Step-Video
Last reviewed
May 21, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 · 4,251 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 21, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 · 4,251 words
Add missing citations, update stale details, or suggest a clearer explanation.
Step-Video is a family of open-source video generation models developed by StepFun (Shanghai Jieyue Xingchen Intelligent Technology Co., Ltd., Chinese: 阶跃星辰), a Chinese AI startup headquartered in Shanghai. The flagship model, Step-Video-T2V, is a 30-billion-parameter text-to-video diffusion model released on 17 February 2025 under the MIT License, accompanied by a detailed technical report on arXiv.[^1][^2][^3] A follow-up image-to-video variant, Step-Video-TI2V, was released on 17 March 2025.[^4][^5] At the time of release, Step-Video-T2V was the largest publicly available text-to-video model by parameter count, and it was distinguished by a deep-compression Video-VAE, a DiT-based denoiser with 3D full attention, and a dual bilingual text encoder pipeline supporting both Chinese and English prompts.[^1][^2]
| Field | Value |
|---|---|
| Developer | StepFun (Shanghai Jieyue Xingchen) |
| First release | 17 February 2025 (Step-Video-T2V)[^3] |
| Follow-up | 17 March 2025 (Step-Video-TI2V)[^5] |
| Parameters | 30 billion (DiT backbone)[^1][^2] |
| Architecture | DiT with 3D full attention, Flow Matching objective[^1][^6] |
| Latent space | Video-VAE, 16x16 spatial, 8x temporal compression[^1][^2] |
| Text encoders | Hunyuan-CLIP (bidirectional) + Step-LLM (causal)[^6] |
| Output resolutions | 544x992 or 768x768 pixels[^2][^7] |
| Max frames | 204 (T2V), 102 (TI2V)[^1][^7] |
| Frame rate | Approximately 24 frames per second (8.5-second clip at 204 frames)[^8] |
| License | MIT (model weights and code)[^2][^9] |
| arXiv | 2502.10248 (T2V), 2503.11251 (TI2V)[^1][^4] |
| Repository | github.com/stepfun-ai/Step-Video-T2V[^9] |
StepFun was founded on 6 April 2023 in Shanghai by Jiang Daxin, a former global vice president and chief scientist at Microsoft Software Technology Center Asia, together with two Microsoft alumni, Jiao Binxing and Zhu Yibo.[^10][^11] The company is widely counted among China's "AI Six Tigers," a group of well-funded large-model startups that also includes Zhipu AI, Moonshot AI, MiniMax, Baichuan Intelligence, and 01.AI.[^10] Jiang Daxin had spent sixteen years at Microsoft, where he worked on the Bing search engine, Cortana intelligent voice assistant, Azure cognitive services, and natural-language understanding components of Microsoft 365 before leaving to start the company. He has publicly cited the November 2022 release of ChatGPT as the immediate catalyst for founding StepFun.[^11] Within two months of starting operations the team trained its first 100-billion-parameter model, Step-1, and StepFun was the only one of the so-called Six Tigers to reach unicorn valuation in its initial funding round.[^11] Before turning to video generation, StepFun shipped Step-1 in 2023, the multimodal Step-1V, and the trillion-parameter Mixture-of-Experts language model Step-2, which was previewed in March 2024 and formally released at the World Artificial Intelligence Conference in July 2024.[^11][^12]
Step-Video-T2V was announced and open-sourced on 17 February 2025 alongside Step-Audio, a 130-billion-parameter speech interaction model, in a joint release that StepFun co-promoted with Geely Auto Group.[^10][^13] The technical report, titled "Step-Video-T2V Technical Report: The Practice, Challenges, and Future of Video Foundation Model," appeared on arXiv on 14 February 2025 (with later revisions on 17 and 24 February) and lists over 100 contributors led by Guoqing Ma and Haoyang Huang.[^1] Both the inference code and the model weights were published on GitHub and Hugging Face on the same day, alongside a turbo variant produced by step distillation and an online demo at yuewen.cn/videos.[^2][^9][^14] The release was notable not only for the model itself but also for the breadth of the technical report, which at over 100 pages constituted one of the most detailed published descriptions of a large-scale video foundation model up to that point, covering data filtering, captioning, infrastructure, distillation, and preference alignment as well as architecture.[^1][^6]
One month later, on 17 March 2025, StepFun released Step-Video-TI2V, an image-to-video extension of the same backbone whose technical report (arXiv:2503.11251) was authored by 54 researchers led by Haoyang Huang.[^4][^5] The TI2V model is licensed under the same MIT terms as the T2V release.[^15] Both models were placed on Hugging Face under the stepfun-ai organization, where the T2V checkpoint and the T2V-Turbo distilled checkpoint are hosted at stepfun-ai/stepvideo-t2v and stepfun-ai/stepvideo-t2v-turbo respectively, with the TI2V checkpoint at stepfun-ai/stepvideo-ti2v.[^2][^15] An evaluation dataset, stepfun-ai/Step-Video-T2V-Eval, was published as a companion resource for community reproduction of the report's benchmark numbers.[^9]
The core of Step-Video-T2V's efficiency is a custom deep-compression Variational Auto-encoder for video, called Video-VAE, which encodes raw RGB frames into a much smaller latent grid that the diffusion transformer operates on.[^1][^2] The compression factor is 16 times in each spatial dimension and 8 times in the temporal dimension (written compactly as 8x16x16), so a 204-frame clip at 544x992 is reduced to a latent tensor with 26 temporal latents and a 34x62 spatial grid.[^1][^6] The encoder uses a dual-path design: a convolutional path with causal 3D convolutions and pixel-unshuffle layers, and a shortcut path that performs grouped channel averaging to preserve coarse structural semantics.[^6] To prevent temporal flickering during decoding, the decoder replaces standard GroupNorm with a spatial-only GroupNorm.[^16] Causal 3D convolutions ensure that the latent at time step t depends only on earlier frames, making the encoder usable for streaming inference.[^6]
The denoising network is a Diffusion Transformer (DiT) with 48 transformer layers, 48 attention heads per layer, a head dimension of 128, and a feed-forward dimension of 24,576, totaling roughly 30 billion parameters.[^2][^6] All layers use a single 3D full-attention block rather than factorizing spatial and temporal attention, on the grounds that decoupled attention loses cross-modal interaction.[^6] The transformer uses RMSNorm for the main residual stream and Query-Key Normalization (QK-Norm) to stabilize the attention dot-products at scale.[^1][^6] Timestep conditioning is injected through AdaLN-Single, a parameter-efficient variant of adaptive layer normalization that shares scale and shift parameters across the depth of the model to reduce overhead.[^2][^6]
Positional information is encoded with a 3D extension of Rotary Position Embedding (RoPE-3D), which assigns separate rotary angles to the temporal axis and the two spatial axes so that the same backbone can handle clips of different durations and resolutions without retraining.[^1][^2][^6]
Step-Video-T2V conditions the diffusion process on a concatenation of features from two pretrained text encoders, designed jointly so that the model can ingest both English and Chinese prompts of varying length.[^1][^6]
The two encoders are run in parallel, their outputs are projected and concatenated, and the resulting sequence is supplied to the DiT through cross-attention. StepFun reports that the combination is necessary because Hunyuan-CLIP gives strong visual alignment for short prompts while Step-LLM remains useful for dense, paragraph-length captions.[^6]
Rather than the conventional noise-prediction or v-prediction objective used by many earlier DiTs, Step-Video-T2V is trained with Flow Matching: linearly interpolating between a noise sample X0 and a data sample X1 and learning a velocity field that maps the noise distribution to the data distribution.[^1][^6] Concretely, the model parameterizes a function u(X_t, y, t; theta) that predicts the time-derivative of the interpolation X_t = (1 - t) X0 + t X1, and is optimized against the ground-truth velocity V_t = X1 - X0 using a simple squared-error loss conditioned on the text embedding y. At inference time the model integrates an ordinary differential equation defined by its predicted velocity, requiring on the order of 50 sampling steps for the base T2V model.[^1][^6] A distilled "Turbo" variant uses 2-rectified flow distillation with a U-shaped timestep sampler (proportional to exp(au) + exp(-au) with a = 5) and a linearly diminishing classifier-free guidance schedule of the form cfg_t = max(cfg_max - 9t (cfg_max - 1), 1), reducing inference to 8 to 15 steps with minimal quality loss.[^6][^2]
The released checkpoints support two main output configurations: 544 by 992 pixels with up to 204 frames, or 768 by 768 pixels with up to 204 frames.[^2][^7] At a default playback rate of approximately 24 frames per second, 204 frames corresponds to about 8.5 seconds of video; StepFun's own Yuewen demo serves 8-second clips.[^8][^14] Step-Video-TI2V produces shorter clips of up to 102 frames at the same resolutions.[^5][^15]
GPU memory usage for the base T2V model peaks at approximately 77.6 gigabytes for a 544x992x204 generation and 78.5 gigabytes for 768x768x204, requiring at least one NVIDIA H800-class GPU (80 GB) or distributed inference for production use.[^2]
The Step-Video-T2V technical report describes a four-stage cascaded training pipeline designed to make best use of mixed-quality video data.[^1][^6]
Raw video is segmented with PySceneDetect to detect cuts and then split with FFmpeg, and each clip is passed through a battery of quality filters: aesthetic scoring (a CLIP-based predictor trained on LAION ratings), NSFW filtering, watermark detection, subtitle detection, saturation and blur scores, black-border detection, and motion magnitude statistics (mean, max, min).[^6] Captions are produced by an internal vision-language model that emits a short caption mirroring user-style prompts and a dense caption with style, camera movement, and detail annotations; original titles are also retained to preserve diversity.[^6] To balance concept distribution, clips are k-means clustered into more than 120,000 buckets, and rare-concept buckets are upsampled.[^6] StepFun reports a series of progressively stricter quality thresholds, producing six nested data subsets used across the pretraining stages.[^6] A separate text-video alignment filter computes the average cosine similarity between eight uniformly sampled frames and the generated caption using CLIP, providing a CLIP Score used both for selection and for monitoring caption drift over the course of training.[^6]
Post-training data for supervised fine-tuning is curated more aggressively. From the pretraining pool, StepFun selects roughly 30 million clips that pass both quality and stylistic filters, then applies cluster-distance constraints to ensure the final SFT set spans the desired concept distribution rather than collapsing on any single visual style. A final manual annotation pass discards remaining failure cases.[^6]
Distributed training is run with an 8-way tensor parallelism plus sequence parallelism plus ZeRO-1 configuration, reaching a model FLOPs utilization (MFU) of approximately 32 percent against a theoretical ceiling near 36.5 percent on its target cluster.[^6] StepFun built an internal stack to support this scale, including: Step Emulator (SEMU) for resource and parallelism simulation; StepCCL, a communication library that overlaps DMA-based transfers with GEMM computation; StepRPC, a tensor-native RPC framework with RDMA and TCP backends; StepTelemetry, a low-overhead observability suite; and StepMind, a training orchestrator that reports effective GPU training time above 99 percent over multi-week runs and a daily restart rate of roughly 0.037 per 1000 GPUs.[^6] Public secondary reporting summarizing the report describes a roughly 4,096-GPU NVIDIA H800 deployment for the Step-Video-T2V run.[^17]
The base text-to-video checkpoint published on 17 February 2025 with the configuration described above. The model card lists two operating points, 544x992x204 and 768x768x204, and documents recommended inference settings of 50 denoising steps with a classifier-free guidance scale of 9.0 and a time-shift of 13.0.[^2]
A distilled variant released on the same day, produced with 2-rectified flow distillation. It uses 10 to 15 denoising steps with a lower CFG scale of 5.0 and a higher time-shift of 17.0, delivering roughly an order of magnitude speedup at comparable visual quality.[^2][^6]
Released on 17 March 2025, Step-Video-TI2V extends the same 30-billion-parameter backbone to text-driven image-to-video generation. A reference image is encoded with the Video-VAE and its latent is concatenated with the first-frame latent in the DiT's input, providing a direct visual anchor for the generated motion.[^4][^15] This direct-concatenation strategy contrasts with adapter-style conditioning used by some other image-to-video systems, and the StepFun report argues that it produces tighter visual fidelity to the conditioning image because the visual feature path shares the same VAE-derived geometry as the noisy latents being denoised.[^4] The model adds a motion-score conditioning input that gives users explicit control over the dynamic intensity of the generated clip: recommended settings are motion_score 2 for highly stable scenes, 5 for general use, and 10 or higher for highly dynamic motion. Camera trajectory prompts (pan, tilt, zoom, dolly, rotation, tracking, orbit, rack focus) are also explicitly supported.[^5][^15] TI2V's maximum clip length is 102 frames at either 544x992 or 768x768 resolution, with peak GPU memory between roughly 75 and 77 gigabytes on a single H800.[^15] Distributed inference is supported with Ulysses-style sequence parallelism up to a parallel degree of 8, reducing peak per-GPU memory to roughly 64 gigabytes and per-clip latency to around 250 to 290 seconds.[^15] At launch StepFun reported that the model held the top position on the VBench-I2V leaderboard and released an evaluation set, Step-Video-TI2V-Eval, with 178 real-world and 120 anime-style prompt-image pairs.[^4][^5]
Along with the model weights, StepFun published Step-Video-T2V-Eval, an evaluation benchmark of 128 prompts spanning 11 categories (sports, food, scenery, surrealism, people, animation, festivals, animals, and others).[^1][^6] The report compares Step-Video-T2V against open-source baselines including HunyuanVideo and Open-Sora as well as proprietary commercial engines including OpenAI Sora, Runway Gen-3 Alpha, and Movie Gen.[^1][^6] StepFun reports that Step-Video-T2V outperforms HunyuanVideo in overall video quality and motion smoothness on the benchmark, is comparable to Movie Gen Video for general prompts while trailing it in fine-grained aesthetic detail, and outperforms Runway Gen-3 Alpha on motion consistency while trailing it on cinematic appeal. The report attributes the residual aesthetic gap mostly to the 540p output cap and to limited high-quality labeled data.[^6][^18]
Step-Video-TI2V was reported to occupy the leading position on the public VBench-I2V leaderboard at the time of its March 2025 release, making it the strongest publicly available image-to-video model by that metric at launch.[^4][^5]
The table below summarizes how Step-Video-T2V relates to several contemporaneous and competing systems. Numbers are drawn from each project's official documentation or technical report at the time of writing.
| Model | Developer | Parameters | Architecture | Max output | License |
|---|---|---|---|---|---|
| Step-Video-T2V | StepFun | 30B | DiT + Flow Matching | 544x992, 204 frames[^1][^2] | MIT[^9] |
| HunyuanVideo | Tencent | 13B | DiT + Flow Matching | 1280x720, 129 frames[^19] | Tencent open-source[^19] |
| Wan 2.1 | Alibaba | 1.3B and 14B | DiT + Flow Matching | up to 720p[^20] | Apache 2.0[^20] |
| CogVideoX | Zhipu AI | 2B and 5B | DiT | 720x480, 49 frames | Open weights |
| Mochi 1 | Genmo | ~10B | DiT + Asymmetric VAE | 480p, 5.4s | Apache 2.0 |
| LTX-Video | Lightricks | ~2B | DiT (real-time) | 768x512, 5s | Open weights |
| Veo 3 | Google DeepMind | undisclosed | proprietary | up to 4K, with audio[^21] | proprietary[^21] |
| Sora 2 | OpenAI | undisclosed | proprietary | up to 15s clips[^21] | proprietary[^21] |
Within the open-source landscape of early 2025, Step-Video-T2V was the largest model by raw parameter count and one of only two models (alongside HunyuanVideo) to attempt full 3D attention at scale.[^1][^6][^19] Compared to Wan 2.1, which was released by Alibaba's Tongyi Lab the same month, Step-Video-T2V emphasizes parameter scale and 3D full attention, while Wan 2.1 emphasizes a more memory-efficient 1.3-billion-parameter checkpoint that can run on consumer GPUs.[^20] CogVideoX from Zhipu AI, which preceded Step-Video by roughly six months, uses a smaller DiT (2B or 5B parameters) and shorter clips. Mochi 1 from Genmo, released in October 2024, and LTX-Video from Lightricks, released in late 2024, target shorter clips at lower resolutions but with substantially lower compute footprints, with LTX-Video in particular targeting real-time inference.[^22] Against the leading closed-source systems, Veo 3 and Sora 2, StepFun does not claim parity on cinematic aesthetics or audio generation but does report competitive motion smoothness and instruction following on its own benchmark.[^6][^18]
Step-Video sits alongside several other StepFun foundation models:
Open release of Step-Video-T2V has produced two broad classes of downstream use. First, the MIT license has made it a popular base for community fine-tuning and inference research, with third-party deployments including AMD's xDiT-based inference recipe on Instinct MI300X GPUs and the integration of TI2V into ComfyUI workflows.[^23] AMD's published recipe demonstrates that Step-Video-T2V can be run on non-NVIDIA accelerators with the same xDiT parallelism strategies used in the official release, making it one of the first 30-billion-parameter video models with documented production inference on ROCm hardware.[^17] Second, StepFun's own product surface, the Yuewen video service, exposes the underlying models for end-user prompt-to-video generation in eight-second clips.[^14]
The Step-Video Turbo and TI2V variants are also positioned for downstream applications such as advertising content, anime production (TI2V was explicitly tuned with anime-style data and evaluated on an anime prompt subset), e-commerce product videos, and educational explainer animation.[^5][^15] The combination of explicit motion-score conditioning and camera-trajectory prompts in TI2V is particularly relevant to advertising and short-form content workflows where art directors expect fine-grained control over shot composition rather than a single text-to-video prompt; StepFun's own marketing materials and the TI2V technical report both emphasize this controllability dimension.[^4][^5]
The technical report is forthright about residual weaknesses.[^6] Five problem areas are explicitly named:
Secondary coverage adds that the 540p output cap leaves Step-Video-T2V behind 1080p-capable commercial models on aesthetic detail, and that the 77 to 78 gigabyte peak VRAM requirement effectively excludes single-consumer-GPU local inference at full resolution.[^18][^2] Some reviewers have also noted that the choice of Hunyuan-CLIP as one of the two text encoders introduces a coupling to a Tencent-released model, although StepFun's own Step-LLM encoder provides an independent path for long-context prompts.[^6][^18]
The closing section of the Step-Video-T2V technical report lays out an explicit research agenda the authors label Level-2: Predictable Video Foundation Models.[^6] They argue that current text-to-video systems, including Step-Video-T2V, are best viewed as translational systems mapping from text to pixels rather than predictive systems modeling underlying world dynamics. As a result, even at 30-billion-parameter scale the models fail at compositional physics and causal reasoning. The proposed direction is to integrate explicit causal modeling, predictive simulation of physical scenes, and multimodal reasoning capabilities of the kind that the modern wave of large language models has demonstrated. StepFun frames this as analogous to the qualitative jump that occurred in language modeling when reasoning capabilities emerged, and positions Step-Video-T2V's open release as a foundation on which subsequent predictive models can be built.[^6]
Step-Video-T2V is one of a small number of foundation-scale text-to-video systems released with full open weights, code, and a detailed training report rather than a proprietary API. Its 30-billion-parameter scale, combined with a permissive MIT License and a documented Flow Matching plus Video-DPO recipe, has made it a reference point for subsequent academic and open-source video generation work, and the accompanying Step-Video-T2V-Eval benchmark provides a reproducible measurement target for prompt categories that VBench under-samples.[^1][^6][^9] The Step-Video-TI2V follow-up extended the same backbone to image-to-video conditioning and reached the top of the VBench-I2V leaderboard at release, while introducing explicit motion-score and camera-trajectory controls that have since been adopted by other open systems.[^4][^5]
Beyond its concrete technical contributions, the Step-Video release also helped establish a pattern in 2025 of Chinese AI labs publishing both very large video models and exhaustive technical reports, alongside Tencent's HunyuanVideo and Alibaba's Wan 2.1. The combined effect lowered the barrier to academic and open-source video generation research that had been dominated, before late 2024, by proprietary systems with limited public information.[^1][^6][^19][^20] StepFun's emphasis on a deep-compression Video-VAE (16x16 spatially and 8x temporally) also encouraged subsequent open-source efforts to invest in more aggressive temporal compression as a way to make 30-billion-parameter scale tractable on commercially available hardware.[^1][^6]