Step-Video

Chinese AI Open Source AI Video Generation

21 min read

Updated Jul 16, 2026

Suggest edit History Talk

RawGraph

Last edited

Jul 16, 2026

Fact-checked

In review queue

Sources

23 citations

Revision

v3 · 4,249 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Step-Video is a family of open-source video generation models developed by StepFun (Shanghai Jieyue Xingchen Intelligent Technology Co., Ltd., Chinese: 阶跃星辰), a Chinese AI startup headquartered in Shanghai. The flagship model, Step-Video-T2V, is a 30-billion-parameter text-to-video diffusion model released on 17 February 2025 under the MIT License, accompanied by a detailed technical report on arXiv.^[1]^[2]^[3] A follow-up image-to-video variant, Step-Video-TI2V, was released on 17 March 2025.^[4]^[5] At the time of release, Step-Video-T2V was the largest publicly available text-to-video model by parameter count, and it was distinguished by a deep-compression Video-VAE, a DiT-based denoiser with 3D full attention, and a dual bilingual text encoder pipeline supporting both Chinese and English prompts.^[1]^[2]

Infobox

Field	Value
Developer	StepFun (Shanghai Jieyue Xingchen)
First release	17 February 2025 (Step-Video-T2V)^[3]
Follow-up	17 March 2025 (Step-Video-TI2V)^[5]
Parameters	30 billion (DiT backbone)^[1]^[2]
Architecture	DiT with 3D full attention, Flow Matching objective^[1]^[6]
Latent space	Video-VAE, 16x16 spatial, 8x temporal compression^[1]^[2]
Text encoders	Hunyuan-CLIP (bidirectional) + Step-LLM (causal)^[6]
Output resolutions	544x992 or 768x768 pixels^[2]^[7]
Max frames	204 (T2V), 102 (TI2V)^[1]^[7]
Frame rate	Approximately 24 frames per second (8.5-second clip at 204 frames)^[8]
License	MIT (model weights and code)^[2]^[9]
arXiv	2502.10248 (T2V), 2503.11251 (TI2V)^[1]^[4]
Repository	github.com/stepfun-ai/Step-Video-T2V^[9]

Background and Release

StepFun was founded on 6 April 2023 in Shanghai by Jiang Daxin, a former global vice president and chief scientist at Microsoft Software Technology Center Asia, together with two Microsoft alumni, Jiao Binxing and Zhu Yibo.^[10]^[11] The company is widely counted among China's "AI Six Tigers," a group of well-funded large-model startups that also includes Zhipu AI, Moonshot AI, MiniMax, Baichuan Intelligence, and 01.AI.^[10] Jiang Daxin had spent sixteen years at Microsoft, where he worked on the Bing search engine, Cortana intelligent voice assistant, Azure cognitive services, and natural-language understanding components of Microsoft 365 before leaving to start the company. He has publicly cited the November 2022 release of ChatGPT as the immediate catalyst for founding StepFun.^[11] Within two months of starting operations the team trained its first 100-billion-parameter model, Step-1, and StepFun was the only one of the so-called Six Tigers to reach unicorn valuation in its initial funding round.^[11] Before turning to video generation, StepFun shipped Step-1 in 2023, the multimodal Step-1V, and the trillion-parameter Mixture-of-Experts language model Step-2, which was previewed in March 2024 and formally released at the World Artificial Intelligence Conference in July 2024.^[11]^[12]

Step-Video-T2V was announced and open-sourced on 17 February 2025 alongside Step-Audio, a 130-billion-parameter speech interaction model, in a joint release that StepFun co-promoted with Geely Auto Group.^[10]^[13] The technical report, titled "Step-Video-T2V Technical Report: The Practice, Challenges, and Future of Video Foundation Model," appeared on arXiv on 14 February 2025 (with later revisions on 17 and 24 February) and lists over 100 contributors led by Guoqing Ma and Haoyang Huang.^[1] Both the inference code and the model weights were published on GitHub and Hugging Face on the same day, alongside a turbo variant produced by step distillation and an online demo at yuewen.cn/videos.^[2]^[9]^[14] The release was notable not only for the model itself but also for the breadth of the technical report, which at over 100 pages constituted one of the most detailed published descriptions of a large-scale video foundation model up to that point, covering data filtering, captioning, infrastructure, distillation, and preference alignment as well as architecture.^[1]^[6]

One month later, on 17 March 2025, StepFun released Step-Video-TI2V, an image-to-video extension of the same backbone whose technical report (arXiv:2503.11251) was authored by 54 researchers led by Haoyang Huang.^[4]^[5] The TI2V model is licensed under the same MIT terms as the T2V release.^[15] Both models were placed on Hugging Face under the stepfun-ai organization, where the T2V checkpoint and the T2V-Turbo distilled checkpoint are hosted at stepfun-ai/stepvideo-t2v and stepfun-ai/stepvideo-t2v-turbo respectively, with the TI2V checkpoint at stepfun-ai/stepvideo-ti2v.^[2]^[15] An evaluation dataset, stepfun-ai/Step-Video-T2V-Eval, was published as a companion resource for community reproduction of the report's benchmark numbers.^[9]

Technical Details

Video-VAE

The core of Step-Video-T2V's efficiency is a custom deep-compression Variational Auto-encoder for video, called Video-VAE, which encodes raw RGB frames into a much smaller latent grid that the diffusion transformer operates on.^[1]^[2] The compression factor is 16 times in each spatial dimension and 8 times in the temporal dimension (written compactly as 8x16x16), so a 204-frame clip at 544x992 is reduced to a latent tensor with 26 temporal latents and a 34x62 spatial grid.^[1]^[6] The encoder uses a dual-path design: a convolutional path with causal 3D convolutions and pixel-unshuffle layers, and a shortcut path that performs grouped channel averaging to preserve coarse structural semantics.^[6] To prevent temporal flickering during decoding, the decoder replaces standard GroupNorm with a spatial-only GroupNorm.^[16] Causal 3D convolutions ensure that the latent at time step t depends only on earlier frames, making the encoder usable for streaming inference.^[6]

Diffusion Transformer

The denoising network is a Diffusion Transformer (DiT) with 48 transformer layers, 48 attention heads per layer, a head dimension of 128, and a feed-forward dimension of 24,576, totaling roughly 30 billion parameters.^[2]^[6] All layers use a single 3D full-attention block rather than factorizing spatial and temporal attention, on the grounds that decoupled attention loses cross-modal interaction.^[6] The transformer uses RMSNorm for the main residual stream and Query-Key Normalization (QK-Norm) to stabilize the attention dot-products at scale.^[1]^[6] Timestep conditioning is injected through AdaLN-Single, a parameter-efficient variant of adaptive layer normalization that shares scale and shift parameters across the depth of the model to reduce overhead.^[2]^[6]

Positional information is encoded with a 3D extension of Rotary Position Embedding (RoPE-3D), which assigns separate rotary angles to the temporal axis and the two spatial axes so that the same backbone can handle clips of different durations and resolutions without retraining.^[1]^[2]^[6]

Dual Bilingual Text Encoders

Step-Video-T2V conditions the diffusion process on a concatenation of features from two pretrained text encoders, designed jointly so that the model can ingest both English and Chinese prompts of varying length.^[1]^[6]

Hunyuan-CLIP, the bidirectional CLIP-style text tower from Tencent's open-source bilingual CLIP model, provides text representations that are well aligned with the visual embedding space, but is capped at a 77-token input.^[6]
Step-LLM, a StepFun in-house unidirectional decoder-only language model with no length cap, contributes long-context representation. Its positional encoding is a redesigned variant of ALiBi adapted to support the long captions used in video training.^[6]

The two encoders are run in parallel, their outputs are projected and concatenated, and the resulting sequence is supplied to the DiT through cross-attention. StepFun reports that the combination is necessary because Hunyuan-CLIP gives strong visual alignment for short prompts while Step-LLM remains useful for dense, paragraph-length captions.^[6]

Flow Matching Training Objective

Rather than the conventional noise-prediction or v-prediction objective used by many earlier DiTs, Step-Video-T2V is trained with Flow Matching: linearly interpolating between a noise sample X0 and a data sample X1 and learning a velocity field that maps the noise distribution to the data distribution.^[1]^[6] Concretely, the model parameterizes a function u(X_t, y, t; theta) that predicts the time-derivative of the interpolation X_t = (1 - t) X0 + t X1, and is optimized against the ground-truth velocity V_t = X1 - X0 using a simple squared-error loss conditioned on the text embedding y. At inference time the model integrates an ordinary differential equation defined by its predicted velocity, requiring on the order of 50 sampling steps for the base T2V model.^[1]^[6] A distilled "Turbo" variant uses 2-rectified flow distillation with a U-shaped timestep sampler (proportional to exp(au) + exp(-au) with a = 5) and a linearly diminishing classifier-free guidance schedule of the form cfg_t = max(cfg_max - 9t (cfg_max - 1), 1), reducing inference to 8 to 15 steps with minimal quality loss.^[6]^[2]

Output Specifications

The released checkpoints support two main output configurations: 544 by 992 pixels with up to 204 frames, or 768 by 768 pixels with up to 204 frames.^[2]^[7] At a default playback rate of approximately 24 frames per second, 204 frames corresponds to about 8.5 seconds of video; StepFun's own Yuewen demo serves 8-second clips.^[8]^[14] Step-Video-TI2V produces shorter clips of up to 102 frames at the same resolutions.^[5]^[15]

GPU memory usage for the base T2V model peaks at approximately 77.6 gigabytes for a 544x992x204 generation and 78.5 gigabytes for 768x768x204, requiring at least one NVIDIA H800-class GPU (80 GB) or distributed inference for production use.^[2]

Training Pipeline

The Step-Video-T2V technical report describes a four-stage cascaded training pipeline designed to make best use of mixed-quality video data.^[1]^[6]

Text-to-image pre-training. The DiT is first trained on roughly 3.8 billion image-text pairs at 256x256 resolution to build a strong prior over visual concepts, scenes, and compositional relationships before any video data is introduced.^[6]
Low-resolution text-to-video pre-training. The model is then trained on roughly one billion video clips at 192x320 resolution, where the focus is to learn temporal dynamics and motion priors. The reduced resolution keeps the per-sample sequence length tractable and allows the model to see far more clips per unit of compute.^[6]
High-resolution text-to-video pre-training. A smaller curated set of about 27.3 million 540p clips is used to scale the model to its target 544x992 resolution and to learn fine visual details.^[6]
Supervised fine-tuning and Video-DPO. A supervised fine-tuning stage uses a curated subset of approximately 30 million high-quality videos with careful captions and consistent style, followed by a final preference-alignment stage called Video-DPO, an adaptation of Direct Preference Optimization to video diffusion. Video-DPO uses human-annotated win-loss pairs of generated clips, decreases the beta hyperparameter relative to image DiffusionDPO, and uses fixed seeds across positive and negative samples to stabilize gradients.^[6]

Data Pipeline

Raw video is segmented with PySceneDetect to detect cuts and then split with FFmpeg, and each clip is passed through a battery of quality filters: aesthetic scoring (a CLIP-based predictor trained on LAION ratings), NSFW filtering, watermark detection, subtitle detection, saturation and blur scores, black-border detection, and motion magnitude statistics (mean, max, min).^[6] Captions are produced by an internal vision-language model that emits a short caption mirroring user-style prompts and a dense caption with style, camera movement, and detail annotations; original titles are also retained to preserve diversity.^[6] To balance concept distribution, clips are k-means clustered into more than 120,000 buckets, and rare-concept buckets are upsampled.^[6] StepFun reports a series of progressively stricter quality thresholds, producing six nested data subsets used across the pretraining stages.^[6] A separate text-video alignment filter computes the average cosine similarity between eight uniformly sampled frames and the generated caption using CLIP, providing a CLIP Score used both for selection and for monitoring caption drift over the course of training.^[6]

Post-training data for supervised fine-tuning is curated more aggressively. From the pretraining pool, StepFun selects roughly 30 million clips that pass both quality and stylistic filters, then applies cluster-distance constraints to ensure the final SFT set spans the desired concept distribution rather than collapsing on any single visual style. A final manual annotation pass discards remaining failure cases.^[6]

Infrastructure

Distributed training is run with an 8-way tensor parallelism plus sequence parallelism plus ZeRO-1 configuration, reaching a model FLOPs utilization (MFU) of approximately 32 percent against a theoretical ceiling near 36.5 percent on its target cluster.^[6] StepFun built an internal stack to support this scale, including: Step Emulator (SEMU) for resource and parallelism simulation; StepCCL, a communication library that overlaps DMA-based transfers with GEMM computation; StepRPC, a tensor-native RPC framework with RDMA and TCP backends; StepTelemetry, a low-overhead observability suite; and StepMind, a training orchestrator that reports effective GPU training time above 99 percent over multi-week runs and a daily restart rate of roughly 0.037 per 1000 GPUs.^[6] Public secondary reporting summarizing the report describes a roughly 4,096-GPU NVIDIA H800 deployment for the Step-Video-T2V run.^[17]

Variants

Step-Video-T2V

The base text-to-video checkpoint published on 17 February 2025 with the configuration described above. The model card lists two operating points, 544x992x204 and 768x768x204, and documents recommended inference settings of 50 denoising steps with a classifier-free guidance scale of 9.0 and a time-shift of 13.0.^[2]

Step-Video-T2V-Turbo

A distilled variant released on the same day, produced with 2-rectified flow distillation. It uses 10 to 15 denoising steps with a lower CFG scale of 5.0 and a higher time-shift of 17.0, delivering roughly an order of magnitude speedup at comparable visual quality.^[2]^[6]

Step-Video-TI2V

Released on 17 March 2025, Step-Video-TI2V extends the same 30-billion-parameter backbone to text-driven image-to-video generation. A reference image is encoded with the Video-VAE and its latent is concatenated with the first-frame latent in the DiT's input, providing a direct visual anchor for the generated motion.^[4]^[15] This direct-concatenation strategy contrasts with adapter-style conditioning used by some other image-to-video systems, and the StepFun report argues that it produces tighter visual fidelity to the conditioning image because the visual feature path shares the same VAE-derived geometry as the noisy latents being denoised.^[4] The model adds a motion-score conditioning input that gives users explicit control over the dynamic intensity of the generated clip: recommended settings are motion_score 2 for highly stable scenes, 5 for general use, and 10 or higher for highly dynamic motion. Camera trajectory prompts (pan, tilt, zoom, dolly, rotation, tracking, orbit, rack focus) are also explicitly supported.^[5]^[15] TI2V's maximum clip length is 102 frames at either 544x992 or 768x768 resolution, with peak GPU memory between roughly 75 and 77 gigabytes on a single H800.^[15] Distributed inference is supported with Ulysses-style sequence parallelism up to a parallel degree of 8, reducing peak per-GPU memory to roughly 64 gigabytes and per-clip latency to around 250 to 290 seconds.^[15] At launch StepFun reported that the model held the top position on the VBench-I2V leaderboard and released an evaluation set, Step-Video-TI2V-Eval, with 178 real-world and 120 anime-style prompt-image pairs.^[4]^[5]

Evaluation

Step-Video-T2V-Eval

Along with the model weights, StepFun published Step-Video-T2V-Eval, an evaluation benchmark of 128 prompts spanning 11 categories (sports, food, scenery, surrealism, people, animation, festivals, animals, and others).^[1]^[6] The report compares Step-Video-T2V against open-source baselines including HunyuanVideo and Open-Sora as well as proprietary commercial engines including OpenAI Sora, Runway Gen-3 Alpha, and Movie Gen.^[1]^[6] StepFun reports that Step-Video-T2V outperforms HunyuanVideo in overall video quality and motion smoothness on the benchmark, is comparable to Movie Gen Video for general prompts while trailing it in fine-grained aesthetic detail, and outperforms Runway Gen-3 Alpha on motion consistency while trailing it on cinematic appeal. The report attributes the residual aesthetic gap mostly to the 540p output cap and to limited high-quality labeled data.^[6]^[18]

VBench-I2V

Step-Video-TI2V was reported to occupy the leading position on the public VBench-I2V leaderboard at the time of its March 2025 release, making it the strongest publicly available image-to-video model by that metric at launch.^[4]^[5]

Comparison with Other Video Generation Models

The table below summarizes how Step-Video-T2V relates to several contemporaneous and competing systems. Numbers are drawn from each project's official documentation or technical report at the time of writing.

Model	Developer	Parameters	Architecture	Max output	License
Step-Video-T2V	StepFun	30B	DiT + Flow Matching	544x992, 204 frames^[1]^[2]	MIT^[9]
HunyuanVideo	Tencent	13B	DiT + Flow Matching	1280x720, 129 frames^[19]	Tencent open-source^[19]
Wan 2.1	Alibaba	1.3B and 14B	DiT + Flow Matching	up to 720p^[20]	Apache 2.0^[20]
CogVideoX	Zhipu AI	2B and 5B	DiT	720x480, 49 frames	Open weights
Mochi 1	Genmo	~10B	DiT + Asymmetric VAE	480p, 5.4s	Apache 2.0
LTX-Video	Lightricks	~2B	DiT (real-time)	768x512, 5s	Open weights
Veo 3	Google DeepMind	undisclosed	proprietary	up to 4K, with audio^[21]	proprietary^[21]
Sora 2	OpenAI	undisclosed	proprietary	up to 15s clips^[21]	proprietary^[21]

Within the open-source landscape of early 2025, Step-Video-T2V was the largest model by raw parameter count and one of only two models (alongside HunyuanVideo) to attempt full 3D attention at scale.^[1]^[6]^[19] Compared to Wan 2.1, which was released by Alibaba's Tongyi Lab the same month, Step-Video-T2V emphasizes parameter scale and 3D full attention, while Wan 2.1 emphasizes a more memory-efficient 1.3-billion-parameter checkpoint that can run on consumer GPUs.^[20] CogVideoX from Zhipu AI, which preceded Step-Video by roughly six months, uses a smaller DiT (2B or 5B parameters) and shorter clips. Mochi 1 from Genmo, released in October 2024, and LTX-Video from Lightricks, released in late 2024, target shorter clips at lower resolutions but with substantially lower compute footprints, with LTX-Video in particular targeting real-time inference.^[22] Against the leading closed-source systems, Veo 3 and Sora 2, StepFun does not claim parity on cinematic aesthetics or audio generation but does report competitive motion smoothness and instruction following on its own benchmark.^[6]^[18]

The Broader StepFun Catalogue

Step-Video sits alongside several other StepFun foundation models:

Step-1 (2023), a 100-billion-parameter dense language model that was StepFun's first release.^[11]
Step-1V, a multimodal extension of Step-1.^[11]
Step-2 (preview March 2024, release July 2024), a trillion-parameter Mixture-of-Experts language model reported by StepFun as the first such MoE built by a Chinese startup.^[12]
Step-Audio (17 February 2025), a 130-billion-parameter unified speech understanding and generation model with a 3-billion-parameter text-to-speech sub-model, open-sourced jointly with Step-Video-T2V.^[13]
Step-1X, an image generation model launched in 2024.^[11]
Step 3, a successor language model unveiled at the 2025 World Artificial Intelligence Conference in July 2025.^[11]

Applications

Open release of Step-Video-T2V has produced two broad classes of downstream use. First, the MIT license has made it a popular base for community fine-tuning and inference research, with third-party deployments including AMD's xDiT-based inference recipe on Instinct MI300X GPUs and the integration of TI2V into ComfyUI workflows.^[23] AMD's published recipe demonstrates that Step-Video-T2V can be run on non-NVIDIA accelerators with the same xDiT parallelism strategies used in the official release, making it one of the first 30-billion-parameter video models with documented production inference on ROCm hardware.^[17] Second, StepFun's own product surface, the Yuewen video service, exposes the underlying models for end-user prompt-to-video generation in eight-second clips.^[14]

The Step-Video Turbo and TI2V variants are also positioned for downstream applications such as advertising content, anime production (TI2V was explicitly tuned with anime-style data and evaluated on an anime prompt subset), e-commerce product videos, and educational explainer animation.^[5]^[15] The combination of explicit motion-score conditioning and camera-trajectory prompts in TI2V is particularly relevant to advertising and short-form content workflows where art directors expect fine-grained control over shot composition rather than a single text-to-video prompt; StepFun's own marketing materials and the TI2V technical report both emphasize this controllability dimension.^[4]^[5]

Limitations

The technical report is forthright about residual weaknesses.^[6] Five problem areas are explicitly named:

Video captioning hallucination. Vision-language models used to caption training data introduce factual errors that destabilize training and limit instruction following. StepFun observes that the noisy training signal from imperfect captions is one of the primary blockers to better prompt fidelity in 30-billion-parameter video models.^[6]
Rare concept composition. Compositions of uncommon concepts (the report uses the example of an elephant with a penguin) tend to fail or produce blended artifacts. This pattern is consistent with similar reports from image diffusion models at smaller scales and suggests that pretraining concept distribution, rather than model capacity alone, is the binding constraint.^[6]
Compute cost. Long, high-resolution training and inference remain expensive even with the deep-compression Video-VAE. The base model requires roughly 77 to 78 gigabytes of VRAM for a 204-frame generation at the default resolution and roughly 15 to 17 minutes per clip on a single H800.^[2]^[6]
Physics and complex action. Even at 30 billion parameters the model struggles with complex gymnastic motion, realistic bouncing physics, and other causal reasoning tasks. The report frames this as motivating their proposed shift toward "predictable video foundation models" that incorporate explicit causal modeling.^[6]
DPO saturation. Improvement from Video-DPO saturates as preference data ages relative to the current policy, suggesting on-policy data refresh is required. The report proposes a dynamic reward model scored on freshly generated outputs as one mitigation, but does not claim a final solution.^[6]

Secondary coverage adds that the 540p output cap leaves Step-Video-T2V behind 1080p-capable commercial models on aesthetic detail, and that the 77 to 78 gigabyte peak VRAM requirement effectively excludes single-consumer-GPU local inference at full resolution.^[18]^[2] Some reviewers have also noted that the choice of Hunyuan-CLIP as one of the two text encoders introduces a coupling to a Tencent-released model, although StepFun's own Step-LLM encoder provides an independent path for long-context prompts.^[6]^[18]

Future Directions

The closing section of the Step-Video-T2V technical report lays out an explicit research agenda the authors label Level-2: Predictable Video Foundation Models.^[6] They argue that current text-to-video systems, including Step-Video-T2V, are best viewed as translational systems mapping from text to pixels rather than predictive systems modeling underlying world dynamics. As a result, even at 30-billion-parameter scale the models fail at compositional physics and causal reasoning. The proposed direction is to integrate explicit causal modeling, predictive simulation of physical scenes, and multimodal reasoning capabilities of the kind that the modern wave of large language models has demonstrated. StepFun frames this as analogous to the qualitative jump that occurred in language modeling when reasoning capabilities emerged, and positions Step-Video-T2V's open release as a foundation on which subsequent predictive models can be built.^[6]

Significance

Step-Video-T2V is one of a small number of foundation-scale text-to-video systems released with full open weights, code, and a detailed training report rather than a proprietary API. Its 30-billion-parameter scale, combined with a permissive MIT License and a documented Flow Matching plus Video-DPO recipe, has made it a reference point for subsequent academic and open-source video generation work, and the accompanying Step-Video-T2V-Eval benchmark provides a reproducible measurement target for prompt categories that VBench under-samples.^[1]^[6]^[9] The Step-Video-TI2V follow-up extended the same backbone to image-to-video conditioning and reached the top of the VBench-I2V leaderboard at release, while introducing explicit motion-score and camera-trajectory controls that have since been adopted by other open systems.^[4]^[5]

Beyond its concrete technical contributions, the Step-Video release also helped establish a pattern in 2025 of Chinese AI labs publishing both very large video models and exhaustive technical reports, alongside Tencent's HunyuanVideo and Alibaba's Wan 2.1. The combined effect lowered the barrier to academic and open-source video generation research that had been dominated, before late 2024, by proprietary systems with limited public information.^[1]^[6]^[19]^[20] StepFun's emphasis on a deep-compression Video-VAE (16x16 spatially and 8x temporally) also encouraged subsequent open-source efforts to invest in more aggressive temporal compression as a way to make 30-billion-parameter scale tractable on commercially available hardware.^[1]^[6]

HunyuanVideo, Tencent's 13-billion-parameter open-source DiT video model, a direct contemporary.
Wan 2.1, Alibaba's 1.3-billion and 14-billion-parameter open-source video model family.
CogVideoX, Zhipu AI's earlier DiT-based open-source video model.
Mochi 1, Genmo's open-source DiT video model.
LTX-Video, Lightricks' real-time-oriented open-source video model.
Open-Sora, a fully open reproduction effort inspired by Sora.
Sora 2 and Veo 3, proprietary commercial competitors.
Kling and Runway Gen-4, proprietary commercial video generators.

References

Ma, Guoqing; Huang, Haoyang; et al. "Step-Video-T2V Technical Report: The Practice, Challenges, and Future of Video Foundation Model", arXiv, 2025-02-14. https://arxiv.org/abs/2502.10248. Accessed 2026-05-21. ↩
StepFun, "stepvideo-t2v model card", Hugging Face, 2025-02-17. https://huggingface.co/stepfun-ai/stepvideo-t2v. Accessed 2026-05-21. ↩
ComfyUI Wiki, "StepFun releases Step-Video-T2V", ComfyUI Wiki, 2025-02-17. https://comfyui-wiki.com/en/news/2025-02-17-stepfun-stepvideo-t2v. Accessed 2026-05-21. ↩
Huang, Haoyang; et al. "Step-Video-TI2V Technical Report: A State-of-the-Art Text-Driven Image-to-Video Generation Model", arXiv, 2025-03-14. https://arxiv.org/abs/2503.11251. Accessed 2026-05-21. ↩
StepFun, "Introduction of Step-Video-TI2V", X (Twitter), 2025-03-25. https://x.com/StepFun_ai/status/1904545620360319418. Accessed 2026-05-21. ↩
Ma, Guoqing; Huang, Haoyang; et al. "Step-Video-T2V Technical Report", arXiv HTML, 2025-02-14. https://arxiv.org/html/2502.10248v1. Accessed 2026-05-21. ↩
StepFun, "Step-Video-T2V GitHub repository README", GitHub, 2025-02-17. https://github.com/stepfun-ai/Step-Video-T2V. Accessed 2026-05-21. ↩
Replicate, "step-video-t2v model details", Replicate, 2025-02-20. https://replicate.com/zsxkib/step-video-t2v. Accessed 2026-05-21. ↩
StepFun, "Step-Video-T2V repository", GitHub, 2025-02-17. https://github.com/stepfun-ai/Step-Video-T2V. Accessed 2026-05-21. ↩
Wikipedia contributors, "StepFun", Wikipedia, 2026-02-25. https://en.wikipedia.org/wiki/StepFun. Accessed 2026-05-21. ↩
South China Morning Post, "Shanghai AI start-up founded by ex-Microsoft engineers bets on 'scaling law'", SCMP, 2024-06-13. https://www.scmp.com/tech/tech-trends/article/3267420/shanghai-ai-start-founded-ex-microsoft-engineers-bets-scaling-law-boost-ai-capabilities. Accessed 2026-05-21. ↩
MarkTechPost, "Chinese AGI Startup StepFun Developed Step-2: A New Trillion-Parameter MoE Architecture Model", MarkTechPost, 2024-11-20. https://www.marktechpost.com/2024/11/20/chinese-agi-startup-stepfun-developed-step-2-a-new-trillion-parameter-moe-architecture-model-ranking-5th-on-livebench/. Accessed 2026-05-21. ↩
Huang, Ailin; Wu, Boyong; et al. "Step-Audio: Unified Understanding and Generation in Intelligent Speech Interaction", arXiv, 2025-02-17. https://arxiv.org/abs/2502.11946. Accessed 2026-05-21. ↩
StepFun, "Yuewen Video demo platform", Yuewen, 2025-02-17. https://yuewen.cn/videos. Accessed 2026-05-21. ↩
StepFun, "stepvideo-ti2v model card", Hugging Face, 2025-03-17. https://huggingface.co/stepfun-ai/stepvideo-ti2v. Accessed 2026-05-21. ↩
Ma, Guoqing; Huang, Haoyang; et al. "Step-Video-T2V Technical Report, Section on Video-VAE", arXiv HTML v3, 2025-02-24. https://arxiv.org/html/2502.10248v3. Accessed 2026-05-21. ↩
AMD ROCm Blogs, "Step-Video-T2V Inference with xDiT on AMD Instinct MI300X GPUs", AMD, 2025-04-08. https://rocm.blogs.amd.com/artificial-intelligence/step-video-t2v/README.html. Accessed 2026-05-21. ↩
Analytics Vidhya, "China's New AI Video Star: Step-Video-T2V", Analytics Vidhya, 2025-02-19. https://www.analyticsvidhya.com/blog/2025/02/step-video-t2v/. Accessed 2026-05-21. ↩
Tencent, "HunyuanVideo repository", GitHub, 2024-12-03. https://github.com/Tencent/HunyuanVideo. Accessed 2026-05-21. ↩
Alibaba Tongyi Lab, "Wan 2.1 release", GitHub, 2025-02-25. https://github.com/Wan-Video/Wan2.1. Accessed 2026-05-21. ↩
MindStudio, "Sora vs Veo 3.1 vs Seedance 2.0", MindStudio, 2026-04-15. https://www.mindstudio.ai/blog/sora-vs-veo-3-1-vs-seedance-2-comparison. Accessed 2026-05-21. ↩
Clore.ai, "Video Generation Comparison Guide", Clore.ai, 2026-03-01. https://docs.clore.ai/guides/comparisons/video-gen-comparison. Accessed 2026-05-21. ↩
Kingy AI, "Step-Video-T2V Technical Report Paper Summary", Kingy AI, 2025-02-19. https://kingy.ai/blog/step-video-t2v-technical-report-paper-summary/. Accessed 2026-05-21. ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

2 revisions by 1 contributor · full history

Suggest edit

What links here

HunyuanVideo

Infobox

Background and Release

Technical Details

Video-VAE

Diffusion Transformer

Dual Bilingual Text Encoders

Flow Matching Training Objective

Output Specifications

Training Pipeline

Data Pipeline

Infrastructure

Variants

Step-Video-T2V

Step-Video-T2V-Turbo

Step-Video-TI2V

Evaluation

Step-Video-T2V-Eval

VBench-I2V

Comparison with Other Video Generation Models

The Broader StepFun Catalogue

Applications

Limitations

Future Directions

Significance

Related Work

See also

References

Improve this article

Related Articles

CogVideoX

HunyuanVideo

Wan 2.1

Wan 2.1-VACE

MiniMax

Kuaishou

What links here

Related Articles

CogVideoX

HunyuanVideo

Wan 2.1

Wan 2.1-VACE

MiniMax

Kuaishou