CogVideoX is a family of open-source text-to-video generation models developed by Zhipu AI and Tsinghua University (THUDM). Built on diffusion transformer architecture, CogVideoX generates coherent, high-quality videos from text prompts or images. The model family traces its lineage to CogVideo (2022), the first publicly released large-scale pretrained text-to-video model, and has evolved through multiple iterations including CogVideoX (August 2024) and CogVideoX-1.5 (November 2024). CogVideoX was accepted as a conference paper at ICLR 2025 and has become one of the most widely adopted open-source video generation frameworks in the AI community.
The commercial counterpart of CogVideoX is Qingying (also transliterated as "ClearShadow"), Zhipu AI's consumer-facing video generation product available through the Zhipu Qingyan platform and bigmodel.cn API.
The CogVideo project originated at Tsinghua University's Knowledge Engineering Group (KEG), led by Professor Jie Tang. The original CogVideo model was open-sourced on May 19, 2022, with an accompanying paper submitted to arXiv on May 29, 2022 (arXiv:2205.15868). It was later published as a conference paper at ICLR 2023.
CogVideo was a 9.4-billion-parameter transformer model trained by inheriting weights from CogView2, a pretrained text-to-image model also developed at Tsinghua. This approach of transferring knowledge from a text-to-image model to a text-to-video model significantly reduced training costs and addressed the scarcity of paired text-video datasets at the time. The model introduced a multi-frame-rate hierarchical training strategy to better align text descriptions with video content.
As the first open-source large-scale pretrained text-to-video model, CogVideo achieved state-of-the-art performance on both machine metrics and human evaluations when it was released. The paper was authored by Wenyi Hong, Ming Ding, Wendi Zheng, Xinghan Liu, and Jie Tang.
CogVideoX represented a fundamental architectural shift from the original CogVideo. Rather than using an autoregressive transformer approach, CogVideoX adopted a latent diffusion framework with a custom diffusion transformer backbone. The paper, titled "CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer" (arXiv:2408.06072), was submitted on August 12, 2024, and accepted at ICLR 2025.
The CogVideoX paper lists 19 co-authors: Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, Da Yin, Xiaotao Gu, Yuxuan Zhang, Weihan Wang, Yean Cheng, Ting Liu, Bin Xu, Yuxiao Dong, and Jie Tang.
The model was released in two variants:
On September 19, 2024, the team released CogVideoX-5B-I2V, an image-to-video variant that takes an input image along with a text prompt to generate a video, providing greater controllability over the generation process.
CogVideoX-1.5, released on November 8, 2024, brought substantial improvements in resolution, video length, and generation quality. The CogVideoX1.5-5B series supports video generation at 1360x768 resolution, 16 frames per second, and durations of 5 to 10 seconds. Both text-to-video and image-to-video variants (CogVideoX1.5-5B and CogVideoX1.5-5B-I2V) were released simultaneously.
Key improvements in the 1.5 release include:
CogVideoX introduces three core architectural innovations: a 3D Variational Autoencoder for video compression, an Expert Transformer for multimodal fusion, and a progressive training strategy.
The 3D Causal Variational Autoencoder (VAE) compresses raw video data into a compact latent representation along both spatial and temporal dimensions. Unlike 2D VAEs used in image generation models, this 3D VAE processes entire video volumes using three-dimensional convolutions.
The VAE achieves a compression ratio of 4x8x8, meaning it compresses the temporal dimension by a factor of 4 and each spatial dimension by a factor of 8. This results in a total compression factor of 256x from pixel space to latent space, substantially reducing the sequence length that the transformer must process.
The architecture uses temporally causal convolutions, which ensure that predictions for each frame depend only on the current and preceding frames, not on future frames. This causal design preserves temporal coherence and maintains natural frame-to-frame continuity. The encoder and decoder each contain four symmetric stages with 2x downsampling and upsampling operations using ResNet blocks.
The denoising backbone of CogVideoX is an Expert Transformer, which handles the fusion of text and video modalities. The design departs from conventional approaches that use separate cross-attention modules to inject text information into the visual stream.
Instead, CogVideoX concatenates patchified video latent embeddings with text embeddings (encoded by a T5 text encoder) along the sequence dimension. This combined sequence is then processed through transformer blocks that apply full 3D attention across both modalities simultaneously. By using this joint attention approach rather than separated spatial and temporal attention mechanisms, the model can directly model temporal relationships between frames while maintaining strong text-video alignment.
The "expert" designation refers to the Expert Adaptive LayerNorm (AdaLN) mechanism. Standard layer normalization applies the same normalization parameters to all tokens, but in CogVideoX, the adaptive LayerNorm applies different normalization parameters to text tokens and video tokens. This allows each modality to maintain its own scale and shift values, addressing the inherent distributional differences between textual and visual features. The adaptive parameters are conditioned on the diffusion timestep, enabling the model to adjust its behavior at different stages of the denoising process. This design facilitates deep cross-modal fusion while keeping the additional parameter count minimal.
CogVideoX employs a 3D full attention mechanism that operates jointly across space and time. This avoids the information loss that can occur when spatial and temporal attention are computed separately, as is done in many prior video generation models. The full attention mechanism enables direct interaction between all patches across all frames, leading to more consistent dynamics throughout the generated video.
For positional encoding, the CogVideoX-2B variant uses 3D sinusoidal-cosine positional embeddings, while the 5B and 1.5 variants use 3D Rotary Position Embeddings (3D-RoPE). The 3D-RoPE approach independently embeds spatial coordinates (height, width) and temporal coordinates (frame index), allowing the model to generalize to different resolutions and video lengths.
After the 3D VAE encodes the video into a latent representation, the latent volume is divided into spatiotemporal patches. This patchification step converts the continuous latent tensor into a sequence of discrete tokens suitable for transformer processing. The process is analogous to how Vision Transformers (ViT) patchify images, but extended to the temporal dimension.
CogVideoX-5B was trained on approximately 35 million high-quality video clips, each averaging about six seconds in length. In addition, the training incorporated roughly 2 billion filtered images sourced from the LAION-5B and COYO-700M datasets. This mixed image-video training approach helps the model learn both spatial visual quality from still images and temporal dynamics from video clips.
The training data underwent a rigorous filtering pipeline. Automated classifiers identified and removed low-quality, redundant, or otherwise unsuitable content. Videos with poor motion connectivity, excessive editing artifacts, and other noise characteristics were excluded.
For video captioning, the team deployed a multi-stage pipeline:
For CogVideoX-1.5, the captioning pipeline was further improved with CogVLM2-Caption, an end-to-end video understanding model that generates more accurate content descriptions.
CogVideoX uses a progressive training approach that proceeds through multiple stages:
This progressive approach, combined with a multi-resolution frame packing technique, enables the model to efficiently learn to generate coherent videos with significant motion across varying durations and aspect ratios.
The training employs explicit uniform sampling for the diffusion noise schedule, which improves training stability.
The following table summarizes the key specifications across all released CogVideoX model variants.
| Model | Release Date | Parameters | Resolution | FPS | Duration | Positional Encoding | License |
|---|---|---|---|---|---|---|---|
| CogVideoX-2B | August 6, 2024 | 2B | 720x480 | 8 | 6 seconds | 3D Sincos | Apache 2.0 |
| CogVideoX-5B | August 27, 2024 | 5B | 720x480 | 8 | 6 seconds | 3D RoPE | CogVideoX License |
| CogVideoX-5B-I2V | September 19, 2024 | 5B | 720x480 | 8 | 6 seconds | 3D RoPE | CogVideoX License |
| CogVideoX1.5-5B | November 8, 2024 | 5B | 1360x768 | 16 | 5-10 seconds | 3D RoPE | CogVideoX License |
| CogVideoX1.5-5B-I2V | November 8, 2024 | 5B | 1360x768 | 16 | 5-10 seconds | 3D RoPE | CogVideoX License |
All models accept English-language prompts with a token limit of 224-226 tokens.
The CogVideoX models use a split licensing structure:
CogVideoX models can run on consumer and professional GPUs with various optimization strategies.
| Configuration | CogVideoX-2B (FP16) | CogVideoX-5B (BF16) | CogVideoX1.5-5B (BF16) |
|---|---|---|---|
| Single GPU VRAM (with offloading) | From 4 GB | From 5 GB | From 10 GB |
| Single GPU VRAM (INT8 quantized) | From 3.6 GB | From 4.4 GB | From 7 GB |
| Multi-GPU (diffusers) | N/A | 15 GB | 24 GB |
| Inference speed (A100, 50 steps) | ~90 seconds | ~180 seconds | ~1000 seconds (5s video) |
| Inference speed (H100, 50 steps) | ~45 seconds | ~90 seconds | ~550 seconds (5s video) |
These VRAM figures assume that CPU offloading, VAE slicing, and VAE tiling optimizations are enabled. Without these optimizations, peak VRAM consumption increases roughly threefold, but inference speed improves by 3-4x.
Quantization through PyTorch AO and Optimum-quanto libraries further reduces memory requirements, enabling deployment on free-tier cloud GPUs and consumer hardware with as little as 4 GB of VRAM.
CogVideoX has developed a broad ecosystem of integrations and community tools since its release.
CogVideoX is officially supported in the Hugging Face Diffusers library, providing a standardized inference and fine-tuning interface. Model weights are available on the Hugging Face Hub under both the THUDM and zai-org organizations. The diffusers integration supports all CogVideoX variants and includes built-in support for memory optimization techniques.
In addition to Diffusers, CogVideoX maintains native support for the SAT (SwissArmyTransformer) framework, which was the original inference and training framework used during development. Tools are provided to convert between SAT and Diffusers weight formats.
The ComfyUI-CogVideoXWrapper integrates CogVideoX into the ComfyUI node-based workflow system, making it accessible to users who prefer a visual interface for video generation pipelines.
The official repository and Diffusers both provide LoRA fine-tuning scripts for CogVideoX. The CogVideoX team recommends approximately 4,000 training steps with around 100 training videos for optimal LoRA results. Fine-tuning CogVideoX-5B with LoRA is feasible on a single NVIDIA 4090 GPU through the cogvideox-factory framework.
Third-party support extends to ControlNet for guided generation, xDiT for distributed inference, VideoSys for system-level optimization, and various quantization backends for efficient deployment.
The following table compares CogVideoX with other prominent video generation models as of late 2024.
| Feature | CogVideoX-5B / 1.5-5B | HunyuanVideo | Kling | Sora |
|---|---|---|---|---|
| Developer | Zhipu AI / Tsinghua University | Tencent | Kuaishou | OpenAI |
| Release Date | August-November 2024 | December 2024 | June 2024 | December 2024 |
| Parameters | 5B | 13B+ | Undisclosed | Undisclosed |
| Architecture | Diffusion Transformer + Expert AdaLN | Dual-stream to Single-stream DiT | DiT + 3D VAE | Diffusion Transformer |
| Max Resolution | 1360x768 (v1.5) | 1280x720 | Up to 1080p | Up to 1080p |
| Max Duration | 10 seconds (v1.5) | 5 seconds (open-source) | Up to 2 minutes | Up to 20 seconds |
| Frame Rate | 16 fps (v1.5) | 24 fps | Up to 30 fps | Variable |
| Open Source | Yes (weights + code) | Yes (weights + code) | No (API only) | No (API only) |
| License | Apache 2.0 (2B) / CogVideoX License (5B) | Tencent Hunyuan Community License | Proprietary | Proprietary |
| Text Encoder | T5 | MLLM (multimodal LLM) | Undisclosed | Undisclosed |
| Image-to-Video | Yes (I2V variants) | Yes | Yes | Yes |
| Fine-Tuning Support | Yes (LoRA, full) | Yes (LoRA) | No | No |
CogVideoX occupies a distinctive position in the video generation landscape. While it does not match the raw output quality or duration capabilities of proprietary systems like Sora or Kling, its fully open weights and code, combined with active fine-tuning support, make it one of the most accessible and customizable video generation frameworks available. HunyuanVideo offers a larger model with higher parameter count but was released later and uses a more restrictive license with geographic limitations.
CogVideoX demonstrates state-of-the-art performance among open-source video generation models across multiple evaluation frameworks.
On the VBench benchmark suite, which evaluates video generation across dimensions including action fidelity, scene structure, motion dynamics, object consistency, and semantic alignment, CogVideoX-5B achieves the best performance in five out of seven metrics compared to models like VideoCrafter-2.0 and Open-Sora.
CogVideoX-1.5 shows strong performance in dimensions related to complex prompt adherence (such as complex landscape and complex plot generation) and physics simulation. Human evaluation studies indicate that CogVideoX performs well across sensory quality, instruction following, physical realism, and prompt coverage. The model achieves high peak signal-to-noise ratio (PSNR) scores and low flicker scores, supporting strong temporal coherence.
In VBench-2.0 evaluations published in 2025, which introduced reasoning-focused evaluation dimensions, CogVideoX and other open-source models (including HunyuanVideo and Wan2.2) scored between 0.273 and 0.371 on reasoning tasks, compared to 0.546 for Sora 2, suggesting that reasoning and physical consistency remain areas for improvement across all open-source video generation models.
Qingying (meaning "Clear Shadow" in Chinese) is Zhipu AI's consumer-facing video generation product built on CogVideoX technology. Launched in July 2024, Qingying supports text-to-video and image-to-video generation through both a web interface and API access via the Zhipu Big Model Open Platform.
The commercial Qingying system has been upgraded beyond the open-source CogVideoX capabilities, supporting 10-second, 4K-resolution videos at 60 frames per second with synchronized sound effects. The product initially generated 1440x960 resolution videos up to six seconds long, and has continued to improve with each model update.
Despite its strengths, CogVideoX has several known limitations: