CogVideoX

Chinese AI Generative AI Open Source AI Video Generation

16 min read

Updated Jun 24, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 24, 2026

Fact-checked

In review queue

Sources

10 citations

Revision

v7 · 3,254 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

CogVideoX is a family of open-source text-to-video generation models developed by Zhipu AI and Tsinghua University (THUDM) that generate up to 10-second continuous videos from a text prompt at 16 frames per second and 768x1360 resolution.^[1] Built on a diffusion transformer (DiT) architecture with a 3D causal variational autoencoder and an expert transformer, CogVideoX is one of the most widely adopted open-weights video generation models in the AI community. Its lineage traces to CogVideo (2022), which its authors describe as the first open-source large-scale pretrained text-to-video model, a 9.4-billion-parameter transformer built by inheriting weights from the CogView2 text-to-image model.^[2] The CogVideoX paper was accepted at ICLR 2025, and the model weights, 3D causal VAE, and video caption model are all publicly released on GitHub and the Hugging Face Hub.^[1]^[3]

The commercial counterpart of CogVideoX is Qingying (also transliterated as "ClearShadow" or rendered as "Ying"), Zhipu AI's consumer-facing video generation product available through the Zhipu Qingyan platform and bigmodel.cn API.^[7]

History and Development

When was CogVideo released?

The CogVideo project originated at Tsinghua University's Knowledge Engineering Group (KEG), led by Professor Jie Tang. The original CogVideo model was open-sourced on May 19, 2022, with an accompanying paper submitted to arXiv on May 29, 2022 (arXiv:2205.15868).^[2] It was later published as a conference paper at ICLR 2023.

CogVideo was a 9.4-billion-parameter transformer model trained by inheriting weights from CogView2, a pretrained text-to-image model also developed at Tsinghua. It inherited roughly 6 billion of those parameters directly from CogView2 and reused its architecture as the foundation for a dual-channel attention mechanism.^[2] This approach of transferring knowledge from a text-to-image model to a text-to-video model significantly reduced training costs and addressed the scarcity of paired text-video datasets at the time. The model introduced a multi-frame-rate hierarchical training strategy to better align text descriptions with video content.

As the first open-source large-scale pretrained text-to-video model, CogVideo achieved state-of-the-art performance on both machine metrics and human evaluations when it was released.^[2] The paper was authored by Wenyi Hong, Ming Ding, Wendi Zheng, Xinghan Liu, and Jie Tang.

CogVideoX (2024)

CogVideoX represented a fundamental architectural shift from the original CogVideo. Rather than using an autoregressive transformer approach, CogVideoX adopted a latent diffusion framework with a custom diffusion transformer backbone. The paper, titled "CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer" (arXiv:2408.06072), was submitted on August 12, 2024, and accepted at ICLR 2025; its latest revision (v3) was posted on March 26, 2025.^[1]

The CogVideoX paper lists 19 co-authors: Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, Da Yin, Xiaotao Gu, Yuxuan Zhang, Weihan Wang, Yean Cheng, Ting Liu, Bin Xu, Yuxiao Dong, and Jie Tang.^[1]

The model was released in two variants:

CogVideoX-2B: Released on August 6, 2024, with 2 billion parameters. This entry-level variant generates 6-second videos (49 frames) at 720x480 resolution and 8 frames per second. It uses FP16 precision by default and 3D sinusoidal-cosine positional encoding.^[4]
CogVideoX-5B: Released on August 27, 2024, with 5 billion parameters. This higher-capacity variant produces videos at the same resolution and frame rate but with improved visual quality and better text-video alignment. It uses BF16 precision by default and 3D Rotary Position Embedding (3D-RoPE).^[5]

On September 19, 2024, the team released CogVideoX-5B-I2V, an image-to-video variant that takes an input image along with a text prompt to generate a video, providing greater controllability over the generation process.

CogVideoX-1.5 (2024)

CogVideoX-1.5, released on November 8, 2024, brought substantial improvements in resolution, video length, and generation quality. The CogVideoX1.5-5B series supports video generation at 1360x768 resolution, 16 frames per second, and durations of 5 to 10 seconds.^[6] Both text-to-video and image-to-video variants (CogVideoX1.5-5B and CogVideoX1.5-5B-I2V) were released simultaneously.

Key improvements in the 1.5 release include:

Higher output resolution (1360x768, up from 720x480)
Doubled frame rate (16 fps, up from 8 fps)
Longer video duration (up to 10 seconds, up from 6 seconds)
Improved text comprehension and instruction-following abilities through enhanced captioning with CogVLM2-Caption
An automated filtering framework to eliminate video training data with poor dynamic connectivity
Support for any-size aspect ratios in the I2V model

How does CogVideoX work?

CogVideoX introduces three core architectural innovations: a 3D Variational Autoencoder for video compression, an Expert Transformer for multimodal fusion, and a progressive training strategy. The paper summarizes these as the model's central contributions, stating that "to improve the text-video alignment, we propose an expert transformer with the expert adaptive LayerNorm to facilitate the deep fusion between the two modalities."^[1]

3D Causal VAE

The 3D Causal Variational Autoencoder (VAE) compresses raw video data into a compact latent representation along both spatial and temporal dimensions. As the paper puts it, the authors "propose a 3D Variational Autoencoder (VAE) to compress videos along both spatial and temporal dimensions, to improve both compression rate and video fidelity."^[1] Unlike 2D VAEs used in image generation models, this 3D VAE processes entire video volumes using three-dimensional convolutions.

The VAE achieves a compression ratio of 4x8x8, meaning it compresses the temporal dimension by a factor of 4 and each spatial dimension by a factor of 8. This results in a total compression factor of 256x from pixel space to latent space, substantially reducing the sequence length that the transformer must process.

The architecture uses temporally causal convolutions, which ensure that predictions for each frame depend only on the current and preceding frames, not on future frames. This causal design preserves temporal coherence and maintains natural frame-to-frame continuity. The encoder and decoder each contain four symmetric stages with 2x downsampling and upsampling operations using ResNet blocks.

Expert Transformer

The denoising backbone of CogVideoX is an Expert Transformer, which handles the fusion of text and video modalities. The design departs from conventional approaches that use separate cross-attention modules to inject text information into the visual stream.

Instead, CogVideoX concatenates patchified video latent embeddings with text embeddings (encoded by a T5 text encoder) along the sequence dimension. This combined sequence is then processed through transformer blocks that apply full 3D attention across both modalities simultaneously. By using this joint attention approach rather than separated spatial and temporal attention mechanisms, the model can directly model temporal relationships between frames while maintaining strong text-video alignment.

The "expert" designation refers to the Expert Adaptive LayerNorm (AdaLN) mechanism. Standard layer normalization applies the same normalization parameters to all tokens, but in CogVideoX, the adaptive LayerNorm applies different normalization parameters to text tokens and video tokens. This allows each modality to maintain its own scale and shift values, addressing the inherent distributional differences between textual and visual features. The adaptive parameters are conditioned on the diffusion timestep, enabling the model to adjust its behavior at different stages of the denoising process. This design facilitates deep cross-modal fusion while keeping the additional parameter count minimal.

3D Full Attention and Positional Encoding

CogVideoX employs a 3D full attention mechanism that operates jointly across space and time. This avoids the information loss that can occur when spatial and temporal attention are computed separately, as is done in many prior video generation models. The full attention mechanism enables direct interaction between all patches across all frames, leading to more consistent dynamics throughout the generated video.

For positional encoding, the CogVideoX-2B variant uses 3D sinusoidal-cosine positional embeddings, while the 5B and 1.5 variants use 3D Rotary Position Embeddings (3D-RoPE). The 3D-RoPE approach independently embeds spatial coordinates (height, width) and temporal coordinates (frame index), allowing the model to generalize to different resolutions and video lengths.

Patchification

After the 3D VAE encodes the video into a latent representation, the latent volume is divided into spatiotemporal patches. This patchification step converts the continuous latent tensor into a sequence of discrete tokens suitable for transformer processing. The process is analogous to how Vision Transformers (ViT) patchify images, but extended to the temporal dimension.

How was CogVideoX trained?

Dataset

CogVideoX-5B was trained on approximately 35 million high-quality video clips, each averaging about six seconds in length. In addition, the training incorporated roughly 2 billion filtered images sourced from the LAION-5B and COYO-700M datasets. This mixed image-video training approach helps the model learn both spatial visual quality from still images and temporal dynamics from video clips.

Data Processing Pipeline

The training data underwent a rigorous filtering pipeline. Automated classifiers identified and removed low-quality, redundant, or otherwise unsuitable content. Videos with poor motion connectivity, excessive editing artifacts, and other noise characteristics were excluded.

For video captioning, the team deployed a multi-stage pipeline:

Short captions were generated using the Panda70M captioning model
Dense frame-level descriptions were produced by CogVLM, an image understanding model
These descriptions were then aggregated and summarized by large language models (GPT-4 or fine-tuned LLaMA-2) to produce detailed video-level captions

For CogVideoX-1.5, the captioning pipeline was further improved with CogVLM2-Caption, an end-to-end video understanding model that generates more accurate content descriptions.

Progressive Training Strategy

CogVideoX uses a progressive training approach that proceeds through multiple stages:

Low-resolution stage: The model first trains on 256px-resolution video data to learn semantic patterns and low-frequency visual structures
High-resolution stage: The model is then fine-tuned on higher-resolution data (up to 1360x768) to capture finer visual details
Quality refinement stage: A final fine-tuning phase emphasizes high-frequency visual details and overall generation quality

According to the paper, "by employing a progressive training and multi-resolution frame pack technique, CogVideoX is adept at producing coherent, long-duration, different shape videos characterized by significant motions."^[1] Combined with the multi-resolution frame packing technique, this approach enables the model to efficiently learn to generate coherent videos with significant motion across varying durations and aspect ratios.

The training employs explicit uniform sampling for the diffusion noise schedule, which improves training stability.

Model Variants

The following table summarizes the key specifications across all released CogVideoX model variants.^[4]^[5]^[6]

Model	Release Date	Parameters	Resolution	FPS	Duration	Positional Encoding	License
CogVideoX-2B	August 6, 2024	2B	720x480	8	6 seconds	3D Sincos	Apache 2.0
CogVideoX-5B	August 27, 2024	5B	720x480	8	6 seconds	3D RoPE	CogVideoX License
CogVideoX-5B-I2V	September 19, 2024	5B	720x480	8	6 seconds	3D RoPE	CogVideoX License
CogVideoX1.5-5B	November 8, 2024	5B	1360x768	16	5-10 seconds	3D RoPE	CogVideoX License
CogVideoX1.5-5B-I2V	November 8, 2024	5B	1360x768	16	5-10 seconds	3D RoPE	CogVideoX License

All models accept English-language prompts with a token limit of 224-226 tokens.^[5]

Is CogVideoX open source?

Yes. CogVideoX is released as open weights with publicly available code, but it uses a split licensing structure across model variants:^[5]

CogVideoX-2B (including the transformer and VAE modules) is released under the Apache 2.0 License, which permits both commercial and non-commercial use without restriction.
CogVideoX-5B and CogVideoX-1.5 variants are released under the CogVideoX License, a custom license that permits free use for academic research. Commercial use requires registration for a basic commercial license through Zhipu AI's open platform (open.bigmodel.cn). The free commercial license caps usage at 1 million service visits per month.
The associated code in the GitHub repository is released under the Apache 2.0 License regardless of model variant.^[3]

The paper states that "the model weight of both 3D Causal VAE, Video caption model and CogVideoX are publicly available," hosted on the project GitHub and the Hugging Face Hub.^[1]

What hardware does CogVideoX need?

CogVideoX models can run on consumer and professional GPUs with various optimization strategies.

Configuration	CogVideoX-2B (FP16)	CogVideoX-5B (BF16)	CogVideoX1.5-5B (BF16)
Single GPU VRAM (with offloading)	From 4 GB	From 5 GB	From 10 GB
Single GPU VRAM (INT8 quantized)	From 3.6 GB	From 4.4 GB	From 7 GB
Multi-GPU (diffusers)	N/A	15 GB	24 GB
Inference speed (A100, 50 steps)	~90 seconds	~180 seconds	~1000 seconds (5s video)
Inference speed (H100, 50 steps)	~45 seconds	~90 seconds	~550 seconds (5s video)

These VRAM figures assume that CPU offloading, VAE slicing, and VAE tiling optimizations are enabled.^[5] Without these optimizations, peak VRAM consumption increases roughly threefold, but inference speed improves by 3-4x.

Quantization through PyTorch AO and Optimum-quanto libraries further reduces memory requirements, enabling deployment on free-tier cloud GPUs and consumer hardware with as little as 4 GB of VRAM.^[5]

Community Adoption and Ecosystem

CogVideoX has developed a broad ecosystem of integrations and community tools since its release.

Hugging Face Diffusers

CogVideoX is officially supported in the Hugging Face Diffusers library, providing a standardized inference and fine-tuning interface.^[10] Model weights are available on the Hugging Face Hub under both the THUDM and zai-org organizations. The diffusers integration supports all CogVideoX variants and includes built-in support for memory optimization techniques.

SAT Framework

In addition to Diffusers, CogVideoX maintains native support for the SAT (SwissArmyTransformer) framework, which was the original inference and training framework used during development.^[3] Tools are provided to convert between SAT and Diffusers weight formats.

ComfyUI

The ComfyUI-CogVideoXWrapper integrates CogVideoX into the ComfyUI node-based workflow system, making it accessible to users who prefer a visual interface for video generation pipelines.

Fine-Tuning

The official repository and Diffusers both provide LoRA fine-tuning scripts for CogVideoX. The CogVideoX team recommends approximately 4,000 training steps with around 100 training videos for optimal LoRA results. Fine-tuning CogVideoX-5B with LoRA is feasible on a single NVIDIA 4090 GPU through the cogvideox-factory framework.^[3]

Additional Integrations

Third-party support extends to ControlNet for guided generation, xDiT for distributed inference, VideoSys for system-level optimization, and various quantization backends for efficient deployment.

How does CogVideoX compare to Sora and Kling?

The following table compares CogVideoX with other prominent video generation models as of late 2024.

Feature	CogVideoX-5B / 1.5-5B	HunyuanVideo	Kling	Sora
Developer	Zhipu AI / Tsinghua University	Tencent	Kuaishou	OpenAI
Release Date	August-November 2024	December 2024	June 2024	December 2024
Parameters	5B	13B+	Undisclosed	Undisclosed
Architecture	Diffusion Transformer + Expert AdaLN	Dual-stream to Single-stream DiT	DiT + 3D VAE	Diffusion Transformer
Max Resolution	1360x768 (v1.5)	1280x720	Up to 1080p	Up to 1080p
Max Duration	10 seconds (v1.5)	5 seconds (open-source)	Up to 2 minutes	Up to 20 seconds
Frame Rate	16 fps (v1.5)	24 fps	Up to 30 fps	Variable
Open Source	Yes (weights + code)	Yes (weights + code)	No (API only)	No (API only)
License	Apache 2.0 (2B) / CogVideoX License (5B)	Tencent Hunyuan Community License	Proprietary	Proprietary
Text Encoder	T5	MLLM (multimodal LLM)	Undisclosed	Undisclosed
Image-to-Video	Yes (I2V variants)	Yes	Yes	Yes
Fine-Tuning Support	Yes (LoRA, full)	Yes (LoRA)	No	No

CogVideoX occupies a distinctive position in the video generation landscape. While it does not match the raw output quality or duration capabilities of proprietary systems like Sora or Kling, its fully open weights and code, combined with active fine-tuning support, make it one of the most accessible and customizable video generation frameworks available. HunyuanVideo offers a larger model with higher parameter count but was released later and uses a more restrictive license with geographic limitations.

Evaluation and Benchmarks

CogVideoX demonstrates state-of-the-art performance among open-source video generation models across multiple evaluation frameworks. The paper reports that "CogVideoX demonstrates state-of-the-art performance across both multiple machine metrics and human evaluations."^[1]

On the VBench benchmark suite, which evaluates video generation across dimensions including action fidelity, scene structure, motion dynamics, object consistency, and semantic alignment, CogVideoX-5B achieves the best performance in five out of seven metrics compared to models like VideoCrafter-2.0 and Open-Sora.^[9]

CogVideoX-1.5 shows strong performance in dimensions related to complex prompt adherence (such as complex landscape and complex plot generation) and physics simulation. Human evaluation studies indicate that CogVideoX performs well across sensory quality, instruction following, physical realism, and prompt coverage. The model achieves high peak signal-to-noise ratio (PSNR) scores and low flicker scores, supporting strong temporal coherence.

In VBench-2.0 evaluations published in 2025, which introduced reasoning-focused evaluation dimensions, CogVideoX and other open-source models (including HunyuanVideo and Wan2.2) scored between 0.273 and 0.371 on reasoning tasks, compared to 0.546 for Sora 2, suggesting that reasoning and physical consistency remain areas for improvement across all open-source video generation models.

What is Qingying, the commercial product?

Qingying (meaning "Clear Shadow" in Chinese, and marketed in English as "Ying") is Zhipu AI's consumer-facing video generation product built on CogVideoX technology. Launched on July 26, 2024, Qingying supports text-to-video and image-to-video generation through both a web interface and API access via the Zhipu Big Model Open Platform.^[7] Unlike Sora, which remained inaccessible to the public months after its preview, Qingying was made available for free from launch day.^[7]

The product initially generated 1440x960 resolution videos up to six seconds long, with Zhipu stating it could produce a clip in as little as 30 seconds.^[7] The commercial Qingying system has since been upgraded beyond the open-source CogVideoX capabilities; powered by CogVideoX v1.5, the enhanced product supports 10-second, 4K-resolution videos at 60 frames per second with synchronized sound effects.^[7]

Limitations

Despite its strengths, CogVideoX has several known limitations:

Prompt language: All variants accept only English-language prompts, though other languages can be translated to English using external tools^[5]
Prompt length: The token limit of 224-226 tokens constrains the level of detail that can be specified in a single generation^[5]
Human generation quality: The model shows weaker performance in human-centric dimensions such as human fidelity and motion rationality compared to its performance on scenes and objects
Physical reasoning: Like all current open-source video models, CogVideoX struggles with complex physical reasoning and causal consistency
Generation speed: Producing a single video takes 90 seconds to over 16 minutes depending on the model variant and hardware, making real-time generation infeasible

References

Yang, Z., Teng, J., Zheng, W., Ding, M., Huang, S., Xu, J., Yang, Y., Hong, W., Zhang, X., Feng, G., Yin, D., Gu, X., Zhang, Y., Wang, W., Cheng, Y., Liu, T., Xu, B., Dong, Y., & Tang, J. (2024). "CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer." *arXiv:2408.06072*. Accepted at ICLR 2025. https://arxiv.org/abs/2408.06072 ↩
Hong, W., Ding, M., Zheng, W., Liu, X., & Tang, J. (2022). "CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers." *arXiv:2205.15868*. Published at ICLR 2023. https://arxiv.org/abs/2205.15868 ↩
zai-org/CogVideo (formerly THUDM/CogVideo) GitHub Repository. https://github.com/zai-org/CogVideo ↩
CogVideoX-2B Model Card, Hugging Face. https://huggingface.co/zai-org/CogVideoX-2b ↩
CogVideoX-5B Model Card, Hugging Face. https://huggingface.co/zai-org/CogVideoX-5b ↩
CogVideoX1.5-5B Model Card, Hugging Face. https://huggingface.co/zai-org/CogVideoX1.5-5B ↩
"Zhipu Launches AI-Powered Video Generator in Bid to Rival OpenAI's Sora." Caixin Global, July 27, 2024. https://www.caixinglobal.com/2024-07-27/zhipu-launches-ai-powered-video-generator-in-bid-to-rival-openais-sora-102220649.html ↩
CogVideoX-1.5 Release Notes. https://news.aibase.com/news/13100
Huang, Z., et al. (2024). "VBench: Comprehensive Benchmark Suite for Video Generative Models." *CVPR 2024*. ↩
Hugging Face Diffusers Documentation: CogVideoX. https://huggingface.co/docs/diffusers/api/pipelines/cogvideox ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

6 revisions by 1 contributors · full history

Suggest edit

What links here

Diffusion model LTX-Video Lumiere Mochi 1 Open-Sora Step-Video Text-to-video generation Zhipu AI

History and Development

When was CogVideo released?

CogVideoX (2024)

CogVideoX-1.5 (2024)

How does CogVideoX work?

3D Causal VAE

Expert Transformer

3D Full Attention and Positional Encoding

Patchification

How was CogVideoX trained?

Dataset

Data Processing Pipeline

Progressive Training Strategy

Model Variants

Is CogVideoX open source?

What hardware does CogVideoX need?

Community Adoption and Ecosystem

Hugging Face Diffusers

SAT Framework

ComfyUI

Fine-Tuning

Additional Integrations

How does CogVideoX compare to Sora and Kling?

Evaluation and Benchmarks

What is Qingying, the commercial product?

Limitations

See Also

References

Improve this article

Related Articles

HunyuanVideo

Wan 2.1

Wan 2.1-VACE

Hailuo AI

Kling 2.1

Seedance

What links here

Related Articles

HunyuanVideo

Wan 2.1

Wan 2.1-VACE

Hailuo AI

Kling 2.1

Seedance

What links here