# CogVideoX

> Source: https://aiwiki.ai/wiki/cogvideo
> Updated: 2026-06-24
> Categories: Chinese AI, Generative AI, Open Source AI, Video Generation
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

**CogVideoX** is a family of open-source [text-to-video](/wiki/text_to_video) generation models developed by [Zhipu AI](/wiki/zhipu_ai) and [Tsinghua University](/wiki/tsinghua_university) (THUDM) that generate up to 10-second continuous videos from a text prompt at 16 frames per second and 768x1360 resolution.[1] Built on a [diffusion transformer](/wiki/diffusion_models) (DiT) architecture with a 3D causal [variational autoencoder](/wiki/variational_autoencoder) and an expert transformer, CogVideoX is one of the most widely adopted open-weights video generation models in the AI community. Its lineage traces to CogVideo (2022), which its authors describe as the first open-source large-scale pretrained text-to-video model, a 9.4-billion-parameter [transformer](/wiki/transformer) built by inheriting weights from the CogView2 [text-to-image](/wiki/text-to-image_models) model.[2] The CogVideoX paper was accepted at [ICLR](/wiki/iclr) 2025, and the model weights, 3D causal VAE, and video caption model are all publicly released on GitHub and the [Hugging Face](/wiki/hugging_face) Hub.[1][3]

The commercial counterpart of CogVideoX is Qingying (also transliterated as "ClearShadow" or rendered as "Ying"), Zhipu AI's consumer-facing video generation product available through the Zhipu Qingyan platform and bigmodel.cn API.[7]

## History and Development

### When was CogVideo released?

The CogVideo project originated at Tsinghua University's Knowledge Engineering Group (KEG), led by Professor Jie Tang. The original CogVideo model was open-sourced on May 19, 2022, with an accompanying paper submitted to arXiv on May 29, 2022 (arXiv:2205.15868).[2] It was later published as a conference paper at ICLR 2023.

CogVideo was a 9.4-billion-parameter [transformer](/wiki/transformer) model trained by inheriting weights from CogView2, a pretrained text-to-image model also developed at Tsinghua. It inherited roughly 6 billion of those parameters directly from CogView2 and reused its architecture as the foundation for a dual-channel attention mechanism.[2] This approach of transferring knowledge from a text-to-image model to a text-to-video model significantly reduced training costs and addressed the scarcity of paired text-video datasets at the time. The model introduced a multi-frame-rate hierarchical training strategy to better align text descriptions with video content.

As the first open-source large-scale pretrained text-to-video model, CogVideo achieved state-of-the-art performance on both machine metrics and human evaluations when it was released.[2] The paper was authored by Wenyi Hong, Ming Ding, Wendi Zheng, Xinghan Liu, and Jie Tang.

### CogVideoX (2024)

CogVideoX represented a fundamental architectural shift from the original CogVideo. Rather than using an autoregressive transformer approach, CogVideoX adopted a [latent diffusion](/wiki/latent_diffusion) framework with a custom diffusion transformer backbone. The paper, titled "CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer" (arXiv:2408.06072), was submitted on August 12, 2024, and accepted at ICLR 2025; its latest revision (v3) was posted on March 26, 2025.[1]

The CogVideoX paper lists 19 co-authors: Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, Da Yin, Xiaotao Gu, Yuxuan Zhang, Weihan Wang, Yean Cheng, Ting Liu, Bin Xu, Yuxiao Dong, and Jie Tang.[1]

The model was released in two variants:

- **CogVideoX-2B**: Released on August 6, 2024, with 2 billion parameters. This entry-level variant generates 6-second videos (49 frames) at 720x480 resolution and 8 frames per second. It uses FP16 precision by default and 3D sinusoidal-cosine positional encoding.[4]
- **CogVideoX-5B**: Released on August 27, 2024, with 5 billion parameters. This higher-capacity variant produces videos at the same resolution and frame rate but with improved visual quality and better text-video alignment. It uses BF16 precision by default and [3D Rotary Position Embedding](/wiki/rotary_position_embedding) (3D-RoPE).[5]

On September 19, 2024, the team released CogVideoX-5B-I2V, an image-to-video variant that takes an input image along with a text prompt to generate a video, providing greater controllability over the generation process.

### CogVideoX-1.5 (2024)

CogVideoX-1.5, released on November 8, 2024, brought substantial improvements in resolution, video length, and generation quality. The CogVideoX1.5-5B series supports video generation at 1360x768 resolution, 16 frames per second, and durations of 5 to 10 seconds.[6] Both text-to-video and image-to-video variants (CogVideoX1.5-5B and CogVideoX1.5-5B-I2V) were released simultaneously.

Key improvements in the 1.5 release include:

- Higher output resolution (1360x768, up from 720x480)
- Doubled frame rate (16 fps, up from 8 fps)
- Longer video duration (up to 10 seconds, up from 6 seconds)
- Improved text comprehension and instruction-following abilities through enhanced captioning with CogVLM2-Caption
- An automated filtering framework to eliminate video training data with poor dynamic connectivity
- Support for any-size aspect ratios in the I2V model

## How does CogVideoX work?

CogVideoX introduces three core architectural innovations: a 3D Variational [Autoencoder](/wiki/autoencoder) for video compression, an Expert Transformer for multimodal fusion, and a progressive training strategy. The paper summarizes these as the model's central contributions, stating that "to improve the text-video alignment, we propose an expert transformer with the expert adaptive LayerNorm to facilitate the deep fusion between the two modalities."[1]

### 3D Causal VAE

The 3D Causal Variational Autoencoder (VAE) compresses raw video data into a compact latent representation along both spatial and temporal dimensions. As the paper puts it, the authors "propose a 3D Variational Autoencoder (VAE) to compress videos along both spatial and temporal dimensions, to improve both compression rate and video fidelity."[1] Unlike 2D VAEs used in image generation models, this 3D VAE processes entire video volumes using three-dimensional convolutions.

The VAE achieves a compression ratio of 4x8x8, meaning it compresses the temporal dimension by a factor of 4 and each spatial dimension by a factor of 8. This results in a total compression factor of 256x from pixel space to latent space, substantially reducing the sequence length that the transformer must process.

The architecture uses temporally causal convolutions, which ensure that predictions for each frame depend only on the current and preceding frames, not on future frames. This causal design preserves temporal coherence and maintains natural frame-to-frame continuity. The encoder and decoder each contain four symmetric stages with 2x downsampling and upsampling operations using [ResNet](/wiki/resnet) blocks.

### Expert Transformer

The denoising backbone of CogVideoX is an Expert Transformer, which handles the fusion of text and video modalities. The design departs from conventional approaches that use separate [cross-attention](/wiki/cross_attention) modules to inject text information into the visual stream.

Instead, CogVideoX concatenates patchified video latent embeddings with text embeddings (encoded by a [T5](/wiki/t5) text encoder) along the sequence dimension. This combined sequence is then processed through transformer blocks that apply full 3D attention across both modalities simultaneously. By using this joint attention approach rather than separated spatial and temporal attention mechanisms, the model can directly model temporal relationships between frames while maintaining strong text-video alignment.

The "expert" designation refers to the Expert Adaptive LayerNorm (AdaLN) mechanism. Standard [layer normalization](/wiki/layer_normalization) applies the same normalization parameters to all tokens, but in CogVideoX, the adaptive LayerNorm applies different normalization parameters to text tokens and video tokens. This allows each modality to maintain its own scale and shift values, addressing the inherent distributional differences between textual and visual features. The adaptive parameters are conditioned on the diffusion timestep, enabling the model to adjust its behavior at different stages of the denoising process. This design facilitates deep cross-modal fusion while keeping the additional parameter count minimal.

### 3D Full Attention and Positional Encoding

CogVideoX employs a 3D full attention mechanism that operates jointly across space and time. This avoids the information loss that can occur when spatial and temporal attention are computed separately, as is done in many prior video generation models. The full attention mechanism enables direct interaction between all patches across all frames, leading to more consistent dynamics throughout the generated video.

For positional encoding, the CogVideoX-2B variant uses 3D sinusoidal-cosine positional embeddings, while the 5B and 1.5 variants use 3D Rotary Position [Embeddings](/wiki/embeddings) (3D-RoPE). The 3D-RoPE approach independently embeds spatial coordinates (height, width) and temporal coordinates (frame index), allowing the model to generalize to different resolutions and video lengths.

### Patchification

After the 3D VAE encodes the video into a latent representation, the latent volume is divided into spatiotemporal patches. This patchification step converts the continuous latent tensor into a sequence of discrete tokens suitable for transformer processing. The process is analogous to how [Vision Transformers](/wiki/vision_transformer) (ViT) patchify images, but extended to the temporal dimension.

## How was CogVideoX trained?

### Dataset

CogVideoX-5B was trained on approximately 35 million high-quality video clips, each averaging about six seconds in length. In addition, the training incorporated roughly 2 billion filtered images sourced from the [LAION-5B](/wiki/laion) and COYO-700M datasets. This mixed image-video training approach helps the model learn both spatial visual quality from still images and temporal dynamics from video clips.

### Data Processing Pipeline

The training data underwent a rigorous filtering pipeline. Automated classifiers identified and removed low-quality, redundant, or otherwise unsuitable content. Videos with poor motion connectivity, excessive editing artifacts, and other noise characteristics were excluded.

For video captioning, the team deployed a multi-stage pipeline:

1. Short captions were generated using the Panda70M captioning model
2. Dense frame-level descriptions were produced by CogVLM, an image understanding model
3. These descriptions were then aggregated and summarized by [large language models](/wiki/large_language_model) ([GPT-4](/wiki/gpt-4) or fine-tuned [LLaMA](/wiki/llama)-2) to produce detailed video-level captions

For CogVideoX-1.5, the captioning pipeline was further improved with CogVLM2-Caption, an end-to-end video understanding model that generates more accurate content descriptions.

### Progressive Training Strategy

CogVideoX uses a progressive training approach that proceeds through multiple stages:

1. **Low-resolution stage**: The model first trains on 256px-resolution video data to learn semantic patterns and low-frequency visual structures
2. **High-resolution stage**: The model is then fine-tuned on higher-resolution data (up to 1360x768) to capture finer visual details
3. **Quality refinement stage**: A final fine-tuning phase emphasizes high-frequency visual details and overall generation quality

According to the paper, "by employing a progressive training and multi-resolution frame pack technique, CogVideoX is adept at producing coherent, long-duration, different shape videos characterized by significant motions."[1] Combined with the multi-resolution frame packing technique, this approach enables the model to efficiently learn to generate coherent videos with significant motion across varying durations and aspect ratios.

The training employs explicit uniform sampling for the diffusion noise schedule, which improves training stability.

## Model Variants

The following table summarizes the key specifications across all released CogVideoX model variants.[4][5][6]

| Model | Release Date | Parameters | Resolution | FPS | Duration | Positional Encoding | License |
|---|---|---|---|---|---|---|---|
| CogVideoX-2B | August 6, 2024 | 2B | 720x480 | 8 | 6 seconds | 3D Sincos | Apache 2.0 |
| CogVideoX-5B | August 27, 2024 | 5B | 720x480 | 8 | 6 seconds | 3D RoPE | CogVideoX License |
| CogVideoX-5B-I2V | September 19, 2024 | 5B | 720x480 | 8 | 6 seconds | 3D RoPE | CogVideoX License |
| CogVideoX1.5-5B | November 8, 2024 | 5B | 1360x768 | 16 | 5-10 seconds | 3D RoPE | CogVideoX License |
| CogVideoX1.5-5B-I2V | November 8, 2024 | 5B | 1360x768 | 16 | 5-10 seconds | 3D RoPE | CogVideoX License |

All models accept English-language prompts with a token limit of 224-226 tokens.[5]

## Is CogVideoX open source?

Yes. CogVideoX is released as open weights with publicly available code, but it uses a split licensing structure across model variants:[5]

- **CogVideoX-2B** (including the transformer and VAE modules) is released under the **Apache 2.0 License**, which permits both commercial and non-commercial use without restriction.
- **CogVideoX-5B and CogVideoX-1.5** variants are released under the **CogVideoX License**, a custom license that permits free use for academic research. Commercial use requires registration for a basic commercial license through Zhipu AI's open platform (open.bigmodel.cn). The free commercial license caps usage at 1 million service visits per month.
- The associated **code** in the GitHub repository is released under the Apache 2.0 License regardless of model variant.[3]

The paper states that "the model weight of both 3D Causal VAE, Video caption model and CogVideoX are publicly available," hosted on the project GitHub and the Hugging Face Hub.[1]

## What hardware does CogVideoX need?

CogVideoX models can run on consumer and professional GPUs with various optimization strategies.

| Configuration | CogVideoX-2B (FP16) | CogVideoX-5B (BF16) | CogVideoX1.5-5B (BF16) |
|---|---|---|---|
| Single GPU VRAM (with offloading) | From 4 GB | From 5 GB | From 10 GB |
| Single GPU VRAM (INT8 quantized) | From 3.6 GB | From 4.4 GB | From 7 GB |
| Multi-GPU (diffusers) | N/A | 15 GB | 24 GB |
| Inference speed (A100, 50 steps) | ~90 seconds | ~180 seconds | ~1000 seconds (5s video) |
| Inference speed (H100, 50 steps) | ~45 seconds | ~90 seconds | ~550 seconds (5s video) |

These VRAM figures assume that CPU offloading, VAE slicing, and VAE tiling optimizations are enabled.[5] Without these optimizations, peak VRAM consumption increases roughly threefold, but inference speed improves by 3-4x.

[Quantization](/wiki/quantization) through [PyTorch](/wiki/pytorch) AO and Optimum-quanto libraries further reduces memory requirements, enabling deployment on free-tier cloud GPUs and consumer hardware with as little as 4 GB of VRAM.[5]

## Community Adoption and Ecosystem

CogVideoX has developed a broad ecosystem of integrations and community tools since its release.

### Hugging Face Diffusers

CogVideoX is officially supported in the [Hugging Face](/wiki/hugging_face) [Diffusers](/wiki/hugging_face) library, providing a standardized inference and fine-tuning interface.[10] Model weights are available on the Hugging Face Hub under both the THUDM and zai-org organizations. The diffusers integration supports all CogVideoX variants and includes built-in support for memory optimization techniques.

### SAT Framework

In addition to Diffusers, CogVideoX maintains native support for the SAT (SwissArmyTransformer) framework, which was the original inference and training framework used during development.[3] Tools are provided to convert between SAT and Diffusers weight formats.

### ComfyUI

The ComfyUI-CogVideoXWrapper integrates CogVideoX into the [ComfyUI](/wiki/comfyui) node-based workflow system, making it accessible to users who prefer a visual interface for video generation pipelines.

### Fine-Tuning

The official repository and Diffusers both provide [LoRA](/wiki/lora) fine-tuning scripts for CogVideoX. The CogVideoX team recommends approximately 4,000 training steps with around 100 training videos for optimal LoRA results. [Fine-tuning](/wiki/fine_tuning) CogVideoX-5B with LoRA is feasible on a single NVIDIA 4090 GPU through the cogvideox-factory framework.[3]

### Additional Integrations

Third-party support extends to ControlNet for guided generation, xDiT for distributed inference, VideoSys for system-level optimization, and various quantization backends for efficient deployment.

## How does CogVideoX compare to Sora and Kling?

The following table compares CogVideoX with other prominent video generation models as of late 2024.

| Feature | CogVideoX-5B / 1.5-5B | [HunyuanVideo](/wiki/hunyuan_video) | [Kling](/wiki/kling) | [Sora](/wiki/sora) |
|---|---|---|---|---|
| Developer | Zhipu AI / Tsinghua University | [Tencent](/wiki/tencent_ai) | [Kuaishou](/wiki/kuaishou) | [OpenAI](/wiki/openai) |
| Release Date | August-November 2024 | December 2024 | June 2024 | December 2024 |
| Parameters | 5B | 13B+ | Undisclosed | Undisclosed |
| Architecture | Diffusion Transformer + Expert AdaLN | Dual-stream to Single-stream DiT | DiT + 3D VAE | Diffusion Transformer |
| Max Resolution | 1360x768 (v1.5) | 1280x720 | Up to 1080p | Up to 1080p |
| Max Duration | 10 seconds (v1.5) | 5 seconds (open-source) | Up to 2 minutes | Up to 20 seconds |
| Frame Rate | 16 fps (v1.5) | 24 fps | Up to 30 fps | Variable |
| Open Source | Yes (weights + code) | Yes (weights + code) | No (API only) | No (API only) |
| License | Apache 2.0 (2B) / CogVideoX License (5B) | Tencent Hunyuan Community License | Proprietary | Proprietary |
| Text Encoder | T5 | MLLM (multimodal LLM) | Undisclosed | Undisclosed |
| Image-to-Video | Yes (I2V variants) | Yes | Yes | Yes |
| Fine-Tuning Support | Yes (LoRA, full) | Yes (LoRA) | No | No |

CogVideoX occupies a distinctive position in the video generation landscape. While it does not match the raw output quality or duration capabilities of proprietary systems like Sora or Kling, its fully open weights and code, combined with active fine-tuning support, make it one of the most accessible and customizable video generation frameworks available. HunyuanVideo offers a larger model with higher parameter count but was released later and uses a more restrictive license with geographic limitations.

## Evaluation and Benchmarks

CogVideoX demonstrates state-of-the-art performance among open-source video generation models across multiple evaluation frameworks. The paper reports that "CogVideoX demonstrates state-of-the-art performance across both multiple machine metrics and human evaluations."[1]

On the VBench benchmark suite, which evaluates video generation across dimensions including action fidelity, scene structure, motion dynamics, object consistency, and semantic alignment, CogVideoX-5B achieves the best performance in five out of seven metrics compared to models like VideoCrafter-2.0 and Open-Sora.[9]

CogVideoX-1.5 shows strong performance in dimensions related to complex prompt adherence (such as complex landscape and complex plot generation) and physics simulation. Human evaluation studies indicate that CogVideoX performs well across sensory quality, instruction following, physical realism, and prompt coverage. The model achieves high peak signal-to-noise ratio (PSNR) scores and low flicker scores, supporting strong temporal coherence.

In VBench-2.0 evaluations published in 2025, which introduced reasoning-focused evaluation dimensions, CogVideoX and other open-source models (including HunyuanVideo and Wan2.2) scored between 0.273 and 0.371 on reasoning tasks, compared to 0.546 for Sora 2, suggesting that reasoning and physical consistency remain areas for improvement across all open-source video generation models.

## What is Qingying, the commercial product?

Qingying (meaning "Clear Shadow" in Chinese, and marketed in English as "Ying") is Zhipu AI's consumer-facing video generation product built on CogVideoX technology. Launched on July 26, 2024, Qingying supports text-to-video and image-to-video generation through both a web interface and API access via the Zhipu Big Model Open Platform.[7] Unlike Sora, which remained inaccessible to the public months after its preview, Qingying was made available for free from launch day.[7]

The product initially generated 1440x960 resolution videos up to six seconds long, with Zhipu stating it could produce a clip in as little as 30 seconds.[7] The commercial Qingying system has since been upgraded beyond the open-source CogVideoX capabilities; powered by CogVideoX v1.5, the enhanced product supports 10-second, 4K-resolution videos at 60 frames per second with synchronized sound effects.[7]

## Limitations

Despite its strengths, CogVideoX has several known limitations:

- **[Prompt](/wiki/prompt) language**: All variants accept only English-language prompts, though other languages can be translated to English using external tools[5]
- **Prompt length**: The token limit of 224-226 tokens constrains the level of detail that can be specified in a single generation[5]
- **Human generation quality**: The model shows weaker performance in human-centric dimensions such as human fidelity and motion rationality compared to its performance on scenes and objects
- **Physical reasoning**: Like all current open-source video models, CogVideoX struggles with complex physical reasoning and causal consistency
- **Generation speed**: Producing a single video takes 90 seconds to over 16 minutes depending on the model variant and hardware, making real-time generation infeasible

## See Also

- [Diffusion Models](/wiki/diffusion_models)
- [Sora](/wiki/sora)
- [Stable Video Diffusion](/wiki/stable_diffusion)
- [Runway](/wiki/runway_ml)
- [Kling](/wiki/kling)
- [HunyuanVideo](/wiki/hunyuan_video)
- [Video Generation](/wiki/ai_video_generation)

## References

1. Yang, Z., Teng, J., Zheng, W., Ding, M., Huang, S., Xu, J., Yang, Y., Hong, W., Zhang, X., Feng, G., Yin, D., Gu, X., Zhang, Y., Wang, W., Cheng, Y., Liu, T., Xu, B., Dong, Y., & Tang, J. (2024). "CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer." *arXiv:2408.06072*. Accepted at ICLR 2025. https://arxiv.org/abs/2408.06072
2. Hong, W., Ding, M., Zheng, W., Liu, X., & Tang, J. (2022). "CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers." *arXiv:2205.15868*. Published at ICLR 2023. https://arxiv.org/abs/2205.15868
3. zai-org/CogVideo (formerly THUDM/CogVideo) GitHub Repository. https://github.com/zai-org/CogVideo
4. CogVideoX-2B Model Card, Hugging Face. https://huggingface.co/zai-org/CogVideoX-2b
5. CogVideoX-5B Model Card, Hugging Face. https://huggingface.co/zai-org/CogVideoX-5b
6. CogVideoX1.5-5B Model Card, Hugging Face. https://huggingface.co/zai-org/CogVideoX1.5-5B
7. "Zhipu Launches AI-Powered Video Generator in Bid to Rival OpenAI's Sora." Caixin Global, July 27, 2024. https://www.caixinglobal.com/2024-07-27/zhipu-launches-ai-powered-video-generator-in-bid-to-rival-openais-sora-102220649.html
8. CogVideoX-1.5 Release Notes. https://news.aibase.com/news/13100
9. Huang, Z., et al. (2024). "VBench: Comprehensive Benchmark Suite for Video Generative Models." *CVPR 2024*.
10. Hugging Face Diffusers Documentation: CogVideoX. https://huggingface.co/docs/diffusers/api/pipelines/cogvideox

