HunyuanVideo is an open-source video generation model developed by Tencent. Released on December 3, 2024, it was the largest open-source video generation model at the time of launch, with over 13 billion parameters. HunyuanVideo generates high-quality videos from text prompts and, through subsequent releases, from images as well. It uses a Diffusion Transformer (DiT) architecture with a novel "Dual-Stream to Single-Stream" design, a Multimodal Large Language Model (MLLM) as its text encoder, and a 3D Variational Autoencoder (VAE) for spatiotemporal compression.
Tencent released the model weights and inference code under the Tencent Hunyuan Community License Agreement. In professional human evaluations conducted with 1,533 text prompts and more than 60 evaluators, HunyuanVideo outperformed several leading models including Runway Gen-3 Alpha and Luma 1.6, particularly in motion quality. A follow-up release, HunyuanVideo 1.5, arrived in November 2025 with a smaller 8.3 billion parameter model that runs on consumer-grade GPUs.
HunyuanVideo was developed by the Tencent Hunyuan Foundation Model Team. The accompanying research paper, titled "HunyuanVideo: A Systematic Framework For Large Video Generative Models," was published on arXiv (2412.03603) on December 3, 2024, the same day the inference code and model weights were made publicly available on GitHub and Hugging Face.
The project grew out of Tencent's broader Hunyuan AI initiative, which includes large language models and image generation systems. HunyuanVideo was positioned as an answer to closed-source video generation tools such as OpenAI's Sora and Kuaishou's Kling, offering comparable quality in an open-weight package that researchers and developers could run locally.
Several milestones followed the initial release:
| Date | Release | Description |
|---|---|---|
| December 3, 2024 | HunyuanVideo (T2V) | Original 13B parameter text-to-video model, 720p output, 129 frames at 24 fps |
| March 6, 2025 | HunyuanVideo-I2V | 13B parameter image-to-video variant with token replace technique for reference image injection |
| May 9, 2025 | HunyuanCustom | Multimodal-driven framework for customized video generation supporting image, audio, video, and text conditions |
| May 28, 2025 | HunyuanVideo-Avatar | Audio-driven human animation model for generating speech-synchronized digital human videos |
| November 21, 2025 | HunyuanVideo 1.5 | Lightweight 8.3B parameter model with SSTA attention, super-resolution to 1080p, and consumer GPU support |
HunyuanVideo is built on three main components: a Diffusion Transformer backbone, an MLLM-based text encoder, and a 3D causal VAE. The system operates in a compressed latent space, where Gaussian noise is progressively denoised conditioned on text (or image) inputs, and the resulting latent representation is decoded back into pixel-space video.
The core of HunyuanVideo is a transformer-based diffusion model that processes video and text tokens through a hybrid architecture. This design was referred to by the team as "Dual-Stream to Single-Stream."
In the dual-stream phase, video tokens and text tokens pass through separate transformer blocks independently. Each modality learns its own modulation mechanisms (such as adaptive layer normalization) without interference from the other. This separation allows the model to develop strong representations for both visual content and language semantics before they interact.
In the single-stream phase, video and text tokens are concatenated into a single sequence and processed jointly through additional transformer blocks using full attention. This stage enables deep multimodal fusion, allowing the model to align generated visual content with the text description.
The architecture hyperparameters for the 13B foundation model are:
| Parameter | Value |
|---|---|
| Dual-stream blocks | 20 |
| Single-stream blocks | 40 |
| Hidden dimension | 3,072 |
| FFN dimension | 12,288 |
| Attention heads | 24 |
| Head dimension | 128 |
| Positional embedding channels (dt, dh, dw) | 16, 56, 56 |
The model uses Flow Matching as its training objective rather than the more traditional DDPM (Denoising Diffusion Probabilistic Model) approach. In flow matching, the network learns to predict the velocity field that transports samples between a simple noise distribution and the target data distribution. This formulation has been shown to produce more stable training dynamics and higher-quality outputs compared to standard noise prediction.
Rather than relying on CLIP or T5 alone as text encoders, HunyuanVideo uses a pre-trained Multimodal Large Language Model (MLLM) with a decoder-only architecture. The publicly released version uses llava-llama-3-8b-v1_1 (provided by Xtuner) as the text encoder, though Tencent has indicated that a proprietary HunyuanMLLM was used internally.
The MLLM offers several advantages over traditional text encoders. Compared to CLIP, it provides superior image detail description and complex reasoning capabilities. Compared to T5, its visual instruction fine-tuning gives it better image-text alignment.
However, the MLLM is based on causal (autoregressive) attention, while diffusion models tend to benefit from bidirectional text representations. To address this gap, HunyuanVideo introduces an extra bidirectional token refiner that post-processes the MLLM's output features, producing enhanced text embeddings that better guide the diffusion process.
A secondary text encoder, OpenAI's clip-vit-large-patch14, is also used in the pipeline alongside the MLLM.
In addition, a prompt rewrite model fine-tuned from Hunyuan-Large rewrites user prompts into more detailed descriptions before they are fed to the text encoder. This approach improves generation quality by expanding sparse user inputs into rich, descriptive text.
HunyuanVideo compresses pixel-space videos into a compact latent space using a 3D Variational Autoencoder (VAE) built with CausalConv3D layers. The compression ratios are:
| Dimension | Compression Ratio |
|---|---|
| Temporal (video length) | 4x |
| Spatial (height and width) | 8x |
| Channel | 16 latent channels |
The causal convolution design ensures temporal causality, meaning each frame's latent representation depends only on current and previous frames, never future ones. This property is important for maintaining coherent motion and enabling autoregressive-style generation patterns.
By compressing along all three dimensions simultaneously (rather than separately handling spatial and temporal compression), the 3D VAE can capture joint spatiotemporal patterns. This compression significantly reduces the number of tokens the diffusion transformer must process, making it possible to train on high-resolution video at the original frame rate.
The VAE handles both images (treated as single-frame videos) and multi-frame videos, allowing the same architecture to support unified image and video generation.
HunyuanVideo was pre-trained on internet-scale images and videos, processed through a multi-stage data curation pipeline. The filtering process included:
After all filtering stages, approximately 800 million high-quality video segments remained for pre-training.
The model was trained using a progressive multi-stage approach that gradually increased resolution, video length, and frame rate:
This progressive approach allows the model to first learn basic visual concepts at low resolution before tackling the harder problem of high-resolution video with complex motion. The strategy also improves training efficiency, since early stages process far fewer tokens per sample.
HunyuanVideo 1.5 was released on November 21, 2025, as a lighter and more efficient successor to the original model. The accompanying technical report (arXiv: 2511.18870) detailed a number of architectural improvements designed to make high-quality video generation accessible on consumer hardware.
The most significant change was the reduction in model size from 13 billion to 8.3 billion parameters. Despite the smaller size, HunyuanVideo 1.5 achieved state-of-the-art visual quality and motion coherence through several architectural innovations.
The 3D causal VAE was also updated, with spatial compression increased to 16x (up from 8x in version 1.0) while maintaining 4x temporal compression. The latent channel dimension was set to 32.
The headline architectural innovation of HunyuanVideo 1.5 is Selective and Sliding Tile Attention (SSTA), a mechanism designed to address the high computational cost of full attention over long video sequences. SSTA operates through four steps:
By dynamically pruning redundant spatiotemporal tokens, SSTA achieved an end-to-end speedup of 1.87x for 10-second 720p video synthesis compared to FlashAttention-3, without a meaningful loss in output quality.
HunyuanVideo 1.5 includes a dedicated video super-resolution network that upscales outputs from the base resolution (480p to 720p) to 1080p. This network follows the same 8.3B Diffusion Transformer architecture as the main model and operates in latent space. Low-resolution latents are injected using channel concatenation, and a separate latent upsample block spatially aligns low-resolution and high-resolution latents before the final VAE decoding step.
The super-resolution network was trained on 1 million high-quality video clips. It not only increases resolution but also corrects distortions and refines details in the base output.
HunyuanVideo 1.5 introduced a three-phase post-training pipeline:
One of the goals of HunyuanVideo 1.5 was to run on consumer-grade GPUs. Peak memory usage was reported at 13.6 GB for 720p video with 121 frames, making it feasible to run on GPUs like the NVIDIA RTX 4090. With GGUF quantization (available in Q8, Q6, and Q4 variants), the model can run on GPUs with as little as 8 GB of VRAM through ComfyUI, though quality degrades noticeably at Q3 and below.
The primary capability of HunyuanVideo is text-to-video generation. Given a text prompt, the model generates video clips at up to 720p resolution (1280x720 or 720x1280 depending on aspect ratio) with 129 frames at 24 fps, yielding roughly 5 seconds of video. Multiple aspect ratios are supported, including 16:9, 9:16, 4:3, 3:4, and 1:1.
The prompt rewrite system, powered by a fine-tuned Hunyuan-Large model, automatically expands brief user prompts into detailed descriptions. This substantially improves generation quality for casual users who may not write highly detailed prompts.
HunyuanVideo-I2V, released in March 2025, extends the framework to accept a reference image as input alongside text. The model uses a token replace technique to inject reference image information into the generation process, preserving the visual style, color palette, and character identity of the source image throughout the generated video.
The I2V variant also supports LoRA training for customizable special effects, lip synchronization with 10 speech styles, and preset dance routine templates.
Because the 3D VAE treats images as single-frame videos, HunyuanVideo can generate both still images and videos from the same architecture. This unified approach simplifies the pipeline and allows knowledge transfer between image and video generation tasks during training.
Tencent conducted a professional human evaluation using 1,533 text prompts. More than 60 professional evaluators assessed generated videos across three criteria: Text Alignment, Motion Quality, and Visual Quality. To ensure fairness, inference was conducted only once per prompt with no cherry-picking of results.
| Model | Text Alignment | Motion Quality | Visual Quality | Overall |
|---|---|---|---|---|
| HunyuanVideo | 61.8% | 66.5% | 95.7% | 41.3% |
| CNTopA (API) | 62.6% | 61.7% | 95.6% | 37.7% |
| CNTopB (Web) | 60.1% | 62.9% | 97.7% | 37.5% |
| Runway Gen-3 Alpha | 47.7% | 54.7% | 97.5% | 27.4% |
| Luma 1.6 | 57.6% | 44.2% | 94.1% | 24.8% |
HunyuanVideo achieved the highest overall score (41.3%) and the best motion quality score (66.5%) among all tested models. "CNTopA" and "CNTopB" refer to anonymized top-performing Chinese video generation models that were included in the comparison.
The following table compares HunyuanVideo with other notable video generation models released around the same period.
| Feature | HunyuanVideo (1.0) | HunyuanVideo 1.5 | Sora (OpenAI) | Kling (Kuaishou) | CogVideoX-5B (Zhipu AI) |
|---|---|---|---|---|---|
| Release Date | December 2024 | November 2025 | December 2024 | June 2024 | August 2024 |
| Parameters | 13B | 8.3B | Undisclosed | Undisclosed | 5B |
| Open Source | Yes (Tencent Hunyuan Community License) | Yes (Tencent Hunyuan Community License) | No | No | Yes (Apache 2.0) |
| Architecture | Dual-Stream/Single-Stream DiT | Dual-Stream/Single-Stream DiT with SSTA | Diffusion Transformer | Diffusion Transformer (DiT) | Expert Transformer (DiT) |
| Max Resolution | 720p (native) | 1080p (with super-resolution) | 1080p | 1080p | 768x1360 |
| Max Duration | ~5 seconds (129 frames at 24 fps) | 5 to 10 seconds | Up to 20 seconds | Up to 2 minutes | Up to 10 seconds |
| Text Encoder | MLLM (decoder-only) + CLIP | MLLM + CLIP | Undisclosed | Undisclosed | T5 with Expert LayerNorm |
| Image-to-Video | Yes (separate I2V model) | Yes (unified) | Yes | Yes | Yes |
| Consumer GPU Support | Limited (60GB+ VRAM) | Yes (8GB+ with GGUF quantization) | No (cloud only) | No (cloud only) | Yes (8-12GB VRAM) |
HunyuanVideo is released under the Tencent Hunyuan Community License Agreement, dated December 3, 2024. This is not a standard open-source license. Key terms include:
Tencent retains intellectual property rights over the original HunyuanVideo works, while users own their derivative works and modifications as long as they comply with the license terms. An Acceptable Use Policy (included as an exhibit to the license) outlines prohibited uses.
ComfyUI, the popular node-based interface for diffusion model workflows, added official native support for HunyuanVideo starting with version 0.3.8. The integration allows users to build text-to-video and image-to-video workflows using ComfyUI's visual node editor.
Multiple integration pathways exist:
For HunyuanVideo 1.5, Tencent released an official ComfyUI plugin (comfyui_hunyuanvideo_1.5_plugin) with both simplified and complete node sets, along with built-in automatic model download support.
HunyuanVideo supports LoRA (Low-Rank Adaptation) fine-tuning, allowing users to customize the model for specific styles, characters, or effects without retraining the full model. LoRA support was added on December 20, 2024, shortly after the initial release.
The training pipeline supports distributed training, Fully Sharded Data Parallel (FSDP), context parallelism, and gradient checkpointing. The recommended optimizer for LoRA fine-tuning is Muon. Third-party tools like finetrainers and various community repositories also provide training scripts compatible with HunyuanVideo.
On platforms like Civitai, a growing library of community-created LoRA adapters is available for HunyuanVideo, covering character styles, animation effects, and camera movements.
HunyuanVideo is integrated into the Hugging Face Diffusers library, making it accessible through a standard Python API. This integration simplifies model loading, inference, and pipeline customization for developers already familiar with the Diffusers ecosystem.
The community has developed several approaches to reduce VRAM requirements:
Tencent has built several specialized models on top of the HunyuanVideo foundation:
Released on May 28, 2025, HunyuanVideo-Avatar is an audio-driven human animation model. Given a single character image and an audio clip, it generates video of the character speaking with lip synchronization and emotional expression. It includes a character image injection module for identity consistency, an Audio Emotion Module for transferring emotional cues, and a Face-Aware Audio Adapter for handling multi-character dialogue scenes. The model supports photorealistic, cartoon, 3D-rendered, and anthropomorphic character styles.
Released on May 9, 2025, HunyuanCustom extends the video generation framework with multi-modal conditioning. It accepts image, audio, video, and text inputs simultaneously, with an emphasis on subject consistency across generated frames. The model uses a text-image fusion module based on LLaVA and an image ID enhancement module that reinforces identity features across frames through temporal concatenation.
Like all current video generation models, HunyuanVideo has several known limitations: