HunyuanVideo

Chinese AI Generative AI Open Source AI Video Generation

30 min read

Updated Jun 21, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 21, 2026

Fact-checked

In review queue

Sources

16 citations

Revision

v8 · 5,946 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

HunyuanVideo is an open-source video generation model developed by Tencent and released on December 3, 2024, with over 13 billion parameters, making it the largest open-source video generation model at the time of launch.^[1] It generates high-quality videos from text prompts and, through subsequent releases, from images, audio, and reference videos as well, using a Diffusion Transformer (DiT) architecture with a novel "Dual-Stream to Single-Stream" design, a Multimodal Large Language Model (MLLM) text encoder, and a 3D Variational Autoencoder (VAE) for spatiotemporal compression.^[1] The accompanying technical report states that HunyuanVideo "demonstrates performance in video generation comparable to, or even surpassing, that of leading closed-source models."^[1]

Tencent released the model weights and inference code under the Tencent Hunyuan Community License Agreement.^[11] In professional human evaluations conducted with 1,533 text prompts and more than 60 evaluators, HunyuanVideo outperformed several leading models including Runway Gen-3 Alpha and Luma 1.6, particularly in motion quality.^[1] A follow-up release, HunyuanVideo 1.5, arrived in November 2025 with a smaller 8.3 billion parameter model that runs on consumer-grade GPUs while delivering further gains in instruction following and motion clarity.^[2] Together with sibling models such as HunyuanCustom, HunyuanVideo-Avatar, and HunyuanVideo-Foley, the HunyuanVideo family has grown into one of the most complete open-source video generation ecosystems available as of 2026.

When was HunyuanVideo released, and how has it evolved?

HunyuanVideo was developed by the Tencent Hunyuan Foundation Model Team. The accompanying research paper, titled "HunyuanVideo: A Systematic Framework For Large Video Generative Models," was published on arXiv (2412.03603) on December 3, 2024, the same day the inference code and model weights were made publicly available on GitHub and Hugging Face.^[1]^[3]^[5]

The project grew out of Tencent's broader Hunyuan AI initiative, which includes large language models, image generation systems, and the 3D generation line. HunyuanVideo was positioned as an answer to closed-source video generation tools such as OpenAI's Sora and Kuaishou's Kling, offering comparable quality in an open-weight package that researchers and developers could run locally. The release marked a strategic decision by Tencent to publish a frontier-scale video model rather than keep it as an internal product, mirroring the company's approach to its text and image foundation models.

Several milestones followed the initial release:

Date	Release	Description
December 3, 2024	HunyuanVideo (T2V)	Original 13B parameter text-to-video model, 720p output, 129 frames at 24 fps
December 20, 2024	LoRA fine-tuning support	Official LoRA training pipeline and example scripts released for community customization
March 6, 2025	HunyuanVideo-I2V	13B parameter image-to-video variant with token replace technique for reference image injection
May 9, 2025	HunyuanCustom	Multimodal-driven framework for customized video generation supporting image, audio, video, and text conditions
May 28, 2025	HunyuanVideo-Avatar	Audio-driven human animation model for generating speech-synchronized digital human videos
August 2025	HunyuanVideo-Foley	Text-Video-to-Audio (TV2A) model that automatically generates sound effects synchronized to existing video
September 29, 2025	HunyuanVideo-Foley-XL	Higher-capacity Foley model with offload inference for reduced VRAM use
November 21, 2025	HunyuanVideo 1.5	Lightweight 8.3B parameter model with SSTA attention, super-resolution to 1080p, and consumer GPU support
November 27, 2025	Cache inference support	TeaCache, TaylorCache, and DeepCache integrations land for HunyuanVideo 1.5
December 2025	HunyuanVideo-1.5 I2V step-distilled	8 or 12-step image-to-video distillation generating clips in roughly 75 seconds on an RTX 4090

The cadence of releases through 2025 transformed HunyuanVideo from a single text-to-video checkpoint into a layered ecosystem of foundation, conditional, and audio models, all sharing the same underlying DiT architecture and VAE compression scheme.

How does the HunyuanVideo architecture work?

HunyuanVideo is built on three main components: a Diffusion Transformer backbone, an MLLM-based text encoder, and a 3D causal VAE.^[1] The system operates in a compressed latent space, where Gaussian noise is progressively denoised conditioned on text (or image) inputs, and the resulting latent representation is decoded back into pixel-space video. The same architectural blueprint also underpins the diffusion model variants in HunyuanVideo-I2V, HunyuanCustom, and HunyuanVideo-Avatar, which extend the base transformer with additional conditioning modules rather than replacing it.

Dual-Stream to Single-Stream Diffusion Transformer

The core of HunyuanVideo is a transformer-based diffusion model that processes video and text tokens through a hybrid architecture. This design was referred to by the team as "Dual-Stream to Single-Stream."^[1]

In the dual-stream phase, video tokens and text tokens pass through separate transformer blocks independently. Each modality learns its own modulation mechanisms (such as adaptive layer normalization) without interference from the other. This separation allows the model to develop strong representations for both visual content and language semantics before they interact.

In the single-stream phase, video and text tokens are concatenated into a single sequence and processed jointly through additional transformer blocks using full attention. This stage enables deep multimodal fusion, allowing the model to align generated visual content with the text description.

The architecture hyperparameters for the 13B foundation model are:

Parameter	Value
Dual-stream blocks	20
Single-stream blocks	40
Hidden dimension	3,072
FFN dimension	12,288
Attention heads	24
Head dimension	128
Positional embedding channels (dt, dh, dw)	16, 56, 56

The model uses Flow Matching as its training objective rather than the more traditional DDPM (Denoising Diffusion Probabilistic Model) approach.^[1] In flow matching, the network learns to predict the velocity field that transports samples between a simple noise distribution and the target data distribution. This formulation has been shown to produce more stable training dynamics and higher-quality outputs compared to standard noise prediction. The Tencent team also reported that flow matching converges faster on the long-sequence regime imposed by video latents, where the cost of running noisy backward iterations is high.

MLLM text encoder

Rather than relying on CLIP or T5 alone as text encoders, HunyuanVideo uses a pre-trained Multimodal Large Language Model (MLLM) with a decoder-only architecture.^[1] The publicly released version uses llava-llama-3-8b-v1_1 (provided by Xtuner) as the text encoder, though Tencent has indicated that a proprietary HunyuanMLLM was used internally.^[5]

The MLLM offers several advantages over traditional text encoders. Compared to CLIP, it provides superior image detail description and complex reasoning capabilities. Compared to T5, its visual instruction fine-tuning gives it better image-text alignment.^[1]

However, the MLLM is based on causal (autoregressive) attention, while diffusion models tend to benefit from bidirectional text representations. To address this gap, HunyuanVideo introduces an extra bidirectional token refiner that post-processes the MLLM's output features, producing enhanced text embeddings that better guide the diffusion process.^[1]

A secondary text encoder, OpenAI's clip-vit-large-patch14, is also used in the pipeline alongside the MLLM.^[1] The CLIP encoder contributes a pooled global representation that supplements the token-level features extracted from the MLLM, giving the diffusion transformer both fine-grained semantic anchors and a coarse overall prompt summary.

In addition, a prompt rewrite model fine-tuned from Hunyuan-Large rewrites user prompts into more detailed descriptions before they are fed to the text encoder.^[1] This approach improves generation quality by expanding sparse user inputs into rich, descriptive text. The rewrite system is also where Tencent integrates bilingual handling, automatically converting Chinese prompts into English equivalents augmented with cinematic terminology before encoding.

3D causal VAE

HunyuanVideo compresses pixel-space videos into a compact latent space using a 3D Variational Autoencoder (VAE) built with CausalConv3D layers.^[1] The compression ratios are:

Dimension	Compression ratio
Temporal (video length)	4x
Spatial (height and width)	8x
Channel	16 latent channels

The causal convolution design ensures temporal causality, meaning each frame's latent representation depends only on current and previous frames, never future ones. This property is important for maintaining coherent motion and enabling autoregressive-style generation patterns.

By compressing along all three dimensions simultaneously (rather than separately handling spatial and temporal compression), the 3D VAE can capture joint spatiotemporal patterns. This compression significantly reduces the number of tokens the diffusion transformer must process, making it possible to train on high-resolution video at the original frame rate.

The VAE handles both images (treated as single-frame videos) and multi-frame videos, allowing the same architecture to support unified image and video generation.^[1] This shared codec is also why HunyuanVideo-I2V, HunyuanCustom, and HunyuanVideo-Avatar can interchange latents and reference images without retraining the VAE for each downstream task.

How was HunyuanVideo trained?

Data curation

HunyuanVideo was pre-trained on internet-scale images and videos, processed through a multi-stage data curation pipeline.^[1] The filtering process included:

Basic filtering to remove videos with padding borders, stitching artifacts, grid layouts (collages), and static or low-motion scenes.
Quality assessment evaluating videos across four dimensions: sharpness, detail retention, noise and artifacts, and dynamic range.
Aesthetic filtering using an aesthetic scoring operator to remove videos with low visual appeal.
Motion filtering rejecting clips with chaotic or jittery camera movement that would inject noise into the temporal training signal.
Captioning and rewriting by a fine-tuned vision-language model that produces dense descriptive captions used as the textual supervision signal.

After all filtering stages, approximately 800 million high-quality video segments remained for pre-training.^[1] The dataset spans a deliberately broad mix of human action, natural scenery, animal behavior, sports, animation, and product footage so that the model can match a wide variety of prompts encountered downstream.

Progressive training strategy

The model was trained using a progressive multi-stage approach that gradually increased resolution, video length, and frame rate:^[1]

Text-to-image (T2I) training at 256p resolution, then 512p, to establish semantic alignment.
Text-to-video (T2V) training starting at 256p, progressing through 480p, and reaching 720p.
Frame rate increase from 16 fps to 24 fps.
Mixed T2I, T2V, and image-to-video (I2V) training at a ratio of 1:6:3.

This progressive approach allows the model to first learn basic visual concepts at low resolution before tackling the harder problem of high-resolution video with complex motion. The strategy also improves training efficiency, since early stages process far fewer tokens per sample.

Training was performed on a large GPU cluster using Tencent's internal distributed framework, with combinations of Fully Sharded Data Parallel (FSDP), context parallelism for long sequences, and gradient checkpointing to manage activation memory. The same scaffolding was later reused for the post-training stages of HunyuanVideo 1.5.

What is HunyuanVideo 1.5, and how does it differ from 1.0?

HunyuanVideo 1.5 was released on November 21, 2025, as a lighter and more efficient successor to the original model.^[2]^[4] The accompanying technical report (arXiv: 2511.18870) detailed a number of architectural improvements designed to make high-quality video generation accessible on consumer hardware while keeping pace with proprietary competitors.^[2] As of early 2026 it is widely regarded as the open-source state of the art for instruction following and natural motion at the sub-10B parameter scale.

Key changes from 1.0

The most significant change was the reduction in model size from 13 billion to 8.3 billion parameters.^[2] Despite the smaller size, HunyuanVideo 1.5 achieved state-of-the-art visual quality and motion coherence through several architectural innovations.

The 3D causal VAE was also updated, with spatial compression increased to 16x (up from 8x in version 1.0) while maintaining 4x temporal compression.^[2] The latent channel dimension was set to 32. The denser latent representation lets a smaller transformer span the same field of view, which is one of the main reasons HunyuanVideo 1.5 can run on a single 24 GB consumer GPU.

Other notable updates include:

Glyph-aware bilingual text encoding that handles English and Chinese prompts within the same encoder pathway and reduces "hallucinated text" errors when prompts contain explicit signage or captions.
Refined flow matching schedule with timestep weighting tuned for the new 16x spatial codec.
Optional super-resolution stage that upscales base outputs to 1080p in latent space.
Step-distilled variants for both T2V and I2V that cut sampling steps from roughly 50 down to 8 or 12 with limited quality loss.

Selective and Sliding Tile Attention (SSTA)

The headline architectural innovation of HunyuanVideo 1.5 is Selective and Sliding Tile Attention (SSTA), a mechanism designed to address the high computational cost of full attention over long video sequences.^[2] SSTA operates through four steps:

3D Block Partition: The spatiotemporal token sequence is divided into 3D blocks.
Selective Mask Generation: Based on query-key similarity and key-key redundancy metrics, the system identifies and marks redundant tokens.
STA Mask Generation: A sliding tile attention mask is created to maintain local coherence.
Block-Sparse Attention: Only non-redundant token interactions are computed.

By dynamically pruning redundant spatiotemporal tokens, SSTA achieved an end-to-end speedup of 1.87x for 10-second 720p video synthesis compared to FlashAttention-3, without a meaningful loss in output quality.^[2] SSTA is implemented as a drop-in replacement for the dense attention path used in HunyuanVideo 1.0, so existing training and inference utilities require only minor modifications to take advantage of it.

Video super-resolution

HunyuanVideo 1.5 includes a dedicated video super-resolution network that upscales outputs from the base resolution (480p to 720p) to 1080p.^[2] This network follows the same 8.3B Diffusion Transformer architecture as the main model and operates in latent space. Low-resolution latents are injected using channel concatenation, and a separate latent upsample block spatially aligns low-resolution and high-resolution latents before the final VAE decoding step.

The super-resolution network was trained on 1 million high-quality video clips.^[2] It not only increases resolution but also corrects distortions and refines details in the base output, effectively serving as a "polish" stage that can be skipped to save compute when 1080p is not required.

Post-training

HunyuanVideo 1.5 introduced a three-phase post-training pipeline:^[2]

Continuing training: Further pre-training on curated high-quality data.
Supervised fine-tuning: Training on human-selected high-quality video-prompt pairs.
RLHF (Reinforcement Learning from Human Feedback): Aligning model outputs with human preferences for visual quality, motion naturalness, and text faithfulness.

The RLHF stage uses a reward model trained on pairwise comparisons of generated clips. This step is credited with much of the visible improvement in instruction following relative to the 13B foundation model, particularly on multi-step or compositional prompts.

What hardware does HunyuanVideo 1.5 need?

One of the goals of HunyuanVideo 1.5 was to run on consumer-grade GPUs. Peak memory usage was reported at 13.6 GB for 720p video with 121 frames, making it feasible to run on GPUs like the NVIDIA RTX 4090.^[4] With GGUF quantization (available in Q8, Q6, and Q4 variants), the model can run on GPUs with as little as 8 GB of VRAM through ComfyUI, though quality degrades noticeably at Q3 and below. Community benchmarks place 720p T2V generation at roughly 60 to 120 seconds per clip on an RTX 4090 with the step-distilled checkpoint, and 5 to 8 minutes on the full sampling schedule.

What can HunyuanVideo do?

Text-to-video

The primary capability of HunyuanVideo is text-to-video generation. Given a text prompt, the model generates video clips at up to 720p resolution (1280x720 or 720x1280 depending on aspect ratio) with 129 frames at 24 fps, yielding roughly 5 seconds of video.^[1] Multiple aspect ratios are supported, including 16:9, 9:16, 4:3, 3:4, and 1:1. HunyuanVideo 1.5 extends the maximum duration to 10 seconds and supports outputs from 480p directly up to 1080p when the super-resolution stage is enabled.^[2]

The prompt rewrite system, powered by a fine-tuned Hunyuan-Large model, automatically expands brief user prompts into detailed descriptions.^[1] This substantially improves generation quality for casual users who may not write highly detailed prompts. The pipeline accepts both English and Chinese natural language and routes through a normalized intermediate prompt format.

Image-to-video

HunyuanVideo-I2V, released in March 2025, extends the framework to accept a reference image as input alongside text.^[7] The model uses a token replace technique to inject reference image information into the generation process, preserving the visual style, color palette, and character identity of the source image throughout the generated video.^[7]

The I2V variant also supports LoRA training for customizable special effects, lip synchronization with 10 speech styles, and preset dance routine templates.^[7] A step-distilled I2V variant released in December 2025 generates clips in 8 or 12 steps and reduces end-to-end generation time by approximately 75 percent, so a single RTX 4090 can complete a clip in around 75 seconds.^[4]

In practice, image-to-video has become one of the most common workflows for HunyuanVideo, because creators can first generate a still image with tools such as Flux or HunyuanImage and then use HunyuanVideo-I2V to animate it. The reference image is encoded by the same MLLM-CLIP pipeline as text and threaded into both streams of the diffusion transformer.

Unified image and video generation

Because the 3D VAE treats images as single-frame videos, HunyuanVideo can generate both still images and videos from the same architecture.^[1] This unified approach simplifies the pipeline and allows knowledge transfer between image and video generation tasks during training. Internally, the diffusion transformer uses a per-batch flag to indicate whether a sample is an image or a video, and the same flow matching loss is applied in both cases.

Multimodal conditioning

With the release of HunyuanCustom in May 2025, the family extended beyond text and a single reference image to accept image, audio, video, and text conditions concurrently.^[8] This enables workflows such as inserting a specific person, animating them speaking a recorded audio clip, and constraining the camera trajectory by referencing an existing video, all in a single inference. The same multimodal toolkit underlies many of the editing and dubbing pipelines built on top of HunyuanVideo through 2025 and 2026.

How does HunyuanVideo perform on benchmarks?

Tencent conducted a professional human evaluation using 1,533 text prompts. More than 60 professional evaluators assessed generated videos across three criteria: Text Alignment, Motion Quality, and Visual Quality.^[1] To ensure fairness, inference was conducted only once per prompt with no cherry-picking of results.

Model	Text Alignment	Motion Quality	Visual Quality	Overall
HunyuanVideo	61.8%	66.5%	95.7%	41.3%
CNTopA (API)	62.6%	61.7%	95.6%	37.7%
CNTopB (Web)	60.1%	62.9%	97.7%	37.5%
Runway Gen-3 Alpha	47.7%	54.7%	97.5%	27.4%
Luma 1.6	57.6%	44.2%	94.1%	24.8%

HunyuanVideo achieved the highest overall score (41.3%) and the best motion quality score (66.5%) among all tested models.^[1] "CNTopA" and "CNTopB" refer to anonymized top-performing Chinese video generation models that were included in the comparison.

Subsequent third-party evaluations on the public VBench benchmark have placed the HunyuanVideo family near the top of the open-source leaderboard, with HunyuanVideo 1.5 generally tied with or slightly ahead of Wan 2.2 on instruction following while trailing it modestly on photoreal human aesthetics. The same benchmarks show HunyuanVideo 1.5 ahead of CogVideoX-5B, LTX-Video 13B, and Open-Sora 2.0 on overall video quality at comparable parameter counts.

How does HunyuanVideo compare with other video models?

The following table compares HunyuanVideo with other notable video generation models released around the same period.

Feature	HunyuanVideo (1.0)	HunyuanVideo 1.5	Sora (OpenAI)	Kling (Kuaishou)	CogVideoX-5B (Zhipu AI)	Wan 2.2 (Alibaba)
Release Date	December 2024	November 2025	December 2024	June 2024	August 2024	August 2025
Parameters	13B	8.3B	Undisclosed	Undisclosed	5B	27B MoE (14B active)
Open Source	Yes (Tencent Hunyuan Community License)	Yes (Tencent Hunyuan Community License)	No	No	Yes (Apache 2.0)	Yes (Apache 2.0)
Architecture	Dual-Stream/Single-Stream DiT	Dual-Stream/Single-Stream DiT with SSTA	Diffusion Transformer	Diffusion Transformer (DiT)	Expert Transformer (DiT)	Mixture-of-Experts DiT
Max Resolution	720p (native)	1080p (with super-resolution)	1080p	1080p	768x1360	1080p
Max Duration	~5 seconds (129 frames at 24 fps)	5 to 10 seconds	Up to 20 seconds	Up to 2 minutes	Up to 10 seconds	5 to 10 seconds
Text Encoder	MLLM (decoder-only) + CLIP	MLLM + CLIP (glyph-aware)	Undisclosed	Undisclosed	T5 with Expert LayerNorm	Bilingual T5-style
Image-to-Video	Yes (separate I2V model)	Yes (unified)	Yes	Yes	Yes	Yes
Audio-driven Avatar	Yes (HunyuanVideo-Avatar)	Yes (HunyuanVideo-Avatar)	No (unconfirmed)	Partial	No	No
Consumer GPU Support	Limited (60GB+ VRAM)	Yes (8GB+ with GGUF quantization)	No (cloud only)	No (cloud only)	Yes (8-12GB VRAM)	Partial (24GB+ recommended)

HunyuanVideo versus Wan 2.x

Alibaba's Wan family has emerged as the most direct open-source competitor to HunyuanVideo since 2025. Wan 2.2, released in August 2025, uses a Mixture-of-Experts DiT with 27 billion total parameters and 14 billion active during inference, while HunyuanVideo 1.5 pursues efficiency at 8.3 billion. Independent comparisons published through late 2025 and early 2026 highlight a consistent pattern: HunyuanVideo 1.5 leads on instruction following accuracy, motion clarity, and physically grounded effects like fluids and cloth, whereas Wan 2.2 leads on photorealistic aesthetics, especially fine skin texture and hair on human subjects. HunyuanVideo also requires less VRAM (roughly 14 GB minimum versus 24 GB for Wan 2.2 with comparable settings) and generally renders shorter clips faster thanks to SSTA.

HunyuanVideo versus Open-Sora and Mochi

Open-Sora 2.0, an 11B parameter model released in March 2025 by HPC AI Tech, was explicitly designed to close the gap with HunyuanVideo.^[16] On VBench, Open-Sora 2.0 narrowed the difference to OpenAI's Sora to under one percentage point and achieved scores broadly comparable to the original 13B HunyuanVideo.^[16] Genmo's Mochi 1, another popular open-source contender, is praised for fluid motion but tends to lag HunyuanVideo on temporal coherence beyond five seconds and on text alignment. With the November 2025 release of HunyuanVideo 1.5, third-party leaderboards as of early 2026 generally place HunyuanVideo 1.5 ahead of both Open-Sora 2.0 and Mochi 1 on overall video quality at comparable parameter counts.

Is HunyuanVideo open source?

HunyuanVideo is released under the Tencent Hunyuan Community License Agreement, dated December 3, 2024.^[11] This is not a standard open-source license. Key terms include:

Usage rights: Non-exclusive, non-transferable, royalty-free license to use, reproduce, distribute, and create derivative works.^[11]
Commercial use: Free for most commercial applications. However, services exceeding 100 million monthly active users require explicit written permission from Tencent.^[11]
Distribution requirements: Derivative works must include a copy of the license agreement and carry notices about modified files. Tencent encourages (but does not require) users to publish a blog post about their experience and mark products as "Powered by Tencent Hunyuan."^[11]
Territorial restrictions: The license does not apply in the European Union, United Kingdom, or South Korea.^[11]
Governing law: Hong Kong Special Administrative Region of the People's Republic of China.^[11]

Tencent retains intellectual property rights over the original HunyuanVideo works, while users own their derivative works and modifications as long as they comply with the license terms. An Acceptable Use Policy (included as an exhibit to the license) outlines prohibited uses, including the generation of child sexual abuse material, non-consensual intimate imagery, deceptive deepfakes intended for election interference, and other categories common across modern foundation-model licenses.^[11]

Community adoption and ecosystem

ComfyUI integration

ComfyUI, the popular node-based interface for diffusion model workflows, added official native support for HunyuanVideo starting with version 0.3.8.^[13] The integration allows users to build text-to-video and image-to-video workflows using ComfyUI's visual node editor.

Multiple integration pathways exist:

Official native nodes in ComfyUI (no plugins required for basic usage).^[13]
ComfyUI-HunyuanVideoWrapper by community developer Kijai, which adds advanced features like context windowing, direct image embedding, and tightly coupled support for HunyuanCustom and HunyuanVideo-Avatar.^[12]
ComfyUI-GGUF by community developer city96, enabling quantized model loading for low-VRAM setups.

For HunyuanVideo 1.5, Tencent released an official ComfyUI plugin (comfyui_hunyuanvideo_1.5_plugin) with both simplified and complete node sets, along with built-in automatic model download support.^[4] ComfyUI workflows have also become the de facto distribution format for community LoRAs, since users can drop a checkpoint file into a folder and immediately reuse a published workflow.

LoRA fine-tuning

HunyuanVideo supports LoRA (Low-Rank Adaptation) fine-tuning, allowing users to customize the model for specific styles, characters, or effects without retraining the full model. LoRA support was added on December 20, 2024, shortly after the initial release.^[3] The same LoRA scaffolding works for HunyuanVideo 1.5 with minor configuration changes, so most community recipes carry forward.

Two community-maintained training stacks dominate as of 2026:

Musubi Tuner by kohya-ss, which supports HunyuanVideo, HunyuanVideo-I2V, Wan 2.1 and 2.2, FramePack, FLUX, Qwen-Image, and other modern architectures. It includes VRAM-saving features such as FP8 weights, block swapping, and cached latents and text-encoder outputs.^[14]
diffusion-pipe, which offers a Gradio interface and Docker images for easy deployment on RunPod or Vast.AI, and is popular for users who want to train without writing config files.

Additional tools include finetrainers, OneTrainer, and various community forks. The recommended optimizer for LoRA fine-tuning is Muon, though AdamW and Lion are also widely used in community recipes. The training pipeline supports distributed training, Fully Sharded Data Parallel (FSDP), context parallelism, and gradient checkpointing.

Practical recipes published on Civitai and YouTube converge on a common pattern: 10 to 50 short clips of three to five seconds at 720p, paired with dense captions in a sidecar text file, trained for around 4,000 steps. A character LoRA typically requires 20 to 30 reference clips covering varied angles, lighting, and expressions, and finishes in 6 to 12 hours on a single RTX 4090.

On platforms like Civitai, a growing library of community-created LoRA adapters is available for HunyuanVideo, covering character styles, animation effects, camera movements, fictional intellectual properties, and style transfers such as Studio Ghibli aesthetics or specific cinematographers' looks.

Prompt engineering

Because HunyuanVideo's MLLM text encoder can handle long, detailed prompts, community guides recommend writing dense descriptions with explicit subject, action, environment, camera, and lighting clauses. For LoRA-conditioned generation, the LoRA trigger word should be written directly into the caption as a short phrase rather than a single token. Recommended sampler settings include 25 to 30 inference steps for smooth motion, a flow shift around 0.5 to 1.0 for dialogue-heavy scenes with little motion, and embedded guidance scales of 10 to 12 when strict prompt adherence is required. Multiple LoRAs can be chained by stacking nodes in ComfyUI, with strengths typically set between 0.5 and 1.2.

Hugging Face Diffusers

HunyuanVideo is integrated into the Hugging Face Diffusers library, making it accessible through a standard Python API.^[5] This integration simplifies model loading, inference, and pipeline customization for developers already familiar with the Diffusers ecosystem. Both the 13B 1.0 model and the 8.3B 1.5 model expose Diffusers pipelines, and the HunyuanCustom and HunyuanVideo-Avatar variants are gradually being integrated as well.

Memory and speed optimization

The community has developed several approaches to reduce VRAM requirements and inference latency:

FP8 quantization (E4M3 format) saves approximately 10 GB of GPU memory compared to full precision.
GGUF quantization (Q8, Q6, Q4 variants) enables running on GPUs with 8 GB or less of VRAM.
Temporal tiling in the VAE decoder processes video frames in chunks rather than all at once, eliminating memory spikes.
FastHunyuan with NF4 quantization allows inference on a single RTX 4090 with roughly 20 GB of VRAM.
Cache inference methods (DeepCache, TeaCache, TaylorCache) were added in November 2025 for HunyuanVideo 1.5, providing further speedups. TeaCache alone is reported to roughly double end-to-end throughput at a threshold of 0.15 with negligible visible quality loss.^[15]
MagCache offers a training-free caching scheme that estimates timestep-wise output drift and reports a 1.7x speedup for HunyuanVideo 1.5 at 20 inference steps.
Step distillation reduces the sampling schedule from roughly 50 steps to 8 or 12 steps for both T2V and I2V, multiplying the effective speedup from caching.

In combination, these techniques have brought HunyuanVideo from a model that initially required an 80 GB H100 to a system that can produce 5-second 720p clips on a hobbyist RTX 4090 in under two minutes, and on a high-end RTX 5090 in well under a minute as of early 2026.

Extended models

Tencent has built several specialized models on top of the HunyuanVideo foundation. Together they form a layered ecosystem in which the same diffusion transformer backbone and 3D causal VAE are adapted to new conditioning modalities by adding lightweight injection modules.

HunyuanVideo-I2V

HunyuanVideo-I2V is the dedicated image-to-video variant of the foundation model. It uses a token replace technique in which a portion of the input latent sequence is overwritten with tokens derived from the reference image, anchoring the first frames of the generated clip to the input.^[7] The MLLM text encoder receives both the user prompt and a visual description of the reference image to align style and content. Minimum GPU memory is 60 GB at full precision for 720p output, with 80 GB recommended for best quality, although community FP8 and GGUF variants substantially lower these requirements.^[7]

HunyuanCustom

Released on May 9, 2025, HunyuanCustom extends the video generation framework with multi-modal conditioning.^[8] It accepts image, audio, video, and text inputs simultaneously, with an emphasis on subject consistency across generated frames.^[8] The model uses a text-image fusion module based on LLaVA and an image ID enhancement module that reinforces identity features across frames through temporal concatenation.^[8]

For audio and video conditioning, HunyuanCustom introduces:

AudioNet, which performs hierarchical alignment between audio features and the video latent stream via spatial cross-attention so that the generated video responds to the rhythm and emphasis of speech or music.^[8]
A video-driven injection module that integrates compressed conditional video through a patchify-based feature-alignment network, enabling object replacement workflows in which a person or product in an existing clip is replaced with a new identity.^[8]

Tencent reports that HunyuanCustom outperforms VACE, Skyreels, Pika, Vidu, Keling, and Hailuo on subject consistency, text-video alignment, and overall video quality in head-to-head subject-driven generation tests.^[8]

HunyuanVideo-Avatar

Released on May 28, 2025, HunyuanVideo-Avatar is an audio-driven human animation model jointly developed by Tencent Hunyuan and Tencent Music.^[9] Given a single character image and an audio clip, it generates video of the character speaking or singing with lip synchronization, emotional expression, and full-body motion.^[9] Key components include:

A character image injection module that replaces the conventional addition-based character conditioning scheme, eliminating the mismatch between training and inference.^[9]
An Audio Emotion Module (AEM) that extracts emotional cues from a reference image and transfers them to the generated video.^[9]
A Face-Aware Audio Adapter (FAA) that isolates the audio-driven character with a latent-level face mask and uses cross-attention for multi-character dialogue.^[9]

The model supports photorealistic, cartoon, 3D-rendered, and anthropomorphic character styles, and has been deployed inside multiple Tencent Music applications for short-form video, virtual host, and avatar singing use cases.^[9]

HunyuanVideo-Foley

HunyuanVideo-Foley, released in August 2025 and updated with the XL variant on September 29, 2025, is an end-to-end Text-Video-to-Audio (TV2A) model that automatically generates sound effects synchronized to existing video.^[10] It analyses the visual content of the input video, identifies actions and environments, and produces contextually appropriate Foley sound effects, ambience, and incidental music. The architecture is a multimodal diffusion transformer (MM-DiT) paired with a self-developed 48 kHz audio VAE that can faithfully reconstruct sound effects, music, and vocals.^[10]

Tencent reports best-in-class scores across audio quality, synchronization, and semantic alignment metrics in its internal evaluation.^[10] In creator workflows, HunyuanVideo-Foley is commonly chained after HunyuanVideo or HunyuanVideo-Avatar to produce a fully voiced and dubbed clip from text plus a still image.

HunyuanVideo 1.5 lineup

Beyond the base T2V checkpoint, the HunyuanVideo 1.5 release ships several model files:^[4]

HunyuanVideo-1.5 T2V (8.3B), the foundation text-to-video model.
HunyuanVideo-1.5 I2V (8.3B), the image-to-video sibling.
HunyuanVideo-1.5 step-distilled T2V and step-distilled I2V, for 8 or 12-step inference.
HunyuanVideo-1.5 Super-Resolution (8.3B), which can be paired with either base model for 1080p output.

This modular packaging mirrors the structure used by the broader Hunyuan foundation suite and gives community deployments fine control over the tradeoff between quality, speed, and resolution.

What is HunyuanVideo used for?

HunyuanVideo and its derivatives are used across a wide range of practical applications:

Short-form video creation for social platforms, where text-to-video and image-to-video pipelines convert script ideas or storyboard frames into 5 to 10 second clips.
E-commerce and product marketing, where HunyuanCustom and HunyuanVideo-Avatar generate product demos and virtual hosts from a few reference images and audio recordings.
Advertising and creative pre-visualization for previewing concepts before live action shoots.
Game and animation prototyping, where Studio Ghibli-style and anime LoRAs help build mood reels and concept clips without keyframing.
Educational content that combines HunyuanVideo for visuals with HunyuanVideo-Foley for ambient sound and a downstream TTS narrator.
Research, since the open weights, training scripts, and detailed technical reports make HunyuanVideo a standard reference baseline for academic work on long-sequence diffusion, video latent codecs, and motion modeling.

The combination of open weights, a mature ComfyUI ecosystem, and consumer-grade hardware support has made HunyuanVideo a popular default in indie filmmaking and YouTube channels focused on AI-assisted content production.

What are the limitations of HunyuanVideo?

Like all current video generation models, HunyuanVideo has several known limitations:

Duration constraints: The base model generates approximately 5 seconds of video (129 frames). While HunyuanVideo 1.5 extends this to 10 seconds, it still falls short of the longer durations offered by some commercial models such as Kling, which can produce two-minute clips.
Physics simulation: Generated videos may contain physically implausible motions, particularly for complex interactions between objects, fast camera moves combined with deformable objects, and contact-rich actions such as cooking or fine tool use.
Text rendering: The original model sometimes struggles to render legible text within generated video frames. HunyuanVideo 1.5's glyph-aware encoding improves bilingual signage and short captions, but long passages of text remain unreliable.
Compute requirements: The original 13B model requires 60 GB or more of GPU memory at full precision, limiting accessibility. HunyuanVideo 1.5 and community quantization efforts have partially addressed this, but high-end consumer GPUs are still needed for the best quality.
Territorial license restrictions: The license excludes the EU, UK, and South Korea, limiting legal use in those regions and prompting some commercial platforms to gate access geographically.
Identity drift: Long clips can suffer from gradual identity drift in characters. Subject-consistent generation models such as HunyuanCustom mitigate this but at the cost of more complex inference pipelines.
Audio integration: The base T2V and I2V models do not generate audio. HunyuanVideo-Foley fills this gap but adds an additional model and step to the pipeline.

References

Kong, W., Tian, Y., et al. "HunyuanVideo: A Systematic Framework For Large Video Generative Models." arXiv:2412.03603, December 2024. https://arxiv.org/abs/2412.03603 ↩
Tencent Hunyuan Foundation Model Team. "HunyuanVideo 1.5 Technical Report." arXiv:2511.18870, November 2025. https://arxiv.org/abs/2511.18870 ↩
Tencent-Hunyuan. "HunyuanVideo GitHub Repository." https://github.com/Tencent-Hunyuan/HunyuanVideo ↩
Tencent-Hunyuan. "HunyuanVideo-1.5 GitHub Repository." https://github.com/Tencent-Hunyuan/HunyuanVideo-1.5 ↩
Tencent. "HunyuanVideo on Hugging Face." https://huggingface.co/tencent/HunyuanVideo ↩
Tencent. "HunyuanVideo-1.5 on Hugging Face." https://huggingface.co/tencent/HunyuanVideo-1.5
Tencent-Hunyuan. "HunyuanVideo-I2V GitHub Repository." https://github.com/Tencent-Hunyuan/HunyuanVideo-I2V ↩
Tencent-Hunyuan. "HunyuanCustom: A Multimodal-Driven Architecture for Customized Video Generation." arXiv:2505.04512, May 2025. https://arxiv.org/abs/2505.04512 ↩
Tencent-Hunyuan. "HunyuanVideo-Avatar GitHub Repository." https://github.com/Tencent-Hunyuan/HunyuanVideo-Avatar ↩
Tencent-Hunyuan. "HunyuanVideo-Foley GitHub Repository." https://github.com/Tencent-Hunyuan/HunyuanVideo-Foley ↩
Tencent Hunyuan Community License Agreement. https://github.com/Tencent-Hunyuan/HunyuanVideo/blob/main/LICENSE.txt ↩
Kijai. "ComfyUI-HunyuanVideoWrapper." https://github.com/kijai/ComfyUI-HunyuanVideoWrapper ↩
ComfyUI Documentation. "Hunyuan Video Examples." https://docs.comfy.org/tutorials/video/hunyuan-video ↩
kohya-ss. "Musubi Tuner." https://github.com/kohya-ss/musubi-tuner ↩
ali-vilab. "TeaCache: Timestep Embedding Tells: It's Time to Cache for Video Diffusion Models." https://github.com/ali-vilab/TeaCache ↩
HPC AI Tech. "Open-Sora 2.0: Training a Commercial-Level Video Generation Model in $200k." arXiv:2503.09642, March 2025. https://arxiv.org/abs/2503.09642 ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

7 revisions by 1 contributors · full history

Suggest edit