HunyuanVideo
Last reviewed
May 17, 2026
Sources
16 citations
Review status
Source-backed
Revision
v4 ยท 5,876 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 17, 2026
Sources
16 citations
Review status
Source-backed
Revision
v4 ยท 5,876 words
Add missing citations, update stale details, or suggest a clearer explanation.
HunyuanVideo is an open-source video generation model developed by Tencent. Released on December 3, 2024, it was the largest open-source video generation model at the time of launch, with over 13 billion parameters. HunyuanVideo generates high-quality videos from text prompts and, through subsequent releases, from images, audio, and reference videos as well. It uses a Diffusion Transformer (DiT) architecture with a novel "Dual-Stream to Single-Stream" design, a Multimodal Large Language Model (MLLM) as its text encoder, and a 3D Variational Autoencoder (VAE) for spatiotemporal compression.
Tencent released the model weights and inference code under the Tencent Hunyuan Community License Agreement. In professional human evaluations conducted with 1,533 text prompts and more than 60 evaluators, HunyuanVideo outperformed several leading models including Runway Gen-3 Alpha and Luma 1.6, particularly in motion quality. A follow-up release, HunyuanVideo 1.5, arrived in November 2025 with a smaller 8.3 billion parameter model that runs on consumer-grade GPUs while delivering further gains in instruction following and motion clarity. Together with sibling models such as HunyuanCustom, HunyuanVideo-Avatar, and HunyuanVideo-Foley, the HunyuanVideo family has grown into one of the most complete open-source video generation ecosystems available as of 2026.
HunyuanVideo was developed by the Tencent Hunyuan Foundation Model Team. The accompanying research paper, titled "HunyuanVideo: A Systematic Framework For Large Video Generative Models," was published on arXiv (2412.03603) on December 3, 2024, the same day the inference code and model weights were made publicly available on GitHub and Hugging Face.
The project grew out of Tencent's broader Hunyuan AI initiative, which includes large language models, image generation systems, and the 3D generation line. HunyuanVideo was positioned as an answer to closed-source video generation tools such as OpenAI's Sora and Kuaishou's Kling, offering comparable quality in an open-weight package that researchers and developers could run locally. The release marked a strategic decision by Tencent to publish a frontier-scale video model rather than keep it as an internal product, mirroring the company's approach to its text and image foundation models.
Several milestones followed the initial release:
| Date | Release | Description |
|---|---|---|
| December 3, 2024 | HunyuanVideo (T2V) | Original 13B parameter text-to-video model, 720p output, 129 frames at 24 fps |
| December 20, 2024 | LoRA fine-tuning support | Official LoRA training pipeline and example scripts released for community customization |
| March 6, 2025 | HunyuanVideo-I2V | 13B parameter image-to-video variant with token replace technique for reference image injection |
| May 9, 2025 | HunyuanCustom | Multimodal-driven framework for customized video generation supporting image, audio, video, and text conditions |
| May 28, 2025 | HunyuanVideo-Avatar | Audio-driven human animation model for generating speech-synchronized digital human videos |
| August 2025 | HunyuanVideo-Foley | Text-Video-to-Audio (TV2A) model that automatically generates sound effects synchronized to existing video |
| September 29, 2025 | HunyuanVideo-Foley-XL | Higher-capacity Foley model with offload inference for reduced VRAM use |
| November 21, 2025 | HunyuanVideo 1.5 | Lightweight 8.3B parameter model with SSTA attention, super-resolution to 1080p, and consumer GPU support |
| November 27, 2025 | Cache inference support | TeaCache, TaylorCache, and DeepCache integrations land for HunyuanVideo 1.5 |
| December 2025 | HunyuanVideo-1.5 I2V step-distilled | 8 or 12-step image-to-video distillation generating clips in roughly 75 seconds on an RTX 4090 |
The cadence of releases through 2025 transformed HunyuanVideo from a single text-to-video checkpoint into a layered ecosystem of foundation, conditional, and audio models, all sharing the same underlying DiT architecture and VAE compression scheme.
HunyuanVideo is built on three main components: a Diffusion Transformer backbone, an MLLM-based text encoder, and a 3D causal VAE. The system operates in a compressed latent space, where Gaussian noise is progressively denoised conditioned on text (or image) inputs, and the resulting latent representation is decoded back into pixel-space video. The same architectural blueprint also underpins the diffusion model variants in HunyuanVideo-I2V, HunyuanCustom, and HunyuanVideo-Avatar, which extend the base transformer with additional conditioning modules rather than replacing it.
The core of HunyuanVideo is a transformer-based diffusion model that processes video and text tokens through a hybrid architecture. This design was referred to by the team as "Dual-Stream to Single-Stream."
In the dual-stream phase, video tokens and text tokens pass through separate transformer blocks independently. Each modality learns its own modulation mechanisms (such as adaptive layer normalization) without interference from the other. This separation allows the model to develop strong representations for both visual content and language semantics before they interact.
In the single-stream phase, video and text tokens are concatenated into a single sequence and processed jointly through additional transformer blocks using full attention. This stage enables deep multimodal fusion, allowing the model to align generated visual content with the text description.
The architecture hyperparameters for the 13B foundation model are:
| Parameter | Value |
|---|---|
| Dual-stream blocks | 20 |
| Single-stream blocks | 40 |
| Hidden dimension | 3,072 |
| FFN dimension | 12,288 |
| Attention heads | 24 |
| Head dimension | 128 |
| Positional embedding channels (dt, dh, dw) | 16, 56, 56 |
The model uses Flow Matching as its training objective rather than the more traditional DDPM (Denoising Diffusion Probabilistic Model) approach. In flow matching, the network learns to predict the velocity field that transports samples between a simple noise distribution and the target data distribution. This formulation has been shown to produce more stable training dynamics and higher-quality outputs compared to standard noise prediction. The Tencent team also reported that flow matching converges faster on the long-sequence regime imposed by video latents, where the cost of running noisy backward iterations is high.
Rather than relying on CLIP or T5 alone as text encoders, HunyuanVideo uses a pre-trained Multimodal Large Language Model (MLLM) with a decoder-only architecture. The publicly released version uses llava-llama-3-8b-v1_1 (provided by Xtuner) as the text encoder, though Tencent has indicated that a proprietary HunyuanMLLM was used internally.
The MLLM offers several advantages over traditional text encoders. Compared to CLIP, it provides superior image detail description and complex reasoning capabilities. Compared to T5, its visual instruction fine-tuning gives it better image-text alignment.
However, the MLLM is based on causal (autoregressive) attention, while diffusion models tend to benefit from bidirectional text representations. To address this gap, HunyuanVideo introduces an extra bidirectional token refiner that post-processes the MLLM's output features, producing enhanced text embeddings that better guide the diffusion process.
A secondary text encoder, OpenAI's clip-vit-large-patch14, is also used in the pipeline alongside the MLLM. The CLIP encoder contributes a pooled global representation that supplements the token-level features extracted from the MLLM, giving the diffusion transformer both fine-grained semantic anchors and a coarse overall prompt summary.
In addition, a prompt rewrite model fine-tuned from Hunyuan-Large rewrites user prompts into more detailed descriptions before they are fed to the text encoder. This approach improves generation quality by expanding sparse user inputs into rich, descriptive text. The rewrite system is also where Tencent integrates bilingual handling, automatically converting Chinese prompts into English equivalents augmented with cinematic terminology before encoding.
HunyuanVideo compresses pixel-space videos into a compact latent space using a 3D Variational Autoencoder (VAE) built with CausalConv3D layers. The compression ratios are:
| Dimension | Compression ratio |
|---|---|
| Temporal (video length) | 4x |
| Spatial (height and width) | 8x |
| Channel | 16 latent channels |
The causal convolution design ensures temporal causality, meaning each frame's latent representation depends only on current and previous frames, never future ones. This property is important for maintaining coherent motion and enabling autoregressive-style generation patterns.
By compressing along all three dimensions simultaneously (rather than separately handling spatial and temporal compression), the 3D VAE can capture joint spatiotemporal patterns. This compression significantly reduces the number of tokens the diffusion transformer must process, making it possible to train on high-resolution video at the original frame rate.
The VAE handles both images (treated as single-frame videos) and multi-frame videos, allowing the same architecture to support unified image and video generation. This shared codec is also why HunyuanVideo-I2V, HunyuanCustom, and HunyuanVideo-Avatar can interchange latents and reference images without retraining the VAE for each downstream task.
HunyuanVideo was pre-trained on internet-scale images and videos, processed through a multi-stage data curation pipeline. The filtering process included:
After all filtering stages, approximately 800 million high-quality video segments remained for pre-training. The dataset spans a deliberately broad mix of human action, natural scenery, animal behavior, sports, animation, and product footage so that the model can match a wide variety of prompts encountered downstream.
The model was trained using a progressive multi-stage approach that gradually increased resolution, video length, and frame rate:
This progressive approach allows the model to first learn basic visual concepts at low resolution before tackling the harder problem of high-resolution video with complex motion. The strategy also improves training efficiency, since early stages process far fewer tokens per sample.
Training was performed on a large GPU cluster using Tencent's internal distributed framework, with combinations of Fully Sharded Data Parallel (FSDP), context parallelism for long sequences, and gradient checkpointing to manage activation memory. The same scaffolding was later reused for the post-training stages of HunyuanVideo 1.5.
HunyuanVideo 1.5 was released on November 21, 2025, as a lighter and more efficient successor to the original model. The accompanying technical report (arXiv: 2511.18870) detailed a number of architectural improvements designed to make high-quality video generation accessible on consumer hardware while keeping pace with proprietary competitors. As of early 2026 it is widely regarded as the open-source state of the art for instruction following and natural motion at the sub-10B parameter scale.
The most significant change was the reduction in model size from 13 billion to 8.3 billion parameters. Despite the smaller size, HunyuanVideo 1.5 achieved state-of-the-art visual quality and motion coherence through several architectural innovations.
The 3D causal VAE was also updated, with spatial compression increased to 16x (up from 8x in version 1.0) while maintaining 4x temporal compression. The latent channel dimension was set to 32. The denser latent representation lets a smaller transformer span the same field of view, which is one of the main reasons HunyuanVideo 1.5 can run on a single 24 GB consumer GPU.
Other notable updates include:
The headline architectural innovation of HunyuanVideo 1.5 is Selective and Sliding Tile Attention (SSTA), a mechanism designed to address the high computational cost of full attention over long video sequences. SSTA operates through four steps:
By dynamically pruning redundant spatiotemporal tokens, SSTA achieved an end-to-end speedup of 1.87x for 10-second 720p video synthesis compared to FlashAttention-3, without a meaningful loss in output quality. SSTA is implemented as a drop-in replacement for the dense attention path used in HunyuanVideo 1.0, so existing training and inference utilities require only minor modifications to take advantage of it.
HunyuanVideo 1.5 includes a dedicated video super-resolution network that upscales outputs from the base resolution (480p to 720p) to 1080p. This network follows the same 8.3B Diffusion Transformer architecture as the main model and operates in latent space. Low-resolution latents are injected using channel concatenation, and a separate latent upsample block spatially aligns low-resolution and high-resolution latents before the final VAE decoding step.
The super-resolution network was trained on 1 million high-quality video clips. It not only increases resolution but also corrects distortions and refines details in the base output, effectively serving as a "polish" stage that can be skipped to save compute when 1080p is not required.
HunyuanVideo 1.5 introduced a three-phase post-training pipeline:
The RLHF stage uses a reward model trained on pairwise comparisons of generated clips. This step is credited with much of the visible improvement in instruction following relative to the 13B foundation model, particularly on multi-step or compositional prompts.
One of the goals of HunyuanVideo 1.5 was to run on consumer-grade GPUs. Peak memory usage was reported at 13.6 GB for 720p video with 121 frames, making it feasible to run on GPUs like the NVIDIA RTX 4090. With GGUF quantization (available in Q8, Q6, and Q4 variants), the model can run on GPUs with as little as 8 GB of VRAM through ComfyUI, though quality degrades noticeably at Q3 and below. Community benchmarks place 720p T2V generation at roughly 60 to 120 seconds per clip on an RTX 4090 with the step-distilled checkpoint, and 5 to 8 minutes on the full sampling schedule.
The primary capability of HunyuanVideo is text-to-video generation. Given a text prompt, the model generates video clips at up to 720p resolution (1280x720 or 720x1280 depending on aspect ratio) with 129 frames at 24 fps, yielding roughly 5 seconds of video. Multiple aspect ratios are supported, including 16:9, 9:16, 4:3, 3:4, and 1:1. HunyuanVideo 1.5 extends the maximum duration to 10 seconds and supports outputs from 480p directly up to 1080p when the super-resolution stage is enabled.
The prompt rewrite system, powered by a fine-tuned Hunyuan-Large model, automatically expands brief user prompts into detailed descriptions. This substantially improves generation quality for casual users who may not write highly detailed prompts. The pipeline accepts both English and Chinese natural language and routes through a normalized intermediate prompt format.
HunyuanVideo-I2V, released in March 2025, extends the framework to accept a reference image as input alongside text. The model uses a token replace technique to inject reference image information into the generation process, preserving the visual style, color palette, and character identity of the source image throughout the generated video.
The I2V variant also supports LoRA training for customizable special effects, lip synchronization with 10 speech styles, and preset dance routine templates. A step-distilled I2V variant released in December 2025 generates clips in 8 or 12 steps and reduces end-to-end generation time by approximately 75 percent, so a single RTX 4090 can complete a clip in around 75 seconds.
In practice, image-to-video has become one of the most common workflows for HunyuanVideo, because creators can first generate a still image with tools such as Flux or HunyuanImage and then use HunyuanVideo-I2V to animate it. The reference image is encoded by the same MLLM-CLIP pipeline as text and threaded into both streams of the diffusion transformer.
Because the 3D VAE treats images as single-frame videos, HunyuanVideo can generate both still images and videos from the same architecture. This unified approach simplifies the pipeline and allows knowledge transfer between image and video generation tasks during training. Internally, the diffusion transformer uses a per-batch flag to indicate whether a sample is an image or a video, and the same flow matching loss is applied in both cases.
With the release of HunyuanCustom in May 2025, the family extended beyond text and a single reference image to accept image, audio, video, and text conditions concurrently. This enables workflows such as inserting a specific person, animating them speaking a recorded audio clip, and constraining the camera trajectory by referencing an existing video, all in a single inference. The same multimodal toolkit underlies many of the editing and dubbing pipelines built on top of HunyuanVideo through 2025 and 2026.
Tencent conducted a professional human evaluation using 1,533 text prompts. More than 60 professional evaluators assessed generated videos across three criteria: Text Alignment, Motion Quality, and Visual Quality. To ensure fairness, inference was conducted only once per prompt with no cherry-picking of results.
| Model | Text Alignment | Motion Quality | Visual Quality | Overall |
|---|---|---|---|---|
| HunyuanVideo | 61.8% | 66.5% | 95.7% | 41.3% |
| CNTopA (API) | 62.6% | 61.7% | 95.6% | 37.7% |
| CNTopB (Web) | 60.1% | 62.9% | 97.7% | 37.5% |
| Runway Gen-3 Alpha | 47.7% | 54.7% | 97.5% | 27.4% |
| Luma 1.6 | 57.6% | 44.2% | 94.1% | 24.8% |
HunyuanVideo achieved the highest overall score (41.3%) and the best motion quality score (66.5%) among all tested models. "CNTopA" and "CNTopB" refer to anonymized top-performing Chinese video generation models that were included in the comparison.
Subsequent third-party evaluations on the public VBench benchmark have placed the HunyuanVideo family near the top of the open-source leaderboard, with HunyuanVideo 1.5 generally tied with or slightly ahead of Wan 2.2 on instruction following while trailing it modestly on photoreal human aesthetics. The same benchmarks show HunyuanVideo 1.5 ahead of CogVideoX-5B, LTX-Video 13B, and Open-Sora 2.0 on overall video quality at comparable parameter counts.
The following table compares HunyuanVideo with other notable video generation models released around the same period.
| Feature | HunyuanVideo (1.0) | HunyuanVideo 1.5 | Sora (OpenAI) | Kling (Kuaishou) | CogVideoX-5B (Zhipu AI) | Wan 2.2 (Alibaba) |
|---|---|---|---|---|---|---|
| Release Date | December 2024 | November 2025 | December 2024 | June 2024 | August 2024 | August 2025 |
| Parameters | 13B | 8.3B | Undisclosed | Undisclosed | 5B | 27B MoE (14B active) |
| Open Source | Yes (Tencent Hunyuan Community License) | Yes (Tencent Hunyuan Community License) | No | No | Yes (Apache 2.0) | Yes (Apache 2.0) |
| Architecture | Dual-Stream/Single-Stream DiT | Dual-Stream/Single-Stream DiT with SSTA | Diffusion Transformer | Diffusion Transformer (DiT) | Expert Transformer (DiT) | Mixture-of-Experts DiT |
| Max Resolution | 720p (native) | 1080p (with super-resolution) | 1080p | 1080p | 768x1360 | 1080p |
| Max Duration | ~5 seconds (129 frames at 24 fps) | 5 to 10 seconds | Up to 20 seconds | Up to 2 minutes | Up to 10 seconds | 5 to 10 seconds |
| Text Encoder | MLLM (decoder-only) + CLIP | MLLM + CLIP (glyph-aware) | Undisclosed | Undisclosed | T5 with Expert LayerNorm | Bilingual T5-style |
| Image-to-Video | Yes (separate I2V model) | Yes (unified) | Yes | Yes | Yes | Yes |
| Audio-driven Avatar | Yes (HunyuanVideo-Avatar) | Yes (HunyuanVideo-Avatar) | No (unconfirmed) | Partial | No | No |
| Consumer GPU Support | Limited (60GB+ VRAM) | Yes (8GB+ with GGUF quantization) | No (cloud only) | No (cloud only) | Yes (8-12GB VRAM) | Partial (24GB+ recommended) |
Alibaba's Wan family has emerged as the most direct open-source competitor to HunyuanVideo since 2025. Wan 2.2, released in August 2025, uses a Mixture-of-Experts DiT with 27 billion total parameters and 14 billion active during inference, while HunyuanVideo 1.5 pursues efficiency at 8.3 billion. Independent comparisons published through late 2025 and early 2026 highlight a consistent pattern: HunyuanVideo 1.5 leads on instruction following accuracy, motion clarity, and physically grounded effects like fluids and cloth, whereas Wan 2.2 leads on photorealistic aesthetics, especially fine skin texture and hair on human subjects. HunyuanVideo also requires less VRAM (roughly 14 GB minimum versus 24 GB for Wan 2.2 with comparable settings) and generally renders shorter clips faster thanks to SSTA.
Open-Sora 2.0, an 11B parameter model released in March 2025 by HPC AI Tech, was explicitly designed to close the gap with HunyuanVideo. On VBench, Open-Sora 2.0 narrowed the difference to OpenAI's Sora to under one percentage point and achieved scores broadly comparable to the original 13B HunyuanVideo. Genmo's Mochi 1, another popular open-source contender, is praised for fluid motion but tends to lag HunyuanVideo on temporal coherence beyond five seconds and on text alignment. With the November 2025 release of HunyuanVideo 1.5, third-party leaderboards as of early 2026 generally place HunyuanVideo 1.5 ahead of both Open-Sora 2.0 and Mochi 1 on overall video quality at comparable parameter counts.
HunyuanVideo is released under the Tencent Hunyuan Community License Agreement, dated December 3, 2024. This is not a standard open-source license. Key terms include:
Tencent retains intellectual property rights over the original HunyuanVideo works, while users own their derivative works and modifications as long as they comply with the license terms. An Acceptable Use Policy (included as an exhibit to the license) outlines prohibited uses, including the generation of child sexual abuse material, non-consensual intimate imagery, deceptive deepfakes intended for election interference, and other categories common across modern foundation-model licenses.
ComfyUI, the popular node-based interface for diffusion model workflows, added official native support for HunyuanVideo starting with version 0.3.8. The integration allows users to build text-to-video and image-to-video workflows using ComfyUI's visual node editor.
Multiple integration pathways exist:
For HunyuanVideo 1.5, Tencent released an official ComfyUI plugin (comfyui_hunyuanvideo_1.5_plugin) with both simplified and complete node sets, along with built-in automatic model download support. ComfyUI workflows have also become the de facto distribution format for community LoRAs, since users can drop a checkpoint file into a folder and immediately reuse a published workflow.
HunyuanVideo supports LoRA (Low-Rank Adaptation) fine-tuning, allowing users to customize the model for specific styles, characters, or effects without retraining the full model. LoRA support was added on December 20, 2024, shortly after the initial release. The same LoRA scaffolding works for HunyuanVideo 1.5 with minor configuration changes, so most community recipes carry forward.
Two community-maintained training stacks dominate as of 2026:
Additional tools include finetrainers, OneTrainer, and various community forks. The recommended optimizer for LoRA fine-tuning is Muon, though AdamW and Lion are also widely used in community recipes. The training pipeline supports distributed training, Fully Sharded Data Parallel (FSDP), context parallelism, and gradient checkpointing.
Practical recipes published on Civitai and YouTube converge on a common pattern: 10 to 50 short clips of three to five seconds at 720p, paired with dense captions in a sidecar text file, trained for around 4,000 steps. A character LoRA typically requires 20 to 30 reference clips covering varied angles, lighting, and expressions, and finishes in 6 to 12 hours on a single RTX 4090.
On platforms like Civitai, a growing library of community-created LoRA adapters is available for HunyuanVideo, covering character styles, animation effects, camera movements, fictional intellectual properties, and style transfers such as Studio Ghibli aesthetics or specific cinematographers' looks.
Because HunyuanVideo's MLLM text encoder can handle long, detailed prompts, community guides recommend writing dense descriptions with explicit subject, action, environment, camera, and lighting clauses. For LoRA-conditioned generation, the LoRA trigger word should be written directly into the caption as a short phrase rather than a single token. Recommended sampler settings include 25 to 30 inference steps for smooth motion, a flow shift around 0.5 to 1.0 for dialogue-heavy scenes with little motion, and embedded guidance scales of 10 to 12 when strict prompt adherence is required. Multiple LoRAs can be chained by stacking nodes in ComfyUI, with strengths typically set between 0.5 and 1.2.
HunyuanVideo is integrated into the Hugging Face Diffusers library, making it accessible through a standard Python API. This integration simplifies model loading, inference, and pipeline customization for developers already familiar with the Diffusers ecosystem. Both the 13B 1.0 model and the 8.3B 1.5 model expose Diffusers pipelines, and the HunyuanCustom and HunyuanVideo-Avatar variants are gradually being integrated as well.
The community has developed several approaches to reduce VRAM requirements and inference latency:
In combination, these techniques have brought HunyuanVideo from a model that initially required an 80 GB H100 to a system that can produce 5-second 720p clips on a hobbyist RTX 4090 in under two minutes, and on a high-end RTX 5090 in well under a minute as of early 2026.
Tencent has built several specialized models on top of the HunyuanVideo foundation. Together they form a layered ecosystem in which the same diffusion transformer backbone and 3D causal VAE are adapted to new conditioning modalities by adding lightweight injection modules.
HunyuanVideo-I2V is the dedicated image-to-video variant of the foundation model. It uses a token replace technique in which a portion of the input latent sequence is overwritten with tokens derived from the reference image, anchoring the first frames of the generated clip to the input. The MLLM text encoder receives both the user prompt and a visual description of the reference image to align style and content. Minimum GPU memory is 60 GB at full precision for 720p output, with 80 GB recommended for best quality, although community FP8 and GGUF variants substantially lower these requirements.
Released on May 9, 2025, HunyuanCustom extends the video generation framework with multi-modal conditioning. It accepts image, audio, video, and text inputs simultaneously, with an emphasis on subject consistency across generated frames. The model uses a text-image fusion module based on LLaVA and an image ID enhancement module that reinforces identity features across frames through temporal concatenation.
For audio and video conditioning, HunyuanCustom introduces:
Tencent reports that HunyuanCustom outperforms VACE, Skyreels, Pika, Vidu, Keling, and Hailuo on subject consistency, text-video alignment, and overall video quality in head-to-head subject-driven generation tests.
Released on May 28, 2025, HunyuanVideo-Avatar is an audio-driven human animation model jointly developed by Tencent Hunyuan and Tencent Music. Given a single character image and an audio clip, it generates video of the character speaking or singing with lip synchronization, emotional expression, and full-body motion. Key components include:
The model supports photorealistic, cartoon, 3D-rendered, and anthropomorphic character styles, and has been deployed inside multiple Tencent Music applications for short-form video, virtual host, and avatar singing use cases.
HunyuanVideo-Foley, released in August 2025 and updated with the XL variant on September 29, 2025, is an end-to-end Text-Video-to-Audio (TV2A) model that automatically generates sound effects synchronized to existing video. It analyses the visual content of the input video, identifies actions and environments, and produces contextually appropriate Foley sound effects, ambience, and incidental music. The architecture is a multimodal diffusion transformer (MM-DiT) paired with a self-developed 48 kHz audio VAE that can faithfully reconstruct sound effects, music, and vocals.
Tencent reports best-in-class scores across audio quality, synchronization, and semantic alignment metrics in its internal evaluation. In creator workflows, HunyuanVideo-Foley is commonly chained after HunyuanVideo or HunyuanVideo-Avatar to produce a fully voiced and dubbed clip from text plus a still image.
Beyond the base T2V checkpoint, the HunyuanVideo 1.5 release ships several model files:
This modular packaging mirrors the structure used by the broader Hunyuan foundation suite and gives community deployments fine control over the tradeoff between quality, speed, and resolution.
HunyuanVideo and its derivatives are used across a wide range of practical applications:
The combination of open weights, a mature ComfyUI ecosystem, and consumer-grade hardware support has made HunyuanVideo a popular default in indie filmmaking and YouTube channels focused on AI-assisted content production.
Like all current video generation models, HunyuanVideo has several known limitations: