HeyGen Avatar IV
Last reviewed
May 16, 2026
Sources
20 citations
Review status
Source-backed
Revision
v1 ยท 3,144 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 16, 2026
Sources
20 citations
Review status
Source-backed
Revision
v1 ยท 3,144 words
Add missing citations, update stale details, or suggest a clearer explanation.
Avatar IV is the fourth generation of the AI avatar engine from HeyGen, the AI video company co-founded in 2020 by Joshua Xu and Wayne Liang. It was announced by Xu on May 6, 2025, with the model going live in HeyGen's web app the same week. Avatar IV generates a talking-avatar video from a single photo, a script or audio file, and a voice, and is built on what HeyGen describes as a diffusion-inspired audio-to-expression engine that drives facial motion, head movement and micro-expressions from the input audio rather than from a fixed gesture library. Through the second half of 2025 the model expanded from head-and-shoulders output into half-body and full-body framings with timing-aware hand gestures, and on November 5, 2025 HeyGen paired it with a new voice control layer called Voice Director and the Panda Voice Engine.
Avatar IV is the headline model in HeyGen's lineup, sitting above the older Avatar III lip-sync engine and beneath the video-reference Avatar V model that followed it in April 2026. It is gated as a premium feature on HeyGen's plans, with usage metered in Premium Credits at roughly 20 credits per minute of generated video, while Avatar III remains the unmetered workhorse for paid users. By the time of HeyGen's $100 million annual recurring revenue milestone in October 2025, Avatar IV was the model the company pointed to in nearly every product announcement, including the August 2025 upgrade of the Digital Twin feature and the May 2026 launch of the Avatar IV API.
HeyGen was founded in Shenzhen in December 2020 as Surreal by Joshua Xu and Wayne Liang, both Tongji University alumni with master's degrees from Carnegie Mellon University. The company moved its headquarters to Los Angeles, rebranded to HeyGen, and over the next four years built an avatar video platform around stock avatars, voice cloning and AI dubbing. By June 2024 HeyGen had raised a $60 million round led by Benchmark at a $500 million valuation, and by October 2025 the company stated it had crossed $100 million in annual recurring revenue, a roughly tenfold jump from late 2023.
The avatar pipeline went through several model generations before Avatar IV. The earliest engines were focused on accurate lip sync over a stock or custom presenter, with movement limited to small head motion against a static background. Avatar III, the immediate predecessor, kept lip sync as its main strength and is described in HeyGen's own product materials as the workhorse model that remains unmetered for paid users. Avatar III handled the bulk of routine talking-head generation but did not produce significant body motion or context-aware gestures, and its outputs were generally framed as head-and-shoulders. Avatar IV was designed to add expressive performance on top of accurate lip sync, with HeyGen positioning it as a model that interprets the script rather than only syncing to it.
The launch landed during a noisy stretch in the AI avatar market. Synthesia was preparing its own Express-2 engine, which shipped in October 2025 as part of Synthesia 3.0 and made full-body avatars and in-context voice cloning the new baseline. D-ID, Hedra and several open-source projects were also pushing on photoreal head-and-shoulders avatars. HeyGen's positioning with Avatar IV was that a single photo plus a voice should be enough to drive a performance rather than a lip sync, and that the model should scale from a portrait crop up to a full-body shot without needing a separate motion-capture pipeline.
Joshua Xu announced Avatar IV on X on May 6, 2025, framing it as HeyGen's most advanced avatar model and describing the input as "one photo, one script, just your voice." In the same post Xu wrote that "most avatars sync to your words, Avatar IV interprets them," and credited a diffusion-inspired audio-to-expression engine that analyzes vocal tone, rhythm and emotion as the core architecture. HeyGen ran a live product launch event for Avatar IV in the same window and pushed the model into general availability across its web app and creator plans in the days that followed.
The initial release covered portrait and partial-body framings, photorealistic and stylized characters, and a wide range of input photo angles including front-facing, three-quarter and profile shots. HeyGen also marketed the model as working on non-human characters such as anime portraits and pets, where the audio-to-expression engine drives mouth and head motion from the same audio input. Across the rest of 2025 HeyGen layered new capabilities on top of the launch release in monthly product update cycles, rather than calling them new model generations.
The most significant rollout milestones after the initial announcement were:
Avatar IV's successor, Avatar V, was introduced at a HeyGen webinar on April 16, 2026. Avatar V is positioned as a video-reference model that fine-tunes on a 15-second clip of the user rather than predicting motion from a single image, but Avatar IV remains live in the product and remains the model HeyGen recommends when only a photo is available.
Avatar IV produces video from three inputs: a photo of the subject, a script or pre-recorded audio file, and a voice. The script is converted to speech using HeyGen's voice engine if no audio is provided, and the audio is then passed into the audio-to-expression engine that drives the visual generation. The output is a fully rendered video clip with synchronized lip movement, head motion, facial micro-expressions and, in half-body and full-body modes, hand and arm gestures.
| Capability | Details |
|---|---|
| Input photo | Single image; front-facing, three-quarter or profile angles supported |
| Supported subjects | Photoreal humans, stylized portraits, anime characters and pets |
| Framing | Portrait, half-body and full-body; framing is selected per generation |
| Lip sync | Audio-driven sync with industry-leading accuracy claimed across dozens of languages |
| Facial motion | Diffusion-inspired audio-to-expression engine for brow shifts, eye softening, asymmetric smiles |
| Head motion | Tilts on pauses, forward lean on emphasis, turn-away on reflective lines |
| Hand gestures | Timing-aware gestures synchronized with script content from June 2025 onward |
| Body motion | Full-body weight shifts and posture, added with the August 2025 Digital Twin upgrade |
| Prompted gestures | Natural-language gesture and movement prompts inside the Avatar IV editor |
| Voice languages | More than 175 languages and dialects across HeyGen's voice stack |
| Voice cloning | Few-second reference clip for voice mirroring; integrated with Voice Director |
| Output resolution | Up to 1080p on Creator plans; 4K on Pro and above |
| API access | Avatar IV API on Pro and Scale developer tiers from May 2026 |
| Credit cost | 20 Premium Credits per minute of generated Avatar IV video |
The diffusion-inspired engine is the part HeyGen emphasizes most in its own materials. Rather than mapping phonemes to mouth shapes, the engine analyzes the audio frame by frame for tone, rhythm and emotional cues, then synthesizes a coherent set of facial movements that match those cues. HeyGen describes the result as photoreal motion with temporal realism, where blinks, head tilts and small smiles arrive at moments that fit the spoken cadence rather than at fixed intervals. The same signal path drives the hand-gesture system that was added in June 2025; prompted phrases such as "emphasize this point" or specific descriptions of a gesture can be inserted into the script and the model attempts to produce the corresponding motion.
A second wave of capabilities arrived in HeyGen's November 2025 release. The Avatar IV editor gained more controllable movements with prompt-based gestures, finer-grained expression controls for subtler body language, and an intelligent render selection that automatically picks an optimal rendering approach for a given clip. These changes did not break existing Avatar IV jobs; HeyGen treated them as in-place upgrades to the same model line rather than a new generation.
Voice Director is the voice control layer that HeyGen shipped on November 5, 2025 alongside the November Avatar IV update. It is built on the Panda Voice Engine and is designed to let users shape an avatar's vocal delivery through natural-language prompts rather than through manual SSML or a separate audio editor. Prompts such as "add excitement," "make it sound confident," or "emphasize this line" can be applied at the word, sentence or paragraph level, and the engine adjusts tone, pacing and emotion for that span while keeping the underlying voice consistent.
HeyGen pairs Voice Director with a companion feature called Voice Mirroring. Voice Mirroring takes a short uploaded voice recording and replicates not only the speaker's voice identity but also their rhythm, emotion, pacing and personality, so that the avatar speaks in the same style the reference speaker actually used rather than in a flattened default. This is the same input pattern HeyGen had used for voice cloning in earlier engines, but extended so that the cloned voice carries the speaker's expressive habits into the Avatar IV performance.
The November 2025 release also moved voice tools into the workflows where they are used most, including AI Studio, Proofread Studio and the Avatars surface. The standalone Voices tab was deprecated as part of that change. Inside the Avatar IV editor, Voice Director output is wired directly into the audio-to-expression engine, so a Voice Director prompt that adds excitement to a line also tends to produce larger gestures, more head motion and more emphatic facial expressions on the avatar.
Avatar IV is included on HeyGen's paid plans and is metered through HeyGen's Premium Credit system, which is the meter the company uses for its newer generative features. Avatar III remains unmetered for paid users and is the default fallback for high-volume routine work. Premium Credit consumption for Avatar IV runs at approximately 20 credits per minute of generated video, although HeyGen has tuned the exact ratio over time and bundles credits differently across plans.
| Plan | Monthly price | Notable Avatar IV terms |
|---|---|---|
| Free | 0 dollars | Short watermarked trial videos; limited Avatar IV access |
| Creator | About 29 dollars per month, 24 with annual billing | Roughly 200 Premium Credits per month, about 10 minutes of Avatar IV video |
| Pro | About 99 dollars per month, 79 with annual billing | Roughly 2,000 Premium Credits per month, about 100 minutes of Avatar IV; 4K export; Avatar IV API access on the developer tier |
| Business | About 149 dollars per month | Shared workspace credits; additional seats at about 20 dollars each |
| Enterprise | Custom pricing | Custom credit pools, dedicated support, governance and security controls |
The Avatar IV API, opened on May 4, 2026, is self-serve on the Pro and Scale developer tiers, with custom rates for higher-volume Enterprise usage. The API exposes the same Avatar IV engine through POST requests to HeyGen's video generation endpoints and through photo-avatar endpoints that handle motion and sound effects programmatically. Pricing for API usage is published on HeyGen's API pricing page and is metered separately from the credit pool tied to the web app subscription.
The Premium Credit cap is the part most often flagged in third-party reviews. Reviewers note that 200 credits a month on the Creator plan covers roughly ten minutes of Avatar IV output before the user has to either fall back to Avatar III or buy additional credits, which can make Avatar IV feel premium-gated for casual creators. Pro and Business plans, with much larger credit allocations, are the tiers most often recommended for users whose primary workflow is Avatar IV rather than Avatar III.
Through late 2025 and into 2026 the AI avatar market is dominated by HeyGen and Synthesia, with Hedra Character sitting alongside them as a more cinematic single-image option and D-ID still active in real-time talking-head workflows. Avatar IV is the model HeyGen sends into all of these comparisons.
| Platform | Engine | Avatar style | Primary positioning | Notable late-2025 feature |
|---|---|---|---|---|
| HeyGen Avatar IV | Diffusion-inspired audio-to-expression engine | Photoreal portraits to full-body avatars from a single photo, stylized and non-human subjects also supported | Marketing, social, creator and small-to-mid business video, plus Digital Twin for executive use | Voice Director and Panda Voice Engine integration, prompted gesture control |
| Synthesia Express-2 | DiT-based Express-Video plus two-stage Express-Voice transformer | Full-body stock and personal avatars from a curated actor library, rendered at 1080p, 30 fps | Enterprise learning and development, training and internal communications | Action-capable avatars that perform prompted B-roll gestures, embedded AI Playground with Sora 2 and Veo 3.1 |
| Hedra Character (Character-3) | Multimodal model reasoning over image, text and audio jointly | Cinematic talking characters from a single image, oriented toward storytelling | Creator-side video, short film, character-driven content | Live Avatars and tightly synchronized audio-driven character performance at sub-dollar per-minute pricing |
| D-ID | Talking-head pipeline with strong real-time path | Head-and-shoulders presenters with streaming output | Real-time customer interaction, agents, chat-style avatars | Streaming avatars and conversational deployment in customer support tools |
The practical split between HeyGen and Synthesia in 2026 reviews tends to track buyer profile rather than raw quality. Avatar IV is consistently rated higher by solo creators and marketing teams who want a faster path from a single photo to a finished video, broader language coverage and lower per-seat pricing. Synthesia's Express-2 is preferred by enterprise buyers who want a curated stock-avatar library, governance, SCORM-ready learning content and the new Video Agents layer. Reviewers covering both platforms in late 2025 generally describe Express-2 as closing most of the realism gap that Avatar IV had opened earlier in the year, while still trailing Avatar IV on flexibility with arbitrary input photos.
Against Hedra Character, Avatar IV is usually framed as the broader product with stronger full-body motion and a deeper voice stack, while Hedra Character is rated higher for character expressiveness and cinematic single-image talking heads at a much lower per-minute price. Against D-ID, Avatar IV is rated higher on visual realism and emotional nuance but does not, as a non-streaming video generation product, replace D-ID's real-time conversational deployments.
Avatar IV was treated by AI video coverage through 2025 as the model that pushed the photoreal avatar bar past head-and-shoulders lip sync into full-body performance. Reviews in publications including TechCrunch, AI Magazine and a wide set of independent creator-focused outlets pointed to the single-photo input, the prompted gesture control, and the November integration with Voice Director as the features that separated it from Avatar III and from competing models earlier in the year. HeyGen's own metrics through the same period, including the jump to $100 million in annual recurring revenue by October 2025, were repeatedly tied back to Avatar IV adoption.
Individual users were more mixed. The most common complaint in third-party reviews of Avatar IV in 2026 is the Premium Credit cap, which makes the 200-credit Creator plan feel narrow for users whose primary workflow is Avatar IV rather than Avatar III. Reviewers also flagged artifacts in complex hand motion, swift head turns and rapid phrasing, with the model performing best on medium-length conversational takes and explainer content rather than fast-paced commentary or heavy choreography. Several reviews noted that the model's strongest output comes from well-lit, front-facing reference photos and that more extreme angles or low-quality source images produce visible artifacts in the generated motion.
Joshua Xu has continued to frame Avatar IV publicly as a step toward avatars that can carry an actual performance rather than only lip-sync a script. In his August 2025 Digital Twin announcement Xu wrote that the upgraded Digital Twin powered by Avatar IV is "indistinguishable from you," and in his June 2025 gesture control announcement he described Avatar IV as a model that can "speak, gesture, and move its body with meaning." HeyGen has since used Avatar IV as the foundation under Avatar V, the April 2026 video-reference model that takes the same general workflow and replaces single-photo prediction with a fine-tuned model based on a 15-second user clip.