Synthesia 3.0
Last reviewed
May 16, 2026
Sources
18 citations
Review status
Source-backed
Revision
v1 ยท 2,977 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 16, 2026
Sources
18 citations
Review status
Source-backed
Revision
v1 ยท 2,977 words
Add missing citations, update stale details, or suggest a clearer explanation.
Synthesia 3.0 is a major release of the AI video generation platform from Synthesia, the London-based company co-founded in 2017 by Victor Riparbelli, Steffen Tjerrild, Lourdes Agapito and Matthias Niessner. Announced on October 1, 2025, the release reframes the product from a generator of one-way "talking head" presentations into a platform built around interactive, two-way video. The launch introduced a new generative engine called Express-2, a feature set branded as Video Agents that can hold real-time conversations with viewers, a quizzing and branching layer called Interactivity 2.0, and an AI Playground that embeds external models from OpenAI and Google inside Synthesia's editor.
The company described it as the biggest update in Synthesia's history. At launch, the platform supported more than 240 stock avatars and content creation in over 140 languages, and the new Express-2 engine produced 1080p, 30 frames-per-second video of arbitrary length. Synthesia 3.0 also marked the company's clearest strategic pivot toward enterprise learning and development, with corporate training, onboarding, candidate screening and internal communications positioned as the primary use cases rather than marketing or consumer content. Synthesia entered the launch from a $2.1 billion valuation set in its February 2025 Series D round led by NEA, and roughly three months after Synthesia 3.0 shipped, the company raised a Series E that doubled the valuation to $4 billion in January 2026.
Synthesia was founded in 2017 on the back of academic research into neural face reenactment, including the Face2Face paper from Matthias Niessner's lab at the Technical University of Munich. The first commercial product let users create short marketing or training videos by selecting a stock avatar, typing a script, and rendering a video where the avatar lip-synced the text in one of a growing set of voices. Through the 2020 to 2024 period, the platform leaned heavily on enterprise customers in financial services, healthcare, retail, and large industrial companies. By the time of the 3.0 launch Synthesia stated it served more than one million users across over 50,000 teams, including roughly 60 percent of the largest United States companies and around 90 percent of the Fortune 100.
The direct predecessor to Express-2 was an engine called Express-1, released earlier in 2024. Express-1 introduced more naturalistic facial movement, emotional expression and lip sync compared to Synthesia's older avatar pipeline, but it remained focused on the head and upper torso of an avatar standing in front of a static background. Synthesia framed Express-1 internally as the model that "brought expressiveness to the content" and Express-2 as the model that "will make content more engaging to watch." In April 2025, Synthesia announced a research-license deal with Shutterstock that gave its R&D team access to HD and 4K stock footage to pre-train future avatar models, with workplace interactions and professional tasks as the data of interest. Synthesia later confirmed that this pre-training program was part of the work that produced Express-2.
The launch happened during a noisy stretch in the generative video space. Runway, Pika, Luma and others were pushing camera and scene generation forward, while OpenAI's Sora line and Google's Veo line had moved from research demos into commercial APIs. Synthesia's positioning in 3.0 was that pure text-to-video models could not yet produce the consistent, controllable, on-brand avatars that enterprises needed for training and internal communications. The company chose to wrap those external generative models inside its editor rather than try to match them on raw cinematic quality.
Express-2 is the generative core of Synthesia 3.0. The engine is split into two large sub-systems that operate in series: Express-Voice for speech synthesis and voice cloning, and Express-Video for visual generation. Synthesia describes the visual side as a diffusion transformer (DiT) architecture trained on workplace and presentation footage, with several auxiliary models that score motion against audio and render the final frames.
Express-Voice is the speech component. It is a two-stage transformer (an autoregressive front end paired with a non-autoregressive back end, each approximately 800 million parameters) that operates over graphemes rather than phonemes. It is conditioned on a reference audio clip at inference time, rather than using a fixed set of learned speaker embeddings. According to Synthesia's technical post, this in-context approach lets the model clone a voice in seconds and preserve identity, accent, dialect, rhythm and emotional tone without any per-speaker fine-tuning. Training data included publicly available speech corpora such as YODAS and LibriLight to cover a wide range of accents. In internal evaluations across 17 accents, Express-Voice was rated highest among compared systems for matching the original speaker's identity, rhythm and accent, with WavLM used for speaker similarity scoring and emotion2vec for emotional fidelity.
Express-Video is split into three coordinated models. Express-Animate generates the co-speech motion (gestures, posture shifts, head turns) that should accompany the audio. Express-Eval is a CLIP-style model that scores how well a candidate motion sequence matches the audio, and is used to filter generations. Express-Render is the final rendering stage, a distilled diffusion model that produces 1080p video at 30 frames per second using only two diffusion steps per frame; a faster variant called Express-Render-Turbo was announced as work in progress at the time of launch. The combination means Express-2 generates full-body avatars rather than head-and-shoulders shots, and the avatars produce hand gestures, weight shifts and micro-expressions that line up with what they are saying.
| Capability | Description |
|---|---|
| Output resolution | 1080p video at 30 frames per second |
| Maximum length | Arbitrary length, limited only by user plan limits |
| Body coverage | Full-body avatars with gestures, posture and movement, not only head and shoulders |
| Voice cloning | Few-second reference clip, preserves identity, accent and emotional tone |
| Voice languages | Speech generation in more than 140 languages |
| Voice style controls | Style presets such as Excited and Neutral, plus performance intensity dial |
| Motion controls | Seed variation for different takes, temporal diversity slider, framing adjustments |
| Action prompts | Prompted gestures or actions such as "walk to the whiteboard" or "place the device on the table" |
| Wardrobe and scene | Prompt-based wardrobe (high-vis vest, hospital scrubs) and environment swaps |
| Personal avatars | Personal avatar creation from a single image upload |
| Pricing | Included on all plans at launch, no per-minute Express-2 surcharge |
A second avatar update, published by Synthesia on November 13, 2025, extended Express-2 with what the company called action-capable avatars. These can be directed by short text prompts to perform B-roll actions inside a scene rather than only delivering a script as A-roll. Synthesia gave examples such as having an avatar walk to a whiteboard, point at a screen, or pick up a device, with the actions generated rather than pulled from a fixed gesture library.
At the launch of Synthesia 3.0 the platform offered more than 240 stock avatars covering a wide range of ages, ethnicities, body types and wardrobe styles, all rebuilt under the Express-2 engine to support full-body rendering and gesture generation. The avatar set is curated by Synthesia using a network of paid actors who consent to having their likeness used, rather than relying on web-scraped imagery. The library is heavily skewed toward presentation contexts such as office, classroom, retail, healthcare and industrial settings, which lines up with Synthesia's enterprise focus.
Language coverage at launch sat above 140 languages for text-to-video creation, with the AI dubbing feature able to translate an existing avatar video into more than 30 target languages while preserving frame-accurate lip sync. Synthesia's own materials around the broader Express-2 rollout describe the platform as supporting 160-plus languages, with the difference attributable to whether each language has full studio-grade voice coverage or is offered through the dubbing pipeline only. Personal avatars, which let a customer create a custom avatar of themselves or an employee, were upgraded so that they could be generated from a single uploaded image rather than requiring a studio recording session, although enterprise-grade studio avatars remain available for higher fidelity.
Synthesia 3.0 made the company's strategic shift toward enterprise learning and development explicit. The headline new feature, Video Agents, is an avatar that a viewer can talk to in real time inside a video, which Synthesia positions for training roleplay, candidate screening, knowledge checks and customer guidance. A Video Agent can be inserted at any point inside an otherwise scripted video, switch into a live conversation grounded in a customer-supplied knowledge base, capture and process viewer data, and then hand back to the scripted track. At launch, Synthesia said Video Agents would be available for enterprise customers in early 2026 rather than on day one.
Interactivity 2.0 is the lighter-touch interactive layer in the same release. It adds clickable hotspots, branching pathways, embedded quizzes, polls and call-to-action buttons inside a Synthesia video, so a single video can fork based on viewer responses. This is paired with a separate product called Courses, which combines Express-2 avatars with branching learning paths, Interactivity 2.0 elements, and SCORM export so that the courses run inside enterprise learning management systems.
The AI Playground is the third pillar. It is a panel inside the Synthesia editor that embeds external generative models, specifically OpenAI's Sora 2 and Google's Veo 3.1, so that a customer can generate cinematic B-roll, transitions or product shots from a prompt and cut them straight into a video that otherwise centers on Express-2 avatars. Synthesia positions the Playground as a way to bring text-to-video generation into a controlled enterprise workflow with single sign-on, audit logs and content moderation rather than asking customers to assemble those models separately.
A Copilot feature, framed as an AI script writer and editing assistant tied to a customer's knowledge base, was previewed at launch as a 2026 product alongside the broader Courses rollout. Roughly 90 percent of Fortune 100 companies and more than 60 percent of the largest United States companies are described by Synthesia as customers, and the company's product surface for 3.0 is built around the workflows those customers already use: corporate training, compliance, onboarding, internal communications and sales enablement.
Synthesia 3.0 kept the same overall plan structure as the prior Synthesia product, with Express-2 avatars and the new editor available across all tiers rather than gated behind a separate pricing line. The Free plan is intended for short trial videos and is watermarked. The Starter plan is aimed at individual creators and small teams. The Creator plan adds more minutes and editing seats. The Enterprise plan is custom-priced and adds unlimited minutes within fair-use limits, single sign-on, SCORM export, advanced security and compliance reviews, and the early-2026 Video Agents and Courses features.
| Plan | Headline price (2025-2026) | Video minutes | Notable features |
|---|---|---|---|
| Free | 0 dollars per month | About 3 minutes per month | Watermarked output, 9 stock avatars, more than 140 languages |
| Starter | About 22 to 29 dollars per month, billed annually or monthly | About 10 minutes per month | Watermark removal, full stock avatar library, basic Interactivity 2.0 |
| Creator | About 53 to 89 dollars per month | About 30 minutes per month or 360 minutes per year via credits | Personal avatar from single image, AI Dubbing, AI Playground access |
| Enterprise | Custom, typically 4,000 dollars per year and up | Unlimited within fair use | SSO, SCORM export, Video Agents (early 2026), Courses, dedicated support |
Pricing varies by billing term, region and negotiated enterprise terms. Synthesia does not publish a list price for its Enterprise plan, and third-party reviews place real-world Enterprise contracts well above the visible Creator tier when factoring in seat counts and custom avatar work.
The AI avatar video market through late 2025 and into 2026 is dominated by Synthesia, HeyGen, and a handful of newer entrants focused on more cinematic or character-driven work such as Hedra Character. Hour One, which had been a frequent comparison point earlier in the cycle, was acquired and effectively wound down for new customers, so head-to-head comparisons after the Synthesia 3.0 launch usually narrow to Synthesia versus HeyGen, with Hedra and a few open-source projects on the edges.
| Platform | Avatar style | Language coverage | Primary positioning | Notable 2025 to 2026 feature |
|---|---|---|---|---|
| Synthesia 3.0 | Full-body avatars from a curated stock library, plus personal avatars from a single image, rendered by Express-2 at 1080p, 30 fps | More than 140 languages for text-to-video, more than 30 for AI dubbing | Enterprise learning, training and internal communications | Video Agents (real-time conversational avatars), Interactivity 2.0, AI Playground with Sora 2 and Veo 3.1 |
| HeyGen | Photoreal head-and-shoulders and partial-body avatars in Avatar IV, with strong custom avatar creation from a short video | 175-plus languages and dialects with real-time translation and lip sync | Marketing, social, and small-to-mid business video | Avatar IV with motion-capture-style animation and Photo Avatar from a single image |
| Hedra Character | Cinematic character avatars driven by audio, oriented toward storytelling and short-form film | Audio-driven, language coverage tied to upstream TTS choice | Creator-side video and short film | Audio-to-character video generation with strong identity preservation |
| Hour One | Stock-presenter avatars in fixed scenes, talking-head emphasis | Multiple languages, smaller library than Synthesia or HeyGen | Workplace training and HR, pre-acquisition | Wound down for new customers after acquisition by Wix |
The practical difference between Synthesia 3.0 and HeyGen in 2026 reviews tends to be framed around buyer profile rather than raw quality. HeyGen is generally rated higher for solo creators and marketing teams who want more flexible avatar customization, broader language coverage and lower per-seat pricing. Synthesia is rated higher for enterprise buyers who want governance, compliance, SCORM-ready learning content, and the option to put a Video Agent in front of a viewer with a knowledge base behind it. Both platforms can produce convincing avatars in 2026, and which one looks better in a side-by-side render depends heavily on the input image and script. Hedra and similar character platforms are usually a separate purchase for film-style content rather than a like-for-like alternative.
Reception to Synthesia 3.0 has been split along customer profile. Reviews aimed at enterprise readers, including industry blogs and corporate learning publications, treated the launch as a significant generational jump for the category, with particular praise for the move from talking heads to full-body avatars, the in-context voice cloning that preserves an employee's accent rather than flattening it, and the embedded AI Playground that pulls in Sora 2 and Veo 3.1 without leaving the editor. Synthesia's own metrics, including a 4.7 out of 5 rating on G2 in the AI Video High Performer category and around 4.0 on Trustpilot from more than 1,700 reviews, held steady or improved after the 3.0 launch.
Individual creators and marketers were less uniformly positive. Common complaints in user reviews after the 3.0 launch include content moderation that rejects marketing scripts the user considers benign, inconsistent moderation decisions where similar videos pass once and are blocked a second time, and stricter limits on what can be said by personal avatars compared with stock ones. Some reviewers also flagged that the platform's center of gravity is moving away from light marketing use cases and toward heavy enterprise workflows, which makes the Free and Starter tiers feel narrower than they used to.
Victor Riparbelli summarized the launch on his own account by saying the team had "just wrapped Synthesia 3.0," framing Video Agents as "video you can talk to in real time" and presenting the new avatars as ones that "move and talk like professional speakers." Press coverage from outlets including CNBC, TechCrunch and AI Magazine connected the product launch directly to Synthesia's January 2026 Series E round, where Alphabet's GV and NVIDIA's NVentures led a $200 million round at a $4 billion valuation, roughly doubling the company's prior valuation in under a year. Reporters generally treated Synthesia 3.0 as the artifact that made the new round possible.