Tavus is an American generative AI research company headquartered in San Francisco, California, that develops video generation models and real-time conversational video technology. The company builds infrastructure enabling developers to create interactive AI video agents -- digital representations of humans that can see, hear, speak, and respond in real time. Tavus is best known for two product lines: an asynchronous video generation platform for personalized video campaigns, and the Conversational Video Interface (CVI), a real-time face-to-face interaction layer for AI agents. The company was founded in 2020 by Hassaan Raza and Quinn Favret and participated in Y Combinator's Summer 2021 batch. As of November 2025, Tavus had raised a total of approximately $64 million across seed, Series A, and Series B funding rounds.
Tavus describes its broader mission as "human computing" -- the thesis that natural human-AI interaction should take the form of face-to-face video conversation rather than text chat, and that building this capability requires co-designed, integrated models for visual rendering, turn-taking, and perception.
Hassaan Raza, co-founder and CEO of Tavus, studied computer science at the University of Texas at Austin. He held engineering and program management roles at Hewlett Packard Enterprise, Apple (where he worked on macOS security initiatives), and Google before founding Tavus. At Google, Raza was a technical program manager with responsibilities spanning machine learning platform programs. Alongside his corporate career he co-founded Accantus, an early-stage SaaS and IoT company applying sensor telemetry to musculoskeletal patient outcomes. His combination of software engineering experience and machine learning program management informed Tavus's technical architecture decisions during its founding years.
Quinn Favret, co-founder and COO, attended the University of Michigan's Stephen M. Ross School of Business. Before Tavus, Favret co-founded Chime Menu, a restaurant technology startup, from 2019 to 2020. Favret departed the University of Michigan to pursue Tavus full-time after meeting Raza and developing the initial product concept.
Raza and Favret founded Tavus in 2020 with an initial focus on personalized video for sales and marketing. The core insight was that sales representatives could record a single video template, and the platform would automatically generate thousands of individualized versions -- inserting each prospect's name, company, or other details while synthesizing the sender's voice and lip movements to match each personalized script variant. The resulting videos appeared as though the sender had personally recorded each one, while in practice only one source recording was required.
This approach addressed a common constraint in outbound video outreach: the time required to manually record individual messages at scale. Tavus's early platform integrated with CRM tools including Salesforce, HubSpot, and Mailchimp, allowing personalized video delivery to be triggered automatically by prospect actions or campaign logic such as a pricing-page visit or a trial signup.
The company was accepted into Y Combinator's Summer 2021 (S21) batch. Its Hacker News launch post from that period described the product as "AI-generated personalized videos for sales outreach," targeting go-to-market teams. The YC cohort provided early capital and validated the market for AI-powered personalization in video.
Through 2022 and 2023, Tavus began repositioning from a sales SaaS tool toward a developer platform and AI research organization. Rather than serving end users directly through a polished interface, the company increasingly packaged its models as APIs that third-party applications could integrate. This shift reflected both the broader maturation of generative AI infrastructure in the post-GPT-3 period and an opportunity to serve a wider range of industries -- healthcare, education, financial services, and entertainment -- rather than sales teams alone.
In March 2023, Tavus announced a seed funding round of $6.1 million led by Sequoia Capital, with participation from Y Combinator Continuity, HubSpot Ventures, Accel Partners, Index Ventures, and Lightspeed Ventures. The company simultaneously launched early API access to its Phoenix video generation model and began opening the platform to developers outside the original sales-outreach context.
By the time of the Series A announcement in March 2024, Tavus described its technology as powering digital replicas usable across a wide range of applications including customer onboarding, interactive product demos, and real-time conversational AI agents -- a significantly expanded scope from its original personalized-outreach product.
Tavus raised $6.1 million in a seed round in early 2023, led by Sequoia Capital. Additional participants included Y Combinator Continuity, HubSpot Ventures, Accel Partners, Index Ventures, and Lightspeed Ventures. The round supported the company's transition from a closed SaaS product to an API-accessible developer platform.
In March 2024, Tavus raised $18 million in a Series A round led by Scale Venture Partners. Sequoia Capital, Y Combinator, and HubSpot Ventures participated. The company announced the round alongside the public beta launch of its developer platform. At the time, notable enterprise customers included Meta and Salesforce, which used the platform to produce personalized B2B upsell and demo video campaigns at scale.
The funding was covered by TechCrunch and announced via Business Wire. Tavus stated plans to expand enterprise and mid-market sales across banking, real estate, automotive, and healthcare verticals, and to accelerate development of the Conversational Video Interface product line.
In November 2025, Tavus raised $40 million in a Series B round led by CRV. Returning investors Scale Venture Partners, Sequoia Capital, Y Combinator, HubSpot Ventures, and Flex Capital also participated. The company framed the round around its vision of "human computing" and simultaneously launched PALs (Personal Affective Links), a new product category representing agentic AI humans with persistent memory, emotional awareness, and multimodal capabilities across video, voice, and text. Total funding raised by Tavus reached approximately $64 million.
Hassaan Raza described the company's vision at the time of the Series B as a "future where machines adapt to us, learning to see, hear, and respond with the nuance and empathy that define human connection," contrasting this with the conventional paradigm in which humans adapt to machine interfaces.
Phoenix is Tavus's proprietary video generation model series, used to synthesize realistic talking-head video of a digital replica from a text script. Phoenix processes a text input, generates the corresponding voice audio using cloned or synthetic speech, and renders synchronized facial movements including lip motion, cheek, eyebrow, nose, and chin articulations to match the audio output.
Early versions of Phoenix used neural radiance fields (NeRF) for three-dimensional facial reconstruction, allowing more spatially consistent rendering than traditional 2D methods that operate only on image surfaces. NeRF-based approaches model a face as a continuous three-dimensional volume, enabling the system to handle changes in head angle and lighting more realistically than image-plane methods.
Phoenix-2 extended the NeRF-based foundation and improved training efficiency. It supported replica creation from approximately two minutes of source video footage, a significantly lower data requirement than earlier personal video generation systems that needed hours of footage.
Phoenix-3, released in early 2025, adopted a Gaussian-diffusion rendering architecture. This approach represents facial geometry as a collection of Gaussian volumetric primitives (sometimes described as 3D Gaussian splatting), which can be deformed and re-rendered rapidly. The change enabled higher-fidelity simulation of micro-expressions -- small, fast facial movements around the eyes, forehead, and mouth that human observers use to assess authenticity and emotional state. Phoenix-3 also introduced full-face animation rather than lip-region-only synthesis: eyebrows, cheeks, eyelids, and the overall head motion were all generated jointly rather than composited from separate systems.
Phoenix-4, released in February 2026, extended the architecture to include a real-time emotional control system. An Emotion Control API allows developers to specify discrete emotional states -- joy, sadness, anger, and surprise -- and the renderer adjusts facial geometry accordingly. A joy specification produces genuine zygomatic muscle engagement affecting the cheeks and periorbital region, not merely an upward mouth curve. Phoenix-4 also introduced a stream-first rendering architecture using WebRTC for direct packet delivery to browsers, targeting sub-600 millisecond end-to-end conversational latency at 30 frames per second.
Digital replicas in the Tavus system are trained from as little as two minutes of source video footage. This is significantly lower than historical requirements for photorealistic personalized video generation, which often required tens of minutes to hours of footage. The low training data requirement was achieved through NeRF-based 3D reconstruction and later Gaussian-diffusion approaches, both of which can generalize from limited input.
As a safety measure, the platform requires explicit verbal consent on camera as part of the training submission process. The speaker must recite a specific consent statement, and the platform automatically cross-checks the voice in the consent recording against the voice in the main training footage to confirm identity. All replica creation requests pass through both automated verification and a subsequent manual human review before the replica is activated.
The Conversational Video Interface (CVI) is Tavus's real-time, end-to-end pipeline for building face-to-face AI agents. Unlike asynchronous video generation -- where a script is submitted and a finished video file is returned -- CVI runs as a live session in which an AI agent speaks, listens, observes the user through their camera, and responds with sub-second latency.
CVI is built on WebRTC for media transport, enabling peer-to-peer audio and video streaming with low latency and automatic bandwidth adaptation. By default, Tavus uses Daily.co's hosted WebRTC infrastructure, which provides pre-built meeting room UIs that can be embedded or linked directly into applications. The pipeline chains together several processing stages:
The final output is a live video stream of a human-appearing agent that responds within approximately 600 to 1,000 milliseconds of the human finishing an utterance. Tavus publishes an SLA target of under one second for utterance-to-utterance latency under standard network conditions.
CVI exposes a modular API: each layer -- ASR, LLM, TTS, video rendering -- can be substituted or customized independently. The LLM layer accepts any model compatible with the OpenAI API specification, allowing developers to use GPT-4o, Claude, Llama, or custom fine-tuned models. Session callbacks deliver webhooks for events including transcript availability, replica joining, recording readiness, and conversation analytics.
Tavus introduced a distinction between "Personas" (behavioral configuration: LLM system prompt, voice settings, personality parameters, TTS engine selection) and "Replicas" (the visual and audio identity of the agent). A developer can combine a custom Persona with a stock Replica, a custom trained Replica from two minutes of footage, or a Tavus stock avatar. This separation of identity from behavior allows the same visual replica to behave differently across applications without retraining the underlying video model.
Tavus also published integration support for Pipecat, Daily's open-source framework for building voice and multimodal AI agents. Through TavusVideoService in the Pipecat framework, developers can incorporate Tavus replicas into broader AI pipeline architectures that chain together speech recognition, language models, and other tools.
In November 2025, alongside the Series B announcement, Tavus launched PALs (Personal Affective Links) -- a product tier representing agentic AI humans built on top of the CVI platform. PALs are described as AI agents that maintain persistent memory across conversations, initiate interactions proactively, take agentic actions such as scheduling calendar events and sending emails, and fluidly transition between video, voice, and text modalities depending on context.
PALs represent Tavus's furthest extension of the human computing concept: AI systems that behave as persistent social actors rather than single-session conversational interfaces. The PAL product category was announced as a forward-looking capability, with availability in preview form at launch.
Hummingbird-0 is a zero-shot lip synchronization model released by Tavus in April 2025 as a standalone research preview product. Unlike the full Phoenix replica pipeline, which requires a training phase tied to a specific individual, Hummingbird-0 operates without per-person training. Given an arbitrary video of a face and any audio track, the model generates synchronized lip movements for the audio without prior exposure to that individual.
Traditional lip synchronization systems require training on samples from the target speaker to learn that person's mouth geometry and speech patterns. Hummingbird-0 eliminates this requirement by generalizing across identities from its training distribution, enabling immediate application to new faces without a pipeline delay for per-speaker training.
This zero-shot capability is significant for use cases where a training dataset is unavailable or impractical: animating historical photographs, dubbing archival footage, rapidly prototyping video content without investing time in replica training, or applying lip sync to faces in user-generated content at scale.
Hummingbird-0 was derived from components of the Phoenix-3 full-face rendering system, inheriting its Gaussian-diffusion architecture's capacity for identity-preserving facial deformation. Tavus described the model as outperforming other commercially available lip-sync models on visual quality, lip-sync accuracy, and identity preservation metrics at the time of release.
The model supports video dubbing workflows: a source video recorded in one language can be re-voiced in a different language while generating lip movements appropriate to the dubbed audio rather than the original speech. This enables multilingual content production from a single source recording without re-shooting. Hummingbird-0 was made available through a standalone Lip Sync API in the Tavus developer platform, separate from the full replica creation workflow.
Raven is Tavus's visual perception model series, designed to give CVI agents contextual awareness of the human participant they are conversing with. Rather than treating a conversation as a purely audio-and-text exchange, Raven continuously analyzes the video feed from the human participant's camera and provides structured perception outputs to the downstream LLM and response generation pipeline.
Raven-0 was introduced alongside Phoenix-3 and Sparrow-0 in March 2025. The model performs several perception tasks in real time:
These perception outputs are passed as context to the LLM, enabling AI agent responses that adapt to non-verbal signals. An agent whose Raven output indicates the user looks confused can proactively offer clarification; an agent observing sustained eye contact and forward posture can recognize engagement and maintain its conversational pace.
Raven-1, released as part of the Phoenix-4 system in February 2026, extended the perception model to correlate visual cues with voice tone analysis, giving the model a more complete picture of the participant's affective state. Where Raven-0 processed only the visual stream, Raven-1 integrated audio tonal features -- pitch variation, speaking rate, voice energy -- with facial expression analysis to produce a richer emotional context vector passed to the LLM and Sparrow.
Sparrow is a transformer-based conversational turn-taking model that manages the timing of exchanges in CVI sessions. Human conversation depends on shared understanding of when one participant has finished speaking and another may begin -- a coordination mechanism involving prosodic cues, sentence completion patterns, gaze, and breath rhythm. Detecting end-of-utterance by voice activity detection (VAD) alone introduces delays and fails to handle interruptions or mid-speech pauses gracefully.
Sparrow-0, released in March 2025, was trained on conversational data to distinguish sentence-final pauses from mid-utterance pauses, detect intentional interruptions, and predict natural handoff points in dialogue. The result is that CVI agents do not wait for an extended silence before responding, and can handle overlapping speech more naturally than VAD-based systems. Sparrow-0 was described as enabling approximately 600 millisecond response latency from the end of a human utterance to the start of the agent's audible reply.
The model is described as a transformer architecture trained to model the statistical patterns of conversational timing across a large corpus of dialogue data. Its outputs are probabilistic: it produces a continuous-valued estimate of the likelihood that the human has finished speaking, allowing the CVI pipeline to begin preparing a response before a definitive silence is detected, further reducing perceptible latency.
Sparrow-1, introduced with the Phoenix-4 system in February 2026, incorporated multimodal inputs from Raven-1's perception output -- visual expression analysis and voice tonal features -- as additional inputs to turn-taking decisions. This allows Sparrow-1 to factor in whether the participant's body language suggests they have more to say even when they have momentarily paused in speech, producing more natural interruption handling and more appropriate agent response timing.
Tavus serves enterprise customers including Meta and Salesforce, both of which have used the asynchronous video generation platform to produce personalized B2B sales and upsell video campaigns at scale. Salesforce and Meta's deployments were cited at the time of the Series A as examples of large enterprises integrating AI personalization into video-based go-to-market motions.
On the developer API side, Tavus's CVI platform is used by product teams building interactive AI agent experiences. Published and documented use cases include:
Tavus publishes HIPAA-compliant configuration guidance for healthcare deployments, covering consent flow design, data handling requirements, and escalation protocols appropriate for sensitive mental health conversations. The company has positioned healthcare as a strategic vertical given research findings that face-to-face video interactions produce higher patient engagement and satisfaction than telephone or text-based channels.
Tavus operates in a market alongside several other AI video and avatar companies, most notably HeyGen, Synthesia, and D-ID. The companies share overlap in AI avatar video generation but differ significantly in product architecture, target customer profiles, and supported use cases.
| Feature | Tavus | HeyGen | Synthesia | D-ID |
|---|---|---|---|---|
| Primary use case | Real-time AI agents, personalized video | Avatar video creation, marketing | Enterprise training videos, localization | Animated speaking portraits, API |
| Real-time conversation | Yes (CVI) | Limited | No | Limited |
| Avatar training footage | ~2 minutes | Yes | Yes | Photo or short video |
| Lip-sync API | Yes (Hummingbird-0) | Yes | Yes | Yes |
| Visual perception | Yes (Raven) | No | No | No |
| Turn-taking model | Yes (Sparrow) | No | No | No |
| Emotional control API | Yes (Phoenix-4) | No | No | No |
| Languages | 30+ (CVI), more in async | 175+ | 140+ | Multiple |
| Enterprise compliance | HIPAA guidance | SOC 2 | SOC 2 Type II | SOC 2 |
| Primary API focus | Developer (CVI + video gen) | Both | Enterprise SaaS + API | Developer API |
| Notable investors | Sequoia, CRV, Scale | Andreessen Horowitz | Enterprise-led | Various |
HeyGen is most directly competitive in avatar realism and high-volume video production. HeyGen's Avatar IV product uses motion capture to generate natural hand gestures, eye movements, and full-body animations. HeyGen supports over 175 languages and real-time translation with lip-sync, making it strong for global content localization. As of 2025-2026, HeyGen is generally considered to lead on avatar realism for scripted video production, while Tavus differentiates on the interactive, real-time conversation layer.
Synthesia targets enterprise clients requiring high-volume multilingual video production for training, onboarding, and compliance content. Synthesia is notable for SOC 2 Type II compliance and tight integration with enterprise learning management and HR systems. Synthesia does not support real-time interactive conversation -- videos are generated from scripts and delivered as files. Its strength lies in the script-to-localized-video workflow for Fortune 500 customers in regulated industries.
D-ID focuses on animated speaking portraits: still photographs or short video clips transformed into talking-head videos. D-ID offers an accessible API and notable integrations including a Canva plugin, making it convenient for design-adjacent workflows. D-ID introduced conversational features in its Creative Reality Studio product but at lower real-time fidelity than Tavus's CVI pipeline.
The key differentiator Tavus emphasizes is its integrated perception-and-turn-taking stack. While competitors focus on video generation or avatar realism for asynchronous content, Tavus argues that genuine interactive AI presence requires co-designed models for perception (Raven), timing (Sparrow), and rendering (Phoenix) operating together within a single sub-second latency budget. No competitor had released equivalent dedicated models for visual perception and semantic turn-taking as of the time of Tavus's March 2025 model family announcement.
CVI is designed for any application where a user benefits from speaking face-to-face with an AI rather than typing into a text chat box. Tavus frames the interface as "human computing" -- the thesis that the natural interface for AI agents in many domains is video conversation rather than text, just as mobile computing shifted interfaces from keyboards toward touchscreens and voice.
Healthcare is a primary vertical. An AI triage agent powered by CVI can gather symptom information before a clinical appointment, helping physicians focus consultation time on diagnosis and decision-making rather than data collection. Conversational AI companions in mental health applications can conduct regular check-ins, detect changes in mood and engagement through Raven's real-time visual analysis, and escalate to human clinicians when warranted. Tavus cites research indicating patients engage longer and share more information in face-to-face video interactions than in telephone or text-based channels, with face-to-face video consultations achieving approximately 86% satisfaction compared to 77% for telephone in one referenced study.
Sales and professional training applications use CVI to create realistic simulated practice environments. Sales representatives can rehearse objection handling with an AI persona that responds dynamically based on the trainee's specific statements, rather than following fixed scripted branches. Interview coaching tools let candidates practice with AI interviewers that provide specific behavioral feedback grounded in what the candidate actually said and how Raven perceived their delivery.
Language education is another documented use case: CVI provides a face-to-face conversation partner available on demand for immersive speaking practice in a target language, with correction and feedback integrated into the conversational flow.
The original Tavus use case remains a significant product line. Sales and marketing teams record a single master video template, define variable fields (recipient name, company name, industry, specific product or feature to mention), and submit a contact list. Tavus generates individualized video files for each contact, with the sender's voice and lip movements synthesized to match each personalized script variant.
This approach is used for outbound sales outreach, customer renewal notices, product upsell campaigns, and onboarding walkthroughs. CRM integrations with Salesforce, HubSpot, and Mailchimp allow video delivery to be triggered by CRM events -- a prospect visiting a pricing page, a contract approaching expiration, or a new account activation. Tavus's API supports campaigns at large scale, with dynamic variable substitution handled automatically during video generation.
Tavus received coverage in TechCrunch at its seed round announcement in March 2023, at the Series A in March 2024, and indirectly through coverage of the broader generative AI video market. The Series A TechCrunch article described the company as moving from a personalized-outreach SaaS tool toward a general developer platform for face and voice cloning with API access. Business Wire covered the August 2024 CVI world-speed announcement, the March 2025 Phoenix-3/Raven-0/Sparrow-0 model family launch, the April 2025 Hummingbird-0 release, and the November 2025 Series B and PALs launch.
The technology analyst blog Intellyx published a review in May 2025 describing Tavus as producing "uncannily human AI conversational avatars," noting that the Phoenix-3 and Raven combination had meaningfully advanced the fidelity ceiling for real-time interactive AI video compared to the state of the market in 2023. SiliconANGLE covered the March 2025 model family launch with the headline "Tavus introduces family of AI models to power real-time human face-to-face interaction."
Y Combinator listed Tavus in its company directory with the description "Building the human layer of AI." The company appeared in VentureBeat's coverage of the Series B with the framing "Tavus Raises $40M to Build AI Humans That See, Hear, and Feel."
User reviews on aggregator platforms have generally praised the platform's lip-sync quality, the CRM variable-insertion workflow, and the realism of face cloning from short training videos. The conversational interface received positive marks for interactivity but mixed marks on voice naturalness in longer responses.
Several limitations have been documented in independent reviews, analyst commentary, and Tavus's own documentation:
Uncanny valley and voice quality. Despite improvements across Phoenix versions, AI-generated voices can still fall short of fully natural human speech. The fine-grained prosodic variation -- subtle pitch contour changes, micro-pauses, breath support, and the emotional inflection that signals genuine affect -- is technically difficult to replicate at real-time latency constraints. Some users describe the voice output as occasionally flat or robotic in emotional range, particularly in longer monologue segments or emotionally nuanced content. Tavus addressed part of this limitation through the Emotion Control API in Phoenix-4, but the problem is not fully eliminated.
Niche in the product lifecycle. The asynchronous personalized video product excels at first-touch outreach but cannot assist with ongoing customer conversations after the initial video is delivered. The video introduces the sender; it does not negotiate, resolve support issues, or answer follow-up questions. CVI addresses this gap but requires developer integration effort and is positioned at a different pricing tier from the self-service video generation product.
Pricing opacity. Tavus does not publish pricing on its website; enterprise pricing is available only through sales conversations. Estimates from user reports suggest starter tiers begin around $39 to $150 per month for limited API access, with growth tiers reaching $300 to $600 per month, and enterprise pricing on custom terms. The absence of self-serve transparent pricing is a barrier for developers evaluating the platform for smaller projects or prototyping.
Deepfake misuse potential. The company's face and voice cloning capabilities carry dual-use risk: the same technology that enables personalized enterprise video could be misused to generate deceptive synthetic media. Tavus addresses this through its consent verification system -- training footage must include an on-camera verbal consent statement, and the platform cross-checks the voice in the consent recording against the voice in the training footage. All replica creation requests pass through automated identity-verification checks followed by a mandatory manual human review before activation. Nonetheless, researchers studying synthetic media risks have noted that platform-enforced safeguards are policy controls rather than cryptographic or physical barriers, and that the dual-use nature of the technology is an inherent characteristic of face-and-voice cloning systems regardless of provider.
Latency variability. The sub-600 millisecond conversational latency target for CVI represents a best-case SLA under optimal network conditions. Real-world deployments with variable internet connectivity, complex LLM prompts requiring longer inference time, high API load, or long conversation context windows can experience higher latency, degrading the natural feel of the interaction. The perception-rendering pipeline involves multiple sequential AI model inferences, each of which contributes to the total latency budget.
Training data requirements and quality sensitivity. While two minutes of training footage is the stated minimum for replica creation, output quality is sensitive to the consistency of the source material. Poor lighting, inconsistent head pose variation, significant background noise in the audio, or camera motion in the training footage can produce lower-fidelity replicas with less natural expression and lip movement. The two-minute figure represents a minimum, not an optimum.
Language coverage asymmetry. While the Phoenix async video generation product supports over 30 languages through voice cloning and script generation, the CVI real-time pipeline's language support depends in part on the ASR and TTS components selected by the developer, introducing variation in real-time quality across languages that is less controlled than asynchronous generation.