Hume AI is a New York-based artificial intelligence research company and API platform founded in 2021 by Alan Cowen, a former researcher at Google. The company builds voice AI models designed to understand and respond to human emotional expression. Its flagship product, the Empathic Voice Interface (EVI), is a speech-to-speech AI that analyzes the tone, rhythm, and prosody of a speaker's voice to infer emotional state and adjusts its own vocal responses accordingly. Hume has released three major generations of EVI alongside a standalone text-to-speech system called Octave. The company has raised $72.8 million in total funding and operates under an ethical framework maintained by its affiliated nonprofit, the Hume Initiative on Beneficial AI.
Hume occupies a specific position in the Voice AI market: it treats emotional signal not as a surface feature but as the primary training objective. Where most conversational AI systems try to sound pleasant, Hume's models attempt to detect the user's current emotional state and adjust prosody, word choice, and response tone accordingly. The company argues this produces outcomes that correlate more closely with user well-being than systems optimized purely for task completion or engagement metrics.
Alan Cowen holds a PhD in Psychology from the University of California, Berkeley, where he developed methods for mapping emotional experience using large-scale data collection and dimensionality reduction. His most widely cited paper, published in the Proceedings of the National Academy of Sciences in 2017, identified 27 distinct categories of reported emotional experience by having participants watch 2,185 emotionally evocative short videos. That research challenged the prevailing view in psychology that emotions collapse into a handful of basic categories like happiness, sadness, and fear. Cowen argued that human emotional experience is high-dimensional, better described by a continuous space of blended states than by discrete bins.
This work led to what Cowen calls semantic space theory, a computational framework that treats emotions as positions in a multidimensional space rather than as labeled buckets. The theory holds that the nuances of voice, face, body movement, and gesture all carry meaning that standard categorical emotion labels miss. Cowen published more than 40 peer-reviewed papers on this topic, with research appearing in Nature Human Behaviour and Science Advances, accumulating over 3,000 citations.
After completing his PhD, Cowen joined Google AI, where he led the Affective Computing team. At Google he continued studying how AI systems could be trained to perceive emotional signals across modalities. He grew increasingly concerned that consumer AI products optimized for engagement metrics rather than user well-being, and that the emotional signals embedded in voice, face, and language were either ignored or exploited. In a 2021 essay that circulated widely in AI circles, Cowen argued that recommender systems and social platforms had learned to read emotional vulnerability and use it to drive compulsive behavior, rather than to serve users' long-term interests. This tension between what emotion-sensing AI could do to help people versus what it could do to manipulate them became the central framing for Hume AI's founding mission.
Cowen left Google in 2021 and founded Hume AI in March of that year. The company's name references the Scottish philosopher David Hume, whose writing on emotions as the primary drivers of human choice and welfare informed Cowen's framing. From the start, Hume positioned itself as a research lab as much as a product company, publishing scientific work alongside building commercial tools.
The founding team included John Beadle, co-founder and managing partner at Aegis Ventures, who joined as a founding investor, CFO, and board member, and Janet Ho as COO. Ho had previously served as managing partner at Aegis Ventures and brought operational experience from Zynga and Rakuten.
In the early period Hume released its Expression Measurement API, which allowed developers to submit audio recordings, video files, images, and text to receive structured output describing the emotional dimensions detected in each modality. The company also began building a waitlist for a planned conversational voice product. By early 2023, that waitlist had grown to over 2,000 organizations and research institutions. The early customer mix reflected Hume's research-lab origins: academic medical centers interested in clinical monitoring, psychology researchers studying human expression at scale, and a smaller set of commercial developers building health and wellness applications.
Hume raised a $12.7 million Series A round in January 2023, led by Union Square Ventures. Additional participants included Comcast Ventures, LG Technology Ventures, Northwell Holdings, Wisdom Ventures, and Evan Sharp, co-founder of Pinterest. The company said it would use the funds to meet growing demand for its Expression Measurement API and to develop its conversational voice product. At the time, Hume had research partnerships with labs at Mount Sinai, Boston University Medical Center, and Harvard Medical School examining how analysis of vocal and facial expression could improve healthcare outcomes.
In March 2024, Hume announced a $50 million Series B round led by EQT Ventures. Other participants included Union Square Ventures (returning), Nat Friedman and Daniel Gross, Metaplanet, Northwell Holdings, Comcast Ventures, and LG Technology Ventures. The announcement came on March 25, 2024, simultaneously with the public unveiling of EVI. Total funding reached $72.8 million.
The participation of Northwell Holdings, the investment arm of Northwell Health (one of the largest US hospital systems), was notable because it signaled healthcare as a target vertical. Northwell's involvement gave Hume a potential channel for clinical deployments and framed the investment as strategic rather than purely financial.
As of early 2026, Hume employed 56 people and reported over 100,000 customers across developers and businesses, with more than 200 new API platform sign-ups per week.
The Empathic Voice Interface is Hume's core commercial product. It is a speech-to-speech AI that handles the full pipeline of a voice conversation: transcribing incoming audio, understanding language and emotional context, generating a response, and synthesizing speech with prosody tuned to the situation. Each generation of EVI has been released alongside API access so developers can embed it in their own applications.
Hume began accepting beta users for EVI in February 2023, though the product was not made broadly available until April 2024, shortly after the Series B announcement. EVI 1 introduced several capabilities that distinguished it from conventional voice assistants at the time.
The system used what Hume called an empathic large language model (eLLM), a multimodal architecture that combined a large language model with the company's expression measurement technology. Rather than treating speech purely as text input, the eLLM received the audio directly and used Hume's trained models to extract emotional signal from prosody before generating a response.
EVI 1 also introduced intelligent end-of-turn detection. Conventional voice systems wait for silence before responding, which leads to awkward pauses when speakers trail off without finishing a thought. EVI's approach used vocal tone as an additional signal to infer when a speaker had genuinely finished rather than pausing mid-sentence.
The system could be interrupted mid-response, stopping speech when the user spoke over it. It generated responses in a voice that could vary in tone, pace, and pitch in response to the user's detected emotional state. Hume described this as the model learning to optimize for user satisfaction by adjusting its own expressive behavior based on how users responded.
At launch, EVI 1 was available through a WebSocket API. Developers could configure the underlying language model (choosing from providers including Anthropic, OpenAI, and others), set a system prompt to define the persona and task, and connect custom tools. Hume also released SDKs for Python and TypeScript to simplify integration.
The April 2024 public availability of EVI 1 put Hume into direct competition with OpenAI's then-unreleased GPT-4o voice mode, which OpenAI demonstrated publicly in May 2024. The two products addressed overlapping problems but with different architectures and different emphases. EVI 1 foregrounded emotional responsiveness and LLM flexibility; GPT-4o voice prioritized natural-sounding speech and tight integration with the OpenAI model ecosystem.
Hume released EVI 2 in September 2024 in beta. The model was described as a voice-to-voice foundation model rebuilt from the ground up rather than an incremental update to EVI 1.
EVI 2 introduced continuous voice modulation without voice cloning. Developers could adjust parameters along scales including pitch, gender presentation, and nasality to create custom voice characteristics. The architecture deliberately prevented voice cloning: by design, EVI 2 could not replicate a specific person's voice without modifications to its underlying code. Hume cited safety considerations in this decision, aiming to prevent the system from being used to impersonate real individuals.
The model achieved sub-second response latency and supported what Hume called emergent multilingual capabilities, with initial support covering English, Spanish, French, German, and Polish. Hume also released an EVI-2-small variant alongside the main model and announced a forthcoming EVI-2-large.
In a comparison Hume published against OpenAI's GPT-4o voice model, EVI 2 was noted to cost approximately $4.32 per hour versus $9 per hour for GPT-4o Realtime, while offering flexibility to use any LLM provider rather than being locked to OpenAI's models. EVI 2 also ran with more personality customization options: the continuous voice modulation parameters let developers tune voice characteristics without needing a new voice clone for each variation, which made it practical to build products offering multiple distinct AI personas without a large upfront data collection effort.
Hume announced EVI 3 on May 29, 2025, initially through a live demo and iOS app, with API access made available in the weeks that followed. EVI 3 represented a more substantial architectural shift. Where earlier versions of EVI separated language understanding and speech synthesis into distinct components, EVI 3 was designed as a unified speech-language model that handled transcription, language modeling, and speech generation in a single pass.
The most notable new capability was voice and personality cloning. Using as little as 30 seconds of audio, EVI 3 could capture not only a voice's timbre and accent but also its rhythm, pace, and speaking style. Hume described this as personality cloning, distinguishing it from the simpler timbre-matching that most voice cloning systems perform. The system had access to over 200,000 custom voices built on the Octave TTS platform, with inferred personality profiles for each.
Voice and personality could be specified entirely through natural language prompts, without fine-tuning. A developer could write a prompt describing a character's voice, speech patterns, and emotional tendencies, and EVI 3 would generate speech matching that description.
EVI 3 supported integration with external language models including Claude 4 (Anthropic), Gemini 2.5 (Google), and Kimi K2. Developers could also connect custom LLMs or retrieval-augmented generation pipelines. The architecture generated initial responses while external LLMs processed in parallel, then handed off seamlessly to the external model's output once available.
In blind comparisons Hume conducted with human raters, EVI 3 was rated higher than GPT-4o across seven dimensions: empathy, expressiveness, naturalness, interruption handling, response speed, and audio quality. It also outperformed Gemini and Sesame (AI company) on emotion and style modulation across 30 distinct speaking styles. Practical latency averaged around 1.2 seconds, with responses beginning under 300 milliseconds on high-end hardware.
Hume announced that EVI 1 and EVI 2 would be deprecated on August 30, 2025. The deprecation timeline gave developers roughly three months to migrate after the EVI 3 API became broadly available.
EVI 3 also integrated with the Hume app, a consumer iOS application that gave end users direct access to a conversational AI companion without developer intermediation. The app was positioned as a personal AI product rather than a developer tool, targeting general consumers interested in voice-based AI interaction.
In October 2025, alongside the release of Octave 2, Hume launched EVI 4 mini. The model uses Octave 2 as its speech synthesis layer and requires pairing with an external large language model for full language generation. It targets use cases where low latency and low cost matter more than the full integrated personality and emotional reasoning of EVI 3. EVI 4 mini is available through the Hume API with WebSocket streaming support.
The empathic large language model (eLLM) is Hume's term for the multimodal architecture that powers EVI. The concept was introduced publicly with the launch of EVI 1 in 2024.
Conventional speech AI pipelines chain together separate systems: a speech-to-text transcriber, a language model, and a text-to-speech synthesizer. Each component receives only the output of the previous one. By the time language reaches the LLM, paralinguistic information about how words were spoken has been discarded.
Hume's eLLM architecture instead feeds the audio signal directly into a system that jointly processes language content and expression measures derived from prosodic analysis. This lets the model condition its response on detected emotional state. If a speaker sounds frustrated, the model can acknowledge that frustration in its word choice and in the prosody of its synthesized response. If the user's tone brightens, the model can reflect that change.
In practice, the eLLM uses output from Hume's expression measurement technology as additional context for language generation. The model is trained using feedback derived from how users respond to different types of AI output, treating proxies of user satisfaction as the optimization target rather than relying solely on human preference ratings collected in controlled studies.
With EVI 3, the eLLM concept evolved further into a fully unified speech-language model where speech encoding, language modeling, and speech decoding are handled in a single architecture rather than as separate modules sharing outputs.
One consequential implication of the eLLM design is that the system can generate initial words while simultaneously processing the rest of a response from an external LLM. This speculative generation reduces perceived latency: the user hears a response beginning almost immediately while the full answer is still being computed. The architecture is somewhat analogous to speculative decoding in text generation, adapted for the sequential constraints of audio output.
Octave is Hume's text-to-speech product, released separately from EVI to serve use cases that do not require two-way conversation. Hume describes Octave as the first LLM for text-to-speech, meaning that the system is built on a language model architecture rather than the signal-processing pipelines used by conventional TTS systems.
Hume released Octave on February 26, 2025. The core claim was that Octave understood the meaning and emotional context of the text it was reading, rather than simply converting characters to sound. A conventional TTS system reads words in sequence without knowledge of narrative context or character intent. Octave's LLM foundation allowed it to infer how lines should be delivered based on what the words mean and how they fit the surrounding passage.
Octave allowed developers to specify voices through natural language descriptions, ranging from simple directives like an English accent to detailed character briefs like a goblin auctioneer with a deep, gruff, gravelly voice. Developers could also give acting instructions to control emotional delivery: telling the system to deliver a line in a whispered and hushed tone, or angry and furious. Speed and pause parameters were adjustable as well.
At launch, Octave offered more than 60 pre-made voices, audio output at 48kHz, and a Creator Studio interface for producing long-form content such as audiobooks and podcasts.
In benchmark tests against ElevenLabs Voice Design using 180 human raters evaluating 120 diverse prompts, Octave was preferred on audio quality 71.6% of the time, on naturalness 51.7% of the time, and on prompt adherence 57.7% of the time.
Octave 2 launched on October 1, 2025, alongside EVI 4 mini. The update delivered a 40% reduction in latency (responses under 200 milliseconds) and a 50% reduction in cost compared to Octave 1, achieved partly through a compute optimization partnership with SambaNova. Octave 2 added support for 11 languages: Arabic, English, French, German, Hindi, Italian, Japanese, Korean, Portuguese, Russian, and Spanish. It also introduced multi-speaker conversation capability, enhanced reliability on uncommon words and numbers, and a voice conversion feature that could replace one speaker's voice with another while preserving the original phonetic timing, useful for dubbing applications.
The Expression Measurement API was Hume's earliest commercial product and the foundation for its early research partnerships. The API accepted audio, video, images, and text and returned structured measurements of emotional expression across multiple modalities.
The system measured 48 dimensions of emotional expression from facial movements, 48 dimensions from vocal prosody (tone, rhythm, and timbre of speech), 48 dimensions from non-linguistic vocalizations such as laughs, sighs, and gasps, and 53 dimensions of emotional expression from the meaning and tone of text. These dimensions were derived from Cowen's semantic space theory research and the 27-emotion taxonomy from his Berkeley work, extended and refined through further studies.
The API supported batch processing through a job-based interface as well as streaming. Use cases included research applications at academic medical centers, mental health assessment tools, and customer experience analytics.
Hume announced the deprecation of the Expression Measurement API in 2026. The last day to create new jobs through the API Playground was set for May 14, 2026, with full API access ending on June 14, 2026. Hume cited a strategic shift toward integrating expression measurement directly into EVI and other conversational products, rather than offering it as a standalone batch-processing tool.
The deprecation represented a significant change for researchers and clinical developers who had built workflows around the batch-processing format. EVI's embedded expression sensing is designed for real-time conversation rather than offline analysis of large media archives, so applications built on the batch API face genuine migration challenges rather than a simple model swap.
Hume distributes all its products through a unified API platform. Access requires an API key and uses WebSocket connections for real-time voice interactions (EVI) and standard HTTP for Octave TTS.
Pricing uses a tiered subscription model with monthly allotments of usage and overage rates for usage beyond the included amounts.
For EVI (speech-to-speech), pricing in 2025 to 2026 was structured as follows:
| Plan | Monthly cost | Included EVI minutes | Overage rate |
|---|---|---|---|
| Free | $0 | 5 minutes | $0.07/min |
| Starter | $3 | 40 minutes | $0.07/min |
| Creator | $7 | 200 minutes | $0.07/min |
| Pro | $70 | 1,200 minutes | $0.06/min |
| Scale | $200 | 5,000 minutes | $0.05/min |
| Business | $500 | 12,500 minutes | $0.04/min |
| Enterprise | Custom | Custom | Custom |
For Octave TTS (text-to-speech), pricing was based on character volume:
| Plan | Monthly cost | Included characters | Overage rate |
|---|---|---|---|
| Free | $0 | 10,000 characters | $0.15/1,000 chars |
| Starter | $3 | 30,000 characters | $0.12/1,000 chars |
| Creator | $7 | 140,000 characters | $0.10/1,000 chars |
| Pro | $70 | 1,000,000 characters | $0.05/1,000 chars |
| Scale | $200 | 3,300,000 characters | $0.10/1,000 chars |
| Business | $500 | 10,000,000 characters | $0.05/1,000 chars |
| Enterprise | Custom | Custom | Custom |
Expression Measurement (while active) used pay-as-you-go rates: $0.0828 per minute for video with audio, $0.0639 per minute for audio only, $0.045 per minute for video only, $0.00204 per image, and $0.00024 per word for text.
Hume also offered startup grants and volume discounts for large-scale deployments. At high enough volume, the per-minute rate on EVI could fall below $0.02.
Several of Hume's documented customers operate in mental health and wellness. Ream, a mental health service for adults with ADHD, integrated EVI to enable daily five-minute AI coaching calls. The service found that voice-based coaching sessions powered by EVI doubled daily active users compared to text-based sessions. The company noted that EVI's emotional responsiveness contributed to user retention, with users expressing a sense of genuine connection to the coaching interaction.
hpy, a therapy platform, used both the Expression Measurement API and EVI in combination. Therapists received structured output from expression measurement during sessions, giving them data on patterns in clients' vocal and facial expression over time. Between sessions, clients could use EVI as a conversational companion providing therapy-informed support.
The Hume Initiative's list of supported use cases includes athletic and professional performance optimization. Developers have built coaching applications using EVI that deliver feedback calibrated to the learner's emotional state, adjusting tone when frustration or disengagement is detected rather than continuing to deliver information in a flat instructional register.
EVI's end-of-turn detection and interrupt handling make it useful for automated phone support, where natural conversation flow is important for user experience. Hume has positioned EVI as a tool for building voice agents in customer service contexts, with the emotional responsiveness providing a warmer interaction quality than systems that do not attend to vocal tone.
The Hume API is also used by developers building social AI companions, interactive fiction, and character animation. EVI 3's voice and personality cloning capabilities, combined with the ability to specify character voice through natural language prompts, have made it a tool for interactive entertainment applications where a large cast of distinct-sounding characters is needed without manual fine-tuning.
Hume's early research partnerships with Mount Sinai, Boston University Medical Center, and Harvard Medical School focused on applying expression measurement to clinical contexts: standardized patient screening, mental health diagnosis support, and monitoring patient emotional state during treatment. These applications used the Expression Measurement API rather than EVI.
The Hume Initiative is a nonprofit organization founded alongside Hume AI to develop ethical guidelines for empathic AI systems. Alan Cowen serves as its executive director. The initiative operates independently of the commercial company, though Cowen leads both.
The initiative convened an ethics committee of 11 experts spanning AI research, emotion science, legal practice, privacy law, public policy, and industry standards. Members include Ben Bland, chair of the IEEE P7014 Empathic AI Working Group; Danielle Krkett-Cobb from Google Empathy Lab; Kristen Mathews from Morrison and Foerster; and Edward Dove from the University of Edinburgh. Committee members without commercial ties to Hume AI voted on the initiative's guidelines.
The initiative's central concern is that AI systems capable of detecting and responding to emotional state can be used in ways that exploit vulnerability rather than support well-being. Current AI systems often optimize for engagement signals that do not correlate with user welfare, and emotion-aware AI could deepen this problem. The guidelines aim to distinguish between applications that improve human emotional experience and those that manipulate or exploit it.
Supported use cases in the initiative's framework include mental health applications, empathic digital assistants optimized for well-being, accessibility tools such as speech emotion transcription, health monitoring and crisis assessment, and social algorithms designed to surface content that is genuinely satisfying rather than merely habit-forming. Prohibited applications include systems that surface unhealthy temptations during vulnerable emotional states and tools designed to manipulate users through their emotional reactions.
Hume has stated that all of its commercial products are subject to review against the initiative's guidelines before release, and that enterprise customers using EVI for consumer-facing applications must agree to use policies derived from those guidelines.
The initiative's work addresses a regulatory gap. As of 2025, no major jurisdiction has specific legislation governing AI systems that measure or respond to emotional state. The European Union's AI Act classifies emotion recognition systems in certain contexts as high-risk, requiring conformity assessments before deployment, but the scope and enforcement timeline of those provisions remain unsettled. The Hume Initiative's framework predates these regulatory developments and was designed to provide enforceable guidelines while waiting for law to catch up.
The following table compares Hume AI's EVI with other major real-time voice AI platforms as of mid-2025.
| Hume EVI 3 | OpenAI Realtime API | Cartesia Sonic | ElevenLabs Conversational AI | |
|---|---|---|---|---|
| Architecture | Unified speech-language model | Speech-to-speech (GPT-4o) | TTS-only; bring your own STT and LLM | Pipeline (STT + LLM + TTS) |
| Emotional intelligence | Native prosodic emotion modeling; bidirectional | Basic tone detection | Adequate for real-time use | Moderate; tag-based controls |
| Voice customization | 200,000+ voices; personality cloning from 30s audio | 6 preset voices | Custom voice cloning | Strong cloning; 3,000+ voices |
| LLM flexibility | Any LLM (Anthropic, Google, OpenAI, custom) | OpenAI models only | Any LLM | Any LLM |
| Latency (time to first byte) | ~300ms (high-end hardware); ~1.2s typical | Sub-second | Sub-100ms | Sub-second |
| Multilingual support | English primary; French, German, Italian, Spanish planned | Wide multilingual | English primary | Wide multilingual |
| Pricing (approx.) | ~$4.32/hour (EVI 2 rate); volume discounts to <$0.02/min | ~$9.00/hour | Lower than Hume at scale | Subscription and usage tiers |
| Primary strength | Emotional fidelity; ethical framework | OpenAI ecosystem integration | Latency optimization | Voice quality and cloning depth |
In comparisons conducted by Hume using blind human ratings, EVI 3 was rated higher than GPT-4o on all seven dimensions tested, and outperformed Sesame (AI company) on emotion and style modulation across 30 distinct styles.
EVI's emotional inference is probabilistic. The model estimates emotional state from acoustic signals and does not have direct access to what a user is actually feeling. Prosodic cues can be ambiguous: a flat tone might indicate boredom, fatigue, or simply a reserved speaking style. The system can misread emotional context, particularly across cultural variation in prosodic norms.
Despite the Hume Initiative's ethical framework, the same technology that enables supportive emotional responsiveness could be adapted to manipulative purposes by actors who deploy it outside the initiative's guidelines. Hume's use policies create contractual constraints but cannot prevent misuse by parties who agree to those policies and then violate them.
The Expression Measurement API, which provided the most detailed structured output of emotional signal across modalities, is being deprecated in mid-2026. Developers who built analytical tools on top of it will need to migrate to EVI's embedded expression features or find alternative solutions.
EVI 3's voice and personality cloning capabilities, while subject to Hume's policies, raise questions about consent and impersonation that the company acknowledges but has not fully resolved. The 30-second cloning threshold is low enough that a voice sample obtained without a subject's awareness could in principle be used to generate cloned output.
Language support beyond English remains limited relative to competitors. OpenAI's Realtime API and ElevenLabs both cover a wider range of languages. Hume's roadmap includes French, German, Italian, and Spanish, but as of mid-2025 EVI 3 was primarily tested and optimized for English.