Inworld AI is an American artificial intelligence company headquartered in Mountain View, California, that builds real-time voice and character AI infrastructure for games, interactive applications, and voice agents. Founded in September 2021 by Ilya Gelfenbeyn, Kylan Gibbs, and Michael Ermolenko, the company initially focused on creating AI-driven non-player characters (NPCs) for video games and virtual environments. Beginning in 2024 and accelerating through 2025, Inworld broadened its platform into a general-purpose voice AI stack, offering text-to-speech models, speech-to-text, and a low-latency orchestration layer called the Agent Runtime. Its flagship TTS line, anchored by the Realtime TTS-2 model launched in May 2026, ranks first on the Artificial Analysis TTS leaderboard by blind human preference evaluation. Customers range from indie game developers and language learning applications to major studios including Ubisoft, partnerships with Xbox and NVIDIA, and enterprise deployments at NBCUniversal and Logitech Streamlabs.
The founding team shares deep roots in conversational AI infrastructure. Ilya Gelfenbeyn co-founded API.AI, a developer platform for natural language understanding and speech recognition, and served as its chief executive officer. API.AI grew to serve millions of developers before Google acquired it in September 2016; following the acquisition, Google rebranded the platform as Dialogflow, one of the most widely deployed conversational AI frameworks in the industry. Gelfenbeyn also built Assistant.ai, an independent voice assistant that attracted more than 40 million users. After the acquisition, he joined Google's developer ecosystem programs before departing to co-found Inworld.
Michael Ermolenko led AI engineering at API.AI before and after the Google acquisition, accumulating direct experience building large-scale conversational pipelines. Kylan Gibbs took a different path: he worked as a product manager at Google DeepMind and earlier as a consultant at Bain and Company. Gibbs also co-founded FlowX, a startup later acquired before his departure to join Gelfenbeyn and Ermolenko.
John Gaeta, an Academy Award-winning visual effects designer best known for pioneering the "bullet time" technique in The Matrix, joined as Chief Creative Officer in April 2022, lending creative credibility to the company's vision of immersive AI characters. Gaeta transitioned to a strategic advisor role in 2024.
Gelfenbeyn, Gibbs, and Ermolenko founded Inworld AI in September 2021 with a specific thesis: virtual environments -- video games, online worlds, and nascent metaverse platforms -- were rapidly increasing in the amount of time users spent inside them, yet those environments remained socially shallow. NPCs in most games still relied on branching dialogue trees written by hand, offering limited interactivity and no capacity to respond dynamically to player actions or language. The founders believed that generative AI had reached a threshold where AI-driven characters could replace scripted dialogue with contextual, personality-consistent conversation.
The company's initial product, the Character Engine, was designed as a layered system. A "Character Brain" layer handled personality definition, emotional states, and long-term memory. A "Contextual Mesh" grounded characters in game-specific lore and rules, mitigating the risk of AI models generating responses that broke fictional immersion or "hallucinated" out-of-game knowledge. A "Real-Time AI" layer handled low-latency inference delivery to game engines. Game developers could define a character through natural language description -- specifying backstory, personality traits, speech patterns, and knowledge constraints -- without writing explicit if-then dialogue logic.
Inworld joined the Disney Accelerator program in 2022, gaining access to resources and strategic relationships within Disney's entertainment portfolio. At the 2022 Disney Accelerator Demo Day, the company presented a "Droid Maker" prototype built in collaboration with ILM Immersive, the storytelling studio within Lucasfilm. The demo allowed users to assemble and converse with interactive Star Wars droids, illustrating how AI characters could extend major intellectual property into interactive experiences.
During the same period, community modders began integrating Inworld's API into popular titles including The Elder Scrolls V: Skyrim, Stardew Valley, and Grand Theft Auto V. These unofficial mods attracted significant attention online and demonstrated consumer appetite for dynamic NPC conversation, while also functioning as organic demonstration of the platform's versatility.
At the Game Developers Conference (GDC) in 2024, Inworld appeared alongside several major industry names. Ubisoft presented NEO NPCs, a research prototype built using the Inworld Engine that showed NPCs capable of environmental awareness, real-time reaction and animation, conversation memory, and collaborative decision-making with other characters. The demo represented Ubisoft's first public-facing generative AI NPC prototype and was developed in conjunction with NVIDIA's Avatar Cloud Engine (ACE) technology.
Microsoft's Xbox division announced a multi-year co-development partnership with Inworld to build AI-assisted tools for game narrative creators. One output of this collaboration, called the Narrative Graph, allows developers to upload source material and generate branching narrative structures that visualize story logic. Another tool, Project Explora, extended narrative AI into broader game design workflows.
NVIDIA integrated Inworld's character intelligence into its Covert Protocol demo at GDC 2024, a social-simulation experience where players acted as private detectives completing objectives through conversation with AI digital humans. The demo used NVIDIA's hardware acceleration alongside Inworld's inference pipeline to demonstrate viable real-time AI character interactions.
As generative AI infrastructure matured and competition intensified, Inworld broadened its strategic focus. The company repositioned from a character-creation platform specifically for games toward a general-purpose real-time voice AI infrastructure provider. This shift was partly driven by the recognition that the technical challenges underlying game NPCs -- low latency, expressive synthesis, voice cloning, multi-turn context management -- were the same challenges faced by voice agent builders, language learning applications, and interactive entertainment products outside of games.
Inworld launched dedicated TTS and STT APIs, followed by the Agent Runtime, a C++ orchestration engine that developers can use to build voice pipelines combining LLMs from multiple providers (OpenAI, Anthropic, Google, Mistral), TTS, STT, memory, and tool integrations in a single configurable graph. The company simultaneously launched native integrations with LiveKit, Pipecat, Vapi, and NLX, all widely used open-source and commercial voice agent frameworks.
Inworld raised a $7 million seed round in November 2021, followed by a pre-seed and additional seed capital totaling approximately $20 million across early rounds. The company announced a $50 million Series A in August 2022, positioning itself as a character AI platform for games and the metaverse. The round brought total funding to approximately $70 million at the time.
In August 2023, Inworld raised an additional $50 million, bringing its post-money valuation to over $500 million and making it, by its own characterization, the best-funded startup at the intersection of AI and gaming. The round was led by Lightspeed Venture Partners, with participation from Stanford University, Samsung Next, Microsoft's M12 fund, First Spark Ventures (co-founded by former Google CEO Eric Schmidt), and LG Technology Ventures. Total funding reached more than $100 million following this round.
The company's investor base over its lifetime has included Lightspeed Venture Partners, Kleiner Perkins, Founders Fund, CRV, Intel Capital, Meta, Microsoft M12, Samsung Next, Stanford University, LG Technology Ventures, and Bitkraft Ventures, among others. Notable angel investors include Twitch co-founder Kevin Lin and Oculus co-founder Nate Mitchell, alongside gaming executives from Riot Games and Animoca Brands.
Inworld's original and still prominent application is enabling game developers to build NPCs with dynamic dialogue and behavioral autonomy. Traditional NPC dialogue relies on pre-authored trees: writers compose every possible exchange, and the game engine navigates those branches based on player choices. This approach scales poorly with narrative complexity, produces rigid interactions that players can exhaust, and cannot respond coherently to player input that deviates from anticipated paths.
Inworld's Character Engine replaces static trees with an AI system that generates responses dynamically while keeping them grounded in a character's defined personality, knowledge base, and the game's fictional rules. Characters are defined through a combination of natural language descriptions, memory modules, and lore files that developers upload to the Contextual Mesh. The system uses this context to constrain the underlying large language model, reducing hallucinations and maintaining narrative consistency.
The platform supports multi-agent scenarios in which two to five AI characters can converse autonomously with each other and with players simultaneously, coordinated by a Director Layer that manages conversational flow and prevents characters from talking over each other or diverging from narrative logic.
Inworld provides native SDK integrations for Unreal Engine and Unity, the two dominant commercial game engines. For Unreal, the Inworld AI NPC Engine plugin (version 1.5 as of 2025) includes a prebuilt dialogue and behavior system. The company launched the Unreal AI Runtime as a unified interactive AI toolkit for game developers in October 2025. Unity support arrived in subsequent releases. The Agent Runtime also ships SDKs for Node.js, enabling web-based interactive experiences.
An open-source Godot SDK was released for developers building games on the free and open-source engine, broadening the addressable developer base beyond commercial engine users.
NetEase Games: Inworld worked with NetEase's Team Miaozi studio to build fully AI-controlled NPCs that respond in real time in playable game builds. NetEase also integrated Inworld's character AI into Cygnus Enterprises as a generative AI-powered companion.
Niantic: The creator of Pokemon Go used Inworld AI to power Wol, an augmented reality experience set in Muir Woods that lets visitors converse with interactive characters representing the redwood ecosystem. The project demonstrated AI NPC applications outside of screen-based games.
Ubisoft: As described above, Ubisoft's NEO NPC prototype showcased at GDC 2024 used Inworld's engine. The prototype explored NPCs with environmental awareness, real-time emotional reactions, and inter-character strategic collaboration.
Xbox (Microsoft): A multi-year co-development agreement produced the Narrative Graph tool and Project Explora, AI-assisted authoring tools for game writers.
NVIDIA: The Covert Protocol demo at GDC 2024 combined NVIDIA ACE hardware acceleration with Inworld's character AI and was used at industry events to illustrate next-generation social simulation.
Indie and community titles: Inworld-powered mods for Skyrim, Stardew Valley, and Grand Theft Auto V attracted community attention. The title Vaudeville, an indie puzzle game, gained Steam traction in 2023 using Inworld's dialogue system.
Death by AI: An AI-native game that reached 20 million players within two months of launch. The studio's API costs scaled from $5,000 to $250,000 per month in two weeks following viral growth. Inworld built custom APIs and optimization layers that returned costs to sustainable levels, and the game subsequently reached profitability.
Status by Wishroll: A social AI application that reached one million users within two weeks of its public beta launch in February 2025, powered by Inworld's voice infrastructure. The product ranked as one of the fastest consumer AI apps to reach that milestone.
In August 2025, Inworld launched its first standalone text-to-speech models under the designations Realtime TTS 1 and Realtime TTS 1-Max. Both are transformer-based autoregressive models built on LLaMA backbones and trained using a sequential pipeline of pre-training, supervised fine-tuning, and reinforcement learning alignment.
TTS 1-Max employs LLaMA-3.1-8B as its speech language model (SpeechLM) backbone, yielding approximately 8.8 billion parameters. The architecture uses X-codec2, an audio codec that merges acoustic and semantic information into a single codebook of 65,536 tokens. The model's vocabulary was expanded from the LLaMA base of 128,256 tokens to 193,856 tokens, incorporating audio tokens and special control tokens. The audio decoder converts generated tokens back into 48 kHz waveforms.
TTS 1, the smaller variant, was designed for real-time synthesis and on-device use cases, with approximately 1.6 billion parameters. Both models support zero-shot voice cloning: given a reference audio clip of 5 to 15 seconds, either model can replicate that speaker's voice characteristics for new utterances without fine-tuning or additional training, using in-context learning.
In blind human evaluation at launch, TTS 1-Max achieved win rates of 59.1% against ElevenLabs, 60.9% against Cartesia, 55.3% against TTS 1, and 60.7% against OpenAI TTS-1-HD. These benchmarks reflected pairwise comparisons where human raters chose their preferred sample without knowing which model produced it.
At launch, supported languages in production included English (all major accent variants), Mandarin Chinese, Korean, Dutch, French, and Spanish. Japanese, German, Italian, Polish, and Portuguese were available in experimental status.
Inworld updated the TTS line with the 1.5 generation, introducing what the company described as 30% greater expressiveness over TTS 1. The 1.5 generation came in two variants:
Both variants expanded language support to 15 languages and served as the foundation for subsequent model generations.
In May 2026, Inworld launched Realtime TTS-2 as its new flagship voice model, available initially as a research preview. TTS-2 represents a architectural advance beyond the TTS 1.x generation by incorporating closed-loop audio context: the model processes the full audio of an ongoing conversation, not just the text of the current turn. This allows TTS-2 to perceive the listener's tone, pacing, and emotional state from their speech and adapt its own delivery in response -- a property the company refers to as "contextual empathy."
Developers can steer TTS-2 using natural language instructions rather than discrete emotion tags. An instruction like "tired but warm after a long day" applied to a character's voice is processed directly by the model, analogously to how a system prompt steers an LLM. Inline controls handle specific moments -- whispering, sighing, laughter -- at precise timestamps within the generated audio.
TTS-2 maintains a consistent voice identity across more than 100 languages with on-the-fly language switching within a single generation, without accent carryover. The model achieves P90 first-chunk latency under 250 milliseconds. It is priced at $35 per million characters.
Integration partners at launch included Layercode, LiveKit, NLX, Pipecat, Vapi, and Voximplant.
Voice cloning is a central capability across all Inworld TTS generations. The platform offers two modes:
Instant cloning: Available free for all API users. Providing 5 to 15 seconds of reference audio produces a production-ready voice clone in seconds through zero-shot inference. The resulting voice can generate any new text in the cloned speaker's style.
Professional cloning: Intended for enterprise deployments requiring maximum fidelity. Requires 30 or more minutes of reference audio and is processed to optimize the voice for specific use cases. Available under custom commercial agreements.
Both modes include voice consent safeguards designed to prevent unauthorized cloning of real individuals. Generated audio is watermarked to identify it as synthetically produced, consistent with emerging AI disclosure norms.
Inworld's cross-lingual voice cloning feature, introduced with TTS-2, preserves a speaker's voice identity when switching output languages, allowing a single voice definition to serve global deployments without the speaker needing to record in each target language.
The Inworld Agent Runtime is an open orchestration infrastructure introduced in 2024 and actively developed through 2025. Implemented in C++ with SDKs for Node.js and Unreal Engine, the Runtime is a graph-based engine that connects LLMs, speech-to-text, text-to-speech, memory systems, knowledge bases, and external tools into a configurable real-time pipeline.
Key capabilities include:
The Unreal Engine integration of the Runtime entered early access in October 2025. Unity support was announced for subsequent availability. The Runtime is free to use, with costs accruing only from model consumption.
Real-world deployments validated through GDC 2025 include Status by Wishroll, which reduced infrastructure costs by over 95% while scaling to 500,000 or more daily active users using the Runtime.
Alongside TTS, Inworld offers a Realtime STT (speech-to-text) API and a Realtime API that combines STT, LLM, and TTS into a single low-latency round-trip pipeline for conversational agents. Together with the Agent Runtime, these components form a full-stack voice agent infrastructure. The company positions this offering as comparable in capability to OpenAI's Realtime API but with provider flexibility, lower per-character pricing, and gaming-specific optimization.
Beyond game studios, Inworld's customer base as of 2025 and 2026 spans:
Language learning: Talkpal AI uses Inworld TTS to serve five million language learners, achieving a 40% reduction in TTS costs while improving feature engagement by 7% and user retention by 4%. The application serves multiple language pairs including English, German, and French.
Interactive entertainment: Bible Chat, with approximately 800,000 daily active users, uses Inworld's voice stack for conversational biblical content. Status by Wishroll reached one million users two weeks after public beta.
Live streaming: Logitech Streamlabs integrated Inworld's AI into a streaming intelligence agent for content creators.
Media and entertainment: NBCUniversal has used Inworld's platform for interactive media applications.
Automotive and consumer electronics: Alpine Electronics and LG Uplus have deployed Inworld-powered voice experiences in automotive and smart device contexts.
Inworld competes across two distinct markets: character AI for games, and real-time voice synthesis. In character AI, its primary B2B competitors include Convai (founded 2022) and Artificial Agency (founded 2023). In voice synthesis, its competitors include ElevenLabs and Cartesia, alongside OpenAI's TTS offerings.
| Feature | Inworld AI | ElevenLabs | Cartesia |
|---|---|---|---|
| Primary focus | Voice AI + Game NPCs | Voice synthesis + cloning | Real-time voice synthesis |
| Flagship TTS model | Realtime TTS-2 | Multilingual v2 / Flash v2.5 | Sonic 3 |
| TTS architecture | Autoregressive transformer (LLaMA backbone) | Autoregressive transformer | State space model (SSM) |
| Time-to-first-audio (P90) | Under 250ms (TTS-2) | ~75ms (Flash v2.5) | 40 to 90ms (Sonic 3) |
| Languages | 100+ (TTS-2) | 32+ | 17+ |
| Voice cloning | Zero-shot, instant and professional tiers | Instant and professional tiers | Instant cloning |
| Game engine SDKs | Unreal Engine, Unity, Godot | None native | None native |
| Agent Runtime / orchestration | Yes (C++ core, Node.js, Unreal) | Limited (Voice API) | Limited |
| Pricing per million characters | $15 (Mini) to $35 (TTS-2) | ~$180 to $300+ | ~$65 (Sonic) |
| Artificial Analysis ranking | 1st (blind preference) | Top 5 | Top 5 |
Inworld's principal differentiator versus ElevenLabs is its lower pricing and game-specific infrastructure. ElevenLabs has a broader library of pre-built voices and a more mature dubbing and audio book workflow, while Inworld focuses on latency and developer-facing APIs for real-time interactive contexts. Versus Cartesia, Inworld offers better expressiveness and higher quality in blind tests, while Cartesia holds an advantage in raw latency for the very lowest-latency deployments. Versus Character.AI, Inworld targets B2B developer infrastructure rather than end-user consumer chat, with no direct consumer-facing character chat product.
Character.AI is a consumer-facing AI companion platform that allows end users to chat with fictional, celebrity-inspired, or user-created AI characters. It is not primarily a developer platform and does not provide game engine SDKs or real-time TTS APIs. The two companies serve largely non-overlapping markets: Character.AI targets recreational chat users, while Inworld targets developers building interactive products. Character.AI's backing from Andreessen Horowitz and its large consumer user base (reportedly hundreds of millions of conversations) represent scale in the consumer space that Inworld does not pursue.
Replika is an AI companion application focused on emotional support, personal wellness, and relationship simulation. Like Character.AI, Replika is a consumer product rather than developer infrastructure. It does not offer B2B APIs or game engine integrations. Inworld is sometimes compared to Replika in discussions of AI character experiences, but the two companies have distinct business models: Replika monetizes subscriptions with individual end users, while Inworld monetizes API consumption with developers and studios.
While gaming remains central to Inworld's identity, the company's voice infrastructure has found application in several adjacent domains:
Language learning: Conversational language tutors benefit from low-latency, expressive TTS with realistic accent representation. The ability to define a tutor's persona and speaking style through the Character Engine allows language apps to differentiate the conversational experience.
Customer service and sales agents: Enterprise applications use the Agent Runtime to build voice-enabled customer service workflows that combine LLM reasoning with expressive TTS output and STT input.
Accessibility: High-quality synthetic voice output with emotional nuance extends accessibility tools for users who rely on text-to-speech for reading and communication.
Training simulations: Medical, military, and corporate training applications use AI characters to run scenario-based simulations where trainees interact verbally with characters representing patients, customers, or adversaries.
Interactive entertainment outside games: Experiences like Niantic's Wol augmented reality product in Muir Woods illustrate how location-based and mixed-reality applications can use AI characters to create contextual, conversational interactions in physical spaces.
Content creation tools: The Xbox Narrative Graph co-development illustrates how AI character infrastructure can assist writers and designers as a creative tool rather than a runtime component, generating narrative structures for human review.
Inworld received sustained press attention beginning with its Disney Accelerator participation in 2022 and accelerating through the Series B announcement in 2023 and its GDC 2024 showcase. TechCrunch's August 2023 coverage of the Series B described the company as the leading generative AI platform for NPC creation. VentureBeat and GamesBeat covered the Ubisoft NEO NPC demo as one of the most concrete examples of AAA generative AI NPC research shown publicly.
The voice AI pivot and TTS launch in August 2025 generated coverage in the AI developer community, with the model's first-place ranking on the Artificial Analysis TTS leaderboard cited widely. The marktechpost.com coverage of Realtime TTS-2 in May 2026 described it as a significant advance in contextually adaptive synthesis.
Developer reception to the Agent Runtime has been positive among teams building real-time voice agents, particularly for its provider-agnostic architecture and the ability to A/B test model configurations without redeployment. The Runtime's free base tier with consumption-based pricing has lowered experimentation costs for smaller teams.
Not all reception has been unqualified. In the 2024 State of Game Industry survey, only 21% of developers believed generative AI would have a positive impact on game development, reflecting broad industry skepticism about AI technology integration in creative workflows. Concerns about NPC hallucination -- characters generating responses that contradict established game lore or behave inappropriately -- have persisted as a technical and reputational challenge. Inworld's Contextual Mesh is designed to mitigate this risk but does not eliminate it entirely; developers report that edge cases in complex narrative environments still require significant tuning.
The expansion into voice cloning has also attracted scrutiny, primarily from voice actors and performers' guilds. SAG-AFTRA and related organizations have raised concerns about synthetic voice replication displacing voice acting work, and the widespread commercial availability of instant cloning from short audio clips accelerates that displacement risk regardless of provider-specific consent safeguards.
Several limitations affect Inworld's platform as of 2025 and 2026:
Inference cost at scale: Large-scale NPC deployments require many simultaneous AI inferences, and the per-API-call cost model can become prohibitive for games with millions of concurrent players. The Death by AI case study illustrates how rapid user growth can generate unsustainable cost spikes. Inworld has worked with affected customers to build custom optimization layers, but this requires individualized engineering effort.
Hallucination and lore breakage: Despite the Contextual Mesh, NPCs can generate responses that break narrative immersion, contradict established facts, or produce content inappropriate for the target audience. This is a fundamental challenge of using probabilistic generative models in constrained fictional contexts and is shared across all generative NPC platforms.
Latency floor: Although Inworld's TTS achieves sub-250ms first-chunk latency for most models, some competitors using state space models achieve sub-100ms latency, which can be significant for highly interactive real-time applications. Inworld's Mini variants target 130ms, but the higher-quality flagship models trade some latency for expressiveness.
Unpredictability in production: Developers building against cloud AI APIs face challenges from provider updates, model changes, and pricing revisions outside their control. The Agent Runtime's provider-agnostic design partially addresses this by enabling provider switching, but the underlying models remain external dependencies.
Adoption timeline in AAA gaming: As of 2025, most games shipping with Inworld-powered NPCs are indie or mid-tier titles. Major AAA studios have conducted research and prototype work but have not yet shipped widely available titles with generative NPC dialogue at scale. The transition from demo to production at AAA scale involves challenges including quality assurance, localization, legal review of AI-generated content, and integration with existing studio pipelines.
Voice actor relations: The voice synthesis and cloning capabilities place Inworld in tension with the voice acting community. While the platform includes consent mechanisms, the technical capability to clone voices from short samples raises questions about long-term employment effects in the voice acting industry that Inworld has not fully resolved.