Voice Engine (OpenAI)
Last reviewed
Jun 3, 2026
Sources
6 citations
Review status
Source-backed
Revision
v1 · 1,447 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 3, 2026
Sources
6 citations
Review status
Source-backed
Revision
v1 · 1,447 words
Add missing citations, update stale details, or suggest a clearer explanation.
Voice Engine is a speech-generation and voice-cloning model developed by OpenAI that can produce natural-sounding speech resembling a specific person from a single audio sample as short as 15 seconds. OpenAI publicly previewed the model on March 29, 2024, in a blog post titled "Navigating the challenges and opportunities of synthetic voices," but declined to release it widely, citing the risks of synthetic-voice misuse during a year of major elections.[1][2] As of mid-2025, more than a year after the preview, OpenAI had still not made Voice Engine generally available, keeping it restricted to a small set of trusted partners.[3]
Voice Engine is the underlying model that powers OpenAI's text-to-speech preset voices and the spoken-output features in ChatGPT, and it is distinct from Whisper (OpenAI's speech-to-text system) and from Advanced Voice Mode (the real-time conversational voice feature in ChatGPT).[1][4]
OpenAI says it first developed Voice Engine in late 2022 and has used it internally since then to provide the preset voices available in its text-to-speech API, as well as the ChatGPT Voice and Read Aloud features.[1][4] The same model can also generate speech in multiple languages, including languages other than that of the original speaker, which underpins some of the translation use cases explored by partners.[2][5]
The model was originally referred to internally as "Custom Voices." OpenAI had reportedly planned to bring it to the API on March 7, 2024, offering access to a group of up to 100 developers building applications with a clear social benefit or "innovative and responsible" uses, with proposed pricing of about $15 per one million characters for standard voices and $30 per million for "HD" quality. The company postponed that launch at the last minute and instead unveiled Voice Engine a few weeks later, as a limited preview without a public sign-up, available only to a cohort of around 10 developers it had begun working with in late 2023.[3][6]
Voice Engine takes a short reference recording of a target speaker, around 15 seconds long, together with a passage of input text, and generates speech that reads the text in a voice resembling the reference speaker.[1][2] According to reporting on the preview, the model uses a combination of diffusion and transformer techniques, and the reference audio submitted to the system is discarded after a request is completed.[6] The preview did not expose fine-grained controls for adjusting tone, pitch, or cadence.[6]
The table below summarizes the model's reported capabilities.
| Capability | Detail |
|---|---|
| Reference sample | Single audio clip of roughly 15 seconds |
| Output | Natural-sounding speech reading arbitrary text in the reference voice |
| Languages | Multiple languages, including languages other than the speaker's own |
| Architecture (reported) | Combined diffusion and transformer approach |
| Reference handling | Submitted audio discarded after the request completes |
| Provenance | Generated audio carries a watermark to trace its origin |
OpenAI distinguishes Voice Engine from its other audio systems. Whisper transcribes speech into text and does not generate audio, while Advanced Voice Mode is a low-latency, interactive spoken-conversation capability built into ChatGPT. Voice Engine, by contrast, is the text-to-speech generation model that synthesizes audio output.[1][4]
Rather than a public launch, OpenAI shared Voice Engine with a small group of partners who agreed to its usage policies, including obtaining the explicit consent of any person whose voice was used, not impersonating individuals or organizations without permission, and disclosing to audiences that the voices were AI-generated.[1][2] OpenAI highlighted several partners and the ways they were testing the model.
| Partner | Field | Reported use case |
|---|---|---|
| Age of Learning | Education | Generating natural, emotive voice-over for pre-scripted educational content aimed at non-readers and children |
| HeyGen | Video / media | Translating video content so creators and businesses can reach audiences in multiple languages |
| Dimagi | Global health | Providing interactive feedback to community health workers in their native languages, including Swahili and a mix of Swahili and English (Sheng) |
| Livox | Accessibility | Powering augmentative and alternative communication (AAC) devices with more natural, distinct voices in multiple languages for people with disabilities |
| Lifespan, Norman Prince Neurosciences Institute | Healthcare | Exploring clinical use to restore the voices of patients with speech impairments from conditions such as brain tumors |
The most widely cited example came from the Norman Prince Neurosciences Institute at Lifespan, a nonprofit health system affiliated with Brown University. Clinicians there, including Rohaid Ali and pediatric neurosurgeon Konstantina Svokos, used Voice Engine to help restore the voice of a young patient who had lost fluent speech because of a brain tumor, recreating her voice from a recording made for a school project before her condition worsened.[2][5] OpenAI presented these pilots as illustrations of potential benefits in education, accessibility, translation, and care for people who have lost the ability to speak.[1]
OpenAI framed the preview explicitly as a discussion about safety rather than a product launch, stating that generating speech resembling real people carries serious risks that were "especially top of mind in an election year."[2] The company noted that 2024 would see voting in more than 80 countries and referenced real-world misuse of synthetic audio, including a January 2024 robocall in New Hampshire that used an AI-generated imitation of U.S. President Joe Biden, an incident that prompted action from the U.S. Federal Communications Commission, and the use of AI-generated speech tied to Pakistan's Imran Khan.[2][6] These concerns were a central reason the company chose not to release the model broadly.
OpenAI said it implemented technical safeguards for the preview, including watermarking generated audio to trace its origin and proactively monitoring how partners used the system.[1][2] It also set out a series of recommendations it believed should accompany any broad deployment of synthetic-voice technology by society as a whole.
| Recommendation | Description |
|---|---|
| Phase out voice authentication | Stop using voice as a security factor for accessing bank accounts and other sensitive information |
| No-go voice list | Maintain lists that detect and prevent the creation of voices too similar to prominent figures |
| Provenance and watermarking | Accelerate techniques for tracking the origin of audiovisual content, such as watermarking |
| Public education | Educate the public about the capabilities and limits of AI, including the possibility of deceptive AI-generated content |
OpenAI added that it was engaging with partners across government, media, entertainment, education, and civil society, and that policies protecting the use of individuals' voices in AI should be explored.[1][2] Because building safeguards such as comprehensive no-go voice lists and broad voice-authentication phase-outs is technically and institutionally demanding, the model remained in limited preview with no announced timeline for general availability.[3]
Coverage of the preview emphasized the tension between the technology's potential and its risks. Outlets including TechCrunch, NBC News, VentureBeat, and Al Jazeera described Voice Engine as impressive but characterized OpenAI's decision to withhold it as a recognition that the tool was, in effect, too risky for unrestricted public release in 2024.[2][3][6] Reporters noted that synthetic-voice cloning had become a fast-growing vector for fraud and that proposed safeguards such as watermarking can be difficult to enforce because watermarks may be stripped or bypassed.[3]
A year after the preview, in March 2025, TechCrunch reported that Voice Engine still had not been released publicly and remained limited to roughly the same small group of partners, with OpenAI saying it was continuing to test the model and learn from partners to improve its usefulness and safety. The article also reported that at least one partner, the accessibility company Livox, found the requirement to use the model online difficult to reconcile with customers who depend on offline devices.[3] The prolonged, deliberately cautious rollout was widely cited as an example of an AI developer restraining the release of a capable model on safety grounds.[3][6]