MAI-Voice-1
Last reviewed
Jun 8, 2026
Sources
9 citations
Review status
Source-backed
Revision
v1 · 1,593 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 8, 2026
Sources
9 citations
Review status
Source-backed
Revision
v1 · 1,593 words
Add missing citations, update stale details, or suggest a clearer explanation.
MAI-Voice-1 is a text-to-speech (speech generation) model developed by Microsoft AI, the consumer artificial-intelligence division of Microsoft led by Mustafa Suleyman. Announced on August 28, 2025, it was the first fully in-house speech-generation model from the division, unveiled alongside MAI-1-preview, Microsoft AI's first end-to-end trained in-house text foundation model. [1][2] MAI-Voice-1 produces expressive, natural, high-fidelity audio across single-speaker and multi-speaker scenarios, and Microsoft positioned it as a centerpiece of consumer voice features in Microsoft Copilot, including Copilot Daily, Copilot Podcasts, and a creative voice playground called Copilot Audio Expressions. [1][3]
The release marked a strategic milestone in Microsoft's effort to build a self-sufficient stack of foundation models rather than rely solely on its partner OpenAI, and it became the first member of an expanding family of in-house "MAI" models that grew over the following year to include MAI-Voice-2, MAI-Image-2, MAI-Transcribe-1, MAI-Thinking-1, and MAI-Code-1. [4][5]
MAI-Voice-1 is a neural speech-synthesis model that converts written text into spoken audio. Microsoft AI describes it as "one of the most expressive and natural speech generation models available," capable of delivering high-fidelity audio in both single-speaker and multi-speaker settings. [1] Microsoft's headline performance claim is that the model is "lightning-fast," with the ability to generate a full minute of audio in under one second on a single GPU, which the company characterizes as making it "one of the most efficient speech systems available today." [1][2] These speed and efficiency figures are vendor claims published by Microsoft AI and have not been independently benchmarked in a peer-reviewed setting; the company has not disclosed the model's parameter count, architecture details, or training-data composition.
Beyond the consumer Copilot surfaces it first powered, MAI-Voice-1 was later made available to third-party developers through Azure Speech in Microsoft Foundry (formerly Azure AI Foundry), where it shipped as a public-preview neural text-to-speech model with six prebuilt English (United States) voices and Speech Synthesis Markup Language (SSML) style controls. [3][6]
| Specification | Detail |
|---|---|
| Developer | Microsoft AI (Microsoft) |
| Model family | MAI (Microsoft AI in-house models) |
| Type | Text-to-speech / speech generation (neural) |
| Announced | August 28, 2025 |
| Modality | Text input, audio output |
| Speakers | Single-speaker and multi-speaker |
| Languages (Foundry preview) | English (United States); six prebuilt voices |
| Expressive control | SSML mstts:express-as styles (e.g., joy, excitement, empathy) |
| Stated speed (Microsoft claim) | About one minute of audio in under one second on a single GPU |
| Consumer products | Copilot Daily, Copilot Podcasts, Copilot Audio Expressions (Copilot Labs) |
| Developer access | Azure Speech in Microsoft Foundry (public preview) |
| Architecture / parameters | Not publicly disclosed |
| Training data | Not publicly disclosed |
| Successor | MAI-Voice-2 (announced June 2, 2026) |
Microsoft AI is the division Microsoft formed in March 2024 to lead its consumer AI products, including Copilot, Bing, and Edge, with Mustafa Suleyman, a co-founder of DeepMind and Inflection AI, hired as its chief executive. [1][4] For its first year, the division's products leaned heavily on models supplied by OpenAI, in which Microsoft is the largest investor. MAI-Voice-1 and MAI-1-preview, released together in late August 2025, were the division's first publicly announced models trained in-house, signaling an intent to develop proprietary foundation models in parallel with its OpenAI partnership. [2][7]
Microsoft framed the two August 2025 models as complementary pieces of a broader plan to orchestrate multiple specialized models for different user intents. MAI-1-preview is a mixture-of-experts text model that Microsoft said was trained on roughly 15,000 NVIDIA H100 GPUs and submitted to the public LMArena leaderboard for community evaluation, while MAI-Voice-1 supplied the speech layer. [1][2] In its announcement, the company emphasized building "a self-sufficient" capability and described an ambition to serve its hundreds of millions of consumer users with models it controls end to end. [1]
The MAI lineup expanded substantially over the following year. Microsoft introduced MAI-Voice-2, a multilingual, prompted speech model, and brought MAI-Voice-1 and MAI-Voice-2 to Azure Speech as the "MAI-Voice" family. [6][8] At its Build developer conference on June 2, 2026, Microsoft AI announced seven new MAI models spanning text, image, voice, transcription, reasoning, and code, which Suleyman described as components of a "hill-climbing machine" of continuously improving in-house systems oriented toward what he calls "humanist superintelligence." [5] Within that trajectory, MAI-Voice-1 stands as the family's founding speech model.
MAI-Voice-1 generates spoken audio from text input. According to Microsoft's Azure documentation, the model "interprets input text holistically and automatically adjusts emotion, pace, and rhythm without manual configuration," producing speech the company describes as highly natural and emotionally rich. [6] It is optimized for conversational, expressive, and long-form scenarios, and maintains a consistent voice persona across extended content while still allowing expressive variation. [6]
In the Azure Speech preview, MAI-Voice-1 ships with six prebuilt English (US) voices, identified by names such as Jasper, June, Grant, Iris, Reed, and Joy, spanning male and female options recommended for uses including general conversation, customer service, professional narration, and emotional styles. [6] Developers can shape delivery using SSML, in particular the mstts:express-as element, to request emotional styles such as joy, excitement, or empathy, and the model supports real-time synthesis through the standard Azure Speech SDKs and APIs. [6] A gated "personal voice" (voice-cloning) prompt mode lets approved customers create a custom voice from consented audio, subject to Microsoft's limited-access review for Custom Neural Voice. [6]
Microsoft has stressed responsible-use safeguards around the technology, including consent requirements and gated access for voice cloning, reflecting the broader industry concern that high-fidelity speech synthesis can be misused for impersonation. The company has not published the model's underlying architecture, size, sampling rate of the model itself, or the datasets used to train it, so several technical specifics remain undisclosed. [1][6]
Copilot Daily is a feature in which an AI voice reads a short, curated rundown of news and information to the user. Microsoft cited MAI-Voice-1 as the model powering the voice in Copilot Daily, providing the spoken delivery for these audio briefings. [1][3]
Microsoft also pointed to Copilot Podcasts (sometimes described as audio explanations or podcast-style summaries) as a use case for MAI-Voice-1. In this scenario, the model generates conversational, multi-speaker audio that can explain a topic in a podcast-like format, taking advantage of its support for multiple voices in a single output. [1]
Copilot Audio Expressions is an experimental feature in Copilot Labs that exposes MAI-Voice-1 directly to users, letting them type or paste a script and generate narrated audio with adjustable delivery. [1][9] Microsoft highlighted creative uses such as personalized "choose your own adventure" stories and bespoke guided-meditation content. [1] The experience offers distinct creative modes: an Emotive mode that adapts a script for a single expressive voice, a Story mode that blends multiple voices and accents for multi-character narration, and a Scripted mode that reads input verbatim for cases such as disclaimers or exact recitation. [9] At launch, Audio Expressions was available through Copilot Labs for personal accounts, with broader and enterprise availability following over time. [3][9]
MAI-Voice-1 entered a fast-growing market for AI voice synthesis. Its most direct comparison is the speech startup ElevenLabs, widely regarded as a leader in expressive AI voices, along with the voice and text-to-speech offerings of OpenAI, Google, and Amazon, plus numerous specialized speech vendors. [2][7] By building its own speech model, Microsoft reduced its reliance on external voice providers for consumer Copilot features and gained control over latency, cost, and the user experience of its audio products. Industry observers framed the launch, together with MAI-1-preview, as a meaningful step in Microsoft diversifying beyond OpenAI for the models behind its products. [2][7]
MAI-Voice-1 is notable as the first in-house speech-generation model from Microsoft AI and the first member of the MAI model family to ship in consumer products. [1][4] Its release demonstrated that Microsoft intended to compete directly in foundation-model development, not only to integrate partner models, and it gave the company a controllable, high-throughput voice engine for Copilot. [2][7] Microsoft's emphasis on efficiency, encapsulated in its claim of generating a minute of audio in under a second on a single GPU, underscored a strategy of serving very large consumer audiences economically. [1]
As the founding entry in the MAI-Voice line, MAI-Voice-1 set the template that MAI-Voice-2 later extended with multilingual coverage, voice prompting, and broader expressive range. [5][8] Within Microsoft's stated long-term goal of a self-sufficient, continuously improving model stack, MAI-Voice-1 occupies the position of the first proprietary speech model the company brought to market under Suleyman's leadership. [1][5]