MAI-Voice-1

AI Models Generative AI

8 min read

Updated Jun 8, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 8, 2026

Fact-checked

In review queue

Sources

9 citations

Revision

v1 · 1,593 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

MAI-Voice-1 is a text-to-speech (speech generation) model developed by Microsoft AI, the consumer artificial-intelligence division of Microsoft led by Mustafa Suleyman. Announced on August 28, 2025, it was the first fully in-house speech-generation model from the division, unveiled alongside MAI-1-preview, Microsoft AI's first end-to-end trained in-house text foundation model. ^[1]^[2] MAI-Voice-1 produces expressive, natural, high-fidelity audio across single-speaker and multi-speaker scenarios, and Microsoft positioned it as a centerpiece of consumer voice features in Microsoft Copilot, including Copilot Daily, Copilot Podcasts, and a creative voice playground called Copilot Audio Expressions. ^[1]^[3]

The release marked a strategic milestone in Microsoft's effort to build a self-sufficient stack of foundation models rather than rely solely on its partner OpenAI, and it became the first member of an expanding family of in-house "MAI" models that grew over the following year to include MAI-Voice-2, MAI-Image-2, MAI-Transcribe-1, MAI-Thinking-1, and MAI-Code-1. ^[4]^[5]

Overview

MAI-Voice-1 is a neural speech-synthesis model that converts written text into spoken audio. Microsoft AI describes it as "one of the most expressive and natural speech generation models available," capable of delivering high-fidelity audio in both single-speaker and multi-speaker settings. ^[1] Microsoft's headline performance claim is that the model is "lightning-fast," with the ability to generate a full minute of audio in under one second on a single GPU, which the company characterizes as making it "one of the most efficient speech systems available today." ^[1]^[2] These speed and efficiency figures are vendor claims published by Microsoft AI and have not been independently benchmarked in a peer-reviewed setting; the company has not disclosed the model's parameter count, architecture details, or training-data composition.

Beyond the consumer Copilot surfaces it first powered, MAI-Voice-1 was later made available to third-party developers through Azure Speech in Microsoft Foundry (formerly Azure AI Foundry), where it shipped as a public-preview neural text-to-speech model with six prebuilt English (United States) voices and Speech Synthesis Markup Language (SSML) style controls. ^[3]^[6]

Specification	Detail
Developer	Microsoft AI (Microsoft)
Model family	MAI (Microsoft AI in-house models)
Type	Text-to-speech / speech generation (neural)
Announced	August 28, 2025
Modality	Text input, audio output
Speakers	Single-speaker and multi-speaker
Languages (Foundry preview)	English (United States); six prebuilt voices
Expressive control	SSML `mstts:express-as` styles (e.g., joy, excitement, empathy)
Stated speed (Microsoft claim)	About one minute of audio in under one second on a single GPU
Consumer products	Copilot Daily, Copilot Podcasts, Copilot Audio Expressions (Copilot Labs)
Developer access	Azure Speech in Microsoft Foundry (public preview)
Architecture / parameters	Not publicly disclosed
Training data	Not publicly disclosed
Successor	MAI-Voice-2 (announced June 2, 2026)

Microsoft AI and the MAI family

Microsoft AI is the division Microsoft formed in March 2024 to lead its consumer AI products, including Copilot, Bing, and Edge, with Mustafa Suleyman, a co-founder of DeepMind and Inflection AI, hired as its chief executive. ^[1]^[4] For its first year, the division's products leaned heavily on models supplied by OpenAI, in which Microsoft is the largest investor. MAI-Voice-1 and MAI-1-preview, released together in late August 2025, were the division's first publicly announced models trained in-house, signaling an intent to develop proprietary foundation models in parallel with its OpenAI partnership. ^[2]^[7]

Microsoft framed the two August 2025 models as complementary pieces of a broader plan to orchestrate multiple specialized models for different user intents. MAI-1-preview is a mixture-of-experts text model that Microsoft said was trained on roughly 15,000 NVIDIA H100 GPUs and submitted to the public LMArena leaderboard for community evaluation, while MAI-Voice-1 supplied the speech layer. ^[1]^[2] In its announcement, the company emphasized building "a self-sufficient" capability and described an ambition to serve its hundreds of millions of consumer users with models it controls end to end. ^[1]

The MAI lineup expanded substantially over the following year. Microsoft introduced MAI-Voice-2, a multilingual, prompted speech model, and brought MAI-Voice-1 and MAI-Voice-2 to Azure Speech as the "MAI-Voice" family. ^[6]^[8] At its Build developer conference on June 2, 2026, Microsoft AI announced seven new MAI models spanning text, image, voice, transcription, reasoning, and code, which Suleyman described as components of a "hill-climbing machine" of continuously improving in-house systems oriented toward what he calls "humanist superintelligence." ^[5] Within that trajectory, MAI-Voice-1 stands as the family's founding speech model.

What MAI-Voice-1 does

MAI-Voice-1 generates spoken audio from text input. According to Microsoft's Azure documentation, the model "interprets input text holistically and automatically adjusts emotion, pace, and rhythm without manual configuration," producing speech the company describes as highly natural and emotionally rich. ^[6] It is optimized for conversational, expressive, and long-form scenarios, and maintains a consistent voice persona across extended content while still allowing expressive variation. ^[6]

In the Azure Speech preview, MAI-Voice-1 ships with six prebuilt English (US) voices, identified by names such as Jasper, June, Grant, Iris, Reed, and Joy, spanning male and female options recommended for uses including general conversation, customer service, professional narration, and emotional styles. ^[6] Developers can shape delivery using SSML, in particular the mstts:express-as element, to request emotional styles such as joy, excitement, or empathy, and the model supports real-time synthesis through the standard Azure Speech SDKs and APIs. ^[6] A gated "personal voice" (voice-cloning) prompt mode lets approved customers create a custom voice from consented audio, subject to Microsoft's limited-access review for Custom Neural Voice. ^[6]

Microsoft has stressed responsible-use safeguards around the technology, including consent requirements and gated access for voice cloning, reflecting the broader industry concern that high-fidelity speech synthesis can be misused for impersonation. The company has not published the model's underlying architecture, size, sampling rate of the model itself, or the datasets used to train it, so several technical specifics remain undisclosed. ^[1]^[6]

Products

Copilot Daily

Copilot Daily is a feature in which an AI voice reads a short, curated rundown of news and information to the user. Microsoft cited MAI-Voice-1 as the model powering the voice in Copilot Daily, providing the spoken delivery for these audio briefings. ^[1]^[3]

Copilot Podcasts

Microsoft also pointed to Copilot Podcasts (sometimes described as audio explanations or podcast-style summaries) as a use case for MAI-Voice-1. In this scenario, the model generates conversational, multi-speaker audio that can explain a topic in a podcast-like format, taking advantage of its support for multiple voices in a single output. ^[1]

Copilot Audio Expressions

Copilot Audio Expressions is an experimental feature in Copilot Labs that exposes MAI-Voice-1 directly to users, letting them type or paste a script and generate narrated audio with adjustable delivery. ^[1]^[9] Microsoft highlighted creative uses such as personalized "choose your own adventure" stories and bespoke guided-meditation content. ^[1] The experience offers distinct creative modes: an Emotive mode that adapts a script for a single expressive voice, a Story mode that blends multiple voices and accents for multi-character narration, and a Scripted mode that reads input verbatim for cases such as disclaimers or exact recitation. ^[9] At launch, Audio Expressions was available through Copilot Labs for personal accounts, with broader and enterprise availability following over time. ^[3]^[9]

Competitive landscape

MAI-Voice-1 entered a fast-growing market for AI voice synthesis. Its most direct comparison is the speech startup ElevenLabs, widely regarded as a leader in expressive AI voices, along with the voice and text-to-speech offerings of OpenAI, Google, and Amazon, plus numerous specialized speech vendors. ^[2]^[7] By building its own speech model, Microsoft reduced its reliance on external voice providers for consumer Copilot features and gained control over latency, cost, and the user experience of its audio products. Industry observers framed the launch, together with MAI-1-preview, as a meaningful step in Microsoft diversifying beyond OpenAI for the models behind its products. ^[2]^[7]

Significance

MAI-Voice-1 is notable as the first in-house speech-generation model from Microsoft AI and the first member of the MAI model family to ship in consumer products. ^[1]^[4] Its release demonstrated that Microsoft intended to compete directly in foundation-model development, not only to integrate partner models, and it gave the company a controllable, high-throughput voice engine for Copilot. ^[2]^[7] Microsoft's emphasis on efficiency, encapsulated in its claim of generating a minute of audio in under a second on a single GPU, underscored a strategy of serving very large consumer audiences economically. ^[1]

As the founding entry in the MAI-Voice line, MAI-Voice-1 set the template that MAI-Voice-2 later extended with multilingual coverage, voice prompting, and broader expressive range. ^[5]^[8] Within Microsoft's stated long-term goal of a self-sufficient, continuously improving model stack, MAI-Voice-1 occupies the position of the first proprietary speech model the company brought to market under Suleyman's leadership. ^[1]^[5]

References

Microsoft AI, "Two in-house models in support of our mission," microsoft.ai, August 28, 2025. https://microsoft.ai/news/two-new-in-house-models/ ↩
MarkTechPost, "Microsoft AI Lab Unveils MAI-Voice-1 and MAI-1-Preview: New In-House Models for Voice AI," August 29, 2025. https://www.marktechpost.com/2025/08/29/microsoft-ai-lab-unveils-mai-voice-1-and-mai-1-preview-new-in-house-models-for-voice-ai/ ↩
Microsoft Community Hub (Azure AI Foundry Blog), "Introducing MAI-Transcribe-1, MAI-Voice-1, and MAI-Image-2 in Microsoft Foundry." https://techcommunity.microsoft.com/blog/azure-ai-foundry-blog/introducing-mai-transcribe-1-mai-voice-1-and-mai-image-2-in-microsoft-foundry/4507787 ↩
Neowin, "Microsoft reveals two in-house AI models: MAI-Voice-1 and MAI-1-preview," August 2025. https://www.neowin.net/news/microsoft-reveals-two-in-house-ai-models-mai-voice-1-and-mai-1-preview/ ↩
Microsoft AI, "Building a hill-climbing machine: Launching seven new MAI models," microsoft.ai, June 2, 2026. https://microsoft.ai/news/building-a-hillclimbing-machine-launching-seven-new-mai-models/ ↩
Microsoft Learn, "What is MAI-Voice? - Foundry Tools," Azure AI Speech documentation. https://learn.microsoft.com/en-us/azure/ai-services/speech-service/mai-voices ↩
Winbuzzer, "Microsoft Unveils In-House MAI-1 and MAI-Voice-1 AI Models to Diversify Beyond OpenAI," August 29, 2025. https://winbuzzer.com/2025/08/29/microsoft-unveils-in-house-mai-1-and-mai-voice-1-ai-models-to-diversify-beyond-openai-xcxwbn/ ↩
Microsoft AI, "MAI-Voice-2." https://microsoft.ai/models/mai-voice-2/ ↩
Windows Forum, "Scripted Mode in Copilot Labs: Verbatim Audio with MAI-Voice-1." https://windowsforum.com/threads/scripted-mode-in-copilot-labs-verbatim-audio-with-mai-voice-1.380655/ ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

Suggest edit

What links here

MAI-Thinking-1