Stable Audio 2.5

AI Models Generative AI Music & Audio Generation Speech & Audio AI

13 min read

Updated Jun 28, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 28, 2026

Fact-checked

In review queue

Sources

13 citations

Revision

v3 · 2,648 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Stable Audio 2.5 is an enterprise focused text-to-audio generation model released by Stability AI on September 10, 2025. It generates production ready music tracks of up to three minutes at 44.1 kHz stereo from a text prompt, and it adds audio-to-audio transformation, audio inpainting, and custom fine-tuning on a customer's own licensed audio library. Stability AI describes it as "the first audio generation model designed specifically for enterprise-grade sound production," the third numbered iteration of the Stable Audio family and the first version aimed at brand and commercial sound work rather than consumer experimentation.^[1]^[2]

The headline technical change relative to earlier Stable Audio releases is generation speed. Stability AI reports that Stable Audio 2.5 produces a three minute track in under two seconds on an Nvidia H100 GPU, using an eight step diffusion process trained with a post-training method the company calls Adversarial Relativistic-Contrastive (ARC). Stable Audio 2.0 required roughly 50 inference steps to reach comparable quality, so the 2.5 release cuts the sampling budget by more than six times.^[1]^[3]

The launch was accompanied by a partnership with sound branding agency amp, part of Landor Group within the WPP holding company, which made Stable Audio 2.5 available to WPP's enterprise clients through the WPP Open platform.^[1]^[2]

What is Stable Audio 2.5?

Stable Audio 2.5 is a latent diffusion text-to-audio model that turns a written description into a finished instrumental music track. A prompt such as "uplifting cinematic strings with a driving percussion bed" returns a stereo track of up to three minutes, structured with an explicit intro, development, and outro rather than a single repeating loop. Beyond plain text-to-audio it supports two editing workflows: audio-to-audio, where a reference clip plus a prompt is reshaped into a new style, and audio inpainting, where the model fills or extends a marked section of an uploaded track while keeping the surrounding material intact.^[1]^[3]

The model is positioned for enterprise sound production. Stability AI markets the training data as fully licensed and the output as cleared for commercial use under the relevant license tier, and it offers organizations the ability to fine-tune a bespoke version of the model on their own sound library so that a brand's signature audio is embedded directly in the generative workflow.^[1]^[2]

Background

Stability AI is a London based generative AI company best known for the Stable Diffusion family of image models. The company expanded into audio in September 2023 with the original Stable Audio text-to-music system, which produced short clips of up to 90 seconds in 44.1 kHz stereo using a latent diffusion architecture.^[4]

The second generation, Stable Audio 2.0, launched on April 3, 2024. It extended generation length to three minutes, switched from a U-Net to a diffusion transformer (DiT) over a highly compressed autoencoder, and introduced audio-to-audio style transfer alongside text-to-audio generation. According to Stability AI, the 2.0 model was trained on a dataset of over 800,000 audio files licensed from the AudioSparx music library, with an artist opt-out provision, and the web app used Audible Magic content recognition technology to screen uploaded audio for copyrighted material. Stability AI made the 2.0 web app free at launch on StableAudio.com and added API access shortly afterward.^[4]^[5]

In May 2025, between the 2.0 and 2.5 releases, Stability AI and Arm released Stable Audio Open Small, a 341 million parameter text-to-audio model optimized to run on Arm CPUs and generate short samples on-device. That model was the first public deployment of the ARC post-training method later used in Stable Audio 2.5, and it was released under the permissive Stability AI Community License.^[6]

Across the same period the generative music space became crowded. Suno and Udio both gained large consumer followings for full song generation with vocals. ElevenLabs entered the field in 2025 with ElevenLabs Music. Stability AI's response was to specialize. Stable Audio 2.5 doubles down on instrumental music, sound design beds, and brand audio rather than chasing the pop song use case the consumer apps target.^[7]^[8]

What is Stable Audio 2.5 used for?

The model handles three core generation modes and several enterprise features that did not ship with the consumer focused 2.0 release.

Capability	Description
Text-to-audio	Generates music from a natural language prompt. Stability AI says 2.5 responds more precisely to mood descriptors like "uplifting" and instrument-level prompts like "lush synthesizers" than 2.0.
Audio-to-audio	Accepts a reference clip plus a text prompt and transforms the input into a new track in a target style. Carried over from Stable Audio 2.0.
Audio inpainting	New in 2.5. Users upload a track, mark a start point, and the model generates a continuation that musically fits the existing material. Suited to fixing or extending parts of an existing arrangement.
Track length	Up to three minutes per generation, output at 44.1 kHz stereo.
Musical structure	Tracks include explicit intro, development, and outro sections rather than a single repeating loop.
Custom fine-tuning	Enterprise customers can fine-tune the model on their own audio library to produce a bespoke sonic identity. Sold as part of enterprise licensing.
Commercial safety	Trained on a fully licensed dataset. Stability AI advertises the output as cleared for commercial use, subject to the license tier.

The inpainting feature is the most consequential addition for sound designers. Earlier audio diffusion tools tended to regenerate a whole clip whenever a section needed changing, which broke continuity. Inpainting lets a producer keep the parts that already work and only resynthesize what is wrong. In its announcement Stability AI described the workflow directly: "Stable Audio 2.5 supports audio inpainting, which means users can input their own audio, select where they want it to start, and the model will use the context to generate the rest of the track." The company framed the target use cases as brand sound work spanning everything from commercials and game intros to in-store music and audio cues for products such as credit cards and car stereos.^[1]^[3]

How fast is Stable Audio 2.5?

Speed is the defining claim of the release. Stability AI states that Stable Audio 2.5 reaches an "inference speed of less than two seconds on a GPU, for tracks up to three minutes," measured on an Nvidia H100, and that it does so in just eight diffusion steps rather than the roughly 50 steps used by Stable Audio 2.0. VentureBeat characterized the eight step pipeline as cutting brand audio production time "from weeks to minutes."^[1]^[2]

Metric	Stable Audio 2.5	Stable Audio 2.0
Inference steps	About 8	About 50
Time for a 3 minute track	Under 2 seconds on H100	Several seconds (slower)
Acceleration method	ARC adversarial post-training	Standard diffusion sampling

The speed gain comes from the post-training stage, not from a smaller model. ARC compresses the number of sampling steps the model needs at inference time while preserving output quality, which is what lets a long stereo track render in roughly the time it takes to read the prompt.^[1]^[9]

How does the ARC architecture work?

Stability AI has disclosed two specific technical details about Stable Audio 2.5. The first is that the model is a latent diffusion system, consistent with the diffusion transformer over a compressed autoencoder used in Stable Audio 2.0. The second is the post-training method used to compress the sampling schedule.^[1]^[4]

Adversarial Relativistic-Contrastive post-training, or ARC, is the technique Stability AI credits for the eight step inference budget. The method was introduced in the paper "Fast Text-to-Audio Generation with Adversarial Post-Training" (arXiv 2505.08175), submitted by Stability AI researchers in May 2025. The authors describe ARC as the first adversarial acceleration algorithm for diffusion and flow models that is not based on distillation: it extends a relativistic adversarial training formulation to diffusion and flow post-training and adds a contrastive discriminator objective intended to improve prompt adherence. Because it avoids distillation, the approach does not require a separate pretrained teacher model or classifier-free guidance. ARC was first applied to the open Stable Audio Open Small model, where Stability AI reported generating about 12 seconds of 44.1 kHz stereo audio in roughly 75 milliseconds on an H100 GPU, before the same method was used to accelerate Stable Audio 2.5.^[1]^[9]

Reported inference time is under two seconds for a three minute output on an Nvidia H100. The company has not published parameter counts, training set size, or the encoder bottleneck rate for the 2.5 model.^[1]^[3]

How much does Stable Audio 2.5 cost?

Stable Audio 2.5 is available through several channels with different commercial terms.

Channel	Notes
StableAudio.com	Web interface aimed at individuals. Free tier for personal use only. Paid Creator tier extends commercial rights to individuals earning under $1 million per year.
Stability AI API	Pay-as-you-go credit pricing on the Stability AI Platform, where 1 credit equals $0.01. New accounts start with 25 free credits.
fal, Replicate, ComfyUI	Third party hosting platforms with their own per-request pricing.
Enterprise license	Direct contract with Stability AI required for any organization with annual revenue above $1 million, for API resellers, and for on-premises deployment. Includes implementation support, custom fine-tuning, and professional services.

Stability AI markets the underlying training data as fully licensed and the outputs as commercially safe, which is the company's primary selling point against legally contested competitors. Uploaded reference audio still has to be copyright cleared by the user, and the platform runs content recognition checks on uploads to enforce this.^[1]^[10]^[11]

How does Stable Audio 2.5 differ from Stable Audio 2.0 and Stable Audio Open?

The three products share a lineage but target different users. Stable Audio 2.0 (April 2024) is the prior consumer flagship; Stable Audio Open is the openly licensed line for local and on-device use; Stable Audio 2.5 (September 2025) is the enterprise tier.

Feature	Stable Audio 2.5	Stable Audio 2.0	Stable Audio Open / Open Small
Released	September 2025	April 2024	2024 (Open), May 2025 (Open Small)
Primary audience	Enterprise and brand sound	Individual creators	Researchers, on-device developers
Max track length	3 minutes	3 minutes	About 47 seconds (Open), short samples (Open Small)
Inference steps	About 8 (ARC)	About 50	About 8 (Open Small, ARC)
Audio inpainting	Yes (new)	No	No
Custom fine-tuning service	Yes	No	Open weights, self fine-tune
Weights	Closed, hosted	Closed, hosted	Open under Stability AI Community License

The practical differences from 2.0 are the more than six times faster generation via ARC, the new audio inpainting workflow, the enterprise fine-tuning service, and the explicit commercial-safety positioning. Stable Audio Open differs again by shipping open weights for self hosting, with Open Small being the small 341 million parameter on-device variant that first proved out the ARC method.^[1]^[4]^[6]

How does Stable Audio 2.5 compare to Suno, Udio, and ElevenLabs?

Stable Audio 2.5 sits in an awkward part of the market. It is faster than the consumer music generators and is positioned for licensed enterprise work, but it does not generate vocals or full songs in the way the leading consumer systems do.

Model	Vendor	Released	Vocals	Max length	Commercial license clarity
Stable Audio 2.5	Stability AI	September 2025	No	3 minutes	Trained on licensed data, commercial use under paid tiers
Suno v5	Suno	2025	Yes	About 8 minutes per generation	Disputed, multiple major label lawsuits pending
Udio	Uncharted Labs	2024	Yes	Up to 15 minutes via extensions	Disputed, multiple major label lawsuits pending
ElevenLabs Music	ElevenLabs	2025	Yes	Up to 5 minutes	Trained on licensed data, commercial use under paid tiers

Reviewers writing about the 2025 audio model landscape generally place Stable Audio in the instrumental and sound design category rather than the song writing category. Coverage from outlets like Geeky Gadgets and aicompetence describes Stable Audio as the strongest choice for cinematic beds, sample packs, loops, and sound effects, and Suno or Udio as the better picks for vocal led pop and hip hop.^[7]^[8]

Reception

Reception at launch focused on three things: the speed claim, the enterprise positioning, and the WPP partnership. VentureBeat called the eight step generation pipeline a "breakthrough" that cut audio production time from weeks to minutes for brand teams that previously commissioned bespoke sound design. The Decoder framed the release as Stability AI choosing to compete on speed and licensing safety rather than vocal generation. Winbuzzer emphasized the inpainting workflow as the most useful day-to-day addition for sound designers, since it removes the all-or-nothing regeneration problem of earlier tools.^[2]^[3]^[12]

Independent coverage of Stable Audio's quality has tended to praise the model on instrumental fidelity at 44.1 kHz stereo while flagging the absence of vocals as a real limitation for anyone trying to use it for full songs. The model has not, as of mid 2026, been widely benchmarked against Suno v5 or ElevenLabs Music on listener preference tests, partly because the systems target different output categories.^[7]^[8]

The WPP partnership is the part of the launch that says the most about the strategy. By going through Landor and amp, Stability AI gets distribution into hundreds of brands that already buy sound design as a service. To justify that focus, Stability AI cited Ipsos research in its announcement, stating that custom audio can make a brand eight times more memorable, that audio influences brand engagement by 86 percent, and that only 6 percent of creative uses a distinct sound identity. That is a smaller market than consumer music apps, but it is one where licensing provenance and on-premises deployment are worth real money.^[1]^[2]

References

Stability AI. "Stability AI Introduces Stable Audio 2.5, the First Audio Model Built for Enterprise Sound Production at Scale." September 10, 2025. https://stability.ai/news-updates/stability-ai-introduces-stable-audio-25-the-first-audio-model-built-for-enterprise-sound-production-at-scale Accessed 2026-06-28. ↩
Michael Nuñez. "Stability AI's enterprise audio model cuts production time from weeks to minutes with 8-step generation breakthrough." VentureBeat, September 10, 2025. https://venturebeat.com/ai/stability-ais-enterprise-audio-model-cuts-production-time-from-weeks-to Accessed 2026-06-28. ↩
The Decoder. "Stability AI releases Stable Audio 2.5 for faster and more complex AI-generated music." September 2025. https://the-decoder.com/stability-ai-releases-stable-audio-2-5-for-faster-and-more-complex-ai-generated-music/ Accessed 2026-06-28. ↩
Stability AI. "Introducing Stable Audio 2.0." April 3, 2024. https://stability.ai/news-updates/stable-audio-2-0 Accessed 2026-06-28. ↩
Voicebot.ai. "Stability AI Releases Augmented Text-to-Music Engine Stable Audio 2 With Upload and Style Transfer Features." April 4, 2024. https://voicebot.ai/2024/04/04/stability-ai-releases-augmented-text-to-music-engine-stable-audio-2-with-upload-and-style-transfer-features/ Accessed 2026-06-28. ↩
Stability AI. "Stability AI and Arm Release Stable Audio Open Small, Enabling Real-World Deployment for On-Device Audio Generation." May 2025. https://stability.ai/news/stability-ai-and-arm-release-stable-audio-open-small-enabling-real-world-deployment-for-on-device-audio-control Accessed 2026-06-28. ↩
Geeky Gadgets. "AI music creators compared: Udio vs Suno vs Stable Audio vs Audio Shake." 2025. Accessed 2026-06-28. ↩
aicompetence.org. "AI Music Generation: Suno vs. Udio vs. Stable Audio." 2025. Accessed 2026-06-28. ↩
Zachary Novack et al. "Fast Text-to-Audio Generation with Adversarial Post-Training." arXiv:2505.08175, submitted May 13, 2025. https://arxiv.org/abs/2505.08175 Accessed 2026-06-28. ↩
Stability AI. "Stability AI License." https://stability.ai/license Accessed 2026-06-28. ↩
Stability AI. "Stability AI Developer Platform Pricing." https://platform.stability.ai/pricing Accessed 2026-06-28. ↩
Winbuzzer. "Stability AI Launches Stable Audio 2.5 with Enterprise-Grade Speed and Creative Control." September 11, 2025. https://winbuzzer.com/2025/09/11/stability-ai-launches-stable-audio-2-5-with-enterprise-grade-speed-and-creative-control-xcxwbn/ Accessed 2026-06-28. ↩
fal.ai blog. "Stable Audio 2.5 Now Available on fal." 2025. https://blog.fal.ai/stable-audio-2-5-now-available-on-fal/ Accessed 2026-06-28.

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

2 revisions by 1 contributors · full history

Suggest edit

What links here

Boomy Sonauto Stability AI

What is Stable Audio 2.5?

Background

What is Stable Audio 2.5 used for?

How fast is Stable Audio 2.5?

How does the ARC architecture work?

How much does Stable Audio 2.5 cost?

How does Stable Audio 2.5 differ from Stable Audio 2.0 and Stable Audio Open?

How does Stable Audio 2.5 compare to Suno, Udio, and ElevenLabs?

Reception

See also

References

Improve this article

Related Articles

Lyria

Suno v5

ElevenLabs Music

Suno

Audio-to-Audio Models

ElevenLabs v3

What links here

Related Articles

Lyria

Suno v5

ElevenLabs Music

Suno

Audio-to-Audio Models

ElevenLabs v3

What links here