Stable Audio
Last reviewed
May 1, 2026
Sources
20 citations
Review status
Source-backed
Revision
v1 ยท 4,000 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 1, 2026
Sources
20 citations
Review status
Source-backed
Revision
v1 ยท 4,000 words
Add missing citations, update stale details, or suggest a clearer explanation.
Stable Audio is a family of generative AI models for text-to-audio synthesis developed by Stability AI, the British startup best known for the Stable Diffusion image model. The system takes a natural-language prompt and produces music or sound effects as a stereo audio file. Stable Audio launched on September 13, 2023 as a hosted commercial product at stableaudio.com and was the first widely available text-to-music tool to render audio at full 44.1 kHz sample rate. The family has since grown to include the larger Stable Audio 2.0 (April 2024), Stable Audio 2.5 (September 2025), the open-weight Stable Audio Open (June 2024), and the on-device Stable Audio Open Small (May 2025).
The system is built on latent diffusion, the same family of models that powers Stable Diffusion, but adapted for raw audio waveforms. A variational autoencoder compresses 44.1 kHz audio into a much shorter latent sequence, and a diffusion transformer (DiT) operating in that latent space produces samples conditioned on a text embedding plus explicit timing tokens. The timing trick, where the model receives the desired start position and total duration as inputs, was the central methodological contribution of the paper Fast Timing-Conditioned Latent Audio Diffusion (Evans, Carr, Taylor, Hawley, and Pons, ICML 2024). It lets a single model generate variable-length clips up to its maximum context, which earlier text-to-audio systems could not do cleanly.
Within AI music generation, Stable Audio sits in a different niche than Suno and Udio. Suno and Udio focus on full songs with realistic AI vocals; Stable Audio targets instrumental tracks, sound design, ambient beds, and short loops, with all of its commercial training data licensed. The open-source variant competes more directly with Meta's AudioCraft family for self-hosted use.
Stability AI's audio research group was led by Zach Evans and a small team including CJ Carr, Josiah Taylor, Scott Hawley, Jordi Pons, Julian Parker, and Zack Zukowski. Their research thread began with the open-source audio-diffusion-pytorch library and the Dance Diffusion experimental models in 2022, then matured into the commercial Stable Audio product in late 2023.
| Date | Event |
|---|---|
| 2022 | Stability AI publishes early open-source audio diffusion experiments (Dance Diffusion, audio-diffusion-pytorch). |
| September 13, 2023 | Stable Audio 1.0 launches at stableaudio.com. Free tier produces tracks up to 45 seconds; Pro subscription up to 90 seconds at 44.1 kHz stereo. |
| October 2023 | Stable Audio is named one of TIME's Best Inventions of 2023. |
| February 7, 2024 | Evans, Carr, Taylor, Hawley, and Pons post Fast Timing-Conditioned Latent Audio Diffusion on arXiv (2402.04825); accepted to ICML 2024. |
| March 23, 2024 | Stability AI founder and CEO Emad Mostaque resigns; COO Shan Shan Wong and CTO Christian Laforte serve as interim co-CEOs. |
| April 3, 2024 | Stable Audio 2.0 launches: full tracks up to 3 minutes, audio-to-audio prompting, structured musical compositions, new compressed autoencoder, DiT replaces U-Net. |
| June 5, 2024 | Stable Audio Open releases on Hugging Face under the Stability AI Community License; 1.21 billion parameters, up to 47 seconds at 44.1 kHz stereo, trained only on Creative Commons audio. |
| June 25, 2024 | Prem Akkaraju (former CEO of Weta Digital) named permanent CEO of Stability AI. |
| July 19, 2024 | Stable Audio Open paper appears on arXiv (2407.14358), authored by Evans, Parker, Carr, Zukowski, Taylor, and Pons. |
| September 24, 2024 | James Cameron joins the Stability AI board of directors. |
| May 14, 2025 | Stable Audio Open Small (341M parameters) released in collaboration with Arm, optimised to run on-device on smartphones; generates 11 seconds of audio in under 8 seconds on an Arm CPU. |
| September 10, 2025 | Stable Audio 2.5 launches as the company's first audio model aimed at enterprise sound production: 3-minute tracks, audio inpainting, and inference under 2 seconds via the Adversarial Relativistic-Contrastive (ARC) training method. |
Stable Audio shipped during a turbulent stretch for Stability AI. By early 2024 the company was burning cash, losing senior researchers, and facing the resignation of its founder. The audio team kept shipping at a steady cadence through this period.
Stable Audio is a latent diffusion model for raw audio. It has three main components: a variational autoencoder that turns waveforms into a compact latent sequence, a text encoder that turns the prompt into a sequence of conditioning tokens, and a diffusion transformer that learns to denoise audio latents conditioned on the text and on timing information. The original paper trained the system end-to-end to render up to 95 seconds of stereo signal at 44.1 kHz in roughly 8 seconds on a single NVIDIA A100 GPU.
The autoencoder is a fully convolutional variational autoencoder (VAE) that takes raw stereo waveforms and compresses them along the time axis into a latent representation suitable for diffusion. Compressing in latent space is what makes long-form audio tractable for diffusion: a 95-second stereo clip at 44.1 kHz contains roughly 8.4 million sample values, which would be far too many tokens for a transformer to attend over directly. The Stable Audio Open autoencoder reports a latent rate of 21.5 Hz, meaning each second of audio is represented by 21.5 latent vectors instead of 44,100 amplitude samples. The encoder is trained with a combination of reconstruction and adversarial losses to keep perceptual fidelity high after the round trip through the decoder.
Stable Audio 1.0 used a U-Net-style backbone borrowed from image diffusion. Stable Audio 2.0 replaced this with a diffusion transformer, the same architecture family popularised in image work by Peebles and Xie. The DiT processes the audio latents as a sequence of tokens with self-attention layers and is conditioned on text and timing information through cross-attention and adaptive layer normalisation. The transformer's ability to model long-range structure is what lets 2.0 produce coherent three-minute compositions with intro, development, and outro sections, where the U-Net of 1.0 tended to drift over the same 90-second window.
The commercial Stable Audio uses a CLAP-style joint text-audio encoder for conditioning, which gives the model a shared representation of text and music. Stable Audio Open replaced the CLAP text encoder with the open-source pre-trained T5-base model so the entire pipeline could be released under a permissive licence.
The novel piece is timing conditioning. The diffusion model receives two extra tokens at each step: the desired start time within the source audio and the desired total duration. During training, every clip is paired with these tags so the model learns the correspondence between the values and the resulting waveform shape. At inference, the user can ask for a 12-second clip, a 47-second loop, or a 95-second track, and the model produces something that is recognisably that long, with a real ending rather than a faded-out crash. Most earlier text-to-music systems either generated a fixed length or required ad hoc looping, and most autoregressive systems (MusicLM, MusicGen) generated until the user stopped them. Variable, controllable duration is the headline feature that distinguishes Stable Audio from those alternatives.
The commercial models output 44.1 kHz stereo, the standard for distributable music. Stable Audio Open 1.0 also outputs 44.1 kHz stereo (despite some early reporting that called it mono). Stable Audio Open Small produces 44.1 kHz stereo as well, but with shorter clips (up to 11 seconds) and a much smaller parameter budget so it can run on a smartphone CPU.
Stability AI made the licensed-only training story a central part of the Stable Audio pitch from launch, in part because the image side of the company had been entangled in copyright disputes around Stable Diffusion 1.5. The commercial models and the open variant use disjoint corpora.
| Model | Training corpus | Size | Licensing approach |
|---|---|---|---|
| Stable Audio 1.0, 2.0, 2.5 | AudioSparx licensed catalogue | ~800,000 audio files (~19,500 hours) of music, sound effects, and field recordings | Commercial licence with AudioSparx; artists in the catalogue offered an opt-out option before training. |
| Stable Audio Open 1.0 | Freesound + Free Music Archive (FMA) | 486,492 recordings (472,618 from Freesound, 13,874 from FMA), roughly 7,300 hours total | Restricted to CC0, CC-BY, and CC Sampling+ tracks; copyright cleared with Audible Magic plus human review of FMA metadata. |
| Stable Audio Open Small | Freesound + Free Music Archive | Subset of the Open 1.0 corpus | Same Creative Commons restrictions. |
The AudioSparx partnership was struck before the original launch and gave Stability AI a large catalogue of professionally produced music, sound effects, and Foley to train on. AudioSparx represents over 8,000 contributors, and the company has reported that roughly 10 percent of its artists chose to opt out of inclusion in the training set. For uploaded audio in the audio-to-audio mode, Stable Audio uses Audible Magic's content recognition service to block prompts that match copyrighted recordings.
This posture matters commercially. In June 2024 the RIAA filed lawsuits against Suno and Udio over alleged use of copyrighted recordings in training. Stable Audio was not named in those suits, and Stability AI has continued to lean on the AudioSparx provenance and the Creative Commons base for the open variant when pitching to enterprise customers.
| Version | Released | Max length | Sample rate / channels | Parameters | License | Distribution | Notable features |
|---|---|---|---|---|---|---|---|
| Stable Audio 1.0 | September 13, 2023 | 90 seconds (Pro), 45 seconds (Free) | 44.1 kHz stereo | Not disclosed | Commercial subscription | stableaudio.com | First commercial 44.1 kHz text-to-music product; latent diffusion with U-Net backbone; CLAP text encoder; timing conditioning. |
| Stable Audio 2.0 | April 3, 2024 | 3 minutes (180 seconds) | 44.1 kHz stereo | Not disclosed | Commercial subscription, free tier | stableaudio.com, Stability AI API | Audio-to-audio prompting, structured compositions with intro/development/outro, new compressed autoencoder, DiT backbone. |
| Stable Audio Open 1.0 | June 5, 2024 | 47 seconds | 44.1 kHz stereo | ~1.21 billion | Stability AI Community License (commercial use up to $1M annual revenue) | Hugging Face (stabilityai/stable-audio-open-1.0), GitHub | First open-weight Stable Audio variant; T5-base text encoder; trained on Freesound and FMA; oriented toward sound effects and short loops. |
| Stable Audio Open Small | May 14, 2025 | 11 seconds | 44.1 kHz stereo | 341 million | Stability AI Community License | Hugging Face, Arm reference implementations | Runs on smartphone CPUs; collaboration with Arm; sub-8-second generation on Arm-based devices. |
| Stable Audio 2.5 | September 10, 2025 | 3 minutes | 44.1 kHz stereo | Not disclosed | Commercial, enterprise licensing | stableaudio.com, Stability AI API, fal, Replicate, ComfyUI, on-premises | Audio inpainting, ARC training for sub-2-second inference, improved musical structure and prompt adherence, fine-tuning for enterprise sound libraries. |
The commercial line moves toward longer, more structured tracks and faster inference; the open line moves toward smaller, more accessible weights that researchers and hobbyists can fine-tune. The two product lines are intentional, not a split: Stability AI uses the open releases as research baselines and to recruit community fine-tunes that feed back into the commercial product strategy.
The model family supports several distinct generation modes.
| Capability | Available in | Description |
|---|---|---|
| Text-to-music | All versions | Generate instrumental music from a prompt that names genre, instruments, mood, BPM, and structural cues. |
| Text-to-sound-effect | All versions, strongest in Open 1.0 | Generate Foley and ambient effects ("rain on a tin roof", "footsteps on gravel", "crowd murmuring"). |
| Audio-to-audio | 2.0 and 2.5 | Upload an audio sample and transform it under a text prompt, useful for style transfer or stem variations. |
| Structured composition | 2.0 and 2.5 | Produce intros, developments, and outros within a single track instead of one looping section. |
| Audio inpainting | 2.5 | Insert generated audio at a chosen position within an uploaded clip while preserving surrounding context. |
| Variable-length output | All versions | Specify the desired total duration directly, courtesy of the timing-token mechanism. |
| On-device inference | Open Small | Run the model entirely on an Arm-based smartphone CPU without an internet connection. |
Vocal generation is intentionally limited. The commercial models generally avoid lyrics; Stable Audio Open's documentation states explicitly that the model cannot generate realistic vocals, and the training data was filtered to lean on instrumental and sound-effect content. This is the largest functional gap between Stable Audio and the Suno/Udio family.
The text-to-music landscape now contains several distinct product categories. Stable Audio overlaps with all of them but does not compete head-on with any single one.
| System | Vendor | First release | Max length | Vocals | Output | Training data | Distribution |
|---|---|---|---|---|---|---|---|
| Stable Audio 2.5 | Stability AI | September 2025 | 3 minutes | Limited | 44.1 kHz stereo | AudioSparx (licensed) | Commercial API, web, enterprise |
| Stable Audio Open 1.0 | Stability AI | June 2024 | 47 seconds | No | 44.1 kHz stereo | Freesound + FMA (CC) | Open weights (Community License) |
| MusicGen | Meta (AudioCraft) | June 2023 | ~30 seconds (extendable) | No | 32 kHz mono/stereo | Licensed (Shutterstock + others) | Open weights (CC-BY-NC) |
| AudioGen | Meta (AudioCraft) | October 2022 | Short clips | No | 16 kHz mono | AudioSet, BBC Sound Effects | Open weights |
| MusicLM | Google Research | January 2023 | Up to 5 minutes (research) | Hummed only | 24 kHz | Free Music Archive + private corpora | Limited Test Kitchen access, then folded into other Google products |
| Lyria 2 | Google DeepMind | April 2025 (Lyria 1 in November 2023) | Multiple minutes | Yes | 48 kHz | Not disclosed | Internal Google services and YouTube Music AI |
| Suno v5 | Suno | December 2023 (v1), September 2025 (v5) | ~4 minutes | Yes (high realism) | 44.1 kHz | Disputed (RIAA lawsuit June 2024) | Web, mobile, API |
| Udio | Udio | April 2024 | ~15 minutes (with extensions) | Yes (high realism) | 44.1 kHz | Disputed (RIAA lawsuit June 2024) | Web, mobile, API |
| ElevenLabs Sound Effects | ElevenLabs | May 2024 | 22 seconds | Effects, no music vocals | 44.1 kHz | Licensed audio | Web, API |
| Riffusion | Forsgren and Martiros (independent) | December 2022 | Short loops | Limited | Spectrogram via Stable Diffusion | Open-weight Stable Diffusion fine-tune | Web, open weights |
| Jukebox | OpenAI | April 2020 | Multiple minutes | Yes (lo-fi) | Lo-fi 44.1 kHz | 1.2 million songs (research only) | Open weights, research |
The practical takeaway: Stable Audio is the strongest enterprise-grade option for instrumental music and sound design with clear licensing, MusicGen is the strongest open self-hosting option for music, Suno and Udio dominate consumer full-song generation, ElevenLabs leads in dedicated sound effects, and Lyria sits inside Google's product ecosystem.
Stable Audio is used across content production wherever on-demand instrumental audio is the bottleneck. Common applications include:
The variable-length generation through timing conditioning is the technical headline. Most rival text-to-music models produce fixed-length outputs and require post-processing to reach the desired duration. Stable Audio handles 12-second loops, 47-second open-model clips, and 3-minute commercial tracks with the same model and the same prompt format. The 44.1 kHz stereo output also matches the standard for distributable music, where MusicGen tops out at 32 kHz and many earlier systems produced 16 or 24 kHz.
The latent diffusion approach also gives reasonable inference speed. The 1.0 paper reported 95 seconds of stereo audio in 8 seconds on an A100, and Stable Audio 2.5 with the ARC post-training method gets a 3-minute track in under 2 seconds on a GPU. Autoregressive token-based systems are typically slower at long outputs because they generate sample by sample.
Licensed training data has become a meaningful commercial advantage. Customers who pay for music generation typically need to use the output in commercial contexts, and a model trained on a clearly licensed corpus (AudioSparx) reduces downstream risk in a way that the disputed Suno and Udio corpora do not. The combination of a commercial product and a permissively licensed open variant also lets Stability AI cover both enterprise and community use without splitting the brand.
Vocal generation is the largest gap. The Stable Audio family is essentially instrumental, and even the audio-to-audio modes do not produce convincing lyric-driven songs. Suno and Udio dominate the consumer mindshare for AI-generated full songs as a direct result.
The open variant trades a lot to be open. Stable Audio Open 1.0 is capped at 47 seconds, and Stable Audio Open Small at 11 seconds, both far shorter than the commercial 3-minute ceiling. Both open models also struggle with non-Western musical styles because their Creative Commons training data skews Western, and both are explicit in their model cards that fine-tuning will probably be needed for specific genres.
The quality is genre-dependent. Reviewers consistently rate Stable Audio strong on cinematic and electronic instrumental tracks, weaker on jazz and acoustic styles, and limited on anything that benefits from a lead vocal line. Some prompts produce coherent music; others drift or fall apart over the full 3-minute window. As with image diffusion, prompt engineering carries a lot of weight.
Integration with professional digital audio workstations (DAWs) is still light. There is no first-party Logic Pro or Ableton plug-in. Third-party integrations through ComfyUI and the API exist, but most professional producers still pull Stable Audio outputs into the DAW manually as audio files.
Licensing terms have evolved with corporate turbulence. The Stability AI Community License changed several times in 2024, the open variant's redistribution terms shifted, and enterprises have asked for clearer commercial commitments before adopting the technology at scale. The September 2025 enterprise release of Stable Audio 2.5 is in part an answer to those concerns.
Stability AI was founded in 2019 by Emad Mostaque and rose to prominence in August 2022 with the public release of Stable Diffusion, a text-to-image model whose open weights catalysed the open-source generative AI ecosystem. The company shipped a wide product line over the next two years: Stable Diffusion 1.x and 2.x, SDXL, Stable LM language models, Stable Video Diffusion (November 2023), Stable Cascade (February 2024), Stable Diffusion 3 (February to June 2024), Stable Diffusion 3.5 (October 2024), and the Stable Audio family covered here.
The company hit a rough patch in 2024. Emad Mostaque resigned as CEO on March 23, 2024, citing concerns about "centralised AI" and stepping down from the board. Stability laid off roughly 10 percent of staff in April. COO Shan Shan Wong and CTO Christian Laforte served as interim co-CEOs. Prem Akkaraju, the former CEO of visual effects studio Weta Digital, was appointed permanent CEO on June 25, 2024. Sean Parker briefly served as executive chairman, and on September 24, 2024 the filmmaker James Cameron joined Stability AI's board of directors. Through that turmoil the audio research group continued to ship: Stable Audio 2.0 in April, Stable Audio Open in June, the Open Small release in May 2025, and Stable Audio 2.5 in September 2025.
Stable Audio is one of several Stability AI product families. The current commercial line-up includes Stable Diffusion 3.5 for image generation, Stable Video for video generation, Stable Audio for audio, and various enterprise-only models. Many of the original image researchers (Robin Rombach, Andreas Blattmann, Dominik Lorenz) left Stability in 2024 to found Black Forest Labs, which shipped the Flux.1 image models in August of that year. The audio team has been more stable than the image team.
Stable Audio Open's June 2024 release pushed the open-source text-to-audio frontier forward, much as the original Stable Diffusion did for images two years earlier. Researchers used the weights as a baseline for academic papers, and hobbyists built fine-tuned variants for specific genres and sound libraries.
In parallel, the commercial Stable Audio offering pivoted toward enterprise customers, culminating in the Stable Audio 2.5 release. The ARC training method that gives 2.5 its sub-2-second inference was developed for the throughput demands of agencies generating large volumes of branded audio. Distribution expanded beyond stableaudio.com to include fal, Replicate, ComfyUI, and the Stability AI API, plus on-premises licensing for enterprises with strict data-handling requirements.
The consumer end of the market continued to be dominated by Suno and Udio, both of which faced RIAA lawsuits in mid-2024 and reached settlements with major labels in late 2025 (Suno with Warner Music, Udio with Universal Music). Google released Lyria 2 in 2025 as the engine behind YouTube's AI music tools. In this crowded market, Stable Audio has carved out a defensible position as the high-fidelity, licensed, instrumental-first option, with a credible open-source variant for the research community.