Stable Audio

AI Companies Generative AI Music & Audio Generation

22 min read

Updated Jun 22, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 22, 2026

Fact-checked

In review queue

Sources

21 citations

Revision

v4 · 4,369 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Stable Audio is a family of generative AI models from Stability AI that turn a text prompt into music or sound effects as a stereo audio file. It launched on September 13, 2023 as a hosted commercial product at stableaudio.com and was, in Stability AI's words, "the first music generation product enabling the creation of high-quality, 44.1 kHz music for commercial use via latent diffusion."^[3]^[9] The family has since grown to include the larger Stable Audio 2.0 (April 2024), Stable Audio 2.5 (September 2025), Stable Audio 3.0 (May 2026), the open-weight Stable Audio Open (June 2024), and the on-device Stable Audio Open Small (May 2025).^[4]^[6]^[7]^[21]

The system is built on latent diffusion, the same family of diffusion models that powers Stable Diffusion, but adapted for raw audio waveforms. A variational autoencoder compresses 44.1 kHz audio into a much shorter latent sequence, and a diffusion transformer (DiT) operating in that latent space produces samples conditioned on a text embedding plus explicit timing tokens.^[1] The timing trick, where the model receives the desired start position and total duration as inputs, was the central methodological contribution of the paper Fast Timing-Conditioned Latent Audio Diffusion (Evans, Carr, Taylor, Hawley, and Pons, ICML 2024).^[1] It lets a single model generate variable-length clips up to its maximum context, which earlier text-to-audio systems could not do cleanly.^[1]

Within AI music generation, Stable Audio sits in a different niche than Suno and Udio. Suno and Udio focus on full songs with realistic AI vocals; Stable Audio targets instrumental tracks, sound design, ambient beds, and short loops, with all of its commercial training data licensed.^[3] The open-source variant competes more directly with Meta's AudioCraft family for self-hosted use.

What is Stable Audio used for?

Stable Audio is used across content production wherever on-demand instrumental audio is the bottleneck. Common applications include:

Background music for video, podcasts, and short-form social content where royalty clearance is the obstacle.
Sound effect libraries for games, animation, and film, especially effects that are easier to describe than to record.
Music sketches and ideation for songwriters and producers testing arrangement ideas before tracking.
Looping music beds for streamers who need hours of varied music without licence headaches.
Game audio prototyping during pre-production, before commissioning a composer.
Brand-led audio identities for marketing campaigns, which is the niche Stable Audio 2.5 explicitly targets with its enterprise fine-tuning options.^[6]
On-device sample generation via Stable Audio Open Small in field environments without connectivity.^[7]

When was Stable Audio released?

Stability AI's audio research group was led by Zach Evans and a small team including CJ Carr, Josiah Taylor, Scott Hawley, Jordi Pons, Julian Parker, and Zack Zukowski.^[1]^[2] Their research thread began with the open-source audio-diffusion-pytorch library and the Dance Diffusion experimental models in 2022, then matured into the commercial Stable Audio product in late 2023.^[3]

Date	Event
2022	Stability AI publishes early open-source audio diffusion experiments (Dance Diffusion, audio-diffusion-pytorch).
September 13, 2023	Stable Audio 1.0 launches at stableaudio.com. Free tier produces tracks up to 45 seconds; Pro subscription up to 90 seconds at 44.1 kHz stereo.
October 2023	Stable Audio is named one of TIME's Best Inventions of 2023.
February 7, 2024	Evans, Carr, Taylor, Hawley, and Pons post Fast Timing-Conditioned Latent Audio Diffusion on arXiv (2402.04825); accepted to ICML 2024.
March 23, 2024	Stability AI founder and CEO Emad Mostaque resigns; COO Shan Shan Wong and CTO Christian Laforte serve as interim co-CEOs.
April 3, 2024	Stable Audio 2.0 launches: full tracks up to 3 minutes, audio-to-audio prompting, structured musical compositions, new compressed autoencoder, DiT replaces U-Net.
June 5, 2024	Stable Audio Open releases on Hugging Face under the Stability AI Community License; 1.21 billion parameters, up to 47 seconds at 44.1 kHz stereo, trained only on Creative Commons audio.
June 25, 2024	Prem Akkaraju (former CEO of Weta Digital) named permanent CEO of Stability AI.
July 19, 2024	Stable Audio Open paper appears on arXiv (2407.14358), authored by Evans, Parker, Carr, Zukowski, Taylor, and Pons.
September 24, 2024	James Cameron joins the Stability AI board of directors.
May 14, 2025	Stable Audio Open Small (341M parameters) released in collaboration with Arm, optimised to run on-device on smartphones; generates 11 seconds of audio in under 8 seconds on an Arm CPU.
September 10, 2025	Stable Audio 2.5 launches as the company's first audio model aimed at enterprise sound production: 3-minute tracks, audio inpainting, and inference under 2 seconds via the Adversarial Relativistic-Contrastive (ARC) training method.
May 20, 2026	Stable Audio 3.0 launches: a family of four models (Small SFX, Small, Medium, Large) generating tracks up to 6 minutes 20 seconds, with the Small SFX, Small, and Medium variants released as open weights on Hugging Face.

Stable Audio shipped during a turbulent stretch for Stability AI. By early 2024 the company was burning cash, losing senior researchers, and facing the resignation of its founder.^[15] The audio team kept shipping at a steady cadence through this period.

How does Stable Audio work?

Stable Audio is a latent diffusion model for raw audio. It has three main components: a variational autoencoder that turns waveforms into a compact latent sequence, a text encoder that turns the prompt into a sequence of conditioning tokens, and a diffusion transformer that learns to denoise audio latents conditioned on the text and on timing information.^[1] The original paper trained the system end-to-end to render up to 95 seconds of stereo signal at 44.1 kHz in roughly 8 seconds on a single NVIDIA A100 GPU.^[1]

Audio autoencoder

The autoencoder is a fully convolutional variational autoencoder (VAE) that takes raw stereo waveforms and compresses them along the time axis into a latent representation suitable for diffusion.^[1] Compressing in latent space is what makes long-form audio tractable for diffusion: a 95-second stereo clip at 44.1 kHz contains roughly 8.4 million sample values, which would be far too many tokens for a transformer to attend over directly. The Stable Audio Open autoencoder reports a latent rate of 21.5 Hz, meaning each second of audio is represented by 21.5 latent vectors instead of 44,100 amplitude samples.^[2] The encoder is trained with a combination of reconstruction and adversarial losses to keep perceptual fidelity high after the round trip through the decoder.^[2]

Diffusion backbone

Stable Audio 1.0 used a U-Net-style backbone borrowed from image diffusion.^[1] Stable Audio 2.0 replaced this with a diffusion transformer, the same architecture family popularised in image work by Peebles and Xie.^[4] The DiT processes the audio latents as a sequence of tokens with self-attention layers and is conditioned on text and timing information through cross-attention and adaptive layer normalisation. The transformer's ability to model long-range structure is what lets 2.0 produce coherent three-minute compositions with intro, development, and outro sections, where the U-Net of 1.0 tended to drift over the same 90-second window.^[4]

Text and timing conditioning

The commercial Stable Audio uses a CLAP-style joint text-audio encoder for conditioning, which gives the model a shared representation of text and music.^[1] Stable Audio Open replaced the CLAP text encoder with the open-source pre-trained T5-base model so the entire pipeline could be released under a permissive licence.^[2]

The novel piece is timing conditioning. The diffusion model receives two extra tokens at each step: the desired start time within the source audio and the desired total duration.^[1] During training, every clip is paired with these tags so the model learns the correspondence between the values and the resulting waveform shape.^[1] At inference, the user can ask for a 12-second clip, a 47-second loop, or a 95-second track, and the model produces something that is recognisably that long, with a real ending rather than a faded-out crash.^[1] Most earlier text-to-music systems either generated a fixed length or required ad hoc looping, and most autoregressive systems (MusicLM, MusicGen) generated until the user stopped them. Variable, controllable duration is the headline feature that distinguishes Stable Audio from those alternatives.^[1]

Output format

The commercial models output 44.1 kHz stereo, the standard for distributable music.^[3] Stable Audio Open 1.0 also outputs 44.1 kHz stereo (despite some early reporting that called it mono).^[8] Stable Audio Open Small produces 44.1 kHz stereo as well, but with shorter clips (up to 11 seconds) and a much smaller parameter budget so it can run on a smartphone CPU.^[7]

What data was Stable Audio trained on?

Stability AI made the licensed-only training story a central part of the Stable Audio pitch from launch, in part because the image side of the company had been entangled in copyright disputes around Stable Diffusion 1.5.^[3] The commercial models and the open variant use disjoint corpora.

Model	Training corpus	Size	Licensing approach
Stable Audio 1.0, 2.0, 2.5	AudioSparx licensed catalogue	Over 800,000 audio files of music, sound effects, and single-instrument stems	Commercial licence with AudioSparx; artists in the catalogue offered an opt-out option before training.
Stable Audio Open 1.0	Freesound + Free Music Archive (FMA)	486,492 recordings (472,618 from Freesound, 13,874 from FMA), roughly 7,300 hours total	Restricted to CC0, CC-BY, and CC Sampling+ tracks; copyright cleared with Audible Magic plus human review of FMA metadata.
Stable Audio Open Small	Freesound + Free Music Archive	Subset of the Open 1.0 corpus	Same Creative Commons restrictions.
Stable Audio 3.0	Licensed catalogue plus label partnerships	Not disclosed	"Built on fully licensed data," supported by partnerships with Universal Music Group and Warner Music Group.^[21]

The AudioSparx partnership was struck before the original launch and gave Stability AI a large catalogue of professionally produced music, sound effects, and Foley to train on. Stability AI states that Stable Audio 2.0 was "trained on data from AudioSparx consisting of over 800,000 audio files containing music, sound effects, and single-instrument stems, as well as corresponding text metadata," and that all of AudioSparx's artists were given the option to opt out of the training set.^[4] For uploaded audio in the audio-to-audio mode, Stable Audio uses Audible Magic's content recognition service to block prompts that match copyrighted recordings.^[4]

This posture matters commercially. In June 2024 the RIAA filed lawsuits against Suno and Udio over alleged use of copyrighted recordings in training. Stable Audio was not named in those suits, and Stability AI has continued to lean on the AudioSparx provenance and the Creative Commons base for the open variant when pitching to enterprise customers.

Versions and capabilities

Version	Released	Max length	Sample rate / channels	Parameters	License	Distribution	Notable features
Stable Audio 1.0	September 13, 2023	90 seconds (Pro), 45 seconds (Free)	44.1 kHz stereo	Not disclosed	Commercial subscription	stableaudio.com	First commercial 44.1 kHz text-to-music product; latent diffusion with U-Net backbone; CLAP text encoder; timing conditioning.
Stable Audio 2.0	April 3, 2024	3 minutes (180 seconds)	44.1 kHz stereo	Not disclosed	Commercial subscription, free tier	stableaudio.com, Stability AI API	Audio-to-audio prompting, structured compositions with intro/development/outro, new compressed autoencoder, DiT backbone.
Stable Audio Open 1.0	June 5, 2024	47 seconds	44.1 kHz stereo	~1.21 billion	Stability AI Community License (commercial use up to $1M annual revenue)	Hugging Face (stabilityai/stable-audio-open-1.0), GitHub	First open-weight Stable Audio variant; T5-base text encoder; trained on Freesound and FMA; oriented toward sound effects and short loops.
Stable Audio Open Small	May 14, 2025	11 seconds	44.1 kHz stereo	341 million	Stability AI Community License	Hugging Face, Arm reference implementations	Runs on smartphone CPUs; collaboration with Arm; sub-8-second generation on Arm-based devices.
Stable Audio 2.5	September 10, 2025	3 minutes	44.1 kHz stereo	Not disclosed	Commercial, enterprise licensing	stableaudio.com, Stability AI API, fal, Replicate, ComfyUI, on-premises	Audio inpainting, ARC training for sub-2-second inference, improved musical structure and prompt adherence, fine-tuning for enterprise sound libraries.
Stable Audio 3.0 (Small SFX / Small / Medium / Large)	May 20, 2026	Up to 2 minutes (Small models), up to 6 minutes 20 seconds (Medium and Large)	44.1 kHz stereo	459M (Small SFX), 459M (Small), 1.4B (Medium), 2.7B (Large)	Open weights for Small SFX, Small, Medium; commercial/API/enterprise for Large	Hugging Face (open variants), Stability AI API, fal	New semantic-acoustic autoencoder; inpainting and track extension; on-device music composition (Small); Large reserved for API and enterprise.

The commercial line moves toward longer, more structured tracks and faster inference; the open line moves toward smaller, more accessible weights that researchers and hobbyists can fine-tune. With Stable Audio 3.0 the two lines partly converged: the Medium 6-minute model ships as open weights while only the 2.7B Large model is held back for API and enterprise use.^[21] Stability AI uses the open releases as research baselines and to recruit community fine-tunes that feed back into the commercial product strategy.

Capabilities

The model family supports several distinct generation modes.

Capability	Available in	Description
Text-to-music	All versions	Generate instrumental music from a prompt that names genre, instruments, mood, BPM, and structural cues.
Text-to-sound-effect	All versions, strongest in Open 1.0 and 3.0 Small SFX	Generate Foley and ambient effects ("rain on a tin roof", "footsteps on gravel", "crowd murmuring").
Audio-to-audio	2.0, 2.5, and 3.0	Upload an audio sample and transform it under a text prompt, useful for style transfer or stem variations.
Structured composition	2.0, 2.5, and 3.0	Produce intros, developments, and outros within a single track instead of one looping section.
Audio inpainting and extension	2.5 and 3.0	Insert generated audio at a chosen position within an uploaded clip, edit sections, or extend a track while preserving surrounding context.
Variable-length output	All versions	Specify the desired total duration directly, courtesy of the timing-token mechanism.
On-device inference	Open Small, 3.0 Small	Run the model entirely on a smartphone CPU without an internet connection.

Vocal generation is intentionally limited. The commercial models generally avoid lyrics; Stable Audio Open's documentation states explicitly that the model cannot generate realistic vocals, and the training data was filtered to lean on instrumental and sound-effect content.^[8] This is the largest functional gap between Stable Audio and the Suno/Udio family.

How does Stable Audio compare to Suno, Udio, and MusicGen?

The text-to-music landscape now contains several distinct product categories. Stable Audio overlaps with all of them but does not compete head-on with any single one.

System	Vendor	First release	Max length	Vocals	Output	Training data	Distribution
Stable Audio 3.0	Stability AI	May 2026	6 min 20 sec	Limited	44.1 kHz stereo	Licensed (AudioSparx, UMG, WMG)	Open weights (3 of 4 models), API, enterprise
Stable Audio Open 1.0	Stability AI	June 2024	47 seconds	No	44.1 kHz stereo	Freesound + FMA (CC)	Open weights (Community License)
MusicGen	Meta (AudioCraft)	June 2023	~30 seconds (extendable)	No	32 kHz mono/stereo	Licensed (Shutterstock + others)	Open weights (CC-BY-NC)
AudioGen	Meta (AudioCraft)	October 2022	Short clips	No	16 kHz mono	AudioSet, BBC Sound Effects	Open weights
MusicLM	Google Research	January 2023	Up to 5 minutes (research)	Hummed only	24 kHz	Free Music Archive + private corpora	Limited Test Kitchen access, then folded into other Google products
Lyria 2	Google DeepMind	April 2025 (Lyria 1 in November 2023)	Multiple minutes	Yes	48 kHz	Not disclosed	Internal Google services and YouTube Music AI
Suno v5	Suno	December 2023 (v1), September 2025 (v5)	~4 minutes	Yes (high realism)	44.1 kHz	Disputed (RIAA lawsuit June 2024)	Web, mobile, API
Udio	Udio	April 2024	~15 minutes (with extensions)	Yes (high realism)	44.1 kHz	Disputed (RIAA lawsuit June 2024)	Web, mobile, API
ElevenLabs Sound Effects	ElevenLabs	May 2024	22 seconds	Effects, no music vocals	44.1 kHz	Licensed audio	Web, API
Riffusion	Forsgren and Martiros (independent)	December 2022	Short loops	Limited	Spectrogram via Stable Diffusion	Open-weight Stable Diffusion fine-tune	Web, open weights
Jukebox	OpenAI	April 2020	Multiple minutes	Yes (lo-fi)	Lo-fi 44.1 kHz	1.2 million songs (research only)	Open weights, research

The practical takeaway: Stable Audio is the strongest enterprise-grade option for instrumental music and sound design with clear licensing, MusicGen is the strongest open self-hosting option for music, Suno and Udio dominate consumer full-song generation, ElevenLabs leads in dedicated sound effects, and Lyria sits inside Google's product ecosystem.

Strengths

The variable-length generation through timing conditioning is the technical headline. Most rival text-to-music models produce fixed-length outputs and require post-processing to reach the desired duration. Stable Audio handles 12-second loops, 47-second open-model clips, and 3-minute commercial tracks with the same model and the same prompt format.^[1] The 44.1 kHz stereo output also matches the standard for distributable music, where MusicGen tops out at 32 kHz and many earlier systems produced 16 or 24 kHz.

The latent diffusion approach also gives reasonable inference speed. The 1.0 paper reported 95 seconds of stereo audio in 8 seconds on an A100, and Stable Audio 2.5 with the ARC post-training method gets a 3-minute track in under 2 seconds on a GPU.^[1]^[6] Autoregressive token-based systems are typically slower at long outputs because they generate sample by sample.

Licensed training data has become a meaningful commercial advantage. Customers who pay for music generation typically need to use the output in commercial contexts, and a model trained on a clearly licensed corpus (AudioSparx, and from 2026 the Universal and Warner partnerships) reduces downstream risk in a way that the disputed Suno and Udio corpora do not.^[3]^[21] The combination of a commercial product and a permissively licensed open variant also lets Stability AI cover both enterprise and community use without splitting the brand.

Limitations

Vocal generation is the largest gap. The Stable Audio family is essentially instrumental, and even the audio-to-audio modes do not produce convincing lyric-driven songs.^[8] Suno and Udio dominate the consumer mindshare for AI-generated full songs as a direct result.

The open variant trades a lot to be open. Stable Audio Open 1.0 is capped at 47 seconds, and Stable Audio Open Small at 11 seconds, both far shorter than the commercial 3-minute ceiling.^[7]^[8] Both open models also struggle with non-Western musical styles because their Creative Commons training data skews Western, and both are explicit in their model cards that fine-tuning will probably be needed for specific genres.^[8]

The quality is genre-dependent. Reviewers consistently rate Stable Audio strong on cinematic and electronic instrumental tracks, weaker on jazz and acoustic styles, and limited on anything that benefits from a lead vocal line.^[19] Some prompts produce coherent music; others drift or fall apart over the full 3-minute window. As with image diffusion, prompt engineering carries a lot of weight.

Integration with professional digital audio workstations (DAWs) is still light. There is no first-party Logic Pro or Ableton plug-in. Third-party integrations through ComfyUI and the API exist, but most professional producers still pull Stable Audio outputs into the DAW manually as audio files.

Licensing terms have evolved with corporate turbulence. The Stability AI Community License changed several times in 2024, the open variant's redistribution terms shifted, and enterprises have asked for clearer commercial commitments before adopting the technology at scale. The September 2025 enterprise release of Stable Audio 2.5 and the May 2026 label-backed Stable Audio 3.0 are in part answers to those concerns.^[6]^[21]

Stability AI corporate context

Stability AI was founded in 2019 by Emad Mostaque and rose to prominence in August 2022 with the public release of Stable Diffusion, a text-to-image model whose open weights catalysed the open-source generative AI ecosystem.^[18] The company shipped a wide product line over the next two years: Stable Diffusion 1.x and 2.x, SDXL, Stable LM language models, Stable Video Diffusion (November 2023), Stable Cascade (February 2024), Stable Diffusion 3 (February to June 2024), Stable Diffusion 3.5 (October 2024), and the Stable Audio family covered here.^[18]

The company hit a rough patch in 2024. Emad Mostaque resigned as CEO on March 23, 2024, citing concerns about "centralised AI" and stepping down from the board.^[15] Stability laid off roughly 10 percent of staff in April.^[15] COO Shan Shan Wong and CTO Christian Laforte served as interim co-CEOs.^[15] Prem Akkaraju, the former CEO of visual effects studio Weta Digital, was appointed permanent CEO on June 25, 2024.^[18] Sean Parker briefly served as executive chairman, and on September 24, 2024 the filmmaker James Cameron joined Stability AI's board of directors.^[18] Through that turmoil the audio research group continued to ship: Stable Audio 2.0 in April, Stable Audio Open in June, the Open Small release in May 2025, Stable Audio 2.5 in September 2025, and Stable Audio 3.0 in May 2026. Alongside the 3.0 release, Ethan Kaplan, former chief digital officer at Universal Audio and Fender, joined Stability AI to lead the company's professional music offering.^[21]

Stable Audio is one of several Stability AI product families. The current commercial line-up includes Stable Diffusion 3.5 for image generation, Stable Video for video generation, Stable Audio for audio, and various enterprise-only models. Many of the original image researchers (Robin Rombach, Andreas Blattmann, Dominik Lorenz) left Stability in 2024 to found Black Forest Labs, which shipped the Flux.1 image models in August of that year.^[18] The audio team has been more stable than the image team.

Recent context (2024 to 2026)

Stable Audio Open's June 2024 release pushed the open-source text-to-audio frontier forward, much as the original Stable Diffusion did for images two years earlier.^[5] Researchers used the weights as a baseline for academic papers, and hobbyists built fine-tuned variants for specific genres and sound libraries.

In parallel, the commercial Stable Audio offering pivoted toward enterprise customers, culminating in the Stable Audio 2.5 release.^[6] The ARC training method that gives 2.5 its sub-2-second inference was developed for the throughput demands of agencies generating large volumes of branded audio.^[17] Distribution expanded beyond stableaudio.com to include fal, Replicate, ComfyUI, and the Stability AI API, plus on-premises licensing for enterprises with strict data-handling requirements.^[6] In May 2026 Stability AI released Stable Audio 3.0, a family of four models capable of tracks up to 6 minutes 20 seconds, with three of the four (Small SFX, Small, and Medium) released as open weights and the 2.7-billion-parameter Large model reserved for the API and enterprise licensing.^[21] The 3.0 line also formalised the licensed-data story through partnerships with Universal Music Group and Warner Music Group.^[21]

The consumer end of the market continued to be dominated by Suno and Udio, both of which faced RIAA lawsuits in mid-2024 and reached settlements with major labels in late 2025 (Suno with Warner Music, Udio with Universal Music). Google released Lyria 2 in 2025 as the engine behind YouTube's AI music tools. In this crowded market, Stable Audio has carved out a defensible position as the high-fidelity, licensed, instrumental-first option, with a credible open-source variant for the research community.

References

Evans, Z., Carr, C. J., Taylor, J., Hawley, S. H., and Pons, J. (2024). *Fast Timing-Conditioned Latent Audio Diffusion*. ICML 2024. arXiv:2402.04825. https://arxiv.org/abs/2402.04825 ↩
Evans, Z., Parker, J. D., Carr, C. J., Zukowski, Z., Taylor, J., and Pons, J. (2024). *Stable Audio Open*. arXiv:2407.14358. https://arxiv.org/abs/2407.14358 ↩
Stability AI (2023). *Stable Audio: Using AI to Generate Music*. https://stability.ai/news/stable-audio-using-ai-to-generate-music ↩
Stability AI (2024). *Introducing Stable Audio 2.0*. https://stability.ai/news/stable-audio-2-0 ↩
Stability AI (2024). *Stable Audio Open: Research Paper*. https://stability.ai/news/stable-audio-open-research-paper ↩
Stability AI (2025). *Stability AI Introduces Stable Audio 2.5, the First Audio Model Built for Enterprise Sound Production at Scale*. https://stability.ai/news-updates/stability-ai-introduces-stable-audio-25-the-first-audio-model-built-for-enterprise-sound-production-at-scale ↩
Stability AI and Arm (2025). *Stability AI and Arm Collaborate to Release Stable Audio Open Small*. https://stability.ai/news/stability-ai-and-arm-release-stable-audio-open-small-enabling-real-world-deployment-for-on-device-audio-control ↩
Hugging Face. *stabilityai/stable-audio-open-1.0 model card*. https://huggingface.co/stabilityai/stable-audio-open-1.0 ↩
Music Ally (2023). *Stable Diffusion maker launches Stable Audio text-to-music AI*. https://musically.com/2023/09/13/stable-diffusion-maker-launches-stable-audio-text-to-music-ai/ ↩
Music Ally (2024). *Stable Audio AI model now makes full music tracks and allows user uploads*. https://musically.com/2024/04/03/stable-audio-ai-model-now-makes-full-music-tracks-and-allows-user-uploads/
MarkTechPost (2023). *Stability AI Introduces Stable Audio: A New Artificial Intelligence Model That Can Generate Audio Clips From Text Prompts*. https://www.marktechpost.com/2023/09/16/stability-ai-introduces-stable-audio-a-new-artificial-intelligence-model-that-can-generate-audio-clips-from-text-prompts/
MarkTechPost (2024). *Stability AI Launches Stable Audio 2.0: Empowering Artists with Next-Gen Audio Tools*. https://www.marktechpost.com/2024/04/03/stability-ai-launches-stable-audio-2-0-empowering-artists-with-next-gen-audio-tools/
SiliconANGLE (2024). *Stability AI debuts Stable Audio 2.0 model for generating sound clips*. https://siliconangle.com/2024/04/03/stability-ai-debuts-stable-audio-2-0-model-generating-sound-clips/
Music Business Worldwide (2024). *Stability AI's Stable Audio update enables full-length song production from text or audio*. https://www.musicbusinessworldwide.com/stability-ais-stable-audio-update-enables-full-length-song-production-from-text-or-audio/
TechCrunch (2024). *Stability AI CEO resigns because you can't beat centralized AI with more centralized AI*. https://techcrunch.com/2024/03/22/stability-ai-ceo-resigns-because-youre-not-going-to-beat-centralized-ai-with-more-centralized-ai/ ↩
TechCrunch (2025). *Stability AI releases an audio-generating model that can run on smartphones*. https://techcrunch.com/2025/05/14/stability-ai-releases-an-audio-generating-model-that-can-run-on-smartphones/
VentureBeat (2025). *Stability AI's enterprise audio model cuts production time from weeks to minutes with 8-step generation breakthrough*. https://venturebeat.com/ai/stability-ais-enterprise-audio-model-cuts-production-time-from-weeks-to ↩
Wikipedia. *Stability AI*. https://en.wikipedia.org/wiki/Stability_AI ↩
AudioCipher (2024). *Stable Audio 2.0: Review of the New Audio-to-Audio Features*. https://www.audiocipher.com/post/stable-audio-ai ↩
Voicebot.ai (2024). *Stability AI Releases Augmented Text-to-Music Engine Stable Audio 2 With Upload and Style Transfer Features*. https://voicebot.ai/2024/04/04/stability-ai-releases-augmented-text-to-music-engine-stable-audio-2-with-upload-and-style-transfer-features/
TechCrunch (2026). *Stability AI releases a new audio model that can create six-minute songs*. https://techcrunch.com/2026/05/20/stability-ai-release-a-new-audio-model-that-can-create-six-minute-songs/ ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

3 revisions by 1 contributors · full history

Suggest edit

What links here

Audio Models AudioCraft Best AI Music Generators Boomy ElevenLabs Music EnCodec Flow Matching Lyria Music MusicGen Sonauto SoundStream Stability AI Stable Audio 2.5 Suno Suno v5

What is Stable Audio used for?

When was Stable Audio released?

How does Stable Audio work?

Audio autoencoder

Diffusion backbone

Text and timing conditioning

Output format

What data was Stable Audio trained on?

Versions and capabilities

Capabilities

How does Stable Audio compare to Suno, Udio, and MusicGen?

Strengths

Limitations

Stability AI corporate context

Recent context (2024 to 2026)

References

Improve this article

Related Articles

Suno

Udio

Lyria

Suno v5

ElevenLabs Music

Stable Audio 2.5

What links here

Related Articles

Suno

Udio

Lyria

Suno v5

ElevenLabs Music

Stable Audio 2.5

What links here