Audio-to-Audio Models

AI Models Music & Audio Generation Speech & Audio AI

32 min read

Updated Jul 16, 2026

Suggest edit History Talk

RawGraph

Last edited

Jul 16, 2026

Fact-checked

In review queue

Sources

42 citations

Revision

v4 · 6,413 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

See also: Audio Models and Tasks

Audio-to-audio models are machine learning systems that take an audio waveform as input and produce a different audio waveform as output. The category covers voice conversion, speech enhancement (denoising and dereverberation), music source separation, vocoders that turn spectrograms into waveforms, and neural audio codecs that compress and reconstruct sound. They sit alongside text-to-speech and speech-to-text in the broader family of audio models, but they are distinguished by the fact that both the input and the output live in the audio domain.

Most of the modern systems share a common pipeline: an encoder that turns the input waveform or spectrogram into a learned representation, a transformation network that manipulates that representation (separating sources, removing noise, swapping speaker identity, predicting clean targets), and a decoder or vocoder that synthesizes the new waveform. Since around 2019 the field has moved away from spectrogram masking with classical signal processing toward fully neural approaches, often built around diffusion models, transformer architectures, or generative adversarial networks (GANs). The same underlying components show up across the subfields: a HiFi-GAN vocoder can be the back end of a voice cloning pipeline, a music separator, or a neural codec.

Overview

The phrase "audio-to-audio" appears on Hugging Face as one of the official task tags.^[28] Models filed under it on the Hub include speech enhancement, voice conversion, source separation, target speaker extraction, and bandwidth extension.^[28] The taxonomy used in this article groups the field into the following tasks.

Task	Input	Output	Typical use
Voice conversion	Speech from speaker A	Same words spoken by speaker B	Voice cloning, dubbing, anonymization
Speech enhancement	Noisy or reverberant speech	Clean speech	Conferencing, podcasts, hearing aids
Source separation	Mixed audio	Individual stems	Karaoke, remix, music production
Vocoder	Spectrogram or features	Waveform	Back end of text-to-speech
Neural audio codec	Waveform	Compressed tokens, then waveform	Low-bitrate audio, generative audio backbones
Speech-to-speech translation	Speech in language A	Speech in language B	Real-time interpretation
Bandwidth extension	8 kHz speech	16 or 24 kHz speech	Upsampling old recordings
Target speaker extraction	Multi-talker audio	One speaker's voice	Cocktail-party problem

The history of the field tracks the wider arc of deep learning audio research. Classical signal processing (Wiener filtering, spectral subtraction, ICA-based source separation) dominated until around 2015. Then convolutional neural networks operating on spectrograms took over, with U-Net architectures from medical imaging adapted for vocal separation. Generative AI models, starting with WaveNet in 2016, made waveform-level synthesis feasible.^[13] The transformer wave reached audio around 2020 and 2021, and by 2023 large generative models such as VALL-E and Voicebox were producing voice cloning that needed only a few seconds of reference audio.^[20]^[21]

Voice conversion

Voice conversion (VC) systems take an utterance from one speaker and re-render it in the voice of another, ideally keeping linguistic content and prosody intact. Early VC used Gaussian mixture models or vector quantization on aligned parallel data. Modern systems learn disentangled representations of content and speaker identity, then recombine them.

StarGAN-VC, AutoVC and the disentanglement era

StarGAN-VC was introduced by Hirokazu Kameoka and colleagues at NTT in 2018.^[2] It applies the StarGAN image translation framework to mel-cepstral features, enabling many-to-many voice conversion without parallel training data.^[2] A follow-up, StarGAN-VC2, improved naturalness in 2019.

AutoVC was published by Kaizhi Qian and collaborators at MIT and IBM in 2019.^[3] The architecture is a content encoder, a speaker encoder, and a decoder, with an information bottleneck on the content path that forces the model to discard speaker identity.^[3] AutoVC was one of the first systems to achieve plausible zero-shot voice conversion to unseen target speakers.^[3]

Singing voice conversion: SoftVC VITS SVC and RVC

SoftVC VITS Singing Voice Conversion (So-VITS-SVC) is an open-source project that combines a SoftVC content encoder with the VITS end-to-end TTS model. The repository was released on GitHub by the svc-develop-team and went through several iterations between 2022 and 2023. The 4.0 line switched the content features to the 12th layer of ContentVec and replaced the vocoder with NSF-HiFiGAN, while 4.1 added an optional shallow diffusion stage for higher sound quality. So-VITS-SVC 4.0 and 4.1 became the de facto standard for fan-made covers in which one singer's voice is mapped onto another's recording.

Retrieval-based Voice Conversion (RVC) is a sibling project that became the dominant voice cloning tool on social media in 2023. RVC uses a HuBERT-style content encoder, an NSF-HiFiGAN vocoder, and a feature retrieval step that pulls the closest matching frames from a target speaker's voice database to suppress timbre leakage from the source speaker. The original RVC repository, RVC-Project, gathered tens of thousands of GitHub stars and powered viral clips of Joe Biden, Donald Trump and various musicians appearing to sing songs they never recorded.

NSF-HiFiGAN is the vocoder used by both SoVITS and RVC. It is a HiFi-GAN variant with a Neural Source-Filter front end that takes fundamental frequency (F0) as an input, which helps preserve pitch through voice conversion.

DiffSVC by Songxiang Liu, Yuewen Cao, Dan Su and Helen Meng (CUHK and Tencent AI Lab, ASRU 2021) was the first singing voice conversion system built on a denoising diffusion probabilistic model, generating acoustic features from phonetic posteriorgram content features.[^diffsvc] FastSVC by Shijun Wang and Yi Zhao introduced a non-autoregressive approach for faster inference.

OpenVoice and modern instant voice cloning

OpenVoice was released by MyShell.ai in December 2023.^[26] It separates tone color cloning, which captures speaker timbre from a short reference clip, from style control, which handles emotion, accent and pacing.^[26] OpenVoice was published with code on GitHub and an accompanying paper by Zengyi Qin and colleagues.^[26] OpenVoice v2 followed in April 2024 with native multilingual support for English, Spanish, French, Chinese, Japanese and Korean.

kNN-VC and Seed-VC: the self-supervised and zero-shot wave

kNN-VC (Voice Conversion With Just Nearest Neighbors) by Matthew Baas, Benjamin van Niekerk and Herman Kamper at Stellenbosch University (Interspeech 2023) showed that any-to-any voice conversion needs no dedicated conversion network at all. Source and reference utterances are encoded into self-supervised WavLM features, each source frame is replaced by the mean of its k nearest neighbours among the reference frames, and a HiFi-GAN vocoder synthesizes the result. The authors report that this concatenative approach improves speaker similarity over prior methods at similar intelligibility, despite its simplicity.[^knnvc]

Seed-VC by Songting Liu (Nanyang Technological University, arXiv November 2024) is a zero-shot framework that converts a source utterance to the timbre of an unseen reference. It applies an external timbre shifter during training to perturb the source timbre and uses a diffusion transformer that reads the whole reference through in-context learning. The paper reports higher speaker similarity and lower word error rate than the OpenVoice and CosyVoice baselines, and the framework extends to zero-shot singing voice conversion with fundamental-frequency conditioning and offers a real-time mode. Code and pretrained models were released on GitHub.[^seedvc]

Voicebox, VALL-E and large generative voice models

VALL-E was announced by Microsoft Research in January 2023, in the paper Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers by Chengyi Wang and colleagues.^[20] The model is a transformer language model trained on tokens from the EnCodec neural codec; given a 3-second voice prompt it can synthesize speech in that voice with reasonable identity preservation.^[20] VALL-E demonstrated for the first time that token-level language modelling over a neural codec could replace traditional vocoder pipelines.^[20] VALL-E X extended the model to cross-lingual cloning, allowing an English speaker to be cloned for Mandarin output, also from Microsoft in 2023.

Voicebox was published by Meta AI in June 2023.^[21] The paper, by Matthew Le and colleagues, describes a non-autoregressive flow-matching model trained on 60,000 hours of multilingual speech that supports text-to-speech, noise removal, content editing inside a recording and zero-shot voice cloning.^[21] Meta did not release Voicebox weights or a demo product, citing potential misuse for impersonation.^[21]

Model	Year	Group	Approach	Open weights
StarGAN-VC	2018	NTT	StarGAN on mel-cepstral features^[2]	Research code
AutoVC	2019	MIT, IBM	Bottleneck autoencoder^[3]	Research code
So-VITS-SVC	2022 to 2023	Open source	SoftVC content plus VITS	Yes
RVC	2023	Open source	HuBERT plus NSF-HiFiGAN with retrieval	Yes
DiffSVC	2021	CUHK, Tencent	Diffusion model	Research code
OpenVoice v1	2023	MyShell	Tone and style decoupling^[26]	Yes
OpenVoice v2	2024	MyShell	Multilingual extension	Yes
kNN-VC	2023	Stellenbosch	WavLM features plus k-NN	Yes
Seed-VC	2024	NTU	Diffusion transformer, zero-shot	Yes
VALL-E	2023	Microsoft	Codec language model^[20]	No
VALL-E X	2023	Microsoft	Cross-lingual codec LM	No
Voicebox	2023	Meta	Flow matching^[21]	No

Speech enhancement

Speech enhancement covers any task in which the goal is to recover clean speech from a degraded signal. The main sub-tasks are denoising (removing background noise), dereverberation (removing room reverb), declipping, packet loss concealment and bandwidth extension. The field is older than deep learning, with Yariv Ephraim and David Malah's 1984 MMSE-STSA estimator still cited in modern papers, but the current state of the art is dominated by neural networks.

SEGAN, DCCRN and DCUNet

SEGAN by Santiago Pascual, Antonio Bonafonte and Joan Serra at the Universitat Politecnica de Catalunya, presented at Interspeech 2017, was one of the first speech enhancement systems to operate directly on raw waveforms with a generative adversarial network.^[1] The generator is a fully convolutional encoder-decoder; the discriminator decides whether a waveform looks clean or denoised.^[1]

Deep Complex Convolution Recurrent Network (DCCRN) by Yanxin Hu and colleagues (Interspeech 2020) won the first round of the Deep Noise Suppression Challenge organized by Microsoft.^[10] It uses complex-valued convolutions to handle phase information in the short-time Fourier transform directly, rather than processing magnitude and phase separately.^[10] DCUNet is a related complex U-Net architecture from 2019.

RNNoise and FullSubNet

RNNoise by Jean-Marc Valin (IEEE MMSP 2018) is a widely deployed open-source noise suppressor that pairs classical signal processing with a small recurrent network. A four-layer recurrent network using gated recurrent units (GRUs) estimates ideal critical-band gains while a pitch filter attenuates noise between harmonics. Valin reports that the hybrid design runs in real time at 48 kHz on a low-power CPU with a model that fits in roughly 85 kB, and the project is bundled in many open-source voice and telephony stacks.[^rnnoise]

FullSubNet by Xiang Hao, Xiangdong Su, Radu Horaud and Xiaofei Li (Inner Mongolia University and Inria, ICASSP 2021) is a real-time single-channel enhancement model that connects a full-band model, which captures global spectral context and cross-band dependencies, with a sub-band model that processes each frequency independently from a few neighbouring bands. The authors report that the fused system exceeds the top-ranked methods of the Interspeech 2020 Deep Noise Suppression Challenge, and the code is open source.[^fullsubnet]

NSNet and the Microsoft pipeline

Noise Suppression Network (NSNet) is Microsoft Research's noise suppressor, used in Microsoft Teams since 2020. It is a recurrent neural network that operates on log-mel features and outputs a suppression gain per time-frequency bin. The team behind NSNet, including Sebastian Braun and Hannes Gamper, also organizes the annual Deep Noise Suppression Challenge that has driven competitive progress on the task.^[29]

DeepFilterNet

DeepFilterNet by Hendrik Schroter, Alberto Escalante and Andreas Maier at Friedrich-Alexander-Universitat Erlangen-Nurnberg was introduced in 2022 and updated to DeepFilterNet 2 and DeepFilterNet 3 in 2023. It operates at 48 kHz, runs in real time on a single CPU core, and is permissively licensed.^[11] The architecture predicts a per-frequency deep filter (a complex-valued convolution over recent frames) rather than a single suppression mask, which preserves transients and consonants better than mask-only methods.^[11]

Resemble Enhance, Adobe Enhance Speech and Voice Isolator

Resemble Enhance was open-sourced by Resemble.ai under an MIT license in late 2023, with the source code published on GitHub. It combines a noise suppression stage with a CFM (conditional flow matching) generative model that re-synthesizes a clean speech waveform, allowing the system to fix not only noise but also bandwidth limitation and other distortions.

Adobe Enhance Speech, originally announced as Project Shasta in 2022 and released as a free web tool through Adobe Podcast in 2023, applies a proprietary neural model to clean up dialogue recorded in untreated rooms. It quickly became a default tool among podcasters and was integrated into Adobe Premiere Pro as Enhance Speech.

ElevenLabs Voice Isolator is a hosted product released by ElevenLabs in 2024 that strips background music and noise from voice recordings. It targets the same use case as Adobe Enhance Speech.

Krisp and NVIDIA Broadcast

Krisp is an Armenian company founded in 2017 that ships a desktop application and SDK for real-time noise suppression in video calls. Its model runs locally on the user's CPU.

NVIDIA RTX Voice was launched in April 2020 during the early days of the COVID-19 lockdowns as a noise suppression utility that ran on RTX GPUs using tensor cores. It evolved into NVIDIA Broadcast, a free Windows application available on RTX cards that includes noise removal, room echo removal, virtual background, eye contact correction and other camera and microphone effects.

Demucs for denoising

Demucs, the music separation model from Meta, was adapted for speech enhancement in a paper titled Real Time Speech Enhancement in the Waveform Domain by Alexandre Defossez, Gabriel Synnaeve and Yossi Adi (Interspeech 2020).^[12] The variant, sometimes called Denoiser or Demucs Denoiser, was for a time a popular open-source baseline before being overtaken by DeepFilterNet and other dedicated speech models.

System	Year	Type	License	Notes
RNNoise	2018	Hybrid DSP plus GRU	Open source	Real-time 48 kHz, ~85 kB model
SEGAN	2017	GAN, waveform	Open source	First waveform GAN denoiser^[1]
DCCRN	2020	Complex CRN	Research	DNS 2020 winner^[10]
FullSubNet	2021	Full plus sub-band fusion	Open source	Beat DNS 2020 top entries
NSNet	2020	RNN	Proprietary	Used in Microsoft Teams
Demucs Denoiser	2020	Waveform U-Net	Open source	Real-time CPU^[12]
DeepFilterNet 3	2023	Deep filter	Open source	48 kHz, real time CPU
Resemble Enhance	2023	Denoise plus CFM	Open source	Quality restoration
Adobe Enhance Speech	2023	Proprietary	Free web tool	Podcast cleanup
ElevenLabs Voice Isolator	2024	Proprietary	Paid API	Voice isolation
Krisp	2017 onward	RNN	Proprietary	Real-time call filter
NVIDIA Broadcast	2020 onward	Proprietary	Free with RTX	GPU accelerated

Source separation

Source separation breaks a mixed audio recording into its component sources. The most studied flavour is music separation into vocals, drums, bass and other, evaluated on the MUSDB18 benchmark introduced by Zafar Rafii and colleagues in 2017. Speech separation, where the goal is to split overlapping speakers, is the related problem behind the cocktail-party effect and is usually benchmarked on the WSJ0-2mix mixture set. Separation quality is most often reported as scale-invariant signal-to-distortion ratio (SI-SDR) or its improvement over the input mixture (SI-SDRi), a metric proposed by Jonathan Le Roux, Scott Wisdom, Hakan Erdogan and John R. Hershey (ICASSP 2019) to make the older SDR measure robust to amplitude scaling.[^sisdr] Higher SI-SDR in decibels means a cleaner separation.

Spleeter

Spleeter was open-sourced by the research team at Deezer in November 2019.^[4] It uses a U-Net trained on the company's internal catalogue and ships pretrained two-stem (vocals or accompaniment), four-stem (vocals, drums, bass, other) and five-stem (adds piano) models.^[4] Despite its modest size, Spleeter became the most widely used music separator for several years because it was easy to install, fast on CPU and good enough for karaoke and DJ use. The release paper, by Romain Hennequin, Anis Khlif, Felix Voituret and Manuel Moussallam, is one of the most cited audio papers of the 2019 to 2020 period.

Demucs and Hybrid Demucs

Demucs by Alexandre Defossez at Meta AI was published in 2019 and revised through several versions.^[5] The original Demucs was a waveform U-Net with bi-directional LSTM bottleneck.^[5] Hybrid Demucs (2021) combined waveform-domain and spectrogram-domain branches that share a transformer bottleneck, hitting state-of-the-art performance on MUSDB18.^[6] Hybrid Transformer Demucs (HT Demucs), released in 2022 in the paper Hybrid Transformers for Music Source Separation, replaced the bottleneck with self-attention and is the version distributed in the demucs Python package today.^[7]

Open-Unmix and Sigsep

Open-Unmix (UMX) by Fabian-Robert Stoter, Stefan Uhlich, Antoine Liutkus and Yuki Mitsufuji was released in 2019 by the Sigsep collective, a group of academic and industry researchers focused on reproducible source separation.^[8] It is a bidirectional LSTM trained on spectrograms and serves as a reference implementation in many papers.^[8]

Conv-TasNet, DPRNN and SepFormer: time-domain speech separation

A parallel line of research targets overlapping speech rather than music, and it drove much of the architectural innovation now used across audio-to-audio. Conv-TasNet by Yi Luo and Nima Mesgarani at Columbia University (IEEE/ACM TASLP 2019) replaced the spectrogram with a learned convolutional encoder, a temporal convolutional network that predicts masks, and a transposed-convolution decoder. With roughly 5 million parameters it reported about 15.3 dB SI-SNRi on WSJ0-2mix, surpassing ideal time-frequency magnitude masks and establishing the time-domain (waveform) paradigm.[^convtasnet]

Dual-Path RNN (DPRNN) by Yi Luo, Zhuo Chen and Takuya Yoshioka (ICASSP 2020) reorganised the separator to model very long sequences by splitting them into chunks and alternating intra-chunk and inter-chunk RNNs. Dropping DPRNN into the TasNet pipeline reached about 18.8 dB SI-SNRi on WSJ0-2mix with a model around 20 times smaller than the previous best system.[^dprnn] DPTNet (Dual-Path Transformer Network) by Jingjing Chen, Qirong Mao and Dong Liu (Interspeech 2020) added direct context-aware modelling within the dual-path structure and reported about 20.6 dB SDR on WSJ0-2mix.[^dptnet]

SepFormer (Separation Transformer) by Cem Subakan, Mirco Ravanelli and colleagues, distributed through the SpeechBrain toolkit, replaced the dual-path RNNs with multi-scale transformer blocks. It reported about 22.3 dB SI-SNRi (22.4 dB SDRi) on WSJ0-2mix with dynamic mixing, a state-of-the-art result at publication, while remaining parallelisable and faster than comparable RNN systems.[^sepformer] Pretrained SepFormer checkpoints for WSJ0-2mix, WSJ0-3mix and the noisy WHAM! set are available on Hugging Face.

MossFormer and recent speech separation

MossFormer by Shengkui Zhao and Bin Ma at Alibaba (ICASSP 2023) is a state-of-the-art speech separation model that combines a convolutional front end with a transformer bottleneck. MossFormer2, published in 2024, adds a recurrent module and pushes performance further on the WSJ0-2mix benchmark.

Band-Split RNN, MDX-Net and BS-RoFormer

Band-Split RNN (BSRNN) by Yi Luo and Jianwei Yu at ByteDance (IEEE/ACM TASLP 2023, preprint September 2022) is the frequency-domain model that introduced the band-split idea now common in music separation. It splits the mixture spectrogram into subbands whose bandwidths can be chosen from prior knowledge of the target instrument, then interleaves band-level and sequence-level RNN modelling. The authors report that BSRNN trained only on MUSDB18-HQ outperforms several top-ranking entries of the 2021 Music Demixing Challenge, with a semi-supervised finetuning stage giving a further gain.[^bsrnn]

The Music Demixing Challenge (MDX), run as part of the Sound Demixing Challenge in 2021 and 2023, drove a generation of new separators. MDX-Net by Kuielab combined waveform and spectrogram branches with knowledge distillation and won the leaderboard in 2021. By 2023 the strongest entries (MDX23) used larger transformer backbones and ensembles.

BS-RoFormer (Band-Split RoPE Transformer) by Wei-Tsung Lu, Ju-Chiang Wang, Qiuqiang Kong and Yun-Ning Hung at ByteDance (SAMI), published in September 2023, replaced BSRNN's recurrent modelling with hierarchical transformers that use rotary positional embeddings (RoPE) over both inner-band and inter-band sequences.^[9] The system ranked first in the music separation track of the 2023 Sound Demixing Challenge (SDX'23); a smaller version trained on MUSDB18-HQ without extra data reported about 9.80 dB average SDR, which the authors describe as state of the art.[^bsroformer] It is widely regarded as one of the strongest open music source separation models, is the backbone of many high-quality vocal stems posted on community sites, and is supported by the Ultimate Vocal Remover (UVR) GUI.

Model	Year	Group	Domain	Strengths	Open weights
Conv-TasNet	2019	Columbia	Speech	Time-domain masking, ~15.3 dB SI-SNRi	Yes
DPRNN	2020	Columbia, Microsoft	Speech	Long-sequence dual-path, ~18.8 dB SI-SNRi	Yes
SepFormer	2021	SpeechBrain	Speech	Transformer dual-path, ~22.3 dB SI-SNRi	Yes
Spleeter	2019	Deezer	Music	Fast, easy install^[4]	Yes
Open-Unmix	2019	Sigsep	Music	Reproducible baseline^[8]	Yes
Demucs v3	2021	Meta	Music	Strong baseline^[6]	Yes
HT Demucs	2022	Meta	Music	Hybrid transformer^[7]	Yes
MossFormer 2	2024	Alibaba	Speech	Speech separation SOTA	Yes
MDX-Net	2021	Kuielab	Music	Music challenge winner	Yes
Band-Split RNN	2022	ByteDance	Music	Band-split frequency-domain	Yes
BS-RoFormer	2023	ByteDance	Music	~9.80 dB SDR on MUSDB18-HQ^[9]	Yes

Vocoders

A vocoder in the modern sense is a model that turns an intermediate representation, typically a mel-spectrogram, back into a waveform. Neural vocoders replaced the older Griffin-Lim algorithm and source-filter vocoders in text-to-speech pipelines because they produce much higher fidelity.

WaveNet, WaveRNN and Parallel WaveNet

WaveNet by Aaron van den Oord and colleagues at DeepMind, published in September 2016, was the first neural vocoder to match or beat the perceptual quality of concatenative speech synthesis.^[13] The model uses stacked dilated causal convolutions to predict the next audio sample, conditioned on linguistic features.^[13] WaveNet powered Google Assistant speech from 2017.

Parallel WaveNet (van den Oord et al., 2017) used probability density distillation to produce a non-autoregressive student that ran 1,000 times faster than the original, making large-scale deployment feasible. WaveRNN by Nal Kalchbrenner and colleagues (2018) achieved similar quality with a recurrent architecture optimised for CPU inference.

MelGAN, HiFi-GAN and iSTFTNet

MelGAN by Kundan Kumar and colleagues at the Universite de Montreal (NeurIPS 2019) was the first GAN-based vocoder to reach competitive quality. HiFi-GAN by Jungil Kong, Jaehyeon Kim and Jaekyoung Bae at Kakao Enterprise (NeurIPS 2020) combined multi-scale and multi-period discriminators with a generator made of transposed convolutions and residual blocks.^[14] HiFi-GAN became the default vocoder for nearly every open TTS system released between 2021 and 2024 because it is small, fast and high quality.

iSTFTNet by Takuhiro Kaneko and colleagues at NTT (ICASSP 2022) replaced the final upsampling layers of HiFi-GAN with an inverse short-time Fourier transform, cutting inference time substantially.

BigVGAN, Vocos and the new generation

BigVGAN by Sang-gil Lee and colleagues at NVIDIA (ICLR 2023) extended HiFi-GAN to 24 kHz and 44.1 kHz, scaling the generator to 112 million parameters and adding a periodic anti-aliasing activation called Snake.^[15] BigVGAN v2 was released in 2024 with improved training and is shipped through Hugging Face. It is the vocoder used by NVIDIA's NeMo TTS systems and several large generative audio projects.

Vocos by Hubert Siuzdak, published in 2023 (Vocos: Closing the gap between time-domain and Fourier-based neural vocoders for high-quality audio synthesis), takes a different route.^[16] Instead of upsampling to time domain with transposed convolutions, Vocos predicts magnitude and phase in the STFT domain and uses an inverse STFT to reach the waveform, which is faster than HiFi-GAN at similar quality.^[16]

Neural audio codecs

A neural audio codec is a model that compresses audio into a sequence of discrete tokens and reconstructs it. Codecs serve two purposes: low-bitrate transmission for telephony or storage, and tokenization for downstream generative models that treat audio like text.

SoundStream

SoundStream by Neil Zeghidour, Alejandro Luebs, Ahmed Omran, Jan Skoglund and Marco Tagliasacchi at Google was published in 2021.^[17] It is an end-to-end encoder, residual vector quantizer (RVQ) and decoder trained jointly with adversarial and reconstruction losses.^[17] SoundStream achieves 3 kbps speech at quality close to legacy 12 kbps codecs and was the first published neural codec to operate as a streaming-capable end-to-end system.^[17] It was followed in 2022 by Lyra v2 from Google, a productized version used in Google Meet under poor network conditions.

EnCodec

EnCodec by Alexandre Defossez, Jade Copet, Gabriel Synnaeve and Yossi Adi at Meta AI was released in October 2022 in the paper High Fidelity Neural Audio Compression.^[18] It builds on the SoundStream architecture, adds a small transformer language model on top of the codes for further entropy coding, and is trained at multiple bitrates between 1.5 and 24 kbps.^[18] EnCodec was released open source under the MIT licence and quickly became the tokenizer of choice for downstream audio language models, including MusicGen (Meta), VALL-E (Microsoft) and AudioGen.

DAC

Descript Audio Codec (DAC) by Rithesh Kumar, Prem Seetharaman, Alejandro Luebs, Ishaan Kumar and Kundan Kumar at Descript was published in 2023 (High-Fidelity Audio Compression with Improved RVQGAN).^[19] DAC pushes neural codec quality further by improving the RVQ training, adding a multi-scale STFT discriminator, and using snake activations.^[19] At 8 kbps it produces 44.1 kHz audio that is widely judged to be near-transparent on speech and music.^[19] DAC is also released open source and is used as the audio tokenizer in many late-2023 and 2024 audio language models.

Mimi

Mimi is the streaming-capable neural codec released by Kyutai in 2024 as part of the Moshi full-duplex speech dialogue system.^[27] It encodes 24 kHz audio at 1.1 kbps with an architecture similar to EnCodec and DAC but with a stronger emphasis on low latency, producing tokens at 12.5 Hz so that a downstream language model can predict each successive token under 80 milliseconds of input.^[27] The release accompanying Moshi made Mimi the first neural codec specifically engineered for real-time conversational AI.

Codec	Year	Group	Bitrate	Notes
SoundStream	2021	Google	3 to 18 kbps	First end-to-end neural codec^[17]
Lyra v2	2022	Google	3.2 to 9.2 kbps	Production deployment in Meet
EnCodec	2022	Meta	1.5 to 24 kbps	MIT licence, MusicGen tokenizer^[18]
DAC	2023	Descript	8 kbps at 44.1 kHz	Near-transparent quality^[19]
Mimi	2024	Kyutai	1.1 kbps at 24 kHz	Streaming, used in Moshi^[27]

Multitask and universal audio models

A growing class of models cuts across the categories above by handling several audio-to-audio tasks with a single backbone, sometimes alongside text inputs or outputs.

SpeechT5

SpeechT5 by Junyi Ao, Rui Wang and colleagues at Microsoft Research Asia, published at ACL 2022, is a unified encoder-decoder pre-trained on both speech and text.^[25] After fine-tuning, the same architecture can perform text-to-speech, automatic speech recognition, voice conversion and speech enhancement, all by changing the input and output streams.^[25] SpeechT5 is distributed through Hugging Face and is one of the first practical demonstrations of a shared backbone across audio tasks.

SeamlessM4T

SeamlessM4T (Massively Multilingual and Multimodal Machine Translation) was released by Meta AI in August 2023.^[24] Trained on the SeamlessAlign dataset, the model performs text-to-text, text-to-speech, speech-to-text and speech-to-speech translation across nearly 100 languages, with about 36 source languages for speech-to-speech.^[24] The follow-up, SeamlessExpressive and Seamless Streaming (October 2023), added expressive prosody transfer (so the translated voice keeps the original speaker's emotion) and low-latency streaming. SeamlessM4T weights are released under a custom Meta licence.^[24]

AudioLM and the codec language model line

AudioLM by Zalan Borsos and colleagues at Google, published in 2022, was the foundational paper that framed audio generation as language modelling over discrete neural codec tokens.^[22] It used a hierarchy of semantic tokens (from w2v-BERT) and acoustic tokens (from SoundStream).^[22] AudioLM itself can continue an audio prompt in the style of the prompt, producing convincing speech and piano continuations from a few seconds of input.^[22] The model is a research artefact (not open source) but its architecture inspired VALL-E, MusicGen, AudioGen and a long subsequent line of work.

AudioGen by Felix Kreuk and colleagues at Meta AI (2022) applied the same codec language model recipe to general environmental sounds, conditioned on text prompts.

Stable Audio Open

Stable Audio Open by Stability AI, released in June 2024, is an open-weights text-to-audio diffusion model trained on the CC-licensed Free Music Archive and Freesound libraries. While it is primarily text-to-audio, the model can be conditioned on existing audio for style transfer and timbre matching, putting it on the boundary between text-to-audio and audio-to-audio.

Music generation models with audio input

Several large music generation models accept audio as a conditioning signal in addition to text, blurring the line with audio-to-audio:

MusicGen by Jade Copet and colleagues at Meta AI (2023) is primarily text-to-music but also accepts a melody waveform as conditioning, effectively performing style or instrument transfer on a tune.^[23]
MusicLM by Andrea Agostinelli and colleagues at Google (2023) added humming and whistling as audio prompts.
Stable Audio 2.0 from Stability AI (April 2024) supports audio-to-audio prompting, allowing users to upload a reference clip and re-render it in a different style.
Suno's Cover and Udio's Remix features (2024) take an existing song and re-perform it with new instrumentation or vocals.

Open-source ecosystem

The day-to-day open-source landscape for audio-to-audio models is built around a few hubs:

Hugging Face audio-to-audio: the task page on the Hugging Face Hub lists hundreds of community models for separation, enhancement and voice conversion.^[28] The transformers and pyannote-audio libraries provide inference wrappers.
demucs: the official demucs package, maintained by Alexandre Defossez, ships Hybrid Transformer Demucs as a command-line tool that produces stems with one command.^[7] It is the standard separator for studio workflows.
Ultimate Vocal Remover (UVR): a free GUI by Anjok07 that wraps a wide range of models including MDX-Net, Demucs, BS-RoFormer and Kim's UVR-MDX-NET variants. UVR is used heavily by hobbyist producers.
audio-separator: a Python CLI by Andrew Beveridge that exposes UVR's models as a pip-installable command, used by many automated stem-extraction pipelines.
RVC-Project and Mangio-RVC-Fork: GitHub repositories that distribute RVC training scripts and pre-trained voice models. The community on Hugging Face hosts thousands of community voice models.
Coqui TTS and NVIDIA NeMo: TTS toolkits that also ship vocoders, voice conversion checkpoints and speech enhancement components.
AudioCraft: Meta's library that bundles EnCodec, AudioGen and MusicGen under one MIT-licensed repo.
Resemble Enhance, DeepFilterNet, Demucs Denoiser: pip-installable speech enhancement tools that run on CPU in real time.

Applications and concerns

Audio-to-audio models have moved out of the lab and into daily life. Podcasters clean dialogue with Adobe Enhance Speech. Video conferencing platforms remove keyboard clatter with Krisp, RTX Voice or built-in Teams suppression. DJs and producers extract vocals with Demucs or BS-RoFormer to make remixes that would have required a multitrack session a decade ago. Hearing aid manufacturers, including GN ReSound and Starkey, ship neural speech enhancement on-device. Streaming services such as Spotify use neural codecs internally for low-bitrate listening.

The same capabilities also raise concerns. Voice cloning systems can generate convincing impersonations of public figures from a few seconds of reference audio, and several incidents in 2023 and 2024 involved RVC and ElevenLabs clones used in scam calls and political robocalls. The US Federal Communications Commission banned AI-generated robocalls in February 2024 after a fake Joe Biden voice clone was used to discourage New Hampshire voters from casting ballots.^[30] Meta cited similar concerns when withholding Voicebox weights.^[21]

Detection of audio deepfakes is an active research area. The ASVspoof series of challenges, run since 2015, evaluates anti-spoofing systems. Commercial detectors from companies including Pindrop and Resemble Detect target call centres and journalism.

Copyright is contested for music separation: extracting and re-using stems from commercial recordings is technically possible but may infringe the underlying composition and master rights. Spleeter's release in 2019 prompted debate among labels and platforms, and the question of stem extraction has resurfaced with each new state-of-the-art separator.

Finally, the audio-to-audio toolkit has begun to merge with speech-to-speech language models such as Moshi (Kyutai, 2024)^[27] and GPT-4o voice mode (OpenAI, 2024), where a single end-to-end transformer takes a user's microphone audio and produces a spoken reply, dropping the traditional speech-to-text plus LLM plus text-to-speech pipeline. The neural codecs described above are the tokenizers that make this possible.

References

Pascual, S., Bonafonte, A., Serra, J. (2017). *SEGAN: Speech Enhancement Generative Adversarial Network*. Interspeech 2017. arXiv:1703.09452. ↩
Kameoka, H., Kaneko, T., Tanaka, K., Hojo, N. (2018). *StarGAN-VC: Non-parallel many-to-many voice conversion using star generative adversarial networks*. IEEE SLT. ↩
Qian, K., Zhang, Y., Chang, S., Yang, X., Hasegawa-Johnson, M. (2019). *AutoVC: Zero-Shot Voice Style Transfer with Only Autoencoder Loss*. ICML 2019. ↩
Hennequin, R., Khlif, A., Voituret, F., Moussallam, M. (2020). *Spleeter: a fast and efficient music source separation tool with pre-trained models*. Journal of Open Source Software. ↩
Defossez, A., Usunier, N., Bottou, L., Bach, F. (2019). *Music Source Separation in the Waveform Domain*. arXiv:1911.13254. ↩
Defossez, A. (2021). *Hybrid Spectrogram and Waveform Source Separation*. ISMIR 2021 Music Demixing Workshop. ↩
Rouard, S., Massa, F., Defossez, A. (2022). *Hybrid Transformers for Music Source Separation*. ICASSP 2023. ↩
Stoter, F.-R., Uhlich, S., Liutkus, A., Mitsufuji, Y. (2019). *Open-Unmix - A Reference Implementation for Music Source Separation*. JOSS. ↩
Lu, W.-T., Wang, J.-C., Kong, Q., Hung, Y.-N. (2023). *Music Source Separation with Band-Split RoPE Transformer*. arXiv:2309.02612. ↩
Hu, Y. et al. (2020). *DCCRN: Deep Complex Convolution Recurrent Network for Phase-Aware Speech Enhancement*. Interspeech 2020. ↩
Schroter, H., Escalante-B., A. N., Rosenkranz, T., Maier, A. (2022). *DeepFilterNet: A Low Complexity Speech Enhancement Framework for Full-Band Audio*. ICASSP. ↩
Defossez, A., Synnaeve, G., Adi, Y. (2020). *Real Time Speech Enhancement in the Waveform Domain*. Interspeech 2020. ↩
van den Oord, A. et al. (2016). *WaveNet: A Generative Model for Raw Audio*. arXiv:1609.03499. ↩
Kong, J., Kim, J., Bae, J. (2020). *HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis*. NeurIPS 2020. ↩
Lee, S.-g., Ping, W., Ginsburg, B., Catanzaro, B., Yoon, S. (2023). *BigVGAN: A Universal Neural Vocoder with Large-Scale Training*. ICLR 2023. ↩
Siuzdak, H. (2023). *Vocos: Closing the gap between time-domain and Fourier-based neural vocoders for high-quality audio synthesis*. arXiv:2306.00814. ↩
Zeghidour, N., Luebs, A., Omran, A., Skoglund, J., Tagliasacchi, M. (2021). *SoundStream: An End-to-End Neural Audio Codec*. IEEE/ACM TASLP. ↩
Defossez, A., Copet, J., Synnaeve, G., Adi, Y. (2022). *High Fidelity Neural Audio Compression*. arXiv:2210.13438 (EnCodec). ↩
Kumar, R., Seetharaman, P., Luebs, A., Tewari, I., Kumar, K. (2023). *High-Fidelity Audio Compression with Improved RVQGAN*. NeurIPS 2023 (DAC). ↩
Wang, C. et al. (2023). *Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers*. arXiv:2301.02111 (VALL-E). ↩
Le, M. et al. (2023). *Voicebox: Text-Guided Multilingual Universal Speech Generation at Scale*. Meta AI research paper. ↩
Borsos, Z. et al. (2022). *AudioLM: a Language Modeling Approach to Audio Generation*. arXiv:2209.03143. ↩
Copet, J. et al. (2023). *Simple and Controllable Music Generation*. NeurIPS 2023 (MusicGen). ↩
Communication Research Centre, Meta AI. (2023). *SeamlessM4T: Massively Multilingual and Multimodal Machine Translation*. Meta AI publication. ↩
Ao, J. et al. (2022). *SpeechT5: Unified-Modal Encoder-Decoder Pre-Training for Spoken Language Processing*. ACL 2022. ↩
Qin, Z. et al. (2023). *OpenVoice: Versatile Instant Voice Cloning*. MyShell.ai research paper. ↩
Kyutai. (2024). *Moshi: a speech-text foundation model for real-time dialogue* (introduces the Mimi codec). ↩
Hugging Face. *Audio-to-Audio task page*. https://huggingface.co/tasks/audio-to-audio ↩
Microsoft. *Deep Noise Suppression (DNS) Challenge series*. https://www.microsoft.com/en-us/research/academic-program/deep-noise-suppression-challenge-icassp-2023/ ↩
Federal Communications Commission. (February 8, 2024). *Declaratory Ruling: AI-generated voices in robocalls are illegal under the TCPA*. ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

3 revisions by 1 contributors · full history

Suggest edit

What links here

Audio Models