# Audio-to-Audio Models

> Source: https://aiwiki.ai/wiki/audio-to-audio_models
> Updated: 2026-07-16
> Categories: AI Models, Music & Audio Generation, Speech & Audio AI
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

*See also: [Audio Models](/wiki/audio_models) and Tasks*

**Audio-to-audio models** are [machine learning](/wiki/machine_learning) systems that take an audio waveform as input and produce a different audio waveform as output. The category covers [voice conversion](/wiki/voice_conversion), [speech enhancement](/wiki/speech_enhancement) (denoising and dereverberation), [music source separation](/wiki/source_separation), [vocoders](/wiki/vocoder) that turn spectrograms into waveforms, and [neural audio codecs](/wiki/neural_audio_codec) that compress and reconstruct sound. They sit alongside text-to-speech and [speech-to-text](/wiki/speech_recognition) in the broader family of [audio models](/wiki/audio_models), but they are distinguished by the fact that both the input and the output live in the [audio](/wiki/audio) domain.

Most of the modern systems share a common pipeline: an encoder that turns the input waveform or spectrogram into a learned representation, a transformation network that manipulates that representation (separating sources, removing noise, swapping speaker identity, predicting clean targets), and a decoder or vocoder that synthesizes the new waveform. Since around 2019 the field has moved away from spectrogram masking with classical signal processing toward fully neural approaches, often built around [diffusion models](/wiki/diffusion_models), [transformer](/wiki/transformer) architectures, or generative adversarial networks ([GANs](/wiki/gan)). The same underlying components show up across the subfields: a HiFi-GAN vocoder can be the back end of a [voice cloning](/wiki/voice_cloning) pipeline, a music separator, or a neural codec.

## Overview

The phrase "audio-to-audio" appears on Hugging Face as one of the official task tags.[28] Models filed under it on the Hub include speech enhancement, voice conversion, source separation, target speaker extraction, and bandwidth extension.[28] The taxonomy used in this article groups the field into the following tasks.

| Task | Input | Output | Typical use |
| --- | --- | --- | --- |
| [Voice conversion](/wiki/voice_conversion) | Speech from speaker A | Same words spoken by speaker B | Voice cloning, dubbing, anonymization |
| Speech enhancement | Noisy or reverberant speech | Clean speech | Conferencing, podcasts, hearing aids |
| Source separation | Mixed audio | Individual stems | Karaoke, remix, music production |
| [Vocoder](/wiki/vocoder) | Spectrogram or features | Waveform | Back end of [text-to-speech](/wiki/text_to_speech) |
| [Neural audio codec](/wiki/neural_audio_codec) | Waveform | Compressed tokens, then waveform | Low-bitrate audio, generative audio backbones |
| Speech-to-speech translation | Speech in language A | Speech in language B | Real-time interpretation |
| Bandwidth extension | 8 kHz speech | 16 or 24 kHz speech | Upsampling old recordings |
| Target speaker extraction | Multi-talker audio | One speaker's voice | Cocktail-party problem |

The history of the field tracks the wider arc of [deep learning](/wiki/deep_learning) audio research. Classical signal processing (Wiener filtering, spectral subtraction, ICA-based source separation) dominated until around 2015. Then convolutional [neural networks](/wiki/neural_networks) operating on spectrograms took over, with U-Net architectures from medical imaging adapted for vocal separation. [Generative AI](/wiki/generative_ai) models, starting with WaveNet in 2016, made waveform-level synthesis feasible.[13] The transformer wave reached audio around 2020 and 2021, and by 2023 large generative models such as VALL-E and Voicebox were producing voice cloning that needed only a few seconds of reference audio.[20][21]

## Voice conversion

[Voice conversion](/wiki/voice_conversion) (VC) systems take an utterance from one speaker and re-render it in the voice of another, ideally keeping linguistic content and prosody intact. Early VC used Gaussian mixture models or vector quantization on aligned parallel data. Modern systems learn disentangled representations of content and speaker identity, then recombine them.

### StarGAN-VC, AutoVC and the disentanglement era

**StarGAN-VC** was introduced by Hirokazu Kameoka and colleagues at NTT in 2018.[2] It applies the StarGAN image translation framework to mel-cepstral features, enabling many-to-many voice conversion without parallel training data.[2] A follow-up, StarGAN-VC2, improved naturalness in 2019.

**AutoVC** was published by Kaizhi Qian and collaborators at MIT and IBM in 2019.[3] The architecture is a content encoder, a speaker encoder, and a decoder, with an information bottleneck on the content path that forces the model to discard speaker identity.[3] AutoVC was one of the first systems to achieve plausible zero-shot voice conversion to unseen target speakers.[3]

### Singing voice conversion: SoftVC VITS SVC and RVC

**SoftVC VITS Singing Voice Conversion (So-VITS-SVC)** is an open-source project that combines a SoftVC content encoder with the [VITS](/wiki/vits) end-to-end TTS model. The repository was released on GitHub by the svc-develop-team and went through several iterations between 2022 and 2023. The 4.0 line switched the content features to the 12th layer of ContentVec and replaced the vocoder with NSF-HiFiGAN, while 4.1 added an optional shallow diffusion stage for higher sound quality. So-VITS-SVC 4.0 and 4.1 became the de facto standard for fan-made covers in which one singer's voice is mapped onto another's recording.

**Retrieval-based Voice Conversion (RVC)** is a sibling project that became the dominant voice cloning tool on social media in 2023. RVC uses a HuBERT-style content encoder, an NSF-HiFiGAN vocoder, and a feature retrieval step that pulls the closest matching frames from a target speaker's voice database to suppress timbre leakage from the source speaker. The original RVC repository, RVC-Project, gathered tens of thousands of GitHub stars and powered viral clips of Joe Biden, Donald Trump and various musicians appearing to sing songs they never recorded.

**NSF-HiFiGAN** is the vocoder used by both SoVITS and RVC. It is a HiFi-GAN variant with a Neural Source-Filter front end that takes fundamental frequency (F0) as an input, which helps preserve pitch through voice conversion.

**DiffSVC** by Songxiang Liu, Yuewen Cao, Dan Su and Helen Meng (CUHK and Tencent AI Lab, ASRU 2021) was the first singing voice conversion system built on a denoising diffusion probabilistic model, generating acoustic features from phonetic posteriorgram content features.[^diffsvc] **FastSVC** by Shijun Wang and Yi Zhao introduced a non-autoregressive approach for faster inference.

### OpenVoice and modern instant voice cloning

**OpenVoice** was released by MyShell.ai in December 2023.[26] It separates tone color cloning, which captures speaker timbre from a short reference clip, from style control, which handles emotion, accent and pacing.[26] OpenVoice was published with code on GitHub and an accompanying paper by Zengyi Qin and colleagues.[26] **OpenVoice v2** followed in April 2024 with native multilingual support for English, Spanish, French, Chinese, Japanese and Korean.

### kNN-VC and Seed-VC: the self-supervised and zero-shot wave

**kNN-VC (Voice Conversion With Just Nearest Neighbors)** by Matthew Baas, Benjamin van Niekerk and Herman Kamper at Stellenbosch University (Interspeech 2023) showed that any-to-any voice conversion needs no dedicated conversion network at all. Source and reference utterances are encoded into self-supervised [WavLM](/wiki/wavlm) features, each source frame is replaced by the mean of its k nearest neighbours among the reference frames, and a HiFi-GAN vocoder synthesizes the result. The authors report that this concatenative approach improves speaker similarity over prior methods at similar intelligibility, despite its simplicity.[^knnvc]

**Seed-VC** by Songting Liu (Nanyang Technological University, arXiv November 2024) is a zero-shot framework that converts a source utterance to the timbre of an unseen reference. It applies an external timbre shifter during training to perturb the source timbre and uses a diffusion transformer that reads the whole reference through in-context learning. The paper reports higher speaker similarity and lower word error rate than the OpenVoice and CosyVoice baselines, and the framework extends to zero-shot singing voice conversion with fundamental-frequency conditioning and offers a real-time mode. Code and pretrained models were released on GitHub.[^seedvc]

### Voicebox, VALL-E and large generative voice models

**VALL-E** was announced by Microsoft Research in January 2023, in the paper *Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers* by Chengyi Wang and colleagues.[20] The model is a transformer language model trained on tokens from the EnCodec neural codec; given a 3-second voice prompt it can synthesize speech in that voice with reasonable identity preservation.[20] VALL-E demonstrated for the first time that token-level language modelling over a neural codec could replace traditional vocoder pipelines.[20] **VALL-E X** extended the model to cross-lingual cloning, allowing an English speaker to be cloned for Mandarin output, also from Microsoft in 2023.

**Voicebox** was published by Meta AI in June 2023.[21] The paper, by Matthew Le and colleagues, describes a non-autoregressive flow-matching model trained on 60,000 hours of multilingual speech that supports text-to-speech, noise removal, content editing inside a recording and zero-shot voice cloning.[21] Meta did not release Voicebox weights or a demo product, citing potential misuse for impersonation.[21]

| Model | Year | Group | Approach | Open weights |
| --- | --- | --- | --- | --- |
| StarGAN-VC | 2018 | NTT | StarGAN on mel-cepstral features[2] | Research code |
| AutoVC | 2019 | MIT, IBM | Bottleneck autoencoder[3] | Research code |
| So-VITS-SVC | 2022 to 2023 | Open source | SoftVC content plus VITS | Yes |
| RVC | 2023 | Open source | HuBERT plus NSF-HiFiGAN with retrieval | Yes |
| DiffSVC | 2021 | CUHK, Tencent | Diffusion model | Research code |
| OpenVoice v1 | 2023 | MyShell | Tone and style decoupling[26] | Yes |
| OpenVoice v2 | 2024 | MyShell | Multilingual extension | Yes |
| kNN-VC | 2023 | Stellenbosch | WavLM features plus k-NN | Yes |
| Seed-VC | 2024 | NTU | Diffusion transformer, zero-shot | Yes |
| VALL-E | 2023 | Microsoft | Codec language model[20] | No |
| VALL-E X | 2023 | Microsoft | Cross-lingual codec LM | No |
| Voicebox | 2023 | Meta | Flow matching[21] | No |

## Speech enhancement

Speech enhancement covers any task in which the goal is to recover clean speech from a degraded signal. The main sub-tasks are denoising (removing background noise), dereverberation (removing room reverb), declipping, packet loss concealment and bandwidth extension. The field is older than deep learning, with Yariv Ephraim and David Malah's 1984 MMSE-STSA estimator still cited in modern papers, but the current state of the art is dominated by neural networks.

### SEGAN, DCCRN and DCUNet

**SEGAN** by Santiago Pascual, Antonio Bonafonte and Joan Serra at the Universitat Politecnica de Catalunya, presented at Interspeech 2017, was one of the first speech enhancement systems to operate directly on raw waveforms with a generative adversarial network.[1] The generator is a fully convolutional encoder-decoder; the discriminator decides whether a waveform looks clean or denoised.[1]

**Deep Complex Convolution Recurrent Network (DCCRN)** by Yanxin Hu and colleagues (Interspeech 2020) won the first round of the [Deep Noise Suppression Challenge](/wiki/dns_challenge) organized by Microsoft.[10] It uses complex-valued convolutions to handle phase information in the short-time Fourier transform directly, rather than processing magnitude and phase separately.[10] **DCUNet** is a related complex U-Net architecture from 2019.

### RNNoise and FullSubNet

**RNNoise** by Jean-Marc Valin (IEEE MMSP 2018) is a widely deployed open-source noise suppressor that pairs classical signal processing with a small recurrent network. A four-layer recurrent network using gated recurrent units (GRUs) estimates ideal critical-band gains while a pitch filter attenuates noise between harmonics. Valin reports that the hybrid design runs in real time at 48 kHz on a low-power CPU with a model that fits in roughly 85 kB, and the project is bundled in many open-source voice and telephony stacks.[^rnnoise]

**FullSubNet** by Xiang Hao, Xiangdong Su, Radu Horaud and Xiaofei Li (Inner Mongolia University and Inria, ICASSP 2021) is a real-time single-channel enhancement model that connects a full-band model, which captures global spectral context and cross-band dependencies, with a sub-band model that processes each frequency independently from a few neighbouring bands. The authors report that the fused system exceeds the top-ranked methods of the Interspeech 2020 [Deep Noise Suppression Challenge](/wiki/dns_challenge), and the code is open source.[^fullsubnet]

### NSNet and the Microsoft pipeline

**Noise Suppression Network (NSNet)** is Microsoft Research's noise suppressor, used in Microsoft Teams since 2020. It is a recurrent neural network that operates on log-mel features and outputs a suppression gain per time-frequency bin. The team behind NSNet, including Sebastian Braun and Hannes Gamper, also organizes the annual Deep Noise Suppression Challenge that has driven competitive progress on the task.[29]

### DeepFilterNet

**DeepFilterNet** by Hendrik Schroter, Alberto Escalante and Andreas Maier at Friedrich-Alexander-Universitat Erlangen-Nurnberg was introduced in 2022 and updated to DeepFilterNet 2 and DeepFilterNet 3 in 2023. It operates at 48 kHz, runs in real time on a single CPU core, and is permissively licensed.[11] The architecture predicts a per-frequency deep filter (a complex-valued convolution over recent frames) rather than a single suppression mask, which preserves transients and consonants better than mask-only methods.[11]

### Resemble Enhance, Adobe Enhance Speech and Voice Isolator

**Resemble Enhance** was open-sourced by Resemble.ai under an MIT license in late 2023, with the source code published on GitHub. It combines a noise suppression stage with a CFM (conditional flow matching) generative model that re-synthesizes a clean speech waveform, allowing the system to fix not only noise but also bandwidth limitation and other distortions.

**Adobe Enhance Speech**, originally announced as Project Shasta in 2022 and released as a free web tool through Adobe Podcast in 2023, applies a proprietary neural model to clean up dialogue recorded in untreated rooms. It quickly became a default tool among podcasters and was integrated into Adobe Premiere Pro as Enhance Speech.

**ElevenLabs Voice Isolator** is a hosted product released by [ElevenLabs](/wiki/elevenlabs) in 2024 that strips background music and noise from voice recordings. It targets the same use case as Adobe Enhance Speech.

### Krisp and NVIDIA Broadcast

**Krisp** is an Armenian company founded in 2017 that ships a desktop application and SDK for real-time noise suppression in video calls. Its model runs locally on the user's CPU.

**NVIDIA RTX Voice** was launched in April 2020 during the early days of the COVID-19 lockdowns as a noise suppression utility that ran on RTX GPUs using tensor cores. It evolved into **NVIDIA Broadcast**, a free Windows application available on RTX cards that includes noise removal, room echo removal, virtual background, eye contact correction and other camera and microphone effects.

### Demucs for denoising

[Demucs](/wiki/demucs), the music separation model from Meta, was adapted for speech enhancement in a paper titled *Real Time Speech Enhancement in the Waveform Domain* by Alexandre Defossez, Gabriel Synnaeve and Yossi Adi (Interspeech 2020).[12] The variant, sometimes called Denoiser or Demucs Denoiser, was for a time a popular open-source baseline before being overtaken by DeepFilterNet and other dedicated speech models.

| System | Year | Type | License | Notes |
| --- | --- | --- | --- | --- |
| RNNoise | 2018 | Hybrid DSP plus GRU | Open source | Real-time 48 kHz, ~85 kB model |
| SEGAN | 2017 | GAN, waveform | Open source | First waveform GAN denoiser[1] |
| DCCRN | 2020 | Complex CRN | Research | DNS 2020 winner[10] |
| FullSubNet | 2021 | Full plus sub-band fusion | Open source | Beat DNS 2020 top entries |
| NSNet | 2020 | RNN | Proprietary | Used in Microsoft Teams |
| Demucs Denoiser | 2020 | Waveform U-Net | Open source | Real-time CPU[12] |
| DeepFilterNet 3 | 2023 | Deep filter | Open source | 48 kHz, real time CPU |
| Resemble Enhance | 2023 | Denoise plus CFM | Open source | Quality restoration |
| Adobe Enhance Speech | 2023 | Proprietary | Free web tool | Podcast cleanup |
| ElevenLabs Voice Isolator | 2024 | Proprietary | Paid API | Voice isolation |
| Krisp | 2017 onward | RNN | Proprietary | Real-time call filter |
| NVIDIA Broadcast | 2020 onward | Proprietary | Free with RTX | GPU accelerated |

## Source separation

[Source separation](/wiki/source_separation) breaks a mixed audio recording into its component sources. The most studied flavour is music separation into vocals, drums, bass and other, evaluated on the MUSDB18 benchmark introduced by Zafar Rafii and colleagues in 2017. Speech separation, where the goal is to split overlapping speakers, is the related problem behind the cocktail-party effect and is usually benchmarked on the WSJ0-2mix mixture set. Separation quality is most often reported as scale-invariant signal-to-distortion ratio (SI-SDR) or its improvement over the input mixture (SI-SDRi), a metric proposed by Jonathan Le Roux, Scott Wisdom, Hakan Erdogan and John R. Hershey (ICASSP 2019) to make the older SDR measure robust to amplitude scaling.[^sisdr] Higher SI-SDR in decibels means a cleaner separation.

### Spleeter

**Spleeter** was open-sourced by the research team at Deezer in November 2019.[4] It uses a U-Net trained on the company's internal catalogue and ships pretrained two-stem (vocals or accompaniment), four-stem (vocals, drums, bass, other) and five-stem (adds piano) models.[4] Despite its modest size, Spleeter became the most widely used music separator for several years because it was easy to install, fast on CPU and good enough for karaoke and DJ use. The release paper, by Romain Hennequin, Anis Khlif, Felix Voituret and Manuel Moussallam, is one of the most cited audio papers of the 2019 to 2020 period.

### Demucs and Hybrid Demucs

**Demucs** by Alexandre Defossez at Meta AI was published in 2019 and revised through several versions.[5] The original Demucs was a waveform U-Net with bi-directional LSTM bottleneck.[5] **Hybrid Demucs** (2021) combined waveform-domain and spectrogram-domain branches that share a transformer bottleneck, hitting state-of-the-art performance on MUSDB18.[6] **Hybrid Transformer Demucs (HT Demucs)**, released in 2022 in the paper *Hybrid Transformers for Music Source Separation*, replaced the bottleneck with self-attention and is the version distributed in the `demucs` Python package today.[7]

### Open-Unmix and Sigsep

**Open-Unmix (UMX)** by Fabian-Robert Stoter, Stefan Uhlich, Antoine Liutkus and Yuki Mitsufuji was released in 2019 by the Sigsep collective, a group of academic and industry researchers focused on reproducible source separation.[8] It is a bidirectional LSTM trained on spectrograms and serves as a reference implementation in many papers.[8]

### Conv-TasNet, DPRNN and SepFormer: time-domain speech separation

A parallel line of research targets overlapping speech rather than music, and it drove much of the architectural innovation now used across audio-to-audio. **Conv-TasNet** by Yi Luo and Nima Mesgarani at Columbia University (IEEE/ACM TASLP 2019) replaced the spectrogram with a learned convolutional encoder, a temporal convolutional network that predicts masks, and a transposed-convolution decoder. With roughly 5 million parameters it reported about 15.3 dB SI-SNRi on WSJ0-2mix, surpassing ideal time-frequency magnitude masks and establishing the time-domain (waveform) paradigm.[^convtasnet]

**Dual-Path RNN (DPRNN)** by Yi Luo, Zhuo Chen and Takuya Yoshioka (ICASSP 2020) reorganised the separator to model very long sequences by splitting them into chunks and alternating intra-chunk and inter-chunk RNNs. Dropping DPRNN into the TasNet pipeline reached about 18.8 dB SI-SNRi on WSJ0-2mix with a model around 20 times smaller than the previous best system.[^dprnn] **DPTNet (Dual-Path Transformer Network)** by Jingjing Chen, Qirong Mao and Dong Liu (Interspeech 2020) added direct context-aware modelling within the dual-path structure and reported about 20.6 dB SDR on WSJ0-2mix.[^dptnet]

**SepFormer (Separation Transformer)** by Cem Subakan, Mirco Ravanelli and colleagues, distributed through the [SpeechBrain](/wiki/speechbrain) toolkit, replaced the dual-path RNNs with multi-scale transformer blocks. It reported about 22.3 dB SI-SNRi (22.4 dB SDRi) on WSJ0-2mix with dynamic mixing, a state-of-the-art result at publication, while remaining parallelisable and faster than comparable RNN systems.[^sepformer] Pretrained SepFormer checkpoints for WSJ0-2mix, WSJ0-3mix and the noisy WHAM! set are available on Hugging Face.

### MossFormer and recent speech separation

**MossFormer** by Shengkui Zhao and Bin Ma at Alibaba (ICASSP 2023) is a state-of-the-art speech separation model that combines a convolutional front end with a transformer bottleneck. **MossFormer2**, published in 2024, adds a recurrent module and pushes performance further on the WSJ0-2mix benchmark.

### Band-Split RNN, MDX-Net and BS-RoFormer

**Band-Split RNN (BSRNN)** by Yi Luo and Jianwei Yu at ByteDance (IEEE/ACM TASLP 2023, preprint September 2022) is the frequency-domain model that introduced the band-split idea now common in music separation. It splits the mixture spectrogram into subbands whose bandwidths can be chosen from prior knowledge of the target instrument, then interleaves band-level and sequence-level RNN modelling. The authors report that BSRNN trained only on MUSDB18-HQ outperforms several top-ranking entries of the 2021 Music Demixing Challenge, with a semi-supervised finetuning stage giving a further gain.[^bsrnn]

The **Music Demixing Challenge (MDX)**, run as part of the Sound Demixing Challenge in 2021 and 2023, drove a generation of new separators. **MDX-Net** by Kuielab combined waveform and spectrogram branches with knowledge distillation and won the leaderboard in 2021. By 2023 the strongest entries (MDX23) used larger transformer backbones and ensembles.

**BS-RoFormer (Band-Split RoPE Transformer)** by Wei-Tsung Lu, Ju-Chiang Wang, Qiuqiang Kong and Yun-Ning Hung at ByteDance (SAMI), published in September 2023, replaced BSRNN's recurrent modelling with hierarchical transformers that use rotary positional embeddings (RoPE) over both inner-band and inter-band sequences.[9] The system ranked first in the music separation track of the 2023 Sound Demixing Challenge (SDX'23); a smaller version trained on MUSDB18-HQ without extra data reported about 9.80 dB average SDR, which the authors describe as state of the art.[^bsroformer] It is widely regarded as one of the strongest open music source separation models, is the backbone of many high-quality vocal stems posted on community sites, and is supported by the Ultimate Vocal Remover (UVR) GUI.

| Model | Year | Group | Domain | Strengths | Open weights |
| --- | --- | --- | --- | --- | --- |
| Conv-TasNet | 2019 | Columbia | Speech | Time-domain masking, ~15.3 dB SI-SNRi | Yes |
| DPRNN | 2020 | Columbia, Microsoft | Speech | Long-sequence dual-path, ~18.8 dB SI-SNRi | Yes |
| SepFormer | 2021 | SpeechBrain | Speech | Transformer dual-path, ~22.3 dB SI-SNRi | Yes |
| Spleeter | 2019 | Deezer | Music | Fast, easy install[4] | Yes |
| Open-Unmix | 2019 | Sigsep | Music | Reproducible baseline[8] | Yes |
| Demucs v3 | 2021 | Meta | Music | Strong baseline[6] | Yes |
| HT Demucs | 2022 | Meta | Music | Hybrid transformer[7] | Yes |
| MossFormer 2 | 2024 | Alibaba | Speech | Speech separation SOTA | Yes |
| MDX-Net | 2021 | Kuielab | Music | Music challenge winner | Yes |
| Band-Split RNN | 2022 | ByteDance | Music | Band-split frequency-domain | Yes |
| BS-RoFormer | 2023 | ByteDance | Music | ~9.80 dB SDR on MUSDB18-HQ[9] | Yes |

## Vocoders

A **vocoder** in the modern sense is a model that turns an intermediate representation, typically a mel-spectrogram, back into a waveform. Neural vocoders replaced the older Griffin-Lim algorithm and source-filter vocoders in [text-to-speech](/wiki/text_to_speech) pipelines because they produce much higher fidelity.

### WaveNet, WaveRNN and Parallel WaveNet

**WaveNet** by Aaron van den Oord and colleagues at DeepMind, published in September 2016, was the first neural vocoder to match or beat the perceptual quality of concatenative speech synthesis.[13] The model uses stacked dilated causal convolutions to predict the next audio sample, conditioned on linguistic features.[13] WaveNet powered Google Assistant speech from 2017.

**Parallel WaveNet** (van den Oord et al., 2017) used probability density distillation to produce a non-autoregressive student that ran 1,000 times faster than the original, making large-scale deployment feasible. **WaveRNN** by Nal Kalchbrenner and colleagues (2018) achieved similar quality with a recurrent architecture optimised for CPU inference.

### MelGAN, HiFi-GAN and iSTFTNet

**MelGAN** by Kundan Kumar and colleagues at the Universite de Montreal (NeurIPS 2019) was the first GAN-based vocoder to reach competitive quality. **HiFi-GAN** by Jungil Kong, Jaehyeon Kim and Jaekyoung Bae at Kakao Enterprise (NeurIPS 2020) combined multi-scale and multi-period discriminators with a generator made of transposed convolutions and residual blocks.[14] HiFi-GAN became the default vocoder for nearly every open TTS system released between 2021 and 2024 because it is small, fast and high quality.

**iSTFTNet** by Takuhiro Kaneko and colleagues at NTT (ICASSP 2022) replaced the final upsampling layers of HiFi-GAN with an inverse short-time Fourier transform, cutting inference time substantially.

### BigVGAN, Vocos and the new generation

**BigVGAN** by Sang-gil Lee and colleagues at NVIDIA (ICLR 2023) extended HiFi-GAN to 24 kHz and 44.1 kHz, scaling the generator to 112 million parameters and adding a periodic anti-aliasing activation called Snake.[15] **BigVGAN v2** was released in 2024 with improved training and is shipped through Hugging Face. It is the vocoder used by NVIDIA's NeMo TTS systems and several large generative audio projects.

**Vocos** by Hubert Siuzdak, published in 2023 (*Vocos: Closing the gap between time-domain and Fourier-based neural vocoders for high-quality audio synthesis*), takes a different route.[16] Instead of upsampling to time domain with transposed convolutions, Vocos predicts magnitude and phase in the STFT domain and uses an inverse STFT to reach the waveform, which is faster than HiFi-GAN at similar quality.[16]

## Neural audio codecs

A **neural audio codec** is a model that compresses audio into a sequence of discrete tokens and reconstructs it. Codecs serve two purposes: low-bitrate transmission for telephony or storage, and tokenization for downstream generative models that treat audio like text.

### SoundStream

**SoundStream** by Neil Zeghidour, Alejandro Luebs, Ahmed Omran, Jan Skoglund and Marco Tagliasacchi at Google was published in 2021.[17] It is an end-to-end encoder, residual vector quantizer (RVQ) and decoder trained jointly with adversarial and reconstruction losses.[17] SoundStream achieves 3 kbps speech at quality close to legacy 12 kbps codecs and was the first published neural codec to operate as a streaming-capable end-to-end system.[17] It was followed in 2022 by **Lyra v2** from Google, a productized version used in Google Meet under poor network conditions.

### EnCodec

**EnCodec** by Alexandre Defossez, Jade Copet, Gabriel Synnaeve and Yossi Adi at Meta AI was released in October 2022 in the paper *High Fidelity Neural Audio Compression*.[18] It builds on the SoundStream architecture, adds a small transformer language model on top of the codes for further entropy coding, and is trained at multiple bitrates between 1.5 and 24 kbps.[18] EnCodec was released open source under the MIT licence and quickly became the tokenizer of choice for downstream audio language models, including MusicGen (Meta), VALL-E (Microsoft) and AudioGen.

### DAC

**Descript Audio Codec (DAC)** by Rithesh Kumar, Prem Seetharaman, Alejandro Luebs, Ishaan Kumar and Kundan Kumar at Descript was published in 2023 (*High-Fidelity Audio Compression with Improved RVQGAN*).[19] DAC pushes neural codec quality further by improving the RVQ training, adding a multi-scale STFT discriminator, and using snake activations.[19] At 8 kbps it produces 44.1 kHz audio that is widely judged to be near-transparent on speech and music.[19] DAC is also released open source and is used as the audio tokenizer in many late-2023 and 2024 audio language models.

### Mimi

**Mimi** is the streaming-capable neural codec released by [Kyutai](/wiki/kyutai) in 2024 as part of the [Moshi](/wiki/moshi) full-duplex speech dialogue system.[27] It encodes 24 kHz audio at 1.1 kbps with an architecture similar to EnCodec and DAC but with a stronger emphasis on low latency, producing tokens at 12.5 Hz so that a downstream language model can predict each successive token under 80 milliseconds of input.[27] The release accompanying Moshi made Mimi the first neural codec specifically engineered for real-time conversational AI.

| Codec | Year | Group | Bitrate | Notes |
| --- | --- | --- | --- | --- |
| SoundStream | 2021 | Google | 3 to 18 kbps | First end-to-end neural codec[17] |
| Lyra v2 | 2022 | Google | 3.2 to 9.2 kbps | Production deployment in Meet |
| EnCodec | 2022 | Meta | 1.5 to 24 kbps | MIT licence, MusicGen tokenizer[18] |
| DAC | 2023 | Descript | 8 kbps at 44.1 kHz | Near-transparent quality[19] |
| Mimi | 2024 | Kyutai | 1.1 kbps at 24 kHz | Streaming, used in Moshi[27] |

## Multitask and universal audio models

A growing class of models cuts across the categories above by handling several audio-to-audio tasks with a single backbone, sometimes alongside text inputs or outputs.

### SpeechT5

**SpeechT5** by Junyi Ao, Rui Wang and colleagues at Microsoft Research Asia, published at ACL 2022, is a unified encoder-decoder pre-trained on both speech and text.[25] After fine-tuning, the same architecture can perform text-to-speech, automatic speech recognition, voice conversion and speech enhancement, all by changing the input and output streams.[25] SpeechT5 is distributed through Hugging Face and is one of the first practical demonstrations of a shared backbone across audio tasks.

### SeamlessM4T

**SeamlessM4T (Massively Multilingual and Multimodal Machine Translation)** was released by Meta AI in August 2023.[24] Trained on the SeamlessAlign dataset, the model performs text-to-text, text-to-speech, speech-to-text and speech-to-speech translation across nearly 100 languages, with about 36 source languages for speech-to-speech.[24] The follow-up, **SeamlessExpressive** and **Seamless Streaming** (October 2023), added expressive prosody transfer (so the translated voice keeps the original speaker's emotion) and low-latency streaming. SeamlessM4T weights are released under a custom Meta licence.[24]

### AudioLM and the codec language model line

**AudioLM** by Zalan Borsos and colleagues at Google, published in 2022, was the foundational paper that framed audio generation as language modelling over discrete neural codec tokens.[22] It used a hierarchy of semantic tokens (from w2v-BERT) and acoustic tokens (from SoundStream).[22] AudioLM itself can continue an audio prompt in the style of the prompt, producing convincing speech and piano continuations from a few seconds of input.[22] The model is a research artefact (not open source) but its architecture inspired VALL-E, MusicGen, AudioGen and a long subsequent line of work.

**AudioGen** by Felix Kreuk and colleagues at Meta AI (2022) applied the same codec language model recipe to general environmental sounds, conditioned on text prompts.

### Stable Audio Open

**Stable Audio Open** by [Stability AI](/wiki/stability_ai), released in June 2024, is an open-weights text-to-audio diffusion model trained on the CC-licensed Free Music Archive and Freesound libraries. While it is primarily text-to-audio, the model can be conditioned on existing audio for style transfer and timbre matching, putting it on the boundary between text-to-audio and audio-to-audio.

## Music generation models with audio input

Several large music generation models accept audio as a conditioning signal in addition to text, blurring the line with audio-to-audio:

- **MusicGen** by Jade Copet and colleagues at Meta AI (2023) is primarily text-to-music but also accepts a melody waveform as conditioning, effectively performing style or instrument transfer on a tune.[23]
- **MusicLM** by Andrea Agostinelli and colleagues at Google (2023) added humming and whistling as audio prompts.
- **Stable Audio 2.0** from Stability AI (April 2024) supports audio-to-audio prompting, allowing users to upload a reference clip and re-render it in a different style.
- **Suno's Cover** and **Udio's Remix** features (2024) take an existing song and re-perform it with new instrumentation or vocals.

## Open-source ecosystem

The day-to-day open-source landscape for audio-to-audio models is built around a few hubs:

- **[Hugging Face](/wiki/hugging_face) audio-to-audio**: the task page on the Hugging Face Hub lists hundreds of community models for separation, enhancement and voice conversion.[28] The `transformers` and `pyannote-audio` libraries provide inference wrappers.
- **demucs**: the official `demucs` package, maintained by Alexandre Defossez, ships Hybrid Transformer Demucs as a command-line tool that produces stems with one command.[7] It is the standard separator for studio workflows.
- **Ultimate Vocal Remover (UVR)**: a free GUI by Anjok07 that wraps a wide range of models including MDX-Net, Demucs, BS-RoFormer and Kim's UVR-MDX-NET variants. UVR is used heavily by hobbyist producers.
- **audio-separator**: a Python CLI by Andrew Beveridge that exposes UVR's models as a pip-installable command, used by many automated stem-extraction pipelines.
- **RVC-Project and Mangio-RVC-Fork**: GitHub repositories that distribute RVC training scripts and pre-trained voice models. The community on Hugging Face hosts thousands of community voice models.
- **Coqui TTS** and **NVIDIA NeMo**: TTS toolkits that also ship vocoders, voice conversion checkpoints and speech enhancement components.
- **AudioCraft**: Meta's library that bundles EnCodec, AudioGen and MusicGen under one MIT-licensed repo.
- **Resemble Enhance, DeepFilterNet, Demucs Denoiser**: pip-installable speech enhancement tools that run on CPU in real time.

## Applications and concerns

Audio-to-audio models have moved out of the lab and into daily life. Podcasters clean dialogue with Adobe Enhance Speech. Video conferencing platforms remove keyboard clatter with Krisp, RTX Voice or built-in Teams suppression. DJs and producers extract vocals with Demucs or BS-RoFormer to make remixes that would have required a multitrack session a decade ago. Hearing aid manufacturers, including GN ReSound and Starkey, ship neural speech enhancement on-device. Streaming services such as Spotify use neural codecs internally for low-bitrate listening.

The same capabilities also raise concerns. **Voice cloning** systems can generate convincing impersonations of public figures from a few seconds of reference audio, and several incidents in 2023 and 2024 involved RVC and ElevenLabs clones used in scam calls and political robocalls. The US Federal Communications Commission banned AI-generated robocalls in February 2024 after a fake Joe Biden voice clone was used to discourage New Hampshire voters from casting ballots.[30] Meta cited similar concerns when withholding Voicebox weights.[21]

**Detection** of audio deepfakes is an active research area. The ASVspoof series of challenges, run since 2015, evaluates anti-spoofing systems. Commercial detectors from companies including Pindrop and Resemble Detect target call centres and journalism.

Copyright is contested for music separation: extracting and re-using stems from commercial recordings is technically possible but may infringe the underlying composition and master rights. Spleeter's release in 2019 prompted debate among labels and platforms, and the question of stem extraction has resurfaced with each new state-of-the-art separator.

Finally, the audio-to-audio toolkit has begun to merge with **speech-to-speech** language models such as Moshi (Kyutai, 2024)[27] and GPT-4o voice mode (OpenAI, 2024), where a single end-to-end transformer takes a user's microphone audio and produces a spoken reply, dropping the traditional speech-to-text plus LLM plus text-to-speech pipeline. The neural codecs described above are the tokenizers that make this possible.

## See also

- [Audio Models](/wiki/audio_models)
- [Text-to-Speech](/wiki/text_to_speech)
- [Speech Recognition](/wiki/speech_recognition)
- [Voice Conversion](/wiki/voice_conversion)
- [Voice Cloning](/wiki/voice_cloning)
- [Source Separation](/wiki/source_separation)
- [Vocoder](/wiki/vocoder)
- [Neural Audio Codec](/wiki/neural_audio_codec)
- [Demucs](/wiki/demucs)
- [HiFi-GAN](/wiki/hifi_gan)
- [WaveNet](/wiki/wavenet)
- [VITS](/wiki/vits)
- [Moshi](/wiki/moshi)
- [SeamlessM4T](/wiki/seamlessm4t)
- [ElevenLabs](/wiki/elevenlabs)
- [Stability AI](/wiki/stability_ai)

## References

1. Pascual, S., Bonafonte, A., Serra, J. (2017). *SEGAN: Speech Enhancement Generative Adversarial Network*. Interspeech 2017. arXiv:1703.09452.
2. Kameoka, H., Kaneko, T., Tanaka, K., Hojo, N. (2018). *StarGAN-VC: Non-parallel many-to-many voice conversion using star generative adversarial networks*. IEEE SLT.
3. Qian, K., Zhang, Y., Chang, S., Yang, X., Hasegawa-Johnson, M. (2019). *AutoVC: Zero-Shot Voice Style Transfer with Only Autoencoder Loss*. ICML 2019.
4. Hennequin, R., Khlif, A., Voituret, F., Moussallam, M. (2020). *Spleeter: a fast and efficient music source separation tool with pre-trained models*. Journal of Open Source Software.
5. Defossez, A., Usunier, N., Bottou, L., Bach, F. (2019). *Music Source Separation in the Waveform Domain*. arXiv:1911.13254.
6. Defossez, A. (2021). *Hybrid Spectrogram and Waveform Source Separation*. ISMIR 2021 Music Demixing Workshop.
7. Rouard, S., Massa, F., Defossez, A. (2022). *Hybrid Transformers for Music Source Separation*. ICASSP 2023.
8. Stoter, F.-R., Uhlich, S., Liutkus, A., Mitsufuji, Y. (2019). *Open-Unmix - A Reference Implementation for Music Source Separation*. JOSS.
9. Lu, W.-T., Wang, J.-C., Kong, Q., Hung, Y.-N. (2023). *Music Source Separation with Band-Split RoPE Transformer*. arXiv:2309.02612.
10. Hu, Y. et al. (2020). *DCCRN: Deep Complex Convolution Recurrent Network for Phase-Aware Speech Enhancement*. Interspeech 2020.
11. Schroter, H., Escalante-B., A. N., Rosenkranz, T., Maier, A. (2022). *DeepFilterNet: A Low Complexity Speech Enhancement Framework for Full-Band Audio*. ICASSP.
12. Defossez, A., Synnaeve, G., Adi, Y. (2020). *Real Time Speech Enhancement in the Waveform Domain*. Interspeech 2020.
13. van den Oord, A. et al. (2016). *WaveNet: A Generative Model for Raw Audio*. arXiv:1609.03499.
14. Kong, J., Kim, J., Bae, J. (2020). *HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis*. NeurIPS 2020.
15. Lee, S.-g., Ping, W., Ginsburg, B., Catanzaro, B., Yoon, S. (2023). *BigVGAN: A Universal Neural Vocoder with Large-Scale Training*. ICLR 2023.
16. Siuzdak, H. (2023). *Vocos: Closing the gap between time-domain and Fourier-based neural vocoders for high-quality audio synthesis*. arXiv:2306.00814.
17. Zeghidour, N., Luebs, A., Omran, A., Skoglund, J., Tagliasacchi, M. (2021). *SoundStream: An End-to-End Neural Audio Codec*. IEEE/ACM TASLP.
18. Defossez, A., Copet, J., Synnaeve, G., Adi, Y. (2022). *High Fidelity Neural Audio Compression*. arXiv:2210.13438 (EnCodec).
19. Kumar, R., Seetharaman, P., Luebs, A., Tewari, I., Kumar, K. (2023). *High-Fidelity Audio Compression with Improved RVQGAN*. NeurIPS 2023 (DAC).
20. Wang, C. et al. (2023). *Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers*. arXiv:2301.02111 (VALL-E).
21. Le, M. et al. (2023). *Voicebox: Text-Guided Multilingual Universal Speech Generation at Scale*. Meta AI research paper.
22. Borsos, Z. et al. (2022). *AudioLM: a Language Modeling Approach to Audio Generation*. arXiv:2209.03143.
23. Copet, J. et al. (2023). *Simple and Controllable Music Generation*. NeurIPS 2023 (MusicGen).
24. Communication Research Centre, Meta AI. (2023). *SeamlessM4T: Massively Multilingual and Multimodal Machine Translation*. Meta AI publication.
25. Ao, J. et al. (2022). *SpeechT5: Unified-Modal Encoder-Decoder Pre-Training for Spoken Language Processing*. ACL 2022.
26. Qin, Z. et al. (2023). *OpenVoice: Versatile Instant Voice Cloning*. MyShell.ai research paper.
27. Kyutai. (2024). *Moshi: a speech-text foundation model for real-time dialogue* (introduces the Mimi codec).
28. Hugging Face. *Audio-to-Audio task page*. https://huggingface.co/tasks/audio-to-audio
29. Microsoft. *Deep Noise Suppression (DNS) Challenge series*. https://www.microsoft.com/en-us/research/academic-program/deep-noise-suppression-challenge-icassp-2023/
30. Federal Communications Commission. (February 8, 2024). *Declaratory Ruling: AI-generated voices in robocalls are illegal under the TCPA*.

[^convtasnet]: Luo, Y., Mesgarani, N. (2019). *Conv-TasNet: Surpassing Ideal Time-Frequency Magnitude Masking for Speech Separation*. IEEE/ACM Transactions on Audio, Speech, and Language Processing 27(8). arXiv:1809.07454. https://arxiv.org/abs/1809.07454 Accessed 2026-05-31.

[^dprnn]: Luo, Y., Chen, Z., Yoshioka, T. (2020). *Dual-path RNN: Efficient Long Sequence Modeling for Time-domain Single-channel Speech Separation*. ICASSP 2020. arXiv:1910.06379. https://arxiv.org/abs/1910.06379 Accessed 2026-05-31.

[^dptnet]: Chen, J., Mao, Q., Liu, D. (2020). *Dual-Path Transformer Network: Direct Context-Aware Modeling for End-to-End Monaural Speech Separation*. Interspeech 2020. arXiv:2007.13975. https://arxiv.org/abs/2007.13975 Accessed 2026-05-31.

[^sepformer]: Subakan, C., Ravanelli, M., Cornell, S., Bronzi, M., Zhong, J. (2021). *Attention Is All You Need In Speech Separation*. ICASSP 2021. arXiv:2010.13154. https://arxiv.org/abs/2010.13154 Accessed 2026-05-31. Pretrained model: https://huggingface.co/speechbrain/sepformer-wsj02mix

[^sisdr]: Le Roux, J., Wisdom, S., Erdogan, H., Hershey, J. R. (2019). *SDR: Half-baked or Well Done?*. ICASSP 2019. arXiv:1811.02508. https://arxiv.org/abs/1811.02508 Accessed 2026-05-31.

[^bsrnn]: Luo, Y., Yu, J. (2023). *Music Source Separation with Band-split RNN*. IEEE/ACM Transactions on Audio, Speech, and Language Processing. arXiv:2209.15174. https://arxiv.org/abs/2209.15174 Accessed 2026-05-31.

[^bsroformer]: Lu, W.-T., Wang, J.-C., Kong, Q., Hung, Y.-N. (2023). *Music Source Separation with Band-Split RoPE Transformer*. arXiv:2309.02612. https://arxiv.org/abs/2309.02612 Accessed 2026-05-31.

[^rnnoise]: Valin, J.-M. (2018). *A Hybrid DSP/Deep Learning Approach to Real-Time Full-Band Speech Enhancement*. IEEE MMSP 2018. arXiv:1709.08243. https://arxiv.org/abs/1709.08243 Accessed 2026-05-31. Source: https://github.com/xiph/rnnoise

[^fullsubnet]: Hao, X., Su, X., Horaud, R., Li, X. (2021). *FullSubNet: A Full-Band and Sub-Band Fusion Model for Real-Time Single-Channel Speech Enhancement*. ICASSP 2021. arXiv:2010.15508. https://arxiv.org/abs/2010.15508 Accessed 2026-05-31.

[^diffsvc]: Liu, S., Cao, Y., Su, D., Meng, H. (2021). *DiffSVC: A Diffusion Probabilistic Model for Singing Voice Conversion*. ASRU 2021. arXiv:2105.13871. https://arxiv.org/abs/2105.13871 Accessed 2026-05-31.

[^knnvc]: Baas, M., van Niekerk, B., Kamper, H. (2023). *Voice Conversion With Just Nearest Neighbors*. Interspeech 2023. arXiv:2305.18975. https://arxiv.org/abs/2305.18975 Accessed 2026-05-31. Code and samples: https://bshall.github.io/knn-vc/

[^seedvc]: Liu, S. (2024). *Zero-shot Voice Conversion with Diffusion Transformers*. arXiv:2411.09943. https://arxiv.org/abs/2411.09943 Accessed 2026-05-31. Code: https://github.com/Plachtaa/seed-vc