Audio-to-Audio Models
Last reviewed
May 13, 2026
Sources
30 citations
Review status
Source-backed
Revision
v2 ยท 5,125 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 13, 2026
Sources
30 citations
Review status
Source-backed
Revision
v2 ยท 5,125 words
Add missing citations, update stale details, or suggest a clearer explanation.
See also: Audio Models and Tasks
Audio-to-audio models are machine learning systems that take an audio waveform as input and produce a different audio waveform as output. The category covers voice conversion, speech enhancement (denoising and dereverberation), music source separation, vocoders that turn spectrograms into waveforms, and neural audio codecs that compress and reconstruct sound. They sit alongside text-to-speech and speech-to-text in the broader family of audio models, but they are distinguished by the fact that both the input and the output live in the audio domain.
Most of the modern systems share a common pipeline: an encoder that turns the input waveform or spectrogram into a learned representation, a transformation network that manipulates that representation (separating sources, removing noise, swapping speaker identity, predicting clean targets), and a decoder or vocoder that synthesizes the new waveform. Since around 2019 the field has moved away from spectrogram masking with classical signal processing toward fully neural approaches, often built around diffusion models, transformer architectures, or generative adversarial networks (GANs). The same underlying components show up across the subfields: a HiFi-GAN vocoder can be the back end of a voice cloning pipeline, a music separator, or a neural codec.
The phrase "audio-to-audio" appears on Hugging Face as one of the official task tags. Models filed under it on the Hub include speech enhancement, voice conversion, source separation, target speaker extraction, and bandwidth extension. The taxonomy used in this article groups the field into the following tasks.
| Task | Input | Output | Typical use |
|---|---|---|---|
| Voice conversion | Speech from speaker A | Same words spoken by speaker B | Voice cloning, dubbing, anonymization |
| Speech enhancement | Noisy or reverberant speech | Clean speech | Conferencing, podcasts, hearing aids |
| Source separation | Mixed audio | Individual stems | Karaoke, remix, music production |
| Vocoder | Spectrogram or features | Waveform | Back end of text-to-speech |
| Neural audio codec | Waveform | Compressed tokens, then waveform | Low-bitrate audio, generative audio backbones |
| Speech-to-speech translation | Speech in language A | Speech in language B | Real-time interpretation |
| Bandwidth extension | 8 kHz speech | 16 or 24 kHz speech | Upsampling old recordings |
| Target speaker extraction | Multi-talker audio | One speaker's voice | Cocktail-party problem |
The history of the field tracks the wider arc of deep learning audio research. Classical signal processing (Wiener filtering, spectral subtraction, ICA-based source separation) dominated until around 2015. Then convolutional neural networks operating on spectrograms took over, with U-Net architectures from medical imaging adapted for vocal separation. Generative models, starting with WaveNet in 2016, made waveform-level synthesis feasible. The transformer wave reached audio around 2020 and 2021, and by 2023 large generative models such as VALL-E and Voicebox were producing voice cloning that needed only a few seconds of reference audio.
Voice conversion (VC) systems take an utterance from one speaker and re-render it in the voice of another, ideally keeping linguistic content and prosody intact. Early VC used Gaussian mixture models or vector quantization on aligned parallel data. Modern systems learn disentangled representations of content and speaker identity, then recombine them.
StarGAN-VC was introduced by Hirokazu Kameoka and colleagues at NTT in 2018. It applies the StarGAN image translation framework to mel-cepstral features, enabling many-to-many voice conversion without parallel training data. A follow-up, StarGAN-VC2, improved naturalness in 2019.
AutoVC was published by Kaizhi Qian and collaborators at MIT and IBM in 2019. The architecture is a content encoder, a speaker encoder, and a decoder, with an information bottleneck on the content path that forces the model to discard speaker identity. AutoVC was one of the first systems to achieve plausible zero-shot voice conversion to unseen target speakers.
SoftVC VITS Singing Voice Conversion (So-VITS-SVC) is an open-source project that combines a SoftVC content encoder with the VITS end-to-end TTS model. The repository was released on GitHub by the user svc-develop-team and went through several iterations between 2022 and 2023. So-VITS-SVC 4.0 and 4.1 became the de facto standard for fan-made covers in which one singer's voice is mapped onto another's recording.
Retrieval-based Voice Conversion (RVC) is a sibling project that became the dominant voice cloning tool on social media in 2023. RVC uses a HuBERT-style content encoder, an NSF-HiFiGAN vocoder, and a feature retrieval step that pulls the closest matching frames from a target speaker's voice database to suppress timbre leakage from the source speaker. The original RVC repository, RVC-Project, gathered tens of thousands of GitHub stars and powered viral clips of Joe Biden, Donald Trump and various musicians appearing to sing songs they never recorded.
NSF-HiFiGAN is the vocoder used by both SoVITS and RVC. It is a HiFi-GAN variant with a Neural Source-Filter front end that takes fundamental frequency (F0) as an input, which helps preserve pitch through voice conversion.
DiffSVC by Songxiang Liu and colleagues (2021) applied diffusion models to singing voice conversion. FastSVC by Shijun Wang and Yi Zhao introduced a non-autoregressive approach for faster inference.
OpenVoice was released by MyShell.ai in December 2023. It separates tone color cloning, which captures speaker timbre from a short reference clip, from style control, which handles emotion, accent and pacing. OpenVoice was published with code on GitHub and an accompanying paper by Zengyi Qin and colleagues. OpenVoice v2 followed in April 2024 with native multilingual support for English, Spanish, French, Chinese, Japanese and Korean.
VALL-E was announced by Microsoft Research in January 2023, in the paper Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers by Chengyi Wang and colleagues. The model is a transformer language model trained on tokens from the EnCodec neural codec; given a 3-second voice prompt it can synthesize speech in that voice with reasonable identity preservation. VALL-E demonstrated for the first time that token-level language modelling over a neural codec could replace traditional vocoder pipelines. VALL-E X extended the model to cross-lingual cloning, allowing an English speaker to be cloned for Mandarin output, also from Microsoft in 2023.
Voicebox was published by Meta AI in June 2023. The paper, by Matthew Le and colleagues, describes a non-autoregressive flow-matching model trained on 60,000 hours of multilingual speech that supports text-to-speech, noise removal, content editing inside a recording and zero-shot voice cloning. Meta did not release Voicebox weights or a demo product, citing potential misuse for impersonation.
| Model | Year | Group | Approach | Open weights |
|---|---|---|---|---|
| StarGAN-VC | 2018 | NTT | StarGAN on mel-cepstral features | Research code |
| AutoVC | 2019 | MIT, IBM | Bottleneck autoencoder | Research code |
| So-VITS-SVC | 2022 to 2023 | Open source | SoftVC content plus VITS | Yes |
| RVC | 2023 | Open source | HuBERT plus NSF-HiFiGAN with retrieval | Yes |
| DiffSVC | 2021 | Tencent | Diffusion model | Research code |
| OpenVoice v1 | 2023 | MyShell | Tone and style decoupling | Yes |
| OpenVoice v2 | 2024 | MyShell | Multilingual extension | Yes |
| VALL-E | 2023 | Microsoft | Codec language model | No |
| VALL-E X | 2023 | Microsoft | Cross-lingual codec LM | No |
| Voicebox | 2023 | Meta | Flow matching | No |
Speech enhancement covers any task in which the goal is to recover clean speech from a degraded signal. The main sub-tasks are denoising (removing background noise), dereverberation (removing room reverb), declipping, packet loss concealment and bandwidth extension. The field is older than deep learning, with Yariv Ephraim and David Malah's 1984 MMSE-STSA estimator still cited in modern papers, but the current state of the art is dominated by neural networks.
SEGAN by Santiago Pascual, Antonio Bonafonte and Joan Serra at the Universitat Politecnica de Catalunya, presented at Interspeech 2017, was one of the first speech enhancement systems to operate directly on raw waveforms with a generative adversarial network. The generator is a fully convolutional encoder-decoder; the discriminator decides whether a waveform looks clean or denoised.
Deep Complex Convolution Recurrent Network (DCCRN) by Yanxin Hu and colleagues (Interspeech 2020) won the first round of the Deep Noise Suppression Challenge organized by Microsoft. It uses complex-valued convolutions to handle phase information in the short-time Fourier transform directly, rather than processing magnitude and phase separately. DCUNet is a related complex U-Net architecture from 2019.
Noise Suppression Network (NSNet) is Microsoft Research's noise suppressor, used in Microsoft Teams since 2020. It is a recurrent neural network that operates on log-mel features and outputs a suppression gain per time-frequency bin. The team behind NSNet, including Sebastian Braun and Hannes Gamper, also organizes the annual Deep Noise Suppression Challenge that has driven competitive progress on the task.
DeepFilterNet by Hendrik Schroter, Alberto Escalante and Andreas Maier at Friedrich-Alexander-Universitat Erlangen-Nurnberg was introduced in 2022 and updated to DeepFilterNet 2 and DeepFilterNet 3 in 2023. It operates at 48 kHz, runs in real time on a single CPU core, and is permissively licensed. The architecture predicts a per-frequency deep filter (a complex-valued convolution over recent frames) rather than a single suppression mask, which preserves transients and consonants better than mask-only methods.
Resemble Enhance was open-sourced by Resemble.ai under an MIT license in late 2023, with the source code published on GitHub. It combines a noise suppression stage with a CFM (conditional flow matching) generative model that re-synthesizes a clean speech waveform, allowing the system to fix not only noise but also bandwidth limitation and other distortions.
Adobe Enhance Speech, originally announced as Project Shasta in 2022 and released as a free web tool through Adobe Podcast in 2023, applies a proprietary neural model to clean up dialogue recorded in untreated rooms. It quickly became a default tool among podcasters and was integrated into Adobe Premiere Pro as Enhance Speech.
ElevenLabs Voice Isolator is a hosted product released by ElevenLabs in 2024 that strips background music and noise from voice recordings. It targets the same use case as Adobe Enhance Speech.
Krisp is an Armenian company founded in 2017 that ships a desktop application and SDK for real-time noise suppression in video calls. Its model runs locally on the user's CPU.
NVIDIA RTX Voice was launched in April 2020 during the early days of the COVID-19 lockdowns as a noise suppression utility that ran on RTX GPUs using tensor cores. It evolved into NVIDIA Broadcast, a free Windows application available on RTX cards that includes noise removal, room echo removal, virtual background, eye contact correction and other camera and microphone effects.
Demucs, the music separation model from Meta, was adapted for speech enhancement in a paper titled Real Time Speech Enhancement in the Waveform Domain by Alexandre Defossez, Gabriel Synnaeve and Yossi Adi (Interspeech 2020). The variant, sometimes called Denoiser or Demucs Denoiser, was for a time a popular open-source baseline before being overtaken by DeepFilterNet and other dedicated speech models.
| System | Year | Type | License | Notes |
|---|---|---|---|---|
| SEGAN | 2017 | GAN, waveform | Open source | First waveform GAN denoiser |
| DCCRN | 2020 | Complex CRN | Research | DNS 2020 winner |
| NSNet | 2020 | RNN | Proprietary | Used in Microsoft Teams |
| Demucs Denoiser | 2020 | Waveform U-Net | Open source | Real-time CPU |
| DeepFilterNet 3 | 2023 | Deep filter | Open source | 48 kHz, real time CPU |
| Resemble Enhance | 2023 | Denoise plus CFM | Open source | Quality restoration |
| Adobe Enhance Speech | 2023 | Proprietary | Free web tool | Podcast cleanup |
| ElevenLabs Voice Isolator | 2024 | Proprietary | Paid API | Voice isolation |
| Krisp | 2017 onward | RNN | Proprietary | Real-time call filter |
| NVIDIA Broadcast | 2020 onward | Proprietary | Free with RTX | GPU accelerated |
Source separation breaks a mixed audio recording into its component sources. The most studied flavour is music separation into vocals, drums, bass and other, evaluated on the MUSDB18 benchmark introduced by Zafar Rafii and colleagues in 2017. Speech separation, where the goal is to split overlapping speakers, is the related problem behind the cocktail-party effect.
Spleeter was open-sourced by the research team at Deezer in November 2019. It uses a U-Net trained on the company's internal catalogue and ships pretrained two-stem (vocals or accompaniment), four-stem (vocals, drums, bass, other) and five-stem (adds piano) models. Despite its modest size, Spleeter became the most widely used music separator for several years because it was easy to install, fast on CPU and good enough for karaoke and DJ use. The release paper, by Romain Hennequin, Anis Khlif, Felix Voituret and Manuel Moussallam, is one of the most cited audio papers of the 2019 to 2020 period.
Demucs by Alexandre Defossez at Meta AI was published in 2019 and revised through several versions. The original Demucs was a waveform U-Net with bi-directional LSTM bottleneck. Hybrid Demucs (2021) combined waveform-domain and spectrogram-domain branches that share a transformer bottleneck, hitting state-of-the-art performance on MUSDB18. Hybrid Transformer Demucs (HT Demucs), released in 2022 in the paper Hybrid Transformers for Music Source Separation, replaced the bottleneck with self-attention and is the version distributed in the demucs Python package today.
Open-Unmix (UMX) by Fabian-Robert Stoter, Stefan Uhlich, Antoine Liutkus and Yuki Mitsufuji was released in 2019 by the Sigsep collective, a group of academic and industry researchers focused on reproducible source separation. It is a bidirectional LSTM trained on spectrograms and serves as a reference implementation in many papers.
MossFormer by Shengkui Zhao and Bin Ma at Alibaba (ICASSP 2023) is a state-of-the-art speech separation model that combines a convolutional front end with a transformer bottleneck. MossFormer2, published in 2024, adds a recurrent module and pushes performance further on the WSJ0-2mix benchmark.
The Music Demixing Challenge (MDX), run as part of the Sound Demixing Challenge in 2021 and 2023, drove a generation of new separators. MDX-Net by Kuielab combined waveform and spectrogram branches with knowledge distillation and won the leaderboard in 2021. By 2023 the strongest entries (MDX23) used larger transformer backbones and ensembles.
BS-RoFormer (Band-Split Rotary-position Embedded Transformer) by Ju-Chiang Wang and colleagues at ByteDance, published in 2023, is widely regarded as the strongest open music source separation model as of 2024. It splits the spectrogram into frequency bands, processes each band with a transformer using rotary positional embeddings (RoPE), and then aggregates results. The model is the backbone of many of the high-quality vocal stems posted on community sites and is supported by the Ultimate Vocal Remover (UVR) GUI.
| Model | Year | Group | Strengths | Open weights |
|---|---|---|---|---|
| Spleeter | 2019 | Deezer | Fast, easy install | Yes |
| Open-Unmix | 2019 | Sigsep | Reproducible baseline | Yes |
| Demucs v3 | 2021 | Meta | Strong baseline | Yes |
| HT Demucs | 2022 | Meta | Hybrid transformer | Yes |
| MossFormer 2 | 2024 | Alibaba | Speech separation SOTA | Yes |
| MDX-Net | 2021 | Kuielab | Music challenge winner | Yes |
| BS-RoFormer | 2023 | ByteDance | 2024 music SOTA | Yes |
A vocoder in the modern sense is a model that turns an intermediate representation, typically a mel-spectrogram, back into a waveform. Neural vocoders replaced the older Griffin-Lim algorithm and source-filter vocoders in text-to-speech pipelines because they produce much higher fidelity.
WaveNet by Aaron van den Oord and colleagues at DeepMind, published in September 2016, was the first neural vocoder to match or beat the perceptual quality of concatenative speech synthesis. The model uses stacked dilated causal convolutions to predict the next audio sample, conditioned on linguistic features. WaveNet powered Google Assistant speech from 2017.
Parallel WaveNet (van den Oord et al., 2017) used probability density distillation to produce a non-autoregressive student that ran 1,000 times faster than the original, making large-scale deployment feasible. WaveRNN by Nal Kalchbrenner and colleagues (2018) achieved similar quality with a recurrent architecture optimised for CPU inference.
MelGAN by Kundan Kumar and colleagues at the Universite de Montreal (NeurIPS 2019) was the first GAN-based vocoder to reach competitive quality. HiFi-GAN by Jungil Kong, Jaehyeon Kim and Jaekyoung Bae at Kakao Enterprise (NeurIPS 2020) combined multi-scale and multi-period discriminators with a generator made of transposed convolutions and residual blocks. HiFi-GAN became the default vocoder for nearly every open TTS system released between 2021 and 2024 because it is small, fast and high quality.
iSTFTNet by Takuhiro Kaneko and colleagues at NTT (ICASSP 2022) replaced the final upsampling layers of HiFi-GAN with an inverse short-time Fourier transform, cutting inference time substantially.
BigVGAN by Sang-gil Lee and colleagues at NVIDIA (ICLR 2023) extended HiFi-GAN to 24 kHz and 44.1 kHz, scaling the generator to 112 million parameters and adding a periodic anti-aliasing activation called Snake. BigVGAN v2 was released in 2024 with improved training and is shipped through Hugging Face. It is the vocoder used by NVIDIA's NeMo TTS systems and several large generative audio projects.
Vocos by Hubert Siuzdak, published in 2023 (Vocos: Closing the gap between time-domain and Fourier-based neural vocoders for high-quality audio synthesis), takes a different route. Instead of upsampling to time domain with transposed convolutions, Vocos predicts magnitude and phase in the STFT domain and uses an inverse STFT to reach the waveform, which is faster than HiFi-GAN at similar quality.
A neural audio codec is a model that compresses audio into a sequence of discrete tokens and reconstructs it. Codecs serve two purposes: low-bitrate transmission for telephony or storage, and tokenization for downstream generative models that treat audio like text.
SoundStream by Neil Zeghidour, Alejandro Luebs, Ahmed Omran, Jan Skoglund and Marco Tagliasacchi at Google was published in 2021. It is an end-to-end encoder, residual vector quantizer (RVQ) and decoder trained jointly with adversarial and reconstruction losses. SoundStream achieves 3 kbps speech at quality close to legacy 12 kbps codecs and was the first published neural codec to operate as a streaming-capable end-to-end system. It was followed in 2022 by Lyra v2 from Google, a productized version used in Google Meet under poor network conditions.
EnCodec by Alexandre Defossez, Jade Copet, Gabriel Synnaeve and Yossi Adi at Meta AI was released in October 2022 in the paper High Fidelity Neural Audio Compression. It builds on the SoundStream architecture, adds a small transformer language model on top of the codes for further entropy coding, and is trained at multiple bitrates between 1.5 and 24 kbps. EnCodec was released open source under the MIT licence and quickly became the tokenizer of choice for downstream audio language models, including MusicGen (Meta), VALL-E (Microsoft) and AudioGen.
Descript Audio Codec (DAC) by Rithesh Kumar, Prem Seetharaman, Alejandro Luebs, Ishaan Kumar and Kundan Kumar at Descript was published in 2023 (High-Fidelity Audio Compression with Improved RVQGAN). DAC pushes neural codec quality further by improving the RVQ training, adding a multi-scale STFT discriminator, and using snake activations. At 8 kbps it produces 44.1 kHz audio that is widely judged to be near-transparent on speech and music. DAC is also released open source and is used as the audio tokenizer in many late-2023 and 2024 audio language models.
Mimi is the streaming-capable neural codec released by Kyutai in 2024 as part of the Moshi full-duplex speech dialogue system. It encodes 24 kHz audio at 1.1 kbps with an architecture similar to EnCodec and DAC but with a stronger emphasis on low latency, producing tokens at 12.5 Hz so that a downstream language model can predict each successive token under 80 milliseconds of input. The release accompanying Moshi made Mimi the first neural codec specifically engineered for real-time conversational AI.
| Codec | Year | Group | Bitrate | Notes |
|---|---|---|---|---|
| SoundStream | 2021 | 3 to 18 kbps | First end-to-end neural codec | |
| Lyra v2 | 2022 | 3.2 to 9.2 kbps | Production deployment in Meet | |
| EnCodec | 2022 | Meta | 1.5 to 24 kbps | MIT licence, MusicGen tokenizer |
| DAC | 2023 | Descript | 8 kbps at 44.1 kHz | Near-transparent quality |
| Mimi | 2024 | Kyutai | 1.1 kbps at 24 kHz | Streaming, used in Moshi |
A growing class of models cuts across the categories above by handling several audio-to-audio tasks with a single backbone, sometimes alongside text inputs or outputs.
SpeechT5 by Junyi Ao, Rui Wang and colleagues at Microsoft Research Asia, published at ACL 2022, is a unified encoder-decoder pre-trained on both speech and text. After fine-tuning, the same architecture can perform text-to-speech, automatic speech recognition, voice conversion and speech enhancement, all by changing the input and output streams. SpeechT5 is distributed through Hugging Face and is one of the first practical demonstrations of a shared backbone across audio tasks.
SeamlessM4T (Massively Multilingual and Multimodal Machine Translation) was released by Meta AI in August 2023. Trained on the SeamlessAlign dataset, the model performs text-to-text, text-to-speech, speech-to-text and speech-to-speech translation across nearly 100 languages, with about 36 source languages for speech-to-speech. The follow-up, SeamlessExpressive and Seamless Streaming (October 2023), added expressive prosody transfer (so the translated voice keeps the original speaker's emotion) and low-latency streaming. SeamlessM4T weights are released under a custom Meta licence.
AudioLM by Zalan Borsos and colleagues at Google, published in 2022, was the foundational paper that framed audio generation as language modelling over discrete neural codec tokens. It used a hierarchy of semantic tokens (from w2v-BERT) and acoustic tokens (from SoundStream). AudioLM itself can continue an audio prompt in the style of the prompt, producing convincing speech and piano continuations from a few seconds of input. The model is a research artefact (not open source) but its architecture inspired VALL-E, MusicGen, AudioGen and a long subsequent line of work.
AudioGen by Felix Kreuk and colleagues at Meta AI (2022) applied the same codec language model recipe to general environmental sounds, conditioned on text prompts.
Stable Audio Open by Stability AI, released in June 2024, is an open-weights text-to-audio diffusion model trained on the CC-licensed Free Music Archive and Freesound libraries. While it is primarily text-to-audio, the model can be conditioned on existing audio for style transfer and timbre matching, putting it on the boundary between text-to-audio and audio-to-audio.
Several large music generation models accept audio as a conditioning signal in addition to text, blurring the line with audio-to-audio:
The day-to-day open-source landscape for audio-to-audio models is built around a few hubs:
transformers and pyannote-audio libraries provide inference wrappers.demucs package, maintained by Alexandre Defossez, ships Hybrid Transformer Demucs as a command-line tool that produces stems with one command. It is the standard separator for studio workflows.Audio-to-audio models have moved out of the lab and into daily life. Podcasters clean dialogue with Adobe Enhance Speech. Video conferencing platforms remove keyboard clatter with Krisp, RTX Voice or built-in Teams suppression. DJs and producers extract vocals with Demucs or BS-RoFormer to make remixes that would have required a multitrack session a decade ago. Hearing aid manufacturers, including GN ReSound and Starkey, ship neural speech enhancement on-device. Streaming services such as Spotify use neural codecs internally for low-bitrate listening.
The same capabilities also raise concerns. Voice cloning systems can generate convincing impersonations of public figures from a few seconds of reference audio, and several incidents in 2023 and 2024 involved RVC and ElevenLabs clones used in scam calls and political robocalls. The US Federal Communications Commission banned AI-generated robocalls in February 2024 after a fake Joe Biden voice clone was used to discourage New Hampshire voters from casting ballots. Meta cited similar concerns when withholding Voicebox weights.
Detection of audio deepfakes is an active research area. The ASVspoof series of challenges, run since 2015, evaluates anti-spoofing systems. Commercial detectors from companies including Pindrop and Resemble Detect target call centres and journalism.
Copyright is contested for music separation: extracting and re-using stems from commercial recordings is technically possible but may infringe the underlying composition and master rights. Spleeter's release in 2019 prompted debate among labels and platforms, and the question of stem extraction has resurfaced with each new state-of-the-art separator.
Finally, the audio-to-audio toolkit has begun to merge with speech-to-speech language models such as Moshi (Kyutai, 2024) and GPT-4o voice mode (OpenAI, 2024), where a single end-to-end transformer takes a user's microphone audio and produces a spoken reply, dropping the traditional speech-to-text plus LLM plus text-to-speech pipeline. The neural codecs described above are the tokenizers that make this possible.