Jukebox (OpenAI)

Generative AI Music & Audio Generation OpenAI

9 min read

Updated Jun 27, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 27, 2026

Fact-checked

In review queue

Sources

5 citations

Revision

v2 · 1,743 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Jukebox is a neural network for music generation developed by OpenAI that produces music, including rudimentary singing, as raw audio across a range of genres and artist styles. The system was introduced in the 2020 paper "Jukebox: A Generative Model for Music," whose authors describe it as "a model that generates music with singing in the raw audio domain" by using "a multi-scale VQ-VAE to compress it to discrete codes, and modeling those using autoregressive Transformers."^[3] Announced on April 30, 2020, Jukebox was notable for operating directly in the raw audio domain rather than generating symbolic note sequences, and for being able to condition its output on artist, genre, and lyrics. The paper, arXiv:2005.00341, was written by Prafulla Dhariwal, Heewoo Jun, Christine Payne, Jong Wook Kim, Alec Radford, and Ilya Sutskever, and OpenAI released both the model weights and the code on GitHub.^[1]^[2]^[3]

At a glance:

Attribute	Detail
Developer	OpenAI
Announced	April 30, 2020
Paper	arXiv:2005.00341
Domain	Raw audio, 44.1 kHz
Architecture	Hierarchical VQ-VAE plus autoregressive Sparse Transformer priors
Top-level prior	~5 billion parameters
Upsampling priors	~1 billion parameters each
Conditioning	Artist, genre, and (unaligned) lyrics
Training data	1.2 million songs (600,000 in English)
Generation speed	~3 hours to fully sample 20 seconds on one V100 GPU
License	Noncommercial use

What is Jukebox?

Jukebox generates songs as continuous audio waveforms at a 44.1 kHz sample rate, the standard fidelity of CD audio, rather than as MIDI or other symbolic representations. According to OpenAI, the model "generates music, including rudimentary singing, as raw audio in a variety of genres and artist styles."^[1] Generation can be steered by selecting an artist and genre to influence the musical and vocal style, and by supplying unaligned lyrics that the model attempts to sing.^[2]^[3]

The core difficulty Jukebox addresses is the extreme length of raw audio sequences. A few minutes of CD-quality stereo audio contains on the order of millions of samples per channel, far longer than the contexts handled by contemporary sequence models. To make this tractable, Jukebox first compresses audio into much shorter sequences of discrete tokens using a vector-quantized variational autoencoder (VQ-VAE), then trains autoregressive Transformers to model those token sequences, and finally decodes the tokens back into audio.^[2]^[3]

Jukebox followed MuseNet, an earlier OpenAI music system announced in 2019 that generated symbolic MIDI note events using a Transformer. Jukebox differs fundamentally in that it works with raw audio, which lets it capture timbre, vocal mannerisms, and production characteristics that symbolic representations cannot express. The Jukebox paper groups MuseNet and Music Transformer together as prior work on symbolic music generation that predicts MIDI events autoregressively.^[3]

How does Jukebox work?

Jukebox uses a two-stage "compress-then-model" pipeline: a hierarchical VQ-VAE first turns raw audio into short sequences of discrete tokens, and then autoregressive Transformers learn to generate those tokens, which the VQ-VAE decoder converts back into audio.^[2]^[3]

Hierarchical VQ-VAE

The first stage of Jukebox is a hierarchical VQ-VAE that encodes raw audio into discrete tokens at three different time resolutions. The three levels use hop lengths of 8, 32, and 128, compressing 44 kHz audio in dimensionality by factors of 8x, 32x, and 128x respectively. Each level uses a codebook of 2,048 entries, so every token is one of 2,048 learned discrete codes.^[3]

Level	Hop length	Compression factor	Approx. audio per 8,192-token context
Top	128	128x	~24 seconds
Middle	32	32x	~6 seconds
Bottom	8	8x	~1.5 seconds

The top level applies the most aggressive compression and retains the coarsest musical information (long-range structure and melody), while the bottom level preserves the finest detail (timbre and local texture). Rather than training a single deeply hierarchical autoencoder, the authors trained separate autoencoders with different hop lengths, a design choice intended to prevent the higher levels from collapsing and ignoring their codes. The VQ-VAE has roughly two million parameters and was trained on short audio clips.^[3]

Autoregressive Transformer priors

After the audio is tokenized, autoregressive Transformers model the distribution of the discrete codes. Jukebox uses Sparse Transformers, which employ factorized (axis-aligned) attention patterns so they can scale to long sequences more efficiently than dense attention. Each prior operates over a context of 8,192 tokens using 72 layers of factorized self-attention, corresponding to approximately 24, 6, and 1.5 seconds of raw audio at the top, middle, and bottom levels respectively.^[3]

Generation proceeds top down. A top-level prior with about 5 billion parameters first produces the coarsest tokens, capturing the overall structure of the song. Two upsampling Transformers, each with about 1 billion parameters, then successively refine the sequence by generating the middle-level and bottom-level tokens conditioned on the level above. Finally the VQ-VAE decoder converts the bottom-level tokens back into audio.^[2]^[3] The largest top-level prior was trained on 512 V100 GPUs for four weeks, and the upsampling models were trained on 128 V100 GPUs for two weeks each.^[3]

Lyrics conditioning

To enable singing of specific words, Jukebox incorporates lyrics conditioning. The authors framed this as a "lyrics-to-singing" (LTS) task, in which the model must align text to vocals over time even though the training lyrics carry no explicit timing information. An encoder processes the lyrics, and an attention mechanism in the top-level prior learns to align the text to the generated vocals so that the singing roughly follows the supplied words.^[3] Notably, Jukebox does not write lyrics; it only sings when lyrics are provided as input, and otherwise produces nonsensical vocal sounds in the style of the chosen artist.^[5]

What data was Jukebox trained on?

OpenAI assembled a new dataset by crawling the web. According to OpenAI, the dataset comprised 1.2 million songs, 600,000 of which were in English, each paired with the corresponding lyrics and metadata from LyricWiki.^[1]^[2]^[3] The metadata included the artist, album, genre, and year of release, along with common moods or playlist keywords associated with each song.^[3] These artist and genre labels were used as conditioning signals, which is what allows a user to request output in a particular artist's style or musical genre.^[2]^[3]

The audio itself was used as 32-bit, 44.1 kHz raw audio during training. To improve lyric alignment, the team used automated tools to match lyrics to the corresponding portions of the audio at the word level, since the raw scraped lyrics did not specify when each word was sung.^[3]

What can Jukebox do, and what were its limitations?

In its demonstrations, OpenAI produced samples in the styles of artists such as Elvis Presley, 2Pac, and Ella Fitzgerald, generating both new material and continuations of existing songs.^[4] The model could generate music that stayed musically coherent through roughly the 24-second context length of the top-level prior, and it captured chord patterns and the stylistic mannerisms of artists.^[3]^[5]

Jukebox had several significant limitations that OpenAI acknowledged directly:

Audio quality. The output frequently contained audible noise and scratchiness, and observers widely described the results as falling into an "uncanny valley" of audio. A TechCrunch writer characterized the singing as sounding "like good, but drunk, karaoke heard through a haze of drugs."^[4]
Lack of large-scale structure. While the samples exhibited local musical coherence, they lacked familiar larger musical structures such as choruses that repeat.^[1]^[5]
Generation speed. Producing a song was extremely slow. The paper reported that the model took around an hour to generate one minute of top-level tokens, and that the subsequent upsampling was very slow, taking on the order of eight hours to upsample one minute of audio.^[3] The released code's documentation noted that, on a single V100 GPU, "it takes about 3 hrs to fully sample 20 seconds of music."^[2]
Gap from human music. OpenAI stated plainly that "there is a significant gap between these generations and human-created music," and the musicians it consulted did not find the tool immediately applicable to their creative process.^[4]^[5]

How was Jukebox released and received?

Alongside the announcement, OpenAI released the model weights and the code on GitHub, together with a tool for exploring the generated samples and thousands of non-cherry-picked examples.^[1]^[2] The repository provided multiple model variants, including a 5-billion-parameter model, a 5-billion-parameter model with lyrics conditioning, and a smaller 1-billion-parameter model with lyrics conditioning. On a V100 GPU these required roughly 10.3 GB, 11.5 GB, and 3.8 GB of memory respectively. The code and weights were distributed under a noncommercial-use license.^[2]

Reception emphasized both the technical leap and the rough edges of the output. Writing at Waxy, Andy Baio called the results "a clear leap forward in musical quality" while describing them as "the uncanny valley of music: machine-hallucinated melodies and nonsensical DeepDream-esque vocals, but often capturing the style and mannerisms of the artist."^[5] Coverage repeatedly stressed that the very slow generation and the noisy audio made Jukebox impractical for real-time or professional use at the time.^[4]^[5]

What is Jukebox's legacy?

Jukebox demonstrated that high-dimensional raw audio, including expressive singing voices, could be modeled at scale by combining learned discrete audio tokenization with large autoregressive Transformers. This compress-then-model recipe, in which a VQ-VAE or similar codec turns continuous signals into discrete tokens that a Transformer then predicts, became influential across later generative audio and music systems. The work also sat within OpenAI's broader line of large autoregressive Transformer research that included GPT-2 and MuseNet.^[2]^[3]

Several of Jukebox's authors continued to shape OpenAI's later work. Jong Wook Kim went on to co-author Whisper, OpenAI's speech recognition system, and Alec Radford and Ilya Sutskever remained central figures in the organization's subsequent models. After Jukebox's 2020 release, OpenAI did not ship a direct successor music model, but the system remained a widely cited reference point in the development of AI music generation.^[1]^[3]

References

OpenAI. "Jukebox." OpenAI, April 30, 2020. https://openai.com/index/jukebox/ ↩
openai/jukebox. "Code for the paper 'Jukebox: A Generative Model for Music'." GitHub. https://github.com/openai/jukebox ↩
Dhariwal, Prafulla; Jun, Heewoo; Payne, Christine; Kim, Jong Wook; Radford, Alec; Sutskever, Ilya. "Jukebox: A Generative Model for Music." arXiv:2005.00341, April 30, 2020. https://arxiv.org/abs/2005.00341 ↩
Coldewey, Devin. "OpenAI's new experiments in music generation create an uncanny valley Elvis." TechCrunch, April 30, 2020. https://techcrunch.com/2020/04/30/openais-new-experiments-in-music-generation-create-an-uncanny-valley-elvis/ ↩
Baio, Andy. "OpenAI's Jukebox Opens the Pandora's Box of AI-Generated Music." Waxy.org, April 30, 2020. https://waxy.org/2020/04/openais-jukebox-opens-the-pandoras-box-of-ai-generated-music/ ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

1 revision by 1 contributors · full history

Suggest edit

What links here

AI art MuseNet Music OpenAI VQ-VAE (Vector Quantized Variational Autoencoder)

What is Jukebox?

How does Jukebox work?

Hierarchical VQ-VAE

Autoregressive Transformer priors

Lyrics conditioning

What data was Jukebox trained on?

What can Jukebox do, and what were its limitations?

How was Jukebox released and received?

What is Jukebox's legacy?

References

Improve this article

Related Articles

MuseNet

Suno

Udio

Stable Audio

Lyria

Suno v5

What links here

Related Articles

MuseNet

Suno

Udio

Stable Audio

Lyria

Suno v5

What links here