Jukebox (OpenAI)
Last reviewed
Jun 3, 2026
Sources
5 citations
Review status
Source-backed
Revision
v1 · 1,540 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 3, 2026
Sources
5 citations
Review status
Source-backed
Revision
v1 · 1,540 words
Add missing citations, update stale details, or suggest a clearer explanation.
Jukebox is a neural network for music generation developed by OpenAI that produces music, including rudimentary singing, as raw audio across a range of genres and artist styles. Announced on April 30, 2020, Jukebox was notable for operating directly in the raw audio domain rather than generating symbolic note sequences, and for being able to condition its output on artist, genre, and lyrics. The system was described in the paper "Jukebox: A Generative Model for Music" (arXiv:2005.00341) by Prafulla Dhariwal, Heewoo Jun, Christine Payne, Jong Wook Kim, Alec Radford, and Ilya Sutskever, and OpenAI released both the model weights and the code on GitHub.[1][2][3]
Jukebox generates songs as continuous audio waveforms at a 44.1 kHz sample rate, the standard fidelity of CD audio, rather than as MIDI or other symbolic representations. According to OpenAI, the model "generates music, including rudimentary singing, as raw audio in a variety of genres and artist styles."[1] Generation can be steered by selecting an artist and genre to influence the musical and vocal style, and by supplying unaligned lyrics that the model attempts to sing.[2][3]
The core difficulty Jukebox addresses is the extreme length of raw audio sequences. A few minutes of CD-quality stereo audio contains on the order of millions of samples per channel, far longer than the contexts handled by contemporary sequence models. To make this tractable, Jukebox first compresses audio into much shorter sequences of discrete tokens using a vector-quantized variational autoencoder (VQ-VAE), then trains autoregressive Transformers to model those token sequences, and finally decodes the tokens back into audio.[2][3]
Jukebox followed MuseNet, an earlier OpenAI music system announced in 2019 that generated symbolic MIDI note events using a Transformer. Jukebox differs fundamentally in that it works with raw audio, which lets it capture timbre, vocal mannerisms, and production characteristics that symbolic representations cannot express. The Jukebox paper groups MuseNet and Music Transformer together as prior work on symbolic music generation that predicts MIDI events autoregressively.[3]
The first stage of Jukebox is a hierarchical VQ-VAE that encodes raw audio into discrete tokens at three different time resolutions. The three levels use hop lengths of 8, 32, and 128, compressing 44 kHz audio in dimensionality by factors of 8x, 32x, and 128x respectively. Each level uses a codebook of 2,048 entries, so every token is one of 2,048 learned discrete codes.[3]
| Level | Hop length | Compression factor | Approx. audio per 8,192-token context |
|---|---|---|---|
| Top | 128 | 128x | ~24 seconds |
| Middle | 32 | 32x | ~6 seconds |
| Bottom | 8 | 8x | ~1.5 seconds |
The top level applies the most aggressive compression and retains the coarsest musical information (long-range structure and melody), while the bottom level preserves the finest detail (timbre and local texture). Rather than training a single deeply hierarchical autoencoder, the authors trained separate autoencoders with different hop lengths, a design choice intended to prevent the higher levels from collapsing and ignoring their codes. The VQ-VAE has roughly two million parameters and was trained on short audio clips.[3]
After the audio is tokenized, autoregressive Transformers model the distribution of the discrete codes. Jukebox uses Sparse Transformers, which employ factorized (axis-aligned) attention patterns so they can scale to long sequences more efficiently than dense attention. Each prior operates over a context of 8,192 tokens, corresponding to approximately 24, 6, and 1.5 seconds of raw audio at the top, middle, and bottom levels respectively.[3]
Generation proceeds top down. A top-level prior with about 5 billion parameters first produces the coarsest tokens, capturing the overall structure of the song. Two upsampling Transformers, each with about 1 billion parameters, then successively refine the sequence by generating the middle-level and bottom-level tokens conditioned on the level above. Finally the VQ-VAE decoder converts the bottom-level tokens back into audio.[2][3] The largest top-level prior was trained on 512 V100 GPUs for four weeks, and the upsampling models were trained on 128 V100 GPUs for two weeks each.[3]
To enable singing of specific words, Jukebox incorporates lyrics conditioning. The authors framed this as a "lyrics-to-singing" (LTS) task, in which the model must align text to vocals over time even though the training lyrics carry no explicit timing information. An encoder processes the lyrics, and an attention mechanism in the top-level prior learns to align the text to the generated vocals so that the singing roughly follows the supplied words.[3] Notably, Jukebox does not write lyrics; it only sings when lyrics are provided as input, and otherwise produces nonsensical vocal sounds in the style of the chosen artist.[5]
OpenAI assembled a new dataset by crawling the web. According to OpenAI, the dataset comprised 1.2 million songs, 600,000 of which were in English, each paired with the corresponding lyrics and metadata from LyricWiki.[1][2][3] The metadata included the artist, album, genre, and year of release, along with common moods or playlist keywords associated with each song.[3] These artist and genre labels were used as conditioning signals, which is what allows a user to request output in a particular artist's style or musical genre.[2][3]
The audio itself was used as 32-bit, 44.1 kHz raw audio during training. To improve lyric alignment, the team used automated tools to match lyrics to the corresponding portions of the audio at the word level, since the raw scraped lyrics did not specify when each word was sung.[3]
In its demonstrations, OpenAI produced samples in the styles of artists such as Elvis Presley, 2Pac, and Ella Fitzgerald, generating both new material and continuations of existing songs.[4] The model could generate music that stayed musically coherent through roughly the 24-second context length of the top-level prior, and it captured chord patterns and the stylistic mannerisms of artists.[3][5]
Jukebox had several significant limitations that OpenAI acknowledged directly:
Alongside the announcement, OpenAI released the model weights and the code on GitHub, together with a tool for exploring the generated samples and thousands of non-cherry-picked examples.[1][2] The repository provided multiple model variants, including a 5-billion-parameter model, a 5-billion-parameter model with lyrics conditioning, and a smaller 1-billion-parameter model with lyrics conditioning. On a V100 GPU these required roughly 10.3 GB, 11.5 GB, and 3.8 GB of memory respectively. The code and weights were distributed under a noncommercial-use license.[2]
Reception emphasized both the technical leap and the rough edges of the output. Writing at Waxy, Andy Baio called the results "a clear leap forward in musical quality" while describing them as "the uncanny valley of music: machine-hallucinated melodies and nonsensical DeepDream-esque vocals, but often capturing the style and mannerisms of the artist."[5] Coverage repeatedly stressed that the very slow generation and the noisy audio made Jukebox impractical for real-time or professional use at the time.[4][5]
Jukebox demonstrated that high-dimensional raw audio, including expressive singing voices, could be modeled at scale by combining learned discrete audio tokenization with large autoregressive Transformers. This compress-then-model recipe, in which a VQ-VAE or similar codec turns continuous signals into discrete tokens that a Transformer then predicts, became influential across later generative audio and music systems. The work also sat within OpenAI's broader line of large autoregressive Transformer research that included GPT-2 and MuseNet.[2][3]
Several of Jukebox's authors continued to shape OpenAI's later work. Jong Wook Kim went on to co-author Whisper, OpenAI's speech recognition system, and Alec Radford and Ilya Sutskever remained central figures in the organization's subsequent models. After Jukebox's 2020 release, OpenAI did not ship a direct successor music model, but the system remained a widely cited reference point in the development of AI music generation.[1][3]