MuseNet
Last reviewed
Jun 3, 2026
Sources
10 citations
Review status
Source-backed
Revision
v1 · 1,376 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 3, 2026
Sources
10 citations
Review status
Source-backed
Revision
v1 · 1,376 words
Add missing citations, update stale details, or suggest a clearer explanation.
MuseNet is a deep neural network for symbolic music generation, announced by OpenAI on April 25, 2019. The model can generate musical compositions up to about four minutes long using as many as 10 different instruments, and it can blend styles ranging from classical composers such as Mozart and Chopin to bands such as the Beatles, as well as country, pop, and other genres. MuseNet was not explicitly programmed with any understanding of music theory; instead it discovered patterns of harmony, rhythm, and style by learning to predict the next token across hundreds of thousands of MIDI files. It used the same general-purpose, unsupervised next-token-prediction approach as GPT-2, implemented as a large Sparse Transformer.[1][2][3]
OpenAI presented MuseNet as evidence that the general-purpose sequence-modeling technology behind its language models could be applied to other domains, in this case symbolic music. The system treats a piece of music as a sequence of discrete tokens, much as a language model treats text as a sequence of word or subword tokens, and learns to continue that sequence one token at a time.[1][2]
Because it was trained purely by prediction rather than by encoding musical rules, MuseNet learned conventions of harmony and rhythm implicitly from data. OpenAI noted that this approach allowed the model to combine styles in unusual ways, for example generating a piece that begins in one composer's idiom and shifts toward another, or rendering a melody associated with one artist using the instrumentation of a different genre. The company also acknowledged limitations: when asked to combine clashing styles and instruments, such as pairing a Chopin-style piano piece with bass and drums, the model could produce odd or incoherent results because such combinations were rare or absent in the training data.[1][2]
MuseNet was released as a research demonstration with an accompanying interactive web tool rather than as a commercial product or downloadable model.[1][4]
MuseNet is built on the Sparse Transformer architecture, a more compute-efficient variant of the Transformer that OpenAI had introduced earlier in 2019. Using the recompute and optimized kernels of the Sparse Transformer, OpenAI trained a 72-layer network with 24 attention heads and full attention over a context of 4,096 tokens. The long context window is what allowed the model to learn and maintain long-term structure across a multi-minute composition.[1][2][3]
The model was trained on a large collection of MIDI files. ClassicalArchives and BitMidi donated their MIDI collections for the project, and OpenAI supplemented these with other collections found online, including jazz, pop, African, Indian, and Arabic styles. To convert MIDI into a sequence the network could model, OpenAI experimented with several token encodings and found that the most effective scheme combined the pitch, volume, and instrument of each note into a single token, alongside tokens that advanced time. OpenAI also applied data augmentation during training, including transposing notes into different keys, adjusting volume and timing, and mixing token embeddings.[1][2][3]
Beyond standard positional embeddings, MuseNet added several learned embeddings to help the model track musical context:
| Embedding | Purpose |
|---|---|
| Positional | Standard Transformer embedding indicating a token's position in the sequence. |
| Timing | A learned embedding tracking the passage of time so that all notes sounding simultaneously share the same timing embedding. |
| Chord | An embedding added for each note within a chord, mimicking relative attention so the model can more easily relate a note to earlier notes in the same or previous chord. |
| Structural (part) | An embedding dividing the larger piece into 128 parts to indicate where a sample sits within the whole. |
| Structural (countdown) | A second structural encoding that counts down from 127 to 0 as the model approaches the end-of-piece token. |
To steer generation, OpenAI created composer and instrumentation tokens that were prepended to each sample at training time, so the model learned to associate them with particular musical characteristics. At generation time, supplying these tokens (for example, a token for a given composer or a chosen lead instrument) biased the output toward the requested style or arrangement.[1][2]
MuseNet's headline capabilities, as described by OpenAI, can be summarized as follows.[1][2][5]
| Aspect | Detail |
|---|---|
| Output format | Symbolic music (MIDI), not raw audio |
| Maximum length | About four minutes per composition |
| Instruments | Up to 10 different instruments in a single piece |
| Styles | Classical composers (for example Mozart, Chopin), bands (for example the Beatles), plus country, pop, jazz, and other genres |
| Generation | Unsupervised next-token prediction, optionally conditioned on composer and instrument tokens |
| Steering | Users could choose a composer or style, a starting set of notes, and instrumentation |
The model could generate a piece from scratch given a chosen style, or continue a short musical snippet provided by a user, predicting how the piece might develop. Because it produced MIDI rather than audio, the output was rendered through software instruments, and it did not include vocals or the fine timbral detail of recorded sound.[1][2]
Alongside the announcement, OpenAI published an interactive demonstration and made a MuseNet-powered co-composer tool available to the public for a limited time, through May 12, 2019.[1][2][6]
The tool offered two modes:
To mark the launch, OpenAI ran an experimental concert on April 25, 2019, livestreamed from roughly 12:00 to 3:00 p.m. Pacific Time. The pieces played during the stream were generated directly by MuseNet with no human curation or filtering, and OpenAI stated that no one, including the company, had heard the samples before they were broadcast.[7][8]
The interactive co-composer was a temporary prototype, and OpenAI did not maintain it as a permanent public service after the demonstration period. The original co-composer access ended in mid-May 2019, and the broader interactive tooling was not kept running long term as OpenAI's research focus shifted.[6][2]
MuseNet attracted significant media attention as a demonstration that the unsupervised, large-scale Transformer approach pioneered for text could generalize to music. Technology outlets highlighted both the breadth of styles the model could imitate and its ability to maintain coherent structure over several minutes, while also noting that genre blends could sound disjointed and that the MIDI output lacked the realism of recorded audio.[3][4][5]
MuseNet directly preceded OpenAI's Jukebox, unveiled in 2020, which represented a different and more ambitious approach to music generation. Whereas MuseNet operated on symbolic MIDI, Jukebox generated music as raw audio, including rudimentary singing, conditioned on genre, artist, and lyrics. OpenAI described its earlier MuseNet work on synthesizing music from large amounts of MIDI data as a precursor to Jukebox, and observers framed the shift from symbolic generation to raw-audio synthesis as an attempt to capture human voices and subtle timbres that symbolic systems cannot represent.[9][10]
As a research artifact, MuseNet remains an early and frequently cited example of applying general-purpose sequence models to creative domains, illustrating the same underlying principle, predicting the next token in a long sequence, that OpenAI applied across text, code, and other modalities.[1][2]