MEGABYTE
Last reviewed
Jun 3, 2026
Sources
7 citations
Review status
Source-backed
Revision
v1 · 1,401 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 3, 2026
Sources
7 citations
Review status
Source-backed
Revision
v1 · 1,401 words
Add missing citations, update stale details, or suggest a clearer explanation.
MEGABYTE is a transformer architecture for autoregressive modeling of very long sequences directly at the byte level, introduced by researchers at Meta AI (FAIR) in May 2023. The design is a multiscale decoder: it splits a long byte sequence into fixed-size patches, runs a large "global" transformer over patch-level representations, and uses a smaller "local" transformer to predict the individual bytes inside each patch. This factorization makes self-attention sub-quadratic in the full sequence length and lets the model handle sequences of more than one million bytes without tokenization. MEGABYTE was published as a poster at NeurIPS 2023 and is a conceptual predecessor to Meta's later Byte Latent Transformer (BLT), which replaced MEGABYTE's fixed patches with dynamic, entropy-based ones.[1][2][3]
Standard autoregressive transformers work well on short sequences but scale poorly to long ones, because the cost of self-attention grows quadratically with sequence length and a large feedforward layer is applied at every single position.[1] Most large language model systems sidestep this by first running tokenization, which compresses raw text into a much shorter sequence of subword tokens. Tokenization introduces its own problems: it is language- and domain-specific, it complicates modeling of non-text modalities, and it can make models brittle to spelling, formatting, and rare character sequences.
Operating directly on raw bytes removes the tokenizer entirely and gives a single, universal vocabulary of 256 values that works for any text, image, or audio file. The drawback is that byte sequences are far longer than token sequences (a book of a few hundred thousand tokens can be millions of bytes), which makes a naive byte-level transformer prohibitively expensive. MEGABYTE was designed to make byte-level modeling tractable at that scale.[1][4]
MEGABYTE is built from three components. A long input of length $T$ bytes is first divided into $K$ patches of fixed size $P$, so that $T = K \times P$.[1][3]
| Component | Role | Operates over |
|---|---|---|
| Patch embedder | Embeds each byte and losslessly concatenates the per-byte embeddings into a single patch representation of dimension $P \cdot D_G$ | Bytes within a patch |
| Global model | A large autoregressive transformer that reads and predicts patch representations, modeling dependencies across the whole sequence | The $K$ patches |
| Local model | A small autoregressive transformer that, conditioned on the global model's output for a patch, predicts the bytes inside that patch one at a time | Bytes within a single patch |
The global model captures long-range structure cheaply, because it attends over only $K = T/P$ positions rather than all $T$ bytes. The local model then fills in the fine-grained detail inside each patch. Because every patch is decoded by the same small local model, MEGABYTE can also generate the bytes of different patches with more parallelism than a flat byte-level transformer.[1][3] In the paper's experiments the global and local models are sized independently, for example a 1.3-billion-parameter global model paired with a roughly 218-million-parameter local model for language modeling, and a 2.7-billion-parameter global model for high-resolution image modeling.[3]
Splitting the sequence into patches changes the dominant cost of self-attention. A flat transformer over $T$ bytes costs $O(T^2)$ in the attention layers. In MEGABYTE the global model attends over $T/P$ patches at cost $O((T/P)^2)$, and the local model attends within patches at cost $O(P \cdot T)$.[1][3] Choosing the patch size so that $P \approx T^{1/3}$ minimizes the combined cost at roughly $O(T^{4/3})$, which is sub-quadratic in $T$. A second benefit is in the feedforward layers: because the heavy global feedforward computation is applied once per patch rather than once per byte, the same compute budget buys much larger feedforward layers, where transformers spend most of their parameters.[1][4]
MEGABYTE was evaluated across three modalities, in every case modeling raw bytes with no tokenizer. Performance on language and image tasks is reported in bits per byte (bpb), where lower is better.[1][3]
The paper used long-form text corpora including PG-19, Stories, Books, arXiv, and a code dataset, with patch size $P = 8$.[3] On these long-context benchmarks, the byte-level MEGABYTE matched or beat strong baselines such as a standard byte-level transformer and PerceiverAR trained with comparable compute.[1][3]
| Dataset | MEGABYTE (bpb) | Transformer (bpb) | PerceiverAR (bpb) |
|---|---|---|---|
| PG-19 | 1.000 | 1.057 | 1.104 |
| Stories | 0.978 | 1.064 | 1.070 |
| Books | 1.007 | 1.097 | 1.104 |
| arXiv | 0.678 | 0.816 | 0.791 |
| Code | 0.411 | 0.575 | 0.546 |
These results were presented as evidence that byte-level models can be competitive with subword models on long-context language modeling, rather than as a claim of beating the best tokenized systems outright.[1][4]
On autoregressive density estimation for ImageNet, MEGABYTE reported state-of-the-art or competitive bits per byte at multiple resolutions, using larger patch sizes for larger images (for example $P = 12$ at 64x64 and $P = 192$ at higher resolutions).[3] At 640x640 resolution each image is around 1.2 million bytes, which is where the "million-byte" capability is exercised on real data.[3]
| Resolution | MEGABYTE (bpb) | PerceiverAR (bpb) |
|---|---|---|
| 64x64 | 3.40 | 3.40 |
| 256x256 | 3.158 | 3.373 |
| 640x640 | 2.282 | 2.345 |
MEGABYTE was also trained on raw audio files (16 kHz, 16-bit) using a sequence length of 524,288 bytes per example and patch size $P = 32$, reaching 3.477 bpb versus 3.543 for a PerceiverAR baseline.[3] Modeling audio directly from raw files, with no spectrogram or codec front end, illustrates the architecture's modality independence.[1]
MEGABYTE argued that tokenizer-free autoregressive modeling is viable at scale, showing that a single byte-level decoder could be competitive across text, images, and audio while handling contexts far longer than a flat transformer could afford.[1] Its central idea, factoring a long sequence into a coarse global model over patches plus a fine local model within patches, has since appeared in several follow-up architectures.
The most direct successor from the same lab is the Byte Latent Transformer, published by Meta FAIR in December 2024 (Pagnoni et al.).[5][6] BLT keeps the byte-level, patch-based philosophy but changes how patches are formed. Whereas MEGABYTE strides over the sequence in fixed-size blocks (for example grouping every few bytes), BLT segments the byte stream dynamically using the entropy of a small next-byte prediction model, creating patches with roughly uniform information density.[2][6] The BLT authors describe the limitation they were addressing: with a fixed stride "compute is not dynamically allocated to where it is needed most," so a model may waste a step on predictable bytes (such as whitespace in code) or under-spend on information-dense bytes (such as the start of a new word or a mathematical expression).[6] Allocating more compute to high-entropy regions let BLT, for the first time, match tokenization-based LLM performance at scale while improving inference efficiency and robustness.[5][6] In this lineage MEGABYTE is the fixed-patch starting point and BLT is the dynamic-patch refinement.
The work was published as "MEGABYTE: Predicting Million-byte Sequences with Multiscale Transformers" by Lili Yu, Daniel Simig, Colin Flaherty, Armen Aghajanyan, Luke Zettlemoyer, and Mike Lewis, all affiliated with Meta AI (FAIR), with Zettlemoyer also at the University of Washington.[1][3] The preprint appeared on arXiv on 12 May 2023 (revised 19 May 2023) and the paper was accepted to NeurIPS 2023.[1][7] Lili Yu, the lead author, later co-authored the Byte Latent Transformer.[6]