MEGABYTE

Meta AI Model Architecture Transformer Models

7 min read

Updated Jul 16, 2026

Suggest edit History Talk

RawGraph

Last edited

Jul 16, 2026

Fact-checked

In review queue

Sources

7 citations

Revision

v2 · 1,399 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

MEGABYTE is a transformer architecture for autoregressive modeling of very long sequences directly at the byte level, introduced by researchers at Meta AI (FAIR) in May 2023. The design is a multiscale decoder: it splits a long byte sequence into fixed-size patches, runs a large "global" transformer over patch-level representations, and uses a smaller "local" transformer to predict the individual bytes inside each patch. This factorization makes self-attention sub-quadratic in the full sequence length and lets the model handle sequences of more than one million bytes without tokenization. MEGABYTE was published as a poster at NeurIPS 2023 and is a conceptual predecessor to Meta's later Byte Latent Transformer (BLT), which replaced MEGABYTE's fixed patches with dynamic, entropy-based ones.^[1]^[2]^[3]

Background and motivation

Standard autoregressive transformers work well on short sequences but scale poorly to long ones, because the cost of self-attention grows quadratically with sequence length and a large feedforward layer is applied at every single position.^[1] Most large language model systems sidestep this by first running tokenization, which compresses raw text into a much shorter sequence of subword tokens. Tokenization introduces its own problems: it is language- and domain-specific, it complicates modeling of non-text modalities, and it can make models brittle to spelling, formatting, and rare character sequences.

Operating directly on raw bytes removes the tokenizer entirely and gives a single, universal vocabulary of 256 values that works for any text, image, or audio file. The drawback is that byte sequences are far longer than token sequences (a book of a few hundred thousand tokens can be millions of bytes), which makes a naive byte-level transformer prohibitively expensive. MEGABYTE was designed to make byte-level modeling tractable at that scale.^[1]^[4]

Architecture

MEGABYTE is built from three components. A long input of length $T$ bytes is first divided into $K$ patches of fixed size $P$, so that $T = K \times P$.^[1]^[3]

Component	Role	Operates over
Patch embedder	Embeds each byte and losslessly concatenates the per-byte embeddings into a single patch representation of dimension $P \cdot D_G$	Bytes within a patch
Global model	A large autoregressive transformer that reads and predicts patch representations, modeling dependencies across the whole sequence	The $K$ patches
Local model	A small autoregressive transformer that, conditioned on the global model's output for a patch, predicts the bytes inside that patch one at a time	Bytes within a single patch

The global model captures long-range structure cheaply, because it attends over only $K = T/P$ positions rather than all $T$ bytes. The local model then fills in the fine-grained detail inside each patch. Because every patch is decoded by the same small local model, MEGABYTE can also generate the bytes of different patches with more parallelism than a flat byte-level transformer.^[1]^[3] In the paper's experiments the global and local models are sized independently, for example a 1.3-billion-parameter global model paired with a roughly 218-million-parameter local model for language modeling, and a 2.7-billion-parameter global model for high-resolution image modeling.^[3]

Efficiency argument

Splitting the sequence into patches changes the dominant cost of self-attention. A flat transformer over $T$ bytes costs $O(T^2)$ in the attention layers. In MEGABYTE the global model attends over $T/P$ patches at cost $O((T/P)^2)$, and the local model attends within patches at cost $O(P \cdot T)$.^[1]^[3] Choosing the patch size so that $P \approx T^{1/3}$ minimizes the combined cost at roughly $O(T^{4/3})$, which is sub-quadratic in $T$. A second benefit is in the feedforward layers: because the heavy global feedforward computation is applied once per patch rather than once per byte, the same compute budget buys much larger feedforward layers, where transformers spend most of their parameters.^[1]^[4]

Experiments and results

MEGABYTE was evaluated across three modalities, in every case modeling raw bytes with no tokenizer. Performance on language and image tasks is reported in bits per byte (bpb), where lower is better.^[1]^[3]

Language modeling

The paper used long-form text corpora including PG-19, Stories, Books, arXiv, and a code dataset, with patch size $P = 8$.^[3] On these long-context benchmarks, the byte-level MEGABYTE matched or beat strong baselines such as a standard byte-level transformer and PerceiverAR trained with comparable compute.^[1]^[3]

Dataset	MEGABYTE (bpb)	Transformer (bpb)	PerceiverAR (bpb)
PG-19	1.000	1.057	1.104
Stories	0.978	1.064	1.070
Books	1.007	1.097	1.104
arXiv	0.678	0.816	0.791
Code	0.411	0.575	0.546

These results were presented as evidence that byte-level models can be competitive with subword models on long-context language modeling, rather than as a claim of beating the best tokenized systems outright.^[1]^[4]

Image modeling

On autoregressive density estimation for ImageNet, MEGABYTE reported state-of-the-art or competitive bits per byte at multiple resolutions, using larger patch sizes for larger images (for example $P = 12$ at 64x64 and $P = 192$ at higher resolutions).^[3] At 640x640 resolution each image is around 1.2 million bytes, which is where the "million-byte" capability is exercised on real data.^[3]

Resolution	MEGABYTE (bpb)	PerceiverAR (bpb)
64x64	3.40	3.40
256x256	3.158	3.373
640x640	2.282	2.345

Audio modeling

MEGABYTE was also trained on raw audio files (16 kHz, 16-bit) using a sequence length of 524,288 bytes per example and patch size $P = 32$, reaching 3.477 bpb versus 3.543 for a PerceiverAR baseline.^[3] Modeling audio directly from raw files, with no spectrogram or codec front end, illustrates the architecture's modality independence.^[1]

Significance and relationship to the Byte Latent Transformer

MEGABYTE argued that tokenizer-free autoregressive modeling is viable at scale, showing that a single byte-level decoder could be competitive across text, images, and audio while handling contexts far longer than a flat transformer could afford.^[1] Its central idea, factoring a long sequence into a coarse global model over patches plus a fine local model within patches, has since appeared in several follow-up architectures.

The most direct successor from the same lab is the Byte Latent Transformer, published by Meta FAIR in December 2024 (Pagnoni et al.).^[5]^[6] BLT keeps the byte-level, patch-based philosophy but changes how patches are formed. Whereas MEGABYTE strides over the sequence in fixed-size blocks (for example grouping every few bytes), BLT segments the byte stream dynamically using the entropy of a small next-byte prediction model, creating patches with roughly uniform information density.^[2]^[6] The BLT authors describe the limitation they were addressing: with a fixed stride "compute is not dynamically allocated to where it is needed most," so a model may waste a step on predictable bytes (such as whitespace in code) or under-spend on information-dense bytes (such as the start of a new word or a mathematical expression).^[6] Allocating more compute to high-entropy regions let BLT, for the first time, match tokenization-based LLM performance at scale while improving inference efficiency and robustness.^[5]^[6] In this lineage MEGABYTE is the fixed-patch starting point and BLT is the dynamic-patch refinement.

Paper and authorship

The work was published as "MEGABYTE: Predicting Million-byte Sequences with Multiscale Transformers" by Lili Yu, Daniel Simig, Colin Flaherty, Armen Aghajanyan, Luke Zettlemoyer, and Mike Lewis, all affiliated with Meta AI (FAIR), with Zettlemoyer also at the University of Washington.^[1]^[3] The preprint appeared on arXiv on 12 May 2023 (revised 19 May 2023) and the paper was accepted to NeurIPS 2023.^[1]^[7] Lili Yu, the lead author, later co-authored the Byte Latent Transformer.^[6]

References

Yu, Lili; Simig, Daniel; Flaherty, Colin; Aghajanyan, Armen; Zettlemoyer, Luke; Lewis, Mike. "MEGABYTE: Predicting Million-byte Sequences with Multiscale Transformers." arXiv:2305.07185, 12 May 2023. https://arxiv.org/abs/2305.07185 ↩
"Byte Latent Transformer: Patches Scale Better Than Tokens." Graphcore Research Blog (discussion of MEGABYTE fixed patches versus BLT entropy-based patches). https://graphcore-research.github.io/byte-latent-transformer/ ↩
Yu, Lili; et al. "MEGABYTE: Predicting Million-byte Sequences with Multiscale Transformers" (full text PDF). arXiv:2305.07185. https://arxiv.org/pdf/2305.07185 ↩
"Meta AI's MegaByte Scalable Architecture for Long Sequence Modelling Outperforms Existing Byte-Level Models." Synced, 19 May 2023. https://syncedreview.com/2023/05/19/meta-ais-megabyte-scalable-architecture-for-long-sequence-modelling-outperforms-existing-byte-level-models/ ↩
"Byte Latent Transformer: Patches Scale Better Than Tokens." AI at Meta (Research). https://ai.meta.com/research/publications/byte-latent-transformer-patches-scale-better-than-tokens/ ↩
Pagnoni, Artidoro; et al. "Byte Latent Transformer: Patches Scale Better Than Tokens." arXiv:2412.09871, 13 December 2024 (HTML full text, related work citing MEGABYTE). https://arxiv.org/html/2412.09871v1 ↩
"MEGABYTE: Predicting Million-byte Sequences with Multiscale Transformers." NeurIPS 2023 (Advances in Neural Information Processing Systems 36), poster. https://proceedings.neurips.cc/paper_files/paper/2023/hash/f8f78f8043f35890181a824e53a57134-Abstract-Conference.html ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

1 revision by 1 contributors · full history

Suggest edit

What links here

Rotary Position Embedding

Background and motivation

Architecture

Efficiency argument

Experiments and results

Language modeling

Image modeling

Audio modeling

Significance and relationship to the Byte Latent Transformer

Paper and authorship

References

Improve this article

Related Articles

Multi-head Latent Attention

Multi-Head Self-Attention

Rotary Position Embedding

Self-attention

Cross-attention

Mixture of Depths

What links here

Related Articles

Multi-head Latent Attention

Multi-Head Self-Attention

Rotary Position Embedding

Self-attention

Cross-attention

Mixture of Depths