Byte Latent Transformer

Large Language Models Meta AI Model Architecture

9 min read

Updated Jul 16, 2026

Suggest edit History Talk

RawGraph

Last edited

Jul 16, 2026

Fact-checked

In review queue

Sources

6 citations

Revision

v2 · 1,779 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

The Byte Latent Transformer (BLT) is a tokenizer-free large language model architecture introduced by researchers at Meta AI's Fundamental AI Research (FAIR) group in December 2024. Instead of mapping text to a fixed vocabulary of subword tokens, BLT operates directly on the raw bytes of an input and groups them into variable-length units called patches. The boundaries between patches are chosen dynamically from the entropy of a small byte-level prediction model, so the architecture spends more compute on parts of a sequence that are harder to predict and less on parts that are easy. The paper describing the method, "Byte Latent Transformer: Patches Scale Better Than Tokens," reports the first compute-controlled scaling study of byte-level models up to 8 billion parameters, and argues that at fixed inference cost BLT scales better than conventional tokenization-based models while matching the quality of Llama 3 at the 8B scale. ^[1]^[2]

Motivation

Almost all modern language models first convert text into tokens using a learned subword vocabulary, typically through byte-pair encoding (BPE). Tokenization compresses text into a manageable number of units and lets a transformer operate over short sequences, but it introduces several known drawbacks: a fixed vocabulary is biased toward the languages and scripts seen during its construction, models become brittle to spelling, casing, and noise because tokens are opaque chunks, and the same amount of model compute is spent on every token regardless of how predictable it is. Byte-level models avoid a vocabulary entirely and are more robust, but a naive byte-level transformer must process far longer sequences (one step per byte rather than per token), which has historically made them too expensive to train and run at the scale of token-based models. BLT is an attempt to keep the robustness of bytes while recovering the efficiency that tokenization provides, by learning where to draw unit boundaries rather than fixing them in advance. ^[1]^[2]

Entropy-based patching

The central idea is that patch boundaries should track how difficult the next byte is to predict. BLT first trains a small, separate byte-level language model whose only job is to estimate the entropy of the next-byte distribution at each position. In the paper this entropy model is a transformer with roughly 100 million parameters, 14 layers, a hidden dimension of 512, and sliding-window attention over the previous 512 bytes. ^[1]

Given the per-byte entropies, BLT segments a sequence using one of two rules: ^[1]

Patching rule	Boundary condition	Idea
Global threshold	Start a new patch when the next-byte entropy exceeds a fixed global value	Cut wherever the model is uncertain in absolute terms
Approximate monotonic constraint	Start a new patch when entropy rises sharply relative to the previous byte	Cut at points where uncertainty suddenly increases

The effect is that predictable spans, such as the end of a common word, are absorbed into a single long patch, while uncertain spans, such as the first character of a new word, tend to start fresh patches. The average patch size is a tunable hyperparameter: the paper trains models with average patch sizes around 4.5, 6, and 8 bytes, where larger average patches mean fewer global-transformer steps per byte and therefore lower cost. Because patch length is determined only by the bytes seen so far, the same segmentation can be applied incrementally at inference time. ^[1]^[3]

Architecture

BLT is built from three modules. A lightweight local model handles the byte level at the input and output, and a much larger model does the heavy reasoning over patches in the middle. ^[1]

Module	Role	Relative size
Local encoder	Maps the raw byte sequence into one representation per patch	Few layers, small
Latent global transformer	Processes the sequence of patch representations; carries most of the model's capacity and dominates the FLOPs	Many layers, large
Local decoder	Turns patch representations back into a prediction over the next raw bytes	Few layers, small

The local encoder is a small transformer that embeds individual bytes and then uses cross-attention to pool the bytes belonging to each patch into a single patch representation. To give each byte more context than its own value, the encoder augments byte embeddings with hash n-gram embeddings: for n-grams of length 3 through 8, a rolling hash indexes into learned embedding tables, and these are added to the per-byte embeddings. The cross-attention follows the design used in the Perceiver architecture, with patch representations acting as queries that attend only to the byte keys and values inside their own patch. ^[1]

The latent global transformer is the main model. It is a standard transformer with block-causal attention that operates over the (much shorter) sequence of patch representations rather than over bytes. Because there are far fewer patches than bytes, this large module runs many fewer steps than a pure byte-level transformer would, which is the source of BLT's efficiency gains. ^[1]

The local decoder is another small transformer. It takes the patch representations produced by the global model and, again using cross-attention (this time with byte queries attending to patch keys and values), unrolls them back into byte-level predictions so the model can generate raw output bytes one at a time. ^[1]

A consequence of this design is that BLT decouples vocabulary size from compute. In a token-based model, processing more text per step requires a larger vocabulary, which inflates the embedding and output layers. In BLT, longer patches reduce the number of expensive global-transformer steps without changing the byte-level input and output, so model size and patch size can be increased together to trade quality against cost. ^[1]^[2]

Scale and training data

The paper presents what its authors describe as the first FLOP-controlled scaling study of byte-level language models, with models trained up to 8 billion parameters and 4 trillion training bytes. Scaling-law experiments used data comparable to the Llama 2 training set, and a separate 1-trillion-token dataset assembled from public sources (including a subset of DataComp-LM), referred to as BLT-1T, was used to train the models evaluated on downstream tasks. ^[1]

Quantity	Value
Largest model	8B parameters
Largest training budget	4T bytes
Entropy model	~100M parameters, 14 layers, hidden dim 512
Hash n-gram sizes	3 to 8
Average patch sizes studied	~4.5, 6, 8 bytes
Released checkpoints	BLT 1B, BLT 7B

Two pretrained checkpoints, a 1B and a 7B model, were released alongside code. ^[4]

Comparison with Llama 3

The headline empirical claim is that BLT can match a strong tokenizer-based baseline at the 8B scale while costing less to run. The paper compares an 8B "BLT-Entropy" model trained on 4.5T bytes against an 8B Llama 3 model trained on 1T tokens of the same data, evaluating both on standard benchmarks. ^[1]

Benchmark	Llama 3 8B	BLT-Entropy 8B
ARC-Easy	77.6	79.6
ARC-Challenge	53.3	52.1
HellaSwag	79.1	80.6
MMLU	58.1	57.4
Average	60.0	61.1

The authors summarize this as matching the training-FLOP-controlled performance of Llama 3 up to the 8B scale. Separately, because inference cost is roughly inversely proportional to average patch size, a model using an average patch size of about 8 bytes runs at close to half the inference FLOPs of a BPE-tokenized model whose tokens average roughly 4.4 bytes; the paper frames this as up to 50% fewer FLOPs at inference, with the option to trade small quality losses for that efficiency. The broader scaling conclusion is that, for a fixed inference budget, BLT scales better than tokenization-based models by growing patch size and model size at the same time. ^[1]^[2]^[3]

Robustness and long-tail behavior

Because BLT sees characters directly rather than through opaque tokens, it does noticeably better on tasks that depend on the internal spelling of words and on inputs that differ from clean training text. On the CUTE benchmark, which probes character-level understanding and manipulation, the paper reports BLT scoring about 54.1 against roughly 27.5 for the comparable Llama 3 model, with near-perfect accuracy on spelling subtasks. On noised versions of HellaSwag, where the input text is corrupted with character-level perturbations, BLT holds an average advantage of about 8 points over the equivalently trained token model. The paper also reports gains on low-resource machine translation using the FLORES benchmark and on a grapheme-to-phoneme task, consistent with the intuition that byte-level modeling helps most where a fixed subword vocabulary provides poor coverage. ^[1]

Reception and status

The work was first posted to arXiv on 13 December 2024 and was subsequently published at the 63rd Annual Meeting of the Association for Computational Linguistics (ACL 2025), held in Vienna, as paper 2025.acl-long.453. The paper is credited to Artidoro Pagnoni, Ram Pasunuru, Pedro Rodriguez, John Nguyen, Benjamin Muller, Margaret Li, Chunting Zhou, Lili Yu, Jason Weston, Luke Zettlemoyer, Gargi Ghosh, Mike Lewis, Ari Holtzman, and Srinivasan Iyer. Meta released training and inference code and the pretrained checkpoints on GitHub and Hugging Face, with the code under a CC-BY-NC-4.0 license and partly based on Meta's Lingua codebase. ^[1]^[4]

BLT drew wide attention as one of the more credible challenges to subword tokenization, a component that had been nearly universal in large language models. Commentators noted that it sits in a longer line of byte- and character-level work (such as ByT5, CANINE, and MEGABYTE, the last of which shares authors with BLT) but is distinctive for demonstrating compute-matched parity with a production-grade tokenized model at the 8B scale and for its learned, entropy-driven segmentation. ^[3]^[5]

References

Pagnoni, A., Pasunuru, R., Rodriguez, P., Nguyen, J., Muller, B., Li, M., Zhou, C., Yu, L., Weston, J., Zettlemoyer, L., Ghosh, G., Lewis, M., Holtzman, A., & Iyer, S. (2024). "Byte Latent Transformer: Patches Scale Better Than Tokens." arXiv:2412.09871. https://arxiv.org/abs/2412.09871 (HTML: https://arxiv.org/html/2412.09871v1) ↩
"Byte Latent Transformer: Patches Scale Better Than Tokens." Research, AI at Meta. https://ai.meta.com/research/publications/byte-latent-transformer-patches-scale-better-than-tokens/ ↩
"Byte Latent Transformer: Patches Scale Better Than Tokens." Graphcore Research Blog. https://graphcore-research.github.io/byte-latent-transformer/ ↩
facebookresearch/blt: Code for BLT research paper. GitHub. https://github.com/facebookresearch/blt ↩
Wiggers, K. "Meta's new BLT architecture replaces tokens to make LLMs more efficient and versatile." VentureBeat. https://venturebeat.com/ai/metas-new-blt-architecture-replaces-tokens-to-make-llms-more-efficient-and-versatile ↩
Pagnoni, A., et al. (2025). "Byte Latent Transformer: Patches Scale Better Than Tokens." Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL 2025), 2025.acl-long.453. https://aclanthology.org/2025.acl-long.453/

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

1 revision by 1 contributors · full history

Suggest edit

What links here

MEGABYTE

Motivation

Entropy-based patching

Architecture

Scale and training data

Comparison with Llama 3

Robustness and long-tail behavior

Reception and status

See also

References

Improve this article

Related Articles

Joint Embedding Predictive Architecture

Large Concept Model

MEGABYTE

Rotary Position Embedding

Mamba

Jamba

What links here

Related Articles

Joint Embedding Predictive Architecture

Large Concept Model

MEGABYTE

Rotary Position Embedding

Mamba

Jamba