Byte Latent Transformer
Last reviewed
Jun 3, 2026
Sources
6 citations
Review status
Source-backed
Revision
v1 · 1,783 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 3, 2026
Sources
6 citations
Review status
Source-backed
Revision
v1 · 1,783 words
Add missing citations, update stale details, or suggest a clearer explanation.
The Byte Latent Transformer (BLT) is a tokenizer-free large language model architecture introduced by researchers at Meta AI's Fundamental AI Research (FAIR) group in December 2024. Instead of mapping text to a fixed vocabulary of subword tokens, BLT operates directly on the raw bytes of an input and groups them into variable-length units called patches. The boundaries between patches are chosen dynamically from the entropy of a small byte-level prediction model, so the architecture spends more compute on parts of a sequence that are harder to predict and less on parts that are easy. The paper describing the method, "Byte Latent Transformer: Patches Scale Better Than Tokens," reports the first compute-controlled scaling study of byte-level models up to 8 billion parameters, and argues that at fixed inference cost BLT scales better than conventional tokenization-based models while matching the quality of Llama 3 at the 8B scale. [1][2]
Almost all modern language models first convert text into tokens using a learned subword vocabulary, typically through byte-pair encoding (BPE). Tokenization compresses text into a manageable number of units and lets a transformer operate over short sequences, but it introduces several known drawbacks: a fixed vocabulary is biased toward the languages and scripts seen during its construction, models become brittle to spelling, casing, and noise because tokens are opaque chunks, and the same amount of model compute is spent on every token regardless of how predictable it is. Byte-level models avoid a vocabulary entirely and are more robust, but a naive byte-level transformer must process far longer sequences (one step per byte rather than per token), which has historically made them too expensive to train and run at the scale of token-based models. BLT is an attempt to keep the robustness of bytes while recovering the efficiency that tokenization provides, by learning where to draw unit boundaries rather than fixing them in advance. [1][2]
The central idea is that patch boundaries should track how difficult the next byte is to predict. BLT first trains a small, separate byte-level language model whose only job is to estimate the entropy of the next-byte distribution at each position. In the paper this entropy model is a transformer with roughly 100 million parameters, 14 layers, a hidden dimension of 512, and sliding-window attention over the previous 512 bytes. [1]
Given the per-byte entropies, BLT segments a sequence using one of two rules: [1]
| Patching rule | Boundary condition | Idea |
|---|---|---|
| Global threshold | Start a new patch when the next-byte entropy exceeds a fixed global value | Cut wherever the model is uncertain in absolute terms |
| Approximate monotonic constraint | Start a new patch when entropy rises sharply relative to the previous byte | Cut at points where uncertainty suddenly increases |
The effect is that predictable spans, such as the end of a common word, are absorbed into a single long patch, while uncertain spans, such as the first character of a new word, tend to start fresh patches. The average patch size is a tunable hyperparameter: the paper trains models with average patch sizes around 4.5, 6, and 8 bytes, where larger average patches mean fewer global-transformer steps per byte and therefore lower cost. Because patch length is determined only by the bytes seen so far, the same segmentation can be applied incrementally at inference time. [1][3]
BLT is built from three modules. A lightweight local model handles the byte level at the input and output, and a much larger model does the heavy reasoning over patches in the middle. [1]
| Module | Role | Relative size |
|---|---|---|
| Local encoder | Maps the raw byte sequence into one representation per patch | Few layers, small |
| Latent global transformer | Processes the sequence of patch representations; carries most of the model's capacity and dominates the FLOPs | Many layers, large |
| Local decoder | Turns patch representations back into a prediction over the next raw bytes | Few layers, small |
The local encoder is a small transformer that embeds individual bytes and then uses cross-attention to pool the bytes belonging to each patch into a single patch representation. To give each byte more context than its own value, the encoder augments byte embeddings with hash n-gram embeddings: for n-grams of length 3 through 8, a rolling hash indexes into learned embedding tables, and these are added to the per-byte embeddings. The cross-attention follows the design used in the Perceiver architecture, with patch representations acting as queries that attend only to the byte keys and values inside their own patch. [1]
The latent global transformer is the main model. It is a standard transformer with block-causal attention that operates over the (much shorter) sequence of patch representations rather than over bytes. Because there are far fewer patches than bytes, this large module runs many fewer steps than a pure byte-level transformer would, which is the source of BLT's efficiency gains. [1]
The local decoder is another small transformer. It takes the patch representations produced by the global model and, again using cross-attention (this time with byte queries attending to patch keys and values), unrolls them back into byte-level predictions so the model can generate raw output bytes one at a time. [1]
A consequence of this design is that BLT decouples vocabulary size from compute. In a token-based model, processing more text per step requires a larger vocabulary, which inflates the embedding and output layers. In BLT, longer patches reduce the number of expensive global-transformer steps without changing the byte-level input and output, so model size and patch size can be increased together to trade quality against cost. [1][2]
The paper presents what its authors describe as the first FLOP-controlled scaling study of byte-level language models, with models trained up to 8 billion parameters and 4 trillion training bytes. Scaling-law experiments used data comparable to the Llama 2 training set, and a separate 1-trillion-token dataset assembled from public sources (including a subset of DataComp-LM), referred to as BLT-1T, was used to train the models evaluated on downstream tasks. [1]
| Quantity | Value |
|---|---|
| Largest model | 8B parameters |
| Largest training budget | 4T bytes |
| Entropy model | ~100M parameters, 14 layers, hidden dim 512 |
| Hash n-gram sizes | 3 to 8 |
| Average patch sizes studied | ~4.5, 6, 8 bytes |
| Released checkpoints | BLT 1B, BLT 7B |
Two pretrained checkpoints, a 1B and a 7B model, were released alongside code. [4]
The headline empirical claim is that BLT can match a strong tokenizer-based baseline at the 8B scale while costing less to run. The paper compares an 8B "BLT-Entropy" model trained on 4.5T bytes against an 8B Llama 3 model trained on 1T tokens of the same data, evaluating both on standard benchmarks. [1]
| Benchmark | Llama 3 8B | BLT-Entropy 8B |
|---|---|---|
| ARC-Easy | 77.6 | 79.6 |
| ARC-Challenge | 53.3 | 52.1 |
| HellaSwag | 79.1 | 80.6 |
| MMLU | 58.1 | 57.4 |
| Average | 60.0 | 61.1 |
The authors summarize this as matching the training-FLOP-controlled performance of Llama 3 up to the 8B scale. Separately, because inference cost is roughly inversely proportional to average patch size, a model using an average patch size of about 8 bytes runs at close to half the inference FLOPs of a BPE-tokenized model whose tokens average roughly 4.4 bytes; the paper frames this as up to 50% fewer FLOPs at inference, with the option to trade small quality losses for that efficiency. The broader scaling conclusion is that, for a fixed inference budget, BLT scales better than tokenization-based models by growing patch size and model size at the same time. [1][2][3]
Because BLT sees characters directly rather than through opaque tokens, it does noticeably better on tasks that depend on the internal spelling of words and on inputs that differ from clean training text. On the CUTE benchmark, which probes character-level understanding and manipulation, the paper reports BLT scoring about 54.1 against roughly 27.5 for the comparable Llama 3 model, with near-perfect accuracy on spelling subtasks. On noised versions of HellaSwag, where the input text is corrupted with character-level perturbations, BLT holds an average advantage of about 8 points over the equivalently trained token model. The paper also reports gains on low-resource machine translation using the FLORES benchmark and on a grapheme-to-phoneme task, consistent with the intuition that byte-level modeling helps most where a fixed subword vocabulary provides poor coverage. [1]
The work was first posted to arXiv on 13 December 2024 and was subsequently published at the 63rd Annual Meeting of the Association for Computational Linguistics (ACL 2025), held in Vienna, as paper 2025.acl-long.453. The paper is credited to Artidoro Pagnoni, Ram Pasunuru, Pedro Rodriguez, John Nguyen, Benjamin Muller, Margaret Li, Chunting Zhou, Lili Yu, Jason Weston, Luke Zettlemoyer, Gargi Ghosh, Mike Lewis, Ari Holtzman, and Srinivasan Iyer. Meta released training and inference code and the pretrained checkpoints on GitHub and Hugging Face, with the code under a CC-BY-NC-4.0 license and partly based on Meta's Lingua codebase. [1][4]
BLT drew wide attention as one of the more credible challenges to subword tokenization, a component that had been nearly universal in large language models. Commentators noted that it sits in a longer line of byte- and character-level work (such as ByT5, CANINE, and MEGABYTE, the last of which shares authors with BLT) but is distinctive for demonstrating compute-matched parity with a production-grade tokenized model at the 8B scale and for its learned, entropy-driven segmentation. [3][5]