# Byte Latent Transformer

> Source: https://aiwiki.ai/wiki/byte_latent_transformer
> Updated: 2026-07-16
> Categories: Large Language Models, Meta AI, Model Architecture
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

The **Byte Latent Transformer** (**BLT**) is a tokenizer-free [large language model](/wiki/large_language_model) architecture introduced by researchers at [Meta AI](/wiki/meta_ai)'s Fundamental AI Research (FAIR) group in December 2024. Instead of mapping text to a fixed vocabulary of subword tokens, BLT operates directly on the raw [bytes](/wiki/byte) of an input and groups them into variable-length units called *patches*. The boundaries between patches are chosen dynamically from the entropy of a small byte-level prediction model, so the architecture spends more compute on parts of a sequence that are harder to predict and less on parts that are easy. The paper describing the method, "Byte Latent Transformer: Patches Scale Better Than Tokens," reports the first compute-controlled scaling study of byte-level models up to 8 billion parameters, and argues that at fixed inference cost BLT scales better than conventional tokenization-based models while matching the quality of [Llama 3](/wiki/llama_3) at the 8B scale. [1][2]

## Motivation

Almost all modern language models first convert text into tokens using a learned subword vocabulary, typically through byte-pair encoding ([BPE](/wiki/byte_pair_encoding)). [Tokenization](/wiki/tokenization) compresses text into a manageable number of units and lets a [transformer](/wiki/transformer) operate over short sequences, but it introduces several known drawbacks: a fixed vocabulary is biased toward the languages and scripts seen during its construction, models become brittle to spelling, casing, and noise because tokens are opaque chunks, and the same amount of model compute is spent on every token regardless of how predictable it is. Byte-level models avoid a vocabulary entirely and are more robust, but a naive byte-level transformer must process far longer sequences (one step per byte rather than per token), which has historically made them too expensive to train and run at the scale of token-based models. BLT is an attempt to keep the robustness of bytes while recovering the efficiency that tokenization provides, by learning where to draw unit boundaries rather than fixing them in advance. [1][2]

## Entropy-based patching

The central idea is that patch boundaries should track how difficult the next byte is to predict. BLT first trains a small, separate byte-level language model whose only job is to estimate the entropy of the next-byte distribution at each position. In the paper this *entropy model* is a transformer with roughly 100 million parameters, 14 layers, a hidden dimension of 512, and sliding-window attention over the previous 512 bytes. [1]

Given the per-byte entropies, BLT segments a sequence using one of two rules: [1]

| Patching rule | Boundary condition | Idea |
|---|---|---|
| Global threshold | Start a new patch when the next-byte entropy exceeds a fixed global value | Cut wherever the model is uncertain in absolute terms |
| Approximate monotonic constraint | Start a new patch when entropy rises sharply relative to the previous byte | Cut at points where uncertainty suddenly increases |

The effect is that predictable spans, such as the end of a common word, are absorbed into a single long patch, while uncertain spans, such as the first character of a new word, tend to start fresh patches. The average patch size is a tunable hyperparameter: the paper trains models with average patch sizes around 4.5, 6, and 8 bytes, where larger average patches mean fewer global-transformer steps per byte and therefore lower cost. Because patch length is determined only by the bytes seen so far, the same segmentation can be applied incrementally at inference time. [1][3]

## Architecture

BLT is built from three modules. A lightweight local model handles the byte level at the input and output, and a much larger model does the heavy reasoning over patches in the middle. [1]

| Module | Role | Relative size |
|---|---|---|
| Local encoder | Maps the raw byte sequence into one representation per patch | Few layers, small |
| Latent global transformer | Processes the sequence of patch representations; carries most of the model's capacity and dominates the FLOPs | Many layers, large |
| Local decoder | Turns patch representations back into a prediction over the next raw bytes | Few layers, small |

The **local encoder** is a small transformer that embeds individual bytes and then uses cross-attention to pool the bytes belonging to each patch into a single patch representation. To give each byte more context than its own value, the encoder augments byte embeddings with *hash n-gram embeddings*: for n-grams of length 3 through 8, a rolling hash indexes into learned embedding tables, and these are added to the per-byte embeddings. The cross-attention follows the design used in the Perceiver architecture, with patch representations acting as queries that attend only to the byte keys and values inside their own patch. [1]

The **latent global transformer** is the main model. It is a standard transformer with block-causal attention that operates over the (much shorter) sequence of patch representations rather than over bytes. Because there are far fewer patches than bytes, this large module runs many fewer steps than a pure byte-level transformer would, which is the source of BLT's efficiency gains. [1]

The **local decoder** is another small transformer. It takes the patch representations produced by the global model and, again using cross-attention (this time with byte queries attending to patch keys and values), unrolls them back into byte-level predictions so the model can generate raw output bytes one at a time. [1]

A consequence of this design is that BLT decouples *vocabulary size* from *compute*. In a token-based model, processing more text per step requires a larger vocabulary, which inflates the embedding and output layers. In BLT, longer patches reduce the number of expensive global-transformer steps without changing the byte-level input and output, so model size and patch size can be increased together to trade quality against cost. [1][2]

## Scale and training data

The paper presents what its authors describe as the first FLOP-controlled scaling study of byte-level language models, with models trained up to 8 billion parameters and 4 trillion training bytes. Scaling-law experiments used data comparable to the Llama 2 training set, and a separate 1-trillion-token dataset assembled from public sources (including a subset of DataComp-LM), referred to as BLT-1T, was used to train the models evaluated on downstream tasks. [1]

| Quantity | Value |
|---|---|
| Largest model | 8B parameters |
| Largest training budget | 4T bytes |
| Entropy model | ~100M parameters, 14 layers, hidden dim 512 |
| Hash n-gram sizes | 3 to 8 |
| Average patch sizes studied | ~4.5, 6, 8 bytes |
| Released checkpoints | BLT 1B, BLT 7B |

Two pretrained checkpoints, a 1B and a 7B model, were released alongside code. [4]

## Comparison with Llama 3

The headline empirical claim is that BLT can match a strong tokenizer-based baseline at the 8B scale while costing less to run. The paper compares an 8B "BLT-Entropy" model trained on 4.5T bytes against an 8B [Llama 3](/wiki/llama_3) model trained on 1T tokens of the same data, evaluating both on standard benchmarks. [1]

| Benchmark | Llama 3 8B | BLT-Entropy 8B |
|---|---|---|
| ARC-Easy | 77.6 | 79.6 |
| ARC-Challenge | 53.3 | 52.1 |
| HellaSwag | 79.1 | 80.6 |
| MMLU | 58.1 | 57.4 |
| Average | 60.0 | 61.1 |

The authors summarize this as matching the training-FLOP-controlled performance of Llama 3 up to the 8B scale. Separately, because inference cost is roughly inversely proportional to average patch size, a model using an average patch size of about 8 bytes runs at close to half the inference FLOPs of a BPE-tokenized model whose tokens average roughly 4.4 bytes; the paper frames this as up to 50% fewer FLOPs at inference, with the option to trade small quality losses for that efficiency. The broader scaling conclusion is that, for a fixed inference budget, BLT scales better than tokenization-based models by growing patch size and model size at the same time. [1][2][3]

## Robustness and long-tail behavior

Because BLT sees characters directly rather than through opaque tokens, it does noticeably better on tasks that depend on the internal spelling of words and on inputs that differ from clean training text. On the CUTE benchmark, which probes character-level understanding and manipulation, the paper reports BLT scoring about 54.1 against roughly 27.5 for the comparable Llama 3 model, with near-perfect accuracy on spelling subtasks. On noised versions of HellaSwag, where the input text is corrupted with character-level perturbations, BLT holds an average advantage of about 8 points over the equivalently trained token model. The paper also reports gains on low-resource machine translation using the FLORES benchmark and on a grapheme-to-phoneme task, consistent with the intuition that byte-level modeling helps most where a fixed subword vocabulary provides poor coverage. [1]

## Reception and status

The work was first posted to arXiv on 13 December 2024 and was subsequently published at the 63rd Annual Meeting of the Association for Computational Linguistics (ACL 2025), held in Vienna, as paper 2025.acl-long.453. The paper is credited to Artidoro Pagnoni, Ram Pasunuru, Pedro Rodriguez, John Nguyen, Benjamin Muller, Margaret Li, Chunting Zhou, Lili Yu, Jason Weston, Luke Zettlemoyer, Gargi Ghosh, Mike Lewis, Ari Holtzman, and Srinivasan Iyer. Meta released training and inference code and the pretrained checkpoints on [GitHub](https://github.com/facebookresearch/blt) and Hugging Face, with the code under a CC-BY-NC-4.0 license and partly based on Meta's Lingua codebase. [1][4]

BLT drew wide attention as one of the more credible challenges to subword tokenization, a component that had been nearly universal in large language models. Commentators noted that it sits in a longer line of byte- and character-level work (such as ByT5, CANINE, and MEGABYTE, the last of which shares authors with BLT) but is distinctive for demonstrating compute-matched parity with a production-grade tokenized model at the 8B scale and for its learned, entropy-driven segmentation. [3][5]

## See also

- [Tokenization](/wiki/tokenization)
- [Byte-pair encoding](/wiki/byte_pair_encoding)
- [Transformer](/wiki/transformer)
- [Llama 3](/wiki/llama_3)
- [Large language model](/wiki/large_language_model)
- [Meta AI](/wiki/meta_ai)

## References

1. Pagnoni, A., Pasunuru, R., Rodriguez, P., Nguyen, J., Muller, B., Li, M., Zhou, C., Yu, L., Weston, J., Zettlemoyer, L., Ghosh, G., Lewis, M., Holtzman, A., & Iyer, S. (2024). "Byte Latent Transformer: Patches Scale Better Than Tokens." arXiv:2412.09871. https://arxiv.org/abs/2412.09871 (HTML: https://arxiv.org/html/2412.09871v1)
2. "Byte Latent Transformer: Patches Scale Better Than Tokens." Research, AI at Meta. https://ai.meta.com/research/publications/byte-latent-transformer-patches-scale-better-than-tokens/
3. "Byte Latent Transformer: Patches Scale Better Than Tokens." Graphcore Research Blog. https://graphcore-research.github.io/byte-latent-transformer/
4. facebookresearch/blt: Code for BLT research paper. GitHub. https://github.com/facebookresearch/blt
5. Wiggers, K. "Meta's new BLT architecture replaces tokens to make LLMs more efficient and versatile." VentureBeat. https://venturebeat.com/ai/metas-new-blt-architecture-replaces-tokens-to-make-llms-more-efficient-and-versatile
6. Pagnoni, A., et al. (2025). "Byte Latent Transformer: Patches Scale Better Than Tokens." Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL 2025), 2025.acl-long.453. https://aclanthology.org/2025.acl-long.453/