Tokens
Last reviewed
May 11, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v3 · 2,604 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 11, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v3 · 2,604 words
Add missing citations, update stale details, or suggest a clearer explanation.
Tokens are the basic units of text that large language models read and produce. A token can be a whole word, a chunk of a word, a single character, or even a piece of punctuation. Models do not see raw letters at all. Before any prompt reaches the neural network, a piece of software called a tokenizer chops the text into tokens and replaces each one with an integer ID. The model then operates on those numbers and emits new ones, which the tokenizer turns back into text on the way out.
Modern tokenizers for natural language processing and LLMs almost always use a subword approach. Common English words like "the" or "cat" become a single token. Less common words get split: "tokenization" might split into "token" and "ization", and a rare proper noun breaks down further. This keeps the vocabulary tractable, around 30,000 to 200,000 entries for most current models, while still letting the system represent any string that fits the underlying byte encoding.
Because the model thinks in tokens rather than words, almost every practical question about LLMs ends up being a question about tokens: how many fit in the context window, how much an API call costs, how fast generation feels, and which languages a model handles efficiently.
The job of a tokenizer is to turn arbitrary text into a sequence of integers and back again, losslessly. To do that, it needs a vocabulary, a fixed list of strings each assigned a unique ID, and a set of merging rules that say which adjacent pieces to glue together.
Three algorithm families dominate current LLMs.
Byte pair encoding is by far the most common method, used by the GPT family, Llama, Gemma, Qwen, Mistral, and many others. The compression algorithm was originally described by Philip Gage in The C User Journal in 1994, then adapted to NLP by Sennrich, Haddow, and Birch in 2015.
The training procedure is simple. Start with a base vocabulary (often just the 256 possible byte values), count which pair of adjacent symbols appears most often in the training corpus, and merge that pair into a single new token. Repeat tens of thousands of times until the vocabulary reaches the target size. The final tokenizer is just the base alphabet plus the ordered list of merges, which is enough to encode any new string.
Byte level BPE is the variant used by GPT-2 and later OpenAI models. Instead of operating on Unicode characters directly, it operates on the UTF-8 byte stream. That guarantees any text can be encoded without an "unknown" token, no matter what script or emoji it contains, which is one reason this style of tokenizer beat earlier alternatives.
WordPiece was developed at Google and is the tokenizer used by BERT, DistilBERT, and Electra. It looks similar to BPE from the outside but picks merges differently. Where BPE always merges the most frequent adjacent pair, WordPiece picks the pair whose joint frequency is highest relative to the product of the parts' frequencies. The effect is that WordPiece prefers merges that are more informative than chance would predict. WordPiece subword pieces are conventionally written with a ## prefix when they continue a previous token, so "playing" becomes play plus ##ing.
SentencePiece is a Google library rather than an algorithm in itself. It can apply BPE or a Unigram language model to raw text without language-specific preprocessing. The practical win is that it treats the space character as part of the vocabulary, shown as ▁, which lets it work cleanly on Chinese, Japanese, Thai, and other scripts that do not use spaces between words. Llama 2 used SentencePiece BPE with a 32,000 token vocabulary; T5, ALBERT, and XLNet use the Unigram flavor, which scores candidate subwords by how much removing each one would hurt the likelihood of the training corpus and trims the lowest scoring entries.
For OpenAI's GPT models, the cookbook gives a useful approximation:
Some reference values OpenAI cites: Wayne Gretzky's quote "You miss 100% of the shots you don't take" is 11 tokens, OpenAI's charter is 476 tokens, and the U.S. Declaration of Independence is 1,695 tokens.
These ratios fall apart outside English. Non-Latin scripts, code, JSON, and unusual symbols all use more tokens per character. A paragraph of Japanese or Hindi can cost two or three times as many tokens as the same paragraph in English, which has real cost and latency consequences. Anthropic notes that the new tokenizer in Claude Opus 4.7 can use up to 35% more tokens than Opus 4.6 for the same text, with the biggest jumps on code, structured data, and non-English content.
| Model family | Tokenizer | Algorithm | Vocabulary size |
|---|---|---|---|
| GPT-2, GPT-3 | r50k_base / p50k_base | byte level BPE | 50,257 |
| GPT-3.5, GPT-4 | cl100k_base | byte level BPE | ~100,277 |
| GPT-4o, o1, o3, o4 family | o200k_base | byte level BPE | ~200,000 |
| BERT, DistilBERT | WordPiece | WordPiece | 30,522 (English uncased) |
| RoBERTa, BART | byte level BPE | byte level BPE | 50,265 |
| T5 | SentencePiece | Unigram | 32,128 |
| Llama 2 | SentencePiece BPE | BPE on SentencePiece | 32,000 |
| Llama 3, Llama 3.1+ | tiktoken-style BPE | byte level BPE | 128,256 |
| Mistral, Mixtral | SentencePiece BPE | BPE | 32,000 |
| Gemma | SentencePiece BPE | BPE | 256,000 |
The trend over the past few years is toward larger vocabularies. OpenAI roughly doubled its vocab going from cl100k_base to o200k_base, and Llama 3 quadrupled its vocab from Llama 2's 32k to 128k. Bigger vocabularies pack more characters into each token, especially for non-English text and code, which shrinks the number of tokens you need to represent the same content. That lowers cost and lets the model see more material within the same context window. The tradeoff is a larger embedding table and slightly more memory.
OpenAI publishes tiktoken, a fast open source BPE tokenizer written in Rust with Python bindings. It implements the same encodings used by the OpenAI API, so a token count produced locally matches what the API will bill.
A minimal example:
import tiktoken
enc = tiktoken.encoding_for_model("gpt-4o")
ids = enc.encode("Hello, world!")
print(len(ids), ids)
print(enc.decode(ids))
The library exposes encodings by name: cl100k_base for GPT-3.5 and GPT-4, o200k_base for GPT-4o and the o-series reasoning models, p50k_base for older Codex models, and r50k_base (also called gpt2) for GPT-2 and the original GPT-3 endpoints. The README claims tiktoken is three to six times faster than comparable open source tokenizers.
For other model families, the equivalent tools are Hugging Face's tokenizers library (a fast Rust implementation that ships with the Transformers package), Google's sentencepiece Python module, and Anthropic's /v1/messages/count_tokens endpoint for Claude. Anthropic exposes token counting as a free API call, which is convenient if you need to budget a prompt before sending it.
A model's context window is the maximum number of tokens it can attend to at once, counting prompt and completion together. Sizes have grown by orders of magnitude in a few years:
What fits inside that window: a few short emails at 4k, a chapter at 32k, a full novel at 200k, several novels or a moderate codebase at 1M. The number of tokens you can actually use in practice is often less than the advertised maximum because the system prompt, tool definitions, and conversation history all count against the same budget, and output tokens come out of the same pool unless the provider allocates them separately.
For API access, tokens are the unit of billing. Providers charge separate rates for input tokens (everything sent to the model) and output tokens (everything the model generates). Output is typically two to five times more expensive than input.
A snapshot of pricing per million tokens at the time of writing:
| Model | Input ($/MTok) | Output ($/MTok) |
|---|---|---|
| Claude Haiku 4.5 | $1 | $5 |
| Claude Sonnet 4.6 | $3 | $15 |
| Claude Opus 4.7 | $5 | $25 |
| GPT-4o | $2.50 | $10 |
| GPT-4o mini | $0.15 | $0.60 |
At those rates a 100,000 token prompt to Claude Opus costs about 50 cents on input alone, before any generation. That arithmetic is why so much engineering effort goes into compressing prompts, caching frequently reused chunks, and choosing the right model for each step of a pipeline.
Providers also offer ways to bring the per-token price down. OpenAI and Anthropic both support prompt caching, which can discount cached input tokens by up to 90%. Batch APIs that run requests asynchronously usually shave another 50%. For very large prompts beyond 200k tokens, Anthropic historically applied a long context surcharge; in 2026 it merged that into standard pricing for the 4.6 and 4.7 families.
Word boundaries are not stable. The same word can tokenize to a different number of tokens depending on capitalization, leading whitespace, and surrounding context. " red" with a leading space is one token in cl100k_base; "Red" at the start of a sentence is another; "RED" in all caps may be two or three. This is why prompts that end with a trailing space sometimes generate worse output: the model has already committed to a token boundary that does not match what the training data normally looks like there.
Numbers and code tokenize unevenly. Until OpenAI's o200k_base encoding, four digit numbers like "2024" were sometimes a single token and sometimes split into "20" and "24", which is part of why early GPT models struggled with long arithmetic. Non-English text is more expensive: scripts that appear less often in training get fewer dedicated tokens and fall back to byte level encoding, so a sentence in Bengali or Thai can take three to four times more tokens than the same meaning in English. Larger multilingual vocabularies (Gemma's 256k, o200k_base's 200k, Llama 3's 128k) narrow this gap but do not close it.
Swapping the tokenizer after training is hard. Because the model's embedding table is keyed to token IDs, changing the tokenizer means retraining the embeddings, and often the whole model. Every major LLM release ships with a tokenizer choice frozen at the start of training.
Understanding tokens unlocks several lower level controls that show up in chat completions and similar APIs.
The logit_bias parameter accepts a map from token ID to a bias value between -100 and +100, which is added to the model's logit for that token before sampling. A value of -100 effectively forbids the token; +100 forces it. The classic example: if you are building an assistant that should never suggest eggs in a recipe, you look up the IDs for " egg", " eggs", and the subword piece "gg" with a tokenizer tool, then push them all to -100.
The max_tokens parameter caps how many tokens the model is allowed to generate. If the budget runs out mid-sentence, the response ends with a finish_reason of length. The stop parameter takes one or more strings that end the response immediately when generated. For structured output, the chat completions and responses APIs constrain generation to match a JSON schema by masking out tokens that would violate the grammar at each step, which is a logit_bias style trick applied automatically.
The word "token" gets pulled in a few different directions in adjacent fields. In classical lexical analysis, a token is a unit produced by a deterministic rule based lexer: an identifier, keyword, literal, or operator. LLM tokens come from a probabilistic algorithm rather than a hand written grammar, and they end up as integer IDs feeding a neural network instead of typed objects feeding a parser.
In web authentication, a token (session token, API token, JWT) is an opaque credential used to identify a user or application. In crypto and Web3, "token" refers to a unit issued on a blockchain. Around 2022 some marketing material started using "generative AI tokens" as a synonym for blockchain assets that gate access to a generative model. None of these are interchangeable with the linguistic tokens described here.
Encoded with cl100k_base, the string Hello, world! becomes four tokens [9906, 11, 1917, 0], which decode back to Hello, ,, world, and !. The space before "world" is part of the "world" token itself; that is how byte level BPE handles spacing.
The word "red" illustrates how position matters. In cl100k_base, " red" (lowercase with a leading space) is one token, " Red" is a different token, and "Red" at the start of a sentence is a third. The period is almost always token 13, because the merge for . is one of the earliest the algorithm learns. The most frequent strings get the lowest IDs and rare strings drift toward the high end, so tokens behave a bit like a Zipfian dictionary.
The most useful interactive tools for working with tokens:
count_tokens API gives exact Claude counts without sending a billable message.tiktoken Python and JavaScript packages let you count locally for any OpenAI model.AutoTokenizer.from_pretrained(...) is the standard entry point.For prompt design, the practical takeaway is to measure rather than guess. A surprisingly large fraction of "why is my context window full?" debugging sessions end up being a single tool definition or system prompt that tokenizes to many more tokens than the author expected, especially when it contains JSON, code samples, or non-English content.