See also: Machine learning terms, Tokenization, Byte pair encoding
In the field of machine learning, a token refers to a fundamental unit of text or data that is used for processing, analysis, or modeling. Tokens are essential components of natural language processing (NLP) systems, which aim to enable computers to understand, interpret, and generate human language. In this context, a token can represent a single word, a character, a subword, or any other unit of text that serves as an input for a given NLP model. The process of breaking down a given text into its constituent tokens is known as tokenization.
In the era of large language models (LLMs), the concept of a token has become central to how these systems are designed, priced, and evaluated. Every prompt sent to a model and every response generated is measured in tokens. The context window of a model, the cost of an API call, and the quality of the model's output all depend on how text is split into tokens and how many tokens the model can handle at once.
A common source of confusion is the relationship between tokens, words, and characters. These three units of text are related but distinct.
Characters are the smallest units of written language: individual letters, digits, punctuation marks, and spaces. The English word "tokenization" contains 12 characters.
Words are sequences of characters separated by spaces or punctuation in most Western languages. "The cat sat on the mat" contains six words.
Tokens fall somewhere between characters and words. Most modern NLP systems use subword tokenization, which splits text into pieces that are sometimes whole words and sometimes fragments of words. For example, a subword tokenizer might split "tokenization" into "token" and "ization," yielding two tokens from one word. Common short words like "the" or "is" are typically kept as single tokens, while rare or long words are broken into multiple subword units.
For English text processed by modern LLM tokenizers such as OpenAI's tiktoken (cl100k_base encoding), one token corresponds to roughly 0.75 words, or equivalently, about 4 characters. This means 100 tokens translates to approximately 75 English words. However, this ratio varies significantly depending on the language, the domain of the text, and the specific tokenizer being used.
| Language | Approximate Characters per Token | Relative Cost vs. English |
|---|---|---|
| English | 4 to 5 | 1.0x (baseline) |
| Spanish | 3 to 4 | ~1.2x |
| German | 3 to 4 | ~1.5x |
| French | 3 to 4 | ~1.3x |
| Russian (Cyrillic) | 3 to 4 | ~1.5x |
| Chinese | 1 to 2 | ~2.0x |
| Japanese | 1 to 2 | ~2.0x |
| Korean | 1 to 2 | ~1.5x to 2.0x |
| Arabic | 2 to 3 | ~2.0x |
| Hindi (Devanagari) | 2 to 3 | ~3.0x |
| Thai | 1 to 2 | ~3.0x |
The disparity arises because most tokenizer vocabularies are trained primarily on English-heavy corpora. Languages with non-Latin scripts or complex morphology require more tokens to express the same meaning, which translates directly into higher processing costs and reduced effective context lengths when using token-based APIs.
The methods used to convert text into tokens have evolved considerably over the past several decades, driven by advances in computational linguistics and deep learning.
Early NLP systems in the 1950s through the 1990s relied on rule-based tokenization. Text was split on whitespace and punctuation using hand-crafted rules for each language. This approach was tightly coupled to linguistic expertise and failed to generalize across languages or handle noisy real-world text (such as social media posts or OCR output). Word-level tokenizers were used alongside n-gram language models, where the vocabulary consisted of the most frequent words in a training corpus. Any word not present in the vocabulary was mapped to a special unknown token, a problem known as the out-of-vocabulary (OOV) issue.
The shift toward data-driven tokenization began in earnest with the rise of neural machine translation in the mid-2010s. Sennrich, Haddow, and Birch (2016) adapted Byte Pair Encoding (BPE), originally a data compression algorithm developed by Philip Gage in 1994, for use in NLP. Their key insight was that subword units could handle rare words by decomposing them into frequently occurring pieces, effectively eliminating the OOV problem without inflating sequence lengths as much as character-level approaches. Around the same time, Google's neural machine translation team (Wu et al., 2016) employed the WordPiece algorithm, first described by Schuster and Nakajima (2012), in their production system. Kudo (2018) introduced the Unigram language model approach and the SentencePiece library, which further removed the dependence on language-specific pre-processing.
By the early 2020s, subword tokenization had become the universal standard for transformer-based models. The main trend since then has been the growth of vocabulary sizes. GPT-2 (2019) used a vocabulary of approximately 50,000 tokens. GPT-4 (2023) expanded to roughly 100,000 tokens with its cl100k_base encoding. LLaMA 3 (2024) adopted a vocabulary of approximately 128,000 tokens, and Gemini models use vocabularies of 256,000 tokens or more. Larger vocabularies encode text more efficiently (fewer tokens per word), which reduces computational cost in the self-attention layers, though at the expense of larger embedding matrices. Some recent models have pushed vocabulary sizes above 200,000 tokens; OpenAI's o200k_base encoding used by GPT-4o and successors contains approximately 200,000 tokens.
Tokenization is a crucial step in the preprocessing of textual data for various machine learning tasks, such as text classification, sentiment analysis, and machine translation. There are several techniques employed for tokenization, each with its own set of advantages and disadvantages.
Word-based tokenization is a straightforward approach that involves segmenting text into individual words, treating each word as a separate token. This technique often uses white spaces and punctuation marks as delimiters. While it is relatively simple and intuitive, word-based tokenization faces several challenges. First, it produces very large vocabularies because every unique word form (including plurals, verb conjugations, and misspellings) gets its own entry. Second, any word not seen during training becomes an out-of-vocabulary (OOV) token, which the model cannot meaningfully process. Third, it struggles with languages that do not use spaces between words, such as Chinese, Japanese, and Thai, or with morphologically rich languages like Finnish and Turkish, where a single word can convey multiple pieces of information through affixes.
Character-based tokenization divides text into individual characters, which are then used as tokens. This approach eliminates the out-of-vocabulary problem entirely because any text can be represented using a small, fixed set of characters. It can also be useful when dealing with languages that lack clear word boundaries. However, character-based tokenization significantly increases sequence lengths, making it computationally expensive. It also places a heavier burden on the model to learn word-level and phrase-level semantics from very small building blocks, which can lead to reduced performance on many tasks.
Byte-level tokenization takes character-level tokenization one step further by operating on raw byte values (0 to 255) rather than Unicode characters. This guarantees that any input, in any language or encoding, can be represented without unknown tokens and without needing a massive base vocabulary. The base vocabulary is fixed at 256 entries. Byte-level approaches are used as the foundation for byte-level BPE (as in GPT-2 and later GPT models) and have been explored in models like ByT5 (Xue et al., 2022), which processes text entirely at the byte level without any subword merging. The trade-off is that byte-level sequences are significantly longer than subword sequences, increasing computational cost.
Subword tokenization combines the benefits of both word-based and character-based tokenization. It involves breaking text into smaller units, known as subwords or subtokens, which are derived from frequently occurring character sequences in the training corpus. Common words remain as single tokens while rare words are decomposed into recognizable subword pieces. This approach has become the dominant tokenization strategy for modern language models and transformers.
Four algorithms dominate subword tokenization in contemporary NLP. Each takes a different approach to building a vocabulary of subword units.
Byte pair encoding (BPE) was originally developed as a data compression algorithm by Philip Gage in 1994 and was adapted for NLP tokenization by Sennrich, Haddow, and Birch in 2016. BPE builds its vocabulary through an iterative bottom-up process:
For example, given the word frequencies ("hug", 10), ("pug", 5), ("pun", 12), ("bun", 4), ("hugs", 5), BPE would start with the base vocabulary ["b", "g", "h", "n", "p", "s", "u"]. The most frequent adjacent pair is "u" + "g" (appearing 20 times across "hug", "pug", and "hugs"), so BPE merges them into "ug". The next most common pair is "u" + "n" (appearing 16 times in "pun" and "bun"), yielding "un". This process continues until the target vocabulary size is reached.
BPE guarantees that the most common substrings in the training data receive their own dedicated tokens, while rare strings are decomposed into smaller, already-known pieces. This gives it a good balance between vocabulary size and sequence length.
BPE is used by the GPT family of models. OpenAI's tiktoken library implements a highly optimized byte-level BPE tokenizer. GPT-3 and GPT-2 use this approach, and GPT-4 and later models use the cl100k_base and o200k_base encodings, which are refined versions of byte-level BPE with larger vocabularies.
WordPiece was developed by Schuster and Nakajima in 2012 for Japanese and Korean voice search and was later adopted by Google for BERT and related models. WordPiece is similar to BPE in that it iteratively merges character pairs, but it differs in the criterion for choosing which pair to merge. Instead of selecting the most frequent pair, WordPiece selects the pair that maximizes the likelihood of the training data when merged. Specifically, it chooses the pair whose combined frequency divided by the product of the individual frequencies is highest:
score("u", "g") = frequency("ug") / (frequency("u") x frequency("g"))
This scoring favors merging pairs that appear together more often than would be expected by chance, rather than simply the most frequent pair. Two tokens that co-occur far more than their individual frequencies would predict get merged first.
WordPiece uses the "##" prefix to indicate that a subword is a continuation of a previous token rather than the start of a new word. For example, "playing" might be tokenized as ["play", "##ing"]. WordPiece is the tokenizer behind BERT, DistilBERT, and Electra.
The Unigram algorithm, proposed by Kudo in 2018, takes the opposite approach from BPE and WordPiece. Instead of starting small and building up, Unigram starts with a large initial vocabulary (often all substrings up to a certain length or all characters plus frequent substrings) and iteratively removes tokens to shrink the vocabulary down to the desired size.
At each step, Unigram computes how much the overall log-likelihood of the training corpus would decrease if each token were removed, and it discards the tokens whose removal causes the least damage (typically the bottom 10% to 20% of tokens by loss impact). The resulting vocabulary tends to produce tokenizations that are probabilistically optimal under the unigram language model assumption, where each token is treated as independent.
During tokenization at inference time, Unigram can produce multiple valid segmentations for any input and selects the one with the highest probability. This probabilistic framework also allows Unigram to perform subword regularization, where different segmentations are sampled during training to make the model more robust. Research has shown that this regularization consistently improves translation quality, especially on low-resource and out-of-domain settings.
Unigram is used by T5, mT5, ALBERT, XLNet, and mBART, typically through the SentencePiece library.
SentencePiece, developed by Kudo and Richardson in 2018, is not strictly a new tokenization algorithm but rather a language-independent framework and library that can implement either BPE or the Unigram algorithm. Its key innovation is that it treats the input text as a raw byte stream, including whitespace characters, without requiring any language-specific pre-tokenization or preprocessing.
Most other tokenizers assume that words are separated by spaces and operate on pre-split words. SentencePiece removes this assumption by representing spaces as a special Unicode character (displayed as "\u2581" or "_") and learning segmentation boundaries directly from the raw text. This makes SentencePiece particularly effective for languages that do not use spaces between words, such as Chinese, Japanese, and Thai. At decoding time, SentencePiece simply concatenates all tokens and replaces the special space character with a regular space to reconstruct the original text.
SentencePiece is used by LLaMA (versions 1 and 2), Mistral, and many multilingual models. LLaMA 3 switched to a tiktoken-based BPE tokenizer with a vocabulary of approximately 128,000 tokens.
The following table summarizes which tokenization algorithm is used by major model families.
| Model Family | Tokenization Algorithm | Library | Vocabulary Size |
|---|---|---|---|
| GPT-2 | Byte-level BPE | tiktoken | ~50,257 |
| GPT-3 | Byte-level BPE | tiktoken | ~50,257 |
| GPT-4 / GPT-4o | Byte-level BPE (cl100k_base) | tiktoken | ~100,256 |
| GPT-4o / GPT-5 | Byte-level BPE (o200k_base) | tiktoken | ~200,019 |
| BERT | WordPiece | HuggingFace Tokenizers | ~30,522 |
| ALBERT | Unigram (SentencePiece) | SentencePiece | ~30,000 |
| T5 | Unigram (SentencePiece) | SentencePiece | ~32,000 |
| XLNet | Unigram (SentencePiece) | SentencePiece | ~32,000 |
| LLaMA 1 and 2 | BPE (SentencePiece) | SentencePiece | ~32,000 |
| LLaMA 3 | Byte-level BPE | tiktoken | ~128,000 |
| Mistral | BPE (SentencePiece) | SentencePiece | ~32,000 |
| Mistral-Nemo / Ministral | BPE (Tekken) | Tekken | ~131,000 |
| Gemini | BPE (SentencePiece) | SentencePiece | ~256,000 |
Several open-source libraries are widely used for tokenization in research and production.
tiktoken is OpenAI's fast BPE tokenizer, written in Rust with Python bindings. It implements the encodings used by all OpenAI models: r50k_base (GPT-3 era), p50k_base (Codex), cl100k_base (GPT-4, GPT-3.5-Turbo), and o200k_base (GPT-4o and successors). tiktoken is 3 to 6 times faster than comparable open-source tokenizers and is the standard tool for counting tokens before sending requests to the OpenAI API.
HuggingFace Tokenizers is a library written in Rust that provides implementations of BPE, WordPiece, and Unigram tokenizers. It is tightly integrated with the HuggingFace Transformers ecosystem and supports training custom tokenizers from scratch. The library provides both a Python API and a standalone Rust crate.
SentencePiece is Google's C++ library with Python wrappers that implements both BPE and Unigram tokenization in a language-independent manner. It is the tokenizer behind many multilingual and research models including T5, LLaMA 1/2, and XLNet.
For counting tokens in practice, OpenAI provides the tiktoken.encoding_for_model() function, which automatically selects the correct encoding for a given model name. HuggingFace provides AutoTokenizer.from_pretrained() for loading the tokenizer associated with any model on the Hub.
The vocabulary of a tokenizer is the complete set of tokens it can produce. Vocabulary size is a critical design decision that balances several trade-offs.
A larger vocabulary means that more words and common phrases receive their own dedicated tokens, resulting in shorter token sequences for a given input. Shorter sequences reduce computational cost in the self-attention layers of transformer models, where the cost scales quadratically with sequence length. Larger vocabularies also improve multilingual coverage by dedicating more tokens to non-English scripts and morphemes.
However, a larger vocabulary increases the size of the embedding matrix (vocabulary size multiplied by embedding dimension), which consumes more memory. It can also lead to data sparsity: tokens that appear rarely in training data receive poorly trained embeddings.
Modern LLM vocabulary sizes have grown steadily. The trend from 2019 to 2026 is shown below.
| Year | Representative Model | Vocabulary Size |
|---|---|---|
| 2018 | BERT | ~30,522 |
| 2019 | GPT-2 | ~50,257 |
| 2020 | GPT-3 | ~50,257 |
| 2020 | T5 | ~32,000 |
| 2023 | LLaMA 2 | ~32,000 |
| 2023 | GPT-4 (cl100k_base) | ~100,256 |
| 2024 | LLaMA 3 | ~128,000 |
| 2024 | GPT-4o (o200k_base) | ~200,019 |
| 2024 | Gemini 1.5 | ~256,000 |
| 2025 | Mistral-Nemo / Ministral | ~131,000 |
| 2025 | Gemini 3 | ~262,000 |
This represents roughly an 8x increase in vocabulary size over three years (from 32K to 256K+), reflecting the industry's emphasis on multilingual support and computational efficiency.
The context window of a language model refers to the maximum number of tokens it can process in a single input-output interaction. This includes both the input prompt (system instructions, conversation history, user query) and the generated output. If a conversation exceeds the context window, the model must either truncate older content or use techniques such as summarization or retrieval-augmented generation to stay within limits.
Context windows are measured in tokens, not words or characters. Because of this, the effective length of a context window in human-readable terms depends on the language and content being processed. A 128,000-token context window holds roughly 96,000 English words but fewer words in languages with lower tokenization efficiency.
Context windows have expanded rapidly since the early transformer models. The original BERT model (2018) had a context window of just 512 tokens. By 2024 and 2025, models with context windows of 200,000 tokens or more became common, and some models now support millions of tokens.
| Model | Provider | Context Window (Tokens) | Max Output Tokens |
|---|---|---|---|
| BERT (2018) | 512 | N/A (encoder only) | |
| GPT-2 (2019) | OpenAI | 1,024 | 1,024 |
| GPT-3 (2020) | OpenAI | 4,096 | 4,096 |
| GPT-4 (2023) | OpenAI | 8,192 / 32,768 | 8,192 |
| GPT-4 Turbo (2023) | OpenAI | 128,000 | 4,096 |
| GPT-4o (2024) | OpenAI | 128,000 | 16,384 |
| GPT-4.1 (2025) | OpenAI | 1,000,000 | 32,768 |
| GPT-5.4 (2026) | OpenAI | 272,000 (1M via API) | 128,000 |
| Claude 3.5 Sonnet (2024) | Anthropic | 200,000 | 8,192 |
| Claude Sonnet 4 (2025) | Anthropic | 200,000 (1M beta) | 64,000 |
| Claude Sonnet 4.6 (2026) | Anthropic | 1,000,000 | 64,000 |
| Gemini 2.5 Pro (2025) | Google DeepMind | 1,000,000 | 65,536 |
| Gemini 3.1 Pro (2026) | Google DeepMind | 1,000,000 | 65,000 |
| LLaMA 3.1 (2024) | Meta AI | 128,000 | N/A (open weights) |
| Llama 4 Scout (2025) | Meta AI | 10,000,000 | N/A (open weights) |
| Llama 4 Maverick (2025) | Meta AI | 1,000,000 | N/A (open weights) |
| DeepSeek R1 (2025) | DeepSeek | 128,000 | 64,000 |
| Mistral Large (2024) | Mistral AI | 128,000 | N/A |
| Codestral (2025) | Mistral AI | 256,000 | N/A |
It is worth noting that a model's advertised context window does not guarantee consistent performance across the entire range. Research has shown that model accuracy tends to degrade as input length approaches the stated maximum, with some models exhibiting sudden performance drops rather than gradual decline. A model claiming 200K tokens may become unreliable around 130K. The "lost in the middle" phenomenon describes the tendency of models to recall information from the beginning and end of long contexts more accurately (85% to 95%) than from the middle (76% to 82%).
Commercial LLM APIs charge users based on the number of tokens processed. Pricing is typically quoted per million tokens, with separate rates for input tokens (the prompt) and output tokens (the model's response). Output tokens are generally more expensive than input tokens because generating each output token requires a full forward pass through the model.
| Model | Provider | Input (per 1M tokens) | Output (per 1M tokens) |
|---|---|---|---|
| GPT-5.4 | OpenAI | $2.50 | $15.00 |
| GPT-5.4 Pro | OpenAI | $30.00 | $180.00 |
| GPT-4o mini | OpenAI | $0.15 | $0.60 |
| o3 | OpenAI | $0.40 | $1.60 |
| Claude Opus 4.6 | Anthropic | $5.00 | $25.00 |
| Claude Sonnet 4.6 | Anthropic | $3.00 | $15.00 |
| Claude Haiku 3.5 | Anthropic | $0.25 | $1.25 |
| Gemini 2.5 Pro | Google DeepMind | $1.25 | $10.00 |
| Gemini 2.5 Flash | Google DeepMind | $0.30 | $2.50 |
| Gemini 2.0 Flash-Lite | Google DeepMind | $0.075 | $0.30 |
| DeepSeek V3.2 | DeepSeek | $0.14 | $0.28 |
| Grok 4 | xAI | $0.20 | $0.50 |
| Codestral | Mistral AI | $0.30 | $0.90 |
LLM API prices dropped roughly 80% across the board between early 2025 and early 2026, driven by hardware improvements, model distillation, and increased competition among providers. Many providers also offer cached input pricing at roughly 10% of the standard input rate, and batch processing discounts that can reduce costs by up to 50%. Combined, caching and batching can yield total savings of up to 95%.
For cost estimation, a useful rule of thumb is that 1,000 tokens of English text is approximately 750 words, or roughly 1.5 pages of single-spaced text. A full-length novel (~80,000 words) translates to approximately 107,000 tokens.
Beyond the subword tokens that represent ordinary text, modern NLP models rely on a set of special tokens that carry structural or functional meaning. These tokens are inserted by the tokenizer to provide the model with signals about the boundaries and roles of different parts of the input. They are not part of the original text but are essential for the model to process inputs correctly.
| Special Token | Common Notation | Purpose | Used By |
|---|---|---|---|
| Beginning of Sequence | <s>, [BOS], <|startoftext|> | Signals the start of a new sequence to the model | GPT, LLaMA, and other autoregressive models |
| End of Sequence | </s>, [EOS], <|endoftext|> | Signals the end of a sequence; tells the model to stop generating | GPT, LLaMA, T5, and most generative models |
| Padding | [PAD] | Fills sequences to a uniform length within a batch so that tensors have consistent dimensions; ignored by the attention mechanism | BERT, T5, and most models during batch processing |
| Classification | [CLS] | Placed at the start of an input; the model's hidden state at this position is used as the aggregate representation for classification tasks | BERT and derived encoder models |
| Separator | [SEP] | Separates two segments of text within a single input (for example, a question and a passage in question answering) | BERT, Electra, and sentence-pair tasks |
| Mask | [MASK] | Replaces a token during masked language model pre-training; the model must predict the original token from context | BERT, RoBERTa, and other masked LMs |
| Unknown | [UNK] | Represents any token that is not found in the vocabulary (primarily in word-level tokenizers; subword tokenizers rarely produce this) | Older word-level tokenizers |
GPT-2 uses the single token <|endoftext|> for both beginning-of-sequence and end-of-sequence roles. During training, documents in the corpus were simply concatenated with this token as a delimiter between them. More recent models, including Claude and LLaMA variants, use distinct tokens for sequence boundaries and additional special tokens for system prompts, tool use, and turn-taking in conversations.
Once text has been tokenized, each token must be converted into a numerical representation that a neural network can process. This conversion is performed by the token embedding layer, which maps each token in the vocabulary to a dense, high-dimensional vector.
A token embedding layer is essentially a lookup table. If the vocabulary contains V tokens and the embedding dimension is D, then the embedding layer is a matrix of size V x D. When a token with index i is fed into the model, the embedding layer returns the i-th row of this matrix as the token's embedding vector. During training, the values in this matrix are learned through backpropagation alongside all other model parameters.
In the original transformer architecture (Vaswani et al., 2017), token embeddings are combined with positional encodings, which provide information about each token's position in the sequence. Without positional information, the model would treat a bag of tokens the same regardless of their order. Modern models use various forms of positional encoding, including sinusoidal encodings (original transformer), learned absolute position embeddings (GPT-2, BERT), and rotary position embeddings (RoPE, used by LLaMA and many recent models).
Earlier embedding approaches like Word2Vec (Mikolov et al., 2013) and GloVe (Pennington et al., 2014) produced static embeddings, where each word type has a single fixed vector regardless of context. The word "bank" would have the same vector whether it appeared in "river bank" or "bank account."
Modern transformer-based models like BERT and GPT produce contextual embeddings. The initial token embedding from the lookup table is transformed through multiple layers of self-attention and feed-forward computation. By the time a token's representation reaches the upper layers of the model, its vector encodes not just the token's identity but its role and meaning within the specific sentence. This means the word "bank" receives different representations depending on the surrounding context. BERT-Base produces 768-dimensional contextual embeddings, while BERT-Large generates 1,024-dimensional embeddings.
The dimensionality of token embeddings varies across models. Larger embedding dimensions allow the model to capture more nuanced semantic distinctions but require more memory and computation.
| Model | Embedding Dimension |
|---|---|
| Word2Vec | 100 to 300 |
| GloVe | 50 to 300 |
| BERT Base | 768 |
| BERT Large | 1,024 |
| GPT-3 (175B) | 12,288 |
| LLaMA 2 (70B) | 8,192 |
| GPT-4 | Not publicly disclosed |
Despite the success of subword tokenization, several categories of input continue to present difficulties for current tokenizers. These issues are sometimes called tokenization artifacts, referring to unintended behaviors that arise from how text is split into tokens.
Numbers are tokenized inconsistently. A tokenizer might split the number "12345" into tokens like "123" and "45" or "1" and "2345," depending on how often various digit sequences appeared in training data. This inconsistent splitting makes it difficult for models to learn basic arithmetic or reason about numerical magnitudes. The number "380" and "381" might have entirely different tokenizations, making it harder for the model to learn that they are numerically adjacent.
Some recent approaches have explored digit-by-digit tokenization for numbers or special numerical encoding schemes to address this limitation.
Programming languages pose tokenization challenges because they contain a mix of natural language identifiers, keywords, operators, whitespace-based indentation (in Python), and special symbols. Variable names and function names in code are often compound words written in camelCase or snake_case, which tokenizers may split unpredictably. Indentation-sensitive languages consume extra tokens for whitespace. Research has shown that a Python project may fit roughly 50% more code into the same context window compared to a language like Rust, highlighting efficiency gaps across programming languages.
As discussed in the section on token-to-word ratios, tokenizers trained on English-dominant corpora produce highly inefficient tokenizations for many other languages. This phenomenon is quantified by token fertility, defined as the average number of tokens produced per word. A fertility of 1.0 means each word maps to exactly one token, while a fertility of 3.0 means each word is split into three tokens on average.
Research evaluating large language models on multilingual tasks has shown that fertility reliably predicts downstream accuracy: higher fertility consistently predicts lower accuracy across all models and subjects. Doubling fertility leads to approximately 4x increases in training cost and inference latency because both computation and memory scale with sequence length.
For agglutinative languages (such as Turkish, Finnish, and Hungarian), tokenization interacts destructively with prefix and suffix morphology. The model must reconstruct intended characters from sub-token fragments, recover the morphological structure, and then map back to the semantic concept. This serial reconstruction process consumes model capacity that would otherwise be used for reasoning.
A study published in 2023 found that for some languages (such as Dzongkha, Odia, and Santali), the cost to process equivalent text is more than 12 times higher than for English. This has led to active research in language-fair tokenization, including specialized multilingual tokenizers and vocabulary extension techniques.
The specific positions where a tokenizer splits text can affect model behavior in subtle ways. If a word is split differently depending on surrounding context or casing (for example, "token" as one token vs. "Token" as two tokens), the model may treat these as unrelated concepts. These boundary effects are especially pronounced with proper nouns, technical terminology, and novel compound words that the tokenizer did not encounter during vocabulary training. Research has described these effects as imposing a "cognitive load" on the model, where middle layers must compensate for tokenization inconsistencies out of a shared capacity budget.
Emoji sequences, especially those with skin tone modifiers and zero-width joiners, can expand into many tokens. A single visually rendered emoji may consume 5 to 10 tokens or more. Non-standard Unicode characters, mathematical symbols, and rare scripts also tend to be tokenized very inefficiently.
Excessive whitespace, tabs, and line breaks all consume tokens. Documents with heavy formatting, such as tables or nested lists, use a disproportionate number of tokens relative to their informational content.
Several important NLP tasks operate at the token level, meaning the model must produce a prediction or label for each individual token in the input sequence rather than a single prediction for the entire input.
Named entity recognition is the task of identifying and classifying named entities (people, organizations, locations, dates, monetary values, and other categories) in text. Each token receives a label indicating whether it is part of a named entity and, if so, which type. The standard labeling scheme is BIO (Beginning, Inside, Outside): the first token of an entity receives a B- label, subsequent tokens within the same entity receive I- labels, and tokens outside any entity receive the O label.
For example, the sentence "Barack Obama visited New York" might be labeled:
| Token | Label |
|---|---|
| Barack | B-PER |
| Obama | I-PER |
| visited | O |
| New | B-LOC |
| York | I-LOC |
POS tagging assigns a grammatical category (noun, verb, adjective, adverb, preposition, determiner, and so on) to each token in a sentence. This information is foundational for downstream tasks such as parsing, information extraction, and machine translation. Modern transformer-based POS taggers achieve accuracy above 97% on standard English benchmarks.
Chunking, also called shallow parsing, groups consecutive tokens into syntactic phrases such as noun phrases (NP), verb phrases (VP), and prepositional phrases (PP). Like NER, chunking uses BIO-style labels to indicate the beginning, interior, and exterior of each chunk.
A practical challenge in token-level tasks is that subword tokenizers may split a single word into multiple tokens. For NER or POS tagging, the model must produce a label for each subword token, but the final output should assign a single label per original word. Common strategies include taking the label of the first subword token, majority voting among subword labels, or using special loss masking that ignores all but the first subword of each word during training.
Tokens play a crucial role in various machine learning tasks, particularly in the domain of NLP and natural language understanding. Some of the primary applications include:
Text Classification. Tokens are used to represent textual data in a format that can be easily processed by machine learning algorithms for tasks such as sentiment analysis, topic classification, and spam detection.
Machine Translation. In machine translation systems, tokens are used to represent source and target language texts, enabling the model to learn mappings between different languages.
Named Entity Recognition. Tokenization allows for the identification of words or phrases that represent specific entities, such as people, organizations, or locations, in a given text.
Text Generation. Autoregressive language models generate text one token at a time. At each step, the model predicts a probability distribution over the vocabulary and samples or selects the next token.
Retrieval-Augmented Generation. Documents are split into token-counted chunks for storage in vector databases, and the number of retrieved tokens must fit within the model's context window alongside the user query and system prompt.
Imagine you have a box of toy blocks with letters on them. If you want to build a sentence or a phrase, you need to use these blocks to make words, and then put those words together to create sentences.
In machine learning, a token is like one of those blocks. But instead of always being one letter, a token can be a whole word (like "cat"), part of a word (like "un" and "happy" for "unhappy"), or even just a letter. The computer breaks text into these blocks so it can understand and work with language.
Why not just use whole words? Because there are too many words in the world, and the computer might see a word it has never seen before. By using smaller building blocks, the computer can handle any word, even new or unusual ones, by putting known pieces together. It is like how you can build any LEGO creation from a set of standard bricks.
When companies charge money for AI chatbots, they count how many tokens you use. That is why you might hear people say things like "this model supports one million tokens" or "it costs three dollars per million tokens."