Token

Deep Learning Machine Learning Natural Language Processing

33 min read

Updated Jun 21, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 21, 2026

Fact-checked

In review queue

Sources

14 citations

Revision

v6 · 6,544 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

A token is the basic unit of text that a language model reads and writes: a word, a subword fragment, a single character, or a byte, produced by splitting text during a step called tokenization. In modern large language models (LLMs), tokens are usually subword pieces, and for English one token corresponds on average to about 4 characters or 0.75 words, so 100 tokens is roughly 75 English words.^[13] Tokens are the unit in which a model's context window is measured and the unit in which commercial APIs are priced, so the cost, length, and even the accuracy of an LLM interaction all depend on how text is broken into tokens.

What is a token in machine learning?

In the field of machine learning, a token refers to a fundamental unit of text or data that is used for processing, analysis, or modeling. Tokens are essential components of natural language processing (NLP) systems, which aim to enable computers to understand, interpret, and generate human language. In this context, a token can represent a single word, a character, a subword, or any other unit of text that serves as an input for a given NLP model. The process of breaking down a given text into its constituent tokens is known as tokenization.

In the era of large language models (LLMs), the concept of a token has become central to how these systems are designed, priced, and evaluated. Every prompt sent to a model and every response generated is measured in tokens. The context window of a model, the cost of an API call, and the quality of the model's output all depend on how text is split into tokens and how many tokens the model can handle at once.

How do tokens differ from words and characters?

A common source of confusion is the relationship between tokens, words, and characters. These three units of text are related but distinct.

Characters are the smallest units of written language: individual letters, digits, punctuation marks, and spaces. The English word "tokenization" contains 12 characters.

Words are sequences of characters separated by spaces or punctuation in most Western languages. "The cat sat on the mat" contains six words.

Tokens fall somewhere between characters and words. Most modern NLP systems use subword tokenization, which splits text into pieces that are sometimes whole words and sometimes fragments of words. For example, a subword tokenizer might split "tokenization" into "token" and "ization," yielding two tokens from one word. Common short words like "the" or "is" are typically kept as single tokens, while rare or long words are broken into multiple subword units.

What is the token-to-word ratio?

For English text processed by modern LLM tokenizers such as OpenAI's tiktoken (cl100k_base encoding), one token corresponds to roughly 0.75 words, or equivalently, about 4 characters. OpenAI's own developer guidance states the rule plainly: "As a rough rule of thumb, 1 token is approximately 4 characters or 0.75 words for English text."^[13] This means 100 tokens translates to approximately 75 English words.^[13] However, this ratio varies significantly depending on the language, the domain of the text, and the specific tokenizer being used.

Language	Approximate Characters per Token	Relative Cost vs. English
English	4 to 5	1.0x (baseline)
Spanish	3 to 4	~1.2x
German	3 to 4	~1.5x
French	3 to 4	~1.3x
Russian (Cyrillic)	3 to 4	~1.5x
Chinese	1 to 2	~2.0x
Japanese	1 to 2	~2.0x
Korean	1 to 2	~1.5x to 2.0x
Arabic	2 to 3	~2.0x
Hindi (Devanagari)	2 to 3	~3.0x
Thai	1 to 2	~3.0x

The disparity arises because most tokenizer vocabularies are trained primarily on English-heavy corpora. Languages with non-Latin scripts or complex morphology require more tokens to express the same meaning, which translates directly into higher processing costs and reduced effective context lengths when using token-based APIs.^[10]

How has tokenization evolved over time?

The methods used to convert text into tokens have evolved considerably over the past several decades, driven by advances in computational linguistics and deep learning.

Rule-based and word-level era

Early NLP systems in the 1950s through the 1990s relied on rule-based tokenization. Text was split on whitespace and punctuation using hand-crafted rules for each language. This approach was tightly coupled to linguistic expertise and failed to generalize across languages or handle noisy real-world text (such as social media posts or OCR output). Word-level tokenizers were used alongside n-gram language models, where the vocabulary consisted of the most frequent words in a training corpus. Any word not present in the vocabulary was mapped to a special unknown token, a problem known as the out-of-vocabulary (OOV) issue.

Statistical and neural tokenization

The shift toward data-driven tokenization began in earnest with the rise of neural machine translation in the mid-2010s. Sennrich, Haddow, and Birch (2016) adapted Byte Pair Encoding (BPE), originally a data compression algorithm developed by Philip Gage in 1994, for use in NLP.^[1]^[11] Their motivating observation was that "Neural machine translation (NMT) models typically operate with a fixed vocabulary, but translation is an open-vocabulary problem."^[1] Their key insight was that subword units could handle rare words by decomposing them into frequently occurring pieces, effectively eliminating the OOV problem without inflating sequence lengths as much as character-level approaches.^[1] On the WMT 15 benchmark, their subword models improved over a back-off dictionary baseline by up to 1.1 BLEU on English to German and 1.3 BLEU on English to Russian.^[1] Around the same time, Google's neural machine translation team (Wu et al., 2016) employed the WordPiece algorithm, first described by Schuster and Nakajima (2012), in their production system.^[3]^[2] Kudo (2018) introduced the Unigram language model approach and the SentencePiece library, which further removed the dependence on language-specific pre-processing.^[4]^[5]

Modern trends

By the early 2020s, subword tokenization had become the universal standard for transformer-based models. The main trend since then has been the growth of vocabulary sizes. GPT-2 (2019) used a vocabulary of approximately 50,000 tokens. GPT-4 (2023) expanded to roughly 100,000 tokens with its cl100k_base encoding. LLaMA 3 (2024) adopted a vocabulary of approximately 128,000 tokens, and Gemini models use vocabularies of 256,000 tokens or more. Larger vocabularies encode text more efficiently (fewer tokens per word), which reduces computational cost in the self-attention layers, though at the expense of larger embedding matrices. Some recent models have pushed vocabulary sizes above 200,000 tokens; OpenAI's o200k_base encoding used by GPT-4o and successors contains approximately 200,000 tokens.

Tokenization techniques

Tokenization is a crucial step in the preprocessing of textual data for various machine learning tasks, such as text classification, sentiment analysis, and machine translation. There are several techniques employed for tokenization, each with its own set of advantages and disadvantages.

Word-based tokenization

Word-based tokenization is a straightforward approach that involves segmenting text into individual words, treating each word as a separate token. This technique often uses white spaces and punctuation marks as delimiters. While it is relatively simple and intuitive, word-based tokenization faces several challenges. First, it produces very large vocabularies because every unique word form (including plurals, verb conjugations, and misspellings) gets its own entry. Second, any word not seen during training becomes an out-of-vocabulary (OOV) token, which the model cannot meaningfully process. Third, it struggles with languages that do not use spaces between words, such as Chinese, Japanese, and Thai, or with morphologically rich languages like Finnish and Turkish, where a single word can convey multiple pieces of information through affixes.

Character-based tokenization

Character-based tokenization divides text into individual characters, which are then used as tokens. This approach eliminates the out-of-vocabulary problem entirely because any text can be represented using a small, fixed set of characters. It can also be useful when dealing with languages that lack clear word boundaries. However, character-based tokenization significantly increases sequence lengths, making it computationally expensive. It also places a heavier burden on the model to learn word-level and phrase-level semantics from very small building blocks, which can lead to reduced performance on many tasks.

Byte-level tokenization

Byte-level tokenization takes character-level tokenization one step further by operating on raw byte values (0 to 255) rather than Unicode characters. This guarantees that any input, in any language or encoding, can be represented without unknown tokens and without needing a massive base vocabulary. The base vocabulary is fixed at 256 entries. Byte-level approaches are used as the foundation for byte-level BPE (as in GPT-2 and later GPT models) and have been explored in models like ByT5 (Xue et al., 2022), which processes text entirely at the byte level without any subword merging. The trade-off is that byte-level sequences are significantly longer than subword sequences, increasing computational cost.

Subword tokenization

Subword tokenization combines the benefits of both word-based and character-based tokenization. It involves breaking text into smaller units, known as subwords or subtokens, which are derived from frequently occurring character sequences in the training corpus. Common words remain as single tokens while rare words are decomposed into recognizable subword pieces. This approach has become the dominant tokenization strategy for modern language models and transformers.

What are the main subword tokenization algorithms?

Four algorithms dominate subword tokenization in contemporary NLP. Each takes a different approach to building a vocabulary of subword units.

Byte Pair Encoding (BPE)

Byte pair encoding (BPE) was originally developed as a data compression algorithm by Philip Gage in 1994^[11] and was adapted for NLP tokenization by Sennrich, Haddow, and Birch in 2016.^[1] BPE builds its vocabulary through an iterative bottom-up process:

Initialization. Start with a vocabulary consisting of all individual characters (or bytes) present in the training corpus, plus any special tokens.
Counting. Count the frequency of every adjacent pair of tokens in the corpus.
Merging. Merge the most frequent pair into a single new token and add it to the vocabulary.
Repeating. Repeat the counting and merging steps until the vocabulary reaches a predefined target size (for example, 50,000 or 100,000 tokens).

For example, given the word frequencies ("hug", 10), ("pug", 5), ("pun", 12), ("bun", 4), ("hugs", 5), BPE would start with the base vocabulary ["b", "g", "h", "n", "p", "s", "u"]. The most frequent adjacent pair is "u" + "g" (appearing 20 times across "hug", "pug", and "hugs"), so BPE merges them into "ug". The next most common pair is "u" + "n" (appearing 16 times in "pun" and "bun"), yielding "un". This process continues until the target vocabulary size is reached.

BPE guarantees that the most common substrings in the training data receive their own dedicated tokens, while rare strings are decomposed into smaller, already-known pieces. This gives it a good balance between vocabulary size and sequence length.

BPE is used by the GPT family of models. OpenAI's tiktoken library implements a highly optimized byte-level BPE tokenizer.^[12] GPT-3 and GPT-2 use this approach, and GPT-4 and later models use the cl100k_base and o200k_base encodings, which are refined versions of byte-level BPE with larger vocabularies.

WordPiece

WordPiece was developed by Schuster and Nakajima in 2012 for Japanese and Korean voice search^[2] and was later adopted by Google for BERT and related models.^[7] WordPiece is similar to BPE in that it iteratively merges character pairs, but it differs in the criterion for choosing which pair to merge. Instead of selecting the most frequent pair, WordPiece selects the pair that maximizes the likelihood of the training data when merged. Specifically, it chooses the pair whose combined frequency divided by the product of the individual frequencies is highest:

score("u", "g") = frequency("ug") / (frequency("u") x frequency("g"))

This scoring favors merging pairs that appear together more often than would be expected by chance, rather than simply the most frequent pair. Two tokens that co-occur far more than their individual frequencies would predict get merged first.

WordPiece uses the "##" prefix to indicate that a subword is a continuation of a previous token rather than the start of a new word. For example, "playing" might be tokenized as ["play", "##ing"]. WordPiece is the tokenizer behind BERT, DistilBERT, and Electra.

Unigram language model

The Unigram algorithm, proposed by Kudo in 2018, takes the opposite approach from BPE and WordPiece.^[4] Instead of starting small and building up, Unigram starts with a large initial vocabulary (often all substrings up to a certain length or all characters plus frequent substrings) and iteratively removes tokens to shrink the vocabulary down to the desired size.

At each step, Unigram computes how much the overall log-likelihood of the training corpus would decrease if each token were removed, and it discards the tokens whose removal causes the least damage (typically the bottom 10% to 20% of tokens by loss impact). The resulting vocabulary tends to produce tokenizations that are probabilistically optimal under the unigram language model assumption, where each token is treated as independent.

During tokenization at inference time, Unigram can produce multiple valid segmentations for any input and selects the one with the highest probability. This probabilistic framework also allows Unigram to perform subword regularization, where different segmentations are sampled during training to make the model more robust. Research has shown that this regularization consistently improves translation quality, especially on low-resource and out-of-domain settings.^[4]

Unigram is used by T5, mT5, ALBERT, XLNet, and mBART, typically through the SentencePiece library.

SentencePiece

SentencePiece, developed by Kudo and Richardson in 2018, is not strictly a new tokenization algorithm but rather a language-independent framework and library that can implement either BPE or the Unigram algorithm.^[5] Its key innovation is that it treats the input text as a raw byte stream, including whitespace characters, without requiring any language-specific pre-tokenization or preprocessing.^[5]

Most other tokenizers assume that words are separated by spaces and operate on pre-split words. SentencePiece removes this assumption by representing spaces as a special Unicode character (displayed as "\u2581" or "_") and learning segmentation boundaries directly from the raw text. This makes SentencePiece particularly effective for languages that do not use spaces between words, such as Chinese, Japanese, and Thai. At decoding time, SentencePiece simply concatenates all tokens and replaces the special space character with a regular space to reconstruct the original text.

SentencePiece is used by LLaMA (versions 1 and 2), Mistral, and many multilingual models. LLaMA 3 switched to a tiktoken-based BPE tokenizer with a vocabulary of approximately 128,000 tokens.

Tokenizer comparison

The following table summarizes which tokenization algorithm is used by major model families.

Model Family	Tokenization Algorithm	Library	Vocabulary Size
GPT-2	Byte-level BPE	tiktoken	~50,257
GPT-3	Byte-level BPE	tiktoken	~50,257
GPT-4 / GPT-4o	Byte-level BPE (cl100k_base)	tiktoken	~100,256
GPT-4o / GPT-5	Byte-level BPE (o200k_base)	tiktoken	~200,019
BERT	WordPiece	HuggingFace Tokenizers	~30,522
ALBERT	Unigram (SentencePiece)	SentencePiece	~30,000
T5	Unigram (SentencePiece)	SentencePiece	~32,000
XLNet	Unigram (SentencePiece)	SentencePiece	~32,000
LLaMA 1 and 2	BPE (SentencePiece)	SentencePiece	~32,000
LLaMA 3	Byte-level BPE	tiktoken	~128,000
Mistral	BPE (SentencePiece)	SentencePiece	~32,000
Mistral-Nemo / Ministral	BPE (Tekken)	Tekken	~131,000
Gemini	BPE (SentencePiece)	SentencePiece	~256,000

Tokenizer libraries and tools

Several open-source libraries are widely used for tokenization in research and production.

tiktoken is OpenAI's fast BPE tokenizer, written in Rust with Python bindings.^[12] It implements the encodings used by all OpenAI models: r50k_base (GPT-3 era), p50k_base (Codex), cl100k_base (GPT-4, GPT-3.5-Turbo), and o200k_base (GPT-4o and successors). tiktoken is 3 to 6 times faster than comparable open-source tokenizers and is the standard tool for counting tokens before sending requests to the OpenAI API.^[12]

HuggingFace Tokenizers is a library written in Rust that provides implementations of BPE, WordPiece, and Unigram tokenizers. It is tightly integrated with the HuggingFace Transformers ecosystem and supports training custom tokenizers from scratch. The library provides both a Python API and a standalone Rust crate.

SentencePiece is Google's C++ library with Python wrappers that implements both BPE and Unigram tokenization in a language-independent manner. It is the tokenizer behind many multilingual and research models including T5, LLaMA 1/2, and XLNet.

For counting tokens in practice, OpenAI provides the tiktoken.encoding_for_model() function, which automatically selects the correct encoding for a given model name. HuggingFace provides AutoTokenizer.from_pretrained() for loading the tokenizer associated with any model on the Hub.

What is a token vocabulary?

The vocabulary of a tokenizer is the complete set of tokens it can produce. Vocabulary size is a critical design decision that balances several trade-offs.

A larger vocabulary means that more words and common phrases receive their own dedicated tokens, resulting in shorter token sequences for a given input. Shorter sequences reduce computational cost in the self-attention layers of transformer models, where the cost scales quadratically with sequence length. Larger vocabularies also improve multilingual coverage by dedicating more tokens to non-English scripts and morphemes.

However, a larger vocabulary increases the size of the embedding matrix (vocabulary size multiplied by embedding dimension), which consumes more memory. It can also lead to data sparsity: tokens that appear rarely in training data receive poorly trained embeddings.

Modern LLM vocabulary sizes have grown steadily. The trend from 2019 to 2026 is shown below.

Year	Representative Model	Vocabulary Size
2018	BERT	~30,522
2019	GPT-2	~50,257
2020	GPT-3	~50,257
2020	T5	~32,000
2023	LLaMA 2	~32,000
2023	GPT-4 (cl100k_base)	~100,256
2024	LLaMA 3	~128,000
2024	GPT-4o (o200k_base)	~200,019
2024	Gemini 1.5	~256,000
2025	Mistral-Nemo / Ministral	~131,000
2025	Gemini 3	~262,000

This represents roughly an 8x increase in vocabulary size over three years (from 32K to 256K+), reflecting the industry's emphasis on multilingual support and computational efficiency.

What is a context window measured in tokens?

The context window of a language model refers to the maximum number of tokens it can process in a single input-output interaction. This includes both the input prompt (system instructions, conversation history, user query) and the generated output. If a conversation exceeds the context window, the model must either truncate older content or use techniques such as summarization or retrieval-augmented generation to stay within limits.

Context windows are measured in tokens, not words or characters. Because of this, the effective length of a context window in human-readable terms depends on the language and content being processed. A 128,000-token context window holds roughly 96,000 English words but fewer words in languages with lower tokenization efficiency.

Context window sizes by model

Context windows have expanded rapidly since the early transformer models. The original BERT model (2018) had a context window of just 512 tokens. By 2024 and 2025, models with context windows of 200,000 tokens or more became common, and some models now support millions of tokens.

Model	Provider	Context Window (Tokens)	Max Output Tokens
BERT (2018)	Google	512	N/A (encoder only)
GPT-2 (2019)	OpenAI	1,024	1,024
GPT-3 (2020)	OpenAI	4,096	4,096
GPT-4 (2023)	OpenAI	8,192 / 32,768	8,192
GPT-4 Turbo (2023)	OpenAI	128,000	4,096
GPT-4o (2024)	OpenAI	128,000	16,384
GPT-4.1 (2025)	OpenAI	1,000,000	32,768
GPT-5.4 (2026)	OpenAI	272,000 (1M via API)	128,000
Claude 3.5 Sonnet (2024)	Anthropic	200,000	8,192
Claude Sonnet 4 (2025)	Anthropic	200,000 (1M beta)	64,000
Claude Sonnet 4.6 (2026)	Anthropic	1,000,000	64,000
Gemini 2.5 Pro (2025)	Google DeepMind	1,000,000	65,536
Gemini 3.1 Pro (2026)	Google DeepMind	1,000,000	65,000
LLaMA 3.1 (2024)	Meta AI	128,000	N/A (open weights)
Llama 4 Scout (2025)	Meta AI	10,000,000	N/A (open weights)
Llama 4 Maverick (2025)	Meta AI	1,000,000	N/A (open weights)
DeepSeek R1 (2025)	DeepSeek	128,000	64,000
Mistral Large (2024)	Mistral AI	128,000	N/A
Codestral (2025)	Mistral AI	256,000	N/A

It is worth noting that a model's advertised context window does not guarantee consistent performance across the entire range. Research has shown that model accuracy tends to degrade as input length approaches the stated maximum, with some models exhibiting sudden performance drops rather than gradual decline. A model claiming 200K tokens may become unreliable around 130K. The "lost in the middle" phenomenon, documented by Liu et al. (2023), describes the tendency of models to recall information placed at the start or end of a long context more reliably than information in the middle. As the authors put it, "performance is often highest when relevant information occurs at the beginning or end of the input context, and significantly degrades when models must access relevant information in the middle."^[14] In the article's measured terms, models recall information from the beginning and end of long contexts more accurately (85% to 95%) than from the middle (76% to 82%).

How does token-based pricing work?

Commercial LLM APIs charge users based on the number of tokens processed. Pricing is typically quoted per million tokens, with separate rates for input tokens (the prompt) and output tokens (the model's response). Output tokens are generally more expensive than input tokens because generating each output token requires a full forward pass through the model.

API pricing comparison (as of early 2026)

Model	Provider	Input (per 1M tokens)	Output (per 1M tokens)
GPT-5.4	OpenAI	$2.50	$15.00
GPT-5.4 Pro	OpenAI	$30.00	$180.00
GPT-4o mini	OpenAI	$0.15	$0.60
o3	OpenAI	$0.40	$1.60
Claude Opus 4.6	Anthropic	$5.00	$25.00
Claude Sonnet 4.6	Anthropic	$3.00	$15.00
Claude Haiku 3.5	Anthropic	$0.25	$1.25
Gemini 2.5 Pro	Google DeepMind	$1.25	$10.00
Gemini 2.5 Flash	Google DeepMind	$0.30	$2.50
Gemini 2.0 Flash-Lite	Google DeepMind	$0.075	$0.30
DeepSeek V3.2	DeepSeek	$0.14	$0.28
Grok 4	xAI	$0.20	$0.50
Codestral	Mistral AI	$0.30	$0.90

LLM API prices dropped roughly 80% across the board between early 2025 and early 2026, driven by hardware improvements, model distillation, and increased competition among providers. Many providers also offer cached input pricing at roughly 10% of the standard input rate, and batch processing discounts that can reduce costs by up to 50%. Combined, caching and batching can yield total savings of up to 95%.

For cost estimation, a useful rule of thumb is that 1,000 tokens of English text is approximately 750 words, or roughly 1.5 pages of single-spaced text.^[13] A full-length novel (~80,000 words) translates to approximately 107,000 tokens.

What are special tokens?

Beyond the subword tokens that represent ordinary text, modern NLP models rely on a set of special tokens that carry structural or functional meaning. These tokens are inserted by the tokenizer to provide the model with signals about the boundaries and roles of different parts of the input. They are not part of the original text but are essential for the model to process inputs correctly.

Special Token	Common Notation	Purpose	Used By
Beginning of Sequence	`<s>`, `[BOS]`, `<\|startoftext\|>`	Signals the start of a new sequence to the model	GPT, LLaMA, and other autoregressive models
End of Sequence	`</s>`, `[EOS]`, `<\|endoftext\|>`	Signals the end of a sequence; tells the model to stop generating	GPT, LLaMA, T5, and most generative models
Padding	`[PAD]`	Fills sequences to a uniform length within a batch so that tensors have consistent dimensions; ignored by the attention mechanism	BERT, T5, and most models during batch processing
Classification	`[CLS]`	Placed at the start of an input; the model's hidden state at this position is used as the aggregate representation for classification tasks	BERT and derived encoder models
Separator	`[SEP]`	Separates two segments of text within a single input (for example, a question and a passage in question answering)	BERT, Electra, and sentence-pair tasks
Mask	`[MASK]`	Replaces a token during masked language model pre-training; the model must predict the original token from context	BERT, RoBERTa, and other masked LMs
Unknown	`[UNK]`	Represents any token that is not found in the vocabulary (primarily in word-level tokenizers; subword tokenizers rarely produce this)	Older word-level tokenizers

GPT-2 uses the single token <|endoftext|> for both beginning-of-sequence and end-of-sequence roles. During training, documents in the corpus were simply concatenated with this token as a delimiter between them. More recent models, including Claude and LLaMA variants, use distinct tokens for sequence boundaries and additional special tokens for system prompts, tool use, and turn-taking in conversations.

What are token embeddings?

Once text has been tokenized, each token must be converted into a numerical representation that a neural network can process. This conversion is performed by the token embedding layer, which maps each token in the vocabulary to a dense, high-dimensional vector.

How token embeddings work

A token embedding layer is essentially a lookup table. If the vocabulary contains V tokens and the embedding dimension is D, then the embedding layer is a matrix of size V x D. When a token with index i is fed into the model, the embedding layer returns the i-th row of this matrix as the token's embedding vector. During training, the values in this matrix are learned through backpropagation alongside all other model parameters.

In the original transformer architecture (Vaswani et al., 2017), token embeddings are combined with positional encodings, which provide information about each token's position in the sequence.^[6] Without positional information, the model would treat a bag of tokens the same regardless of their order. Modern models use various forms of positional encoding, including sinusoidal encodings (original transformer), learned absolute position embeddings (GPT-2, BERT), and rotary position embeddings (RoPE, used by LLaMA and many recent models).

Static vs. contextual embeddings

Earlier embedding approaches like Word2Vec (Mikolov et al., 2013)^[8] and GloVe (Pennington et al., 2014)^[9] produced static embeddings, where each word type has a single fixed vector regardless of context. The word "bank" would have the same vector whether it appeared in "river bank" or "bank account."

Modern transformer-based models like BERT and GPT produce contextual embeddings. The initial token embedding from the lookup table is transformed through multiple layers of self-attention and feed-forward computation. By the time a token's representation reaches the upper layers of the model, its vector encodes not just the token's identity but its role and meaning within the specific sentence. This means the word "bank" receives different representations depending on the surrounding context. BERT-Base produces 768-dimensional contextual embeddings, while BERT-Large generates 1,024-dimensional embeddings.^[7]

Embedding dimensions

The dimensionality of token embeddings varies across models. Larger embedding dimensions allow the model to capture more nuanced semantic distinctions but require more memory and computation.

Model	Embedding Dimension
Word2Vec	100 to 300
GloVe	50 to 300
BERT Base	768
BERT Large	1,024
GPT-3 (175B)	12,288
LLaMA 2 (70B)	8,192
GPT-4	Not publicly disclosed

What are tokenization challenges and artifacts?

Despite the success of subword tokenization, several categories of input continue to present difficulties for current tokenizers. These issues are sometimes called tokenization artifacts, referring to unintended behaviors that arise from how text is split into tokens.

Numbers and arithmetic

Numbers are tokenized inconsistently. A tokenizer might split the number "12345" into tokens like "123" and "45" or "1" and "2345," depending on how often various digit sequences appeared in training data. This inconsistent splitting makes it difficult for models to learn basic arithmetic or reason about numerical magnitudes. The number "380" and "381" might have entirely different tokenizations, making it harder for the model to learn that they are numerically adjacent.

Some recent approaches have explored digit-by-digit tokenization for numbers or special numerical encoding schemes to address this limitation.

Source code

Programming languages pose tokenization challenges because they contain a mix of natural language identifiers, keywords, operators, whitespace-based indentation (in Python), and special symbols. Variable names and function names in code are often compound words written in camelCase or snake_case, which tokenizers may split unpredictably. Indentation-sensitive languages consume extra tokens for whitespace. Research has shown that a Python project may fit roughly 50% more code into the same context window compared to a language like Rust, highlighting efficiency gaps across programming languages.

Multilingual text and token fertility

As discussed in the section on token-to-word ratios, tokenizers trained on English-dominant corpora produce highly inefficient tokenizations for many other languages. This phenomenon is quantified by token fertility, defined as the average number of tokens produced per word. A fertility of 1.0 means each word maps to exactly one token, while a fertility of 3.0 means each word is split into three tokens on average.

Research evaluating large language models on multilingual tasks has shown that fertility reliably predicts downstream accuracy: higher fertility consistently predicts lower accuracy across all models and subjects.^[10] Doubling fertility leads to approximately 4x increases in training cost and inference latency because both computation and memory scale with sequence length.^[10]

For agglutinative languages (such as Turkish, Finnish, and Hungarian), tokenization interacts destructively with prefix and suffix morphology. The model must reconstruct intended characters from sub-token fragments, recover the morphological structure, and then map back to the semantic concept. This serial reconstruction process consumes model capacity that would otherwise be used for reasoning.

A study published in 2023 found that for some languages (such as Dzongkha, Odia, and Santali), the cost to process equivalent text is more than 12 times higher than for English.^[10] This has led to active research in language-fair tokenization, including specialized multilingual tokenizers and vocabulary extension techniques.

Token boundary effects

The specific positions where a tokenizer splits text can affect model behavior in subtle ways. If a word is split differently depending on surrounding context or casing (for example, "token" as one token vs. "Token" as two tokens), the model may treat these as unrelated concepts. These boundary effects are especially pronounced with proper nouns, technical terminology, and novel compound words that the tokenizer did not encounter during vocabulary training. Research has described these effects as imposing a "cognitive load" on the model, where middle layers must compensate for tokenization inconsistencies out of a shared capacity budget.

Emoji and Unicode

Emoji sequences, especially those with skin tone modifiers and zero-width joiners, can expand into many tokens. A single visually rendered emoji may consume 5 to 10 tokens or more. Non-standard Unicode characters, mathematical symbols, and rare scripts also tend to be tokenized very inefficiently.

Whitespace and formatting

Excessive whitespace, tabs, and line breaks all consume tokens. Documents with heavy formatting, such as tables or nested lists, use a disproportionate number of tokens relative to their informational content.

Token-level tasks

Several important NLP tasks operate at the token level, meaning the model must produce a prediction or label for each individual token in the input sequence rather than a single prediction for the entire input.

Named Entity Recognition (NER)

Named entity recognition is the task of identifying and classifying named entities (people, organizations, locations, dates, monetary values, and other categories) in text. Each token receives a label indicating whether it is part of a named entity and, if so, which type. The standard labeling scheme is BIO (Beginning, Inside, Outside): the first token of an entity receives a B- label, subsequent tokens within the same entity receive I- labels, and tokens outside any entity receive the O label.

For example, the sentence "Barack Obama visited New York" might be labeled:

Token	Label
Barack	B-PER
Obama	I-PER
visited	O
New	B-LOC
York	I-LOC

Part-of-Speech (POS) tagging

POS tagging assigns a grammatical category (noun, verb, adjective, adverb, preposition, determiner, and so on) to each token in a sentence. This information is foundational for downstream tasks such as parsing, information extraction, and machine translation. Modern transformer-based POS taggers achieve accuracy above 97% on standard English benchmarks.

Chunking

Chunking, also called shallow parsing, groups consecutive tokens into syntactic phrases such as noun phrases (NP), verb phrases (VP), and prepositional phrases (PP). Like NER, chunking uses BIO-style labels to indicate the beginning, interior, and exterior of each chunk.

Token classification in subword models

A practical challenge in token-level tasks is that subword tokenizers may split a single word into multiple tokens. For NER or POS tagging, the model must produce a label for each subword token, but the final output should assign a single label per original word. Common strategies include taking the label of the first subword token, majority voting among subword labels, or using special loss masking that ignores all but the first subword of each word during training.

What are tokens used for in machine learning?

Tokens play a crucial role in various machine learning tasks, particularly in the domain of NLP and natural language understanding. Some of the primary applications include:

Text Classification. Tokens are used to represent textual data in a format that can be easily processed by machine learning algorithms for tasks such as sentiment analysis, topic classification, and spam detection.
Machine Translation. In machine translation systems, tokens are used to represent source and target language texts, enabling the model to learn mappings between different languages.
Named Entity Recognition. Tokenization allows for the identification of words or phrases that represent specific entities, such as people, organizations, or locations, in a given text.
Text Generation. Autoregressive language models generate text one token at a time. At each step, the model predicts a probability distribution over the vocabulary and samples or selects the next token.
Retrieval-Augmented Generation. Documents are split into token-counted chunks for storage in vector databases, and the number of retrieved tokens must fit within the model's context window alongside the user query and system prompt.

Explain Like I'm 5 (ELI5)

Imagine you have a box of toy blocks with letters on them. If you want to build a sentence or a phrase, you need to use these blocks to make words, and then put those words together to create sentences.

In machine learning, a token is like one of those blocks. But instead of always being one letter, a token can be a whole word (like "cat"), part of a word (like "un" and "happy" for "unhappy"), or even just a letter. The computer breaks text into these blocks so it can understand and work with language.

Why not just use whole words? Because there are too many words in the world, and the computer might see a word it has never seen before. By using smaller building blocks, the computer can handle any word, even new or unusual ones, by putting known pieces together. It is like how you can build any LEGO creation from a set of standard bricks.

When companies charge money for AI chatbots, they count how many tokens you use. That is why you might hear people say things like "this model supports one million tokens" or "it costs three dollars per million tokens."

References

Sennrich, R., Haddow, B., & Birch, A. (2016). "Neural Machine Translation of Rare Words with Subword Units." *Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL)*, pp. 1715-1725. The foundational paper adapting BPE for NLP tokenization. https://aclanthology.org/P16-1162/ ↩
Schuster, M., & Nakajima, K. (2012). "Japanese and Korean voice search." *IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)*. The paper introducing the WordPiece algorithm. ↩
Wu, Y., Schuster, M., Chen, Z., Le, Q. V., et al. (2016). "Google's Neural Machine Translation System: Bridging the Gap Between Human and Machine Translation." *arXiv preprint arXiv:1609.08144*. Describes the use of WordPiece in Google's production NMT system. ↩
Kudo, T. (2018). "Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates." *Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (ACL)*. The paper proposing the Unigram tokenization algorithm. ↩
Kudo, T., & Richardson, J. (2018). "SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing." *Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP)*. ↩
Vaswani, A., Shazeer, N., Parmar, N., et al. (2017). "Attention Is All You Need." *Advances in Neural Information Processing Systems (NeurIPS)*. The paper introducing the transformer architecture and its token embedding approach. ↩
Devlin, J., Chang, M., Lee, K., & Toutanova, K. (2019). "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding." *Proceedings of NAACL-HLT*. Introduced the use of WordPiece tokenization along with special tokens like [CLS], [SEP], and [MASK]. ↩
Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). "Efficient Estimation of Word Representations in Vector Space." *arXiv preprint arXiv:1301.3781*. The Word2Vec paper that pioneered dense word embeddings. ↩
Pennington, J., Socher, R., & Manning, C. (2014). "GloVe: Global Vectors for Word Representation." *Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)*. ↩
Petrov, A., La Malfa, E., Torr, P., & Biber, A. (2023). "Language Model Tokenizers Introduce Unfairness Between Languages." *arXiv preprint arXiv:2305.15425*. Documents the tokenization efficiency disparities across languages. ↩
Gage, P. (1994). "A New Algorithm for Data Compression." *The C Users Journal*, 12(2), pp. 23-38. The original BPE compression algorithm later adapted for NLP. ↩
OpenAI. "tiktoken: A fast BPE tokeniser for use with OpenAI's models." GitHub repository. https://github.com/openai/tiktoken ↩
OpenAI. "What are tokens and how to count them?" OpenAI Help Center. States the rule of thumb that 1 token is approximately 4 characters or 0.75 words for English text, and that 100 tokens is about 75 words. https://help.openai.com/en/articles/4936856-what-are-tokens-and-how-to-count-them ↩
Liu, N. F., Lin, K., Hewitt, J., Paranjape, A., Bevilacqua, M., Petroni, F., & Liang, P. (2023). "Lost in the Middle: How Language Models Use Long Contexts." *arXiv preprint arXiv:2307.03172* (Transactions of the Association for Computational Linguistics, 2024). https://arxiv.org/abs/2307.03172 ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

5 revisions by 1 contributors · full history

Suggest edit