tiktoken
Last reviewed
May 25, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 · 3,879 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 25, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 · 3,879 words
Add missing citations, update stale details, or suggest a clearer explanation.
tiktoken is an open source byte pair encoding (BPE) tokenizer library released by OpenAI in December 2022. Written in Rust with Python bindings via PyO3, it converts text into integer token sequences that the company's language models, including GPT-3.5, GPT-4, GPT-4o, GPT-4.1, and GPT-5, consume during inference and training.[1][2] The package distributes four production encodings (r50k_base, p50k_base, cl100k_base, and o200k_base) plus the legacy gpt2 and p50k_edit variants, and its README advertises a 3 to 6 times speedup over a comparable Hugging Face implementation when tokenizing 1 GB of text.[1] tiktoken has become a de facto reference for measuring token counts in third party tooling: by May 2026 the PyPI package was being downloaded more than 167 million times per month, with LangChain and LlamaIndex both using cl100k_base as their default token counter.[3][4][5]
tiktoken's first PyPI release, version 0.1.1, was uploaded on 15 December 2022.[6] The accompanying Hacker News thread, posted on 16 December 2022, documents the community's earliest reactions; users immediately noted that the new cl100k_base encoding doubled the vocabulary of the GPT-2 tokenizer from roughly 50,000 tokens to about 100,000 and gave special attention to numeric sequences.[7] The repository's LICENSE file dates copyright to 2022 and credits OpenAI alongside individual contributor Shantanu Jain, the project's principal maintainer.[8]
The library was developed alongside OpenAI's transition from the older byte-level BPE tokenizers shipped with GPT-2 and GPT-3 to a denser encoding for the GPT-3.5 and GPT-4 generations. Before tiktoken, most developers tokenized OpenAI inputs using the GPT2TokenizerFast class from Hugging Face's tokenizers library, which had been the canonical implementation since the publication of GPT-2 in February 2019.[9] tiktoken provided a faster alternative tailored to the new vocabularies and shipped the merge tables in compressed BPE files that the Rust core loaded at runtime.
Subsequent releases tracked OpenAI's model launches. Version 0.7.0 was published on 13 May 2024, the same day OpenAI announced GPT-4o and introduced the o200k_base encoding with a vocabulary of roughly 200,000 tokens.[10][11] Version 0.8.0 followed on 3 October 2024 and added support for the o1 reasoning family, version 0.9.0 was released on 14 February 2025, and version 0.13.0 (the current release as of May 2026) was published on 15 May 2026.[10][3] Throughout this period the project remained MIT licensed and accepted contributions from outside OpenAI through GitHub pull requests, including community submissions that mapped new public model names (such as gpt-5 and gpt-5.1) to the appropriate encoding.[12]
Andrej Karpathy's February 2024 video lecture "Let's build the GPT Tokenizer" gave tiktoken a wider pedagogical audience by walking through the exact algorithm the library implements; the companion minbpe repository explicitly references tiktoken as the production target that the educational code mirrors.[13][14]
tiktoken is structured as a small Rust crate with thin Python bindings. The hot path lives in the Rust module _tiktoken, exposed to Python via the PyO3 framework; the setup.py for the project lists setuptools-rust and pins PyO3 to a specific minor version so that the compiled extension binds against a stable ABI.[1][15] The Python package (tiktoken/) contains the user-facing Encoding class, the lookup tables in model.py, and the tiktoken_ext plugin namespace under which encoding factories are registered.[15]
Tokenization in tiktoken follows the standard byte-level BPE pipeline that traces back to Rico Sennrich, Barry Haddow, and Alexandra Birch's 2016 ACL paper on subword units for neural machine translation, which itself adapted the byte pair encoding compression algorithm published by Philip Gage in 1994.[16] In tiktoken's variant, text is first encoded to UTF-8 bytes, then split into chunks by a regular expression that captures contractions, runs of letters, runs of digits, runs of punctuation, and various whitespace classes. The Rust core then iteratively merges adjacent byte pairs according to a fixed merge ranking until no further merges from the encoding's vocabulary apply.[1][17]
The regex split is identical to what GPT-2 used in spirit but rewritten to use the regex crate's Unicode property escapes. The cl100k_base pretokenization pattern, for example, is:
'(?i:[sdmt]|ll|ve|re)|[^\r\n\p{L}\p{N}]?+\p{L}++|\p{N}{1,3}+| ?[^\s\p{L}\p{N}]++[\r\n]*+|\s++$|\s*[\r\n]|\s
This pattern matches English contractions case-insensitively, runs of Unicode letters, numeric groups of up to three digits, runs of non-letter non-digit punctuation, and trailing or interleaved whitespace.[17] The o200k_base pattern, introduced for GPT-4o, generalizes the letter classes to include modifier letters and combining marks (\p{Lm}, \p{Lo}, and \p{M}) so that scripts like Devanagari, Tamil, and Arabic split into fewer pieces before the BPE merges run.[18][19] The pattern also adds a case-aware branch that keeps capitalized words intact and explicitly handles a longer set of contraction suffixes.[17]
Special tokens such as <|endoftext|>, the FIM fill-in-the-middle sentinels (<|fim_prefix|>, <|fim_middle|>, <|fim_suffix|>), and <|endofprompt|> are stored at fixed integer IDs above the BPE vocabulary. For cl100k_base, <|endoftext|> is token 100257 while the FIM tokens occupy 100258 through 100260; for o200k_base, <|endoftext|> is 199999 and <|endofprompt|> is 200018.[17] tiktoken requires callers to opt into special tokens by listing them in the allowed_special argument of encode, which prevents accidental injection of FIM or end-of-text markers from untrusted user input.[1]
A tiktoken_ext.openai_public plugin registers the four production encodings by name and points at the compressed BPE merge tables hosted on OpenAI's tiktoken_bfile blob storage. The first call to tiktoken.get_encoding("cl100k_base") downloads the merge file and caches it under ~/.cache/tiktoken (or a directory chosen via TIKTOKEN_CACHE_DIR), so subsequent calls within the same machine are offline.[1][7]
The Rust core is exposed to Python through the CoreBPE class, which receives a Python dictionary mapping byte sequences to token integers, a separate dictionary of special tokens, and the precompiled regex string. The constructor builds an internal HashMap<Vec<u8>, u32> for the merge table and a reverse HashMap<u32, Vec<u8>> for the decoder, then compiles the regex with the fancy_regex crate (which supports lookaround required by some of the older encodings) for use during encoding.[1][15] The _encode_native method on CoreBPE performs the inner loop without holding the Python global interpreter lock, which is what allows encode_batch to scale across threads when called from a Python thread pool.[1]
Memory usage is dominated by the merge dictionary. The cl100k_base merge file decompresses to roughly 1.7 MiB and the o200k_base file to roughly 3.5 MiB; both are loaded fully into RAM at first use. The library exposes no streaming or memory-mapped loader, which means cold-start latency is dominated by the disk read or HTTP fetch of the merge file rather than by the merge table construction itself.[1]
The four encodings shipped in the library correspond to four generations of OpenAI text models. The table below summarizes the mappings as of the tiktoken/model.py source on the main branch in May 2026.[15][17]
| Encoding | Vocabulary size | First introduced | Representative models |
|---|---|---|---|
r50k_base (alias gpt2) | 50,257 BPE merges | February 2019 with GPT-2 | gpt-2, davinci, curie, babbage, ada |
p50k_base | 50,281 BPE merges | June 2022 with Codex | text-davinci-002, text-davinci-003, code-davinci-002, cushman-codex |
p50k_edit | 50,284 BPE merges | March 2023 with edit endpoint | text-davinci-edit-001, code-davinci-edit-001 |
cl100k_base | 100,277 BPE merges plus five special tokens | December 2022 with GPT-3.5 turbo and GPT-4 | gpt-3.5-turbo, gpt-4, gpt-4-turbo, text-embedding-ada-002, text-embedding-3-small, text-embedding-3-large, davinci-002, babbage-002 |
o200k_base | 199,998 BPE merges plus special tokens | May 2024 with GPT-4o | gpt-4o, gpt-4o-mini, gpt-4.1, gpt-5, o1, o3, o3-mini, o4-mini |
The r50k_base encoding is byte-identical to the public GPT-2 tokenizer published by OpenAI in 2019: 256 base bytes, 50,000 learned merges, and a single <|endoftext|> token at index 50256.[9] p50k_base extends this with merges learned on Codex's enlarged corpus (predominantly source code) and reserves several extra IDs for whitespace-only merges that improve indentation handling.[20]
cl100k_base, introduced alongside the ChatGPT API on 30 November 2022 and the GPT-4 release on 14 March 2023, doubles the merge count and adds first-class support for FIM completion. The OpenAI Cookbook recipe "How to count tokens with tiktoken" lists cl100k_base as the canonical encoding for gpt-4, gpt-4-turbo, gpt-3.5-turbo, and all three current embedding endpoints (text-embedding-ada-002, text-embedding-3-small, text-embedding-3-large).[2]
o200k_base doubles the vocabulary again to roughly 200,000 entries. OpenAI shipped this encoding with GPT-4o on 13 May 2024 and uses it for every subsequent flagship model, including the o-series reasoning models (o1, o3, o3-mini, o4-mini) and the GPT-4.1 and GPT-5 families.[11][15] OpenAI's MODEL_PREFIX_TO_ENCODING table uses prefix matching, so any model whose name starts with gpt-5- or o3- automatically resolves to o200k_base without requiring a new release of tiktoken every time OpenAI ships a versioned model.[15]
The library exposes a compact Python API. Two factory functions create Encoding objects, and the resulting objects expose encode and decode methods that round-trip text and integers.[1][2]
import tiktoken
enc = tiktoken.get_encoding("cl100k_base")
tokens = enc.encode("tiktoken is great!")
text = enc.decode(tokens)
enc4o = tiktoken.encoding_for_model("gpt-4o-mini")
print(len(enc4o.encode("Hello world")))
tiktoken.get_encoding(name) returns the encoding by explicit name, while tiktoken.encoding_for_model(model) looks the model up in the MODEL_TO_ENCODING table (or applies prefix matching) and returns the corresponding encoding. The OpenAI Cookbook documents the canonical pattern for token counting as len(encoding.encode(string)).[2]
Beyond encode and decode, the Encoding class exposes encode_batch for parallel processing of multiple strings, encode_with_unstable for tokenizing input that may be a partial prefix of a larger string (useful when streaming completions), and decode_single_token_bytes for inspecting individual token bytes. The name, eot_token, and n_vocab attributes describe the encoding statically.[1] Special tokens are disallowed by default; callers must pass allowed_special={"<|endoftext|>"} (or "all") to permit them, which guards against injecting end-of-text markers from user data.[1]
The Python package depends only on regex for the pretokenization pattern and on requests for the one-time blob download; blobfile is an optional dependency used when applications need to load merge files from cloud object stores.[3] The Rust core ships as a manylinux wheel for x86_64 and aarch64 Linux, plus universal2 wheels for macOS and a Windows x86_64 wheel, so most users install with pip install tiktoken and never invoke cargo.[3]
Two further utilities round out the API. tiktoken.list_encoding_names() returns the names of all registered encodings, including those contributed by plugins, and Encoding.encode_ordinary skips the special token check entirely for performance-sensitive paths where the caller knows the input contains no chat or FIM markers. The latter is the function that most token counting wrappers, including LangChain's _get_token_ids, end up calling because it avoids the overhead of pattern matching against the special token list on every call.[1][5]
For applications that need to construct custom encodings (for example, while training a derivative model on a private vocabulary), tiktoken's Encoding constructor accepts the merge dictionary, the special tokens, the pretokenization regex, and an optional explicit vocabulary size directly. The library deliberately does not ship a training routine, so users wanting to learn new merges typically pair tiktoken.Encoding with an external trainer such as Karpathy's rustbpe or Hugging Face's BpeTrainer, then load the resulting merges into a fresh Encoding for inference.[14][15]
The repository's README claims that "tiktoken is between 3 to 6 times faster than a comparable open source tokeniser," referring to a benchmark that tokenizes 1 GB of English text using both cl100k_base and Hugging Face's GPT2TokenizerFast.[1] Independent reviewers have generally confirmed the headline number for single-threaded workloads: an analysis published on machinelearningplus.com in 2024 reported a 2 to 3 times speedup for tiktoken in OpenAI token counting tasks, with the gap narrowing when Hugging Face's encode_batch distributed the work across Rust threads.[21]
The speed advantage comes from two design choices. First, tiktoken's pretokenization regex is precompiled by the regex crate, which uses Thompson NFA construction with explicit Unicode property support, and the Rust core avoids allocating intermediate Python objects during the inner merge loop.[1] Second, the BPE merges are stored as a flat HashMap<Vec<u8>, u32> keyed by byte sequences rather than the trie-based approach used in some earlier implementations, which trades a small amount of memory for cache-friendly lookups.[1]
Several newer Rust tokenizers have since outperformed tiktoken on specific benchmarks. The bpe crate's backtracking encoder reports being roughly 3 times faster than tiktoken with pretokenization enabled and 10 times faster than Hugging Face's BPE without pretokenization on its synthetic corpus.[22] The rs-bpe project published comparisons in 2025 showing it outperforms both tiktoken and Hugging Face's tokenizers on Latin text.[23] These results sit alongside tiktoken rather than displacing it: tiktoken's value is that it ships the exact OpenAI merge tables under the MIT license and is the implementation the official cookbook recommends, so applications targeting OpenAI models continue to use it for correctness even when raw throughput matters less than fidelity.[2]
The o200k_base encoding's principal practical change over cl100k_base is improved compression of non-English text. Independent measurements published the day GPT-4o launched show large reductions in token counts for Indic, Arabic, Chinese, and other non-Latin scripts. One analysis on njkumar.com cited a Tamil example where the sentence "நீ எப்படி இருக்கிறாய்? நான் நல்லா இருக்கேன்" was tokenized as 68 tokens with cl100k_base but only 21 with o200k_base, a 3.2 times compression improvement.[19] An aggregated comparison on aipmguru.substack.com reported a 17 percent reduction in tokens for English text and an 86 percent reduction for Gujarati, alongside specific drops from 70 to 21 tokens for an Arabic phrase and 56 to 18 for a Chinese phrase.[24]
Two factors drive the improvement. First, the larger merge table can absorb full common phrases as single tokens in scripts where cl100k_base had to fall back to byte-level encoding. Second, the pretokenization regex was rewritten to use the Unicode general categories \p{Lo} (other letters), \p{Lm} (modifier letters), and \p{M} (combining marks), which prevents the splitter from cutting between a base character and its combining diacritic before the BPE merges have a chance to operate.[17][19]
The compression improvement matters for users in two ways. First, because OpenAI bills its API by token, the same source text in Hindi or Korean costs less to send to GPT-4o than to GPT-4 turbo. Second, the model can fit more content in its context window when the encoding is denser. The trade-off is that the larger vocabulary increases the embedding table size and the softmax over the output distribution, which contributes to GPT-4o's higher per-token compute compared to a hypothetical model that retained cl100k_base.[11][19]
OpenAI's commentary in the GPT-4o launch post highlighted the multilingual tokenizer change as one of three principal upgrades over GPT-4 turbo, alongside native multimodal support and lower latency. Modal's blog on the o200k_harmony variant (an extension of o200k_base shipped with the open weight gpt-oss models in 2025) confirmed that the base encoding has 199,998 BPE merges plus two reserved special tokens, with the Harmony variant adding chat-template tokens for tool calls.[25]
The two principal alternatives to tiktoken in the open ecosystem are Hugging Face's tokenizers library and Google's SentencePiece. All three implement the BPE family but differ in scope, training capability, and licensing.
Hugging Face tokenizers, also written in Rust with Python bindings, is a much broader toolkit. It supports BPE, WordPiece, and Unigram models, exposes a PreTokenizer and Normalizer configuration system, and ships a train_from_iterator method that fits new tokenizers from corpora. By contrast, tiktoken is inference only: it ships pretrained OpenAI merge tables and offers no training API.[1][26] In raw token counting, tiktoken is faster on single threads, but Hugging Face overtakes it once encode_batch parallelizes across Rust threads.[21]
SentencePiece, published by Taku Kudo and John Richardson at Google in 2018, takes a different architectural stance: it treats input as a raw byte sequence and trains directly on unprocessed text, which makes it language agnostic and removes the need for a separate pretokenization regex. SentencePiece supports both BPE and Unigram language model tokenization and is the implementation used by T5, Gemma, LLaMA, and the original PaLM.[27] SentencePiece's BPE differs from tiktoken's primarily because it includes a learned _ (U+2581) "metaword" boundary marker rather than relying on bytes for whitespace; this makes its outputs incompatible with tiktoken's merge tables, but the algorithmic complexity is similar.[27]
A third class of implementation, often grouped under the loose umbrella "GPT2-style" tokenizers, includes pure JavaScript ports (gpt-tokenizer, niieani/gpt-tokenizer), .NET ports (tryAGI/Tiktoken), and Go and R bindings that reuse the official cl100k_base and o200k_base merge files. These ports exist because OpenAI's official package only targets Python; the merge files themselves are CC-licensed metadata that ports can redistribute. None of these ports trains new tokenizers either: they exist purely to count or encode tokens against OpenAI's vocabularies in non-Python environments.[28]
Karpathy's minbpe and rustbpe projects exist explicitly as educational reference implementations. minbpe reproduces tiktoken's encode path in roughly 200 lines of Python and adds a small training loop; rustbpe extends the Rust path with a train method that tiktoken lacks. Both projects target byte-compatible output with tiktoken's cl100k_base for any input.[14]
tiktoken sits at the foundation of nearly every Python tool that interacts with OpenAI's text APIs. The package's PyPI page reported 167 million downloads in the 30 days preceding 26 May 2026, with 4.99 million downloads on the most recent day alone, ranking it among the most downloaded AI-adjacent libraries on the index.[3]
LangChain uses tiktoken as the default token counter inside its BaseChatModel.get_num_tokens_from_messages implementation for OpenAI models, and the framework's documentation recommends installing tiktoken alongside the core package for any application that needs to enforce context length limits.[5] LlamaIndex sets cl100k_base as its global default tokenizer via the Settings.tokenizer attribute; the project's documentation specifically calls out the dependency on tiktoken to match GPT-3.5 turbo's default behavior.[4] LiteLLM, an aggregation layer that exposes a single API across multiple providers, falls back to tiktoken when no provider-specific tokenizer is available.[5]
Third party token counters for non-OpenAI models often use tiktoken as a rough approximation. Anthropic's official position is that Claude requires its own client.messages.count_tokens endpoint for accurate counts, but several community libraries continue to use tiktoken.get_encoding("p50k_base") as a heuristic for older Claude 2 era inputs. Anthropic now publishes its own TypeScript tokenizer in the anthropic-tokenizer-typescript repository for browser-side estimation, leaving tiktoken as the OpenAI-specific tool.[29][30]
tiktoken is also the canonical reference implementation cited by academic papers when reporting token-based metrics for OpenAI evaluation. Papers studying multilingual token compression, prompt cost analysis, and prompt injection routinely report token counts using tiktoken's cl100k_base or o200k_base and link the exact library version they used for reproducibility.[19][31]
The library has been ported into more than a dozen non-Python ecosystems. The tiktoken-go package provides Go bindings, gpt-tokenizer and js-tiktoken cover Node.js and the browser, tryAGI/Tiktoken covers .NET, and rtiktoken provides an R interface that wraps the official Rust core through extendr. Most of these projects bundle the official merge tables and reach byte-identical output with the Python package for the supported encodings, which means their token counts can be used as the ground truth for OpenAI billing in JavaScript, Go, and .NET applications without round-tripping calls to a Python service.[28]
The library exposes only inference. Users cannot fit new BPE vocabularies through tiktoken itself; producing a new encoding requires an external trainer and then constructing a tiktoken.Encoding around the resulting merges. The repository's README states this constraint explicitly and points to minbpe and rustbpe as suggested training companions.[1][14]
Special token handling is conservative by design. Calling encode("<|endoftext|>") raises a ValueError unless the caller passes allowed_special={"<|endoftext|>"} or sets allowed_special="all". This prevents prompt injection through user-supplied content that contains the literal special token string, but it can surprise developers who unwittingly include the token in their templates. The opposite mode, disallowed_special=(), opts every special token back into normal encoding paths and is sometimes used to round-trip arbitrary text without errors.[1]
tiktoken's merge files are downloaded from blob storage on first use, which means the library does not work in air-gapped environments without prior preparation. The workaround documented in the README is to download the merge file in a connected environment, copy ~/.cache/tiktoken to the target machine, and set TIKTOKEN_CACHE_DIR to the destination. Several community ports (such as js-tiktoken and the .NET Tiktoken package) bundle the merge tables directly to avoid this problem.[1][7][28]
The library targets OpenAI's vocabularies and matches them exactly, but it offers no tokenizers for other providers. Counts produced by tiktoken.get_encoding("cl100k_base") for a Claude or LLaMA input are approximations that drift from the actual model's tokenizer by a few percent on English and considerably more on multilingual text. Anthropic's documentation explicitly recommends using client.messages.count_tokens rather than tiktoken for Claude, and the transformers library's LlamaTokenizer is the canonical counter for Meta's models.[29][30]
Finally, because the merge tables are byte-level, tiktoken cannot guarantee that arbitrary sequences of tokens decode to valid UTF-8. Individual tokens may be partial UTF-8 byte sequences that only become valid when concatenated with neighbors; decode_single_token_bytes returns raw bytes precisely so that callers handling streaming output can buffer until a complete code point arrives. This semantic is documented in the README and matches how OpenAI's server side streaming endpoints deliver byte-level deltas.[1]