# tiktoken

> Source: https://aiwiki.ai/wiki/tiktoken
> Updated: 2026-06-25
> Categories: Natural Language Processing, Open Source AI, OpenAI
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

**tiktoken** is an open source byte pair encoding (BPE) tokenizer library released by [OpenAI](/wiki/openai) in December 2022 that converts text into the integer token sequences its language models read and write. OpenAI describes the project in a single line at the top of its README: "tiktoken is a fast BPE tokeniser for use with OpenAI's models."[^1] Written in Rust with [Python](/wiki/python) bindings via PyO3, it ships the exact merge tables behind models including [GPT-3.5](/wiki/gpt-3.5), [GPT-4](/wiki/gpt-4), [GPT-4o](/wiki/gpt_4o), [GPT-4.1](/wiki/gpt-4.1), and [GPT-5](/wiki/gpt-5), and is the implementation OpenAI's own cookbook recommends for counting tokens.[^1][^2] The package distributes four production encodings (`r50k_base`, `p50k_base`, `cl100k_base`, and `o200k_base`) plus the legacy `gpt2` and `p50k_edit` variants, and its README advertises a 3 to 6 times speedup over a comparable open source tokenizer when tokenizing 1 GB of text.[^1] tiktoken has become a de facto reference for measuring token counts in third party tooling: by May 2026 the PyPI package was being downloaded more than 167 million times per month, with [LangChain](/wiki/langchain) and [LlamaIndex](/wiki/llamaindex) both using `cl100k_base` as their default token counter.[^3][^4][^5]

## What is tiktoken?

tiktoken is a tokenizer, the component that sits between human-readable text and a language model. A [transformer](/wiki/transformer) model does not operate on characters or words; it operates on integers drawn from a fixed vocabulary, and tiktoken is the function that maps a string to that list of integers (encoding) and back (decoding). Because OpenAI bills its API by the token and every model has a finite [context window](/wiki/context_window), tiktoken is also the standard tool developers use to predict request cost and to verify that a prompt fits before they send it.[^1][^2]

The library is deliberately narrow. It does not train new vocabularies, normalize text, or support providers other than OpenAI; it ships pretrained OpenAI merge tables and exposes a small `encode`/`decode` surface around them. That focus is what lets it guarantee byte-identical output with OpenAI's production tokenizers, which is the property that makes it citable as ground truth for OpenAI token counts.[^1][^2]

## When was tiktoken released?

tiktoken's first PyPI release, version 0.1.1, was uploaded on 15 December 2022.[^6] The accompanying Hacker News thread, posted on 16 December 2022, documents the community's earliest reactions; users immediately noted that the new `cl100k_base` encoding doubled the vocabulary of the [GPT-2](/wiki/gpt-2) tokenizer from roughly 50,000 tokens to about 100,000 and gave special attention to numeric sequences.[^7] The repository's LICENSE file dates copyright to 2022 and credits OpenAI alongside individual contributor Shantanu Jain, the project's principal maintainer.[^8]

The library was developed alongside OpenAI's transition from the older byte-level BPE tokenizers shipped with GPT-2 and GPT-3 to a denser encoding for the GPT-3.5 and GPT-4 generations. Before tiktoken, most developers tokenized OpenAI inputs using the `GPT2TokenizerFast` class from Hugging Face's `tokenizers` library, which had been the canonical implementation since the publication of GPT-2 in February 2019.[^9] tiktoken provided a faster alternative tailored to the new vocabularies and shipped the merge tables in compressed BPE files that the Rust core loaded at runtime.

Subsequent releases tracked OpenAI's model launches. Version 0.7.0 was published on 13 May 2024, the same day OpenAI announced [GPT-4o](/wiki/gpt_4o) and introduced the `o200k_base` encoding with a vocabulary of roughly 200,000 tokens.[^10][^11] Version 0.8.0 followed on 3 October 2024 and added support for the o1 reasoning family, version 0.9.0 was released on 14 February 2025, and version 0.13.0 (the current release as of May 2026) was published on 15 May 2026.[^10][^3] Throughout this period the project remained MIT licensed and accepted contributions from outside OpenAI through GitHub pull requests, including community submissions that mapped new public model names (such as `gpt-5` and `gpt-5.1`) to the appropriate encoding.[^12]

Andrej [Karpathy](/wiki/andrej_karpathy)'s February 2024 video lecture "Let's build the GPT Tokenizer" gave tiktoken a wider pedagogical audience by walking through the exact algorithm the library implements; the companion `minbpe` repository explicitly references tiktoken as the production target that the educational code mirrors.[^13][^14]

## How does tiktoken work?

tiktoken is structured as a small Rust crate with thin Python bindings. The hot path lives in the Rust module `_tiktoken`, exposed to Python via the PyO3 framework; the `setup.py` for the project lists `setuptools-rust` and pins PyO3 to a specific minor version so that the compiled extension binds against a stable ABI.[^1][^15] The Python package (`tiktoken/`) contains the user-facing `Encoding` class, the lookup tables in `model.py`, and the `tiktoken_ext` plugin namespace under which encoding factories are registered.[^15]

Tokenization in tiktoken follows the standard byte-level BPE pipeline that traces back to Rico Sennrich, Barry Haddow, and Alexandra Birch's 2016 ACL paper on subword units for neural machine translation, which itself adapted the byte pair encoding compression algorithm published by Philip Gage in 1994.[^16] In tiktoken's variant, text is first encoded to UTF-8 bytes, then split into chunks by a regular expression that captures contractions, runs of letters, runs of digits, runs of punctuation, and various whitespace classes. The Rust core then iteratively merges adjacent byte pairs according to a fixed merge ranking until no further merges from the encoding's vocabulary apply.[^1][^17]

The regex split is identical to what GPT-2 used in spirit but rewritten to use the `regex` crate's Unicode property escapes. The `cl100k_base` pretokenization pattern, for example, is:

```
'(?i:[sdmt]|ll|ve|re)|[^\r\n\p{L}\p{N}]?+\p{L}++|\p{N}{1,3}+| ?[^\s\p{L}\p{N}]++[\r\n]*+|\s++$|\s*[\r\n]|\s
```

This pattern matches English contractions case-insensitively, runs of Unicode letters, numeric groups of up to three digits, runs of non-letter non-digit punctuation, and trailing or interleaved whitespace.[^17] The `o200k_base` pattern, introduced for GPT-4o, generalizes the letter classes to include modifier letters and combining marks (`\p{Lm}`, `\p{Lo}`, and `\p{M}`) so that scripts like Devanagari, Tamil, and Arabic split into fewer pieces before the BPE merges run.[^18][^19] The pattern also adds a case-aware branch that keeps capitalized words intact and explicitly handles a longer set of contraction suffixes.[^17]

Special tokens such as `<|endoftext|>`, the FIM fill-in-the-middle sentinels (`<|fim_prefix|>`, `<|fim_middle|>`, `<|fim_suffix|>`), and `<|endofprompt|>` are stored at fixed integer IDs above the BPE vocabulary. For `cl100k_base`, `<|endoftext|>` is token 100257 while the FIM tokens occupy 100258 through 100260; for `o200k_base`, `<|endoftext|>` is 199999 and `<|endofprompt|>` is 200018.[^17] tiktoken requires callers to opt into special tokens by listing them in the `allowed_special` argument of `encode`, which prevents accidental injection of FIM or end-of-text markers from untrusted user input.[^1]

A `tiktoken_ext.openai_public` plugin registers the four production encodings by name and points at the compressed BPE merge tables hosted on OpenAI's `tiktoken_bfile` blob storage. The first call to `tiktoken.get_encoding("cl100k_base")` downloads the merge file and caches it under `~/.cache/tiktoken` (or a directory chosen via `TIKTOKEN_CACHE_DIR`), so subsequent calls within the same machine are offline.[^1][^7]

The Rust core is exposed to Python through the `CoreBPE` class, which receives a Python dictionary mapping byte sequences to token integers, a separate dictionary of special tokens, and the precompiled regex string. The constructor builds an internal `HashMap<Vec<u8>, u32>` for the merge table and a reverse `HashMap<u32, Vec<u8>>` for the decoder, then compiles the regex with the `fancy_regex` crate (which supports lookaround required by some of the older encodings) for use during encoding.[^1][^15] The `_encode_native` method on `CoreBPE` performs the inner loop without holding the Python global interpreter lock, which is what allows `encode_batch` to scale across threads when called from a Python thread pool.[^1]

Memory usage is dominated by the merge dictionary. The `cl100k_base` merge file decompresses to roughly 1.7 MiB and the `o200k_base` file to roughly 3.5 MiB; both are loaded fully into RAM at first use. The library exposes no streaming or memory-mapped loader, which means cold-start latency is dominated by the disk read or HTTP fetch of the merge file rather than by the merge table construction itself.[^1]

## Which encodings does tiktoken use?

The four encodings shipped in the library correspond to four generations of OpenAI text models. The table below summarizes the mappings as of the `tiktoken/model.py` source on the `main` branch in May 2026.[^15][^17]

| Encoding | Vocabulary size | First introduced | Representative models |
|----------|-----------------|-------------------|------------------------|
| `r50k_base` (alias `gpt2`) | 50,257 BPE merges | February 2019 with GPT-2 | `gpt-2`, `davinci`, `curie`, `babbage`, `ada` |
| `p50k_base` | 50,281 BPE merges | June 2022 with Codex | `text-davinci-002`, `text-davinci-003`, `code-davinci-002`, `cushman-codex` |
| `p50k_edit` | 50,284 BPE merges | March 2023 with edit endpoint | `text-davinci-edit-001`, `code-davinci-edit-001` |
| `cl100k_base` | 100,277 BPE merges plus five special tokens | December 2022 with GPT-3.5 turbo and GPT-4 | `gpt-3.5-turbo`, `gpt-4`, `gpt-4-turbo`, `text-embedding-ada-002`, `text-embedding-3-small`, `text-embedding-3-large`, `davinci-002`, `babbage-002` |
| `o200k_base` | 199,998 BPE merges plus special tokens | May 2024 with GPT-4o | `gpt-4o`, `gpt-4o-mini`, `gpt-4.1`, `gpt-5`, `o1`, `o3`, `o3-mini`, `o4-mini` |

The `r50k_base` encoding is byte-identical to the public [GPT-2](/wiki/gpt-2) tokenizer published by OpenAI in 2019: 256 base bytes, 50,000 learned merges, and a single `<|endoftext|>` token at index 50256.[^9] `p50k_base` extends this with merges learned on Codex's enlarged corpus (predominantly source code) and reserves several extra IDs for whitespace-only merges that improve indentation handling.[^20]

`cl100k_base`, introduced alongside the ChatGPT API on 30 November 2022 and the GPT-4 release on 14 March 2023, doubles the merge count and adds first-class support for FIM completion. The OpenAI Cookbook recipe "How to count tokens with tiktoken" lists `cl100k_base` as the canonical encoding for `gpt-4`, `gpt-4-turbo`, `gpt-3.5-turbo`, and all three current embedding endpoints (`text-embedding-ada-002`, `text-embedding-3-small`, `text-embedding-3-large`).[^2]

`o200k_base` doubles the vocabulary again to roughly 200,000 entries. OpenAI shipped this encoding with [GPT-4o](/wiki/gpt_4o) on 13 May 2024 and uses it for every subsequent flagship model, including the o-series reasoning models (`o1`, `o3`, `o3-mini`, `o4-mini`) and the [GPT-4.1](/wiki/gpt-4.1) and [GPT-5](/wiki/gpt-5) families.[^11][^15] OpenAI's `MODEL_PREFIX_TO_ENCODING` table uses prefix matching, so any model whose name starts with `gpt-5-`, `gpt-4.1-`, `gpt-4.5-`, `o1-`, `o3-`, or `o4-mini-` automatically resolves to `o200k_base` without requiring a new release of tiktoken every time OpenAI ships a versioned model.[^15]

## How do you use tiktoken in Python?

The library exposes a compact Python API. Two factory functions create `Encoding` objects, and the resulting objects expose `encode` and `decode` methods that round-trip text and integers.[^1][^2]

```python
import tiktoken

enc = tiktoken.get_encoding("cl100k_base")
tokens = enc.encode("tiktoken is great!")
text = enc.decode(tokens)

enc4o = tiktoken.encoding_for_model("gpt-4o-mini")
print(len(enc4o.encode("Hello world")))
```

`tiktoken.get_encoding(name)` returns the encoding by explicit name, while `tiktoken.encoding_for_model(model)` looks the model up in the `MODEL_TO_ENCODING` table (or applies prefix matching) and returns the corresponding encoding. The OpenAI Cookbook documents the canonical pattern for token counting as `len(encoding.encode(string))`.[^2]

Beyond `encode` and `decode`, the `Encoding` class exposes `encode_batch` for parallel processing of multiple strings, `encode_with_unstable` for tokenizing input that may be a partial prefix of a larger string (useful when streaming completions), and `decode_single_token_bytes` for inspecting individual token bytes. The `name`, `eot_token`, and `n_vocab` attributes describe the encoding statically.[^1] Special tokens are disallowed by default; callers must pass `allowed_special={"<|endoftext|>"}` (or `"all"`) to permit them, which guards against injecting end-of-text markers from user data.[^1]

The Python package depends only on `regex` for the pretokenization pattern and on `requests` for the one-time blob download; `blobfile` is an optional dependency used when applications need to load merge files from cloud object stores.[^3] The Rust core ships as a manylinux wheel for x86_64 and aarch64 Linux, plus universal2 wheels for macOS and a Windows x86_64 wheel, so most users install with `pip install tiktoken` and never invoke `cargo`.[^3]

Two further utilities round out the API. `tiktoken.list_encoding_names()` returns the names of all registered encodings, including those contributed by plugins, and `Encoding.encode_ordinary` skips the special token check entirely for performance-sensitive paths where the caller knows the input contains no chat or FIM markers. The latter is the function that most token counting wrappers, including LangChain's `_get_token_ids`, end up calling because it avoids the overhead of pattern matching against the special token list on every call.[^1][^5]

For applications that need to construct custom encodings (for example, while training a derivative model on a private vocabulary), tiktoken's `Encoding` constructor accepts the merge dictionary, the special tokens, the pretokenization regex, and an optional explicit vocabulary size directly. The library deliberately does not ship a training routine, so users wanting to learn new merges typically pair `tiktoken.Encoding` with an external trainer such as Karpathy's `rustbpe` or Hugging Face's `BpeTrainer`, then load the resulting merges into a fresh `Encoding` for inference.[^14][^15]

## How fast is tiktoken?

The repository's README claims that "tiktoken is between 3-6x faster than a comparable open source tokeniser," referring to a benchmark that tokenizes 1 GB of text using `cl100k_base` and Hugging Face's `GPT2TokenizerFast` (measured with `tokenizers==0.13.2`, `transformers==4.24.0`, and `tiktoken==0.2.0`).[^1] Independent reviewers have generally confirmed the headline number for single-threaded workloads: an analysis published on machinelearningplus.com in 2024 reported a 2 to 3 times speedup for tiktoken in OpenAI token counting tasks, with the gap narrowing when Hugging Face's `encode_batch` distributed the work across Rust threads.[^21]

The speed advantage comes from two design choices. First, tiktoken's pretokenization regex is precompiled by the `regex` crate, which uses Thompson NFA construction with explicit Unicode property support, and the Rust core avoids allocating intermediate Python objects during the inner merge loop.[^1] Second, the BPE merges are stored as a flat `HashMap<Vec<u8>, u32>` keyed by byte sequences rather than the trie-based approach used in some earlier implementations, which trades a small amount of memory for cache-friendly lookups.[^1]

Several newer Rust tokenizers have since outperformed tiktoken on specific benchmarks. The `bpe` crate's backtracking encoder reports being roughly 3 times faster than tiktoken with pretokenization enabled and 10 times faster than Hugging Face's BPE without pretokenization on its synthetic corpus.[^22] The `rs-bpe` project published comparisons in 2025 showing it outperforms both tiktoken and Hugging Face's tokenizers on Latin text.[^23] These results sit alongside tiktoken rather than displacing it: tiktoken's value is that it ships the exact OpenAI merge tables under the MIT license and is the implementation the official cookbook recommends, so applications targeting OpenAI models continue to use it for correctness even when raw throughput matters less than fidelity.[^2]

## How does tiktoken handle non-English text?

The `o200k_base` encoding's principal practical change over `cl100k_base` is improved compression of non-English text. Independent measurements published the day GPT-4o launched show large reductions in token counts for Indic, Arabic, Chinese, and other non-Latin scripts. One analysis on njkumar.com cited a Tamil example where the sentence "நீ எப்படி இருக்கிறாய்? நான் நல்லா இருக்கேன்" was tokenized as 68 tokens with `cl100k_base` but only 21 with `o200k_base`, a 3.2 times compression improvement.[^19] An aggregated comparison on aipmguru.substack.com reported a 17 percent reduction in tokens for English text and an 86 percent reduction for Gujarati, alongside specific drops from 70 to 21 tokens for an Arabic phrase and 56 to 18 for a Chinese phrase.[^24]

Two factors drive the improvement. First, the larger merge table can absorb full common phrases as single tokens in scripts where `cl100k_base` had to fall back to byte-level encoding. Second, the pretokenization regex was rewritten to use the Unicode general categories `\p{Lo}` (other letters), `\p{Lm}` (modifier letters), and `\p{M}` (combining marks), which prevents the splitter from cutting between a base character and its combining diacritic before the BPE merges have a chance to operate.[^17][^19]

The compression improvement matters for users in two ways. First, because OpenAI bills its API by token, the same source text in Hindi or Korean costs less to send to GPT-4o than to GPT-4 turbo. Second, the model can fit more content in its [context window](/wiki/context_window) when the encoding is denser. The trade-off is that the larger vocabulary increases the embedding table size and the softmax over the output distribution, which contributes to GPT-4o's higher per-token compute compared to a hypothetical model that retained `cl100k_base`.[^11][^19]

OpenAI's commentary in the GPT-4o launch post highlighted the multilingual tokenizer change as one of three principal upgrades over GPT-4 turbo, alongside native multimodal support and lower latency. Modal's blog on the `o200k_harmony` variant (an extension of `o200k_base` shipped with the open weight `gpt-oss` models in 2025) confirmed that the base encoding has 199,998 BPE merges plus two reserved special tokens, with the Harmony variant adding chat-template tokens for tool calls.[^25]

## How does tiktoken compare to other tokenizers?

The two principal alternatives to tiktoken in the open ecosystem are Hugging Face's `tokenizers` library and Google's [SentencePiece](/wiki/sentencepiece). All three implement the BPE family but differ in scope, training capability, and licensing.

Hugging Face `tokenizers`, also written in Rust with Python bindings, is a much broader toolkit. It supports BPE, WordPiece, and Unigram models, exposes a `PreTokenizer` and `Normalizer` configuration system, and ships a `train_from_iterator` method that fits new tokenizers from corpora. By contrast, tiktoken is inference only: it ships pretrained OpenAI merge tables and offers no training API.[^1][^26] In raw token counting, tiktoken is faster on single threads, but Hugging Face overtakes it once `encode_batch` parallelizes across Rust threads.[^21]

[SentencePiece](/wiki/sentencepiece), published by Taku Kudo and John Richardson at Google in 2018, takes a different architectural stance: it treats input as a raw byte sequence and trains directly on unprocessed text, which makes it language agnostic and removes the need for a separate pretokenization regex. SentencePiece supports both BPE and Unigram language model tokenization and is the implementation used by T5, Gemma, LLaMA, and the original PaLM.[^27] SentencePiece's BPE differs from tiktoken's primarily because it includes a learned `_` (U+2581) "metaword" boundary marker rather than relying on bytes for whitespace; this makes its outputs incompatible with tiktoken's merge tables, but the algorithmic complexity is similar.[^27]

A third class of implementation, often grouped under the loose umbrella "GPT2-style" tokenizers, includes pure JavaScript ports (gpt-tokenizer, niieani/gpt-tokenizer), .NET ports (`tryAGI/Tiktoken`), and Go and R bindings that reuse the official `cl100k_base` and `o200k_base` merge files. These ports exist because OpenAI's official package only targets Python; the merge files themselves are CC-licensed metadata that ports can redistribute. None of these ports trains new tokenizers either: they exist purely to count or encode tokens against OpenAI's vocabularies in non-Python environments.[^28]

[Karpathy](/wiki/andrej_karpathy)'s `minbpe` and `rustbpe` projects exist explicitly as educational reference implementations. `minbpe` reproduces tiktoken's encode path in roughly 200 lines of Python and adds a small training loop; `rustbpe` extends the Rust path with a `train` method that tiktoken lacks. Both projects target byte-compatible output with tiktoken's `cl100k_base` for any input.[^14]

## Who uses tiktoken?

tiktoken sits at the foundation of nearly every Python tool that interacts with OpenAI's text APIs. The package's PyPI page reported 167 million downloads in the 30 days preceding 26 May 2026, with 4.99 million downloads on the most recent day alone, ranking it among the most downloaded AI-adjacent libraries on the index.[^3]

[LangChain](/wiki/langchain) uses tiktoken as the default token counter inside its `BaseChatModel.get_num_tokens_from_messages` implementation for OpenAI models, and the framework's documentation recommends installing tiktoken alongside the core package for any application that needs to enforce context length limits.[^5] [LlamaIndex](/wiki/llamaindex) sets `cl100k_base` as its global default tokenizer via the `Settings.tokenizer` attribute; the project's documentation specifically calls out the dependency on tiktoken to match GPT-3.5 turbo's default behavior.[^4] LiteLLM, an aggregation layer that exposes a single API across multiple providers, falls back to tiktoken when no provider-specific tokenizer is available.[^5]

Third party token counters for non-OpenAI models often use tiktoken as a rough approximation. Anthropic's official position is that [Claude](/wiki/claude) requires its own `client.messages.count_tokens` endpoint for accurate counts, but several community libraries continue to use `tiktoken.get_encoding("p50k_base")` as a heuristic for older Claude 2 era inputs. Anthropic now publishes its own TypeScript tokenizer in the `anthropic-tokenizer-typescript` repository for browser-side estimation, leaving tiktoken as the OpenAI-specific tool.[^29][^30]

tiktoken is also the canonical reference implementation cited by academic papers when reporting token-based metrics for OpenAI evaluation. Papers studying multilingual token compression, prompt cost analysis, and prompt injection routinely report token counts using tiktoken's `cl100k_base` or `o200k_base` and link the exact library version they used for reproducibility.[^19][^31]

The library has been ported into more than a dozen non-Python ecosystems. The `tiktoken-go` package provides Go bindings, `gpt-tokenizer` and `js-tiktoken` cover Node.js and the browser, `tryAGI/Tiktoken` covers .NET, and `rtiktoken` provides an R interface that wraps the official Rust core through `extendr`. Most of these projects bundle the official merge tables and reach byte-identical output with the Python package for the supported encodings, which means their token counts can be used as the ground truth for OpenAI billing in JavaScript, Go, and .NET applications without round-tripping calls to a Python service.[^28]

## What are tiktoken's limitations?

The library exposes only inference. Users cannot fit new BPE vocabularies through tiktoken itself; producing a new encoding requires an external trainer and then constructing a `tiktoken.Encoding` around the resulting merges. The repository's README states this constraint explicitly and points to `minbpe` and `rustbpe` as suggested training companions.[^1][^14]

Special token handling is conservative by design. Calling `encode("<|endoftext|>")` raises a `ValueError` unless the caller passes `allowed_special={"<|endoftext|>"}` or sets `allowed_special="all"`. This prevents prompt injection through user-supplied content that contains the literal special token string, but it can surprise developers who unwittingly include the token in their templates. The opposite mode, `disallowed_special=()`, opts every special token back into normal encoding paths and is sometimes used to round-trip arbitrary text without errors.[^1]

tiktoken's merge files are downloaded from blob storage on first use, which means the library does not work in air-gapped environments without prior preparation. The workaround documented in the README is to download the merge file in a connected environment, copy `~/.cache/tiktoken` to the target machine, and set `TIKTOKEN_CACHE_DIR` to the destination. Several community ports (such as `js-tiktoken` and the .NET `Tiktoken` package) bundle the merge tables directly to avoid this problem.[^1][^7][^28]

The library targets OpenAI's vocabularies and matches them exactly, but it offers no tokenizers for other providers. Counts produced by `tiktoken.get_encoding("cl100k_base")` for a Claude or [LLaMA](/wiki/llama) input are approximations that drift from the actual model's tokenizer by a few percent on English and considerably more on multilingual text. Anthropic's documentation explicitly recommends using `client.messages.count_tokens` rather than tiktoken for Claude, and the `transformers` library's `LlamaTokenizer` is the canonical counter for Meta's models.[^29][^30]

Finally, because the merge tables are byte-level, tiktoken cannot guarantee that arbitrary sequences of tokens decode to valid UTF-8. Individual tokens may be partial UTF-8 byte sequences that only become valid when concatenated with neighbors; `decode_single_token_bytes` returns raw bytes precisely so that callers handling streaming output can buffer until a complete code point arrives. This semantic is documented in the README and matches how OpenAI's server side streaming endpoints deliver byte-level deltas.[^1]

## ELI5: What does tiktoken actually do?

Language models do not read words; they read numbers. Before [GPT-4](/wiki/gpt-4) can process the sentence "tiktoken is great!", that text has to be chopped into pieces called tokens and each piece swapped for a number from a fixed list. tiktoken is the tool that does the chopping and the swapping. It uses a method called byte pair encoding, which starts from raw bytes and repeatedly glues together the most common neighboring pairs, so frequent chunks like " the" or "ing" become single tokens while rare words get split into several pieces. The number of tokens is exactly what OpenAI charges you for and exactly what fills up a model's memory limit, so people use tiktoken to answer two everyday questions: how much will this prompt cost, and will it fit?

## See also

- [Byte-Pair Encoding](/wiki/byte_pair_encoding)
- [Tokenization](/wiki/tokenization)
- [SentencePiece](/wiki/sentencepiece)
- [GPT-2](/wiki/gpt-2)
- [GPT-4](/wiki/gpt-4)
- [GPT-4o](/wiki/gpt_4o)
- [GPT-4.1](/wiki/gpt-4.1)
- [GPT-5](/wiki/gpt-5)
- [OpenAI](/wiki/openai)
- [LangChain](/wiki/langchain)
- [LlamaIndex](/wiki/llamaindex)
- [Context window](/wiki/context_window)
- [Andrej Karpathy](/wiki/andrej_karpathy)

## References

[^1]: OpenAI, "openai/tiktoken: tiktoken is a fast BPE tokeniser for use with OpenAI's models", GitHub, 2026-05-15. https://github.com/openai/tiktoken. Accessed 2026-05-26.
[^2]: OpenAI, "How to count tokens with tiktoken", OpenAI Cookbook. https://developers.openai.com/cookbook/examples/how_to_count_tokens_with_tiktoken. Accessed 2026-05-26.
[^3]: tiktoken project, "tiktoken 0.13.0", PyPI, 2026-05-15. https://pypi.org/project/tiktoken/. Accessed 2026-05-26.
[^4]: LlamaIndex, "Using LLMs", LlamaIndex documentation. https://docs.llamaindex.ai/en/stable/module_guides/models/llms/. Accessed 2026-05-26.
[^5]: BerriAI, "Completion Token Usage and Cost", LiteLLM documentation. https://docs.litellm.ai/docs/completion/token_usage. Accessed 2026-05-26.
[^6]: tiktoken project, "tiktoken 0.1.1", PyPI, 2022-12-15. https://pypi.org/project/tiktoken/0.1.1/. Accessed 2026-05-26.
[^7]: Hacker News, "Tiktoken: OpenAI's Tokenizer", Y Combinator, 2022-12-16. https://news.ycombinator.com/item?id=34008839. Accessed 2026-05-26.
[^8]: OpenAI, "tiktoken/LICENSE", GitHub, 2022. https://github.com/openai/tiktoken/blob/main/LICENSE. Accessed 2026-05-26.
[^9]: Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever, "Language Models are Unsupervised Multitask Learners", OpenAI, 2019-02-14. https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf. Accessed 2026-05-26.
[^10]: OpenAI, "Releases: openai/tiktoken", GitHub. https://github.com/openai/tiktoken/releases. Accessed 2026-05-26.
[^11]: OpenAI, "Hello GPT-4o", OpenAI blog, 2024-05-13. https://openai.com/index/hello-gpt-4o/. Accessed 2026-05-26.
[^12]: rjarun8235, "Add GPT-5 model support with o200k_base encoding", openai/tiktoken pull request 440, GitHub. https://github.com/openai/tiktoken/pull/440. Accessed 2026-05-26.
[^13]: Andrej Karpathy, "Let's build the GPT Tokenizer", YouTube, 2024-02-20. https://www.youtube.com/watch?v=zduSFxRajkE. Accessed 2026-05-26.
[^14]: Andrej Karpathy, "karpathy/minbpe: Minimal, clean code for the Byte Pair Encoding (BPE) algorithm commonly used in LLM tokenization", GitHub, 2024-02-16. https://github.com/karpathy/minbpe. Accessed 2026-05-26.
[^15]: OpenAI, "tiktoken/tiktoken/model.py", GitHub. https://github.com/openai/tiktoken/blob/main/tiktoken/model.py. Accessed 2026-05-26.
[^16]: Rico Sennrich, Barry Haddow, and Alexandra Birch, "Neural Machine Translation of Rare Words with Subword Units", Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL), 2016-08-07. https://aclanthology.org/P16-1162/. Accessed 2026-05-26.
[^17]: OpenAI, "tiktoken/tiktoken_ext/openai_public.py", GitHub. https://github.com/openai/tiktoken/blob/main/tiktoken_ext/openai_public.py. Accessed 2026-05-26.
[^18]: Various contributors, "o200k_base pretokenizer regex error?", openai/tiktoken issue 298, GitHub. https://github.com/openai/tiktoken/issues/298. Accessed 2026-05-26.
[^19]: NJ Kumar, "Multilingual token compression in GPT-o family models", njkumar.com, 2024-06-12. https://www.njkumar.com/gpt-o-multilingual-token-compression/. Accessed 2026-05-26.
[^20]: OpenAI, "Models", OpenAI Platform documentation. https://platform.openai.com/docs/models. Accessed 2026-05-26.
[^21]: MachineLearningPlus, "tiktoken vs HuggingFace Tokenizers: Benchmark Guide", 2024. https://machinelearningplus.com/gen-ai/tiktoken-vs-huggingface-tokenizers/. Accessed 2026-05-26.
[^22]: Github user contribution, "bpe", crates.io. https://crates.io/crates/bpe. Accessed 2026-05-26.
[^23]: gweidart, "rs-bpe outperforms tiktoken and tokenizers", DEV Community, 2025-03-18. https://dev.to/gweidart/rs-bpe-outperforms-tiktoken-tokenizers-2h3j. Accessed 2026-05-26.
[^24]: AIPM Guru, "The Invisible Upgrade: How Tokenization Quietly Got Better (And Why Your AI Costs Dropped)", Substack. https://aipmguru.substack.com/p/the-invisible-upgrade-how-tokenization. Accessed 2026-05-26.
[^25]: Modal, "What is o200k Harmony? OpenAI's latest edition to their tiktoken tokenizer library", Modal blog, 2025. https://modal.com/blog/what-is-o200k-harmony. Accessed 2026-05-26.
[^26]: Hugging Face, "huggingface/tokenizers", GitHub. https://github.com/huggingface/tokenizers. Accessed 2026-05-26.
[^27]: Taku Kudo and John Richardson, "SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing", arXiv:1808.06226, 2018-08-19. https://arxiv.org/abs/1808.06226. Accessed 2026-05-26.
[^28]: tryAGI, "tryAGI/Tiktoken: High performance .NET BPE tokenizer", GitHub. https://github.com/tryAGI/Tiktoken. Accessed 2026-05-26.
[^29]: Anthropic, "Token counting", Claude API docs. https://platform.claude.com/docs/en/build-with-claude/token-counting. Accessed 2026-05-26.
[^30]: Anthropic, "anthropics/anthropic-tokenizer-typescript", GitHub. https://github.com/anthropics/anthropic-tokenizer-typescript. Accessed 2026-05-26.
[^31]: tiktoken project, "tiktoken download statistics", pypistats.org. https://pypistats.org/packages/tiktoken. Accessed 2026-05-26.