370
edits
No edit summary |
No edit summary |
||
Line 1: | Line 1: | ||
==Introduction== | ==Introduction== | ||
Tokens are fragments of words, which may include trailing spaces or sub-words. | [[Tokens]] are fragments of words, which may include trailing spaces or sub-words. Before [[natural language processing]] (NLP) systems, like [[large language models]] (LLMs), process a [[prompt]], the prompt is transformed into tokens. The way words are broken down into tokens is language-dependent, which can affect the implementation cost of the API for languages other than English. | ||
==Understanding Token Lengths== | ==Understanding Token Lengths== | ||
To comprehend the idea of tokens, consider these general approximations concerning token lengths: | To comprehend the idea of tokens, consider these general approximations concerning token lengths: | ||
1 token ≈ 4 characters in English | *1 token ≈ 4 characters in English | ||
1 token ≈ ¾ words | *1 token ≈ ¾ words | ||
100 tokens ≈ 75 words | *100 tokens ≈ 75 words | ||
Regarding sentences and paragraphs: | Regarding sentences and paragraphs: | ||
1-2 sentences ≈ 30 tokens | *1-2 sentences ≈ 30 tokens | ||
1 paragraph ≈ 100 tokens | *1 paragraph ≈ 100 tokens | ||
1,500 words ≈ 2048 tokens | *1,500 words ≈ 2048 tokens | ||
For more context, consider these examples: | For more context, consider these examples: | ||
Wayne Gretzky’s quote "You miss 100% of the shots you don't take" contains 11 tokens. | *Wayne Gretzky’s quote "You miss 100% of the shots you don't take" contains 11 tokens. | ||
OpenAI’s charter contains 476 tokens. | *OpenAI’s charter contains 476 tokens. | ||
The transcript of the US Declaration of Independence contains 1,695 tokens. | *The transcript of the US Declaration of Independence contains 1,695 tokens. | ||
==Tokenization Tools== | ==Tokenization Tools== | ||
To delve deeper into tokenization, the following tools and libraries are available: | To delve deeper into tokenization, the following tools and libraries are available: | ||
*[ | *[https://platform.openai.com/tokenizer OpenAI's interactive Tokenizer tool] | ||
*[[Tiktoken]], a fast BPE tokenizer specifically for OpenAI models | *[[Tiktoken]], a fast BPE tokenizer specifically for OpenAI models | ||
*Transformers package for Python | *[[Transformers]] package for Python | ||
*gpt-3-encoder package for node.js | *[[gpt-3-encoder]] package for node.js | ||
==Token Limits== | ==Token Limits== |
edits