Tokens: Difference between revisions

106 bytes added ,  6 April 2023
no edit summary
No edit summary
No edit summary
Line 1: Line 1:
==Introduction==
==Introduction==
Tokens are fragments of words, which may include trailing spaces or sub-words. They are used by natural language processing (NLP) systems, such as the OpenAI API, to process text input. The way words are broken down into tokens is language-dependent, which can affect the implementation cost of the API for languages other than English.
[[Tokens]] are fragments of words, which may include trailing spaces or sub-words. Before [[natural language processing]] (NLP) systems, like [[large language models]] (LLMs), process a [[prompt]], the prompt is transformed into tokens. The way words are broken down into tokens is language-dependent, which can affect the implementation cost of the API for languages other than English.


==Understanding Token Lengths==
==Understanding Token Lengths==
To comprehend the idea of tokens, consider these general approximations concerning token lengths:
To comprehend the idea of tokens, consider these general approximations concerning token lengths:


1 token ≈ 4 characters in English
*1 token ≈ 4 characters in English
1 token ≈ ¾ words
*1 token ≈ ¾ words
100 tokens ≈ 75 words
*100 tokens ≈ 75 words
 
Regarding sentences and paragraphs:
Regarding sentences and paragraphs:


1-2 sentences ≈ 30 tokens
*1-2 sentences ≈ 30 tokens
1 paragraph ≈ 100 tokens
*1 paragraph ≈ 100 tokens
1,500 words ≈ 2048 tokens
*1,500 words ≈ 2048 tokens
 
For more context, consider these examples:
For more context, consider these examples:


Wayne Gretzky’s quote "You miss 100% of the shots you don't take" contains 11 tokens.
*Wayne Gretzky’s quote "You miss 100% of the shots you don't take" contains 11 tokens.
OpenAI’s charter contains 476 tokens.
*OpenAI’s charter contains 476 tokens.
The transcript of the US Declaration of Independence contains 1,695 tokens.
*The transcript of the US Declaration of Independence contains 1,695 tokens.


==Tokenization Tools==
==Tokenization Tools==
To delve deeper into tokenization, the following tools and libraries are available:
To delve deeper into tokenization, the following tools and libraries are available:


*[[OpenAI]]'s interactive Tokenizer tool
*[https://platform.openai.com/tokenizer OpenAI's interactive Tokenizer tool]
*[[Tiktoken]], a fast BPE tokenizer specifically for OpenAI models
*[[Tiktoken]], a fast BPE tokenizer specifically for OpenAI models
*Transformers package for Python
*[[Transformers]] package for Python
*gpt-3-encoder package for node.js
*[[gpt-3-encoder]] package for node.js


==Token Limits==
==Token Limits==
370

edits