Tokens: Difference between revisions

Tokens (view source)

Revision as of 17:16, 6 April 2023

106 bytes added , 6 April 2023

no edit summary

Daikon Radish

370

edits

@@ Line 1: / Line 1: @@
 ==Introduction==
-Tokens are fragments of words, which may include trailing spaces or sub-words. They are used by natural language processing (NLP) systems, such as the OpenAI API, to process text input. The way words are broken down into tokens is language-dependent, which can affect the implementation cost of the API for languages other than English.
+[[Tokens]] are fragments of words, which may include trailing spaces or sub-words. Before [[natural language processing]] (NLP) systems, like [[large language models]] (LLMs), process a [[prompt]], the prompt is transformed into tokens. The way words are broken down into tokens is language-dependent, which can affect the implementation cost of the API for languages other than English.
 ==Understanding Token Lengths==
 To comprehend the idea of tokens, consider these general approximations concerning token lengths:
-token ≈ 4 characters in English
+*1 token ≈ 4 characters in English
-token ≈ ¾ words
+*1 token ≈ ¾ words
-tokens ≈ 75 words
+*100 tokens ≈ 75 words
 Regarding sentences and paragraphs:
--2 sentences ≈ 30 tokens
+*1-2 sentences ≈ 30 tokens
-paragraph ≈ 100 tokens
+*1 paragraph ≈ 100 tokens
-,500 words ≈ 2048 tokens
+*1,500 words ≈ 2048 tokens
 For more context, consider these examples:
-Wayne Gretzky’s quote "You miss 100% of the shots you don't take" contains 11 tokens.
+*Wayne Gretzky’s quote "You miss 100% of the shots you don't take" contains 11 tokens.
-OpenAI’s charter contains 476 tokens.
+*OpenAI’s charter contains 476 tokens.
-The transcript of the US Declaration of Independence contains 1,695 tokens.
+*The transcript of the US Declaration of Independence contains 1,695 tokens.
 ==Tokenization Tools==
 To delve deeper into tokenization, the following tools and libraries are available:
-*[[OpenAI]]'s interactive Tokenizer tool
+*[https://platform.openai.com/tokenizer OpenAI's interactive Tokenizer tool]
 *[[Tiktoken]], a fast BPE tokenizer specifically for OpenAI models
-*Transformers package for Python
+*[[Transformers]] package for Python
-*gpt-3-encoder package for node.js
+*[[gpt-3-encoder]] package for node.js
 ==Token Limits==