Tokens: Difference between revisions

72 bytes removed ,  6 April 2023
no edit summary
No edit summary
No edit summary
Line 1: Line 1:
==Text Fragments==
==Introduction==
Text fragments, often called tokens, are parts of words that can include spaces at the end or even smaller components of words. Natural language processing (NLP) systems, like the OpenAI API, use tokens to manage and process textual input. The method in which words are split into tokens relies on the language, which may impact the cost of implementing the API for languages other than English.
Tokens are fragments of words, which may include trailing spaces or sub-words. They are used by natural language processing (NLP) systems, such as the OpenAI API, to process text input. The way words are broken down into tokens is language-dependent, which can affect the implementation cost of the API for languages other than English.


===Grasping Fragment Lengths===
==Understanding Token Lengths==
To comprehend the idea of tokens, consider these general approximations concerning token lengths:
To comprehend the idea of tokens, consider these general approximations concerning token lengths:


Line 18: Line 18:
OpenAI’s charter contains 476 tokens.
OpenAI’s charter contains 476 tokens.
The transcript of the US Declaration of Independence contains 1,695 tokens.
The transcript of the US Declaration of Independence contains 1,695 tokens.
===Fragmentation Tools===
 
==Tokenization Tools==
To delve deeper into tokenization, the following tools and libraries are available:
To delve deeper into tokenization, the following tools and libraries are available:


[[OpenAI]]'s interactive Tokenizer tool
*[[OpenAI]]'s interactive Tokenizer tool
[[Tiktoken]], a swift BPE tokenizer designed for OpenAI models
*[[Tiktoken]], a fast BPE tokenizer specifically for OpenAI models
Transformers package for Python
*Transformers package for Python
gpt-3-encoder package for node.js
*gpt-3-encoder package for node.js
==Fragment Limits==
 
==Token Limits==
The fragment limit for requests is contingent on the model employed, with a maximum of 4097 tokens shared between the prompt and its completion. If a prompt consists of 4000 tokens, the completion can have a maximum of 97 tokens. This limitation is a technical constraint, but there are strategies to work within it, such as shortening prompts or dividing text into smaller sections.
The fragment limit for requests is contingent on the model employed, with a maximum of 4097 tokens shared between the prompt and its completion. If a prompt consists of 4000 tokens, the completion can have a maximum of 97 tokens. This limitation is a technical constraint, but there are strategies to work within it, such as shortening prompts or dividing text into smaller sections.


==Fragment Pricing==
==Token Pricing==
API fragment pricing differs depending on the model type, with various capabilities and speeds available at distinct price points. Davinci is the most proficient model, while Ada is the swiftest. Detailed fragment pricing information can be found on the OpenAI API's pricing page.
API token pricing varies depending on the model type, with different capabilities and speeds offered at different price points. Davinci is the most capable model, while Ada is the fastest. Detailed token pricing information can be found on the OpenAI API's pricing page.


==Fragment Context==
==Fragment Context==
GPT-3 handles tokens based on their context in the corpus data. Identical words might have different tokens depending on their arrangement within the text. For instance, the token value generated for the word "red" changes based on its context.
GPT-3 handles tokens based on their context in the corpus data. Identical words might have different tokens depending on their structure within the text. For instance, the token value generated for the word "red" changes based on its context.


Lowercase in the middle of a sentence: " red" (token: "2266")
*Lowercase in the middle of a sentence: " red" (token: "2266")
Uppercase in the middle of a sentence: " Red" (token: "2297")
*Uppercase in the middle of a sentence: " Red" (token: "2297")
Uppercase at the beginning of a sentence: "Red" (token: "7738")
*Uppercase at the beginning of a sentence: "Red" (token: "7738")
The more likely or common a token is, the lower the token number assigned to it. For example, the token for the period ("13") remains consistent in all three sentences because its usage is similar throughout the corpus data.
*The more likely or common a token is, the lower the token number assigned to it. For example, the token for the period ("13") remains consistent in all three sentences because its usage is similar throughout the corpus data.


==Designing Prompts with Fragment Knowledge==
==Prompt Design and Token Knowledge==
Understanding tokens can help enhance prompt design in several ways.
Understanding tokens can help enhance [[prompt design]] in several ways.


===Prompts Ending with a Space===
===Prompts Ending with a Space===
370

edits