Tokens

Text Fragments

Text fragments, often called tokens, are parts of words that can include spaces at the end or even smaller components of words. Natural language processing (NLP) systems, like the OpenAI API, use tokens to manage and process textual input. The method in which words are split into tokens relies on the language, which may impact the cost of implementing the API for languages other than English.

Grasping Fragment Lengths

To comprehend the idea of tokens, consider these general approximations concerning token lengths:

1 token ≈ 4 characters in English 1 token ≈ ¾ words 100 tokens ≈ 75 words Regarding sentences and paragraphs:

1-2 sentences ≈ 30 tokens 1 paragraph ≈ 100 tokens 1,500 words ≈ 2048 tokens For more context, consider these examples:

Wayne Gretzky’s quote "You miss 100% of the shots you don't take" contains 11 tokens. OpenAI’s charter contains 476 tokens. The transcript of the US Declaration of Independence contains 1,695 tokens.

Fragmentation Tools

To delve deeper into tokenization, the following tools and libraries are available:

OpenAI's interactive Tokenizer tool Tiktoken, a swift BPE tokenizer designed for OpenAI models Transformers package for Python gpt-3-encoder package for node.js

Fragment Limits

The fragment limit for requests is contingent on the model employed, with a maximum of 4097 tokens shared between the prompt and its completion. If a prompt consists of 4000 tokens, the completion can have a maximum of 97 tokens. This limitation is a technical constraint, but there are strategies to work within it, such as shortening prompts or dividing text into smaller sections.

Fragment Pricing

API fragment pricing differs depending on the model type, with various capabilities and speeds available at distinct price points. Davinci is the most proficient model, while Ada is the swiftest. Detailed fragment pricing information can be found on the OpenAI API's pricing page.

Fragment Context

GPT-3 handles tokens based on their context in the corpus data. Identical words might have different tokens depending on their arrangement within the text. For instance, the token value generated for the word "red" changes based on its context.

Lowercase in the middle of a sentence: " red" (token: "2266") Uppercase in the middle of a sentence: " Red" (token: "2297") Uppercase at the beginning of a sentence: "Red" (token: "7738") The more likely or common a token is, the lower the token number assigned to it. For example, the token for the period ("13") remains consistent in all three sentences because its usage is similar throughout the corpus data.

Designing Prompts with Fragment Knowledge

Understanding tokens can help enhance prompt design in several ways.

Prompts Ending with a Space

As tokens can include spaces at the end, prompts that end with a space might produce suboptimal output. The API's token dictionary already considers trailing spaces.

Using the logit_bias Parameter

The logit_bias parameter enables setting biases for specific tokens, altering the probability of those tokens appearing in the completion. For example, if designing an AI Baking Assistant that takes users' egg allergies into account, the logit_bias parameter can be used to deter the model from generating responses that include any form of the word "egg".

First, use a tokenizer tool to identify the tokens for which biases should be set:

Singular with trailing space: " egg"(token: "5935")

Plural with trailing space: " eggs" (token: "9653") Subword token generated for "Egg" or "Eggs": "gg" (token: "1130") The logit_bias parameter accepts bias values ranging from -100 to +100. Extreme values result in either prohibiting (-100) or exclusively selecting (100) the related token.

By incorporating the logit biases into the prompt, the likelihood of the word "egg" (and its variations) appearing in the response for a banana bread recipe is diminished. Consequently, the AI Baking Assistant generates a response that excludes eggs, meeting its requirement of being mindful of the user's egg allergies.