Prompt engineering for text generation: Difference between revisions

no edit summary
No edit summary
 
(43 intermediate revisions by 2 users not shown)
Line 23: Line 23:


===Context or External Information===
===Context or External Information===
Context or external information is the additional information for the model that might not exist in the model. They can be manually inserted into the prompt, retrieved via a vector database ([[retrieval augmentation]]), or gathered from other sources such as [[APIs]]. Providing context helps the model generate more informed and precise responses.
Context or external information is the additional information for the model that might not exist in the model. They can be manually inserted into the prompt, retrieved via a vector database ([[retrieval augmentation]]), or gathered from other sources such as [[APIs]]. Providing context helps the model generate more informed and precise responses.
*Context or external information is entirely optional.


===Example(s)===
===Example(s)===
Example(s) are a few instances of user input/query along with corresponding output provided by the user and shown to the language model. It is often utilized in [[few-shot prompting]] and is entirely optional.


===User Input or Query===
===User Input or Query===
Line 32: Line 34:
===Output Indicator===
===Output Indicator===
The output indicator signifies the beginning of the generated text. For instance, when generating Python code, the word "import" may be used to indicate that the model should start writing Python code, as many Python scripts begin with the "import" statement. The output indicator helps to maintain the desired format and structure of the response.
The output indicator signifies the beginning of the generated text. For instance, when generating Python code, the word "import" may be used to indicate that the model should start writing Python code, as many Python scripts begin with the "import" statement. The output indicator helps to maintain the desired format and structure of the response.
===Example of a Prompt with the Above Structure===
Example using [[ChatGPT]] with [[GPT-4]]
====Input====
<poem style="border: 1px solid; padding: 1rem">
Instruction: Identify the president of the United States of America for the given year.
'''Context:'''
Considering that GPT-4's knowledge is limited to information up until September 2021. We need to provide some context to help with the task with Joe Biden.
#1861-1865: Abraham Lincoln was the 16th president of the United States.
#1933-1945: Franklin D. Roosevelt was the 32nd president of the United States and served four terms.
#1961-1963: John F. Kennedy was the 35th president of the United States.
#2009-2017: Barack Obama was the 44th president of the United States.
#2021-2025: Joe Biden is the 46th president of the United States.
'''Examples:'''
*1801: Thomas Jefferson
*1901: William McKinley
*1981: Ronald Reagan
*2001: George W. Bush
'''Query and output indicator:'''
*1945:
*2017:
*2023:
</poem>
====Output====
<poem style="border: 1px solid; padding: 1rem">
*1945: Franklin D. Roosevelt
*2017: Barack Obama
*2023: Joe Biden
</poem>


==Building Prompts==
==Building Prompts==
Line 49: Line 87:
The model can also be asked to act as a technical advisor, mentor, quality assurance, code reviewer, debugging assistant, compliance checker, code optimization specialist, accessibility expert, search engine optimization specialist, and performance analyst. Examples of prompts for the use cases are available [https://github.com/mattnigh/ChatGPT3-Free-Prompt-List here].
The model can also be asked to act as a technical advisor, mentor, quality assurance, code reviewer, debugging assistant, compliance checker, code optimization specialist, accessibility expert, search engine optimization specialist, and performance analyst. Examples of prompts for the use cases are available [https://github.com/mattnigh/ChatGPT3-Free-Prompt-List here].


==Few-shot Standard Prompting==
==Zero-shot Prompting==
[[Few-shot Standard Prompts]] have a task description, a few examples and then a prompt.
[[Zero-shot prompting]] involves giving the model a [[task description]]/[[instruction]] and directly asking it for a result. It does not show the model any [[examples]].
 
<poem style="border: 1px solid; padding: 1rem">
'''Input (task description/instruction):''' Who was the president of United States in 1999?
'''Output:'''
</poem>
 
==Few-shot Prompting (Basic)==
[[Few-shot prompting]] have a task description, a few examples and then a prompt.


===For example===
===For example===
Line 78: Line 124:
Vacationing in Florida is fun: FL
Vacationing in Florida is fun: FL
</poem>
</poem>
==Few-shot Prompting (Advanced)==
In few-shot prompting, the model is presented with high-quality demonstrations, including input and desired output, for the target task. This approach enables the model to understand the human intention better and the desired criteria for answers, often resulting in improved performance compared to zero-shot prompting. However, this comes at the expense of increased token consumption and may reach the context length limit for longer input and output texts.
Numerous studies have explored how to construct in-context examples to maximize performance. [[Prompt format]], [[training examples]], and [[example order]] can lead to dramatically different performance outcomes, ranging from near-random guessing to near state-of-the-art (SoTA) results.
Zhao et al. (2021) investigated [[few-shot classification]] using LLMs, specifically [[GPT-3]]. They identified several biases that contribute to high [[variance]] in performance: (1) majority [[label bias]], (2) [[recency bias]], and (3) [[common token bias]]. To address these [[biases]], they proposed a method to calibrate label probabilities output by the model to be uniform when the input string is N/A.<ref name="”111”">Zhao et al. (2021) Calibrate Before Use: Improving Few-Shot Performance of Language Models https://arxiv.org/abs/2102.09690</ref>
===Tips for Example Selection===
====Semantically Similar Examples====
Liu et al. (2021) suggested choosing examples that are semantically similar to the test example by employing [[k-nearest neighbors]] (KNN) clustering in the [[embedding space]].<ref name="”112”">Liu et al. (2021) What Makes Good In-Context Examples for GPT-3? https://arxiv.org/abs/2101.06804</ref>
====Diverse and Representative Examples====
Su et al. (2022) proposed a [[graph-based approach]] to select a diverse and representative set of examples: (1) construct a directed graph based on the cosine similarity between samples in the embedding space (e.g., using [[SBERT]] or other [[embedding models]]), and (2) start with a set of selected samples and a set of remaining samples, scoring each sample to encourage [[diverse selection]].<ref name="”113”">Su et al. (2022) Selective Annotation Makes Language Models Better Few-Shot Learners https://arxiv.org/abs/2209.01975</ref>
====Embeddings via Contrastive Learning====
Rubin et al. (2022) suggested training embeddings through [[contrastive learning]] specific to one [[training dataset]] for [[in-context learning]] sample selection. This approach measures the quality of an example based on a conditioned probability assigned by the language model.<ref name="”114”">Rubin et al. (2022) Learning To Retrieve Prompts for In-Context Learning https://arxiv.org/abs/2112.08633</ref>
====Q-Learning====
Zhang et al. (2022) explored using [[Q-Learning]] for sample selection in LLM training.<ref name="”115”">Zhang et al. (2022) Active Example Selection for In-Context Learning https://arxiv.org/abs/2211.04486</ref>
====Uncertainty-Based Active Learning====
Diao et al. (2023) proposed identifying examples with [[high disagreement]] or [[entropy]] among multiple sampling trials based on [[uncertainty-based active learning]]. These examples can then be annotated and used in few-shot prompts.<ref name="”116”">Diao et al. (2023) Active Prompting with Chain-of-Thought for Large Language Models https://arxiv.org/abs/2302.12246</ref>
===Tips for Example Ordering===
A general recommendation is to maintain a diverse selection of examples relevant to the test sample and present them in random order to avoid [[majority label bias]] and [[recency bias]]. Increasing [[model size]]s or including more [[training examples]] does not necessarily reduce [[variance]] among different permutations of in-context examples. The exact order may work well for one model but poorly for another.
When the [[validation set]] is limited, Lu et al. (2022) suggested choosing the order such that the model does not produce extremely unbalanced predictions or exhibit overconfidence in its predictions.<ref name="”117”">Lu et al. (2022) Fantastically Ordered Prompts and Where to Find Them: Overcoming Few-Shot Prompt Order Sensitivity https://arxiv.org/abs/2104.08786</ref>


==Roles==
==Roles==
Line 127: Line 201:
*'''[[Straightforward]]''' and '''[[Professional]]''' - business emails, formal communication, legal documents
*'''[[Straightforward]]''' and '''[[Professional]]''' - business emails, formal communication, legal documents
*'''[[Trustworthy]]''' and '''[[Professional]]''' - business proposals, executive summaries, investor pitches
*'''[[Trustworthy]]''' and '''[[Professional]]''' - business proposals, executive summaries, investor pitches
==Self-Consistency Sampling==
[[Self-consistency sampling]] is a method for generating multiple outputs using a [[temperature]] greater than 0 and selecting the best candidate from the generated outputs. The criteria for choosing the best candidate may vary according to the task. A common approach is to use [[majority vote]]. In tasks that are easy to validate, such as programming questions with unit tests, the outputs can be run through an interpreter and their correctness can be verified using unit tests.<ref name="”118”">Wang et al. (2022a) Self-Consistency Improves Chain of Thought Reasoning in Language Models https://arxiv.org/abs/2203.11171</ref>


==Chain of Thought Prompting==
==Chain of Thought Prompting==
{{see also|Chain of Thought Prompting}}
[[Chain of Thought Prompting]] (CoT prompting) is a technique introduced by Wei et al. (2022) to generate a sequence of short sentences describing step-by-step reasoning, known as [[reasoning chains]] or [[rationales]], leading to the final answer. [[CoT prompting]] is particularly useful for complex reasoning tasks when applied to large language models (e.g., those with over 50 billion parameters), while simpler tasks may benefit only marginally.<ref name="”119”">Wei et al. (2022) Chain-of-Thought Prompting Elicits Reasoning in Large Language Models https://arxiv.org/abs/2201.11903</ref>
 
===Types of CoT Prompts===
There are two main types of CoT prompting:
 
====Few-shot CoT====
[[Few-shot CoT prompting]] involves providing the model with a limited number of demonstrations, each containing either manually written or model-generated high-quality reasoning chains. Examples of such demonstrations are provided in the original article, showcasing how this type of prompting is used to solve various mathematical reasoning problems.
 
====Zero-shot CoT====
[[Zero-shot CoT prompting]] uses natural language statements, such as "Let's think step by step" or "Let's work this out step by step to be sure we have the right answer," to explicitly encourage the model to generate reasoning chains. Following this, a statement like "Therefore, the answer is" is used to prompt the model to produce the final answer.<ref name="”128”">Kojima et al. (2022) Large Language Models are Zero-Shot Reasoners https://arxiv.org/abs/2205.11916</ref><ref name="”129”">Zhou et al. (2022) Large Language Models Are Human-Level Prompt Engineers https://arxiv.org/abs/2211.01910</ref>
 
===Tips and Extensions===
Several techniques have been proposed to improve the accuracy and effectiveness of CoT prompting:
 
*[[Self-consistency sampling]], as suggested by Wang et al. (2022a), can improve reasoning accuracy by sampling a number of diverse answers and taking the majority vote.<ref name="”118”"></ref>
 
*Wang et al. (2022b) proposed using ensemble learning by altering the example order or replacing human-written rationales with model-generated ones, introducing randomness during multiple sample trials. Model outputs can then be aggregated using a majority vote to obtain the final answer.<ref name="”120”">Wang et al. (2022b) Rationale-Augmented Ensembles in Language Models https://arxiv.org/abs/2207.00747</ref>
 
*If training examples only have true answers but no rationales, the [[STaR]] ([[Self-Taught Reasoner]]) method by Zelikman et al. (2022) can be followed: (1) ask the model to generate reasoning chains and keep only those leading to correct answers; (2) fine-tune the model with generated rationales and repeat the process until convergence. Higher temperature settings are more likely to generate incorrect rationales with correct answers.<ref name="”121”">Zelikman et al. (2022) STaR: Bootstrapping Reasoning With Reasoning https://arxiv.org/abs/2203.14465</ref>
 
*Fu et al. (2023) found that prompts with demonstrations of higher reasoning complexity lead to better performance. They also suggested that using newline (\n) symbols to separate reasoning steps works better than step indicators, periods, or semicolons.<ref name="”122”">Fu et al. (2023) Complexity-Based Prompting for Multi-Step Reasoning https://arxiv.org/abs/2210.00720</ref>
 
*Complexity-based consistency, as proposed by Fu et al. (2023), involves explicitly preferring complex chains among all generations by taking a majority vote among only the top complex chains.<ref name="”122”"></ref>
 
*Shum et al. (2023) discovered that CoT prompts with only complex examples improve the accuracy of complex questions but perform poorly on simple questions. This finding was based on evidence from the [[GSM8k]] dataset.<ref name="”123”">Shum et al. (2023) Automatic Prompt Augmentation and Selection with Chain-of-Thought from Labeled Data https://arxiv.org/abs/2302.12822</ref>
 
*Fu et al. (2023) found that changing "Q:" to "Question:" in the prompts is helpful.<ref name="”122”"></ref>
 
*Ye & Durrett (2022) observed that including explanations in prompts has a small to moderate effect on [[NLP]] tasks that involve reasoning over text, such as [[question-answering]] (QA) and [[natural language inference]] (NLI). They also noted that nonfactual explanations are more likely to lead to incorrect predictions than inconsistent explanations.<ref name="”124”">Ye & Durrett (2022) The Unreliability of Explanations in Few-shot Prompting for Textual Reasoning https://arxiv.org/abs/2205.03401</ref>
 
*[[Self-Ask]], a method proposed by Press et al. (2022), repeatedly prompts the model to ask follow-up questions, constructing the thought process iteratively.<ref name="”125”">Press et al. (2022) Measuring and Narrowing the Compositionality Gap in Language Models https://arxiv.org/abs/2210.03350</ref> Search engine results can be used to answer these follow-up questions. Similarly, [[IRCoT]] ([[Interleaving Retrieval CoT]]; Trivedi et al. 2022) and [[ReAct]] ([[Reason + Act]]; Yao et al. 2023) combine iterative CoT prompting with queries to Wikipedia APIs. These methods search for relevant entities and content and then incorporate the retrieved information back into the context, further enhancing the model's reasoning capabilities.<ref name="”126”">Trivedi et al. (2022) Interleaving Retrieval with Chain-of-Thought Reasoning for Knowledge-Intensive Multi-Step Questions https://arxiv.org/abs/2212.10509</ref><ref name="”127”">Yao et al. (2023) ReAct: Synergizing Reasoning and Acting in Language Models https://arxiv.org/abs/2210.03629</ref>
 
==26 Principals for Good Prompts==
{{see also|26 Principles of Good Prompts}}
{{:26 Principles of Good Prompts}}


==Prompt Engineering for Code Generation Models==
==Prompt Engineering for Code Generation Models==
Line 207: Line 318:
*[[Vividness]] -
*[[Vividness]] -
*[[Ecclesiastical]] -
*[[Ecclesiastical]] -
==Connecting External APIs==
==Resources==
*'''[https://github.com/openai/openai-cookbook OpenAI Cookbook]''' - detailed examples on how to utilize LLM efficiently.
*'''[[LangChain]]''' - library for combining language models with other components to build applications.
*'''[https://github.com/dair-ai/Prompt-Engineering-Guide Prompt Engineering Guide]''' - repo contains a fairly comprehensive collection of educational materials on the topic
*'''[https://learnprompting.org/docs/intro learnprompting.org]'''
*'''[https://promptperfect.jina.ai/ PromptPerfect]'''
*'''[https://github.com/microsoft/semantic-kernel Semantic Kernel]'''


==References==
==References==
1,065

edits