370
edits
Line 95: | Line 95: | ||
</poem> | </poem> | ||
==Few-shot | ==Few-shot Prompting== | ||
[[Few-shot | ===Basics=== | ||
[[Few-shot prompting]] have a task description, a few examples and then a prompt. | |||
===For example=== | ====For example==== | ||
<poem style="border: 1px solid; padding: 1rem"> | <poem style="border: 1px solid; padding: 1rem"> | ||
Line 119: | Line 120: | ||
*Vacationing in Florida is fun: | *Vacationing in Florida is fun: | ||
===Example output=== | ====Example output==== | ||
<poem style="border: 1px solid; padding: 1rem"> | <poem style="border: 1px solid; padding: 1rem"> | ||
Vacationing in Florida is fun: FL | Vacationing in Florida is fun: FL | ||
</poem> | </poem> | ||
===Advanced=== | |||
In few-shot prompting, the model is presented with high-quality demonstrations, including input and desired output, for the target task. This approach enables the model to understand the human intention better and the desired criteria for answers, often resulting in improved performance compared to zero-shot prompting. However, this comes at the expense of increased token consumption and may reach the context length limit for longer input and output texts. | |||
Numerous studies have explored how to construct in-context examples to maximize performance. Prompt format, training examples, and example order can lead to dramatically different performance outcomes, ranging from near-random guessing to near state-of-the-art (SoTA) results. | |||
Zhao et al. (2021) investigated few-shot classification using LLMs, specifically GPT-3. They identified several biases that contribute to high variance in performance: (1) majority label bias, (2) recency bias, and (3) common token bias. To address these biases, they proposed a method to calibrate label probabilities output by the model to be uniform when the input string is N/A. | |||
==Tips for Example Selection== | |||
===Semantically Similar Examples=== | |||
Liu et al. (2021) suggested choosing examples that are semantically similar to the test example by employing nearest neighbor (NN) clustering in the embedding space. | |||
===Diverse and Representative Examples=== | |||
Su et al. (2022) proposed a graph-based approach to select a diverse and representative set of examples: (1) construct a directed graph based on the cosine similarity between samples in the embedding space (e.g., using SBERT or other embedding models), and (2) start with a set of selected samples and a set of remaining samples, scoring each sample to encourage diverse selection. | |||
===Embeddings via Contrastive Learning=== | |||
Rubin et al. (2022) suggested training embeddings through contrastive learning specific to one training dataset for in-context learning sample selection. This approach measures the quality of an example based on a conditioned probability assigned by the language model. | |||
===Q-Learning=== | |||
Zhang et al. (2022) explored using Q-Learning for sample selection in LLM training. | |||
===Uncertainty-Based Active Learning=== | |||
Diao et al. (2023) proposed identifying examples with high disagreement or entropy among multiple sampling trials based on uncertainty-based active learning. These examples can then be annotated and used in few-shot prompts. | |||
==Tips for Example Ordering== | |||
A general recommendation is to maintain a diverse selection of examples relevant to the test sample and present them in random order to avoid majority label bias and recency bias. Increasing model sizes or including more training examples does not necessarily reduce variance among different permutations of in-context examples. The exact order may work well for one model but poorly for another. | |||
When the validation set is limited, Lu et al. (2022) suggested choosing the order such that the model does not produce extremely unbalanced predictions or exhibit overconfidence in its predictions. | |||
==Roles== | ==Roles== |
edits