See also: Machine learning terms
Zero-shot, one-shot, and few-shot learning are three closely related settings in machine learning and prompt engineering that describe how many labelled examples a model sees of a target task or class before it has to make a prediction. Zero-shot uses no task-specific examples, one-shot uses exactly one, and few-shot uses a small handful (commonly anywhere from two to a few dozen). All three are responses to the same practical problem: building useful systems when labeled data for the exact thing you care about is scarce, expensive, or simply nonexistent.
The terms have two distinct lineages that often get conflated, and it is worth pulling them apart before going any further. The older lineage is classical zero-shot and few-shot learning, a body of work in computer vision and NLP that started in 2008 with Larochelle, Erhan, and Bengio's "Zero-data Learning of New Tasks." Here, the model is trained with auxiliary information (attributes, class descriptions, semantic embeddings) so that at test time it can generalize to classes that were never in the training set. The newer lineage is zero-shot, one-shot, and few-shot prompting, a vocabulary popularized by Brown et al. in the 2020 GPT-3 paper "Language Models are Few-Shot Learners." Here, a pretrained large language model is conditioned on zero, one, or several input-output examples placed inside the prompt, with no parameter updates at all. Both lineages share a goal (rapid generalization with little data) but the mechanisms are very different.
The table below summarizes how the three settings differ, using the GPT-3 framing that has become standard in the LLM era.
| Setting | Examples in prompt or support set | Typical use | Example |
|---|---|---|---|
| Zero-shot | 0 | The model is given only a task description or instruction. It must rely entirely on knowledge acquired during pretraining or auxiliary class information. | "Translate to French: cheese ->" |
| One-shot | 1 | A single demonstration is provided. Often used when a task is hard to describe in words but easy to show. | "Translate to French: sea otter -> loutre de mer. cheese ->" |
| Few-shot | A small handful, commonly 2 to 32 (Brown et al. allowed up to 100) | Multiple demonstrations let the model infer format, label space, and edge cases. | A list of 5 to 20 worked English-to-French pairs followed by a new English word. |
In classical few-shot learning the analogous structure is the N-way K-shot task: at test time the model sees a small support set of N novel classes with K labelled examples each, then must classify a query set drawn from the same N classes. A 5-way 1-shot Omniglot task, for example, gives the model one example each of five new handwritten characters and asks it to identify another image of one of them.
It is worth noting that the cutoffs are conventional, not principled. Brown et al. drew the line at "as many demonstrations as fit in the context window," while the Omniglot tradition tends to use small fixed K values such as 1 and 5. Different papers report numbers differently, and "few-shot" in a 2024 LLM paper often means three to ten examples while "few-shot" in a 2018 vision paper might mean exactly one or five.
The classical zero-shot setting was first formalized by Hugo Larochelle, Dumitru Erhan, and Yoshua Bengio in 2008 under the name "zero-data learning," applied to character recognition and a multi-task drug discovery problem. A year later, Mark Palatucci and colleagues at Carnegie Mellon coined the now-standard term "zero-shot learning" in a NIPS 2009 paper that decoded fMRI brain activity to predict which word a person was thinking of, even for words not in the training set. The trick was to map both the input (brain activity) and the output (a word) into a shared semantic space defined by hand-crafted features.
The attribute-based variant was popularized in computer vision the same year by Christoph Lampert, Hannes Nickisch, and Stefan Harmeling, whose CVPR 2009 paper "Learning to Detect Unseen Object Classes by Between-Class Attribute Transfer" introduced the Animals with Attributes dataset. The idea is that animals can be described by a fixed vocabulary of attributes ("has stripes," "four legs," "black and white," "lives in Africa") and a model that predicts attributes well can identify a zebra at test time even if it has never seen one labelled, as long as it knows the zebra's attribute profile. The paper introduced two methods, Direct Attribute Prediction (DAP) and Indirect Attribute Prediction (IAP), achieving 40.5% and 27.8% accuracy respectively on Animals with Attributes.
With the rise of word embedding methods in 2013, embedding-based zero-shot approaches took over. Andrea Frome and colleagues at Google introduced DeViSE at NIPS 2013, which trained a CNN to project images into a word embedding space (specifically a Word2Vec space) so that an unseen class label like "okapi" could still be located in the same vector space as the image. Richard Socher, Milind Ganjoo, Christopher Manning, and Andrew Ng's "Zero-Shot Learning Through Cross-Modal Transfer" (NIPS 2013) made a similar move using a different embedding architecture. Mohammad Norouzi and colleagues' ConSE method (ICLR 2014) showed that you could simply average word embeddings of the top-K predictions of an off-the-shelf classifier to get strong zero-shot results without any joint training.
A practical complication that this body of work surfaced is the gap between conventional zero-shot evaluation (where test classes are disjoint from train classes) and generalized zero-shot learning, where at test time you might see either a familiar class or a brand-new one and the model must handle both. Generalized zero-shot is much harder because models tend to be biased toward familiar classes. Yongqin Xian, Christoph Lampert, Bernt Schiele, and Zeynep Akata's 2018 "Zero-Shot Learning: A Comprehensive Evaluation of the Good, the Bad and the Ugly" exposed how much published numbers depended on quirks in evaluation protocols and proposed a unified benchmark that is still cited today.
Few-shot learning as a research area really took off after Brenden Lake, Ruslan Salakhutdinov, and Joshua Tenenbaum's 2015 Science paper "Human-level concept learning through probabilistic program induction." The paper introduced the Omniglot dataset of 1,623 handwritten characters from 50 alphabets and showed that a Bayesian program learning model could match human performance on one-shot classification. Omniglot quickly became the "MNIST of few-shot learning" and is still the standard sanity check.
The deep learning community responded with three families of approaches that are now standard reference points.
| Family | Core idea | Representative papers |
|---|---|---|
| Metric-based | Learn an embedding space where same-class examples are close; classify by nearest neighbour or class prototype | Siamese Networks (Koch, Zemel, and Salakhutdinov, 2015); Matching Networks (Vinyals et al., 2016); Prototypical Networks (Snell, Swersky, and Zemel, 2017); Relation Networks (Sung et al., 2018) |
| Optimization-based | Learn an initialization or update rule that adapts quickly to new tasks with few gradient steps | MAML (Finn, Abbeel, and Levine, 2017); Reptile and FOMAML (Nichol, Achiam, and Schulman, 2018); Meta-SGD (Li et al., 2017) |
| Model-based | Use external memory or specialized architectures so the model can store and retrieve task information at inference time | Memory-Augmented Neural Networks (Santoro et al., 2016); SNAIL (Mishra et al., 2018) |
Vinyals et al.'s Matching Networks paper deserves a special mention because it introduced the episodic training paradigm now standard in the field. Instead of training on a fixed set of classes, the model is trained on a stream of small N-way K-shot tasks sampled from a large pool of classes, so that learning to handle new tasks itself becomes the training objective. Matching Networks also introduced mini-ImageNet, the most heavily used few-shot benchmark, derived from ImageNet.
Chelsea Finn, Pieter Abbeel, and Sergey Levine's 2017 ICML paper on MAML is probably the most influential optimization-based result. MAML finds an initialization such that one or a few gradient descent steps on the support set produce good predictions on the query set. Crucially, MAML is model-agnostic: the same idea works for convolutional neural network image classifiers, recurrent neural network sequence models, and reinforcement learning policies. Snell, Swersky, and Zemel's Prototypical Networks (NeurIPS 2017) showed that something much simpler often works comparably well: compute a class prototype as the mean of the embedded support examples, classify a query by Euclidean distance plus a softmax, and stop there.
A recurring uncomfortable finding is that strong transfer learning baselines (pretrain a backbone on the base classes, then fit a linear classifier on top of the frozen features for the novel classes) are often competitive with sophisticated meta-learning methods, especially as backbones get larger. Wei-Yu Chen et al.'s ICLR 2019 paper "A Closer Look at Few-Shot Classification" was an early statement of this.
The phrase "few-shot learning" took on a second meaning when Tom Brown and colleagues at OpenAI published "Language Models are Few-Shot Learners" in 2020, the paper that introduced GPT-3 and its 175 billion parameters. Brown et al. evaluated GPT-3 in three modes that map directly onto the terminology you find in modern prompt engineering tutorials.
| Brown et al. setting | Demonstrations | Description |
|---|---|---|
| Zero-shot | 0 | Only a natural-language task description is given (for example: "Translate English to French:"). |
| One-shot | 1 | Task description plus a single English-French pair, then the new English word. |
| Few-shot | Up to ~100 | Task description plus as many demonstrations as fit in the 2,048-token context window. |
The striking result was not just that GPT-3 could do this at all but that performance on many tasks scaled smoothly with both model size and the number of in-context examples, often approaching the accuracy of dedicated fine-tuned models without a single gradient update. This phenomenon is now called in-context learning (ICL), and Brown et al.'s vocabulary of zero-shot, one-shot, and few-shot prompting has become the lingua franca of prompt engineering.
In-context learning is fundamentally different from classical few-shot learning in three ways. First, no weights are updated; the "learning" happens entirely in the forward pass. Second, the demonstrations consume context window space, so there is a hard cap on how many you can use, especially for longer tasks. Third, what is actually being exploited is the model's pretraining: the same model with random weights would learn nothing from a few examples in a prompt. Stanford researchers have argued that ICL is an emergent ability, only reliable above a certain scale, although whether emergence is genuine or an artifact of how it is measured is contested.
A related paradigm is zero-shot chain-of-thought. Jason Wei et al.'s 2022 paper "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models" showed that giving an LLM a few demonstrations that include intermediate reasoning steps dramatically boosts performance on multi-step problems; with eight chain-of-thought exemplars, a 540B-parameter PaLM hit then-state-of-the-art on the GSM8K math word problem benchmark. Takeshi Kojima et al.'s NeurIPS 2022 paper "Large Language Models are Zero-Shot Reasoners" went further and showed that simply appending the phrase "Let's think step by step" to a question, with no demonstrations at all, lifted accuracy on MultiArith from 17.7% to 78.7% with text-davinci-002. Zero-shot CoT is now a routine prompt-engineering trick.
A few worked examples make the differences concrete.
A classifier is trained on hundreds of animal species labelled with a fixed attribute vocabulary ("has stripes," "four legs," "hooved," "lives in Africa"). It learns to predict attributes from images. At test time the system is shown a picture of a zebra, an animal it has never seen labelled. Even so, the model can predict that the image has stripes, four legs, hooves, and an African habitat. Cross-referencing this attribute profile against a database of unseen species (which includes "zebra") returns the correct class. This is the recipe behind Lampert et al.'s 2009 work and the Animals with Attributes dataset.
Alec Radford et al.'s 2021 OpenAI paper "Learning Transferable Visual Models From Natural Language Supervision" trained CLIP on roughly 400 million (image, caption) pairs scraped from the web. To classify a new image into a set of categories, you simply embed the image and embed text prompts like "a photo of a [category]," then pick the category whose text embedding has the highest cosine similarity. CLIP achieved 76.2% top-1 accuracy on ImageNet without ever training on the ImageNet labels, matching the original ResNet-50 and shrinking the robustness gap to natural distribution shifts by up to 75%.
Imagine you want a system that can recognize a specific Tibetan character it has never seen before. You show it one example of the character, then ask it to identify which of several test images contains the same character. Classical machine learning would need many copies of the character to learn the relevant features. A Siamese network trained on Omniglot, by contrast, learns to compare pairs of character images and can score the test images by similarity to the single example, often matching or beating untrained humans. Lake, Salakhutdinov, and Tenenbaum's 2015 Bayesian program learning model achieved roughly 95% on the 20-way 1-shot Omniglot task, comparable to human performance.
A fine-grained image classifier has been trained on dozens of fruit species. Given five photos of a new fruit (say, the rambutan) it has never seen, a Prototypical Network embeds the five images, averages their embeddings into a class prototype, and classifies new fruit images by Euclidean distance to that prototype and the prototypes of the other classes. This is a 5-shot learning task, and it works because the embedding space was trained to put similar fruits close together regardless of which classes were used during training.
A prompt to GPT-3 or GPT-4 might look like this:
Classify the sentiment of each review as positive or negative.
Review: "The food was cold and the service was terrible."
Sentiment: negative
Review: "Loved every bite. Coming back next week."
Sentiment: positive
Review: "Service was fine but the menu felt overpriced."
Sentiment:
Given this three-shot prompt, the model is expected to output "negative" (or possibly "mixed," depending on how strict you want it to be). No weights are updated; the model is just continuing the pattern.
Different communities have settled on different evaluation suites.
| Benchmark | Type | Classes | Year | Source |
|---|---|---|---|---|
| Omniglot | Few-shot character recognition | 1,623 | 2015 | Lake, Salakhutdinov, and Tenenbaum |
| mini-ImageNet | Few-shot image classification | 100 (64/16/20) | 2016 | Vinyals et al. |
| tieredImageNet | Few-shot image classification (harder splits) | 608 | 2018 | Ren et al. |
| Animals with Attributes 2 | Zero-shot image classification | 50 | 2018 | Xian et al. |
| CUB-200-2011 | Fine-grained zero/few-shot bird classification | 200 | 2011 | Wah et al. |
| Meta-Dataset | Cross-domain few-shot | 10 datasets | 2020 | Triantafillou et al. |
| SuperGLUE / BIG-bench / MMLU | Zero-shot and few-shot language tasks | Varies | 2019 to 2022 | Multiple |
The gap between vision and language benchmarks reflects the gap between the two lineages. Vision benchmarks fix the N-way K-shot protocol and measure accuracy in episodes. Language benchmarks measure zero-shot and few-shot prompted accuracy on a set of NLP tasks; the demonstrations and the test query both live inside a single text prompt.
When working with LLMs, the choice between zero-shot, one-shot, and few-shot is mostly a question of how much you trust the model to infer the task from instructions versus how much you can constrain it by example.
Demonstration selection matters more than novices expect. Prefer examples that are diverse and representative rather than near-duplicates of one another. Avoid demonstrations whose answers come from the same class in a row; that triggers majority-label bias (see below).
Few-shot prompting looks magical until you measure how brittle it is. The literature has documented several biases that consistently hurt accuracy.
| Failure mode | What happens | Source |
|---|---|---|
| Demonstration ordering | Performance can swing from near-state-of-the-art to near-random based on the order of the same examples; a good order for one model often does not transfer to another | Lu et al., "Fantastically Ordered Prompts and Where to Find Them," ACL 2022 |
| Majority-label bias | If most demonstrations belong to one class, the model is biased toward predicting that class | Zhao et al., "Calibrate Before Use," ICML 2021 |
| Recency bias | The model is biased toward classes that appear near the end of the prompt | Zhao et al., 2021 |
| Common token bias | The model prefers answer tokens that are frequent in pretraining data | Zhao et al., 2021 |
| Label correctness matters less than format | Randomly relabeling demonstrations barely hurts accuracy in many settings; the model is mostly using demonstrations to infer label space, input distribution, and output format rather than to learn from the (input, label) mapping | Min et al., "Rethinking the Role of Demonstrations," EMNLP 2022 |
| Generalized zero-shot bias | When test inputs include both seen and unseen classes, models heavily favour seen classes | Xian et al., 2018 |
| Domain shift | Few-shot accuracy collapses when novel classes come from a visibly different domain than base classes (for example, training on natural images and testing on satellite imagery) | Triantafillou et al., "Meta-Dataset," ICLR 2020 |
Zhao et al.'s contextual calibration trick (ask the model what it would predict for a content-free input like "N/A," then subtract that bias) recovered up to 30 percentage points of absolute accuracy on some tasks, which is a strong indicator of how unstable raw few-shot prompting can be.
One consequence of these failure modes is that small benchmarks of zero-shot or few-shot performance can be misleading. Reported numbers depend on the prompt template, the exact demonstrations chosen, the order, the temperature, and the model version. When comparing methods, it is good practice to report mean and standard deviation across many random seeds and demonstration orderings.
Zero-shot, one-shot, and few-shot learning sit at the intersection of several broader ideas in machine learning. The table below sketches how they relate.
| Related technique | Relationship |
|---|---|
| Transfer learning | The umbrella concept: reuse knowledge from one task or distribution on another. Zero/few-shot learning is transfer with extreme data scarcity at the target. |
| Meta-learning | "Learning to learn." Trains a model across many small tasks so it adapts quickly to a new one. Most pre-2020 few-shot work is meta-learning. |
| In-context learning | The mechanism by which an LLM does few-shot prompting at inference time. No gradient updates. |
| Fine-tuning | Updates model parameters on labelled data. Zero/few-shot prompting is the no-update alternative. |
| Instruction tuning | Fine-tuning a base LLM on instruction-following data so that zero-shot prompting works better. |
| Chain-of-thought prompting | Augments zero/few-shot prompting with intermediate reasoning steps, often boosting accuracy on multi-step problems. |
| Self-supervised pretraining | Provides the broad foundation of features and knowledge that makes zero/few-shot generalization possible. |
| Data augmentation | Synthetically expands a few labelled examples; complementary to few-shot learning rather than a substitute. |
The zero/few-shot vocabulary now extends well outside its original homes of image classification and text NLP.
Imagine you have never seen a duckbilled platypus before. If a friend says, "It's a small swimming animal with fur, a beak like a duck, and a flat tail," and you spot one in the wild, you can probably point and say "that's a platypus." That is zero-shot learning: you used a description, not examples. If your friend instead shows you one photo of a platypus, that is one-shot learning. If they show you five photos taken from different angles, that is few-shot learning. Computers do something similar: zero-shot models use general descriptions or knowledge, one-shot models use a single example, and few-shot models use a handful. The whole point is to learn quickly when you do not have thousands of training pictures lying around.