See also: machine learning terms, one-shot learning, zero-shot learning, meta-learning
Few-shot learning is a subfield of machine learning focused on training models to perform tasks or make accurate predictions using only a very small number of labeled examples, typically between one and five per class. In contrast to conventional supervised learning, which often requires thousands or millions of labeled samples to reach acceptable performance, few-shot learning aims to generalize effectively from minimal data. The problem setting is motivated by the observation that humans can learn new concepts from just a handful of examples. A child who sees two or three pictures of a giraffe can reliably identify giraffes in the future, yet standard deep learning models trained from scratch on so few samples would fail dramatically due to overfitting.
The roots of few-shot learning trace back to early work on learning with limited data, but the field gained significant momentum in the mid-2010s as researchers developed meta-learning and metric learning approaches specifically designed for low-data regimes. The publication of Matching Networks by Vinyals et al. (2016), Prototypical Networks by Snell et al. (2017), and Model-Agnostic Meta-Learning (MAML) by Finn et al. (2017) established foundational methods that remain influential. More recently, the emergence of large language models such as GPT-3 demonstrated that few-shot learning can also be achieved through in-context learning, where a pre-trained model is given a few demonstration examples within its input prompt and generalizes to new instances without any gradient updates.
Few-shot learning has practical significance in domains where labeled data is scarce, expensive to acquire, or inherently rare. These include medical imaging for rare diseases, drug discovery for novel molecular targets, robotics in unstructured environments, personalized recommendation systems, and natural language processing tasks with limited labeled corpora.
The standard few-shot classification problem is formulated as an N-way K-shot task. In this setup, the model must classify query examples into one of N classes, with only K labeled examples (called the support set) available for each class. For example, a 5-way 1-shot task requires classifying an input into one of five classes when only a single labeled example per class is available.
More formally, a few-shot learning episode consists of:
| Component | Definition |
|---|---|
| Support set | A small set of labeled examples, containing K samples for each of N classes (total of N x K examples) |
| Query set | A set of unlabeled examples that the model must classify using information from the support set |
| N (ways) | The number of distinct classes in the episode |
| K (shots) | The number of labeled examples per class in the support set |
Common experimental configurations include 5-way 1-shot and 5-way 5-shot settings. The model is trained across many such episodes (a procedure called episodic training), where each episode samples different classes and examples. This training strategy forces the model to learn a general ability to classify from few examples, rather than memorizing specific classes.
Few-shot learning exists on a spectrum with zero-shot and one-shot learning. The differences are defined by how many labeled examples are available at inference time.
| Setting | Labeled examples per class | Typical mechanism |
|---|---|---|
| Zero-shot | 0 | Relies on class descriptions, attributes, or semantic embeddings to recognize unseen classes without any examples |
| One-shot | 1 | A single example per class; the model must generalize from that lone instance |
| Few-shot | 2 to 5 (sometimes up to ~20) | A small handful of examples per class; the most commonly studied setting |
| Standard supervised | Hundreds to millions | Conventional training with large labeled datasets |
One-shot learning is a special case of few-shot learning where K = 1. Zero-shot learning does not use any labeled examples from the target classes at all and instead relies on auxiliary information such as class attribute vectors or natural language descriptions to bridge the gap between seen and unseen categories.
Metric learning approaches learn an embedding space where examples from the same class are close together and examples from different classes are far apart. At inference time, a query example is classified by finding the nearest class representation in this learned space. Metric learning methods are among the most popular approaches to few-shot learning because of their simplicity and strong empirical performance.
Siamese networks. Introduced by Koch et al. (2015) for one-shot image recognition, Siamese networks consist of two identical sub-networks with shared weights that process a pair of inputs and output a similarity score. The network learns to determine whether two inputs belong to the same class. At inference time, the query image is compared pairwise against all examples in the support set, and the class of the most similar support example is assigned. Siamese networks laid the groundwork for subsequent metric learning methods in few-shot learning.
Matching networks. Proposed by Vinyals et al. (2016), matching networks introduced the episodic training paradigm that became standard in the field. The model maps both support and query examples into an embedding space using a learned encoder, then classifies query examples using a weighted nearest-neighbor approach with attention over the support set. Matching networks achieved 93.8% accuracy on 5-way 1-shot Omniglot classification and 87.8% on 1-shot ImageNet classification, representing substantial improvements over prior work.
Prototypical networks. Developed by Snell et al. (2017), prototypical networks simplify the matching process by computing a single prototype (the mean embedding) for each class from its support examples. Query examples are classified based on their Euclidean distance to each class prototype. Despite their simplicity, prototypical networks achieved competitive or superior results compared to more complex approaches. The authors also showed that the method extends naturally to zero-shot learning when prototypes are derived from class metadata instead of examples.
Relation networks. Proposed by Sung et al. (2018), relation networks replace the fixed distance metric used in prototypical networks with a learned distance function. A relation module (a small neural network) takes as input the concatenation of query and support embeddings and outputs a relation score. This allows the model to learn task-specific notions of similarity rather than relying on a predefined metric like Euclidean distance.
| Method | Year | Distance function | Key idea |
|---|---|---|---|
| Siamese networks | 2015 | Learned (binary) | Pairwise similarity between input pairs |
| Matching networks | 2016 | Cosine with attention | Episodic training; weighted nearest-neighbor classification |
| Prototypical networks | 2017 | Euclidean to class mean | Class prototypes as mean embeddings; simple and effective |
| Relation networks | 2018 | Learned (neural network) | End-to-end learned distance function |
Meta-learning approaches train models across a distribution of tasks so that they can adapt quickly to new tasks with minimal data. Rather than learning to solve a specific task, meta-learning algorithms learn a learning procedure itself.
Model-Agnostic Meta-Learning (MAML). Introduced by Finn et al. (2017), MAML is an optimization-based meta-learning algorithm that learns an initialization of model parameters from which the model can be fine-tuned to a new task with just a few gradient steps on a small support set. The key insight is that MAML does not learn a fixed classifier but rather finds a point in parameter space that is maximally sensitive to task-specific information. MAML is model-agnostic, meaning it can be applied to any model trained with gradient descent, including classifiers, regressors, and reinforcement learning policies. On 5-way 1-shot miniImageNet classification, MAML achieved 48.70% accuracy, a strong result at the time of publication.
Reptile. Developed by Nichol and Schulman at OpenAI (2018), Reptile is a first-order meta-learning algorithm that simplifies MAML by avoiding the computation of second-order gradients. Reptile works by repeatedly sampling a task, performing several steps of stochastic gradient descent on that task, and then moving the initialization parameters toward the resulting task-specific parameters. Despite its simplicity, Reptile achieves performance comparable to first-order MAML on benchmarks like Omniglot and miniImageNet while being easier to implement and computationally more efficient.
Task-agnostic representations. More recent meta-learning research has explored learning universal feature representations that transfer well across tasks. Methods such as Meta-Dataset (Triantafillou et al., 2020) train on multiple diverse datasets simultaneously, learning representations that generalize to entirely new domains with few examples.
Memory-augmented neural networks (MANNs) incorporate an external memory module that stores and retrieves information from past examples. The Neural Turing Machine (NTM) and the Differentiable Neural Computer (DNC) are foundational architectures in this category. By writing support examples to memory and reading from memory when classifying query examples, MANNs can effectively leverage small support sets. Santoro et al. (2016) demonstrated that MANNs with an explicit memory buffer can rapidly assimilate new classes, achieving strong performance on Omniglot with a single example per class.
Transfer learning provides a practical and increasingly popular approach to few-shot learning. The strategy involves two stages: first, pre-train a model on a large dataset with abundant labels (the base classes), and then fine-tune the model on the target task using only a few labeled examples from new classes. Pre-trained models capture general-purpose features (such as edge detectors, texture patterns, and semantic representations) that transfer well to new tasks. Research by Chen et al. (2019) showed that simple fine-tuning of a pre-trained feature extractor can match or outperform many meta-learning methods when the backbone network is sufficiently powerful, challenging the assumption that specialized meta-learning algorithms are always necessary.
With the advent of large-scale pre-trained models like CLIP (Radford et al., 2021) and foundation models, transfer-based few-shot learning has become even more effective. These models, trained on billions of image-text pairs, develop rich feature spaces that enable strong few-shot performance through simple linear probing or lightweight adaptation.
The release of GPT-3 (Brown et al., 2020), a 175-billion-parameter autoregressive language model, introduced a fundamentally different paradigm for few-shot learning: in-context learning. Rather than updating model parameters, in-context learning provides a few input-output demonstration examples directly in the model's text prompt. The model then generates a prediction for a new query by pattern-matching against these demonstrations, without any gradient updates or fine-tuning.
Brown et al. evaluated GPT-3 under three settings:
| Setting | Description | Example |
|---|---|---|
| Zero-shot | Only a task instruction is provided | "Translate English to French: cheese ->" |
| One-shot | A single input-output pair is provided before the query | "sea otter -> loutre de mer, cheese ->" |
| Few-shot | Multiple input-output pairs (typically 10 to 100) are provided in the prompt | Several translation pairs followed by the query |
GPT-3's few-shot performance approached or matched fine-tuned models on a range of NLP benchmarks, including translation, question answering, and cloze tasks, without modifying any parameters. This demonstrated that scale alone (in terms of model parameters and training data) can give rise to strong few-shot capabilities. The finding has been replicated and extended by subsequent models including PaLM, LLaMA, and GPT-4.
In-context learning differs from traditional few-shot learning in several important ways. It requires no episodic training, no learned metric space, and no parameter updates at test time. Instead, it leverages the vast knowledge encoded during pre-training and the model's ability to recognize patterns within its context window.
The success of in-context learning has spawned the field of prompt engineering, where practitioners carefully design prompts to maximize a language model's few-shot performance. Key techniques include:
Data augmentation is a natural strategy for mitigating the scarcity of labeled data in few-shot learning. Because the support set is so small, even modest augmentation can meaningfully expand the effective training set.
Standard augmentation techniques (random cropping, flipping, color jittering, rotation) are directly applicable and can improve few-shot classification accuracy by increasing the diversity of support examples. More advanced strategies have been developed specifically for few-shot settings:
Feature hallucination. Rather than augmenting raw images, hallucination methods synthesize new examples in the feature space. Adversarial Feature Hallucination Networks (AFHN) use conditional generative adversarial networks to generate diverse, discriminative feature vectors for novel classes. Saliency-guided hallucination (Zhang et al., 2019) combines foregrounds and backgrounds from different images to create plausible new training samples.
Cross-class transfer. Some methods generate augmented features for novel classes by transferring intra-class variation patterns observed in base classes with abundant data to novel classes with few examples.
Text-based augmentation for NLP. In natural language processing, techniques such as back-translation, synonym replacement, and paraphrase generation can expand few-shot training sets. More recently, large language models themselves have been used to generate synthetic training examples for few-shot text classification.
A critical consideration in few-shot augmentation is maintaining both diversity and discriminability. Augmented samples that are too similar to existing ones provide little new information, while samples that deviate too far from the true class distribution can degrade the classifier's decision boundary.
The few-shot learning community relies on several standard benchmarks to evaluate and compare methods. These datasets are designed to test a model's ability to generalize to novel classes from limited examples.
| Benchmark | Domain | Classes | Samples per class | Introduced by | Notes |
|---|---|---|---|---|---|
| Omniglot | Handwritten characters | 1,623 characters from 50 alphabets | 20 drawings per character | Lake et al., 2015 | Often called the "MNIST of few-shot learning"; tests character recognition across diverse writing systems |
| miniImageNet | Natural images | 100 classes (64 train / 16 val / 20 test) | 600 images per class | Vinyals et al., 2016; Ravi & Larochelle, 2017 | Subset of ImageNet; the most widely used benchmark for few-shot image classification |
| tieredImageNet | Natural images | 608 classes grouped into 34 high-level categories | ~1,300 images per class | Ren et al., 2018 | Larger-scale benchmark where train, validation, and test splits use non-overlapping high-level categories, testing broader generalization |
| CIFAR-FS | Natural images | 100 classes (64 / 16 / 20 split) | 600 images per class | Bertinetto et al., 2019 | Derived from CIFAR-100; lower resolution (32x32) than miniImageNet |
| Meta-Dataset | Multiple domains | Varies by dataset | Varies | Triantafillou et al., 2020 | Aggregates 10 image datasets to test cross-domain generalization |
| FewRel | Relation classification (NLP) | 100 relations | 700 sentences per relation | Han et al., 2018 | Few-shot relation extraction benchmark for NLP |
Omniglot was one of the earliest benchmarks and served as a proving ground for initial few-shot methods. However, its relative simplicity (small grayscale images of characters) led the community to adopt miniImageNet as the primary benchmark, as it presents a more challenging visual recognition task. tieredImageNet further increases difficulty by ensuring that the semantic categories used during training and testing do not overlap at a high taxonomic level, requiring models to generalize across broader conceptual boundaries.
Few-shot learning has found applications across a wide range of domains where labeled data is inherently limited.
Few-shot image classification is the most extensively studied application. Beyond classification, few-shot object detection (FSOD) extends the paradigm to localization tasks, where the model must detect and localize objects from novel categories given only a few annotated bounding boxes. Recent FSOD methods leverage meta-learning, feature reweighting, and attention mechanisms to adapt region proposal networks to new classes. Foundation models like DINOv2 and vision-language models have further advanced few-shot detection capabilities.
Few-shot semantic segmentation addresses the task of pixel-level labeling for new classes with minimal annotations, which is particularly valuable in medical imaging where expert annotations are expensive.
Medical imaging is a natural domain for few-shot learning because many conditions are rare, and obtaining expert-annotated images is costly and time-consuming. Few-shot approaches have been applied to classification of skin lesions, detection of tumors in radiology scans, and diagnosis of retinal diseases from fundus images.
The SHEPHERD system (Agrawal et al., 2024) applies few-shot learning to rare genetic disease diagnosis by combining deep learning over a knowledge graph enriched with rare disease information. In evaluation, SHEPHERD ranked the correct causal gene among its top five predictions for 77.8% of hard-to-diagnose patients, improving diagnostic efficiency by at least twofold compared to unguided approaches.
In drug discovery, obtaining biological assay data for novel molecular targets is expensive and slow. Few-shot learning addresses this by enabling models to predict molecular activity from just a handful of confirmed active compounds. Stanley et al. (2022) applied prototypical networks and relation networks to ligand-based virtual screening, demonstrating that metric-based meta-learning can effectively identify hit compounds for new targets using as few as 5 to 16 active examples.
Few-shot learning has significant applications in NLP, especially for tasks where labeled data is scarce or domain-specific.
Text classification. Few-shot text classification is critical for applications like sentiment analysis in niche domains, intent detection for new conversational topics, and content moderation for emerging categories. Both meta-learning approaches (using prototypical networks over sentence embeddings) and in-context learning with large language models have shown strong results.
Named entity recognition (NER). Few-shot NER is the task of identifying named entities in text when only a small number of labeled examples are available for each entity type. This is especially relevant for specialized domains (biomedical, legal, financial) where entity types are domain-specific and annotation requires expert knowledge. Approaches include metric learning applied at the token level, prompt-based methods using language models, and hybrid systems that combine LLM predictions with metric learning models for robust low-resource entity extraction.
Relation extraction. The FewRel benchmark (Han et al., 2018) specifically targets few-shot relation classification, where models must determine the semantic relationship between entities in a sentence given only a few labeled examples of each relation type.
Robots operating in unstructured real-world environments frequently encounter objects, tasks, and scenarios not seen during training. Few-shot learning enables robotic systems to adapt to new manipulation tasks, recognize novel objects, and learn new skills from a small number of human demonstrations. Meta-learning methods like MAML have been applied to robotic manipulation, where a robot learns a policy initialization that can be quickly fine-tuned to new tasks with just a few trial-and-error episodes.
Despite significant progress, few-shot learning faces several persistent challenges.
Overfitting. With extremely limited training data, models are highly susceptible to overfitting to the few available support examples. Regularization techniques, data augmentation, and episodic training mitigate this risk, but overfitting remains the fundamental challenge of the few-shot regime.
Task distribution mismatch. Meta-learning methods assume that the tasks encountered during meta-training are drawn from a similar distribution as the tasks encountered at test time. When this assumption is violated (for example, training on natural image tasks but testing on medical imaging tasks), performance can degrade substantially. Domain adaptation and cross-domain few-shot learning are active research areas addressing this gap.
Evaluation protocol variability. Different papers use different evaluation protocols, making it difficult to compare results fairly. Variations include the number of episodes used for evaluation, the specific class splits, the backbone architectures used, and whether or not pre-training on additional data is employed. Efforts like the Meta-Dataset benchmark and unified evaluation frameworks aim to standardize evaluation.
Sensitivity to example selection. In both episodic few-shot learning and in-context learning with LLMs, the specific examples chosen for the support set or prompt can dramatically affect performance. Research has shown that different example orderings in few-shot prompts can cause accuracy to vary from near-random to near-optimal, highlighting the fragility of few-shot approaches.
Scalability to complex tasks. While few-shot learning works well for simple classification tasks, scaling to more complex outputs (such as structured prediction, generation, or multi-step reasoning) remains an open challenge. In-context learning with large language models has partially addressed this for text-based tasks, but vision and multimodal domains still lag behind.
Computational cost of meta-learning. Optimization-based meta-learning methods like MAML require computing gradients through the inner optimization loop, which involves second-order derivatives and significant memory overhead. Although first-order approximations like Reptile and first-order MAML reduce this cost, meta-learning still typically requires training across thousands of episodes, which can be computationally expensive.
Imagine someone shows you a picture of a strange animal you have never seen before, like an okapi. They tell you, "This is an okapi." Then they mix that picture in with photos of zebras and giraffes, and ask you to find the okapi. You could probably do it, even though you only saw one picture of an okapi. Your brain is really good at learning new things from just one or two examples.
Computers usually are not that good at this. Most of the time, you have to show a computer thousands of pictures of cats before it learns what a cat looks like. Few-shot learning is a special way of training computers so they can do what you do naturally: learn to recognize something new after seeing only a few examples. It is like teaching the computer to be a fast learner, instead of a slow one that needs to study the same thing over and over.