One-shot learning is a machine learning approach in which a model learns to recognize or classify new categories from only a single labeled example per class. Unlike conventional supervised learning methods that typically require hundreds or thousands of labeled training examples, one-shot learning systems aim to replicate the human ability to generalize from very limited experience. The term was introduced by Li Fei-Fei, Rob Fergus, and Pietro Perona in their 2006 paper on object category recognition, where they demonstrated a Bayesian framework for learning visual categories from one or a few examples by leveraging prior knowledge from previously learned categories.
One-shot learning occupies a specific position within the broader spectrum of N-shot learning paradigms. Zero-shot learning requires no examples of target classes at all, instead relying on auxiliary information such as semantic attributes or textual descriptions. Few-shot learning generalizes to settings with a small number of examples (typically 2 to 20 per class). One-shot learning is the special case where exactly one example is available per novel class, making it one of the most challenging data-efficient learning problems.
Imagine someone shows you a single photo of a platypus. You have never seen one before. But from that one picture, you notice key features: a duck-like bill, a beaver tail, and brown fur. Later at the zoo, you spot a platypus right away because you remember those features from the one photo. That is basically what one-shot learning does for computers. It teaches a computer to recognize something new after seeing it just one time, by paying attention to the features that make it different from everything else.
Traditional deep learning models, particularly convolutional neural networks for image classification, achieve high accuracy only when trained on large labeled datasets. ImageNet, for example, contains over 14 million labeled images across more than 20,000 categories. Collecting and annotating datasets of this scale is time-consuming, expensive, and sometimes impossible for specialized domains.
Several real-world scenarios make large-scale data collection impractical:
Humans, by contrast, can learn to identify a new animal, recognize a new friend's face, or understand a new word after encountering it just once or twice. This cognitive ability inspired researchers to develop machine learning methods that can similarly generalize from minimal data.
One-shot learning is most commonly formulated as an N-way K-shot classification problem. In this setting:
This formulation is distinct from standard classification because the classes in the support set are typically unseen during training. The model must generalize to new classes at test time without any fine-tuning.
Most modern one-shot learning methods use episodic training (also called meta-training). During each training iteration, the algorithm samples a random N-way K-shot task (called an episode) from the training set:
By training on thousands of such episodes, the model learns a general strategy for classifying novel classes from few examples, rather than memorizing specific categories.
Several standard benchmarks are used to evaluate one-shot learning methods:
| Dataset | Domain | Classes | Images per class | Typical evaluation |
|---|---|---|---|---|
| Omniglot | Handwritten characters | 1,623 characters from 50 alphabets | 20 | 5-way and 20-way 1-shot |
| miniImageNet | Natural images | 100 (64 train / 16 val / 20 test) | 600 | 5-way 1-shot and 5-shot |
| tieredImageNet | Natural images | 608 (351 train / 97 val / 160 test) | ~1,200 | 5-way 1-shot and 5-shot |
| CUB-200-2011 | Fine-grained birds | 200 | ~60 | 5-way 1-shot and 5-shot |
| CIFAR-FS | Natural images | 100 (64 train / 16 val / 20 test) | 600 | 5-way 1-shot and 5-shot |
Omniglot, sometimes called the "transpose of MNIST", was introduced by Brenden Lake and colleagues in 2015 and is considered an easier benchmark. miniImageNet, proposed by Vinyals et al. (2016) and later standardized by Ravi and Larochelle (2017), has become the most widely used benchmark for comparing few-shot learning methods.
Research on one-shot learning can be broadly organized into four families of approaches: metric-based methods, optimization-based (meta-learning) methods, memory-augmented methods, and data augmentation-based methods.
Metric-based approaches learn an embedding space where examples from the same class are close together and examples from different classes are far apart. Classification is then performed by comparing the distances between query embeddings and support set embeddings.
Siamese networks, applied to one-shot image recognition by Koch, Zemel, and Salakhutdinov (2015), consist of two identical neural network branches that share the same weights. Each branch processes one input image and produces a feature vector. The network then computes a distance between the two feature vectors (typically using a weighted L1 distance followed by a sigmoid function) to output a similarity score.
During training, the network is presented with pairs of images that are either from the same class (positive pairs) or different classes (negative pairs). It learns to output high similarity for same-class pairs and low similarity for different-class pairs. At test time, a query image is compared against each example in the support set, and the class of the most similar support example is assigned.
Strengths: Siamese networks are conceptually simple and can generalize to new classes without retraining. Once the embedding function is learned, new classes can be added by providing a single reference image.
Limitations: Pairwise comparison scales quadratically with the number of support examples and classes, and the fixed distance function may not capture complex class boundaries.
Matching Networks, introduced by Vinyals, Blundell, Lillicrap, Kavukcuoglu, and Wierstra (2016), extend the Siamese approach by framing one-shot learning as a differentiable nearest-neighbor problem. The model maps a support set and a query image to a predicted label using an attention mechanism over the support set embeddings.
A key innovation of Matching Networks is the use of Full Context Embeddings (FCE), where the embedding of each support example is conditioned on the entire support set through a bidirectional LSTM. This allows the model to produce embeddings that are adapted to the specific task at hand. Matching Networks achieved 93.8% accuracy on 5-way 1-shot Omniglot classification and improved one-shot accuracy on ImageNet from 82.2% to 87.8%.
Prototypical Networks, proposed by Snell, Swersky, and Zemel (2017), simplify the metric-based approach by computing a single prototype representation for each class. The prototype is defined as the mean of the embedded support examples for that class. Classification is performed by computing the distance from a query embedding to each class prototype and assigning the nearest class.
Formally, given an embedding function f (typically a CNN), the prototype for class c is:
p_c = (1/|S_c|) * sum of f(x) for all x in S_c
where S_c is the set of support examples for class c. For one-shot learning, each prototype is simply the embedding of the single support example.
Prototypical Networks use squared Euclidean distance as the distance metric. The authors showed that this choice is equivalent to a linear classifier in the embedding space, providing a simple inductive bias well suited to the limited-data regime. The method achieves strong results while being simpler than Matching Networks and faster to train.
Relation Networks, introduced by Sung, Yang, Zhang, Xiang, Torr, and Hospedales (2018), replace the fixed distance function with a learned relation module. Instead of computing Euclidean or cosine distance between embeddings, the model concatenates the query and support embeddings and passes them through a small neural network that outputs a relation score between 0 and 1.
This approach allows the model to learn a non-linear similarity function that may capture more complex relationships than a fixed metric. Relation Networks were evaluated on both Omniglot and miniImageNet and showed competitive performance with other metric-based methods.
| Method | Year | Distance function | Key idea | 5-way 1-shot miniImageNet accuracy |
|---|---|---|---|---|
| Siamese Networks | 2015 | Weighted L1 + sigmoid | Pairwise similarity learning | N/A (evaluated on Omniglot) |
| Matching Networks | 2016 | Cosine + attention | Differentiable nearest neighbor with FCE | 43.56% |
| Prototypical Networks | 2017 | Squared Euclidean | Class prototypes as mean embeddings | 49.42% |
| Relation Networks | 2018 | Learned neural network | Non-linear learned similarity | 50.44% |
Optimization-based methods take the approach of "learning to learn" by training model initialization or update rules that enable rapid adaptation to new tasks from minimal data.
MAML, proposed by Finn, Abbeel, and Levine (2017), is a meta-learning algorithm that seeks an initialization of model parameters from which a few gradient steps on a small support set can produce good performance on the corresponding query set. The key insight is that some parameter initializations are better suited for fast adaptation than others.
The MAML training procedure works in two loops:
This requires computing second-order gradients (gradients of gradients), which is computationally expensive. A first-order approximation called FOMAML drops these second-order terms with minimal performance loss.
MAML is "model-agnostic" in the sense that it can be applied to any model trained with gradient descent, including classification, regression, and reinforcement learning tasks. On 5-way 1-shot miniImageNet, MAML achieves approximately 48.70% accuracy.
Reptile, proposed by Nichol, Achiam, and Schulman (2018), is a simpler alternative to MAML that avoids computing second-order gradients. Instead of differentiating through the optimization process, Reptile repeatedly performs several steps of SGD on a task and then moves the initialization toward the resulting parameters. Despite its simplicity, Reptile achieves comparable performance to MAML on standard benchmarks.
Meta-SGD (Li, Yang, Song, and Hospedales, 2017) extends MAML by also meta-learning the learning rate and update direction for each parameter, rather than using a fixed learning rate. This provides more flexibility in the inner-loop optimization and can lead to faster adaptation.
Memory-augmented neural networks (MANNs) extend standard neural networks with an external memory module that allows them to store and retrieve information about previously seen examples.
Santoro, Bartunov, Botvinick, Wierstra, and Lillicrap (2016) demonstrated that a neural network augmented with an external memory, similar to a Neural Turing Machine, could perform one-shot learning by quickly encoding new examples into memory and using content-based addressing to retrieve relevant memories at test time. Their approach uses a Least Recently Used Access (LRUA) memory writing strategy that encourages the network to write new information to either the least recently used memory slot or the most recently used slot, balancing novelty and familiarity.
The advantage of memory-augmented approaches is that they can rapidly incorporate new information without changing the model weights, making them well suited for scenarios where new classes appear continuously.
Data augmentation-based methods address the scarcity of training examples by generating additional synthetic examples for novel classes.
Hariharan and Girshick (2017) proposed learning a "hallucinator" network that generates additional feature vectors for underrepresented classes. Given one real example from a novel class, the hallucinator produces multiple synthetic feature vectors by learning transformations from base classes that have abundant data. This approach improved one-shot accuracy on ImageNet by 2.3x. A follow-up paper by Wang, Girshick, Hebert, and Hariharan (2018) jointly optimized the hallucinator with a meta-learner, yielding up to a 6-point improvement in classification accuracy.
Other data augmentation strategies for one-shot learning include geometric transformations (rotation, scaling, flipping), color jittering, and more sophisticated methods such as using generative adversarial networks to synthesize training examples.
Garcia and Bruna (2018) proposed using graph neural networks for few-shot learning by constructing a graph where nodes represent both support and query examples and edges encode similarity relationships. Message passing on this graph allows information to propagate between examples, enabling the model to make predictions by aggregating neighborhood information. This framework generalizes several earlier few-shot learning methods and can be extended to semi-supervised and active learning settings.
| Approach family | Representative methods | Key mechanism | Strengths | Limitations |
|---|---|---|---|---|
| Metric-based | Siamese Networks, Matching Networks, Prototypical Networks, Relation Networks | Learn embedding space and distance function | Simple, fast inference, no fine-tuning needed | Fixed embedding may not adapt to diverse tasks |
| Optimization-based | MAML, Reptile, Meta-SGD | Learn parameter initialization for fast adaptation | Model-agnostic, flexible | Computationally expensive (second-order gradients), sensitive to hyperparameters |
| Memory-augmented | MANNs, Neural Turing Machines | External memory for storing and retrieving examples | Rapid incorporation of new data, no weight updates needed | Memory management overhead, scalability concerns |
| Data augmentation | Hallucinator networks, GAN-based synthesis | Generate synthetic training examples | Directly addresses data scarcity | Quality of synthetic examples may vary, risk of introducing artifacts |
| Graph-based | GNN-based methods | Message passing on example graphs | Captures inter-example relationships, extensible to semi-supervised settings | Graph construction can be expensive |
One-shot learning is closely connected to several related learning paradigms.
| Aspect | Zero-shot learning | One-shot learning | Few-shot learning |
|---|---|---|---|
| Examples per novel class | 0 | 1 | 2 to ~20 |
| Auxiliary information required | Yes (attributes, descriptions, or class embeddings) | No | No |
| Core methods | Semantic embeddings, attribute-based classification | Metric learning, meta-learning | Meta-learning, metric learning |
| Typical applications | Recognizing unseen object categories, cross-domain transfer | Face verification, signature matching | General classification with limited data |
Transfer learning involves reusing a model trained on one task (the source) for a different but related task (the target). One-shot learning can be seen as an extreme form of transfer learning, where the "transfer" happens from base classes with many examples to novel classes with only one example. Many one-shot learning methods use a backbone network pretrained on base classes and then apply metric learning or meta-learning strategies to generalize to novel classes.
Meta-learning, or "learning to learn," is the broader framework within which many one-shot learning methods operate. While meta-learning encompasses any approach that improves learning efficiency across tasks, one-shot learning focuses specifically on the extreme low-data regime. MAML, Matching Networks, and Prototypical Networks are all instances of meta-learning applied to the one-shot setting.
Sucholutsky and Schonlau (2021) introduced the concept of "less than one"-shot learning, where a model must learn to distinguish N classes given fewer than N examples total. This is achieved through the use of soft labels (probability distributions over classes rather than hard class assignments), which encode information about multiple classes in a single example. While still primarily theoretical and experimental on simple datasets, this work pushes the boundaries of data-efficient learning beyond what was previously considered possible.
One-shot learning has found practical applications across many domains.
Medical imaging datasets are often limited because of privacy constraints, the rarity of certain conditions, and the high cost of expert annotation. One-shot and few-shot learning methods enable models to classify medical images (such as X-rays, MRI scans, or histopathology slides) with minimal labeled examples, which is especially valuable for rare diseases.
Altae-Tran, Ramsundar, Pappu, and Pande (2017) demonstrated that one-shot learning can be applied to molecular property prediction, enabling models to predict chemical properties like toxicity from very few examples of a compound class. Their iterative refinement LSTM combined with graph neural networks achieved strong results on molecular datasets with limited training data.
In NLP, one-shot and few-shot learning have been applied to tasks such as text classification, named entity recognition, and relation extraction in low-resource settings. Large language models like GPT-3 have demonstrated remarkable few-shot and one-shot abilities through in-context learning, where examples are provided directly in the input prompt.
Despite significant progress, one-shot learning still faces several open challenges.
Domain shift. Most methods are evaluated on benchmarks where training and test classes come from similar distributions (for example, different subsets of ImageNet). Performance degrades significantly when there is a large domain gap between base and novel classes, such as transferring from natural images to medical scans.
Scalability. Many one-shot learning methods are evaluated on small-scale benchmarks with a limited number of classes. Scaling to thousands or millions of classes, as would be needed in real-world product recognition or biodiversity monitoring, remains difficult.
Cross-modal generalization. Current methods typically operate within a single modality (images, text, or audio). Developing one-shot learning systems that can transfer knowledge across modalities is an active research area.
Noisy and ambiguous support examples. In practical settings, the single available example may be noisy, atypical, or poorly representative of the class. Robust one-shot learning methods must handle such cases gracefully.
Evaluation standardization. Different papers use different dataset splits, backbone architectures, and training procedures, making fair comparison of methods challenging. Efforts to standardize evaluation protocols are ongoing.
Open-world recognition. Most one-shot learning methods assume a closed-set scenario where every query belongs to one of the N support classes. In the real world, models must also detect when a query does not belong to any known class (out-of-distribution detection).
| Year | Milestone |
|---|---|
| 2003 | Fei-Fei, Fergus, and Perona propose a Bayesian approach to unsupervised one-shot learning of object categories |
| 2006 | Fei-Fei et al. publish "One-shot learning of object categories" in IEEE TPAMI, formally introducing the term |
| 2015 | Koch, Zemel, and Salakhutdinov apply Siamese networks to one-shot image recognition; Lake et al. introduce the Omniglot dataset |
| 2016 | Vinyals et al. introduce Matching Networks; Santoro et al. demonstrate memory-augmented neural networks for one-shot learning |
| 2017 | Snell et al. propose Prototypical Networks; Finn et al. introduce MAML; Ravi and Larochelle standardize the miniImageNet benchmark |
| 2018 | Sung et al. introduce Relation Networks; Garcia and Bruna apply graph neural networks to few-shot learning |
| 2020 | Sucholutsky and Schonlau introduce "less than one"-shot learning with soft labels |
| 2020s | Integration with large pretrained models (vision transformers, foundation models) becomes a dominant trend |