One-Shot Learning

One-shot learning is a machine learning approach in which a model learns to recognize or classify new categories from only a single labeled example per class. Unlike conventional supervised learning methods that typically require hundreds or thousands of labeled training examples, one-shot learning systems aim to replicate the human ability to generalize from very limited experience. The term was introduced by Li Fei-Fei, Rob Fergus, and Pietro Perona in their 2006 paper on object category recognition, where they demonstrated a Bayesian framework for learning visual categories from one or a few examples by leveraging prior knowledge from previously learned categories.

One-shot learning occupies a specific position within the broader spectrum of N-shot learning paradigms. Zero-shot learning requires no examples of target classes at all, instead relying on auxiliary information such as semantic attributes or textual descriptions. Few-shot learning generalizes to settings with a small number of examples (typically 2 to 20 per class). One-shot learning is the special case where exactly one example is available per novel class, making it one of the most challenging data-efficient learning problems.

Explain like I'm 5 (ELI5)

Imagine someone shows you a single photo of a platypus. You have never seen one before. But from that one picture, you notice key features: a duck-like bill, a beaver tail, and brown fur. Later at the zoo, you spot a platypus right away because you remember those features from the one photo. That is basically what one-shot learning does for computers. It teaches a computer to recognize something new after seeing it just one time, by paying attention to the features that make it different from everything else.

Background and motivation

Traditional deep learning models, particularly convolutional neural networks for image classification, achieve high accuracy only when trained on large labeled datasets. ImageNet, for example, contains over 14 million labeled images across more than 20,000 categories. Collecting and annotating datasets of this scale is time-consuming, expensive, and sometimes impossible for specialized domains.

Several real-world scenarios make large-scale data collection impractical:

Rare categories. In medical imaging, examples of rare diseases may be extremely scarce. A hospital might have only a handful of confirmed cases for a particular condition.
Privacy-sensitive data. Face verification systems must enroll new users from a single photo. Collecting hundreds of images per person raises privacy concerns.
Rapidly changing classes. Product recognition systems must accommodate new items that appear daily, making it infeasible to retrain from scratch each time.
Cost of expert annotation. Labeling satellite imagery, biological specimens, or industrial defects often requires domain experts whose time is limited and costly.

Humans, by contrast, can learn to identify a new animal, recognize a new friend's face, or understand a new word after encountering it just once or twice. This cognitive ability inspired researchers to develop machine learning methods that can similarly generalize from minimal data.

Problem formulation

One-shot learning is most commonly formulated as an N-way K-shot classification problem. In this setting:

A support set S contains K labeled examples from each of N novel classes. For one-shot learning, K = 1, so the support set has exactly N examples total.
A query set Q contains unlabeled examples that must be classified into one of the N classes.
The model must assign each query example to the correct class, using only the support set for guidance.

This formulation is distinct from standard classification because the classes in the support set are typically unseen during training. The model must generalize to new classes at test time without any fine-tuning.

Episodic training

Most modern one-shot learning methods use episodic training (also called meta-training). During each training iteration, the algorithm samples a random N-way K-shot task (called an episode) from the training set:

Select N classes at random from the training classes.
For each class, sample K examples for the support set and a separate set of query examples.
Train the model to classify the query examples correctly, given the support set.
Update model parameters based on the classification loss.

By training on thousands of such episodes, the model learns a general strategy for classifying novel classes from few examples, rather than memorizing specific categories.

Benchmark datasets

Several standard benchmarks are used to evaluate one-shot learning methods:

Dataset	Domain	Classes	Images per class	Typical evaluation
Omniglot	Handwritten characters	1,623 characters from 50 alphabets	20	5-way and 20-way 1-shot
miniImageNet	Natural images	100 (64 train / 16 val / 20 test)	600	5-way 1-shot and 5-shot
tieredImageNet	Natural images	608 (351 train / 97 val / 160 test)	~1,200	5-way 1-shot and 5-shot
CUB-200-2011	Fine-grained birds	200	~60	5-way 1-shot and 5-shot
CIFAR-FS	Natural images	100 (64 train / 16 val / 20 test)	600	5-way 1-shot and 5-shot

Omniglot, sometimes called the "transpose of MNIST", was introduced by Brenden Lake and colleagues in 2015 and is considered an easier benchmark. miniImageNet, proposed by Vinyals et al. (2016) and later standardized by Ravi and Larochelle (2017), has become the most widely used benchmark for comparing few-shot learning methods.

Approaches and methods

Research on one-shot learning can be broadly organized into four families of approaches: metric-based methods, optimization-based (meta-learning) methods, memory-augmented methods, and data augmentation-based methods.

Metric-based methods

Metric-based approaches learn an embedding space where examples from the same class are close together and examples from different classes are far apart. Classification is then performed by comparing the distances between query embeddings and support set embeddings.

Siamese networks

Siamese networks, applied to one-shot image recognition by Koch, Zemel, and Salakhutdinov (2015), consist of two identical neural network branches that share the same weights. Each branch processes one input image and produces a feature vector. The network then computes a distance between the two feature vectors (typically using a weighted L1 distance followed by a sigmoid function) to output a similarity score.

During training, the network is presented with pairs of images that are either from the same class (positive pairs) or different classes (negative pairs). It learns to output high similarity for same-class pairs and low similarity for different-class pairs. At test time, a query image is compared against each example in the support set, and the class of the most similar support example is assigned.

Strengths: Siamese networks are conceptually simple and can generalize to new classes without retraining. Once the embedding function is learned, new classes can be added by providing a single reference image.

Limitations: Pairwise comparison scales quadratically with the number of support examples and classes, and the fixed distance function may not capture complex class boundaries.

Matching networks

Matching Networks, introduced by Vinyals, Blundell, Lillicrap, Kavukcuoglu, and Wierstra (2016), extend the Siamese approach by framing one-shot learning as a differentiable nearest-neighbor problem. The model maps a support set and a query image to a predicted label using an attention mechanism over the support set embeddings.

A key innovation of Matching Networks is the use of Full Context Embeddings (FCE), where the embedding of each support example is conditioned on the entire support set through a bidirectional LSTM. This allows the model to produce embeddings that are adapted to the specific task at hand. Matching Networks achieved 93.8% accuracy on 5-way 1-shot Omniglot classification and improved one-shot accuracy on ImageNet from 82.2% to 87.8%.

Prototypical networks

Prototypical Networks, proposed by Snell, Swersky, and Zemel (2017), simplify the metric-based approach by computing a single prototype representation for each class. The prototype is defined as the mean of the embedded support examples for that class. Classification is performed by computing the distance from a query embedding to each class prototype and assigning the nearest class.

Formally, given an embedding function f (typically a CNN), the prototype for class c is:

p_c = (1/|S_c|) * sum of f(x) for all x in S_c

where S_c is the set of support examples for class c. For one-shot learning, each prototype is simply the embedding of the single support example.

Prototypical Networks use squared Euclidean distance as the distance metric. The authors showed that this choice is equivalent to a linear classifier in the embedding space, providing a simple inductive bias well suited to the limited-data regime. The method achieves strong results while being simpler than Matching Networks and faster to train.

Relation networks

Relation Networks, introduced by Sung, Yang, Zhang, Xiang, Torr, and Hospedales (2018), replace the fixed distance function with a learned relation module. Instead of computing Euclidean or cosine distance between embeddings, the model concatenates the query and support embeddings and passes them through a small neural network that outputs a relation score between 0 and 1.

This approach allows the model to learn a non-linear similarity function that may capture more complex relationships than a fixed metric. Relation Networks were evaluated on both Omniglot and miniImageNet and showed competitive performance with other metric-based methods.

Comparison of metric-based methods

Method	Year	Distance function	Key idea	5-way 1-shot miniImageNet accuracy
Siamese Networks	2015	Weighted L1 + sigmoid	Pairwise similarity learning	N/A (evaluated on Omniglot)
Matching Networks	2016	Cosine + attention	Differentiable nearest neighbor with FCE	43.56%
Prototypical Networks	2017	Squared Euclidean	Class prototypes as mean embeddings	49.42%
Relation Networks	2018	Learned neural network	Non-linear learned similarity	50.44%

Optimization-based methods (meta-learning)

Optimization-based methods take the approach of "learning to learn" by training model initialization or update rules that enable rapid adaptation to new tasks from minimal data.

Model-Agnostic Meta-Learning (MAML)

MAML, proposed by Finn, Abbeel, and Levine (2017), is a meta-learning algorithm that seeks an initialization of model parameters from which a few gradient steps on a small support set can produce good performance on the corresponding query set. The key insight is that some parameter initializations are better suited for fast adaptation than others.

The MAML training procedure works in two loops:

Inner loop. For each task (episode), start from the current meta-parameters and perform one or a few gradient descent steps on the support set loss to get task-specific parameters.
Outer loop. Evaluate the task-specific parameters on the query set and compute the meta-loss. Update the meta-parameters by backpropagating through the inner loop optimization.

This requires computing second-order gradients (gradients of gradients), which is computationally expensive. A first-order approximation called FOMAML drops these second-order terms with minimal performance loss.

MAML is "model-agnostic" in the sense that it can be applied to any model trained with gradient descent, including classification, regression, and reinforcement learning tasks. On 5-way 1-shot miniImageNet, MAML achieves approximately 48.70% accuracy.

Reptile

Reptile, proposed by Nichol, Achiam, and Schulman (2018), is a simpler alternative to MAML that avoids computing second-order gradients. Instead of differentiating through the optimization process, Reptile repeatedly performs several steps of SGD on a task and then moves the initialization toward the resulting parameters. Despite its simplicity, Reptile achieves comparable performance to MAML on standard benchmarks.

Meta-SGD

Meta-SGD (Li, Yang, Song, and Hospedales, 2017) extends MAML by also meta-learning the learning rate and update direction for each parameter, rather than using a fixed learning rate. This provides more flexibility in the inner-loop optimization and can lead to faster adaptation.

Memory-augmented methods

Memory-augmented neural networks (MANNs) extend standard neural networks with an external memory module that allows them to store and retrieve information about previously seen examples.

Santoro, Bartunov, Botvinick, Wierstra, and Lillicrap (2016) demonstrated that a neural network augmented with an external memory, similar to a Neural Turing Machine, could perform one-shot learning by quickly encoding new examples into memory and using content-based addressing to retrieve relevant memories at test time. Their approach uses a Least Recently Used Access (LRUA) memory writing strategy that encourages the network to write new information to either the least recently used memory slot or the most recently used slot, balancing novelty and familiarity.

The advantage of memory-augmented approaches is that they can rapidly incorporate new information without changing the model weights, making them well suited for scenarios where new classes appear continuously.

Data augmentation-based methods

Data augmentation-based methods address the scarcity of training examples by generating additional synthetic examples for novel classes.

Hariharan and Girshick (2017) proposed learning a "hallucinator" network that generates additional feature vectors for underrepresented classes. Given one real example from a novel class, the hallucinator produces multiple synthetic feature vectors by learning transformations from base classes that have abundant data. This approach improved one-shot accuracy on ImageNet by 2.3x. A follow-up paper by Wang, Girshick, Hebert, and Hariharan (2018) jointly optimized the hallucinator with a meta-learner, yielding up to a 6-point improvement in classification accuracy.

Other data augmentation strategies for one-shot learning include geometric transformations (rotation, scaling, flipping), color jittering, and more sophisticated methods such as using generative adversarial networks to synthesize training examples.

Graph-based methods

Garcia and Bruna (2018) proposed using graph neural networks for few-shot learning by constructing a graph where nodes represent both support and query examples and edges encode similarity relationships. Message passing on this graph allows information to propagate between examples, enabling the model to make predictions by aggregating neighborhood information. This framework generalizes several earlier few-shot learning methods and can be extended to semi-supervised and active learning settings.

Comparison of approach families

Approach family	Representative methods	Key mechanism	Strengths	Limitations
Metric-based	Siamese Networks, Matching Networks, Prototypical Networks, Relation Networks	Learn embedding space and distance function	Simple, fast inference, no fine-tuning needed	Fixed embedding may not adapt to diverse tasks
Optimization-based	MAML, Reptile, Meta-SGD	Learn parameter initialization for fast adaptation	Model-agnostic, flexible	Computationally expensive (second-order gradients), sensitive to hyperparameters
Memory-augmented	MANNs, Neural Turing Machines	External memory for storing and retrieving examples	Rapid incorporation of new data, no weight updates needed	Memory management overhead, scalability concerns
Data augmentation	Hallucinator networks, GAN-based synthesis	Generate synthetic training examples	Directly addresses data scarcity	Quality of synthetic examples may vary, risk of introducing artifacts
Graph-based	GNN-based methods	Message passing on example graphs	Captures inter-example relationships, extensible to semi-supervised settings	Graph construction can be expensive

One-shot learning is closely connected to several related learning paradigms.

Zero-shot vs. one-shot vs. few-shot learning

Aspect	Zero-shot learning	One-shot learning	Few-shot learning
Examples per novel class	0	1	2 to ~20
Auxiliary information required	Yes (attributes, descriptions, or class embeddings)	No	No
Core methods	Semantic embeddings, attribute-based classification	Metric learning, meta-learning	Meta-learning, metric learning
Typical applications	Recognizing unseen object categories, cross-domain transfer	Face verification, signature matching	General classification with limited data

Transfer learning

Transfer learning involves reusing a model trained on one task (the source) for a different but related task (the target). One-shot learning can be seen as an extreme form of transfer learning, where the "transfer" happens from base classes with many examples to novel classes with only one example. Many one-shot learning methods use a backbone network pretrained on base classes and then apply metric learning or meta-learning strategies to generalize to novel classes.

Meta-learning

Meta-learning, or "learning to learn," is the broader framework within which many one-shot learning methods operate. While meta-learning encompasses any approach that improves learning efficiency across tasks, one-shot learning focuses specifically on the extreme low-data regime. MAML, Matching Networks, and Prototypical Networks are all instances of meta-learning applied to the one-shot setting.

Less-than-one-shot learning

Sucholutsky and Schonlau (2021) introduced the concept of "less than one"-shot learning, where a model must learn to distinguish N classes given fewer than N examples total. This is achieved through the use of soft labels (probability distributions over classes rather than hard class assignments), which encode information about multiple classes in a single example. While still primarily theoretical and experimental on simple datasets, this work pushes the boundaries of data-efficient learning beyond what was previously considered possible.

Applications

One-shot learning has found practical applications across many domains.

Computer vision

Face recognition and verification. Systems like Apple's Face ID enroll a new user from a single set of reference images. The system must verify identity from one example, making one-shot learning techniques directly applicable.
Object recognition in robotics. Robots deployed in warehouses or homes may encounter novel objects. One-shot learning allows them to identify new items after a single demonstration.
Handwriting recognition. The Omniglot benchmark was originally designed to test character recognition across diverse writing systems, a task where collecting large datasets per character is impractical.

Medical imaging

Medical imaging datasets are often limited because of privacy constraints, the rarity of certain conditions, and the high cost of expert annotation. One-shot and few-shot learning methods enable models to classify medical images (such as X-rays, MRI scans, or histopathology slides) with minimal labeled examples, which is especially valuable for rare diseases.

Drug discovery

Altae-Tran, Ramsundar, Pappu, and Pande (2017) demonstrated that one-shot learning can be applied to molecular property prediction, enabling models to predict chemical properties like toxicity from very few examples of a compound class. Their iterative refinement LSTM combined with graph neural networks achieved strong results on molecular datasets with limited training data.

Natural language processing

In NLP, one-shot and few-shot learning have been applied to tasks such as text classification, named entity recognition, and relation extraction in low-resource settings. Large language models like GPT-3 have demonstrated remarkable few-shot and one-shot abilities through in-context learning, where examples are provided directly in the input prompt.

Other applications

Signature verification. Verifying whether a signature is authentic by comparing it to a single reference sample.
Audio classification. Identifying new sound categories from a single recording, useful for environmental monitoring and wildlife tracking.
Autonomous navigation. Drones and self-driving vehicles may encounter novel obstacles or road signs that were not in the training data.

Challenges and open problems

Despite significant progress, one-shot learning still faces several open challenges.

Domain shift. Most methods are evaluated on benchmarks where training and test classes come from similar distributions (for example, different subsets of ImageNet). Performance degrades significantly when there is a large domain gap between base and novel classes, such as transferring from natural images to medical scans.

Scalability. Many one-shot learning methods are evaluated on small-scale benchmarks with a limited number of classes. Scaling to thousands or millions of classes, as would be needed in real-world product recognition or biodiversity monitoring, remains difficult.

Cross-modal generalization. Current methods typically operate within a single modality (images, text, or audio). Developing one-shot learning systems that can transfer knowledge across modalities is an active research area.

Noisy and ambiguous support examples. In practical settings, the single available example may be noisy, atypical, or poorly representative of the class. Robust one-shot learning methods must handle such cases gracefully.

Evaluation standardization. Different papers use different dataset splits, backbone architectures, and training procedures, making fair comparison of methods challenging. Efforts to standardize evaluation protocols are ongoing.

Open-world recognition. Most one-shot learning methods assume a closed-set scenario where every query belongs to one of the N support classes. In the real world, models must also detect when a query does not belong to any known class (out-of-distribution detection).

Historical timeline

Year	Milestone
2003	Fei-Fei, Fergus, and Perona propose a Bayesian approach to unsupervised one-shot learning of object categories
2006	Fei-Fei et al. publish "One-shot learning of object categories" in IEEE TPAMI, formally introducing the term
2015	Koch, Zemel, and Salakhutdinov apply Siamese networks to one-shot image recognition; Lake et al. introduce the Omniglot dataset
2016	Vinyals et al. introduce Matching Networks; Santoro et al. demonstrate memory-augmented neural networks for one-shot learning
2017	Snell et al. propose Prototypical Networks; Finn et al. introduce MAML; Ravi and Larochelle standardize the miniImageNet benchmark
2018	Sung et al. introduce Relation Networks; Garcia and Bruna apply graph neural networks to few-shot learning
2020	Sucholutsky and Schonlau introduce "less than one"-shot learning with soft labels
2020s	Integration with large pretrained models (vision transformers, foundation models) becomes a dominant trend

References

Fei-Fei, L., Fergus, R., & Perona, P. (2006). "One-shot learning of object categories." *IEEE Transactions on Pattern Analysis and Machine Intelligence*, 28(4), 594-611.
Koch, G., Zemel, R., & Salakhutdinov, R. (2015). "Siamese neural networks for one-shot image recognition." *ICML Deep Learning Workshop*.
Vinyals, O., Blundell, C., Lillicrap, T., Kavukcuoglu, K., & Wierstra, D. (2016). "Matching networks for one shot learning." *Advances in Neural Information Processing Systems*, 29.
Santoro, A., Bartunov, S., Botvinick, M., Wierstra, D., & Lillicrap, T. (2016). "Meta-learning with memory-augmented neural networks." *Proceedings of the 33rd International Conference on Machine Learning*.
Snell, J., Swersky, K., & Zemel, R. (2017). "Prototypical networks for few-shot learning." *Advances in Neural Information Processing Systems*, 30.
Finn, C., Abbeel, P., & Levine, S. (2017). "Model-agnostic meta-learning for fast adaptation of deep networks." *Proceedings of the 34th International Conference on Machine Learning*, 70, 1126-1135.
Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P. H., & Hospedales, T. M. (2018). "Learning to compare: Relation network for few-shot learning." *IEEE/CVF Conference on Computer Vision and Pattern Recognition*, 1199-1208.
Garcia, V. & Bruna, J. (2018). "Few-shot learning with graph neural networks." *International Conference on Learning Representations (ICLR)*.
Hariharan, B. & Girshick, R. (2017). "Low-shot visual recognition by shrinking and hallucinating features." *IEEE International Conference on Computer Vision (ICCV)*.
Wang, Y., Girshick, R., Hebert, M., & Hariharan, B. (2018). "Low-shot learning from imaginary data." *IEEE/CVF Conference on Computer Vision and Pattern Recognition*.
Altae-Tran, H., Ramsundar, B., Pappu, A. S., & Pande, V. (2017). "Low data drug discovery with one-shot learning." *ACS Central Science*, 3(4), 283-293.
Sucholutsky, I. & Schonlau, M. (2021). "'Less than one'-shot learning: Learning N classes from M < N samples." *Proceedings of the AAAI Conference on Artificial Intelligence*, 35(11), 9739-9746.
Wang, Y., Yao, Q., Kwok, J. T., & Ni, L. M. (2020). "Generalizing from a few examples: A survey on few-shot learning." *ACM Computing Surveys*, 53(3), 1-34.
Lake, B. M., Salakhutdinov, R., & Tenenbaum, J. B. (2015). "Human-level concept learning through probabilistic program induction." *Science*, 350(6266), 1332-1338.

Explain like I'm 5 (ELI5)

Background and motivation

Problem formulation

Episodic training

Benchmark datasets

Approaches and methods

Metric-based methods

Siamese networks

Matching networks

Prototypical networks

Relation networks

Comparison of metric-based methods

Optimization-based methods (meta-learning)

Model-Agnostic Meta-Learning (MAML)

Reptile

Meta-SGD

Memory-augmented methods

Data augmentation-based methods

Graph-based methods

Comparison of approach families

Relationship to related paradigms

Zero-shot vs. one-shot vs. few-shot learning

Transfer learning

Meta-learning

Less-than-one-shot learning

Applications

Computer vision

Medical imaging

Drug discovery

Natural language processing

Other applications

Challenges and open problems

Historical timeline

See also

References

Improve this article

Related Articles

Sparse autoencoder

Few-Shot Learning

ARC-AGI 2

GELU (Gaussian Error Linear Unit)

LeNet

Chelsea Finn

Explain like I'm 5 (ELI5)

Background and motivation

Problem formulation

Episodic training

Benchmark datasets

Approaches and methods

Metric-based methods

Siamese networks

Matching networks

Prototypical networks

Relation networks

Comparison of metric-based methods

Optimization-based methods (meta-learning)

Model-Agnostic Meta-Learning (MAML)

Reptile

Meta-SGD

Memory-augmented methods

Data augmentation-based methods

Graph-based methods

Comparison of approach families

Relationship to related paradigms

Zero-shot vs. one-shot vs. few-shot learning

Transfer learning

Meta-learning

Less-than-one-shot learning

Applications

Computer vision

Medical imaging

Drug discovery

Natural language processing

Other applications

Challenges and open problems

Historical timeline

See also

References

Related Articles

Sparse autoencoder