Instance

In machine learning and statistics, an instance is a single data point used by a model. It is the atomic unit of a dataset: one row in a spreadsheet, one image in a photo collection, one sentence in a text corpus, or one audio clip in a speech recognition corpus. Every supervised, unsupervised, semi-supervised, and reinforcement learning pipeline ultimately processes a sequence of instances, so the concept underpins almost everything else in the field. The term is heavily overloaded. Software engineers use it to mean an object created from a class. Cloud engineers use it to mean a virtual machine. Computer vision researchers use it to mean a single object within an image, as in instance segmentation. This article focuses primarily on the data-point sense of the word and disambiguates the others at the end.

Explain like I'm 5 (ELI5)

Imagine a teacher has a stack of flashcards. On the front of each card is a picture of an animal. On the back is the animal's name. Each card is one instance. The picture is what the student looks at, and the name is the answer the student tries to learn. If you give the student a thousand cards, you have a dataset of a thousand instances. After studying enough of them, the student can look at a brand new picture (a new instance with no name on the back) and guess the animal correctly.

Synonyms and terminology

Different communities use different words for the same idea. The vocabulary depends on the field, the textbook, and sometimes the decade. The table below collects the most common terms.

Term	Field where common	Notes
Instance	Machine learning, data mining	The default term in textbooks like Witten and Frank's Data Mining and most ML courses
Example	Machine learning, statistical learning theory	Used heavily by Vapnik, Mitchell, and the PAC-learning community
Sample	Statistics, deep learning	Confusing in statistics where "sample" can also mean a set of observations rather than one
Observation	Statistics, econometrics	Standard in regression analysis
Record	Databases, data warehousing	Each row in a table
Row	Tabular data, pandas, SQL	Common in tabular ML pipelines
Data point	Geometry, visualization	Emphasizes the geometric view of an instance as a point in feature space
Tuple	Relational databases	A row treated as an ordered collection of values
Case	Survey research, medical statistics	One subject or one event
Item	Recommender systems	Often used for the thing being recommended

The lack of a single standard word can be confusing. "Sample" is the most ambiguous: in deep learning a sample usually means one instance, but in classical statistics a sample is a collection of observations drawn from a population. When reading papers it pays to check which sense the author intends.

Formal notation

A dataset of n instances is usually written as

D = { (x_1, y_1), (x_2, y_2), ..., (x_n, y_n) }

where x_i is the input for the i-th instance and y_i is its label. The input x_i lives in an input space (often called the instance space) X, and the label y_i lives in a label space Y. For a tabular problem with d features, X is typically R^d and each x_i is a feature vector. For classification, Y might be a finite set like {0, 1} or {cat, dog, fish}. For regression, Y is usually R.

In unsupervised settings the dataset is just D = {x_1, ..., x_n} with no labels. In semi-supervised learning some instances have labels and others do not. In reinforcement learning the equivalent unit is a transition (state, action, reward, next state), which is sometimes called an experience or a sample.

Instance structure by data type

The physical representation of an instance varies enormously depending on the type of data the model consumes. The model architecture is usually chosen to match the structure of the instance.

Data type	Instance representation	Typical shape	Example architectures
Tabular	Row vector or dictionary of named features	(d,) where d is the number of columns	Logistic regression, gradient boosting, MLP
Image	2D or 3D tensor of pixel intensities	(H, W, C) where C is channels	CNN, Vision Transformer
Text	Sequence of tokens or token IDs	(L,) where L is sequence length	RNN, Transformer
Audio	Raw waveform or spectrogram	(T,) for waveform, (T, F) for spectrogram	WaveNet, conformer, Whisper
Video	4D tensor over time	(T, H, W, C)	3D CNN, video transformer
Graph	Set of nodes plus an edge list	Variable	Graph neural network
Time series	Ordered sequence of observations	(T, d)	RNN, temporal CNN, Informer
Point cloud	Unordered set of 3D points	(N, 3) or (N, 6) with color	PointNet, PointNet++
Multimodal	Tuple of representations from different modalities	Varies	CLIP, Flamingo, multimodal transformers

The choice of representation often determines what model can be applied. A convolutional network expects a fixed-size image tensor; a transformer expects a sequence; a graph neural network expects a node-and-edge structure. Preprocessing pipelines exist to convert raw data into the right instance shape.

Instance space and feature space

The terms instance space and feature space are sometimes used interchangeably and sometimes given distinct meanings. In most modern texts the instance space is the set of all possible inputs the model could ever receive, and the feature space is the geometric space in which feature vectors live. For tabular data with d numerical features, both are usually R^d. For images, the raw instance space is the set of all possible H by W by C pixel grids, while the learned feature space is the lower-dimensional embedding space produced by the network.

Many classical algorithms are easiest to understand geometrically. K-nearest neighbors classifies a new instance by finding the closest training instances in feature space. Support vector machines find a hyperplane separating instances of different classes. K-means clustering groups instances by their distances to centroids. In each case the geometry of the instance space drives the algorithm's behavior.

Splits: training, validation, and test

A dataset is partitioned into disjoint subsets of instances. The standard split is training data, validation data, and a test set. Training instances are used to fit model parameters. Validation instances are used to tune hyperparameters and choose between candidate models. Test instances are held out until the very end to give an unbiased estimate of generalization performance. The splits are partitions of the same instance pool, not separate kinds of data.

Leakage between splits is a frequent source of inflated benchmark numbers. Common mistakes include placing instances from the same patient in both training and test sets in medical imaging, or splitting time series randomly so that future instances appear in the training set. Careful split design is part of building an honest evaluation pipeline.

Labeled and unlabeled instances

An instance is called labeled if it carries a known target value y_i. Otherwise it is unlabeled. Labeled data is required for supervised learning and is usually expensive to obtain because it requires human annotation. Unlabeled data is typically abundant. Several learning paradigms are organized around how labels are assigned to instances.

Paradigm	Labeled instances	Unlabeled instances	Notes
Supervised learning	All	None	Standard classification and regression
Unsupervised learning	None	All	Clustering, density estimation, dimensionality reduction
Semi-supervised learning	Few	Many	Label propagation, pseudo-labeling, consistency regularization
Self-supervised learning	None directly; labels are derived from the input itself	All	Contrastive learning, masked language modeling
Active learning	Initially few; the model selects which to label next	Many	Human-in-the-loop annotation
Weakly supervised learning	Noisy or coarse labels	Some	Distant supervision, label smoothing

Multiple instance learning

Multiple instance learning (MIL) is a variant of supervised learning where labels are attached to bags of instances rather than to individual instances. It was formalized by Thomas Dietterich, Richard Lathrop, and Tomas Lozano-Perez in their 1997 paper "Solving the Multiple Instance Problem with Axis-Parallel Rectangles," published in the journal Artificial Intelligence. The original motivation came from drug discovery. A small molecule can fold into many three-dimensional shapes called conformations. The molecule binds to a target protein if at least one of its conformations fits the binding site. Chemists could label the molecule as active or inactive but could not say which specific conformation was responsible. Each molecule was therefore a bag containing many candidate conformations as instances, with a single bag-level label.

The standard MIL assumption is sometimes written as a logical OR over instances. A bag is positive if at least one instance is positive, and negative if all instances are negative. The learner sees only the bag label, not the per-instance labels.

MIL setting	Bag label rule	Typical use
Standard MIL	Positive if any instance is positive	Drug activity prediction, image classification with object presence
Threshold MIL	Positive if at least k instances are positive	Counting tasks
Collective MIL	Bag label depends on the distribution of instances	Histopathology slide grading
Embedded-space MIL	Bag is mapped to a single feature vector before classification	Modern attention-based MIL pipelines

MIL is now widely used in computational pathology, where a whole-slide image of a biopsy is too large to label pixel by pixel and is instead split into thousands of tile instances grouped into a slide-level bag. It is also used in object detection with image-level labels, video classification where the bag is a clip and each frame is an instance, and weakly supervised text classification.

Instance weights and sampling

In many practical pipelines, not every instance is treated equally. Each instance can be assigned a weight w_i that scales its contribution to the loss. Higher weights make the model pay more attention to that instance during training. Common reasons to use instance weights include class imbalance, in which minority-class instances are upweighted; importance sampling, in which weights correct for the difference between the sampling distribution and a target distribution; and curriculum learning, in which weights vary across training to introduce hard examples gradually. See upweighting for more detail on the upweighting variant.

Resampling is an alternative to weighting. Oversampling duplicates rare instances and undersampling discards common ones. Both change the effective distribution of training instances seen by the model. SMOTE is a popular oversampling method that creates synthetic minority instances by interpolating between existing ones.

Instance hardness and example mining

Not all instances are equally easy for a model to learn. An instance is sometimes called hard if the model consistently mispredicts it or assigns it low confidence, and easy if the model gets it right with high confidence early in training. The distinction matters because hard instances often dominate the loss and gradient signal late in training, while easy instances provide little new information once the model has converged on them.

Hard example mining is the practice of identifying difficult instances and giving them more attention. Online hard example mining (OHEM), introduced by Shrivastava and colleagues in 2016 for object detection, ranks region proposals by loss within each minibatch and trains only on the hardest fraction. Focal loss, introduced by Lin and colleagues in their 2017 RetinaNet paper, achieves a similar effect by smoothly downweighting easy examples in the loss function rather than discarding them.

Curriculum learning, introduced by Bengio and colleagues in 2009, takes the opposite ordering. Easier instances are presented first and harder ones are introduced gradually, mimicking the way humans learn structured material. Both approaches treat instance difficulty as a first-class signal rather than a fixed property of the data.

Instance discrimination in self-supervised learning

Instance discrimination is a self-supervised pretraining technique in which each instance in the dataset is treated as its own class. The model is trained to recognize each instance as distinct from every other instance. The idea was introduced by Zhirong Wu, Yuanjun Xiong, Stella X. Yu, and Dahua Lin in their 2018 CVPR paper "Unsupervised Feature Learning via Non-Parametric Instance Discrimination." Their method stored a memory bank of feature vectors for every image in the training set and used noise-contrastive estimation to keep the computation tractable when the number of effective classes equals the number of training images, which can be in the millions.

Instance discrimination became the conceptual foundation for the contrastive learning wave that followed. Methods like MoCo (Momentum Contrast, He et al. 2020) replaced the memory bank with a momentum-updated encoder and a queue of negative examples. SimCLR (Chen et al. 2020) showed that strong data augmentation and a large batch of in-batch negatives could match or beat memory-bank approaches. Both still treat each augmented view of an instance as an anchor whose only positive partner is another augmentation of the same instance.

Instance segmentation in computer vision

In computer vision, instance takes on a slightly different meaning. An instance is a single object occurrence in an image, and instance segmentation is the task of producing a separate pixel mask for each object instance, even when several objects belong to the same class. The table below contrasts the three major segmentation tasks.

Task	Output	Distinguishes individual objects?	Example
Semantic segmentation	One class label per pixel	No	All cars share the same "car" label
Instance segmentation	One mask per object instance	Yes	Each car gets its own mask
Panoptic segmentation	Per-pixel class plus instance ID for countable classes	Yes for things, no for stuff	Combines both

The canonical instance segmentation model is Mask R-CNN, introduced by Kaiming He, Georgia Gkioxari, Piotr Dollar, and Ross Girshick at Facebook AI Research in 2017. It extends Faster R-CNN by adding a small fully convolutional mask prediction head in parallel with the existing classification and bounding-box regression heads. Each region of interest produces a class score, a refined box, and a binary mask. Mask R-CNN won the ICCV 2017 Best Paper Award (the Marr Prize) and remained the dominant framework for instance segmentation for several years. More recent work explores transformer-based approaches like DETR and Mask2Former.

Instance normalization

Instance normalization is a normalization layer that computes statistics independently for each instance in a batch, rather than across the whole batch. It was introduced by Dmitry Ulyanov, Andrea Vedaldi, and Victor Lempitsky in their 2016 paper "Instance Normalization: The Missing Ingredient for Fast Stylization." The authors found that replacing batch normalization with per-instance normalization in a feed-forward style transfer network produced sharper, more consistent stylized outputs and let the network reach quality competitive with the slower iterative method of Gatys and colleagues. The intuition is that style transfer should be invariant to the contrast and brightness of the content image, and per-instance statistics remove that variation more effectively than batch statistics.

The table below contrasts the four main normalization variants by what dimensions they reduce over for a 4D tensor of shape (N, C, H, W).

Method	Statistics computed over	Typical use
Batch normalization	(N, H, W) per channel	General classification, large batches
Layer normalization	(C, H, W) per instance	Transformers, RNNs
Instance normalization	(H, W) per instance per channel	Style transfer, generative models
Group normalization	(G, H, W) per instance, channels grouped	Small batches, detection, segmentation

Common confusions

The word instance shows up in several other parts of computing with unrelated meanings. The table below disambiguates the three uses most likely to overlap with machine learning conversations.

Sense	Meaning	Example
Data instance (this article)	A single data point in a dataset	One row of a spreadsheet, one image in ImageNet
Object-oriented programming	A concrete object created from a class	`model = Model()` creates a new instance of `Model`
Cloud computing	A virtual machine or container provisioned in a cloud platform	An AWS EC2 g5.xlarge instance used for GPU training

The overlap is rarely accidental. A data scientist might write "I trained the model on 1 million instances using a g5.12xlarge instance," using two different senses of the word in the same sentence. Context usually disambiguates: data instances live in datasets, software instances live in memory, and cloud instances live in data centers.

Concept	One-line definition
Instance	A single data point in a dataset
Example	Synonym for instance, common in PAC learning literature
Feature	A measurable property of an instance
Feature vector	The numerical representation of an instance's features
Dataset	A collection of instances
Labeled data	Instances paired with target values
Training data	The subset of instances used to fit the model
Test set	The subset of instances held out for final evaluation
Multiple instance learning	Learning from bags of instances with bag-level labels
Instance segmentation	Segmenting each object occurrence in an image
Instance normalization	Normalizing per instance rather than per batch

References

Dietterich, T. G., Lathrop, R. H., & Lozano-Perez, T. (1997). "Solving the Multiple Instance Problem with Axis-Parallel Rectangles." *Artificial Intelligence*, 89(1-2), 31-71. doi:10.1016/S0004-3702(96)00034-3.
He, K., Gkioxari, G., Dollar, P., & Girshick, R. (2017). "Mask R-CNN." *Proceedings of the IEEE International Conference on Computer Vision (ICCV)*, 2961-2969. arXiv:1703.06870. (ICCV 2017 Best Paper Award.)
Wu, Z., Xiong, Y., Yu, S. X., & Lin, D. (2018). "Unsupervised Feature Learning via Non-Parametric Instance Discrimination." *Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)*, 3733-3742. arXiv:1805.01978.
Ulyanov, D., Vedaldi, A., & Lempitsky, V. (2016). "Instance Normalization: The Missing Ingredient for Fast Stylization." arXiv:1607.08022.
Mitchell, T. M. (1997). *Machine Learning*. McGraw-Hill. ISBN 978-0-07-042807-2.
Witten, I. H., Frank, E., Hall, M. A., & Pal, C. J. (2016). *Data Mining: Practical Machine Learning Tools and Techniques* (4th ed.). Morgan Kaufmann. ISBN 978-0-12-804291-5.
Bishop, C. M. (2006). *Pattern Recognition and Machine Learning*. Springer. ISBN 978-0-387-31073-2.
Hastie, T., Tibshirani, R., & Friedman, J. (2009). *The Elements of Statistical Learning: Data Mining, Inference, and Prediction* (2nd ed.). Springer. ISBN 978-0-387-84857-0.
Maron, O., & Lozano-Perez, T. (1998). "A Framework for Multiple-Instance Learning." *Advances in Neural Information Processing Systems*, 10, 570-576.
Carbonneau, M.-A., Cheplygina, V., Granger, E., & Gagnon, G. (2018). "Multiple Instance Learning: A Survey of Problem Characteristics and Applications." *Pattern Recognition*, 77, 329-353.
Shrivastava, A., Gupta, A., & Girshick, R. (2016). "Training Region-Based Object Detectors with Online Hard Example Mining." *Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)*, 761-769. arXiv:1604.03540.
Lin, T.-Y., Goyal, P., Girshick, R., He, K., & Dollar, P. (2017). "Focal Loss for Dense Object Detection." *Proceedings of the IEEE International Conference on Computer Vision (ICCV)*, 2980-2988. arXiv:1708.02002.
Bengio, Y., Louradour, J., Collobert, R., & Weston, J. (2009). "Curriculum Learning." *Proceedings of the 26th International Conference on Machine Learning (ICML)*, 41-48.
He, K., Fan, H., Wu, Y., Xie, S., & Girshick, R. (2020). "Momentum Contrast for Unsupervised Visual Representation Learning." *Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)*, 9729-9738. arXiv:1911.05722.
Chen, T., Kornblith, S., Norouzi, M., & Hinton, G. (2020). "A Simple Framework for Contrastive Learning of Visual Representations." *Proceedings of the 37th International Conference on Machine Learning (ICML)*, 1597-1607. arXiv:2002.05709.
Ioffe, S., & Szegedy, C. (2015). "Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift." *Proceedings of the 32nd International Conference on Machine Learning (ICML)*, 448-456. arXiv:1502.03167.
Wu, Y., & He, K. (2018). "Group Normalization." *Proceedings of the European Conference on Computer Vision (ECCV)*, 3-19. arXiv:1803.08494.

Explain like I'm 5 (ELI5)

Synonyms and terminology

Formal notation

Instance structure by data type

Instance space and feature space

Splits: training, validation, and test

Labeled and unlabeled instances

Multiple instance learning

Instance weights and sampling

Instance hardness and example mining

Instance discrimination in self-supervised learning

Instance segmentation in computer vision

Instance normalization

Common confusions

Summary of related concepts

References

Improve this article

Related Articles

ARC-AGI 2

Discrete Feature

Categorical Data

Continuous Feature

Ground Truth

Synthetic data

Explain like I'm 5 (ELI5)

Synonyms and terminology

Formal notation

Instance structure by data type

Instance space and feature space

Splits: training, validation, and test

Labeled and unlabeled instances

Multiple instance learning

Instance weights and sampling

Instance hardness and example mining

Instance discrimination in self-supervised learning

Instance segmentation in computer vision

Instance normalization

Common confusions

Summary of related concepts

References

Related Articles

ARC-AGI 2

Discrete Feature

Categorical Data

Continuous Feature

Ground Truth

Synthetic data