In machine learning and statistics, an instance is a single data point used by a model. It is the atomic unit of a dataset: one row in a spreadsheet, one image in a photo collection, one sentence in a text corpus, or one audio clip in a speech recognition corpus. Every supervised, unsupervised, semi-supervised, and reinforcement learning pipeline ultimately processes a sequence of instances, so the concept underpins almost everything else in the field. The term is heavily overloaded. Software engineers use it to mean an object created from a class. Cloud engineers use it to mean a virtual machine. Computer vision researchers use it to mean a single object within an image, as in instance segmentation. This article focuses primarily on the data-point sense of the word and disambiguates the others at the end.
Imagine a teacher has a stack of flashcards. On the front of each card is a picture of an animal. On the back is the animal's name. Each card is one instance. The picture is what the student looks at, and the name is the answer the student tries to learn. If you give the student a thousand cards, you have a dataset of a thousand instances. After studying enough of them, the student can look at a brand new picture (a new instance with no name on the back) and guess the animal correctly.
Different communities use different words for the same idea. The vocabulary depends on the field, the textbook, and sometimes the decade. The table below collects the most common terms.
| Term | Field where common | Notes |
|---|---|---|
| Instance | Machine learning, data mining | The default term in textbooks like Witten and Frank's Data Mining and most ML courses |
| Example | Machine learning, statistical learning theory | Used heavily by Vapnik, Mitchell, and the PAC-learning community |
| Sample | Statistics, deep learning | Confusing in statistics where "sample" can also mean a set of observations rather than one |
| Observation | Statistics, econometrics | Standard in regression analysis |
| Record | Databases, data warehousing | Each row in a table |
| Row | Tabular data, pandas, SQL | Common in tabular ML pipelines |
| Data point | Geometry, visualization | Emphasizes the geometric view of an instance as a point in feature space |
| Tuple | Relational databases | A row treated as an ordered collection of values |
| Case | Survey research, medical statistics | One subject or one event |
| Item | Recommender systems | Often used for the thing being recommended |
The lack of a single standard word can be confusing. "Sample" is the most ambiguous: in deep learning a sample usually means one instance, but in classical statistics a sample is a collection of observations drawn from a population. When reading papers it pays to check which sense the author intends.
A dataset of n instances is usually written as
D = { (x_1, y_1), (x_2, y_2), ..., (x_n, y_n) }
where x_i is the input for the i-th instance and y_i is its label. The input x_i lives in an input space (often called the instance space) X, and the label y_i lives in a label space Y. For a tabular problem with d features, X is typically R^d and each x_i is a feature vector. For classification, Y might be a finite set like {0, 1} or {cat, dog, fish}. For regression, Y is usually R.
In unsupervised settings the dataset is just D = {x_1, ..., x_n} with no labels. In semi-supervised learning some instances have labels and others do not. In reinforcement learning the equivalent unit is a transition (state, action, reward, next state), which is sometimes called an experience or a sample.
The physical representation of an instance varies enormously depending on the type of data the model consumes. The model architecture is usually chosen to match the structure of the instance.
| Data type | Instance representation | Typical shape | Example architectures |
|---|---|---|---|
| Tabular | Row vector or dictionary of named features | (d,) where d is the number of columns | Logistic regression, gradient boosting, MLP |
| Image | 2D or 3D tensor of pixel intensities | (H, W, C) where C is channels | CNN, Vision Transformer |
| Text | Sequence of tokens or token IDs | (L,) where L is sequence length | RNN, Transformer |
| Audio | Raw waveform or spectrogram | (T,) for waveform, (T, F) for spectrogram | WaveNet, conformer, Whisper |
| Video | 4D tensor over time | (T, H, W, C) | 3D CNN, video transformer |
| Graph | Set of nodes plus an edge list | Variable | Graph neural network |
| Time series | Ordered sequence of observations | (T, d) | RNN, temporal CNN, Informer |
| Point cloud | Unordered set of 3D points | (N, 3) or (N, 6) with color | PointNet, PointNet++ |
| Multimodal | Tuple of representations from different modalities | Varies | CLIP, Flamingo, multimodal transformers |
The choice of representation often determines what model can be applied. A convolutional network expects a fixed-size image tensor; a transformer expects a sequence; a graph neural network expects a node-and-edge structure. Preprocessing pipelines exist to convert raw data into the right instance shape.
The terms instance space and feature space are sometimes used interchangeably and sometimes given distinct meanings. In most modern texts the instance space is the set of all possible inputs the model could ever receive, and the feature space is the geometric space in which feature vectors live. For tabular data with d numerical features, both are usually R^d. For images, the raw instance space is the set of all possible H by W by C pixel grids, while the learned feature space is the lower-dimensional embedding space produced by the network.
Many classical algorithms are easiest to understand geometrically. K-nearest neighbors classifies a new instance by finding the closest training instances in feature space. Support vector machines find a hyperplane separating instances of different classes. K-means clustering groups instances by their distances to centroids. In each case the geometry of the instance space drives the algorithm's behavior.
A dataset is partitioned into disjoint subsets of instances. The standard split is training data, validation data, and a test set. Training instances are used to fit model parameters. Validation instances are used to tune hyperparameters and choose between candidate models. Test instances are held out until the very end to give an unbiased estimate of generalization performance. The splits are partitions of the same instance pool, not separate kinds of data.
Leakage between splits is a frequent source of inflated benchmark numbers. Common mistakes include placing instances from the same patient in both training and test sets in medical imaging, or splitting time series randomly so that future instances appear in the training set. Careful split design is part of building an honest evaluation pipeline.
An instance is called labeled if it carries a known target value y_i. Otherwise it is unlabeled. Labeled data is required for supervised learning and is usually expensive to obtain because it requires human annotation. Unlabeled data is typically abundant. Several learning paradigms are organized around how labels are assigned to instances.
| Paradigm | Labeled instances | Unlabeled instances | Notes |
|---|---|---|---|
| Supervised learning | All | None | Standard classification and regression |
| Unsupervised learning | None | All | Clustering, density estimation, dimensionality reduction |
| Semi-supervised learning | Few | Many | Label propagation, pseudo-labeling, consistency regularization |
| Self-supervised learning | None directly; labels are derived from the input itself | All | Contrastive learning, masked language modeling |
| Active learning | Initially few; the model selects which to label next | Many | Human-in-the-loop annotation |
| Weakly supervised learning | Noisy or coarse labels | Some | Distant supervision, label smoothing |
Multiple instance learning (MIL) is a variant of supervised learning where labels are attached to bags of instances rather than to individual instances. It was formalized by Thomas Dietterich, Richard Lathrop, and Tomas Lozano-Perez in their 1997 paper "Solving the Multiple Instance Problem with Axis-Parallel Rectangles," published in the journal Artificial Intelligence. The original motivation came from drug discovery. A small molecule can fold into many three-dimensional shapes called conformations. The molecule binds to a target protein if at least one of its conformations fits the binding site. Chemists could label the molecule as active or inactive but could not say which specific conformation was responsible. Each molecule was therefore a bag containing many candidate conformations as instances, with a single bag-level label.
The standard MIL assumption is sometimes written as a logical OR over instances. A bag is positive if at least one instance is positive, and negative if all instances are negative. The learner sees only the bag label, not the per-instance labels.
| MIL setting | Bag label rule | Typical use |
|---|---|---|
| Standard MIL | Positive if any instance is positive | Drug activity prediction, image classification with object presence |
| Threshold MIL | Positive if at least k instances are positive | Counting tasks |
| Collective MIL | Bag label depends on the distribution of instances | Histopathology slide grading |
| Embedded-space MIL | Bag is mapped to a single feature vector before classification | Modern attention-based MIL pipelines |
MIL is now widely used in computational pathology, where a whole-slide image of a biopsy is too large to label pixel by pixel and is instead split into thousands of tile instances grouped into a slide-level bag. It is also used in object detection with image-level labels, video classification where the bag is a clip and each frame is an instance, and weakly supervised text classification.
In many practical pipelines, not every instance is treated equally. Each instance can be assigned a weight w_i that scales its contribution to the loss. Higher weights make the model pay more attention to that instance during training. Common reasons to use instance weights include class imbalance, in which minority-class instances are upweighted; importance sampling, in which weights correct for the difference between the sampling distribution and a target distribution; and curriculum learning, in which weights vary across training to introduce hard examples gradually. See upweighting for more detail on the upweighting variant.
Resampling is an alternative to weighting. Oversampling duplicates rare instances and undersampling discards common ones. Both change the effective distribution of training instances seen by the model. SMOTE is a popular oversampling method that creates synthetic minority instances by interpolating between existing ones.
Not all instances are equally easy for a model to learn. An instance is sometimes called hard if the model consistently mispredicts it or assigns it low confidence, and easy if the model gets it right with high confidence early in training. The distinction matters because hard instances often dominate the loss and gradient signal late in training, while easy instances provide little new information once the model has converged on them.
Hard example mining is the practice of identifying difficult instances and giving them more attention. Online hard example mining (OHEM), introduced by Shrivastava and colleagues in 2016 for object detection, ranks region proposals by loss within each minibatch and trains only on the hardest fraction. Focal loss, introduced by Lin and colleagues in their 2017 RetinaNet paper, achieves a similar effect by smoothly downweighting easy examples in the loss function rather than discarding them.
Curriculum learning, introduced by Bengio and colleagues in 2009, takes the opposite ordering. Easier instances are presented first and harder ones are introduced gradually, mimicking the way humans learn structured material. Both approaches treat instance difficulty as a first-class signal rather than a fixed property of the data.
Instance discrimination is a self-supervised pretraining technique in which each instance in the dataset is treated as its own class. The model is trained to recognize each instance as distinct from every other instance. The idea was introduced by Zhirong Wu, Yuanjun Xiong, Stella X. Yu, and Dahua Lin in their 2018 CVPR paper "Unsupervised Feature Learning via Non-Parametric Instance Discrimination." Their method stored a memory bank of feature vectors for every image in the training set and used noise-contrastive estimation to keep the computation tractable when the number of effective classes equals the number of training images, which can be in the millions.
Instance discrimination became the conceptual foundation for the contrastive learning wave that followed. Methods like MoCo (Momentum Contrast, He et al. 2020) replaced the memory bank with a momentum-updated encoder and a queue of negative examples. SimCLR (Chen et al. 2020) showed that strong data augmentation and a large batch of in-batch negatives could match or beat memory-bank approaches. Both still treat each augmented view of an instance as an anchor whose only positive partner is another augmentation of the same instance.
In computer vision, instance takes on a slightly different meaning. An instance is a single object occurrence in an image, and instance segmentation is the task of producing a separate pixel mask for each object instance, even when several objects belong to the same class. The table below contrasts the three major segmentation tasks.
| Task | Output | Distinguishes individual objects? | Example |
|---|---|---|---|
| Semantic segmentation | One class label per pixel | No | All cars share the same "car" label |
| Instance segmentation | One mask per object instance | Yes | Each car gets its own mask |
| Panoptic segmentation | Per-pixel class plus instance ID for countable classes | Yes for things, no for stuff | Combines both |
The canonical instance segmentation model is Mask R-CNN, introduced by Kaiming He, Georgia Gkioxari, Piotr Dollar, and Ross Girshick at Facebook AI Research in 2017. It extends Faster R-CNN by adding a small fully convolutional mask prediction head in parallel with the existing classification and bounding-box regression heads. Each region of interest produces a class score, a refined box, and a binary mask. Mask R-CNN won the ICCV 2017 Best Paper Award (the Marr Prize) and remained the dominant framework for instance segmentation for several years. More recent work explores transformer-based approaches like DETR and Mask2Former.
Instance normalization is a normalization layer that computes statistics independently for each instance in a batch, rather than across the whole batch. It was introduced by Dmitry Ulyanov, Andrea Vedaldi, and Victor Lempitsky in their 2016 paper "Instance Normalization: The Missing Ingredient for Fast Stylization." The authors found that replacing batch normalization with per-instance normalization in a feed-forward style transfer network produced sharper, more consistent stylized outputs and let the network reach quality competitive with the slower iterative method of Gatys and colleagues. The intuition is that style transfer should be invariant to the contrast and brightness of the content image, and per-instance statistics remove that variation more effectively than batch statistics.
The table below contrasts the four main normalization variants by what dimensions they reduce over for a 4D tensor of shape (N, C, H, W).
| Method | Statistics computed over | Typical use |
|---|---|---|
| Batch normalization | (N, H, W) per channel | General classification, large batches |
| Layer normalization | (C, H, W) per instance | Transformers, RNNs |
| Instance normalization | (H, W) per instance per channel | Style transfer, generative models |
| Group normalization | (G, H, W) per instance, channels grouped | Small batches, detection, segmentation |
The word instance shows up in several other parts of computing with unrelated meanings. The table below disambiguates the three uses most likely to overlap with machine learning conversations.
| Sense | Meaning | Example |
|---|---|---|
| Data instance (this article) | A single data point in a dataset | One row of a spreadsheet, one image in ImageNet |
| Object-oriented programming | A concrete object created from a class | model = Model() creates a new instance of Model |
| Cloud computing | A virtual machine or container provisioned in a cloud platform | An AWS EC2 g5.xlarge instance used for GPU training |
The overlap is rarely accidental. A data scientist might write "I trained the model on 1 million instances using a g5.12xlarge instance," using two different senses of the word in the same sentence. Context usually disambiguates: data instances live in datasets, software instances live in memory, and cloud instances live in data centers.
| Concept | One-line definition |
|---|---|
| Instance | A single data point in a dataset |
| Example | Synonym for instance, common in PAC learning literature |
| Feature | A measurable property of an instance |
| Feature vector | The numerical representation of an instance's features |
| Dataset | A collection of instances |
| Labeled data | Instances paired with target values |
| Training data | The subset of instances used to fit the model |
| Test set | The subset of instances held out for final evaluation |
| Multiple instance learning | Learning from bags of instances with bag-level labels |
| Instance segmentation | Segmenting each object occurrence in an image |
| Instance normalization | Normalizing per instance rather than per batch |