In machine learning, an example is a single data point that a model trains on or makes a prediction about. In supervised settings an example is a pair $(x, y)$ where $x$ is a vector of features and $y$ is a label. In unsupervised settings an example is just the feature vector $x$. The word is used almost interchangeably with instance, sample, observation, record, row, and data point. Different research communities prefer different terms, but the underlying object is the same: one row of data that the learning algorithm consumes. The choice of which examples to collect, how to label them, how to split them across training and evaluation, and how to weight them during training shapes almost every property of the resulting model.
Imagine you are teaching a friend to tell apples from oranges. Each time you hold up a piece of fruit and say what it is, that is one example. The fruit itself is what your friend looks at. The name ("apple" or "orange") is the answer. After enough examples your friend can guess on their own. A computer learns the same way, just with millions of examples instead of a handful. If you only ever show your friend yellow apples, they will struggle when a green apple shows up. The mix of examples you choose decides what the learner can and cannot do later.
In the standard formulation of supervised learning, examples are assumed to be drawn independently and identically distributed (iid) from a fixed but unknown joint distribution $P(X, Y)$ over an input space $\mathcal{X}$ and an output space $\mathcal{Y}$. A training data set of size $n$ is then
$$D = {(x_1, y_1), (x_2, y_2), \ldots, (x_n, y_n)}$$
where each $(x_i, y_i)$ is one example. The features $x_i$ may be a real-valued vector, a sequence of tokens, a tensor of pixel intensities, a graph, or any other structured object. The label $y_i$ may be a class index for classification, a real number for regression, a sequence for sequence-to-sequence tasks, or even another structured object.
In unsupervised settings the examples are just ${x_1, \ldots, x_n}$ with no labels attached. The iid assumption is convenient for theory but rarely holds exactly in practice. Real datasets tend to be biased by the collection process, by the population that was sampled, and by the labelers who annotated them.
Different communities use different words for the same idea, which causes a lot of needless confusion when reading across fields.
| Term | Field where common | Notes |
|---|---|---|
| Example | Machine learning, statistical learning theory | The default in PAC-learning literature and in textbooks by Mitchell and Vapnik |
| Instance | Machine learning, data mining | Used in Witten and Frank's Data Mining and Mitchell's Machine Learning |
| Sample | Deep learning, signal processing | Ambiguous: in statistics a "sample" usually means a set of observations, not one |
| Observation | Statistics, econometrics | Standard in regression and time series |
| Record | Databases, data engineering | One row in a table or one document in a store |
| Row | Tabular ML, pandas, SQL | Common when data is held in a DataFrame |
| Data point | Geometry, visualization | Emphasizes the view of an example as a point in feature space |
| Datum | Older statistics texts | Singular of "data"; rarely used in modern ML papers |
| Training example | ML pedagogy | Emphasizes that the example is part of the training set |
| Demonstration / shot | Prompt engineering, in-context learning | An example placed in a prompt rather than in a training set |
"Sample" is the most slippery of these. A deep learning practitioner who says "the model saw 10 million samples" means 10 million examples. A classical statistician who says "we drew a sample of size 1000" means a single dataset of 1000 observations.
The shape of an example depends on the kind of learning being done.
| Paradigm | Form of an example | Typical label |
|---|---|---|
| Supervised learning | $(x, y)$ pair | Class index, real number, sequence |
| Unsupervised learning | $x$ alone | None |
| Semi-supervised learning | Mix of $(x, y)$ and $x$ | Some labeled, most unlabeled |
| Self-supervised learning | $x$ split into input and target | Derived from $x$ itself, e.g., next token or masked patch |
| Reinforcement learning | Trajectory of $(s_t, a_t, r_t)$ | Reward signal |
| Active learning | $x$ chosen by the learner, then queried for $y$ | Label produced by oracle on demand |
| In-context learning | $(x, y)$ shown inside the prompt | Demonstration, no weight update |
In supervised learning the examples are the raw material. In self-supervised learning, examples are constructed from unlabeled text or images by hiding part of the input and asking the model to predict it. The trillions of next-token prediction examples used to pretrain modern large language models all come from this trick. In reinforcement learning, the unit is a transition or full trajectory, which is a sequence of state, action, and reward triples rather than a single (input, label) pair.
The distinction between labeled and unlabeled examples sits at the heart of semi-supervised learning, which was given its modern formulation in the 2006 MIT Press volume edited by Olivier Chapelle, Bernhard Schölkopf, and Alexander Zien. Labels are usually expensive: a doctor must read the X-ray, a translator must produce the target sentence, a judge must rate the search result. Unlabeled inputs are usually cheap: web pages, photographs, and recordings can be collected by the billion at almost no cost.
| Type | Composition | Where it comes from | Typical use |
|---|---|---|---|
| Labeled example | Features plus a target label | Human annotation, click logs, sensor measurements | Supervised training, evaluation |
| Unlabeled example | Features only | Crawls, scrapes, sensor streams | Pretraining, unsupervised learning, semi-supervised learning |
| Weakly labeled example | Features plus a noisy or partial label | Distant supervision, hashtags, search queries | Pretraining, weak supervision |
| Synthetic example | Features (and possibly label) generated by a model or simulator | GANs, simulators, LLMs | Data augmentation, fine-tuning |
The ratio of labeled to unlabeled examples in a project drives many design choices. If labels are abundant, plain supervised learning is usually the right tool. If labels are scarce but inputs are plentiful, semi-supervised learning, self-training, and pretraining followed by fine-tuning all become attractive.
"Example" and "instance" are usually treated as synonyms. The two communities that introduced them simply chose different words. There is one place where the distinction matters: multi-instance learning (MIL). MIL was introduced by Thomas Dietterich, Richard Lathrop, and Tomás Lozano-Pérez in a 1997 paper on drug activity prediction. In MIL the unit of training is a bag of instances rather than a single instance. The label is attached to the bag. A drug molecule can fold into many shapes (conformers), and the molecule binds to a target if at least one of its shapes binds. Only the molecule's overall activity is observed, not which conformer is responsible. So the bag of conformers is the example, but each conformer is an instance.
This bag-versus-instance split shows up again in computer vision (where an image is a bag of patches), in document classification (where a document is a bag of sentences), and in pathology (where a slide is a bag of tiles). Outside of MIL the distinction collapses and the two words are interchangeable.
Modern training rarely processes one example at a time. Instead, examples are grouped into a batch (sometimes called a mini-batch) and passed through the model together. The gradient of the loss is then averaged across the batch before the weights are updated.
| Term | What it refers to |
|---|---|
| Example | One $(x, y)$ pair, the smallest unit of data |
| Batch | A group of examples processed together in one forward and backward pass |
| Epoch | One full pass through every example in the training set |
| Step / iteration | One gradient update, usually one batch |
Batch size is one of the most discussed hyperparameters in deep learning. Small batches give noisier gradient estimates but often generalize better. Large batches use hardware more efficiently and reduce per-example variance, at the cost of needing more careful learning-rate tuning.
A single dataset is almost always split into disjoint subsets so that performance can be measured honestly.
| Split | Purpose | Typical fraction |
|---|---|---|
| Training set | Examples used to fit model parameters | 60 to 90 percent |
| Validation (dev) set | Examples used for hyperparameter tuning and early stopping | 5 to 20 percent |
| Test set | Examples used only at the end to estimate generalization | 5 to 20 percent |
A cardinal rule of supervised ML is that test examples must never be used to train or tune the model. When they are, the test error is no longer an honest estimate of how the model will perform on new data. Information leakage between splits is one of the most common reasons published results fail to reproduce.
Not every example is equally informative. Some are easy: the model gets them right almost from the first epoch. Others are hard: the model misclassifies them again and again, or the loss on them stays high. The recognition that this matters led to several lines of work.
Curriculum learning, introduced by Yoshua Bengio and colleagues in 2009, presents examples in order of increasing difficulty rather than at random. The intuition borrowed from human and animal learning: build up from simple cases to complex ones. Bengio et al. reported faster convergence and better final solutions on shape recognition and language modeling tasks.
Hard-example mining goes the other way. Object detectors and face recognizers often train on a stream of mostly easy negatives. Online hard example mining (OHEM), proposed by Abhinav Shrivastava and colleagues, deliberately oversamples the examples on which the current model performs worst. This focuses gradient signal where it matters and tends to improve detection accuracy.
Noisy or mislabeled examples are a third category. They are hard for the wrong reason. A line of confident learning research, including work by Curtis Northcutt and colleagues at MIT, audits popular benchmarks and finds that test sets like ImageNet, MNIST, and CIFAR contain real labeling errors that depress reported accuracies of strong models.
Adversarial examples are inputs that have been intentionally perturbed to make a model misclassify them, often by a margin invisible to humans. The phenomenon was reported by Christian Szegedy and colleagues in the 2014 ICLR paper Intriguing properties of neural networks. They showed that almost any image could be perturbed by a tiny amount and cause a confident image classifier to assign the wrong label. The perturbations also transferred between models trained on different subsets of the data.
Ian Goodfellow, Jonathon Shlens, and Christian Szegedy followed up in 2014 with Explaining and Harnessing Adversarial Examples, which introduced the fast gradient sign method (FGSM) and argued that the underlying cause was the near-linear behavior of deep networks in high-dimensional input space rather than overfitting. FGSM constructs an adversarial example as $x' = x + \varepsilon \cdot \text{sign}(\nabla_x J(\theta, x, y))$, where $\varepsilon$ controls the size of the perturbation and $J$ is the loss. Including these perturbed examples in training, called adversarial training, became one of the standard defenses.
Adversarial examples are a serious safety concern in any deployment where an attacker can craft inputs: spam filtering, malware detection, face recognition, and autonomous driving have all been shown to be vulnerable.
A counterfactual example is closely related but built for a different purpose: explanation rather than attack. Sandra Wachter, Brent Mittelstadt, and Chris Russell argued in their 2017 paper Counterfactual Explanations Without Opening the Black Box: Automated Decisions and the GDPR that data subjects have a legitimate interest in knowing what minimal change to their input would have flipped a model's decision. If a loan application was denied, a useful explanation might be "if your annual income had been 4,000 USD higher, the model would have approved you." The counterfactual is an example that the model classifies the way the user wanted, and that is as close as possible to their actual input. Counterfactual explanations have since become a standard tool in the interpretable machine learning literature and are often discussed in the context of the EU General Data Protection Regulation.
When real labeled examples are scarce, synthetic ones can fill the gap. The simplest form is data augmentation: cropping, flipping, and color-jittering images, or paraphrasing text. More elaborate methods generate entirely new examples.
| Method | How examples are produced | Typical use |
|---|---|---|
| Data augmentation | Programmatic transforms of real examples | Vision, audio, NLP |
| Simulation | Examples sampled from a physics or game engine | Robotics, autonomous driving |
| GAN-generated examples | A generative network draws from the data distribution | Image synthesis, privacy-preserving data |
| LLM-generated examples | A large language model writes new text examples | Instruction tuning, classification, evaluation |
The LLM-generated case has exploded in importance since 2022. Yizhong Wang and colleagues introduced Self-Instruct in Self-Instruct: Aligning Language Models with Self-Generated Instructions, which bootstraps an instruction-tuning dataset by prompting a base model to generate new tasks from a small seed set. The original paper produced over 52,000 instructions and 82,000 instances, then used them to fine-tune GPT-3 to behave more like InstructGPT despite using no human-written demonstrations beyond the seed. Stanford Alpaca followed the same recipe with examples generated by GPT-3.5, and Anthropic's Constitutional AI used a model to critique and revise its own outputs to produce preference examples without human labelers in the loop.
Synthetic examples are not free. Errors compound: if the generator hallucinates a fact, every example built on it is wrong. Quality filtering and careful prompt design are the main defenses.
Large language models stretch the meaning of "example" in interesting ways. The same sequence of tokens can serve as many different examples depending on which sub-sequence is treated as the target.
| Stage | What an example looks like | Where the labels come from |
|---|---|---|
| Pretraining | One token sequence, with each next token as the target | The data itself (self-supervised) |
| Supervised fine-tuning (SFT) | A prompt and a desired response | Human writers or another model |
| Reward modeling | Two responses to the same prompt, one preferred | Human ranking |
| RLHF policy training | A prompt, a sampled response, and a scalar reward | Reward model |
| In-context learning | Demonstrations packed into the prompt | Static, no gradient update |
Long Ouyang and colleagues at OpenAI laid out this multi-stage pipeline in Training Language Models to Follow Instructions With Human Feedback, the 2022 InstructGPT paper. Reinforcement learning from human feedback (RLHF) reframes "example" as a preference pair: the labeler picks the better of two model outputs, and the reward model learns to predict which a human would prefer. This pairwise structure traces back to the Bradley-Terry model from 1952 and turns subjective judgments into a differentiable training signal.
Few-shot learning and zero-shot learning further blur the line between training and inference. In Tom Brown and colleagues' 2020 paper Language Models Are Few-Shot Learners, GPT-3 is shown $K$ input-output demonstrations inside the prompt and asked to complete the next case. No weights are updated. The demonstrations act as examples, but they live in the context window rather than in a training file. The phrase "shot" comes from one-shot and few-shot learning in computer vision.
Not every example contributes equally to the loss. A common pattern is to upweight rare classes or important subgroups so that they are not drowned out by the majority. In the simplest form the loss becomes
$$L = \frac{1}{n} \sum_{i=1}^{n} w_i \cdot \ell(f(x_i), y_i)$$
where $w_i$ is a per-example weight. Importance sampling, class-balanced loss, and curriculum reweighting are all variants of this idea. Choosing weights carelessly can introduce bias; choosing them deliberately is one of the main levers for handling class imbalance and distribution shift.
Examples need to live somewhere. The format depends on scale, on access patterns, and on the framework in use.
| Format | Layout | Common in | Notes |
|---|---|---|---|
| CSV / TSV | Row-oriented text | Tabular data, small datasets | Human-readable, slow to parse, no schema |
| JSON / JSONL | Row-oriented text, one example per line | NLP, instruction tuning | Easy to inspect, supports nested structure |
| Parquet | Columnar binary | Large tabular pipelines, Spark, BigQuery | Compressed, fast to scan a subset of columns |
| Apache Arrow | Columnar in-memory | Hugging Face Datasets, pandas | Zero-copy reads, language-agnostic |
| TFRecord | Sequence of serialized tf.train.Example protos | TensorFlow pipelines | High streaming throughput, opaque without a schema |
| WebDataset | Tar shards of files | Large vision pretraining | Streams from object storage, easy sharding |
A TensorFlow tf.train.Example is, despite the name, exactly the data structure described in this article: a single instance, encoded as a dictionary of feature names to values. The record format, TFRecord, just packs many of these protos into a binary stream that data loaders can read sequentially.
An example is only as good as its label. Mislabeled examples poison gradient signal, mislead reward models, and inflate or deflate evaluation metrics depending on which way the noise runs. The Northcutt confident learning audits found error rates of several percent in widely used benchmarks. Anthropic, OpenAI, and others have invested heavily in labeler training and inter-annotator agreement for the same reason. The trend in 2024 and 2025 has been toward fewer, higher-quality examples for fine-tuning, with synthetic generation used to amplify a small carefully curated seed rather than to flood the model with cheap data.
In pretraining the picture is different. Quantity still matters, but so does deduplication, contamination filtering, and quality scoring. The same web page repeated a thousand times in the corpus does not produce a thousand distinct examples worth of learning signal.