Example

In machine learning, an example is a single data point that a model trains on or makes a prediction about. In supervised settings an example is a pair $(x, y)$ where $x$ is a vector of features and $y$ is a label. In unsupervised settings an example is just the feature vector $x$. The word is used almost interchangeably with instance, sample, observation, record, row, and data point. Different research communities prefer different terms, but the underlying object is the same: one row of data that the learning algorithm consumes. The choice of which examples to collect, how to label them, how to split them across training and evaluation, and how to weight them during training shapes almost every property of the resulting model.

Explain like I'm 5 (ELI5)

Imagine you are teaching a friend to tell apples from oranges. Each time you hold up a piece of fruit and say what it is, that is one example. The fruit itself is what your friend looks at. The name ("apple" or "orange") is the answer. After enough examples your friend can guess on their own. A computer learns the same way, just with millions of examples instead of a handful. If you only ever show your friend yellow apples, they will struggle when a green apple shows up. The mix of examples you choose decides what the learner can and cannot do later.

Formal definition

In the standard formulation of supervised learning, examples are assumed to be drawn independently and identically distributed (iid) from a fixed but unknown joint distribution $P(X, Y)$ over an input space $\mathcal{X}$ and an output space $\mathcal{Y}$. A training data set of size $n$ is then

$$D = {(x_1, y_1), (x_2, y_2), \ldots, (x_n, y_n)}$$

where each $(x_i, y_i)$ is one example. The features $x_i$ may be a real-valued vector, a sequence of tokens, a tensor of pixel intensities, a graph, or any other structured object. The label $y_i$ may be a class index for classification, a real number for regression, a sequence for sequence-to-sequence tasks, or even another structured object.

In unsupervised settings the examples are just ${x_1, \ldots, x_n}$ with no labels attached. The iid assumption is convenient for theory but rarely holds exactly in practice. Real datasets tend to be biased by the collection process, by the population that was sampled, and by the labelers who annotated them.

Synonyms and terminology

Different communities use different words for the same idea, which causes a lot of needless confusion when reading across fields.

Term	Field where common	Notes
Example	Machine learning, statistical learning theory	The default in PAC-learning literature and in textbooks by Mitchell and Vapnik
Instance	Machine learning, data mining	Used in Witten and Frank's Data Mining and Mitchell's Machine Learning
Sample	Deep learning, signal processing	Ambiguous: in statistics a "sample" usually means a set of observations, not one
Observation	Statistics, econometrics	Standard in regression and time series
Record	Databases, data engineering	One row in a table or one document in a store
Row	Tabular ML, pandas, SQL	Common when data is held in a DataFrame
Data point	Geometry, visualization	Emphasizes the view of an example as a point in feature space
Datum	Older statistics texts	Singular of "data"; rarely used in modern ML papers
Training example	ML pedagogy	Emphasizes that the example is part of the training set
Demonstration / shot	Prompt engineering, in-context learning	An example placed in a prompt rather than in a training set

"Sample" is the most slippery of these. A deep learning practitioner who says "the model saw 10 million samples" means 10 million examples. A classical statistician who says "we drew a sample of size 1000" means a single dataset of 1000 observations.

Examples in different learning paradigms

The shape of an example depends on the kind of learning being done.

Paradigm	Form of an example	Typical label
Supervised learning	$(x, y)$ pair	Class index, real number, sequence
Unsupervised learning	$x$ alone	None
Semi-supervised learning	Mix of $(x, y)$ and $x$	Some labeled, most unlabeled
Self-supervised learning	$x$ split into input and target	Derived from $x$ itself, e.g., next token or masked patch
Reinforcement learning	Trajectory of $(s_t, a_t, r_t)$	Reward signal
Active learning	$x$ chosen by the learner, then queried for $y$	Label produced by oracle on demand
In-context learning	$(x, y)$ shown inside the prompt	Demonstration, no weight update

In supervised learning the examples are the raw material. In self-supervised learning, examples are constructed from unlabeled text or images by hiding part of the input and asking the model to predict it. The trillions of next-token prediction examples used to pretrain modern large language models all come from this trick. In reinforcement learning, the unit is a transition or full trajectory, which is a sequence of state, action, and reward triples rather than a single (input, label) pair.

Labeled and unlabeled examples

The distinction between labeled and unlabeled examples sits at the heart of semi-supervised learning, which was given its modern formulation in the 2006 MIT Press volume edited by Olivier Chapelle, Bernhard Schölkopf, and Alexander Zien. Labels are usually expensive: a doctor must read the X-ray, a translator must produce the target sentence, a judge must rate the search result. Unlabeled inputs are usually cheap: web pages, photographs, and recordings can be collected by the billion at almost no cost.

Type	Composition	Where it comes from	Typical use
Labeled example	Features plus a target label	Human annotation, click logs, sensor measurements	Supervised training, evaluation
Unlabeled example	Features only	Crawls, scrapes, sensor streams	Pretraining, unsupervised learning, semi-supervised learning
Weakly labeled example	Features plus a noisy or partial label	Distant supervision, hashtags, search queries	Pretraining, weak supervision
Synthetic example	Features (and possibly label) generated by a model or simulator	GANs, simulators, LLMs	Data augmentation, fine-tuning

The ratio of labeled to unlabeled examples in a project drives many design choices. If labels are abundant, plain supervised learning is usually the right tool. If labels are scarce but inputs are plentiful, semi-supervised learning, self-training, and pretraining followed by fine-tuning all become attractive.

Examples and instances

"Example" and "instance" are usually treated as synonyms. The two communities that introduced them simply chose different words. There is one place where the distinction matters: multi-instance learning (MIL). MIL was introduced by Thomas Dietterich, Richard Lathrop, and Tomás Lozano-Pérez in a 1997 paper on drug activity prediction. In MIL the unit of training is a bag of instances rather than a single instance. The label is attached to the bag. A drug molecule can fold into many shapes (conformers), and the molecule binds to a target if at least one of its shapes binds. Only the molecule's overall activity is observed, not which conformer is responsible. So the bag of conformers is the example, but each conformer is an instance.

This bag-versus-instance split shows up again in computer vision (where an image is a bag of patches), in document classification (where a document is a bag of sentences), and in pathology (where a slide is a bag of tiles). Outside of MIL the distinction collapses and the two words are interchangeable.

Batches of examples

Modern training rarely processes one example at a time. Instead, examples are grouped into a batch (sometimes called a mini-batch) and passed through the model together. The gradient of the loss is then averaged across the batch before the weights are updated.

Term	What it refers to
Example	One $(x, y)$ pair, the smallest unit of data
Batch	A group of examples processed together in one forward and backward pass
Epoch	One full pass through every example in the training set
Step / iteration	One gradient update, usually one batch

Batch size is one of the most discussed hyperparameters in deep learning. Small batches give noisier gradient estimates but often generalize better. Large batches use hardware more efficiently and reduce per-example variance, at the cost of needing more careful learning-rate tuning.

Train, validation, and test examples

A single dataset is almost always split into disjoint subsets so that performance can be measured honestly.

Split	Purpose	Typical fraction
Training set	Examples used to fit model parameters	60 to 90 percent
Validation (dev) set	Examples used for hyperparameter tuning and early stopping	5 to 20 percent
Test set	Examples used only at the end to estimate generalization	5 to 20 percent

A cardinal rule of supervised ML is that test examples must never be used to train or tune the model. When they are, the test error is no longer an honest estimate of how the model will perform on new data. Information leakage between splits is one of the most common reasons published results fail to reproduce.

Hard and easy examples

Not every example is equally informative. Some are easy: the model gets them right almost from the first epoch. Others are hard: the model misclassifies them again and again, or the loss on them stays high. The recognition that this matters led to several lines of work.

Curriculum learning, introduced by Yoshua Bengio and colleagues in 2009, presents examples in order of increasing difficulty rather than at random. The intuition borrowed from human and animal learning: build up from simple cases to complex ones. Bengio et al. reported faster convergence and better final solutions on shape recognition and language modeling tasks.

Hard-example mining goes the other way. Object detectors and face recognizers often train on a stream of mostly easy negatives. Online hard example mining (OHEM), proposed by Abhinav Shrivastava and colleagues, deliberately oversamples the examples on which the current model performs worst. This focuses gradient signal where it matters and tends to improve detection accuracy.

Noisy or mislabeled examples are a third category. They are hard for the wrong reason. A line of confident learning research, including work by Curtis Northcutt and colleagues at MIT, audits popular benchmarks and finds that test sets like ImageNet, MNIST, and CIFAR contain real labeling errors that depress reported accuracies of strong models.

Adversarial examples

Adversarial examples are inputs that have been intentionally perturbed to make a model misclassify them, often by a margin invisible to humans. The phenomenon was reported by Christian Szegedy and colleagues in the 2014 ICLR paper Intriguing properties of neural networks. They showed that almost any image could be perturbed by a tiny amount and cause a confident image classifier to assign the wrong label. The perturbations also transferred between models trained on different subsets of the data.

Ian Goodfellow, Jonathon Shlens, and Christian Szegedy followed up in 2014 with Explaining and Harnessing Adversarial Examples, which introduced the fast gradient sign method (FGSM) and argued that the underlying cause was the near-linear behavior of deep networks in high-dimensional input space rather than overfitting. FGSM constructs an adversarial example as $x' = x + \varepsilon \cdot \text{sign}(\nabla_x J(\theta, x, y))$, where $\varepsilon$ controls the size of the perturbation and $J$ is the loss. Including these perturbed examples in training, called adversarial training, became one of the standard defenses.

Adversarial examples are a serious safety concern in any deployment where an attacker can craft inputs: spam filtering, malware detection, face recognition, and autonomous driving have all been shown to be vulnerable.

Counterfactual examples

A counterfactual example is closely related but built for a different purpose: explanation rather than attack. Sandra Wachter, Brent Mittelstadt, and Chris Russell argued in their 2017 paper Counterfactual Explanations Without Opening the Black Box: Automated Decisions and the GDPR that data subjects have a legitimate interest in knowing what minimal change to their input would have flipped a model's decision. If a loan application was denied, a useful explanation might be "if your annual income had been 4,000 USD higher, the model would have approved you." The counterfactual is an example that the model classifies the way the user wanted, and that is as close as possible to their actual input. Counterfactual explanations have since become a standard tool in the interpretable machine learning literature and are often discussed in the context of the EU General Data Protection Regulation.

Synthetic examples

When real labeled examples are scarce, synthetic ones can fill the gap. The simplest form is data augmentation: cropping, flipping, and color-jittering images, or paraphrasing text. More elaborate methods generate entirely new examples.

Method	How examples are produced	Typical use
Data augmentation	Programmatic transforms of real examples	Vision, audio, NLP
Simulation	Examples sampled from a physics or game engine	Robotics, autonomous driving
GAN-generated examples	A generative network draws from the data distribution	Image synthesis, privacy-preserving data
LLM-generated examples	A large language model writes new text examples	Instruction tuning, classification, evaluation

The LLM-generated case has exploded in importance since 2022. Yizhong Wang and colleagues introduced Self-Instruct in Self-Instruct: Aligning Language Models with Self-Generated Instructions, which bootstraps an instruction-tuning dataset by prompting a base model to generate new tasks from a small seed set. The original paper produced over 52,000 instructions and 82,000 instances, then used them to fine-tune GPT-3 to behave more like InstructGPT despite using no human-written demonstrations beyond the seed. Stanford Alpaca followed the same recipe with examples generated by GPT-3.5, and Anthropic's Constitutional AI used a model to critique and revise its own outputs to produce preference examples without human labelers in the loop.

Synthetic examples are not free. Errors compound: if the generator hallucinates a fact, every example built on it is wrong. Quality filtering and careful prompt design are the main defenses.

Examples in modern LLM training

Large language models stretch the meaning of "example" in interesting ways. The same sequence of tokens can serve as many different examples depending on which sub-sequence is treated as the target.

Stage	What an example looks like	Where the labels come from
Pretraining	One token sequence, with each next token as the target	The data itself (self-supervised)
Supervised fine-tuning (SFT)	A prompt and a desired response	Human writers or another model
Reward modeling	Two responses to the same prompt, one preferred	Human ranking
RLHF policy training	A prompt, a sampled response, and a scalar reward	Reward model
In-context learning	Demonstrations packed into the prompt	Static, no gradient update

Long Ouyang and colleagues at OpenAI laid out this multi-stage pipeline in Training Language Models to Follow Instructions With Human Feedback, the 2022 InstructGPT paper. Reinforcement learning from human feedback (RLHF) reframes "example" as a preference pair: the labeler picks the better of two model outputs, and the reward model learns to predict which a human would prefer. This pairwise structure traces back to the Bradley-Terry model from 1952 and turns subjective judgments into a differentiable training signal.

Few-shot learning and zero-shot learning further blur the line between training and inference. In Tom Brown and colleagues' 2020 paper Language Models Are Few-Shot Learners, GPT-3 is shown $K$ input-output demonstrations inside the prompt and asked to complete the next case. No weights are updated. The demonstrations act as examples, but they live in the context window rather than in a training file. The phrase "shot" comes from one-shot and few-shot learning in computer vision.

Sampling weights

Not every example contributes equally to the loss. A common pattern is to upweight rare classes or important subgroups so that they are not drowned out by the majority. In the simplest form the loss becomes

$$L = \frac{1}{n} \sum_{i=1}^{n} w_i \cdot \ell(f(x_i), y_i)$$

where $w_i$ is a per-example weight. Importance sampling, class-balanced loss, and curriculum reweighting are all variants of this idea. Choosing weights carelessly can introduce bias; choosing them deliberately is one of the main levers for handling class imbalance and distribution shift.

Common storage formats

Examples need to live somewhere. The format depends on scale, on access patterns, and on the framework in use.

Format	Layout	Common in	Notes
CSV / TSV	Row-oriented text	Tabular data, small datasets	Human-readable, slow to parse, no schema
JSON / JSONL	Row-oriented text, one example per line	NLP, instruction tuning	Easy to inspect, supports nested structure
Parquet	Columnar binary	Large tabular pipelines, Spark, BigQuery	Compressed, fast to scan a subset of columns
Apache Arrow	Columnar in-memory	Hugging Face Datasets, pandas	Zero-copy reads, language-agnostic
TFRecord	Sequence of serialized `tf.train.Example` protos	TensorFlow pipelines	High streaming throughput, opaque without a schema
WebDataset	Tar shards of files	Large vision pretraining	Streams from object storage, easy sharding

A TensorFlow tf.train.Example is, despite the name, exactly the data structure described in this article: a single instance, encoded as a dictionary of feature names to values. The record format, TFRecord, just packs many of these protos into a binary stream that data loaders can read sequentially.

Data quality and example value

An example is only as good as its label. Mislabeled examples poison gradient signal, mislead reward models, and inflate or deflate evaluation metrics depending on which way the noise runs. The Northcutt confident learning audits found error rates of several percent in widely used benchmarks. Anthropic, OpenAI, and others have invested heavily in labeler training and inter-annotator agreement for the same reason. The trend in 2024 and 2025 has been toward fewer, higher-quality examples for fine-tuning, with synthetic generation used to amplify a small carefully curated seed rather than to flood the model with cheap data.

In pretraining the picture is different. Quantity still matters, but so does deduplication, contamination filtering, and quality scoring. The same web page repeated a thousand times in the corpus does not produce a thousand distinct examples worth of learning signal.

References

Chapelle, O., Schölkopf, B., and Zien, A. (eds). *Semi-Supervised Learning*. MIT Press, 2006. https://mitpress.mit.edu/9780262514125/semi-supervised-learning/
Dietterich, T. G., Lathrop, R. H., and Lozano-Pérez, T. "Solving the multiple instance problem with axis-parallel rectangles." *Artificial Intelligence* 89, 1997, pp. 31-71.
Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I., and Fergus, R. "Intriguing properties of neural networks." ICLR 2014. https://arxiv.org/abs/1312.6199
Goodfellow, I. J., Shlens, J., and Szegedy, C. "Explaining and Harnessing Adversarial Examples." ICLR 2015. https://arxiv.org/abs/1412.6572
Wachter, S., Mittelstadt, B., and Russell, C. "Counterfactual Explanations Without Opening the Black Box: Automated Decisions and the GDPR." *Harvard Journal of Law & Technology* 31, 2017. https://arxiv.org/abs/1711.00399
Bengio, Y., Louradour, J., Collobert, R., and Weston, J. "Curriculum Learning." ICML 2009. https://ronan.collobert.com/pub/2009_curriculum_icml.pdf
Brown, T. B., et al. "Language Models are Few-Shot Learners." NeurIPS 2020. https://arxiv.org/abs/2005.14165
Wang, Y., Kordi, Y., Mishra, S., Liu, A., Smith, N. A., Khashabi, D., and Hajishirzi, H. "Self-Instruct: Aligning Language Models with Self-Generated Instructions." ACL 2023. https://arxiv.org/abs/2212.10560
Ouyang, L., et al. "Training Language Models to Follow Instructions with Human Feedback." NeurIPS 2022. https://arxiv.org/abs/2203.02155
Shrivastava, A., Gupta, A., and Girshick, R. "Training Region-Based Object Detectors with Online Hard Example Mining." CVPR 2016. https://arxiv.org/abs/1604.03540
Northcutt, C. G., Athalye, A., and Mueller, J. "Pervasive Label Errors in Test Sets Destabilize Machine Learning Benchmarks." NeurIPS 2021 Datasets and Benchmarks Track. https://arxiv.org/abs/2103.14749
TensorFlow. "TFRecord and tf.train.Example." https://www.tensorflow.org/tutorials/load_data/tfrecord
Mitchell, T. M. *Machine Learning*. McGraw-Hill, 1997.
Hastie, T., Tibshirani, R., and Friedman, J. *The Elements of Statistical Learning*. Springer, 2nd edition, 2009.

Example

Explain like I'm 5 (ELI5)

Formal definition

Synonyms and terminology

Examples in different learning paradigms

Labeled and unlabeled examples

Examples and instances

Batches of examples

Train, validation, and test examples

Hard and easy examples

Adversarial examples

Counterfactual examples

Synthetic examples

Examples in modern LLM training

Sampling weights

Common storage formats

Data quality and example value

See also

References

Improve this article

Explain like I'm 5 (ELI5)

Formal definition

Synonyms and terminology

Examples in different learning paradigms

Labeled and unlabeled examples

Examples and instances

Batches of examples

Train, validation, and test examples

Hard and easy examples

Adversarial examples

Counterfactual examples

Synthetic examples

Examples in modern LLM training

Sampling weights

Common storage formats

Data quality and example value

See also

References

Explain like I'm 5 (ELI5)

Formal definition

Synonyms and terminology

Examples in different learning paradigms

Labeled and unlabeled examples

Examples and instances

Batches of examples

Train, validation, and test examples

Hard and easy examples

Adversarial examples

Counterfactual examples

Synthetic examples

Examples in modern LLM training

Sampling weights

Common storage formats

Data quality and example value

See also

References

Improve this article

Related Articles

ARC-AGI 2

Confident Learning (CL)

Decision Forest

Ground Truth

Target

Support Vector Machine (SVM)

Explain like I'm 5 (ELI5)

Formal definition

Synonyms and terminology

Examples in different learning paradigms

Labeled and unlabeled examples

Examples and instances

Batches of examples

Train, validation, and test examples

Hard and easy examples

Adversarial examples

Counterfactual examples

Synthetic examples

Examples in modern LLM training

Sampling weights

Common storage formats

Data quality and example value

See also

References

Related Articles

ARC-AGI 2

Confident Learning (CL)

Decision Forest

Ground Truth

Target

Support Vector Machine (SVM)