Target

Introduction

In supervised learning, the target is the variable that a model learns to predict from input features. It is the answer side of the dataset: for every training example, a target value tells the algorithm what the right output should be. The target is also called the label, the ground truth, the dependent variable, the response variable, the outcome, or simply y. Different communities prefer different words. Statisticians tend to say "response" or "dependent variable." Computer vision and NLP researchers usually say "label." Decision theorists and engineers often just write y on the board and move on.

Whatever you call it, the target is the thing you are trying to predict, and the choice of target is one of the most consequential decisions in any machine learning project. A model that perfectly predicts the wrong target is worse than useless because it gives confident, official-looking answers to a question nobody asked. Defining the target well, sourcing accurate target values, and validating their quality are usually harder than picking an algorithm.

Notation

The target appears under several symbols depending on the textbook. The most common are:

Symbol	Typical use
y	Generic target value, especially in regression and classification
t	Target in neural network literature (Bishop's Pattern Recognition and Machine Learning uses this convention)
z	Sometimes used for latent or auxiliary targets
Y	Random variable for the target, with y as a realisation
ŷ	The model's prediction for y

When training data has n examples, the targets are usually written as a vector y of length n, or as a matrix Y when each example has multiple target dimensions (multi-output regression, multi-label classification, or sequence outputs).

Targets by task type

The shape and meaning of the target changes with the task. The table below summarises the main categories.

Task	Target shape	Example	Typical loss
Regression	Scalar real number	House price in dollars	Mean squared error, MAE, Huber
Binary classification	0 or 1, or sometimes -1 and +1	Spam vs. not spam	Binary cross-entropy, hinge
Multi-class classification	One of K classes	Handwritten digit 0 through 9	Categorical cross-entropy
Multi-label classification	Binary vector of length K	Tags on a photo (beach, sunset, dog)	Sum of binary cross-entropies
Ordinal regression	Ordered category	Movie rating from 1 to 5 stars	Ordinal hinge, cumulative link
Ranking	Relevance grade or pairwise preference	Search result quality scores	LambdaRank, ListNet
Sequence labelling	Tag per input token	Named entity tags on a sentence	Sum of token-level cross-entropies
Sequence-to-sequence	Output token sequence	French translation of an English sentence	Sum of token cross-entropies
Object detection	Bounding boxes plus class per box	Cars and pedestrians in a photo	Box regression plus classification
Image segmentation	Class label per pixel	Road, sky, vehicle masks	Pixel-wise cross-entropy, Dice
Density estimation	Same shape as input	Generative modelling of images	Negative log likelihood

The boundaries are not always clean. Image segmentation can be framed as multi-class classification at the pixel level. Multi-label classification can be implemented as K parallel binary classifiers. The target structure also drives architectural choices: a model that emits a fixed-length vector cannot directly produce a variable-length translation, which is why sequence-to-sequence models use decoders.

Where targets come from

In most real projects, target values do not arrive prelabelled. Someone has to produce them, and the way you produce them shapes the noise, cost, and bias of the resulting model. The dominant sources are summarised below.

Source	How it works	Strengths	Weaknesses
Manual annotation	Humans label examples one at a time	Highest accuracy when annotators are well-trained	Slow, expensive, hard to scale past tens of thousands of examples
Crowdsourcing	Many non-expert workers label cheaply (Amazon Mechanical Turk, Scale AI)	Cheap and fast at scale	Inter-annotator agreement is often low for ambiguous tasks
Expert annotation	Specialists label (radiologists, lawyers, scientists)	Necessary for technical domains	Very expensive and bottlenecked by available experts
Programmatic / weak labels	Heuristics, regex, or labelling functions assign labels (Snorkel, data programming)	Scales to millions of examples	Each rule is noisy; labels need denoising
Distant supervision	A knowledge base aligns to text or images to produce labels	No new annotation effort	Many false positives, since co-occurrence is not relation
User behaviour	Clicks, conversions, dwell time, ratings act as implicit labels	Available in huge volume from production traffic	Confounded by position bias, exposure, and selection effects
Self-supervision	The label is derived directly from the input itself	Unlimited labels with no annotators	The pretext task may not be aligned with the downstream goal
Active learning	Model picks which examples to send to humans next	Reduces label budget by focusing on informative examples	Requires a working pipeline and a calibrated uncertainty score
Derived label	Label is computed from other recorded fields	Cheap and consistent	Risk of proxy labels drifting from the construct of interest

For weak supervision, the foundational system is Snorkel by Ratner et al. (2017), which lets domain experts write labelling functions that vote on each example. A generative model then learns the functions' accuracies and emits probabilistic training labels. Mintz et al. (2009) introduced distant supervision for relation extraction, where Freebase triples were aligned with Wikipedia sentences to produce noisy training data for relation classifiers.

Ground truth and its limits

"Ground truth" sounds absolute, but in practice it is whatever the dataset says is correct. That can be wrong. Northcutt, Athalye, and Mueller (2021) audited the test sets of ten widely used benchmarks (ImageNet, MNIST, CIFAR-10, CIFAR-100, Caltech-256, QuickDraw, IMDB, Amazon Reviews, 20News, AudioSet) and found an average error rate of at least 3.3 percent, with the ImageNet validation set carrying around 6 percent label errors. Their headline result was that label errors can flip benchmark rankings: on corrected ImageNet, ResNet-18 outperforms ResNet-50 once the original label error rate exceeds 6 percent of the test set. The implication is uncomfortable. Models tuned to chase the last fraction of a percentage point on a noisy benchmark may be overfitting to the noise.

For any new project, it is worth assuming that some non-trivial fraction of your labels are wrong, and budgeting for label auditing accordingly. Tools like cleanlab, which grew out of the same line of research, score every label by how confidently the model disagrees with it.

Encoding targets for classification

The target rarely enters a model in human-readable form. It is encoded numerically, and the encoding choice interacts with the loss function and architecture.

Encoding	Form	When it is used
Integer encoding	Single integer in {0, 1, ..., K-1}	Tree-based learners, sparse categorical cross-entropy in deep frameworks
One-hot encoding	Length-K binary vector with a single 1	Standard for softmax classifiers and most neural net classifiers
Soft labels	Length-K probability vector	Knowledge distillation (Hinton et al. 2015), label smoothing
Label smoothing	(1 - epsilon) on true class, epsilon / (K - 1) on others	Szegedy et al. (2016) Inception v3 trick to reduce overconfidence
Embedding	Dense vector per class	Hierarchical or text-side label encoders, retrieval-style classifiers

Label smoothing, introduced in Rethinking the Inception Architecture for Computer Vision (Szegedy et al. 2016), is worth singling out. Hard one-hot targets push the model toward producing logits that go to plus or minus infinity, which encourages overconfidence and can hurt calibration. Replacing the hard label with, for example, 0.9 on the true class and 0.1 split across the others gives a finite optimum, smoother gradients, and often slightly better held-out accuracy. The trick has become close to standard for large image and language models.

Transforming regression targets

Regression targets often violate the assumptions of linear models or simply make optimisation harder than it needs to be. Common transformations:

Transformation	Formula	Purpose
Standardisation	(y - mean) / std	Put targets on the same scale as features so gradients behave
Log transform	log(y) or log(1 + y)	Tame right-skewed strictly positive targets like prices, counts, or durations
Box-Cox	Power transform parameterised by lambda	Find the power that best normalises a positive target
Yeo-Johnson	Generalisation of Box-Cox	Handles zero and negative values, used by scikit-learn's PowerTransformer
Quantile transform	Map y onto a uniform or normal quantile grid	Aggressive rank-preserving normalisation
Log-odds	log(p / (1 - p))	Convert a bounded probability target into an unbounded regression target

The critical detail is that any transformation applied to the training target must be inverted on predictions before they are reported. A model trained on log price has to be exponentiated, and when you do, the median rather than the mean is what you have actually predicted. Forgetting the inverse, or misapplying it, is a common bug.

Target leakage

Target leakage, sometimes called label leakage, happens when features available at training time secretly contain information about the target that will not be available at prediction time. The model looks brilliant on the validation set and falls apart in production. Classic examples:

A churn model that uses a customer's number of cancellation calls as a feature, when those calls happen after the churn event being predicted.
A medical model that uses post-treatment lab values to predict the diagnosis they were ordered to confirm.
A loan default model that includes whether collections has already been notified.

Kaufman et al. (2012) catalogued seven types of leakage and gave the example of the IJCNN 2011 Social Network Challenge, where participants discovered the data came from Flickr and looked up the labels for 60 percent of the test set directly from the source. Avoiding leakage requires temporal discipline: features must be computed only from information that would have been available at the moment a real prediction was needed. This often means rebuilding feature pipelines from event-time logs rather than from snapshots.

Label noise

Real labels are rarely perfect. The standard taxonomy comes from Frenay and Verleysen (2014), who classified label noise into three types based on what the noise depends on.

Noise type	Abbreviation	What it depends on	Example
Noisy completely at random	NCAR	Independent of features and true class	Random keystroke errors when an annotator clicks the wrong button
Noisy at random	NAR	Depends on the true class but not features	Annotators reliably confuse husky and wolf images
Noisy not at random	NNAR	Depends on both features and true class	Hard or ambiguous examples are mislabelled at higher rates

NCAR noise is the easiest case theoretically; it tends to bias estimates only mildly when the dataset is large. NNAR is the worst case and the most common in practice, because the same examples that confuse the model also confused the annotators. Methods for learning under label noise include robust loss functions (generalised cross-entropy, symmetric losses), noise-transition modelling, sample reweighting, co-teaching, and confident learning. None of them eliminate the underlying problem, which is that you do not know the true labels.

Self-supervised "targets"

In self-supervised learning, the target is generated automatically from the input through a pretext task. There is no human annotation step at all. The most consequential examples in modern AI:

Pretext task	Target	Used in
Next-token prediction	The next token of the sequence	GPT family, most modern LLMs
Masked language modelling	The identity of randomly masked tokens	BERT, RoBERTa, DeBERTa
Masked image modelling	The pixel values or token IDs of masked patches	MAE, BEiT
Contrastive learning	Whether two views come from the same image	SimCLR, MoCo, CLIP
Rotation prediction	The angle (0, 90, 180, 270) the input was rotated by	RotNet
Jigsaw puzzles	The original arrangement of shuffled patches	Noroozi and Favaro (2016)
Colourisation	The original colour channels of a greyscale image	Zhang et al. (2016)

Self-supervised targets dominate modern foundation model training. A large language model like GPT-4 is trained on a single, very simple target: predict the next token. The trick is that doing this well at internet scale appears to require learning a great deal about syntax, semantics, world knowledge, and reasoning, which then transfers to downstream tasks after fine-tuning.

Targets in alignment and RLHF

Reinforcement learning from human feedback (RLHF) introduces a different kind of target. Christiano et al. (2017) trained a reward model whose target is a scalar value representing how good a human would judge a given output to be. The reward model is itself learned, typically from pairwise human preferences using a Bradley-Terry style loss function. The downstream policy is then optimised to maximise this learned scalar target, often with KL regularisation back to the supervised baseline.

Constitutional AI (Bai et al. 2022) replaces the human preference labels with model-generated ones derived from a written set of principles. The target is still a scalar reward, but the preference data is bootstrapped from the model itself rather than purchased from human annotators. This pattern of using one model's outputs as another model's targets has become widespread, with the obvious risk that errors and biases in the labelling model propagate into the trained policy.

Target shift

Dataset shift describes the situation where the training distribution differs from the deployment distribution. Storkey (2009), in When Training and Test Sets Are Different, gives a six-way taxonomy that includes covariate shift, prior probability shift (also called target shift or label shift), sample selection bias, imbalanced data, domain shift, and source component shift.

Target shift specifically means the marginal P(y) changes between training and test, while the class-conditional P(x | y) stays fixed. A medical example: a disease becomes more prevalent in the population, but the way it manifests in symptoms is unchanged. A model trained on the old prevalence will under-predict the disease unless its decision threshold or its prior is recalibrated. Detecting target shift is harder than covariate shift because you usually do not have test-set labels; methods include black-box shift estimation (Lipton, Wang, and Smola 2018), which estimates the new label proportions from confusion-matrix structure on labelled training data plus unlabelled test data.

Practical guidance

A few rules of thumb that come up over and over in real projects:

Define the target before you write code. Write down, in plain English, what you are predicting, when it is observable, and what action will be taken on the prediction. Many projects discover at the modelling stage that the "obvious" target is not actually measurable in time, or is a proxy for what someone really cares about.
Audit your labels. Sample a few hundred examples and re-label them yourself. Disagreement with the existing labels is your floor on irreducible error.
Watch for leakage from the start. Build features from event-time logs rather than snapshots. Treat any feature whose meaning involves the future with suspicion.
Match the loss to the target. Squared error for symmetric continuous targets, cross-entropy for categorical targets, ranking losses for rankings, and so on. A misaligned loss can dominate any improvement from a fancier model.
Instrument production for ground-truth comparison. Predictions that never get compared with realised outcomes cannot be debugged.
Beware fake ground truth. Especially for tasks like content moderation, medical triage, or recommendation, the available labels often reflect the policy of the past system, not what you actually want to optimise.

Explain like I'm five

Imagine you are teaching a friend to recognise dogs. You show them lots of pictures and for each one you say "dog" or "not a dog." The word you say is the target. After enough pictures, your friend can guess the answer for new pictures on their own. The target is just the right answer that comes with each example while you are teaching.

References

Bai, Y. et al. (2022). *Constitutional AI: Harmlessness from AI Feedback*. arXiv:2212.08073.
Christiano, P. et al. (2017). *Deep Reinforcement Learning from Human Preferences*. NeurIPS 2017. arXiv:1706.03741.
Frenay, B. and Verleysen, M. (2014). Classification in the Presence of Label Noise: A Survey. *IEEE Transactions on Neural Networks and Learning Systems*, 25(5), 845-869.
Hinton, G., Vinyals, O., and Dean, J. (2015). *Distilling the Knowledge in a Neural Network*. arXiv:1503.02531.
Kaufman, S., Rosset, S., Perlich, C., and Stitelman, O. (2012). Leakage in Data Mining: Formulation, Detection, and Avoidance. *ACM Transactions on Knowledge Discovery from Data*, 6(4).
Lipton, Z., Wang, Y.-X., and Smola, A. (2018). *Detecting and Correcting for Label Shift with Black Box Predictors*. ICML 2018.
Mintz, M., Bills, S., Snow, R., and Jurafsky, D. (2009). Distant Supervision for Relation Extraction Without Labeled Data. *ACL-IJCNLP 2009*.
Northcutt, C. G., Athalye, A., and Mueller, J. (2021). *Pervasive Label Errors in Test Sets Destabilize Machine Learning Benchmarks*. NeurIPS 2021 Datasets and Benchmarks Track. arXiv:2103.14749.
Ratner, A. et al. (2017). Snorkel: Rapid Training Data Creation with Weak Supervision. *VLDB 2018*. arXiv:1711.10160.
Storkey, A. (2009). When Training and Test Sets Are Different: Characterising Learning Transfer. In *Dataset Shift in Machine Learning*, MIT Press.
Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., and Wojna, Z. (2016). Rethinking the Inception Architecture for Computer Vision. *CVPR 2016*. arXiv:1512.00567.
Bishop, C. (2006). *Pattern Recognition and Machine Learning*. Springer.
Hastie, T., Tibshirani, R., and Friedman, J. (2009). *The Elements of Statistical Learning*, 2nd edition. Springer.

Introduction

Notation

Targets by task type

Where targets come from

Ground truth and its limits

Encoding targets for classification

Transforming regression targets

Target leakage

Label noise

Self-supervised "targets"

Targets in alignment and RLHF

Target shift

Practical guidance

Explain like I'm five

References

Improve this article

Related Articles

ARC-AGI 2

Confident Learning (CL)

Decision Forest

Example

Ground Truth

Support Vector Machine (SVM)

Introduction

Notation

Targets by task type

Where targets come from

Ground truth and its limits

Encoding targets for classification

Transforming regression targets

Target leakage

Label noise

Self-supervised "targets"

Targets in alignment and RLHF

Target shift

Practical guidance

Explain like I'm five

References

Related Articles

ARC-AGI 2

Confident Learning (CL)

Decision Forest

Example

Ground Truth

Support Vector Machine (SVM)