See also: Machine learning terms
In supervised learning, the target is the variable that a model learns to predict from input features. It is the answer side of the dataset: for every training example, a target value tells the algorithm what the right output should be. The target is also called the label, the ground truth, the dependent variable, the response variable, the outcome, or simply y. Different communities prefer different words. Statisticians tend to say "response" or "dependent variable." Computer vision and NLP researchers usually say "label." Decision theorists and engineers often just write y on the board and move on.
Whatever you call it, the target is the thing you are trying to predict, and the choice of target is one of the most consequential decisions in any machine learning project. A model that perfectly predicts the wrong target is worse than useless because it gives confident, official-looking answers to a question nobody asked. Defining the target well, sourcing accurate target values, and validating their quality are usually harder than picking an algorithm.
The target appears under several symbols depending on the textbook. The most common are:
| Symbol | Typical use |
|---|---|
| y | Generic target value, especially in regression and classification |
| t | Target in neural network literature (Bishop's Pattern Recognition and Machine Learning uses this convention) |
| z | Sometimes used for latent or auxiliary targets |
| Y | Random variable for the target, with y as a realisation |
| ŷ | The model's prediction for y |
When training data has n examples, the targets are usually written as a vector y of length n, or as a matrix Y when each example has multiple target dimensions (multi-output regression, multi-label classification, or sequence outputs).
The shape and meaning of the target changes with the task. The table below summarises the main categories.
| Task | Target shape | Example | Typical loss |
|---|---|---|---|
| Regression | Scalar real number | House price in dollars | Mean squared error, MAE, Huber |
| Binary classification | 0 or 1, or sometimes -1 and +1 | Spam vs. not spam | Binary cross-entropy, hinge |
| Multi-class classification | One of K classes | Handwritten digit 0 through 9 | Categorical cross-entropy |
| Multi-label classification | Binary vector of length K | Tags on a photo (beach, sunset, dog) | Sum of binary cross-entropies |
| Ordinal regression | Ordered category | Movie rating from 1 to 5 stars | Ordinal hinge, cumulative link |
| Ranking | Relevance grade or pairwise preference | Search result quality scores | LambdaRank, ListNet |
| Sequence labelling | Tag per input token | Named entity tags on a sentence | Sum of token-level cross-entropies |
| Sequence-to-sequence | Output token sequence | French translation of an English sentence | Sum of token cross-entropies |
| Object detection | Bounding boxes plus class per box | Cars and pedestrians in a photo | Box regression plus classification |
| Image segmentation | Class label per pixel | Road, sky, vehicle masks | Pixel-wise cross-entropy, Dice |
| Density estimation | Same shape as input | Generative modelling of images | Negative log likelihood |
The boundaries are not always clean. Image segmentation can be framed as multi-class classification at the pixel level. Multi-label classification can be implemented as K parallel binary classifiers. The target structure also drives architectural choices: a model that emits a fixed-length vector cannot directly produce a variable-length translation, which is why sequence-to-sequence models use decoders.
In most real projects, target values do not arrive prelabelled. Someone has to produce them, and the way you produce them shapes the noise, cost, and bias of the resulting model. The dominant sources are summarised below.
| Source | How it works | Strengths | Weaknesses |
|---|---|---|---|
| Manual annotation | Humans label examples one at a time | Highest accuracy when annotators are well-trained | Slow, expensive, hard to scale past tens of thousands of examples |
| Crowdsourcing | Many non-expert workers label cheaply (Amazon Mechanical Turk, Scale AI) | Cheap and fast at scale | Inter-annotator agreement is often low for ambiguous tasks |
| Expert annotation | Specialists label (radiologists, lawyers, scientists) | Necessary for technical domains | Very expensive and bottlenecked by available experts |
| Programmatic / weak labels | Heuristics, regex, or labelling functions assign labels (Snorkel, data programming) | Scales to millions of examples | Each rule is noisy; labels need denoising |
| Distant supervision | A knowledge base aligns to text or images to produce labels | No new annotation effort | Many false positives, since co-occurrence is not relation |
| User behaviour | Clicks, conversions, dwell time, ratings act as implicit labels | Available in huge volume from production traffic | Confounded by position bias, exposure, and selection effects |
| Self-supervision | The label is derived directly from the input itself | Unlimited labels with no annotators | The pretext task may not be aligned with the downstream goal |
| Active learning | Model picks which examples to send to humans next | Reduces label budget by focusing on informative examples | Requires a working pipeline and a calibrated uncertainty score |
| Derived label | Label is computed from other recorded fields | Cheap and consistent | Risk of proxy labels drifting from the construct of interest |
For weak supervision, the foundational system is Snorkel by Ratner et al. (2017), which lets domain experts write labelling functions that vote on each example. A generative model then learns the functions' accuracies and emits probabilistic training labels. Mintz et al. (2009) introduced distant supervision for relation extraction, where Freebase triples were aligned with Wikipedia sentences to produce noisy training data for relation classifiers.
"Ground truth" sounds absolute, but in practice it is whatever the dataset says is correct. That can be wrong. Northcutt, Athalye, and Mueller (2021) audited the test sets of ten widely used benchmarks (ImageNet, MNIST, CIFAR-10, CIFAR-100, Caltech-256, QuickDraw, IMDB, Amazon Reviews, 20News, AudioSet) and found an average error rate of at least 3.3 percent, with the ImageNet validation set carrying around 6 percent label errors. Their headline result was that label errors can flip benchmark rankings: on corrected ImageNet, ResNet-18 outperforms ResNet-50 once the original label error rate exceeds 6 percent of the test set. The implication is uncomfortable. Models tuned to chase the last fraction of a percentage point on a noisy benchmark may be overfitting to the noise.
For any new project, it is worth assuming that some non-trivial fraction of your labels are wrong, and budgeting for label auditing accordingly. Tools like cleanlab, which grew out of the same line of research, score every label by how confidently the model disagrees with it.
The target rarely enters a model in human-readable form. It is encoded numerically, and the encoding choice interacts with the loss function and architecture.
| Encoding | Form | When it is used |
|---|---|---|
| Integer encoding | Single integer in {0, 1, ..., K-1} | Tree-based learners, sparse categorical cross-entropy in deep frameworks |
| One-hot encoding | Length-K binary vector with a single 1 | Standard for softmax classifiers and most neural net classifiers |
| Soft labels | Length-K probability vector | Knowledge distillation (Hinton et al. 2015), label smoothing |
| Label smoothing | (1 - epsilon) on true class, epsilon / (K - 1) on others | Szegedy et al. (2016) Inception v3 trick to reduce overconfidence |
| Embedding | Dense vector per class | Hierarchical or text-side label encoders, retrieval-style classifiers |
Label smoothing, introduced in Rethinking the Inception Architecture for Computer Vision (Szegedy et al. 2016), is worth singling out. Hard one-hot targets push the model toward producing logits that go to plus or minus infinity, which encourages overconfidence and can hurt calibration. Replacing the hard label with, for example, 0.9 on the true class and 0.1 split across the others gives a finite optimum, smoother gradients, and often slightly better held-out accuracy. The trick has become close to standard for large image and language models.
Regression targets often violate the assumptions of linear models or simply make optimisation harder than it needs to be. Common transformations:
| Transformation | Formula | Purpose |
|---|---|---|
| Standardisation | (y - mean) / std | Put targets on the same scale as features so gradients behave |
| Log transform | log(y) or log(1 + y) | Tame right-skewed strictly positive targets like prices, counts, or durations |
| Box-Cox | Power transform parameterised by lambda | Find the power that best normalises a positive target |
| Yeo-Johnson | Generalisation of Box-Cox | Handles zero and negative values, used by scikit-learn's PowerTransformer |
| Quantile transform | Map y onto a uniform or normal quantile grid | Aggressive rank-preserving normalisation |
| Log-odds | log(p / (1 - p)) | Convert a bounded probability target into an unbounded regression target |
The critical detail is that any transformation applied to the training target must be inverted on predictions before they are reported. A model trained on log price has to be exponentiated, and when you do, the median rather than the mean is what you have actually predicted. Forgetting the inverse, or misapplying it, is a common bug.
Target leakage, sometimes called label leakage, happens when features available at training time secretly contain information about the target that will not be available at prediction time. The model looks brilliant on the validation set and falls apart in production. Classic examples:
Kaufman et al. (2012) catalogued seven types of leakage and gave the example of the IJCNN 2011 Social Network Challenge, where participants discovered the data came from Flickr and looked up the labels for 60 percent of the test set directly from the source. Avoiding leakage requires temporal discipline: features must be computed only from information that would have been available at the moment a real prediction was needed. This often means rebuilding feature pipelines from event-time logs rather than from snapshots.
Real labels are rarely perfect. The standard taxonomy comes from Frenay and Verleysen (2014), who classified label noise into three types based on what the noise depends on.
| Noise type | Abbreviation | What it depends on | Example |
|---|---|---|---|
| Noisy completely at random | NCAR | Independent of features and true class | Random keystroke errors when an annotator clicks the wrong button |
| Noisy at random | NAR | Depends on the true class but not features | Annotators reliably confuse husky and wolf images |
| Noisy not at random | NNAR | Depends on both features and true class | Hard or ambiguous examples are mislabelled at higher rates |
NCAR noise is the easiest case theoretically; it tends to bias estimates only mildly when the dataset is large. NNAR is the worst case and the most common in practice, because the same examples that confuse the model also confused the annotators. Methods for learning under label noise include robust loss functions (generalised cross-entropy, symmetric losses), noise-transition modelling, sample reweighting, co-teaching, and confident learning. None of them eliminate the underlying problem, which is that you do not know the true labels.
In self-supervised learning, the target is generated automatically from the input through a pretext task. There is no human annotation step at all. The most consequential examples in modern AI:
| Pretext task | Target | Used in |
|---|---|---|
| Next-token prediction | The next token of the sequence | GPT family, most modern LLMs |
| Masked language modelling | The identity of randomly masked tokens | BERT, RoBERTa, DeBERTa |
| Masked image modelling | The pixel values or token IDs of masked patches | MAE, BEiT |
| Contrastive learning | Whether two views come from the same image | SimCLR, MoCo, CLIP |
| Rotation prediction | The angle (0, 90, 180, 270) the input was rotated by | RotNet |
| Jigsaw puzzles | The original arrangement of shuffled patches | Noroozi and Favaro (2016) |
| Colourisation | The original colour channels of a greyscale image | Zhang et al. (2016) |
Self-supervised targets dominate modern foundation model training. A large language model like GPT-4 is trained on a single, very simple target: predict the next token. The trick is that doing this well at internet scale appears to require learning a great deal about syntax, semantics, world knowledge, and reasoning, which then transfers to downstream tasks after fine-tuning.
Reinforcement learning from human feedback (RLHF) introduces a different kind of target. Christiano et al. (2017) trained a reward model whose target is a scalar value representing how good a human would judge a given output to be. The reward model is itself learned, typically from pairwise human preferences using a Bradley-Terry style loss function. The downstream policy is then optimised to maximise this learned scalar target, often with KL regularisation back to the supervised baseline.
Constitutional AI (Bai et al. 2022) replaces the human preference labels with model-generated ones derived from a written set of principles. The target is still a scalar reward, but the preference data is bootstrapped from the model itself rather than purchased from human annotators. This pattern of using one model's outputs as another model's targets has become widespread, with the obvious risk that errors and biases in the labelling model propagate into the trained policy.
Dataset shift describes the situation where the training distribution differs from the deployment distribution. Storkey (2009), in When Training and Test Sets Are Different, gives a six-way taxonomy that includes covariate shift, prior probability shift (also called target shift or label shift), sample selection bias, imbalanced data, domain shift, and source component shift.
Target shift specifically means the marginal P(y) changes between training and test, while the class-conditional P(x | y) stays fixed. A medical example: a disease becomes more prevalent in the population, but the way it manifests in symptoms is unchanged. A model trained on the old prevalence will under-predict the disease unless its decision threshold or its prior is recalibrated. Detecting target shift is harder than covariate shift because you usually do not have test-set labels; methods include black-box shift estimation (Lipton, Wang, and Smola 2018), which estimates the new label proportions from confusion-matrix structure on labelled training data plus unlabelled test data.
A few rules of thumb that come up over and over in real projects:
Imagine you are teaching a friend to recognise dogs. You show them lots of pictures and for each one you say "dog" or "not a dog." The word you say is the target. After enough pictures, your friend can guess the answer for new pictures on their own. The target is just the right answer that comes with each example while you are teaching.