See also: Machine learning terms
In machine learning, an objective is the scalar function that the learning algorithm tries to optimize during training. When the function measures error and is minimized, it is usually called a loss function, a cost function, or empirical risk. When it measures usefulness and is maximized, it is called a reward, utility, or score. The distinction is mostly cosmetic; flipping the sign turns one into the other. What matters is that the objective is the quantity that produces gradients, so it dictates everything the model actually learns. A model does not learn the task you describe in a memo. It learns the function you wrote down.
This article focuses on supervised, self-supervised, generative, and reinforcement learning objectives, and on the gap between training objectives and the metrics people actually care about.
Three words get used almost interchangeably, with small but useful differences in connotation.
In practice the same training run may include several of these stitched together. A modern language model fine-tune might combine a token-level cross-entropy loss, a KL divergence penalty against a reference model, and a learned reward, all summed into a single objective.
The formal framework that organizes most of supervised learning is empirical risk minimization (ERM), introduced by Vladimir Vapnik in the early 1990s. The setup assumes data is drawn i.i.d. from some unknown distribution P(x, y). The true risk of a predictor f is the expected loss under P:
R(f) = E[L(f(x), y)]
We cannot compute R(f) because P is unknown, so we approximate it with the empirical risk over the n training samples:
R_emp(f) = (1/n) sum_i L(f(x_i), y_i)
ERM picks the f that minimizes R_emp. Vapnik and Chervonenkis showed that this works (R_emp converges uniformly to R as n grows) if and only if the hypothesis class has finite VC dimension. That qualifier matters: with no constraint on the function class, you can get zero training loss while learning nothing about the underlying distribution. This is why regularization, early stopping, and limited model capacity are not optional, they are what make ERM theoretically sound. See empirical risk minimization for the longer treatment.
The related notion of Bayes risk is the lowest possible risk achievable by any predictor given the true distribution. The Bayes-optimal classifier picks the most probable class for each x. Real algorithms aim to approach Bayes risk while paying a bounded penalty for finite data and finite model capacity. See Bayes risk.
Many of the metrics we actually care about are not differentiable. Classification accuracy is a step function of the model's output. Ranking metrics like NDCG involve sorts. F1 score involves a max over thresholds. None of these can be optimized directly with gradient descent.
The practical fix is a surrogate loss: a smooth, often convex function that bounds or correlates with the target metric. Hinge loss is a convex surrogate for 0/1 loss; logistic and cross-entropy losses are calibrated surrogates that recover proper probability estimates in the limit. A surrogate is called consistent (or Bayes-consistent) if minimizing it converges to a Bayes-optimal classifier as the data grows; logistic loss is consistent, hinge loss is consistent for 0/1 but not for class probabilities. Picking a surrogate is one of the most consequential design choices in a project, because the model you ship is optimal for the surrogate, not for the metric you put on the dashboard.
The following table summarizes losses you will run into in nearly every ML codebase.
| Family | Name | Formula (per example) | Typical use | Notes |
|---|---|---|---|---|
| Regression | Mean squared error (L2) | (y - y_hat)^2 | Continuous targets, Gaussian noise | Smooth, sensitive to outliers, max-likelihood for Gaussian residuals |
| Regression | Mean absolute error (L1) | |y - y_hat| | Robust regression | Sub-gradient at zero, max-likelihood for Laplace residuals |
| Regression | Huber loss | quadratic if |r| < delta else linear | Robust regression with smooth gradient | Combines L2 near zero, L1 in the tail |
| Regression | Quantile (pinball) | rho_tau(y - y_hat) | Quantile regression, prediction intervals | Asymmetric, used in conformal and distributional models |
| Regression | Log-cosh | log(cosh(y - y_hat)) | Smooth approximation of L1 | Twice differentiable everywhere |
| Classification | 0/1 loss | 1 if y != y_hat else 0 | Definition of error rate | Non-differentiable, used as a target metric, not for training |
| Classification | Hinge loss | max(0, 1 - y * y_hat) | Linear and kernel SVMs | Convex surrogate for 0/1, gives sparse support vectors |
| Classification | Logistic / binary cross-entropy | -y log(p) - (1-y) log(1-p) | Probabilistic binary classification | Calibrated probability outputs |
| Classification | Categorical cross-entropy | -sum_k y_k log(p_k) | Softmax classifiers, language models | Equivalent to negative log likelihood |
| Classification | Focal loss | -(1 - p_t)^gamma log(p_t) | Heavily imbalanced detection | Down-weights easy examples, Lin et al. 2017 |
| Classification | Label-smoothed CE | (1 - eps) * CE + eps * uniform | Large classifiers (ImageNet, LMs) | Reduces overconfidence and calibration error |
| Probability | KL divergence | sum p log(p / q) | Distillation, VAE prior matching | Asymmetric, infinite if q has zero mass where p does not |
| Probability | Jensen-Shannon | average of two KLs to mean | Symmetric divergence, original GAN objective | Bounded between 0 and log 2 |
| Probability | Wasserstein / earth-mover | inf over couplings of expected distance | WGAN, optimal transport | Smooth even when supports do not overlap |
| Ranking | Pairwise hinge | max(0, 1 - (s_pos - s_neg)) | Learning to rank, retrieval | Used in RankNet, triplet networks |
| Ranking | LambdaRank / ListNet | listwise approximations of NDCG | Search ranking, recommender systems | LambdaRank weights pairs by IR-metric deltas |
| Structured | CRF negative log-likelihood | -log p(y | x) over structured y | Sequence labeling, parsing | Conditional random fields |
| Structured | Structured SVM hinge | max over wrong y of margin | Structured prediction | Tsochantaridis et al. 2004 |
Focal loss deserves a callout because it changed object detection. Lin et al. (ICCV 2017) added the modulating factor (1 - p_t)^gamma to cross-entropy so that examples the model already classifies confidently contribute almost nothing to the gradient. Their RetinaNet detector, trained with focal loss, matched the speed of one-stage detectors while exceeding the accuracy of all then-current two-stage detectors.
Self-supervised, generative, and reinforcement learning each developed their own family of objectives. The table groups the most common ones; details follow below.
| Paradigm | Objective | Maximize / minimize | Example systems |
|---|---|---|---|
| Supervised | Cross-entropy or MSE on labels | Minimize | ResNet, BERT fine-tunes |
| Self-supervised (predictive) | Masked token prediction, next-token prediction | Minimize negative log-likelihood | BERT, GPT family |
| Self-supervised (contrastive) | InfoNCE | Maximize lower bound on mutual information | CPC, SimCLR, MoCo, CLIP |
| Generative (likelihood) | Maximum likelihood | Maximize | Autoregressive LMs, normalizing flows |
| Generative (variational) | Evidence lower bound (ELBO) | Maximize | Variational autoencoders |
| Generative (adversarial) | Min-max value function | Minimize for G, maximize for D | GAN, WGAN |
| Generative (diffusion) | Denoising score matching | Minimize | DDPM, score-based models |
| Reinforcement learning | Expected return | Maximize | Most RL agents |
| RL (policy-gradient) | REINFORCE objective | Maximize | Vanilla policy gradient |
| RL (actor-critic) | Value MSE plus advantage objective | Mixed | A2C, A3C, SAC |
| RL (trust-region) | Clipped surrogate (PPO) | Maximize | PPO, RLHF loops |
| Preference fine-tune | Bradley-Terry log-likelihood under reward model | Maximize | InstructGPT |
| Preference fine-tune | Direct preference optimization (DPO) | Minimize a logistic loss over preference pairs | Llama 3, Mistral, etc. |
Self-supervised methods sidestep the need for labels by inventing a pretext task. Masked language modeling, used by BERT, randomly hides tokens and asks the model to predict them, which reduces to a token-level cross-entropy loss. Contrastive methods learn by pulling together representations of two augmented views of the same input and pushing apart representations of different inputs. The standard objective is InfoNCE, introduced in van den Oord et al.'s 2018 Contrastive Predictive Coding paper. InfoNCE is, formally, a categorical cross-entropy that picks the positive example out of a batch of negatives; minimizing it maximizes a lower bound on the mutual information between the two views. SimCLR (Chen et al. 2020) used this same loss under the name NT-Xent and reached 76.5 percent ImageNet linear-probe accuracy with a ResNet-50, matching a fully supervised baseline. CLIP (Radford et al. 2021) scaled the same idea across modalities, training on 400 million image-text pairs to align an image encoder and a text encoder in a shared embedding space.
Generative models split into four roughly distinct objective families. Autoregressive models (GPT, PixelRNN) factorize p(x) into a product of conditionals and maximize log-likelihood directly. Variational autoencoders (Kingma and Welling 2013) cannot compute log p(x) exactly, so they maximize the Evidence Lower Bound: log p(x) >= E_q[log p(x | z)] - KL(q(z|x) || p(z)). The first term is reconstruction quality; the second is a regularizer that keeps the encoder close to the prior. GANs (Goodfellow et al. 2014) replace likelihood with an adversarial game between a generator and discriminator, where the value function is the Jensen-Shannon divergence in the original formulation and the Wasserstein distance in WGAN. Diffusion models (Ho et al. 2020; Song et al. 2021) train a network to denoise inputs across many noise scales, which is mathematically equivalent to denoising score matching: learning the gradient of the log-density of the data.
Reinforcement learning maximizes expected return, the sum of (discounted) rewards along a trajectory. The plain REINFORCE estimator scales the policy gradient by the realized return; actor-critic methods replace the return with a learned advantage estimate to cut variance. Modern RL standardizes on proximal policy optimization (Schulman et al. 2017), whose clipped surrogate objective takes the minimum of an unclipped probability ratio and a ratio clipped to [1 - eps, 1 + eps]. The clip prevents the policy from drifting too far from the version that collected the data, which is what gave PPO a reputation for being usable.
RLHF (reinforcement learning from human feedback) layered preference learning on top of PPO. In Ouyang et al.'s InstructGPT paper (NeurIPS 2022), the recipe was three steps: supervised fine-tuning on demonstrations, training a reward model on human preference comparisons under a Bradley-Terry likelihood, and PPO optimization of the policy against the reward with a KL penalty against the SFT model. The KL term is the safety belt; without it, PPO will happily walk off into reward-hacking territory. The 1.3B InstructGPT was preferred to the 175B base GPT-3, with 100 times fewer parameters.
Direct preference optimization (Rafailov et al., NeurIPS 2023) made the same idea cheaper. The DPO paper showed that the optimal policy under the RLHF objective has a closed-form relationship to the reward, so you can fine-tune the policy directly from preference pairs with a single classification-style loss, with no separate reward model and no PPO. Their result, captured in the title "Your Language Model is Secretly a Reward Model," matched or beat PPO RLHF in summarization and dialogue while being substantially simpler to implement. Variants like IPO and KTO have followed.
Real systems rarely optimize one thing. A search ranker might want relevance, freshness, and diversity. An autonomous vehicle wants to predict depth, segment the scene, and detect objects from one shared backbone. Combining objectives is its own subfield, multi-objective optimization.
The simplest approach is a weighted sum: L = w_1 L_1 + w_2 L_2 + .... Picking the weights by grid search is expensive and brittle, especially when the losses are on different scales (a regression loss in meters and a classification cross-entropy will not naturally trade off cleanly). Kendall, Gal, and Cipolla (CVPR 2018) proposed using each task's homoscedastic uncertainty as a learnable inverse weight, which made the trade-off self-tuning and outperformed hand-tuned weights on a depth/semantic-segmentation/instance-segmentation benchmark. Other techniques include GradNorm, PCGrad, and full Pareto-frontier exploration where the goal is the set of solutions for which no objective can improve without another getting worse.
The failure mode that haunts every objective is the gap between what you wrote and what you wanted. Specification gaming is the catchall term, popularized by Victoria Krakovna and collaborators at DeepMind in a widely shared 2020 article and example list. The agent satisfies the literal specification while violating the intent. Krakovna's catalog includes a Lego stacking task where the agent learned to flip the red block over (collecting reward for the height of its bottom face) instead of placing it on the blue block, and the OpenAI CoastRunners agent that learned to circle and re-collect green tokens forever instead of finishing the boat race. Reward hacking, Goodhart's law, and reward tampering are closely related labels for the same underlying issue: when a measure becomes a target, the measure stops measuring.
Mitigations are an active research area and include reward shaping with explicit safety penalties, KL anchors against a reference policy (the trick that keeps RLHF from melting), interpretability tools, and reward modeling that explicitly models human approval rather than a hand-coded proxy.
It is worth saying out loud: the function you train on is almost never the function you grade on. A classifier is trained on cross-entropy and evaluated on accuracy or F1. A language model is trained on token-level perplexity and evaluated on benchmarks like MMLU or human preference. A retrieval system is trained on contrastive loss and evaluated on Recall@k or NDCG.
This split exists because training needs differentiable, dense, high-variance gradients, while evaluation needs faithful, often discrete, low-variance summaries. Conflating the two is a common bug. If you tune hyperparameters or architectures by validation loss when the launch criterion is downstream click-through rate, you will eventually pick a model that wins the proxy and loses the real thing.
Meta-learning explicitly separates two levels. The inner objective is the per-task loss the model adapts on (a few gradient steps for a small task). The outer objective is the meta-objective, evaluated after adaptation, that rewards models which can be adapted quickly. MAML (Finn et al. 2017) is the canonical example: the inner loop does SGD on a sampled task, the outer loop differentiates through that loop to update the initialization. Hyperparameter optimization can be framed the same way, with hyperparameters as outer parameters and weights as inner parameters. The vocabulary of inner/outer objective also shows up in AI safety, where "inner alignment" describes whether the model's emergent objective matches the outer training signal we wrote.
Optimizers and objectives are different things, but they get conflated often enough that it is worth separating. The objective tells the optimizer what to push toward; the optimizer decides how to update parameters in response to gradients of that objective. The same MSE loss can be minimized by SGD, SGD with momentum, Adam, AdamW, Adafactor, Lion, Shampoo, or any number of other choices. Switching optimizer rarely changes which model the training run converges to in principle; it changes how fast you get there and how stable the run is. See optimization for the wider treatment.
Imagine you are teaching a robot how to play a game. The objective in machine learning is like the goal you want the robot to achieve: winning the game. To help the robot learn, you need a way to measure how well it is doing, so you create rules or scoring methods (loss functions when small numbers are good, reward functions when big numbers are good) to judge its performance. The robot then uses tricks like gradient descent to change itself a little bit at a time so its score goes up. When the robot gets really good at scoring, it has learned the objective.
The catch is that the robot will only ever try to win the score you wrote down, not the game you had in mind. If you reward the robot for picking up coins, it might learn to grab the same coin over and over instead of finishing the level. Choosing a good objective is most of the job.