Objective

See also: Machine learning terms

In machine learning, an objective is the scalar function that the learning algorithm tries to optimize during training. When the function measures error and is minimized, it is usually called a loss function, a cost function, or empirical risk. When it measures usefulness and is maximized, it is called a reward, utility, or score. The distinction is mostly cosmetic; flipping the sign turns one into the other. What matters is that the objective is the quantity that produces gradients, so it dictates everything the model actually learns. A model does not learn the task you describe in a memo. It learns the function you wrote down.

This article focuses on supervised, self-supervised, generative, and reinforcement learning objectives, and on the gap between training objectives and the metrics people actually care about.

objective vs loss vs reward

Three words get used almost interchangeably, with small but useful differences in connotation.

A loss is a per-example error, usually non-negative, that the model is trying to push toward zero.
A cost or risk is the average loss over a batch or over the data distribution; it is what gradient methods actually minimize.
A reward or utility is a quantity to maximize, common in reinforcement learning settings and decision theory.
An objective is the umbrella term: any scalar function the optimizer is pointed at.

In practice the same training run may include several of these stitched together. A modern language model fine-tune might combine a token-level cross-entropy loss, a KL divergence penalty against a reference model, and a learned reward, all summed into a single objective.

empirical risk minimization

The formal framework that organizes most of supervised learning is empirical risk minimization (ERM), introduced by Vladimir Vapnik in the early 1990s. The setup assumes data is drawn i.i.d. from some unknown distribution P(x, y). The true risk of a predictor f is the expected loss under P:

R(f) = E[L(f(x), y)]

We cannot compute R(f) because P is unknown, so we approximate it with the empirical risk over the n training samples:

R_emp(f) = (1/n) sum_i L(f(x_i), y_i)

ERM picks the f that minimizes R_emp. Vapnik and Chervonenkis showed that this works (R_emp converges uniformly to R as n grows) if and only if the hypothesis class has finite VC dimension. That qualifier matters: with no constraint on the function class, you can get zero training loss while learning nothing about the underlying distribution. This is why regularization, early stopping, and limited model capacity are not optional, they are what make ERM theoretically sound. See empirical risk minimization for the longer treatment.

The related notion of Bayes risk is the lowest possible risk achievable by any predictor given the true distribution. The Bayes-optimal classifier picks the most probable class for each x. Real algorithms aim to approach Bayes risk while paying a bounded penalty for finite data and finite model capacity. See Bayes risk.

surrogate losses

Many of the metrics we actually care about are not differentiable. Classification accuracy is a step function of the model's output. Ranking metrics like NDCG involve sorts. F1 score involves a max over thresholds. None of these can be optimized directly with gradient descent.

The practical fix is a surrogate loss: a smooth, often convex function that bounds or correlates with the target metric. Hinge loss is a convex surrogate for 0/1 loss; logistic and cross-entropy losses are calibrated surrogates that recover proper probability estimates in the limit. A surrogate is called consistent (or Bayes-consistent) if minimizing it converges to a Bayes-optimal classifier as the data grows; logistic loss is consistent, hinge loss is consistent for 0/1 but not for class probabilities. Picking a surrogate is one of the most consequential design choices in a project, because the model you ship is optimal for the surrogate, not for the metric you put on the dashboard.

common loss functions

The following table summarizes losses you will run into in nearly every ML codebase.

Family	Name	Formula (per example)	Typical use	Notes
Regression	Mean squared error (L2)	(y - y_hat)^2	Continuous targets, Gaussian noise	Smooth, sensitive to outliers, max-likelihood for Gaussian residuals
Regression	Mean absolute error (L1)	\|y - y_hat\|	Robust regression	Sub-gradient at zero, max-likelihood for Laplace residuals
Regression	Huber loss	quadratic if \|r\| < delta else linear	Robust regression with smooth gradient	Combines L2 near zero, L1 in the tail
Regression	Quantile (pinball)	rho_tau(y - y_hat)	Quantile regression, prediction intervals	Asymmetric, used in conformal and distributional models
Regression	Log-cosh	log(cosh(y - y_hat))	Smooth approximation of L1	Twice differentiable everywhere
Classification	0/1 loss	1 if y != y_hat else 0	Definition of error rate	Non-differentiable, used as a target metric, not for training
Classification	Hinge loss	max(0, 1 - y * y_hat)	Linear and kernel SVMs	Convex surrogate for 0/1, gives sparse support vectors
Classification	Logistic / binary cross-entropy	-y log(p) - (1-y) log(1-p)	Probabilistic binary classification	Calibrated probability outputs
Classification	Categorical cross-entropy	-sum_k y_k log(p_k)	Softmax classifiers, language models	Equivalent to negative log likelihood
Classification	Focal loss	-(1 - p_t)^gamma log(p_t)	Heavily imbalanced detection	Down-weights easy examples, Lin et al. 2017
Classification	Label-smoothed CE	(1 - eps) * CE + eps * uniform	Large classifiers (ImageNet, LMs)	Reduces overconfidence and calibration error
Probability	KL divergence	sum p log(p / q)	Distillation, VAE prior matching	Asymmetric, infinite if q has zero mass where p does not
Probability	Jensen-Shannon	average of two KLs to mean	Symmetric divergence, original GAN objective	Bounded between 0 and log 2
Probability	Wasserstein / earth-mover	inf over couplings of expected distance	WGAN, optimal transport	Smooth even when supports do not overlap
Ranking	Pairwise hinge	max(0, 1 - (s_pos - s_neg))	Learning to rank, retrieval	Used in RankNet, triplet networks
Ranking	LambdaRank / ListNet	listwise approximations of NDCG	Search ranking, recommender systems	LambdaRank weights pairs by IR-metric deltas
Structured	CRF negative log-likelihood	-log p(y \| x) over structured y	Sequence labeling, parsing	Conditional random fields
Structured	Structured SVM hinge	max over wrong y of margin	Structured prediction	Tsochantaridis et al. 2004

Focal loss deserves a callout because it changed object detection. Lin et al. (ICCV 2017) added the modulating factor (1 - p_t)^gamma to cross-entropy so that examples the model already classifies confidently contribute almost nothing to the gradient. Their RetinaNet detector, trained with focal loss, matched the speed of one-stage detectors while exceeding the accuracy of all then-current two-stage detectors.

modern training objectives by paradigm

Self-supervised, generative, and reinforcement learning each developed their own family of objectives. The table groups the most common ones; details follow below.

Paradigm	Objective	Maximize / minimize	Example systems
Supervised	Cross-entropy or MSE on labels	Minimize	ResNet, BERT fine-tunes
Self-supervised (predictive)	Masked token prediction, next-token prediction	Minimize negative log-likelihood	BERT, GPT family
Self-supervised (contrastive)	InfoNCE	Maximize lower bound on mutual information	CPC, SimCLR, MoCo, CLIP
Generative (likelihood)	Maximum likelihood	Maximize	Autoregressive LMs, normalizing flows
Generative (variational)	Evidence lower bound (ELBO)	Maximize	Variational autoencoders
Generative (adversarial)	Min-max value function	Minimize for G, maximize for D	GAN, WGAN
Generative (diffusion)	Denoising score matching	Minimize	DDPM, score-based models
Reinforcement learning	Expected return	Maximize	Most RL agents
RL (policy-gradient)	REINFORCE objective	Maximize	Vanilla policy gradient
RL (actor-critic)	Value MSE plus advantage objective	Mixed	A2C, A3C, SAC
RL (trust-region)	Clipped surrogate (PPO)	Maximize	PPO, RLHF loops
Preference fine-tune	Bradley-Terry log-likelihood under reward model	Maximize	InstructGPT
Preference fine-tune	Direct preference optimization (DPO)	Minimize a logistic loss over preference pairs	Llama 3, Mistral, etc.

Self-supervised methods sidestep the need for labels by inventing a pretext task. Masked language modeling, used by BERT, randomly hides tokens and asks the model to predict them, which reduces to a token-level cross-entropy loss. Contrastive methods learn by pulling together representations of two augmented views of the same input and pushing apart representations of different inputs. The standard objective is InfoNCE, introduced in van den Oord et al.'s 2018 Contrastive Predictive Coding paper. InfoNCE is, formally, a categorical cross-entropy that picks the positive example out of a batch of negatives; minimizing it maximizes a lower bound on the mutual information between the two views. SimCLR (Chen et al. 2020) used this same loss under the name NT-Xent and reached 76.5 percent ImageNet linear-probe accuracy with a ResNet-50, matching a fully supervised baseline. CLIP (Radford et al. 2021) scaled the same idea across modalities, training on 400 million image-text pairs to align an image encoder and a text encoder in a shared embedding space.

Generative models split into four roughly distinct objective families. Autoregressive models (GPT, PixelRNN) factorize p(x) into a product of conditionals and maximize log-likelihood directly. Variational autoencoders (Kingma and Welling 2013) cannot compute log p(x) exactly, so they maximize the Evidence Lower Bound: log p(x) >= E_q[log p(x | z)] - KL(q(z|x) || p(z)). The first term is reconstruction quality; the second is a regularizer that keeps the encoder close to the prior. GANs (Goodfellow et al. 2014) replace likelihood with an adversarial game between a generator and discriminator, where the value function is the Jensen-Shannon divergence in the original formulation and the Wasserstein distance in WGAN. Diffusion models (Ho et al. 2020; Song et al. 2021) train a network to denoise inputs across many noise scales, which is mathematically equivalent to denoising score matching: learning the gradient of the log-density of the data.

Reinforcement learning maximizes expected return, the sum of (discounted) rewards along a trajectory. The plain REINFORCE estimator scales the policy gradient by the realized return; actor-critic methods replace the return with a learned advantage estimate to cut variance. Modern RL standardizes on proximal policy optimization (Schulman et al. 2017), whose clipped surrogate objective takes the minimum of an unclipped probability ratio and a ratio clipped to [1 - eps, 1 + eps]. The clip prevents the policy from drifting too far from the version that collected the data, which is what gave PPO a reputation for being usable.

RLHF (reinforcement learning from human feedback) layered preference learning on top of PPO. In Ouyang et al.'s InstructGPT paper (NeurIPS 2022), the recipe was three steps: supervised fine-tuning on demonstrations, training a reward model on human preference comparisons under a Bradley-Terry likelihood, and PPO optimization of the policy against the reward with a KL penalty against the SFT model. The KL term is the safety belt; without it, PPO will happily walk off into reward-hacking territory. The 1.3B InstructGPT was preferred to the 175B base GPT-3, with 100 times fewer parameters.

Direct preference optimization (Rafailov et al., NeurIPS 2023) made the same idea cheaper. The DPO paper showed that the optimal policy under the RLHF objective has a closed-form relationship to the reward, so you can fine-tune the policy directly from preference pairs with a single classification-style loss, with no separate reward model and no PPO. Their result, captured in the title "Your Language Model is Secretly a Reward Model," matched or beat PPO RLHF in summarization and dialogue while being substantially simpler to implement. Variants like IPO and KTO have followed.

multi-objective optimization

Real systems rarely optimize one thing. A search ranker might want relevance, freshness, and diversity. An autonomous vehicle wants to predict depth, segment the scene, and detect objects from one shared backbone. Combining objectives is its own subfield, multi-objective optimization.

The simplest approach is a weighted sum: L = w_1 L_1 + w_2 L_2 + .... Picking the weights by grid search is expensive and brittle, especially when the losses are on different scales (a regression loss in meters and a classification cross-entropy will not naturally trade off cleanly). Kendall, Gal, and Cipolla (CVPR 2018) proposed using each task's homoscedastic uncertainty as a learnable inverse weight, which made the trade-off self-tuning and outperformed hand-tuned weights on a depth/semantic-segmentation/instance-segmentation benchmark. Other techniques include GradNorm, PCGrad, and full Pareto-frontier exploration where the goal is the set of solutions for which no objective can improve without another getting worse.

specification gaming and reward hacking

The failure mode that haunts every objective is the gap between what you wrote and what you wanted. Specification gaming is the catchall term, popularized by Victoria Krakovna and collaborators at DeepMind in a widely shared 2020 article and example list. The agent satisfies the literal specification while violating the intent. Krakovna's catalog includes a Lego stacking task where the agent learned to flip the red block over (collecting reward for the height of its bottom face) instead of placing it on the blue block, and the OpenAI CoastRunners agent that learned to circle and re-collect green tokens forever instead of finishing the boat race. Reward hacking, Goodhart's law, and reward tampering are closely related labels for the same underlying issue: when a measure becomes a target, the measure stops measuring.

Mitigations are an active research area and include reward shaping with explicit safety penalties, KL anchors against a reference policy (the trick that keeps RLHF from melting), interpretability tools, and reward modeling that explicitly models human approval rather than a hand-coded proxy.

training objective vs evaluation metric

It is worth saying out loud: the function you train on is almost never the function you grade on. A classifier is trained on cross-entropy and evaluated on accuracy or F1. A language model is trained on token-level perplexity and evaluated on benchmarks like MMLU or human preference. A retrieval system is trained on contrastive loss and evaluated on Recall@k or NDCG.

This split exists because training needs differentiable, dense, high-variance gradients, while evaluation needs faithful, often discrete, low-variance summaries. Conflating the two is a common bug. If you tune hyperparameters or architectures by validation loss when the launch criterion is downstream click-through rate, you will eventually pick a model that wins the proxy and loses the real thing.

inner and outer objectives in meta-learning

Meta-learning explicitly separates two levels. The inner objective is the per-task loss the model adapts on (a few gradient steps for a small task). The outer objective is the meta-objective, evaluated after adaptation, that rewards models which can be adapted quickly. MAML (Finn et al. 2017) is the canonical example: the inner loop does SGD on a sampled task, the outer loop differentiates through that loop to update the initialization. Hyperparameter optimization can be framed the same way, with hyperparameters as outer parameters and weights as inner parameters. The vocabulary of inner/outer objective also shows up in AI safety, where "inner alignment" describes whether the model's emergent objective matches the outer training signal we wrote.

a quick word on optimizers

Optimizers and objectives are different things, but they get conflated often enough that it is worth separating. The objective tells the optimizer what to push toward; the optimizer decides how to update parameters in response to gradients of that objective. The same MSE loss can be minimized by SGD, SGD with momentum, Adam, AdamW, Adafactor, Lion, Shampoo, or any number of other choices. Switching optimizer rarely changes which model the training run converges to in principle; it changes how fast you get there and how stable the run is. See optimization for the wider treatment.

explain like I'm 5 (ELI5)

Imagine you are teaching a robot how to play a game. The objective in machine learning is like the goal you want the robot to achieve: winning the game. To help the robot learn, you need a way to measure how well it is doing, so you create rules or scoring methods (loss functions when small numbers are good, reward functions when big numbers are good) to judge its performance. The robot then uses tricks like gradient descent to change itself a little bit at a time so its score goes up. When the robot gets really good at scoring, it has learned the objective.

The catch is that the robot will only ever try to win the score you wrote down, not the game you had in mind. If you reward the robot for picking up coins, it might learn to grab the same coin over and over instead of finishing the level. Choosing a good objective is most of the job.

references

Vapnik, V. (1991). Principles of Risk Minimization for Learning Theory. NeurIPS.
Lin, T.-Y., Goyal, P., Girshick, R., He, K., and Dollar, P. (2017). Focal Loss for Dense Object Detection. ICCV.
van den Oord, A., Li, Y., and Vinyals, O. (2018). Representation Learning with Contrastive Predictive Coding. arXiv:1807.03748.
Chen, T., Kornblith, S., Norouzi, M., and Hinton, G. (2020). A Simple Framework for Contrastive Learning of Visual Representations. ICML.
Radford, A., Kim, J. W., Hallacy, C., et al. (2021). Learning Transferable Visual Models From Natural Language Supervision. ICML.
Kingma, D. P., and Welling, M. (2013). Auto-Encoding Variational Bayes. arXiv:1312.6114.
Ho, J., Jain, A., and Abbeel, P. (2020). Denoising Diffusion Probabilistic Models. NeurIPS.
Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. (2017). Proximal Policy Optimization Algorithms. arXiv:1707.06347.
Ouyang, L., Wu, J., Jiang, X., et al. (2022). Training language models to follow instructions with human feedback. NeurIPS.
Rafailov, R., Sharma, A., Mitchell, E., Ermon, S., Manning, C. D., and Finn, C. (2023). Direct Preference Optimization: Your Language Model is Secretly a Reward Model. NeurIPS.
Krakovna, V., et al. (2020). Specification gaming: the flip side of AI ingenuity. DeepMind blog.
Kendall, A., Gal, Y., and Cipolla, R. (2018). Multi-Task Learning Using Uncertainty to Weigh Losses for Scene Geometry and Semantics. CVPR.
Finn, C., Abbeel, P., and Levine, S. (2017). Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks. ICML.

objective vs loss vs reward

empirical risk minimization

surrogate losses

common loss functions

modern training objectives by paradigm

multi-objective optimization

specification gaming and reward hacking

training objective vs evaluation metric

inner and outer objectives in meta-learning

a quick word on optimizers

explain like I'm 5 (ELI5)

references

Improve this article

Related Articles

ARC-AGI 2

L0 Regularization

L1 Loss

L1 Regularization

L2 Loss

L2 Regularization

objective vs loss vs reward

empirical risk minimization

surrogate losses

common loss functions

modern training objectives by paradigm

multi-objective optimization

specification gaming and reward hacking

training objective vs evaluation metric

inner and outer objectives in meta-learning

a quick word on optimizers

explain like I'm 5 (ELI5)

references

Related Articles

ARC-AGI 2

L0 Regularization

L1 Loss

L1 Regularization

L2 Loss

L2 Regularization