Objective function

See also: Machine learning terms

In machine learning and mathematical optimization, an objective function is the scalar-valued function that an optimization algorithm seeks to minimize or maximize. It maps a candidate solution (in ML, the model parameters) to a single real number that summarizes how good or bad that solution is. Every gradient that updates a neural network, every step a gradient descent routine takes, and every weight an SGD update changes is ultimately produced by differentiating an objective function. In Goodfellow, Bengio, and Courville's Deep Learning, the authors put it plainly: the function we want to minimize or maximize is called the objective function, or criterion, and when we are minimizing it we may also call it the cost function, loss function, or error function.

This article focuses on the objective function as a mathematical object: its formal structure, the most common families of objectives used in ML, the analytical properties (convexity, smoothness, Lipschitz continuity) that govern which optimizers can solve it, and the practical pitfalls that arise when an objective is poorly chosen. For the broader concept that includes rewards, business metrics, and alignment goals, see objective.

definition

Given a parameter space Θ (typically R^d for a model with d parameters), an objective function is any map

J : Θ → R

that the learning algorithm tries to minimize:

θ* = arg min_{θ ∈ Θ} J(θ)

or maximize, depending on convention. In supervised ML the canonical form is the regularized empirical risk:

J(θ) = (1/n) Σ_{i=1}^{n} L(f_θ(x_i), y_i) + λ R(θ)

The first term is a data-fit cost averaged over n training examples, where L is a per-example loss function, f_θ is the model, and (x_i, y_i) is the i-th input/label pair. The second term is a regularization penalty (such as L2, L1, or weight-decay), scaled by a hyperparameter λ that trades off training fit against complexity. When λ = 0, J reduces to the empirical risk and the setup is the classical empirical risk minimization (ERM) problem of Vapnik.

Writing the objective explicitly serves two roles. It tells the optimizer what to do, and it tells the analyst what the model is actually learning. The model does not learn the task description in a slack message; it learns whatever function J encodes.

terminology: loss, cost, risk, criterion

The vocabulary is overloaded and inconsistent across textbooks. Here is the breakdown used by most modern ML practitioners.

Term	Scope	Typical formula	Direction
Loss	Per example	L(f(x_i), y_i)	Minimize
Cost	Aggregate over a batch or dataset	(1/n) Σ L(f(x_i), y_i)	Minimize
Empirical risk	Average loss on training set	R_emp(f) = (1/n) Σ L(f(x_i), y_i)	Minimize
True risk (population risk)	Expected loss under data distribution P	R(f) = E_{(x,y)~P}[L(f(x), y)]	Minimize
Reward / utility	Per-step or per-trajectory benefit (RL)	r(s, a)	Maximize
Return	Discounted sum of future rewards	G_t = Σ γ^k r_{t+k+1}	Maximize
Fitness	Quality score in evolutionary search	problem-specific	Maximize
Criterion / objective	Umbrella term for any of the above	J(θ)	Either

In casual usage "loss" and "cost" are often interchangeable. Sebastian Raschka points out that some authors reserve "loss" for the single-example version and "cost" for the dataset average, but plenty of textbooks (including Deep Learning) use them as synonyms. The PyTorch and TensorFlow APIs both call them "losses" and apply a reduction='mean' argument to aggregate.

"Criterion" is the mathematical-statistics term and shows up in PyTorch (nn.CrossEntropyLoss is invoked as criterion). In evolutionary algorithms the same scalar is called a fitness function and is maximized rather than minimized. None of these distinctions affect the math; flipping the sign turns a loss into a reward.

structural anatomy of an objective

Almost every modern training objective decomposes into a small number of recognizable pieces.

A data-fit term that depends on training examples: cross-entropy on labels, MSE on continuous targets, masked-token negative log-likelihood, contrastive logits.
One or more regularization terms that depend only on parameters: L2 (ridge), L1 (lasso), weight decay, dropout (implicit), spectral norm penalties.
Optional auxiliary losses that share parameters: an autoencoder reconstruction term added to a classifier, a VAE's KL term against the prior, a direct preference optimization (DPO) margin added to supervised cross-entropy.
Optional constraints, either hard (projection onto a feasible set) or soft (penalty terms with a Lagrange multiplier; see Lagrange multiplier).

These terms are usually combined linearly with weighting hyperparameters. Treating those weights as fixed is a modeling choice, and it implicitly defines a single point on a Pareto frontier in the space of objectives. Multi-objective approaches make that trade-off explicit.

common objective functions

The table below collects the formulas you will encounter in nearly any production codebase. Per-example formulas are shown; the full objective sums or averages over the dataset and adds regularization.

Family	Name	Formula (per example)	Range	Differentiable	Typical use
Regression	Mean squared error (MSE, L2)	(y - ŷ)^2	[0, ∞)	Everywhere	Linear regression, value heads in RL, diffusion noise prediction
Regression	Mean absolute error (MAE, L1)	\|y - ŷ\|	[0, ∞)	Except at 0	Robust regression with heavy-tailed noise
Regression	Huber loss	(1/2)r^2 if \|r\|≤δ else δ(\|r\|-δ/2)	[0, ∞)	Yes (C^1)	Robust regression, RL value heads (DQN clipped)
Regression	Quantile (pinball)	ρ_τ(y - ŷ)	[0, ∞)	Except at 0	Quantile regression, prediction intervals
Classification	0/1 loss	1[y ≠ ŷ]	{0, 1}	No	Definition of error rate, not used for training
Classification	Binary cross-entropy	-y log p - (1-y) log(1-p)	[0, ∞)	Yes	Binary classification, sigmoid heads
Classification	Categorical cross-entropy	-Σ_k y_k log p_k	[0, log K] for one-hot y	Yes	Softmax classifiers, language models
Classification	Hinge loss	max(0, 1 - y·ŷ)	[0, ∞)	Except at the kink	Linear and kernel SVMs
Classification	Focal loss	-(1 - p_t)^γ log p_t	[0, ∞)	Yes	Heavily imbalanced detection (RetinaNet)
Probability	KL divergence	Σ p log(p/q)	[0, ∞)	Yes (where p, q > 0)	Distillation, VAE prior matching, RLHF penalty
Probability	Wasserstein-1	inf over couplings of E\|x-y\|	[0, ∞)	Yes (with critic)	WGAN, optimal transport
Metric	Triplet loss	max(0, d(a,p) - d(a,n) + m)	[0, ∞)	Except at the kink	Face recognition, embedding learning (FaceNet)
Self-supervised	InfoNCE	-log[exp(s_pos/τ) / Σ exp(s_j/τ)]	[0, log N]	Yes	CPC, SimCLR, MoCo, CLIP
Generative (variational)	Negative ELBO	-E_q[log p(x\|z)] + KL(q(z\|x) \|\| p(z))	R	Yes	Variational autoencoders
Generative (diffusion)	Noise-prediction MSE	\|\|ε - ε_θ(x_t, t)\|\|^2	[0, ∞)	Yes	DDPM, Stable Diffusion
RL (policy gradient)	PPO clipped surrogate	E[min(r·Â, clip(r, 1-ε, 1+ε)·Â)]	R	Almost everywhere	LLM RLHF, robotics, game-playing
RL (preference)	DPO loss	-log σ(β log(π_θ(y_w)/π_ref(y_w)) - β log(π_θ(y_l)/π_ref(y_l)))	[0, ∞)	Yes	Preference fine-tuning of LLMs

A few of these are worth a closer look.

Huber loss is parameterized by a threshold δ. For residuals smaller than δ it behaves like MSE (smooth, easy to optimize); for larger residuals it behaves like MAE (linear, robust to outliers). The Wikipedia entry attributes it to Peter Huber's 1964 paper on robust estimation, and DQN's reward clipping is essentially Huber loss with δ = 1.

Focal loss, introduced by Lin et al. at ICCV 2017, multiplies cross-entropy by (1 - p_t)^γ so that easy examples (those already classified with high probability) contribute almost nothing to the gradient. Their RetinaNet detector trained with focal loss matched single-stage detector speeds while exceeding the accuracy of all then-current two-stage detectors.

InfoNCE, due to van den Oord, Li, and Vinyals (2018) in the Contrastive Predictive Coding paper, treats representation learning as classification: pick the positive sample out of a set that includes one positive and N - 1 negatives. The loss is a categorical cross-entropy over the resulting logits, and the negative of its expectation is a lower bound on mutual information between context and target.

The DPO loss of Rafailov et al. (NeurIPS 2023) is a binary cross-entropy on a margin between the log-ratio of preferred and dispreferred completions under the policy versus a frozen reference model. The result is a closed-form objective that recovers the optimal policy of an RLHF KL-regularized reward maximization problem without ever fitting an explicit reward model.

properties of objective functions

The analytical properties of J(θ) determine which optimizers can solve it efficiently and whether the global optimum can be found at all.

Property	Definition (informal)	Why it matters	ML examples
Convex	Line segment between any two points lies above the graph	Every local minimum is a global minimum	MSE in linear regression, logistic regression loss, SVM hinge with L2
Strictly convex	Convex with no flat regions	Unique global minimum	Logistic regression with L2
Strongly convex	Convex with quadratic lower bound (μ-strongly convex)	Linear convergence rate for gradient descent	L2-regularized convex losses
Non-convex	Multiple local minima or saddle points	No global guarantees from local methods	Neural networks, matrix factorization, mixture models
Smooth (C^1, Lipschitz gradient)	∇f exists everywhere and ∇f is L-Lipschitz	Required for standard convergence proofs of GD	Cross-entropy, MSE, sigmoid composed networks
Non-smooth	Subgradient exists but gradient does not at some points	Need subgradient or proximal methods	L1 regularization, hinge loss, ReLU networks at kinks
Lipschitz (function)	\|f(x) - f(y)\| ≤ K\|x - y\|	Bounds on excess risk, GAN critic constraints	WGAN critic with weight clipping or gradient penalty
Coercive	f(θ) → ∞ as \|\|θ\|\| → ∞	Guarantees a finite minimum exists	Regularized objectives (L2 makes anything coercive)

In the Boyd and Vandenberghe formulation, a function is convex if its domain is convex and the chord between any two graph points lies above the graph. Strong convexity adds a quadratic lower bound: f(y) ≥ f(x) + ∇f(x)·(y - x) + (μ/2)||y - x||^2 for some μ > 0. L-smoothness is the symmetric quadratic upper bound. Together, L-smooth and μ-strongly convex objectives admit linear convergence rates for gradient descent, with iteration complexity O((L/μ) log(1/ε)).

Most classical ML objectives (linear regression, logistic regression, soft-margin SVM with L2) are convex; almost no neural-network objective is. That is why theoretical guarantees from convex optimization only loosely transfer to deep learning.

optimizers matched to objective properties

The right optimizer depends on whether the objective is convex, smooth, constrained, and stochastic.

Objective shape	Recommended optimizers	Notes
Convex, smooth, low-dim	Newton's method, BFGS, L-BFGS	Quadratic convergence near optimum
Convex, smooth, large-scale	Gradient descent, accelerated gradient (Nesterov), conjugate gradient	O(1/k^2) with acceleration
Convex, non-smooth (e.g., L1)	Subgradient method, proximal gradient (ISTA, FISTA), ADMM	Proximal map handles non-smooth term in closed form
Non-convex, smooth, large-scale	SGD with momentum, Adam, AdamW	The dominant family for deep learning
Non-convex, non-smooth	Subgradient SGD, proximal SGD	Most production neural nets use this implicitly via ReLU plus weight decay
Constrained (linear or convex)	Projected gradient, interior-point, Frank-Wolfe, augmented Lagrangian, ADMM	Interior-point dominates small/medium convex programs
Constrained (general nonlinear)	Sequential quadratic programming, Lagrange + KKT conditions	Standard for trajectory optimization, control
Black-box (no gradient)	Bayesian optimization, CMA-ES, evolutionary strategies, simulated annealing	Used for hyperparameter tuning, neural-architecture search

Adam, introduced by Diederik Kingma and Jimmy Ba at ICLR 2015, is the default optimizer for most deep-learning workloads. Their paper described it as a method for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments. The algorithm is invariant to diagonal rescaling of gradients, has small memory overhead, and is well suited for problems with very noisy or sparse gradients. Subsequent work (AdamW, by Loshchilov and Hutter) decoupled weight decay from the gradient step, fixing a subtle interaction between adaptive learning rates and L2 regularization.

surrogate objectives

A recurring problem is that the metric we care about (classification accuracy, F1, NDCG, BLEU) is non-differentiable, piecewise constant, or involves sorts and thresholds. None of these can be optimized directly with gradient methods. The standard fix is to optimize a surrogate: a smooth, often convex function that bounds or correlates with the target metric.

0/1 loss is the metric, but it is a step function. Cross-entropy is its convex, smooth surrogate; logistic regression is then minimum cross-entropy over linear models.
Hinge loss is another convex surrogate for 0/1 loss, used in SVMs. Unlike cross-entropy it is consistent for the classification decision but not for class probabilities.
For ranking, NDCG is non-differentiable, so listwise surrogates (LambdaRank, ListNet, ApproxNDCG) approximate it.
For BLEU and ROUGE in machine translation, training uses token-level cross-entropy as a surrogate, sometimes followed by minimum-risk training or reinforcement learning that optimizes the metric more directly.

A surrogate is called Bayes-consistent (or proper) if minimizing it converges to a Bayes-optimal predictor as data grows. Logistic loss is consistent for classification and gives calibrated probabilities; hinge loss is consistent for the 0/1 decision but not for probabilities; squared loss is consistent for the conditional mean. Picking a surrogate is one of the most consequential design choices in a project, because the model that gets shipped is optimal for the surrogate, not for the metric on the dashboard.

constrained optimization

Many ML problems are naturally constrained. SVM training is a constrained quadratic program; fairness-aware learning constrains a disparity metric; safe RL constrains expected damage. The general form is

minimize J(θ) subject to g_i(θ) ≤ 0, i = 1, ..., m h_j(θ) = 0, j = 1, ..., p

The two main analytic tools are Lagrange multipliers and the Karush-Kuhn-Tucker (KKT) conditions. The Lagrangian

L(θ, λ, ν) = J(θ) + Σ λ_i g_i(θ) + Σ ν_j h_j(θ)

turns a constrained problem into an unconstrained saddle-point problem in (θ, λ, ν). The KKT conditions are necessary first-order optimality conditions: stationarity of the Lagrangian, primal feasibility, dual feasibility (λ ≥ 0), and complementary slackness (λ_i g_i(θ) = 0). For convex problems with appropriate constraint qualifications, KKT conditions are also sufficient.

The original SVM was framed as a constrained quadratic program: maximize the margin subject to all training points being classified correctly with a margin of at least 1. The dual formulation, derived via Lagrange multipliers, replaces the optimization over weight vectors with an optimization over per-example dual variables, exposes the support vectors as those with non-zero duals, and enables the kernel trick.

In practice, deep-learning constrained optimization usually uses penalty methods (add λ times a violation term to J) or augmented Lagrangians (alternate updates to θ and to the multipliers). These are easier to implement on autograd frameworks than projected gradient descent for complex constraints.

multi-objective optimization

Real projects rarely care about exactly one number. A recommender wants high engagement and low policy violations; a self-driving stack wants low time-to-destination, low jerk, and low collision rate. The mathematical setup is

minimize (J_1(θ), J_2(θ), ..., J_k(θ))

where the output is a vector. There is no single "optimum"; instead there is a Pareto frontier, the set of solutions that cannot be improved on any objective without worsening another.

The two practical approaches are scalarization and the constraint method.

Weighted-sum scalarization combines objectives into J(θ) = Σ w_i J_i(θ) and picks one Pareto point per choice of weights. Simple, but cannot reach non-convex parts of the frontier.
The ε-constraint method picks one objective to minimize and bounds the others as constraints: minimize J_1(θ) subject to J_i(θ) ≤ ε_i. Can reach the entire frontier but requires a constrained solver.

Most ML systems quietly do scalarization: the deployed objective is a weighted sum of accuracy, latency cost, fairness penalty, and engagement, with weights tuned offline.

modern training objectives in practice

The choice of objective function defines the paradigm.

Supervised vision and tabular models minimize cross-entropy or MSE on labels, regularized by weight decay and dropout.
Language model pre-training minimizes per-token cross-entropy: J = -(1/T) Σ_t log p_θ(x_t | x_{<t}). On GPT-3 and beyond the objective is identical; only the data and parameter count change.
Diffusion models minimize a noise-prediction MSE: J = E[||ε - ε_θ(x_t, t)||^2] over diffusion timesteps and noise samples. This is the simplified ELBO from Ho, Jain, and Abbeel (2020) and remains the dominant training objective for image and video diffusion.
Contrastive self-supervised learning maximizes InfoNCE-style mutual information: positive pairs (two augmentations of the same image, or aligned image-text in CLIP) get pulled together while negatives are pushed apart.
Reward modeling for RLHF is a pairwise log-likelihood over human preference data: maximize log σ(r_θ(x, y_w) - r_θ(x, y_l)) where y_w is the preferred completion. This is the Bradley-Terry model.
RLHF policy optimization with PPO maximizes E[r(x, y) - β KL(π_θ(·|x) || π_ref(·|x))], the reward shaped by a KL penalty against a frozen reference policy. The inner objective is the PPO clipped surrogate of Schulman et al. (2017): E[min(r_t Â_t, clip(r_t, 1-ε, 1+ε) Â_t)] where r_t is the importance-sampling ratio. The KL penalty was added to RLHF in OpenAI's InstructGPT line of work to keep the fine-tuned model from drifting too far from the supervised base.
Direct Preference Optimization (DPO) replaces the entire RL loop with a closed-form classification loss derived from the same KL-regularized reward formulation. The Rafailov et al. paper showed that the optimal policy under that objective can be expressed in terms of the reward, then inverted so that the reward never has to be parameterized at all.

common pitfalls

Writing down the objective is the most consequential decision in a training run, and the most common mistakes are conceptual rather than numerical.

Confusing the metric with the objective. F1 score, AUC, accuracy at threshold 0.5, BLEU, recall@10. These are evaluation metrics. If they are non-differentiable, they cannot be the training objective directly. Training cross-entropy and reporting F1 is fine; training F1 by hacking together a soft approximation usually is not unless it is principled and tested.
Misaligned proxy (Goodhart's law). When the training objective is not the true business objective, any sufficiently strong optimizer will exploit the gap. Recommender click-through-rate maximization that destroyed long-term user engagement is the canonical example. Specifying the right J is harder than optimizing it.
Numerical issues with logs and exponentials. Cross-entropy includes log(p), which blows up at p = 0; softmax includes exp(z), which overflows for large z. Standard fixes are the log-sum-exp trick, fused softmax-cross-entropy kernels (TensorFlow's softmax_cross_entropy_with_logits, PyTorch's cross_entropy), and clamping probabilities to (ε, 1 - ε).
Scale mismatch between data-fit and regularization. If the per-example loss is in the thousands and λ R(θ) is order 1, regularization does nothing. If the loss is averaged over a batch and λ is set assuming a sum, the regularizer dominates. Always check the relative magnitudes early in training.
Reduction bugs (sum vs mean). PyTorch's reduction='sum' versus reduction='mean' changes the effective learning rate by a factor of batch size. The same applies to ignoring padding tokens in language modeling: a per-token mean over only valid tokens is not the same as a per-batch mean.
Ignoring stochasticity. SGD optimizes an unbiased estimator of the true objective, not the objective itself. Loss curves wiggle because of the noise; training loss going up for one batch is not necessarily a problem.
Overfitting the surrogate. The model is optimal for J, not for the held-out metric. Track the metric you actually care about on validation, even if you cannot train on it. If they diverge, something in the surrogate or the data is wrong.
Constraint violation in penalty methods. Adding λ · g(θ) as a soft penalty does not enforce g(θ) ≤ 0. For safety-critical constraints, use projection or augmented Lagrangians, not a single penalty term.

The broader objective article covers the umbrella concept and includes rewards, business KPIs, alignment goals, and the loss-vs-reward distinction. This article is narrower: it concentrates on the mathematical function J(θ) and its analytical properties.

The loss function article provides per-example formulas and code examples for individual losses (cross-entropy, MSE, hinge, focal, triplet). Where the loss-function article is recipe-oriented, this article is structural: how those losses are aggregated into objectives, what properties they inherit, and which optimizers can solve them.

The empirical risk minimization article gives the formal statistical framework (Vapnik, ERM, generalization bounds, VC dimension) under which minimizing the empirical objective approximates minimizing the true risk.

explain like I'm 5

Imagine you're playing a video game where you score points based on how well you do. The objective function is the rule that converts what you did into a number. If you crash a lot, the number goes up (bad); if you finish levels quickly, the number goes down (good). Your goal is to make the number as small as possible.

Machine learning works the same way. The model has a bunch of dials (its parameters), and the objective function turns the position of all those dials into a single score. The training algorithm jiggles the dials in whichever direction makes the score smaller, over and over again. The model never "sees" the task you wanted; it only sees the score. Pick the wrong scoring rule and the model will get really good at scoring well while doing the wrong thing.

references

Goodfellow, I., Bengio, Y., and Courville, A. (2016). Deep Learning. MIT Press. Chapter 4.3 introduces the terms objective function, criterion, cost, and loss.
Boyd, S. and Vandenberghe, L. (2004). Convex Optimization. Cambridge University Press. Definitions of convexity, smoothness, Lipschitz continuity, and KKT conditions.
Vapnik, V. (1998). Statistical Learning Theory. Wiley. Empirical risk minimization framework.
Kingma, D. P. and Ba, J. (2015). "Adam: A Method for Stochastic Optimization." ICLR. arXiv:1412.6980.
Loshchilov, I. and Hutter, F. (2019). "Decoupled Weight Decay Regularization" (AdamW). ICLR.
Schulman, J. et al. (2017). "Proximal Policy Optimization Algorithms." arXiv:1707.06347.
Lin, T.-Y. et al. (2017). "Focal Loss for Dense Object Detection." ICCV.
van den Oord, A., Li, Y., and Vinyals, O. (2018). "Representation Learning with Contrastive Predictive Coding." arXiv:1807.03748.
Ho, J., Jain, A., and Abbeel, P. (2020). "Denoising Diffusion Probabilistic Models." NeurIPS.
Rafailov, R. et al. (2023). "Direct Preference Optimization: Your Language Model is Secretly a Reward Model." NeurIPS. arXiv:2305.18290.
Huber, P. J. (1964). "Robust Estimation of a Location Parameter." Annals of Mathematical Statistics.
Cortes, C. and Vapnik, V. (1995). "Support-Vector Networks." Machine Learning. SVMs as a constrained quadratic program.
Bradley, R. A. and Terry, M. E. (1952). "Rank Analysis of Incomplete Block Designs." Biometrika. The pairwise preference model used in RLHF reward modeling.
Raschka, S. "What is the difference between a cost function and a loss function in machine learning?" sebastianraschka.com.

definition

terminology: loss, cost, risk, criterion

structural anatomy of an objective

common objective functions

properties of objective functions

optimizers matched to objective properties

surrogate objectives

constrained optimization

multi-objective optimization

modern training objectives in practice

common pitfalls

relationship to related concepts

explain like I'm 5

references

Improve this article

Related Articles

L1 Loss

L2 Loss

Cost

Squared Hinge Loss

Squared Loss

Log Loss

definition

terminology: loss, cost, risk, criterion

structural anatomy of an objective

common objective functions

properties of objective functions

optimizers matched to objective properties

surrogate objectives

constrained optimization

multi-objective optimization

modern training objectives in practice

common pitfalls

relationship to related concepts

explain like I'm 5

references

Related Articles

L1 Loss

L2 Loss

Cost

Squared Hinge Loss

Squared Loss

Log Loss