See also: Machine learning terms
In machine learning and mathematical optimization, an objective function is the scalar-valued function that an optimization algorithm seeks to minimize or maximize. It maps a candidate solution (in ML, the model parameters) to a single real number that summarizes how good or bad that solution is. Every gradient that updates a neural network, every step a gradient descent routine takes, and every weight an SGD update changes is ultimately produced by differentiating an objective function. In Goodfellow, Bengio, and Courville's Deep Learning, the authors put it plainly: the function we want to minimize or maximize is called the objective function, or criterion, and when we are minimizing it we may also call it the cost function, loss function, or error function.
This article focuses on the objective function as a mathematical object: its formal structure, the most common families of objectives used in ML, the analytical properties (convexity, smoothness, Lipschitz continuity) that govern which optimizers can solve it, and the practical pitfalls that arise when an objective is poorly chosen. For the broader concept that includes rewards, business metrics, and alignment goals, see objective.
Given a parameter space Θ (typically R^d for a model with d parameters), an objective function is any map
J : Θ → R
that the learning algorithm tries to minimize:
θ* = arg min_{θ ∈ Θ} J(θ)
or maximize, depending on convention. In supervised ML the canonical form is the regularized empirical risk:
J(θ) = (1/n) Σ_{i=1}^{n} L(f_θ(x_i), y_i) + λ R(θ)
The first term is a data-fit cost averaged over n training examples, where L is a per-example loss function, f_θ is the model, and (x_i, y_i) is the i-th input/label pair. The second term is a regularization penalty (such as L2, L1, or weight-decay), scaled by a hyperparameter λ that trades off training fit against complexity. When λ = 0, J reduces to the empirical risk and the setup is the classical empirical risk minimization (ERM) problem of Vapnik.
Writing the objective explicitly serves two roles. It tells the optimizer what to do, and it tells the analyst what the model is actually learning. The model does not learn the task description in a slack message; it learns whatever function J encodes.
The vocabulary is overloaded and inconsistent across textbooks. Here is the breakdown used by most modern ML practitioners.
| Term | Scope | Typical formula | Direction |
|---|---|---|---|
| Loss | Per example | L(f(x_i), y_i) | Minimize |
| Cost | Aggregate over a batch or dataset | (1/n) Σ L(f(x_i), y_i) | Minimize |
| Empirical risk | Average loss on training set | R_emp(f) = (1/n) Σ L(f(x_i), y_i) | Minimize |
| True risk (population risk) | Expected loss under data distribution P | R(f) = E_{(x,y)~P}[L(f(x), y)] | Minimize |
| Reward / utility | Per-step or per-trajectory benefit (RL) | r(s, a) | Maximize |
| Return | Discounted sum of future rewards | G_t = Σ γ^k r_{t+k+1} | Maximize |
| Fitness | Quality score in evolutionary search | problem-specific | Maximize |
| Criterion / objective | Umbrella term for any of the above | J(θ) | Either |
In casual usage "loss" and "cost" are often interchangeable. Sebastian Raschka points out that some authors reserve "loss" for the single-example version and "cost" for the dataset average, but plenty of textbooks (including Deep Learning) use them as synonyms. The PyTorch and TensorFlow APIs both call them "losses" and apply a reduction='mean' argument to aggregate.
"Criterion" is the mathematical-statistics term and shows up in PyTorch (nn.CrossEntropyLoss is invoked as criterion). In evolutionary algorithms the same scalar is called a fitness function and is maximized rather than minimized. None of these distinctions affect the math; flipping the sign turns a loss into a reward.
Almost every modern training objective decomposes into a small number of recognizable pieces.
These terms are usually combined linearly with weighting hyperparameters. Treating those weights as fixed is a modeling choice, and it implicitly defines a single point on a Pareto frontier in the space of objectives. Multi-objective approaches make that trade-off explicit.
The table below collects the formulas you will encounter in nearly any production codebase. Per-example formulas are shown; the full objective sums or averages over the dataset and adds regularization.
| Family | Name | Formula (per example) | Range | Differentiable | Typical use |
|---|---|---|---|---|---|
| Regression | Mean squared error (MSE, L2) | (y - ŷ)^2 | [0, ∞) | Everywhere | Linear regression, value heads in RL, diffusion noise prediction |
| Regression | Mean absolute error (MAE, L1) | |y - ŷ| | [0, ∞) | Except at 0 | Robust regression with heavy-tailed noise |
| Regression | Huber loss | (1/2)r^2 if |r|≤δ else δ(|r|-δ/2) | [0, ∞) | Yes (C^1) | Robust regression, RL value heads (DQN clipped) |
| Regression | Quantile (pinball) | ρ_τ(y - ŷ) | [0, ∞) | Except at 0 | Quantile regression, prediction intervals |
| Classification | 0/1 loss | 1[y ≠ ŷ] | {0, 1} | No | Definition of error rate, not used for training |
| Classification | Binary cross-entropy | -y log p - (1-y) log(1-p) | [0, ∞) | Yes | Binary classification, sigmoid heads |
| Classification | Categorical cross-entropy | -Σ_k y_k log p_k | [0, log K] for one-hot y | Yes | Softmax classifiers, language models |
| Classification | Hinge loss | max(0, 1 - y·ŷ) | [0, ∞) | Except at the kink | Linear and kernel SVMs |
| Classification | Focal loss | -(1 - p_t)^γ log p_t | [0, ∞) | Yes | Heavily imbalanced detection (RetinaNet) |
| Probability | KL divergence | Σ p log(p/q) | [0, ∞) | Yes (where p, q > 0) | Distillation, VAE prior matching, RLHF penalty |
| Probability | Wasserstein-1 | inf over couplings of E|x-y| | [0, ∞) | Yes (with critic) | WGAN, optimal transport |
| Metric | Triplet loss | max(0, d(a,p) - d(a,n) + m) | [0, ∞) | Except at the kink | Face recognition, embedding learning (FaceNet) |
| Self-supervised | InfoNCE | -log[exp(s_pos/τ) / Σ exp(s_j/τ)] | [0, log N] | Yes | CPC, SimCLR, MoCo, CLIP |
| Generative (variational) | Negative ELBO | -E_q[log p(x|z)] + KL(q(z|x) || p(z)) | R | Yes | Variational autoencoders |
| Generative (diffusion) | Noise-prediction MSE | ||ε - ε_θ(x_t, t)||^2 | [0, ∞) | Yes | DDPM, Stable Diffusion |
| RL (policy gradient) | PPO clipped surrogate | E[min(r·Â, clip(r, 1-ε, 1+ε)·Â)] | R | Almost everywhere | LLM RLHF, robotics, game-playing |
| RL (preference) | DPO loss | -log σ(β log(π_θ(y_w)/π_ref(y_w)) - β log(π_θ(y_l)/π_ref(y_l))) | [0, ∞) | Yes | Preference fine-tuning of LLMs |
A few of these are worth a closer look.
Huber loss is parameterized by a threshold δ. For residuals smaller than δ it behaves like MSE (smooth, easy to optimize); for larger residuals it behaves like MAE (linear, robust to outliers). The Wikipedia entry attributes it to Peter Huber's 1964 paper on robust estimation, and DQN's reward clipping is essentially Huber loss with δ = 1.
Focal loss, introduced by Lin et al. at ICCV 2017, multiplies cross-entropy by (1 - p_t)^γ so that easy examples (those already classified with high probability) contribute almost nothing to the gradient. Their RetinaNet detector trained with focal loss matched single-stage detector speeds while exceeding the accuracy of all then-current two-stage detectors.
InfoNCE, due to van den Oord, Li, and Vinyals (2018) in the Contrastive Predictive Coding paper, treats representation learning as classification: pick the positive sample out of a set that includes one positive and N - 1 negatives. The loss is a categorical cross-entropy over the resulting logits, and the negative of its expectation is a lower bound on mutual information between context and target.
The DPO loss of Rafailov et al. (NeurIPS 2023) is a binary cross-entropy on a margin between the log-ratio of preferred and dispreferred completions under the policy versus a frozen reference model. The result is a closed-form objective that recovers the optimal policy of an RLHF KL-regularized reward maximization problem without ever fitting an explicit reward model.
The analytical properties of J(θ) determine which optimizers can solve it efficiently and whether the global optimum can be found at all.
| Property | Definition (informal) | Why it matters | ML examples |
|---|---|---|---|
| Convex | Line segment between any two points lies above the graph | Every local minimum is a global minimum | MSE in linear regression, logistic regression loss, SVM hinge with L2 |
| Strictly convex | Convex with no flat regions | Unique global minimum | Logistic regression with L2 |
| Strongly convex | Convex with quadratic lower bound (μ-strongly convex) | Linear convergence rate for gradient descent | L2-regularized convex losses |
| Non-convex | Multiple local minima or saddle points | No global guarantees from local methods | Neural networks, matrix factorization, mixture models |
| Smooth (C^1, Lipschitz gradient) | ∇f exists everywhere and ∇f is L-Lipschitz | Required for standard convergence proofs of GD | Cross-entropy, MSE, sigmoid composed networks |
| Non-smooth | Subgradient exists but gradient does not at some points | Need subgradient or proximal methods | L1 regularization, hinge loss, ReLU networks at kinks |
| Lipschitz (function) | |f(x) - f(y)| ≤ K|x - y| | Bounds on excess risk, GAN critic constraints | WGAN critic with weight clipping or gradient penalty |
| Coercive | f(θ) → ∞ as ||θ|| → ∞ | Guarantees a finite minimum exists | Regularized objectives (L2 makes anything coercive) |
In the Boyd and Vandenberghe formulation, a function is convex if its domain is convex and the chord between any two graph points lies above the graph. Strong convexity adds a quadratic lower bound: f(y) ≥ f(x) + ∇f(x)·(y - x) + (μ/2)||y - x||^2 for some μ > 0. L-smoothness is the symmetric quadratic upper bound. Together, L-smooth and μ-strongly convex objectives admit linear convergence rates for gradient descent, with iteration complexity O((L/μ) log(1/ε)).
Most classical ML objectives (linear regression, logistic regression, soft-margin SVM with L2) are convex; almost no neural-network objective is. That is why theoretical guarantees from convex optimization only loosely transfer to deep learning.
The right optimizer depends on whether the objective is convex, smooth, constrained, and stochastic.
| Objective shape | Recommended optimizers | Notes |
|---|---|---|
| Convex, smooth, low-dim | Newton's method, BFGS, L-BFGS | Quadratic convergence near optimum |
| Convex, smooth, large-scale | Gradient descent, accelerated gradient (Nesterov), conjugate gradient | O(1/k^2) with acceleration |
| Convex, non-smooth (e.g., L1) | Subgradient method, proximal gradient (ISTA, FISTA), ADMM | Proximal map handles non-smooth term in closed form |
| Non-convex, smooth, large-scale | SGD with momentum, Adam, AdamW | The dominant family for deep learning |
| Non-convex, non-smooth | Subgradient SGD, proximal SGD | Most production neural nets use this implicitly via ReLU plus weight decay |
| Constrained (linear or convex) | Projected gradient, interior-point, Frank-Wolfe, augmented Lagrangian, ADMM | Interior-point dominates small/medium convex programs |
| Constrained (general nonlinear) | Sequential quadratic programming, Lagrange + KKT conditions | Standard for trajectory optimization, control |
| Black-box (no gradient) | Bayesian optimization, CMA-ES, evolutionary strategies, simulated annealing | Used for hyperparameter tuning, neural-architecture search |
Adam, introduced by Diederik Kingma and Jimmy Ba at ICLR 2015, is the default optimizer for most deep-learning workloads. Their paper described it as a method for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments. The algorithm is invariant to diagonal rescaling of gradients, has small memory overhead, and is well suited for problems with very noisy or sparse gradients. Subsequent work (AdamW, by Loshchilov and Hutter) decoupled weight decay from the gradient step, fixing a subtle interaction between adaptive learning rates and L2 regularization.
A recurring problem is that the metric we care about (classification accuracy, F1, NDCG, BLEU) is non-differentiable, piecewise constant, or involves sorts and thresholds. None of these can be optimized directly with gradient methods. The standard fix is to optimize a surrogate: a smooth, often convex function that bounds or correlates with the target metric.
A surrogate is called Bayes-consistent (or proper) if minimizing it converges to a Bayes-optimal predictor as data grows. Logistic loss is consistent for classification and gives calibrated probabilities; hinge loss is consistent for the 0/1 decision but not for probabilities; squared loss is consistent for the conditional mean. Picking a surrogate is one of the most consequential design choices in a project, because the model that gets shipped is optimal for the surrogate, not for the metric on the dashboard.
Many ML problems are naturally constrained. SVM training is a constrained quadratic program; fairness-aware learning constrains a disparity metric; safe RL constrains expected damage. The general form is
minimize J(θ) subject to g_i(θ) ≤ 0, i = 1, ..., m h_j(θ) = 0, j = 1, ..., p
The two main analytic tools are Lagrange multipliers and the Karush-Kuhn-Tucker (KKT) conditions. The Lagrangian
L(θ, λ, ν) = J(θ) + Σ λ_i g_i(θ) + Σ ν_j h_j(θ)
turns a constrained problem into an unconstrained saddle-point problem in (θ, λ, ν). The KKT conditions are necessary first-order optimality conditions: stationarity of the Lagrangian, primal feasibility, dual feasibility (λ ≥ 0), and complementary slackness (λ_i g_i(θ) = 0). For convex problems with appropriate constraint qualifications, KKT conditions are also sufficient.
The original SVM was framed as a constrained quadratic program: maximize the margin subject to all training points being classified correctly with a margin of at least 1. The dual formulation, derived via Lagrange multipliers, replaces the optimization over weight vectors with an optimization over per-example dual variables, exposes the support vectors as those with non-zero duals, and enables the kernel trick.
In practice, deep-learning constrained optimization usually uses penalty methods (add λ times a violation term to J) or augmented Lagrangians (alternate updates to θ and to the multipliers). These are easier to implement on autograd frameworks than projected gradient descent for complex constraints.
Real projects rarely care about exactly one number. A recommender wants high engagement and low policy violations; a self-driving stack wants low time-to-destination, low jerk, and low collision rate. The mathematical setup is
minimize (J_1(θ), J_2(θ), ..., J_k(θ))
where the output is a vector. There is no single "optimum"; instead there is a Pareto frontier, the set of solutions that cannot be improved on any objective without worsening another.
The two practical approaches are scalarization and the constraint method.
Most ML systems quietly do scalarization: the deployed objective is a weighted sum of accuracy, latency cost, fairness penalty, and engagement, with weights tuned offline.
The choice of objective function defines the paradigm.
Writing down the objective is the most consequential decision in a training run, and the most common mistakes are conceptual rather than numerical.
softmax_cross_entropy_with_logits, PyTorch's cross_entropy), and clamping probabilities to (ε, 1 - ε).reduction='sum' versus reduction='mean' changes the effective learning rate by a factor of batch size. The same applies to ignoring padding tokens in language modeling: a per-token mean over only valid tokens is not the same as a per-batch mean.The broader objective article covers the umbrella concept and includes rewards, business KPIs, alignment goals, and the loss-vs-reward distinction. This article is narrower: it concentrates on the mathematical function J(θ) and its analytical properties.
The loss function article provides per-example formulas and code examples for individual losses (cross-entropy, MSE, hinge, focal, triplet). Where the loss-function article is recipe-oriented, this article is structural: how those losses are aggregated into objectives, what properties they inherit, and which optimizers can solve them.
The empirical risk minimization article gives the formal statistical framework (Vapnik, ERM, generalization bounds, VC dimension) under which minimizing the empirical objective approximates minimizing the true risk.
Imagine you're playing a video game where you score points based on how well you do. The objective function is the rule that converts what you did into a number. If you crash a lot, the number goes up (bad); if you finish levels quickly, the number goes down (good). Your goal is to make the number as small as possible.
Machine learning works the same way. The model has a bunch of dials (its parameters), and the objective function turns the position of all those dials into a single score. The training algorithm jiggles the dials in whichever direction makes the score smaller, over and over again. The model never "sees" the task you wanted; it only sees the score. Pick the wrong scoring rule and the model will get really good at scoring well while doing the wrong thing.