# Objective function

> Source: https://aiwiki.ai/wiki/objective_function
> Updated: 2026-04-26
> Categories: Training & Optimization
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

*See also: [Machine learning terms](/wiki/machine_learning_terms)*

In [machine learning](/wiki/machine_learning) and mathematical optimization, an **objective function** is the scalar-valued function that an optimization algorithm seeks to minimize or maximize. It maps a candidate solution (in ML, the model parameters) to a single real number that summarizes how good or bad that solution is. Every gradient that updates a neural network, every step a [gradient descent](/wiki/gradient_descent) routine takes, and every weight an [SGD](/wiki/stochastic_gradient_descent_sgd) update changes is ultimately produced by differentiating an objective function. In Goodfellow, Bengio, and Courville's *Deep Learning*, the authors put it plainly: the function we want to minimize or maximize is called the objective function, or criterion, and when we are minimizing it we may also call it the cost function, loss function, or error function.

This article focuses on the objective function as a mathematical object: its formal structure, the most common families of objectives used in ML, the analytical properties (convexity, smoothness, Lipschitz continuity) that govern which optimizers can solve it, and the practical pitfalls that arise when an objective is poorly chosen. For the broader concept that includes rewards, business metrics, and alignment goals, see [objective](/wiki/objective).

## definition

Given a parameter space Θ (typically R^d for a model with d parameters), an objective function is any map

J : Θ → R

that the learning algorithm tries to minimize:

θ* = arg min_{θ ∈ Θ} J(θ)

or maximize, depending on convention. In supervised ML the canonical form is the regularized empirical risk:

J(θ) = (1/n) Σ_{i=1}^{n} L(f_θ(x_i), y_i) + λ R(θ)

The first term is a data-fit cost averaged over n training examples, where L is a per-example [loss function](/wiki/loss_function), f_θ is the model, and (x_i, y_i) is the i-th input/label pair. The second term is a [regularization](/wiki/regularization) penalty (such as L2, L1, or weight-decay), scaled by a hyperparameter λ that trades off training fit against complexity. When λ = 0, J reduces to the empirical risk and the setup is the classical [empirical risk minimization](/wiki/empirical_risk_minimization_erm) (ERM) problem of Vapnik.

Writing the objective explicitly serves two roles. It tells the optimizer what to do, and it tells the analyst what the model is actually learning. The model does not learn the task description in a slack message; it learns whatever function J encodes.

## terminology: loss, cost, risk, criterion

The vocabulary is overloaded and inconsistent across textbooks. Here is the breakdown used by most modern ML practitioners.

| Term | Scope | Typical formula | Direction |
| --- | --- | --- | --- |
| Loss | Per example | L(f(x_i), y_i) | Minimize |
| Cost | Aggregate over a batch or dataset | (1/n) Σ L(f(x_i), y_i) | Minimize |
| Empirical risk | Average loss on training set | R_emp(f) = (1/n) Σ L(f(x_i), y_i) | Minimize |
| True risk (population risk) | Expected loss under data distribution P | R(f) = E_{(x,y)~P}[L(f(x), y)] | Minimize |
| Reward / utility | Per-step or per-trajectory benefit (RL) | r(s, a) | Maximize |
| Return | Discounted sum of future rewards | G_t = Σ γ^k r_{t+k+1} | Maximize |
| Fitness | Quality score in evolutionary search | problem-specific | Maximize |
| Criterion / objective | Umbrella term for any of the above | J(θ) | Either |

In casual usage "loss" and "cost" are often interchangeable. Sebastian Raschka points out that some authors reserve "loss" for the single-example version and "cost" for the dataset average, but plenty of textbooks (including *Deep Learning*) use them as synonyms. The PyTorch and TensorFlow APIs both call them "losses" and apply a `reduction='mean'` argument to aggregate.

"Criterion" is the mathematical-statistics term and shows up in PyTorch (`nn.CrossEntropyLoss` is invoked as `criterion`). In evolutionary algorithms the same scalar is called a **fitness function** and is maximized rather than minimized. None of these distinctions affect the math; flipping the sign turns a loss into a reward.

## structural anatomy of an objective

Almost every modern training objective decomposes into a small number of recognizable pieces.

- A **data-fit term** that depends on training examples: cross-entropy on labels, MSE on continuous targets, masked-token negative log-likelihood, contrastive logits.
- One or more **regularization terms** that depend only on parameters: L2 (ridge), L1 (lasso), weight decay, dropout (implicit), spectral norm penalties.
- Optional **auxiliary losses** that share parameters: an autoencoder reconstruction term added to a classifier, a VAE's KL term against the prior, a [direct preference optimization](/wiki/direct_preference_optimization_dpo) (DPO) margin added to supervised cross-entropy.
- Optional **constraints**, either hard (projection onto a feasible set) or soft (penalty terms with a Lagrange multiplier; see [Lagrange multiplier](/wiki/lagrange_multiplier)).

These terms are usually combined linearly with weighting hyperparameters. Treating those weights as fixed is a modeling choice, and it implicitly defines a single point on a Pareto frontier in the space of objectives. Multi-objective approaches make that trade-off explicit.

## common objective functions

The table below collects the formulas you will encounter in nearly any production codebase. Per-example formulas are shown; the full objective sums or averages over the dataset and adds regularization.

| Family | Name | Formula (per example) | Range | Differentiable | Typical use |
| --- | --- | --- | --- | --- | --- |
| Regression | [Mean squared error](/wiki/mean_squared_error_mse) (MSE, L2) | (y - ŷ)^2 | [0, ∞) | Everywhere | Linear regression, value heads in RL, diffusion noise prediction |
| Regression | Mean absolute error (MAE, L1) | \|y - ŷ\| | [0, ∞) | Except at 0 | Robust regression with heavy-tailed noise |
| Regression | Huber loss | (1/2)r^2 if \|r\|≤δ else δ(\|r\|-δ/2) | [0, ∞) | Yes (C^1) | Robust regression, RL value heads (DQN clipped) |
| Regression | Quantile (pinball) | ρ_τ(y - ŷ) | [0, ∞) | Except at 0 | Quantile regression, prediction intervals |
| Classification | 0/1 loss | 1[y ≠ ŷ] | {0, 1} | No | Definition of error rate, not used for training |
| Classification | Binary [cross-entropy](/wiki/cross-entropy) | -y log p - (1-y) log(1-p) | [0, ∞) | Yes | Binary classification, sigmoid heads |
| Classification | Categorical cross-entropy | -Σ_k y_k log p_k | [0, log K] for one-hot y | Yes | Softmax classifiers, language models |
| Classification | [Hinge loss](/wiki/hinge_loss) | max(0, 1 - y·ŷ) | [0, ∞) | Except at the kink | Linear and kernel SVMs |
| Classification | Focal loss | -(1 - p_t)^γ log p_t | [0, ∞) | Yes | Heavily imbalanced detection (RetinaNet) |
| Probability | KL divergence | Σ p log(p/q) | [0, ∞) | Yes (where p, q > 0) | Distillation, VAE prior matching, RLHF penalty |
| Probability | Wasserstein-1 | inf over couplings of E\|x-y\| | [0, ∞) | Yes (with critic) | WGAN, optimal transport |
| Metric | Triplet loss | max(0, d(a,p) - d(a,n) + m) | [0, ∞) | Except at the kink | Face recognition, embedding learning (FaceNet) |
| Self-supervised | InfoNCE | -log[exp(s_pos/τ) / Σ exp(s_j/τ)] | [0, log N] | Yes | CPC, SimCLR, MoCo, CLIP |
| Generative (variational) | Negative ELBO | -E_q[log p(x\|z)] + KL(q(z\|x) \|\| p(z)) | R | Yes | Variational autoencoders |
| Generative (diffusion) | Noise-prediction MSE | \|\|ε - ε_θ(x_t, t)\|\|^2 | [0, ∞) | Yes | DDPM, Stable Diffusion |
| RL (policy gradient) | PPO clipped surrogate | E[min(r·Â, clip(r, 1-ε, 1+ε)·Â)] | R | Almost everywhere | LLM RLHF, robotics, game-playing |
| RL (preference) | [DPO](/wiki/direct_preference_optimization_dpo) loss | -log σ(β log(π_θ(y_w)/π_ref(y_w)) - β log(π_θ(y_l)/π_ref(y_l))) | [0, ∞) | Yes | Preference fine-tuning of LLMs |

A few of these are worth a closer look.

**Huber loss** is parameterized by a threshold δ. For residuals smaller than δ it behaves like MSE (smooth, easy to optimize); for larger residuals it behaves like MAE (linear, robust to outliers). The Wikipedia entry attributes it to Peter Huber's 1964 paper on robust estimation, and DQN's reward clipping is essentially Huber loss with δ = 1.

**Focal loss**, introduced by Lin et al. at ICCV 2017, multiplies cross-entropy by (1 - p_t)^γ so that easy examples (those already classified with high probability) contribute almost nothing to the gradient. Their RetinaNet detector trained with focal loss matched single-stage detector speeds while exceeding the accuracy of all then-current two-stage detectors.

**InfoNCE**, due to van den Oord, Li, and Vinyals (2018) in the Contrastive Predictive Coding paper, treats representation learning as classification: pick the positive sample out of a set that includes one positive and N - 1 negatives. The loss is a categorical cross-entropy over the resulting logits, and the negative of its expectation is a lower bound on mutual information between context and target.

**The DPO loss** of Rafailov et al. (NeurIPS 2023) is a binary cross-entropy on a margin between the log-ratio of preferred and dispreferred completions under the policy versus a frozen reference model. The result is a closed-form objective that recovers the optimal policy of an RLHF KL-regularized reward maximization problem without ever fitting an explicit reward model.

## properties of objective functions

The analytical properties of J(θ) determine which optimizers can solve it efficiently and whether the global optimum can be found at all.

| Property | Definition (informal) | Why it matters | ML examples |
| --- | --- | --- | --- |
| Convex | Line segment between any two points lies above the graph | Every local minimum is a global minimum | MSE in linear regression, logistic regression loss, SVM hinge with L2 |
| Strictly convex | Convex with no flat regions | Unique global minimum | Logistic regression with L2 |
| Strongly convex | Convex with quadratic lower bound (μ-strongly convex) | Linear convergence rate for gradient descent | L2-regularized convex losses |
| Non-convex | Multiple local minima or saddle points | No global guarantees from local methods | Neural networks, matrix factorization, mixture models |
| Smooth (C^1, Lipschitz gradient) | ∇f exists everywhere and ∇f is L-Lipschitz | Required for standard convergence proofs of GD | Cross-entropy, MSE, sigmoid composed networks |
| Non-smooth | Subgradient exists but gradient does not at some points | Need subgradient or proximal methods | L1 regularization, hinge loss, ReLU networks at kinks |
| Lipschitz (function) | \|f(x) - f(y)\| ≤ K\|x - y\| | Bounds on excess risk, GAN critic constraints | WGAN critic with weight clipping or gradient penalty |
| Coercive | f(θ) → ∞ as \|\|θ\|\| → ∞ | Guarantees a finite minimum exists | Regularized objectives (L2 makes anything coercive) |

In the Boyd and Vandenberghe formulation, a function is convex if its domain is convex and the chord between any two graph points lies above the graph. Strong convexity adds a quadratic lower bound: f(y) ≥ f(x) + ∇f(x)·(y - x) + (μ/2)\|\|y - x\|\|^2 for some μ > 0. L-smoothness is the symmetric quadratic upper bound. Together, L-smooth and μ-strongly convex objectives admit linear convergence rates for [gradient descent](/wiki/gradient_descent), with iteration complexity O((L/μ) log(1/ε)).

Most classical ML objectives (linear regression, logistic regression, soft-margin SVM with L2) are convex; almost no neural-network objective is. That is why theoretical guarantees from [convex optimization](/wiki/convex_optimization) only loosely transfer to deep learning.

## optimizers matched to objective properties

The right optimizer depends on whether the objective is convex, smooth, constrained, and stochastic.

| Objective shape | Recommended optimizers | Notes |
| --- | --- | --- |
| Convex, smooth, low-dim | Newton's method, BFGS, L-BFGS | Quadratic convergence near optimum |
| Convex, smooth, large-scale | Gradient descent, accelerated gradient (Nesterov), conjugate gradient | O(1/k^2) with acceleration |
| Convex, non-smooth (e.g., L1) | Subgradient method, proximal gradient (ISTA, FISTA), ADMM | Proximal map handles non-smooth term in closed form |
| Non-convex, smooth, large-scale | [SGD](/wiki/stochastic_gradient_descent_sgd) with momentum, Adam, AdamW | The dominant family for deep learning |
| Non-convex, non-smooth | Subgradient SGD, proximal SGD | Most production neural nets use this implicitly via ReLU plus weight decay |
| Constrained (linear or convex) | Projected gradient, interior-point, Frank-Wolfe, augmented Lagrangian, ADMM | Interior-point dominates small/medium convex programs |
| Constrained (general nonlinear) | Sequential quadratic programming, Lagrange + [KKT conditions](/wiki/kkt_conditions) | Standard for trajectory optimization, control |
| Black-box (no gradient) | Bayesian optimization, CMA-ES, evolutionary strategies, simulated annealing | Used for hyperparameter tuning, neural-architecture search |

Adam, introduced by Diederik Kingma and Jimmy Ba at ICLR 2015, is the default optimizer for most deep-learning workloads. Their paper described it as a method for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments. The algorithm is invariant to diagonal rescaling of gradients, has small memory overhead, and is well suited for problems with very noisy or sparse gradients. Subsequent work (AdamW, by Loshchilov and Hutter) decoupled weight decay from the gradient step, fixing a subtle interaction between adaptive learning rates and L2 regularization.

## surrogate objectives

A recurring problem is that the metric we care about (classification accuracy, F1, NDCG, BLEU) is non-differentiable, piecewise constant, or involves sorts and thresholds. None of these can be optimized directly with gradient methods. The standard fix is to optimize a **surrogate**: a smooth, often convex function that bounds or correlates with the target metric.

- 0/1 loss is the metric, but it is a step function. Cross-entropy is its convex, smooth surrogate; logistic regression is then minimum cross-entropy over linear models.
- Hinge loss is another convex surrogate for 0/1 loss, used in SVMs. Unlike cross-entropy it is consistent for the classification decision but not for class probabilities.
- For ranking, NDCG is non-differentiable, so listwise surrogates (LambdaRank, ListNet, ApproxNDCG) approximate it.
- For BLEU and ROUGE in machine translation, training uses token-level cross-entropy as a surrogate, sometimes followed by minimum-risk training or reinforcement learning that optimizes the metric more directly.

A surrogate is called **Bayes-consistent** (or proper) if minimizing it converges to a Bayes-optimal predictor as data grows. Logistic loss is consistent for classification and gives calibrated probabilities; hinge loss is consistent for the 0/1 decision but not for probabilities; squared loss is consistent for the conditional mean. Picking a surrogate is one of the most consequential design choices in a project, because the model that gets shipped is optimal for the surrogate, not for the metric on the dashboard.

## constrained optimization

Many ML problems are naturally constrained. SVM training is a constrained quadratic program; fairness-aware learning constrains a disparity metric; safe RL constrains expected damage. The general form is

minimize J(θ)
subject to g_i(θ) ≤ 0, i = 1, ..., m
          h_j(θ) = 0, j = 1, ..., p

The two main analytic tools are Lagrange multipliers and the Karush-Kuhn-Tucker (KKT) conditions. The Lagrangian

L(θ, λ, ν) = J(θ) + Σ λ_i g_i(θ) + Σ ν_j h_j(θ)

turns a constrained problem into an unconstrained saddle-point problem in (θ, λ, ν). The KKT conditions are necessary first-order optimality conditions: stationarity of the Lagrangian, primal feasibility, dual feasibility (λ ≥ 0), and complementary slackness (λ_i g_i(θ) = 0). For convex problems with appropriate constraint qualifications, KKT conditions are also sufficient.

The original SVM was framed as a constrained quadratic program: maximize the margin subject to all training points being classified correctly with a margin of at least 1. The dual formulation, derived via Lagrange multipliers, replaces the optimization over weight vectors with an optimization over per-example dual variables, exposes the support vectors as those with non-zero duals, and enables the kernel trick.

In practice, deep-learning constrained optimization usually uses penalty methods (add λ times a violation term to J) or augmented Lagrangians (alternate updates to θ and to the multipliers). These are easier to implement on autograd frameworks than projected gradient descent for complex constraints.

## multi-objective optimization

Real projects rarely care about exactly one number. A recommender wants high engagement and low policy violations; a self-driving stack wants low time-to-destination, low jerk, and low collision rate. The mathematical setup is

minimize (J_1(θ), J_2(θ), ..., J_k(θ))

where the output is a vector. There is no single "optimum"; instead there is a **Pareto frontier**, the set of solutions that cannot be improved on any objective without worsening another.

The two practical approaches are scalarization and the constraint method.

- **Weighted-sum scalarization** combines objectives into J(θ) = Σ w_i J_i(θ) and picks one Pareto point per choice of weights. Simple, but cannot reach non-convex parts of the frontier.
- **The ε-constraint method** picks one objective to minimize and bounds the others as constraints: minimize J_1(θ) subject to J_i(θ) ≤ ε_i. Can reach the entire frontier but requires a constrained solver.

Most ML systems quietly do scalarization: the deployed objective is a weighted sum of accuracy, latency cost, fairness penalty, and engagement, with weights tuned offline.

## modern training objectives in practice

The choice of objective function defines the paradigm.

- **Supervised vision and tabular models** minimize cross-entropy or MSE on labels, regularized by weight decay and dropout.
- **Language model pre-training** minimizes per-token cross-entropy: J = -(1/T) Σ_t log p_θ(x_t | x_{<t}). On GPT-3 and beyond the objective is identical; only the data and parameter count change.
- **Diffusion models** minimize a noise-prediction MSE: J = E[\|\|ε - ε_θ(x_t, t)\|\|^2] over diffusion timesteps and noise samples. This is the simplified ELBO from Ho, Jain, and Abbeel (2020) and remains the dominant training objective for image and video diffusion.
- **Contrastive self-supervised learning** maximizes InfoNCE-style mutual information: positive pairs (two augmentations of the same image, or aligned image-text in CLIP) get pulled together while negatives are pushed apart.
- **Reward modeling for RLHF** is a pairwise log-likelihood over human preference data: maximize log σ(r_θ(x, y_w) - r_θ(x, y_l)) where y_w is the preferred completion. This is the Bradley-Terry model.
- **RLHF policy optimization with PPO** maximizes E[r(x, y) - β KL(π_θ(·|x) \|\| π_ref(·|x))], the reward shaped by a KL penalty against a frozen reference policy. The inner objective is the PPO clipped surrogate of Schulman et al. (2017): E[min(r_t Â_t, clip(r_t, 1-ε, 1+ε) Â_t)] where r_t is the importance-sampling ratio. The KL penalty was added to RLHF in OpenAI's InstructGPT line of work to keep the fine-tuned model from drifting too far from the supervised base.
- **Direct Preference Optimization (DPO)** replaces the entire RL loop with a closed-form classification loss derived from the same KL-regularized reward formulation. The Rafailov et al. paper showed that the optimal policy under that objective can be expressed in terms of the reward, then inverted so that the reward never has to be parameterized at all.

## common pitfalls

Writing down the objective is the most consequential decision in a training run, and the most common mistakes are conceptual rather than numerical.

- **Confusing the metric with the objective.** F1 score, AUC, accuracy at threshold 0.5, BLEU, recall@10. These are evaluation metrics. If they are non-differentiable, they cannot be the training objective directly. Training cross-entropy and reporting F1 is fine; training F1 by hacking together a soft approximation usually is not unless it is principled and tested.
- **Misaligned proxy (Goodhart's law).** When the training objective is not the true business objective, any sufficiently strong optimizer will exploit the gap. Recommender click-through-rate maximization that destroyed long-term user engagement is the canonical example. Specifying the right J is harder than optimizing it.
- **Numerical issues with logs and exponentials.** Cross-entropy includes log(p), which blows up at p = 0; softmax includes exp(z), which overflows for large z. Standard fixes are the log-sum-exp trick, fused softmax-cross-entropy kernels (TensorFlow's `softmax_cross_entropy_with_logits`, PyTorch's `cross_entropy`), and clamping probabilities to (ε, 1 - ε).
- **Scale mismatch between data-fit and regularization.** If the per-example loss is in the thousands and λ R(θ) is order 1, regularization does nothing. If the loss is averaged over a batch and λ is set assuming a sum, the regularizer dominates. Always check the relative magnitudes early in training.
- **Reduction bugs (sum vs mean).** PyTorch's `reduction='sum'` versus `reduction='mean'` changes the effective learning rate by a factor of batch size. The same applies to ignoring padding tokens in language modeling: a per-token mean over only valid tokens is not the same as a per-batch mean.
- **Ignoring stochasticity.** SGD optimizes an unbiased estimator of the true objective, not the objective itself. Loss curves wiggle because of the noise; training loss going up for one batch is not necessarily a problem.
- **Overfitting the surrogate.** The model is optimal for J, not for the held-out metric. Track the metric you actually care about on validation, even if you cannot train on it. If they diverge, something in the surrogate or the data is wrong.
- **Constraint violation in penalty methods.** Adding λ · g(θ) as a soft penalty does not enforce g(θ) ≤ 0. For safety-critical constraints, use projection or augmented Lagrangians, not a single penalty term.

## relationship to related concepts

The broader [objective](/wiki/objective) article covers the umbrella concept and includes rewards, business KPIs, alignment goals, and the loss-vs-reward distinction. This article is narrower: it concentrates on the mathematical function J(θ) and its analytical properties.

The [loss function](/wiki/loss_function) article provides per-example formulas and code examples for individual losses (cross-entropy, MSE, hinge, focal, triplet). Where the loss-function article is recipe-oriented, this article is structural: how those losses are aggregated into objectives, what properties they inherit, and which optimizers can solve them.

The [empirical risk minimization](/wiki/empirical_risk_minimization_erm) article gives the formal statistical framework (Vapnik, ERM, generalization bounds, VC dimension) under which minimizing the empirical objective approximates minimizing the true risk.

## explain like I'm 5

Imagine you're playing a video game where you score points based on how well you do. The objective function is the rule that converts what you did into a number. If you crash a lot, the number goes up (bad); if you finish levels quickly, the number goes down (good). Your goal is to make the number as small as possible.

Machine learning works the same way. The model has a bunch of dials (its parameters), and the objective function turns the position of all those dials into a single score. The training algorithm jiggles the dials in whichever direction makes the score smaller, over and over again. The model never "sees" the task you wanted; it only sees the score. Pick the wrong scoring rule and the model will get really good at scoring well while doing the wrong thing.

## references

1. Goodfellow, I., Bengio, Y., and Courville, A. (2016). *Deep Learning*. MIT Press. Chapter 4.3 introduces the terms objective function, criterion, cost, and loss.
2. Boyd, S. and Vandenberghe, L. (2004). *Convex Optimization*. Cambridge University Press. Definitions of convexity, smoothness, Lipschitz continuity, and KKT conditions.
3. Vapnik, V. (1998). *Statistical Learning Theory*. Wiley. Empirical risk minimization framework.
4. Kingma, D. P. and Ba, J. (2015). "Adam: A Method for Stochastic Optimization." ICLR. arXiv:1412.6980.
5. Loshchilov, I. and Hutter, F. (2019). "Decoupled Weight Decay Regularization" (AdamW). ICLR.
6. Schulman, J. et al. (2017). "Proximal Policy Optimization Algorithms." arXiv:1707.06347.
7. Lin, T.-Y. et al. (2017). "Focal Loss for Dense Object Detection." ICCV.
8. van den Oord, A., Li, Y., and Vinyals, O. (2018). "Representation Learning with Contrastive Predictive Coding." arXiv:1807.03748.
9. Ho, J., Jain, A., and Abbeel, P. (2020). "Denoising Diffusion Probabilistic Models." NeurIPS.
10. Rafailov, R. et al. (2023). "Direct Preference Optimization: Your Language Model is Secretly a Reward Model." NeurIPS. arXiv:2305.18290.
11. Huber, P. J. (1964). "Robust Estimation of a Location Parameter." *Annals of Mathematical Statistics*.
12. Cortes, C. and Vapnik, V. (1995). "Support-Vector Networks." *Machine Learning*. SVMs as a constrained quadratic program.
13. Bradley, R. A. and Terry, M. E. (1952). "Rank Analysis of Incomplete Block Designs." *Biometrika*. The pairwise preference model used in RLHF reward modeling.
14. Raschka, S. "What is the difference between a cost function and a loss function in machine learning?" sebastianraschka.com.

