See also: Machine learning terms
In machine learning, cost is the scalar number that summarizes how badly a model is performing on a chunk of data. The function that produces it is the cost function, often written J(θ), where θ is the vector of model parameters. A training algorithm changes θ to push J(θ) downward, and that single number is the only thing the optimizer ever sees. Whatever J encodes, the model will learn to do; whatever J ignores, the model will treat as free to wreck.
In casual conversation, ML practitioners use "cost," "loss," and "objective" interchangeably, and the standard textbook (Goodfellow, Bengio, and Courville's Deep Learning) explicitly says the function we want to minimize is called the objective function, criterion, cost, or loss, with no real distinction. When people do bother to draw a line, the most common convention is the one Sebastian Raschka uses on his FAQ: loss is per example, cost is the aggregate over a dataset, and objective is the broadest umbrella term that may include a regularizer or even maximization. This article focuses on the cost view, and on the second meaning that has crept into modern ML: the dollars, watts, and GPU-hours it takes to actually train and serve a model.
The canonical machine-learning cost function is the regularized empirical average
J(θ) = (1/N) Σ_{i=1}^{N} L(f_θ(x_i), y_i) + λ R(θ)
where L is a per-example loss function, f_θ is the model, (x_i, y_i) is the i-th input-label pair from a dataset of size N, R(θ) is a regularization penalty, and λ ≥ 0 controls how much the regularizer matters. When λ = 0 the cost reduces to the empirical risk and the setup is the textbook empirical risk minimization (ERM) problem of Vapnik. When λ > 0, the cost is a Pareto trade-off between fitting the training data and keeping θ small or simple.
The cost is what the optimizer minimizes; everything else (accuracy on validation, calibration, latency, fairness, dollars per inference) is a secondary metric the team tracks but the gradient never sees.
The vocabulary is overloaded across textbooks, papers, and frameworks. The table below shows how careful authors distinguish the terms.
| Term | Scope | Typical formula | What it measures |
|---|---|---|---|
| Loss | One example | L(f_θ(x_i), y_i) | Per-sample penalty |
| Cost | A batch or full training set | J(θ) = (1/N) Σ L(f_θ(x_i), y_i) | Average penalty over the data we have |
| Empirical risk | Same as cost without regularization | R_emp(θ) = (1/N) Σ L(f_θ(x_i), y_i) | Statistical-learning name for cost |
| True risk | Expectation over the data distribution | R(θ) = E_{(x,y) ~ D}[L(f_θ(x), y)] | Average penalty on data we will see in deployment |
| Objective function | Whatever the optimizer touches | J(θ) plus regularization, constraints, multipliers | The thing being minimized or maximized |
| Criterion | Same as objective | J(θ) | Statistics and PyTorch terminology |
| Reward | One step or one trajectory in RL | r(s, a) or R(τ) | Per-step or per-episode benefit, maximized |
The loss-vs-cost split is a convention, not a law. Andrew Ng's Coursera ML course popularized writing L for the per-example version and J for the aggregated version, and the Baeldung CS notes follow the same split. PyTorch and TensorFlow both call them "losses" and apply a reduction='mean' argument to do the aggregation; in those frameworks there is no separate "cost" object.
The risk-vs-cost split is the load-bearing one. Cost is sample-based and computable; risk is population-based and only ever estimated. The whole point of generalization theory is that minimizing the cost (what we can compute) does not automatically minimize the risk (what we care about). Léon Bottou's writings on stochastic gradient descent are explicit about this: SGD optimizes the empirical risk while we hope it generalizes to the expected risk, and the gap between them is the entire field of statistical learning theory.
The per-example loss L drops into the same averaging shell. The choice of L is what defines the task. The table below collects the formulas you will encounter in nearly any production codebase.
| Family | Name | Per-example formula | Typical use |
|---|---|---|---|
| Regression | Mean squared error (MSE, L2) | (y - ŷ)^2 | Linear regression, value heads in RL, diffusion noise prediction |
| Regression | Mean absolute error (MAE, L1) | |y - ŷ| | Robust regression with heavy-tailed noise |
| Regression | Huber | (1/2)r^2 if |r| ≤ δ else δ(|r| - δ/2) | Robust regression, DQN value head with reward clipping |
| Regression | Quantile (pinball) | ρ_τ(y - ŷ) | Quantile regression, prediction intervals |
| Classification | Binary cross-entropy | -y log p - (1 - y) log(1 - p) | Sigmoid binary classifiers |
| Classification | Categorical cross-entropy | -Σ_k y_k log p_k | Softmax classifiers, language model next-token prediction |
| Classification | Negative log-likelihood | -log p_θ(y | x) | Generative models, autoregressive LMs |
| Classification | Hinge loss | max(0, 1 - y · ŷ) | Linear and kernel SVMs |
| Classification | Focal loss | -(1 - p_t)^γ log p_t | Class-imbalanced detection (RetinaNet) |
| Probability | KL divergence | Σ p log(p / q) | Distillation, VAE prior matching, RLHF KL penalty |
| Metric | Triplet loss | max(0, d(a, p) - d(a, n) + m) | Face recognition (FaceNet), embedding learning |
| Self-supervised | InfoNCE | -log[exp(s_pos / τ) / Σ_j exp(s_j / τ)] | SimCLR, MoCo, CLIP, CPC |
| Generative (diffusion) | DDPM noise-prediction MSE | ||ε - ε_θ(x_t, t)||^2 | DDPM (Ho, Jain, Abbeel 2020), Stable Diffusion |
| RL (policy gradient) | PPO clipped surrogate | E[min(r_t Â_t, clip(r_t, 1 - ε, 1 + ε) Â_t)] | LLM RLHF, robotics, game-playing |
| RL (preference) | DPO | -log σ(β log(π_θ(y_w) / π_ref(y_w)) - β log(π_θ(y_l) / π_ref(y_l))) | Preference fine-tuning of LLMs |
A few of these merit a longer look.
Focal loss, introduced by Lin et al. at ICCV 2017 with their RetinaNet detector, multiplies the standard cross-entropy by (1 - p_t)^γ so that easy, well-classified examples contribute almost nothing to the gradient. The motivation came from one-stage object detectors that have to score 10K to 100K candidate boxes per image, almost all of them background. Without down-weighting the easy negatives, they swamp the loss and the detector never learns to find the rare positives.
The DDPM noise-prediction loss of Ho, Jain, and Abbeel (2020) is just an MSE between the actual Gaussian noise sampled in the forward process and the noise the network predicts. It is the simplest possible objective in the deep generative literature, and it is the foundation under Stable Diffusion, Imagen, and most large-scale image and video diffusion systems. The original derivation goes through a variational lower bound, but the simplified loss is what is actually trained.
The PPO clipped surrogate of Schulman et al. (2017) is the inner optimization step in reinforcement learning from human feedback (RLHF) for LLMs, among many other RL applications. The clip prevents the importance-sampling ratio r_t = π_θ(a | s) / π_old(a | s) from straying too far from 1, which keeps each policy update small and stable.
DPO, the direct-preference-optimization loss of Rafailov et al. (NeurIPS 2023), is a binary cross-entropy on the difference of log-ratios between a policy and a fixed reference policy. The closed-form derivation collapses the entire RLHF pipeline (reward model, PPO loop, KL penalty) into a single supervised loss on preference pairs. The β temperature controls how strongly the policy is pulled toward the preference signal versus held near the reference.
When R(θ) is added to the average loss, the cost punishes both poor fit and large or complex parameters. The two most common choices are L2 and L1 penalties, with names that vary by community.
| Name | Penalty R(θ) | Effect on θ | ML examples |
|---|---|---|---|
| L2 / ridge / Tikhonov / weight decay | (1/2) Σ_j θ_j^2 | Shrinks all weights toward 0 | Logistic regression with L2, SGD with weight decay, AdamW |
| L1 / lasso | Σ_j |θ_j| | Drives many weights exactly to 0 (sparse solutions) | Lasso regression, sparse coding |
| Elastic net | α Σ |θ_j| + (1 - α)/2 Σ θ_j^2 | Combination of L1 and L2 | Glmnet, sklearn ElasticNet |
| Group sparsity | Σ_g ||θ_g||_2 | Sets entire groups of weights to 0 | Structured pruning, channel pruning |
| Dropout (implicit) | Random masking at train time | Implicit ensembling, approximate L2 | Most deep nets |
| Spectral norm penalty | ||W||_2 | Bounds Lipschitz constant of each layer | WGAN critic, robust training |
In linear regression, the ridge cost is J(θ) = ||y - Xθ||^2 + λ||θ||^2, with closed-form solution θ = (X^T X + λ I)^{-1} X^T y. The lasso version replaces ||θ||^2 with ||θ||_1 and gives the celebrated sparse solutions of Tibshirani (1996). For deep nets, weight decay (L2 with a small λ, often around 1e-4) is the default regularizer, applied either through the loss or, in modern AdamW, decoupled from the gradient step entirely.
In many real applications the cost of getting an answer wrong is not symmetric. Missing a malignant tumor (a false negative) is much worse than calling a benign mole malignant (a false positive). Cost-sensitive learning modifies the cost function so that different mistakes are weighted differently.
The simplest version replaces the uniform 1/N average with class-weighted or sample-weighted averaging:
J(θ) = (1/N) Σ_i w_i L(f_θ(x_i), y_i)
where w_i depends on the true class y_i (class weights) or on the specific example (sample weights). For binary classification with a 2x2 cost matrix C, the per-example loss becomes a weighted sum over the four (true, predicted) cells, and the optimal classifier shifts the decision threshold away from 0.5 toward whichever class is cheaper to predict.
The medical-screening literature is full of this: Rao et al. (2024) review cost-sensitive learning across imbalanced medical data and note that the choice of cost ratio can change AUROC, precision-recall, and the deployed sensitivity of the screen by tens of percentage points. scikit-learn implements this through class_weight='balanced' on most estimators, and PyTorch through the weight argument to nn.CrossEntropyLoss.
A cost function only earns its place if it has the right mathematical properties for the optimizer it will be paired with.
| Property | What it means | Why it matters |
|---|---|---|
| Differentiable (or subdifferentiable) | Has a gradient or subgradient almost everywhere | Required for gradient descent, backprop, autograd |
| Aligned with the task metric | Lower J corresponds to better performance on what you care about | Otherwise you optimize the wrong thing (Goodhart's law) |
| Convex (when possible) | Every local minimum is a global minimum | Linear regression, logistic regression, soft-margin SVM are convex; neural nets are not |
| Bounded below | inf J(θ) > -∞ | Otherwise the optimizer can drive J to negative infinity without learning anything |
| Numerically stable | No log(0), no exp(very large) without log-sum-exp | Otherwise NaNs in the loss curve |
| Smooth (Lipschitz gradient) | ∇J does not jump suddenly | Standard convergence proofs of GD assume this |
The non-convex part is the painful one. Almost no neural-network cost is convex, which is why theoretical guarantees for convex optimization only loosely transfer to deep learning. In practice the loss surface is full of saddle points and local minima; SGD with momentum and Adam happen to find good ones often enough that we ship the model anyway.
The cost function and the optimizer are paired. You design J(θ) so that an algorithm can minimize it; the optimizer takes gradients (or finite differences, or surrogate models) of J and walks θ downhill. Three concepts come up over and over.
The cost surface is the graph of J(θ) over the parameter space. For a deep net it lives in millions or billions of dimensions and is impossible to visualize directly; the popular "loss landscape" plots project it to two dimensions by sweeping along chosen directions in θ-space. Local minima are points where ∇J = 0 and the Hessian is positive semi-definite; saddle points are points where ∇J = 0 but the Hessian has both positive and negative eigenvalues. Goodfellow et al. argue that in high dimensions saddle points dominate, and SGD escapes them by adding the right kind of noise.
Gradient descent updates θ_{t+1} = θ_t - η ∇J(θ_t). For a full-batch gradient over the entire dataset, each step is exact but expensive. Stochastic gradient descent estimates ∇J from a single example or a mini-batch, which is what makes large-scale ML feasible. Bottou's chapter in Neural Networks: Tricks of the Trade (2012) and the survey "Optimization Methods for Large-Scale Machine Learning" (Bottou, Curtis, Nocedal 2018) lay out the convergence theory. Adam, AdamW, RMSProp, and the rest of the modern menagerie are variants that adapt the step size per parameter using running estimates of the first and second moments of the gradient.
The practical point is that the gradient is computed by autograd from the cost; if the cost is wrong, the gradient is wrong, and the optimizer obediently walks the wrong way.
Machine-learning practitioners increasingly use "cost" in a second sense that has nothing to do with J(θ): the actual money and electricity spent on training and serving a model. As models scaled past GPT-3, this kind of cost stopped being a footnote and became a strategic constraint.
| Cost dimension | What gets measured | Typical units |
|---|---|---|
| Training compute | FLOPs or GPU-hours used to train one model | GPU-hours, PFLOP-days, dollars |
| Inference compute | FLOPs per query at serving time | ms latency, $ per million tokens, J per token |
| Energy | Power drawn during training and inference | kWh, MWh |
| Carbon | Greenhouse-gas emissions from that energy | kg CO2e, t CO2e |
| Cloud / hardware | Rental price of GPUs and TPUs | $ per GPU-hour |
| Annotation | Human time to label data | $ per label, $ per task |
| Inference pricing | What end users or developers pay | $ per million tokens, $ per image |
The Strubell, Ganesh, and McCallum 2019 paper at ACL was the first widely cited attempt to put numbers on the carbon side of this. They estimated that training a single large NLP model with neural-architecture search produced roughly the same lifetime CO2 emissions as five average American cars, and they pushed the field to start reporting energy and dollar costs alongside accuracy. The paper has been hugely influential in the green-AI conversation and in the more recent push for energy-efficient training.
For frontier LLMs the dollar numbers are public, if approximate. Sam Altman pegged the training cost of GPT-4 at "more than $100 million," and Stanford's 2025 AI Index along with Epoch AI estimate the compute portion alone at around $78 million. Llama 3.1 405B is estimated by the same Stanford report at roughly $170 million in compute, with other analysts citing figures as low as $60 million depending on whether amortized hardware capex is included; Meta used over 16,000 H100 GPUs and roughly 30.8 million GPU-hours. Google's Gemini Ultra came in around $191 million by the same methodology. These numbers are rounded, contested, and almost always exclude staff, R&D, and failed runs, but they are the right order of magnitude.
Inference cost has its own pricing world. Anthropic's Claude Opus 4.7 (released April 2026) is priced at $5 per million input tokens and $25 per million output tokens; Sonnet 4.6 sits at $3 / $15, and Haiku 4.5 at $1 / $5. Batch processing across all of these is 50% cheaper, and prompt caching can drop the effective input cost by up to 90%. Inference cost matters because it is paid every single time a user makes a query, while training cost is paid once.
Annotation cost is the unglamorous one. Hiring humans to label millions of images for ImageNet, to write demonstration responses for InstructGPT, or to rank pairwise preferences for RLHF is often the dominant line item for tasks where data is the bottleneck rather than compute. The OpenAI InstructGPT paper noted that the human-feedback dataset was small but expensive, and the entire DPO line of work is partly motivated by squeezing more out of the same preference labels.
Writing down J is the most consequential step in a training run, and the most common failures are conceptual rather than numerical.
reduction='sum' versus reduction='mean' changes the effective learning rate by a factor of batch size, and ignoring padding tokens in a language model makes the per-token mean differ from the per-batch mean. The cost stays correct only if these reductions match what the optimizer expects.cross_entropy on logits, TensorFlow's softmax_cross_entropy_with_logits), and clamping probabilities into (ε, 1 - ε).In practice you almost never write the cost function from scratch. The major ML frameworks ship dozens of pre-built losses, all of them invoked the same way: instantiate, then call on (predictions, targets) to get a scalar.
| Framework | Common API | Reduction control |
|---|---|---|
| PyTorch | nn.MSELoss, nn.CrossEntropyLoss, nn.BCEWithLogitsLoss, nn.HuberLoss, nn.TripletMarginLoss, nn.KLDivLoss | reduction='mean' | 'sum' | 'none', default 'mean' |
| TensorFlow / Keras | tf.keras.losses.MeanSquaredError, CategoricalCrossentropy, SparseCategoricalCrossentropy, Hinge, Huber, KLDivergence | reduction= argument, plus from_logits=True for numerical stability |
| JAX / Flax | optax.softmax_cross_entropy, optax.l2_loss, optax.huber_loss | Functional, no implicit reduction; the user calls .mean() or .sum() |
| scikit-learn | loss= keyword on most estimators (SGDClassifier(loss='hinge'), LogisticRegression, GradientBoostingRegressor(loss='huber')) | Implicit per estimator |
| Hugging Face Transformers | Cross-entropy is built into model forward when labels are passed; custom losses subclass Trainer.compute_loss | Inherits PyTorch reductions |
PyTorch's convention is worth a closer look because it shows up in nearly every research paper. nn.CrossEntropyLoss takes raw logits (not probabilities) and an integer class label, computes a numerically stable log-softmax internally, and reduces with mean by default. The conventional variable name for the loss object is criterion, which goes back to the statistics term. nn.BCEWithLogitsLoss does the same trick for binary classification: it fuses sigmoid and binary cross-entropy so that the gradient stays stable for very confident predictions.
Imagine you are teaching a robot to throw beanbags into a bucket. After every throw the robot gets a number that says how badly it missed. Big miss, big number. Hit the bucket, small number. That number is the cost. The robot's whole life is figuring out how to wiggle its arm so the next number is smaller than the last one.
The rule that turns a throw into a number is the cost function. Different rules give different robots. If the rule is "how far the bag landed from the bucket," the robot learns to aim for the middle of the bucket. If the rule is "did the bag land in the bucket, yes or no," the robot might learn to just barely tip it over the rim. Pick the wrong rule and you get the wrong robot, even if the wrong robot is very good at being wrong.
The other meaning of cost is even simpler: training a really big robot uses a lot of electricity, costs a lot of money, and sometimes burns through millions of dollars. That is also called cost.
The broader objective function article covers the umbrella mathematical concept and includes detailed treatment of convexity, smoothness, and constrained optimization. The loss function article covers the per-example formulas with code examples for individual losses. The empirical risk minimization article gives the formal statistical framework under which minimizing the cost approximates minimizing the true risk, and the training loss article describes the specific case of cost computed on the training set during the inner optimization loop.
torch.nn.CrossEntropyLoss, torch.nn.MSELoss. pytorch.org/docs.