Cost

See also: Machine learning terms

In machine learning, cost is the scalar number that summarizes how badly a model is performing on a chunk of data. The function that produces it is the cost function, often written J(θ), where θ is the vector of model parameters. A training algorithm changes θ to push J(θ) downward, and that single number is the only thing the optimizer ever sees. Whatever J encodes, the model will learn to do; whatever J ignores, the model will treat as free to wreck.

In casual conversation, ML practitioners use "cost," "loss," and "objective" interchangeably, and the standard textbook (Goodfellow, Bengio, and Courville's Deep Learning) explicitly says the function we want to minimize is called the objective function, criterion, cost, or loss, with no real distinction. When people do bother to draw a line, the most common convention is the one Sebastian Raschka uses on his FAQ: loss is per example, cost is the aggregate over a dataset, and objective is the broadest umbrella term that may include a regularizer or even maximization. This article focuses on the cost view, and on the second meaning that has crept into modern ML: the dollars, watts, and GPU-hours it takes to actually train and serve a model.

definition

The canonical machine-learning cost function is the regularized empirical average

J(θ) = (1/N) Σ_{i=1}^{N} L(f_θ(x_i), y_i) + λ R(θ)

where L is a per-example loss function, f_θ is the model, (x_i, y_i) is the i-th input-label pair from a dataset of size N, R(θ) is a regularization penalty, and λ ≥ 0 controls how much the regularizer matters. When λ = 0 the cost reduces to the empirical risk and the setup is the textbook empirical risk minimization (ERM) problem of Vapnik. When λ > 0, the cost is a Pareto trade-off between fitting the training data and keeping θ small or simple.

The cost is what the optimizer minimizes; everything else (accuracy on validation, calibration, latency, fairness, dollars per inference) is a secondary metric the team tracks but the gradient never sees.

cost vs loss vs risk vs objective

The vocabulary is overloaded across textbooks, papers, and frameworks. The table below shows how careful authors distinguish the terms.

Term	Scope	Typical formula	What it measures
Loss	One example	L(f_θ(x_i), y_i)	Per-sample penalty
Cost	A batch or full training set	J(θ) = (1/N) Σ L(f_θ(x_i), y_i)	Average penalty over the data we have
Empirical risk	Same as cost without regularization	R_emp(θ) = (1/N) Σ L(f_θ(x_i), y_i)	Statistical-learning name for cost
True risk	Expectation over the data distribution	R(θ) = E_{(x,y) ~ D}[L(f_θ(x), y)]	Average penalty on data we will see in deployment
Objective function	Whatever the optimizer touches	J(θ) plus regularization, constraints, multipliers	The thing being minimized or maximized
Criterion	Same as objective	J(θ)	Statistics and PyTorch terminology
Reward	One step or one trajectory in RL	r(s, a) or R(τ)	Per-step or per-episode benefit, maximized

The loss-vs-cost split is a convention, not a law. Andrew Ng's Coursera ML course popularized writing L for the per-example version and J for the aggregated version, and the Baeldung CS notes follow the same split. PyTorch and TensorFlow both call them "losses" and apply a reduction='mean' argument to do the aggregation; in those frameworks there is no separate "cost" object.

The risk-vs-cost split is the load-bearing one. Cost is sample-based and computable; risk is population-based and only ever estimated. The whole point of generalization theory is that minimizing the cost (what we can compute) does not automatically minimize the risk (what we care about). Léon Bottou's writings on stochastic gradient descent are explicit about this: SGD optimizes the empirical risk while we hope it generalizes to the expected risk, and the gap between them is the entire field of statistical learning theory.

common cost functions

The per-example loss L drops into the same averaging shell. The choice of L is what defines the task. The table below collects the formulas you will encounter in nearly any production codebase.

Family	Name	Per-example formula	Typical use
Regression	Mean squared error (MSE, L2)	(y - ŷ)^2	Linear regression, value heads in RL, diffusion noise prediction
Regression	Mean absolute error (MAE, L1)	\|y - ŷ\|	Robust regression with heavy-tailed noise
Regression	Huber	(1/2)r^2 if \|r\| ≤ δ else δ(\|r\| - δ/2)	Robust regression, DQN value head with reward clipping
Regression	Quantile (pinball)	ρ_τ(y - ŷ)	Quantile regression, prediction intervals
Classification	Binary cross-entropy	-y log p - (1 - y) log(1 - p)	Sigmoid binary classifiers
Classification	Categorical cross-entropy	-Σ_k y_k log p_k	Softmax classifiers, language model next-token prediction
Classification	Negative log-likelihood	-log p_θ(y \| x)	Generative models, autoregressive LMs
Classification	Hinge loss	max(0, 1 - y · ŷ)	Linear and kernel SVMs
Classification	Focal loss	-(1 - p_t)^γ log p_t	Class-imbalanced detection (RetinaNet)
Probability	KL divergence	Σ p log(p / q)	Distillation, VAE prior matching, RLHF KL penalty
Metric	Triplet loss	max(0, d(a, p) - d(a, n) + m)	Face recognition (FaceNet), embedding learning
Self-supervised	InfoNCE	-log[exp(s_pos / τ) / Σ_j exp(s_j / τ)]	SimCLR, MoCo, CLIP, CPC
Generative (diffusion)	DDPM noise-prediction MSE	\|\|ε - ε_θ(x_t, t)\|\|^2	DDPM (Ho, Jain, Abbeel 2020), Stable Diffusion
RL (policy gradient)	PPO clipped surrogate	E[min(r_t Â_t, clip(r_t, 1 - ε, 1 + ε) Â_t)]	LLM RLHF, robotics, game-playing
RL (preference)	DPO	-log σ(β log(π_θ(y_w) / π_ref(y_w)) - β log(π_θ(y_l) / π_ref(y_l)))	Preference fine-tuning of LLMs

A few of these merit a longer look.

Focal loss, introduced by Lin et al. at ICCV 2017 with their RetinaNet detector, multiplies the standard cross-entropy by (1 - p_t)^γ so that easy, well-classified examples contribute almost nothing to the gradient. The motivation came from one-stage object detectors that have to score 10K to 100K candidate boxes per image, almost all of them background. Without down-weighting the easy negatives, they swamp the loss and the detector never learns to find the rare positives.

The DDPM noise-prediction loss of Ho, Jain, and Abbeel (2020) is just an MSE between the actual Gaussian noise sampled in the forward process and the noise the network predicts. It is the simplest possible objective in the deep generative literature, and it is the foundation under Stable Diffusion, Imagen, and most large-scale image and video diffusion systems. The original derivation goes through a variational lower bound, but the simplified loss is what is actually trained.

The PPO clipped surrogate of Schulman et al. (2017) is the inner optimization step in reinforcement learning from human feedback (RLHF) for LLMs, among many other RL applications. The clip prevents the importance-sampling ratio r_t = π_θ(a | s) / π_old(a | s) from straying too far from 1, which keeps each policy update small and stable.

DPO, the direct-preference-optimization loss of Rafailov et al. (NeurIPS 2023), is a binary cross-entropy on the difference of log-ratios between a policy and a fixed reference policy. The closed-form derivation collapses the entire RLHF pipeline (reward model, PPO loop, KL penalty) into a single supervised loss on preference pairs. The β temperature controls how strongly the policy is pulled toward the preference signal versus held near the reference.

regularized cost: ridge, lasso, weight decay

When R(θ) is added to the average loss, the cost punishes both poor fit and large or complex parameters. The two most common choices are L2 and L1 penalties, with names that vary by community.

Name	Penalty R(θ)	Effect on θ	ML examples
L2 / ridge / Tikhonov / weight decay	(1/2) Σ_j θ_j^2	Shrinks all weights toward 0	Logistic regression with L2, SGD with weight decay, AdamW
L1 / lasso	Σ_j \|θ_j\|	Drives many weights exactly to 0 (sparse solutions)	Lasso regression, sparse coding
Elastic net	α Σ \|θ_j\| + (1 - α)/2 Σ θ_j^2	Combination of L1 and L2	Glmnet, sklearn ElasticNet
Group sparsity	Σ_g \|\|θ_g\|\|_2	Sets entire groups of weights to 0	Structured pruning, channel pruning
Dropout (implicit)	Random masking at train time	Implicit ensembling, approximate L2	Most deep nets
Spectral norm penalty	\|\|W\|\|_2	Bounds Lipschitz constant of each layer	WGAN critic, robust training

In linear regression, the ridge cost is J(θ) = ||y - Xθ||^2 + λ||θ||^2, with closed-form solution θ = (X^T X + λ I)^{-1} X^T y. The lasso version replaces ||θ||^2 with ||θ||_1 and gives the celebrated sparse solutions of Tibshirani (1996). For deep nets, weight decay (L2 with a small λ, often around 1e-4) is the default regularizer, applied either through the loss or, in modern AdamW, decoupled from the gradient step entirely.

cost-sensitive learning

In many real applications the cost of getting an answer wrong is not symmetric. Missing a malignant tumor (a false negative) is much worse than calling a benign mole malignant (a false positive). Cost-sensitive learning modifies the cost function so that different mistakes are weighted differently.

The simplest version replaces the uniform 1/N average with class-weighted or sample-weighted averaging:

J(θ) = (1/N) Σ_i w_i L(f_θ(x_i), y_i)

where w_i depends on the true class y_i (class weights) or on the specific example (sample weights). For binary classification with a 2x2 cost matrix C, the per-example loss becomes a weighted sum over the four (true, predicted) cells, and the optimal classifier shifts the decision threshold away from 0.5 toward whichever class is cheaper to predict.

The medical-screening literature is full of this: Rao et al. (2024) review cost-sensitive learning across imbalanced medical data and note that the choice of cost ratio can change AUROC, precision-recall, and the deployed sensitivity of the screen by tens of percentage points. scikit-learn implements this through class_weight='balanced' on most estimators, and PyTorch through the weight argument to nn.CrossEntropyLoss.

properties of a good cost function

A cost function only earns its place if it has the right mathematical properties for the optimizer it will be paired with.

Property	What it means	Why it matters
Differentiable (or subdifferentiable)	Has a gradient or subgradient almost everywhere	Required for gradient descent, backprop, autograd
Aligned with the task metric	Lower J corresponds to better performance on what you care about	Otherwise you optimize the wrong thing (Goodhart's law)
Convex (when possible)	Every local minimum is a global minimum	Linear regression, logistic regression, soft-margin SVM are convex; neural nets are not
Bounded below	inf J(θ) > -∞	Otherwise the optimizer can drive J to negative infinity without learning anything
Numerically stable	No log(0), no exp(very large) without log-sum-exp	Otherwise NaNs in the loss curve
Smooth (Lipschitz gradient)	∇J does not jump suddenly	Standard convergence proofs of GD assume this

The non-convex part is the painful one. Almost no neural-network cost is convex, which is why theoretical guarantees for convex optimization only loosely transfer to deep learning. In practice the loss surface is full of saddle points and local minima; SGD with momentum and Adam happen to find good ones often enough that we ship the model anyway.

connection to optimization

The cost function and the optimizer are paired. You design J(θ) so that an algorithm can minimize it; the optimizer takes gradients (or finite differences, or surrogate models) of J and walks θ downhill. Three concepts come up over and over.

The cost surface is the graph of J(θ) over the parameter space. For a deep net it lives in millions or billions of dimensions and is impossible to visualize directly; the popular "loss landscape" plots project it to two dimensions by sweeping along chosen directions in θ-space. Local minima are points where ∇J = 0 and the Hessian is positive semi-definite; saddle points are points where ∇J = 0 but the Hessian has both positive and negative eigenvalues. Goodfellow et al. argue that in high dimensions saddle points dominate, and SGD escapes them by adding the right kind of noise.

Gradient descent updates θ_{t+1} = θ_t - η ∇J(θ_t). For a full-batch gradient over the entire dataset, each step is exact but expensive. Stochastic gradient descent estimates ∇J from a single example or a mini-batch, which is what makes large-scale ML feasible. Bottou's chapter in Neural Networks: Tricks of the Trade (2012) and the survey "Optimization Methods for Large-Scale Machine Learning" (Bottou, Curtis, Nocedal 2018) lay out the convergence theory. Adam, AdamW, RMSProp, and the rest of the modern menagerie are variants that adapt the step size per parameter using running estimates of the first and second moments of the gradient.

The practical point is that the gradient is computed by autograd from the cost; if the cost is wrong, the gradient is wrong, and the optimizer obediently walks the wrong way.

cost in the real world: dollars, watts, and GPU-hours

Machine-learning practitioners increasingly use "cost" in a second sense that has nothing to do with J(θ): the actual money and electricity spent on training and serving a model. As models scaled past GPT-3, this kind of cost stopped being a footnote and became a strategic constraint.

Cost dimension	What gets measured	Typical units
Training compute	FLOPs or GPU-hours used to train one model	GPU-hours, PFLOP-days, dollars
Inference compute	FLOPs per query at serving time	ms latency, $ per million tokens, J per token
Energy	Power drawn during training and inference	kWh, MWh
Carbon	Greenhouse-gas emissions from that energy	kg CO2e, t CO2e
Cloud / hardware	Rental price of GPUs and TPUs	$ per GPU-hour
Annotation	Human time to label data	$ per label, $ per task
Inference pricing	What end users or developers pay	$ per million tokens, $ per image

The Strubell, Ganesh, and McCallum 2019 paper at ACL was the first widely cited attempt to put numbers on the carbon side of this. They estimated that training a single large NLP model with neural-architecture search produced roughly the same lifetime CO2 emissions as five average American cars, and they pushed the field to start reporting energy and dollar costs alongside accuracy. The paper has been hugely influential in the green-AI conversation and in the more recent push for energy-efficient training.

For frontier LLMs the dollar numbers are public, if approximate. Sam Altman pegged the training cost of GPT-4 at "more than $100 million," and Stanford's 2025 AI Index along with Epoch AI estimate the compute portion alone at around $78 million. Llama 3.1 405B is estimated by the same Stanford report at roughly $170 million in compute, with other analysts citing figures as low as $60 million depending on whether amortized hardware capex is included; Meta used over 16,000 H100 GPUs and roughly 30.8 million GPU-hours. Google's Gemini Ultra came in around $191 million by the same methodology. These numbers are rounded, contested, and almost always exclude staff, R&D, and failed runs, but they are the right order of magnitude.

Inference cost has its own pricing world. Anthropic's Claude Opus 4.7 (released April 2026) is priced at $5 per million input tokens and $25 per million output tokens; Sonnet 4.6 sits at $3 / $15, and Haiku 4.5 at $1 / $5. Batch processing across all of these is 50% cheaper, and prompt caching can drop the effective input cost by up to 90%. Inference cost matters because it is paid every single time a user makes a query, while training cost is paid once.

Annotation cost is the unglamorous one. Hiring humans to label millions of images for ImageNet, to write demonstration responses for InstructGPT, or to rank pairwise preferences for RLHF is often the dominant line item for tasks where data is the bottleneck rather than compute. The OpenAI InstructGPT paper noted that the human-feedback dataset was small but expensive, and the entire DPO line of work is partly motivated by squeezing more out of the same preference labels.

common mistakes

Writing down J is the most consequential step in a training run, and the most common failures are conceptual rather than numerical.

Confusing cost with risk. The training cost is what we can compute on a sample; the true risk is what we actually care about on the population. They are not the same number, and the gap between them is the whole point of generalization theory and the regularizer term.
Confusing cost with the evaluation metric. Training MSE while reporting classification accuracy means the model is optimal for MSE, not for accuracy. If the metric is non-differentiable (F1, BLEU, NDCG), use a smooth surrogate for training and report the metric on validation.
Reduction bugs. PyTorch's reduction='sum' versus reduction='mean' changes the effective learning rate by a factor of batch size, and ignoring padding tokens in a language model makes the per-token mean differ from the per-batch mean. The cost stays correct only if these reductions match what the optimizer expects.
Numerical instability. Cross-entropy contains log(p) and softmax contains exp(z). Log of zero blows up; exp of large values overflows. Standard fixes include the log-sum-exp trick, fused softmax-cross-entropy kernels (PyTorch's cross_entropy on logits, TensorFlow's softmax_cross_entropy_with_logits), and clamping probabilities into (ε, 1 - ε).
Scale mismatch between data fit and regularization. If the per-example loss is in the thousands and λR(θ) is order 1, the regularizer does nothing. Always check that the two terms are within a few orders of magnitude of each other early in training.
Reporting training loss as if it were generalization performance. Training cost going to zero means the model has memorized the data, not that it has learned a useful function. The validation cost is the relevant number.
Treating the deployed cost as the same as the training cost. Inference cost (latency, dollars per token, energy per query) is paid forever; training cost is paid once. Optimizing only one of them ignores the other.

implementations

In practice you almost never write the cost function from scratch. The major ML frameworks ship dozens of pre-built losses, all of them invoked the same way: instantiate, then call on (predictions, targets) to get a scalar.

Framework	Common API	Reduction control
PyTorch	`nn.MSELoss`, `nn.CrossEntropyLoss`, `nn.BCEWithLogitsLoss`, `nn.HuberLoss`, `nn.TripletMarginLoss`, `nn.KLDivLoss`	`reduction='mean' \| 'sum' \| 'none'`, default `'mean'`
TensorFlow / Keras	`tf.keras.losses.MeanSquaredError`, `CategoricalCrossentropy`, `SparseCategoricalCrossentropy`, `Hinge`, `Huber`, `KLDivergence`	`reduction=` argument, plus `from_logits=True` for numerical stability
JAX / Flax	`optax.softmax_cross_entropy`, `optax.l2_loss`, `optax.huber_loss`	Functional, no implicit reduction; the user calls `.mean()` or `.sum()`
scikit-learn	`loss=` keyword on most estimators (`SGDClassifier(loss='hinge')`, `LogisticRegression`, `GradientBoostingRegressor(loss='huber')`)	Implicit per estimator
Hugging Face Transformers	Cross-entropy is built into model `forward` when labels are passed; custom losses subclass `Trainer.compute_loss`	Inherits PyTorch reductions

PyTorch's convention is worth a closer look because it shows up in nearly every research paper. nn.CrossEntropyLoss takes raw logits (not probabilities) and an integer class label, computes a numerically stable log-softmax internally, and reduces with mean by default. The conventional variable name for the loss object is criterion, which goes back to the statistics term. nn.BCEWithLogitsLoss does the same trick for binary classification: it fuses sigmoid and binary cross-entropy so that the gradient stays stable for very confident predictions.

explain like I'm 5

Imagine you are teaching a robot to throw beanbags into a bucket. After every throw the robot gets a number that says how badly it missed. Big miss, big number. Hit the bucket, small number. That number is the cost. The robot's whole life is figuring out how to wiggle its arm so the next number is smaller than the last one.

The rule that turns a throw into a number is the cost function. Different rules give different robots. If the rule is "how far the bag landed from the bucket," the robot learns to aim for the middle of the bucket. If the rule is "did the bag land in the bucket, yes or no," the robot might learn to just barely tip it over the rim. Pick the wrong rule and you get the wrong robot, even if the wrong robot is very good at being wrong.

The other meaning of cost is even simpler: training a really big robot uses a lot of electricity, costs a lot of money, and sometimes burns through millions of dollars. That is also called cost.

The broader objective function article covers the umbrella mathematical concept and includes detailed treatment of convexity, smoothness, and constrained optimization. The loss function article covers the per-example formulas with code examples for individual losses. The empirical risk minimization article gives the formal statistical framework under which minimizing the cost approximates minimizing the true risk, and the training loss article describes the specific case of cost computed on the training set during the inner optimization loop.

references

Goodfellow, I., Bengio, Y., and Courville, A. (2016). Deep Learning. MIT Press. Chapter 4.3 introduces the terms objective function, criterion, cost, and loss as essentially synonymous.
Raschka, S. "What is the difference between a cost function and a loss function in machine learning?" sebastianraschka.com FAQ.
Baeldung. "Difference Between the Cost, Loss, and the Objective Function." baeldung.com/cs/cost-vs-loss-vs-objective-function.
Vapnik, V. (1998). Statistical Learning Theory. Wiley. Empirical risk minimization framework.
Bottou, L. (2012). "Stochastic Gradient Descent Tricks." In Neural Networks: Tricks of the Trade, Springer.
Bottou, L., Curtis, F. E., and Nocedal, J. (2018). "Optimization Methods for Large-Scale Machine Learning." SIAM Review, 60(2), 223-311. arXiv:1606.04838.
Tibshirani, R. (1996). "Regression Shrinkage and Selection via the Lasso." Journal of the Royal Statistical Society B.
Lin, T.-Y., Goyal, P., Girshick, R., He, K., and Dollár, P. (2017). "Focal Loss for Dense Object Detection." ICCV. arXiv:1708.02002.
Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. (2017). "Proximal Policy Optimization Algorithms." arXiv:1707.06347.
Ho, J., Jain, A., and Abbeel, P. (2020). "Denoising Diffusion Probabilistic Models." NeurIPS. arXiv:2006.11239.
Rafailov, R., Sharma, A., Mitchell, E., Ermon, S., Manning, C. D., and Finn, C. (2023). "Direct Preference Optimization: Your Language Model is Secretly a Reward Model." NeurIPS. arXiv:2305.18290.
Strubell, E., Ganesh, A., and McCallum, A. (2019). "Energy and Policy Considerations for Deep Learning in NLP." ACL. arXiv:1906.02243.
Stanford HAI (2025). Artificial Intelligence Index Report 2025. Chapter on training cost trends.
Epoch AI (2024). "Trends in Training Cost of Frontier AI Systems." epochai.org.
Anthropic (2026). "Claude API Pricing." platform.claude.com/docs/en/about-claude/pricing.
Rao, A. et al. (2024). "Cost-sensitive learning for imbalanced medical data: a review." Artificial Intelligence Review, Springer.
PyTorch documentation. torch.nn.CrossEntropyLoss, torch.nn.MSELoss. pytorch.org/docs.

definition

cost vs loss vs risk vs objective

common cost functions

regularized cost: ridge, lasso, weight decay

cost-sensitive learning

properties of a good cost function

connection to optimization

cost in the real world: dollars, watts, and GPU-hours

common mistakes

implementations

explain like I'm 5

related concepts

references

Improve this article

Related Articles

L1 Loss

L2 Loss

Objective function

Squared Hinge Loss

Squared Loss

Log Loss

definition

cost vs loss vs risk vs objective

common cost functions

regularized cost: ridge, lasso, weight decay

cost-sensitive learning

properties of a good cost function

connection to optimization

cost in the real world: dollars, watts, and GPU-hours

common mistakes

implementations

explain like I'm 5

related concepts

references

Related Articles

L1 Loss

L2 Loss

Objective function

Squared Hinge Loss

Squared Loss

Log Loss