Convex Function

A convex function is a real-valued mathematical function whose graph curves upward, forming a bowl or cup shape. Formally, a function is convex if the line segment connecting any two points on its graph lies on or above the graph itself. Convex functions are foundational to convex optimization and play a central role in machine learning, where they guarantee that optimization algorithms such as gradient descent can find global minima efficiently.

The study of convex functions belongs to a branch of mathematics called convex analysis, formalized in the 20th century in works such as Werner Fenchel's lecture notes (1951) and R. Tyrrell Rockafellar's Convex Analysis (1970). The properties of convex functions, especially the fact that local minima coincide with global minima, make them the most well-behaved class of nonlinear objective functions in mathematical optimization. Modern textbooks such as Boyd and Vandenberghe's Convex Optimization (2004) and Nesterov's Introductory Lectures on Convex Optimization (2004) treat convex functions as the central object of study because efficient solvers exist for almost any problem expressible in convex form.

eli5: explain like i'm 5

Imagine you have a bowl sitting on a table. If you drop a marble into the bowl, it will always roll down to the very bottom, no matter where you drop it. A convex function is like the inside surface of that bowl. There is only one lowest point, and every path leads down to it.

In machine learning, we use bowl-shaped (convex) functions to measure how wrong our predictions are. Because the function is shaped like a bowl, our training algorithm can always find the single best answer by rolling downhill. If the function were bumpy with lots of dips (non-convex), the marble might get stuck in a small dip that is not the true bottom, and our model would not learn the best possible answer.

A second way to picture convexity: stretch a rubber band between any two points on the curve. If the rubber band always stays above (or touching) the curve, the function is convex. If the rubber band ever dips below the curve, the function is not convex.

formal definition

Let f : R^n -> R be a function whose domain is a convex set. The function f is convex if, for all x, y in its domain and for every t in [0, 1]:

f(tx + (1 - t)y) <= t f(x) + (1 - t) f(y)

Geometrically, this inequality states that the chord (line segment) between any two points (x, f(x)) and (y, f(y)) on the graph of f lies on or above the graph. Equivalently, the epigraph of f (the set of points lying on or above the graph) is a convex set.

A function f is concave if -f is convex. Linear and affine functions are the only functions that are both convex and concave simultaneously.

The definition can be extended to functions whose domain is any real vector space, including infinite-dimensional function spaces. In abstract convex analysis, a convex function is often allowed to take the value plus infinity, with the convention that the inequality remains valid by reading any expression involving infinity in the natural way. The set where f is finite is called the effective domain, written dom f. A convex function whose effective domain is nonempty is called proper.

the epigraph viewpoint

The epigraph of f is the set epi f = {(x, r) in R^n x R : f(x) <= r}. A function is convex if and only if its epigraph is a convex subset of R^(n+1). This characterization is powerful because it converts every question about a convex function into a question about a convex set, unifying the theory of convex functions with the geometry of convex sets. The closely related sublevel set {x : f(x) <= alpha} is also convex for every alpha whenever f is convex, although sublevel-set convexity alone is weaker than convexity (it characterizes the broader class of quasi-convex functions).

strictly convex functions

A function f is strictly convex if the inequality holds with strict inequality for all distinct x != y and for all t in (0, 1):

f(tx + (1 - t)y) < t f(x) + (1 - t) f(y)

Strict convexity means the chord between two distinct points lies strictly above the graph (never touching it except at the endpoints). A strictly convex function has at most one global minimum. If a minimum exists, it is unique. The function f(x) = e^x is a classic example: it is strictly convex on all of R, but it does not attain a minimum.

strongly convex functions

Strong convexity is a more powerful condition that quantifies how "curved" a function is. A differentiable function f is mu-strongly convex (for some mu > 0) if for all x, y in its domain:

f(y) >= f(x) + nabla f(x)^T (y - x) + (mu / 2) ||y - x||^2

Intuitively, a mu-strongly convex function grows at least as fast as a quadratic with curvature parameter mu. This provides a quadratic lower bound on the function, guaranteeing a unique global minimum and enabling stronger convergence guarantees for optimization algorithms.

equivalent characterizations

The following conditions are equivalent for a differentiable function f:

Characterization	Statement
Quadratic lower bound	f(y) >= f(x) + nabla f(x)^T (y - x) + (mu/2) \|\|y - x\|\|^2
Subtraction test	g(x) = f(x) - (mu/2) \|\|x\|\|^2 is convex
Monotone gradient	(nabla f(x) - nabla f(y))^T (x - y) >= mu \|\|x - y\|\|^2
Strengthened Jensen	f(tx + (1-t)y) <= t f(x) + (1-t)f(y) - t(1-t)(mu/2) \|\|x - y\|\|^2

Strong convexity implies strict convexity, which in turn implies convexity, but the reverse implications do not hold.

convergence rates

Strong convexity is especially important in optimization because it guarantees linear convergence (exponential decrease in error) for gradient descent. For a mu-strongly convex function with L-Lipschitz continuous gradients, gradient descent achieves an error of at most epsilon after O((L/mu) log(1/epsilon)) iterations. Without strong convexity, the convergence rate for smooth convex functions is only O(1/k), which is substantially slower.

The ratio kappa = L/mu is called the condition number of the optimization problem. A small condition number (close to 1) means the function is well conditioned and gradient methods converge quickly. A large condition number means the function is elongated and ill conditioned, in which case accelerated gradient methods such as Nesterov's method achieve a faster O(sqrt(kappa) log(1/epsilon)) rate.

first-order condition

If f is differentiable, it is convex if and only if its domain is a convex set and for all x, y in the domain:

f(y) >= f(x) + nabla f(x)^T (y - x)

This condition has a clear geometric meaning: the tangent hyperplane at any point lies entirely on or below the graph of the function. In one dimension, this says that every tangent line is a global underestimator of the function. This first-order condition is the basis for many optimization algorithms, because it implies that the gradient provides a reliable direction toward the minimum.

A related first-order characterization is monotonicity of the gradient: a differentiable function f on a convex domain is convex if and only if (nabla f(x) - nabla f(y))^T (x - y) >= 0 for all x, y. This generalizes the fact that the derivative of a convex function on the real line is nondecreasing.

second-order condition

If f is twice differentiable, it is convex if and only if its Hessian matrix (the matrix of second partial derivatives) is positive semidefinite everywhere on its domain:

nabla^2 f(x) >= 0 for all x in dom(f)

For a function of a single variable, this reduces to the familiar condition f''(x) >= 0 for all x. For strictly convex functions, the Hessian must be positive definite (all eigenvalues strictly positive). For mu-strongly convex functions, the Hessian satisfies nabla^2 f(x) >= mu * I for all x, meaning every eigenvalue of the Hessian is at least mu.

Convexity Type	Second-Order Condition
Convex	nabla^2 f(x) >= 0 (positive semidefinite)
Strictly convex	nabla^2 f(x) > 0 (positive definite)
mu-strongly convex	nabla^2 f(x) >= mu I (eigenvalues >= mu)

A subtle point: positive definite Hessian everywhere on the domain implies strict convexity, but the converse is false. The classical counterexample is f(x) = x^4, which is strictly convex on R but has f''(0) = 0. Strong convexity is the condition that pins down a uniform positive lower bound on the curvature.

examples of convex functions

The following table lists commonly encountered convex functions and their key properties.

Function	Formula	Strict?	Strongly Convex?	Notes
Affine / Linear	f(x) = a^T x + b	No	No	Both convex and concave; the only functions with this property
Quadratic	f(x) = x^2	Yes	Yes (mu = 2)	The prototypical strongly convex function
Positive-definite quadratic	f(x) = x^T A x for A > 0	Yes	Yes (mu = lambda_min(A))	Strongly convex with parameter equal to the smallest eigenvalue of A
Exponential	f(x) = e^(ax) for any a	Yes	No	Strictly convex but not strongly convex on all of R
Negative logarithm	f(x) = -log(x), x > 0	Yes	No	Strictly convex on its domain; used as a barrier function in interior-point methods
Norms	f(x) = \|\|x\|\|_p, p >= 1	No	No	Convex by the triangle inequality; not strictly convex in general
Log-sum-exp	f(x) = log(sum e^(x_i))	No	No	Smooth approximation to the max function; its gradient is the softmax function
Absolute value	f(x) = \|x\|	No	No	Convex but not differentiable at x = 0
Powers	f(x) = x^p, x >= 0, p >= 1	Yes (if p > 1)	No	Convex on the nonneg. reals for p >= 1
Negative entropy	f(x) = x log x, x > 0	Yes	No	Strictly convex on (0, infinity); appears in information theory
Maximum eigenvalue	f(X) = lambda_max(X)	No	No	Convex on symmetric matrices; basis of semidefinite programming
Spectral norm	f(X) = sigma_max(X)	No	No	Largest singular value; convex matrix function
Quadratic over linear	f(x, y) = x^2 / y, y > 0	Yes	No	Convex jointly in (x, y); appears in regression and signal processing

the log-sum-exp function

The log-sum-exp (LSE) function, defined as f(x) = log(sum_{i=1}^{n} e^(x_i)), deserves special attention because of its wide use in machine learning. It is a smooth, convex approximation to the maximum function, satisfying:

max(x_1, ..., x_n) <= log(sum e^(x_i)) <= max(x_1, ..., x_n) + log(n)

The gradient of the LSE function is the softmax function, and its Hessian can be shown to be positive semidefinite via the Cauchy-Schwarz inequality. The LSE function appears in the log-partition function of exponential family distributions, in the cross-entropy loss function for classification, and in smooth relaxations used in reinforcement learning.

norms and seminorms

Every norm on R^n is a convex function. The proof is a direct application of the triangle inequality and absolute homogeneity. For t in [0, 1] and any norm || . ||:

||t x + (1 - t) y|| <= ||t x|| + ||(1 - t) y|| = t ||x|| + (1 - t) ||y||

This covers the Euclidean (L2) norm, the L1 (Manhattan) norm, the L-infinity (max) norm, and every Lp norm for p >= 1. None of these are strictly convex except for the L2 norm restricted away from the origin (and even then only weakly so). Norms are not differentiable at zero in general, so subgradient calculus replaces standard gradient calculus when working with norm-regularized problems such as the Lasso.

convex matrix functions

Several important functions on matrix arguments are convex. The trace of a matrix product A****X is linear (and hence convex) in X. The largest eigenvalue lambda_max(X) of a symmetric matrix is convex, since it can be written as the supremum of the linear functions u^T X u over unit vectors u. The negative log determinant f(X) = -log det(X) is convex on the cone of positive definite matrices and serves as a barrier in semidefinite programming. The spectral norm and the nuclear norm (sum of singular values) are also convex; the nuclear norm is the convex envelope of the matrix rank and underlies many low-rank matrix recovery techniques.

non-convex examples

Not all functions encountered in machine learning are convex. Important non-convex cases include:

Neural network loss surfaces. The loss function of a neural network with hidden layers is generally non-convex due to nonlinear activation functions and the multiplicative interaction of weight matrices. The loss landscape contains many local minima and saddle points.
f(x) = sin(x). Oscillating functions are neither convex nor concave over their full domain.
f(x) = x^3. This function is concave for x < 0 and convex for x > 0, so it is not convex on all of R.
f(x) = log(x). The logarithm is concave (not convex) on the positive reals.
The sigmoid function. f(x) = 1 / (1 + e^(-x)) is neither convex nor concave on R; it is convex on (-infinity, 0] and concave on [0, infinity).
Mixture model log-likelihoods. The negative log-likelihood of a Gaussian mixture model is non-convex, which is why expectation-maximization is used to find local rather than global optima.
Matrix factorization objectives. Loss functions that involve the product of two matrices to be optimized jointly (as in low-rank factorization) are non-convex even when each factor is fixed and the loss in the other factor is convex.

When the loss function is non-convex, gradient descent may converge to local minima or saddle points rather than the global minimum. In practice, techniques such as stochastic gradient descent with momentum, learning rate schedules, and random initialization help navigate non-convex landscapes.

A practical workaround when the original problem is non-convex is to construct a convex relaxation: replace the function with a convex envelope or a convex upper bound that is tractable to optimize. The use of the L1 norm in place of the L0 "norm" (cardinality) is the canonical example, and it underlies compressed sensing as well as sparse regression.

operations that preserve convexity

Convexity is preserved under several useful operations, making it possible to build complex convex functions from simpler ones. These composition rules are central to modeling in convex optimization.

Operation	Rule	Example
Nonnegative weighted sum	If f_1, ..., f_m are convex and w_i >= 0, then sum w_i f_i is convex	Weighted sum of loss terms in multi-task learning
Affine composition	If f is convex, then f(Ax + b) is convex	Applying a linear transformation to the input
Pointwise maximum	If f_1, ..., f_m are convex, then max(f_1, ..., f_m) is convex	Piecewise-linear functions; hinge loss
Pointwise supremum	If f(x, y) is convex in x for each y, then sup_y f(x, y) is convex in x	Support function of a set
Composition (nondecreasing)	If g is convex and nondecreasing and f is convex, then g(f(x)) is convex	e^(f(x)) when f is convex
Composition (nonincreasing)	If g is convex and nonincreasing and f is concave, then g(f(x)) is convex	-log of a concave function
Minimization over convex set	If f(x, y) is convex in (x, y) and C is convex, then g(x) = inf_{y in C} f(x, y) is convex	Projection onto a convex set
Perspective	If f is convex, the perspective g(x, t) = t f(x/t) is convex for t > 0	KL divergence as the perspective of -log
Conjugate	The convex conjugate f^*(y) = sup_x (y^T x - f(x)) is always convex	Legendre transform; dual formulations
Infimal convolution	The inf-convolution (f box g)(x) = inf_y f(y) + g(x - y) is convex if f and g are	Moreau envelope; smoothing of nondifferentiable functions

disciplined convex programming

These composition rules form the basis of disciplined convex programming (DCP), a methodology popularized by Stephen Boyd's group at Stanford and implemented in modeling languages such as CVX (MATLAB), CVXPY (Python), and Convex.jl (Julia). In DCP, every expression is built from a small library of atomic convex and concave functions combined according to verifiable composition rules. If a problem is written in DCP form, the modeling tool can automatically certify its convexity and dispatch it to a suitable solver such as ECOS, SCS, or MOSEK. DCP is the reason convex optimization is widely accessible to practitioners who have not memorized the underlying theory.

subgradients and subdifferentials

Many convex functions of practical interest, such as the L1 norm and the hinge loss, are not differentiable everywhere. The right generalization of the gradient for convex functions is the subgradient. A vector g is a subgradient of a convex function f at the point x if for every y in dom f:

f(y) >= f(x) + g^T (y - x)

The set of all subgradients at x is called the subdifferential and is denoted partial f(x). Three key facts make the subdifferential a useful object:

For convex f, the subdifferential partial f(x) is nonempty at every interior point of dom f.
If f is differentiable at x, then partial f(x) = {nabla f(x)}.
A point x* is a global minimum of a convex function f if and only if 0 in partial f(x*).

The subdifferential of the absolute value at zero is the interval [-1, 1], and the subdifferential of the L1 norm at a sparse vector contains the sign vectors with -1 or 1 in the nonzero coordinates and any value in [-1, 1] in the zero coordinates. This is why the L1 norm encourages sparsity: setting a coordinate to zero is a stationary point because the subdifferential straddles zero.

Subgradient methods replace the gradient in gradient descent with any element of the subdifferential. They converge at the slower O(1/sqrt(k)) rate but apply to the broader class of nonsmooth convex functions. The proximal gradient method further generalizes this by using the proximal operator of a (possibly nonsmooth) convex regularizer, and it is the workhorse for solving regularized problems such as the Lasso.

convex conjugate (legendre-fenchel transform)

The convex conjugate of a function f, written f^*, is defined by:

f^*(y) = sup_{x in dom f} ((y)^T (x) - f(x))

This transform, also called the Legendre-Fenchel transform, plays the role of the Fourier transform in convex analysis: it linearizes the nonlinear structure of f by re-expressing it in terms of supporting hyperplanes. Key properties:

The conjugate f^* is always convex and lower semicontinuous, regardless of f.
For convex, lower semicontinuous, proper f, the Fenchel-Moreau theorem states that f** = f. The conjugate is therefore an involution on this class.
The conjugate of the squared L2 norm is the squared L2 norm itself (up to a factor).
The conjugate of the negative entropy is log-sum-exp.
The conjugate of an indicator function delta_C of a convex set C is the support function of C.

Convex conjugates underpin the theory of Lagrange duality: the dual of a convex optimization problem is constructed from the conjugate of the Lagrangian. They also describe the Bregman divergence in dual coordinates and provide the link between primal and dual formulations of regularized problems.

bregman divergence

Given a strictly convex, differentiable function phi, the Bregman divergence generated by phi is defined as:

D_phi(x, y) = phi(x) - phi(y) - nabla phi(y)^T (x - y)

The Bregman divergence measures the gap between phi(x) and the first-order Taylor expansion of phi around y. It is always nonnegative and equals zero exactly when x = y (because phi is strictly convex).

Different generators give rise to different divergences and reproduce many familiar discrepancy measures.

Generator phi	Domain	Resulting Bregman Divergence
(1/2)\|\|x\|\|^2	R^n	Squared Euclidean distance
sum x_i log x_i (negative entropy)	Probability simplex	Kullback-Leibler divergence
-sum log x_i	Positive orthant	Itakura-Saito divergence
(1/2) x^T A x for A > 0	R^n	Mahalanobis-like quadratic divergence

Bregman divergences are not in general symmetric and do not satisfy the triangle inequality, so they are not metrics. They do satisfy a Pythagorean-style identity that supports projection arguments, and they are used in the analysis of mirror descent, proximal point methods, and online convex optimization.

jensen's inequality

One of the most important results involving convex functions is Jensen's inequality, named after Danish mathematician Johan Jensen who published it in 1906. For a convex function f and a random variable X:

f(E[X]) <= E[f(X)]

In words, applying a convex function to the expected value of a random variable gives a result no larger than the expected value of the convex function applied to the random variable.

For concave functions, the inequality reverses: f(E[X]) >= E[f(X)].

Jensen's inequality has wide-ranging applications. In information theory, it is used to prove that the Kullback-Leibler divergence is always nonnegative. In statistics, it establishes bounds on expectations. In machine learning, it underlies the derivation of the evidence lower bound (ELBO) used in variational inference and variational autoencoders.

finite form

The finite (discrete) form of Jensen's inequality states that for nonneg. weights w_1, ..., w_n summing to 1 and points x_1, ..., x_n:

f(sum w_i x_i) <= sum w_i f(x_i)

This follows directly from the definition of convexity by induction.

derived inequalities

Many classical inequalities are special cases or near-corollaries of Jensen's inequality:

The arithmetic-geometric mean (AM-GM) inequality follows from applying Jensen to the concave logarithm.
Holder's inequality can be derived from Jensen by applying it to the convex function t -> t^p with appropriate measures.
The information inequality D_KL(P || Q) >= 0 between any two probability distributions is a direct consequence of applying Jensen to -log.
Cauchy-Schwarz can be obtained from Jensen applied to the squaring map.
The fact that the variance of a random variable is nonnegative is the simplest case (apply Jensen to t -> t^2).

smoothness and lipschitz gradients

A twin notion to strong convexity is L-smoothness. A differentiable function f is L-smooth if its gradient is L-Lipschitz continuous, that is, ||nabla f(x) - nabla f(y)|| <= L ||x - y|| for all x, y. For twice-differentiable functions, this is equivalent to nabla^2 f(x) <= L I, an upper bound on the Hessian. L-smoothness gives a quadratic upper bound:

f(y) <= f(x) + nabla f(x)^T (y - x) + (L/2) ||y - x||^2

This upper bound, sometimes called the descent lemma, is what makes gradient descent provably effective: at each step, one can guarantee a decrease in the function value by at least a fixed amount. Combined with mu-strong convexity (lower quadratic bound), it pins the function between two paraboloids of curvature mu and L. The standard convergence theorems for gradient descent and its accelerated variants are stated in this regime.

The table below summarizes the standard regimes of convex optimization and the corresponding best-known iteration complexity for first-order methods (counting one gradient evaluation per iteration).

Regime	Lower bound	Upper bound (vanilla GD)	Upper bound (accelerated)
Smooth (L), nonconvex	Stationary point only	O(1/sqrt(k)) gradient norm	O(1/sqrt(k))
Convex, smooth (L)	O(1/k^2)	O(1/k)	O(1/k^2)
Strongly convex (mu), smooth (L)	O((1 - 1/kappa)^k)	O((1 - 1/kappa)^k)	O((1 - 1/sqrt(kappa))^k)
Strongly convex, nonsmooth	O(1/k)	O(1/k)	O(1/k)
Convex, nonsmooth	O(1/sqrt(k))	O(1/sqrt(k))	O(1/sqrt(k))

convex functions in machine learning

Convex functions are central to machine learning because they guarantee that optimization problems have well-behaved solutions. When the loss function is convex with respect to the model parameters, every local minimum is also a global minimum, and gradient descent is guaranteed to converge to an optimal solution.

convex loss functions

The following table summarizes common convex loss functions used in machine learning.

Loss Function	Formula	Used In	Convexity Notes
Mean Squared Error (MSE)	(1/n) sum (y_i - y_hat_i)^2	Linear regression	Convex (and strongly convex) in the model parameters for linear models
Cross-entropy (log loss)	-sum y_i log(p_i)	Logistic regression, classification	Convex in the linear model parameters; non-convex in neural network weights
Hinge loss	max(0, 1 - y_i f(x_i))	Support vector machines	Convex (piecewise linear); not differentiable at the hinge point
Huber loss	Quadratic for small errors, linear for large	Robust regression	Convex and differentiable everywhere
L1 loss (MAE)	(1/n) sum \|y_i - y_hat_i\|	Robust regression	Convex but not differentiable at zero
Squared hinge loss	max(0, 1 - y_i f(x_i))^2	Squared-loss SVM variants	Convex and differentiable everywhere
Logistic loss	log(1 + exp(-y_i f(x_i)))	Logistic regression	Strictly convex in linear model parameters
Quantile (pinball) loss	max(tau (y - y_hat), (tau - 1)(y - y_hat))	Quantile regression	Convex; estimates conditional quantiles
Epsilon-insensitive loss	max(0, \|y - y_hat\| - epsilon)	Support vector regression	Convex; ignores small errors

For linear regression with MSE loss, the optimization problem is a convex quadratic, which admits a closed-form solution (the normal equations) or can be solved efficiently with gradient descent. For logistic regression with cross-entropy loss, the objective is convex in the weight parameters, guaranteeing convergence to the global optimum.

However, when these same loss functions are composed with a multi-layer neural network, the overall objective becomes non-convex because of the nonlinear activation functions and the product structure of the weight matrices.

convex regularization

Regularization terms added to loss functions are also typically convex.

Regularizer	Formula	Convexity	Effect on Solution
L2 (ridge)	lambda \|\|w\|\|_2^2	Strongly convex	Shrinks all weights smoothly toward zero
L1 (lasso)	lambda \|\|w\|\|_1	Convex (not strict)	Promotes sparsity; many weights become exactly zero
Elastic net	lambda_1 \|\|w\|\|_1 + lambda_2 \|\|w\|\|_2^2	Convex (strongly convex if lambda_2 > 0)	Combines sparsity and smoothing
Group Lasso	lambda sum_g \|\|w_g\|\|_2	Convex	Promotes sparsity at the group level
Total variation	lambda sum \|w_(i+1) - w_i\|	Convex	Promotes piecewise-constant solutions
Nuclear norm	lambda \|\|W\|\|_*	Convex	Promotes low-rank solutions for matrices

L2 regularization and L1 regularization are the most common; Lasso refers specifically to least-squares regression with L1 regularization, while ridge regression is least squares with L2 regularization. Adding L2 regularization to a convex loss function is a practical technique for converting a merely convex problem into a strongly convex one, enabling faster and more stable optimization.

classical convex models in machine learning

The following table catalogs widely used classical models that admit convex training problems.

Model	Training Problem	Convex Class
Ordinary least squares	min \|\|Xw - y\|\|_2^2	Convex quadratic
Ridge regression	min \|\|Xw - y\|\|_2^2 + lambda \|\|w\|\|_2^2	Strongly convex quadratic
Lasso	min \|\|Xw - y\|\|_2^2 + lambda \|\|w\|\|_1	Convex (nonsmooth)
Logistic regression	min sum log(1 + exp(-y_i w^T x_i))	Convex; strongly convex with L2 regularization
Linear SVM (primal)	min (1/2)\|\|w\|\|_2^2 + C sum max(0, 1 - y_i (w^T x_i + b))	Convex (quadratic + piecewise linear)
Kernel SVM (dual)	Convex quadratic in dual variables under positive semidefinite kernel	Convex QP
Maximum entropy / softmax classifier	min cross-entropy with linear model	Convex
Quantile regression	min sum pinball loss	Convex (LP)
Poisson regression	min Poisson negative log likelihood	Convex (canonical link)
LogDet model selection	min trace(S^(-1) X) - log det(X) + lambda \|\|X\|\|_1 (graphical lasso)	Convex

Support vector machines deserve particular attention. The SVM training problem is a convex quadratic program (QP) in either its primal or dual form. The convexity of this formulation means there is no risk of getting stuck in a poor local optimum, and standard QP solvers (or specialized algorithms such as SMO) can compute the unique global optimum. The kernel trick extends this convex framework to nonlinear decision boundaries while preserving convexity in the dual.

deep learning and convexity

The loss function of a deep neural network is non-convex in its parameters as a whole. However, several partial convexity properties still play a role:

Holding all but the final layer fixed, the loss is typically convex in the final layer weights when the loss itself is convex (cross-entropy, MSE, hinge). This is why a linear classifier on frozen features works as a useful baseline and why the last linear layer of a transformer is sometimes treated separately.
Two-layer infinite-width networks (mean-field limit) admit a convex formulation in the space of probability measures over neuron parameters, even though finite-width networks are non-convex.
The neural tangent kernel regime linearizes the network around initialization, producing a convex (kernel ridge regression) problem that approximates the original non-convex training dynamics for large widths.
Many regularizers used in deep learning (weight decay, L1, dropout-as-Bayesian-prior, spectral norm penalties) are convex, even though the data-fitting term is not.
Convex relaxations of neural network verification problems (e.g., LP relaxations of ReLU constraints) are used to certify robustness against adversarial examples.

Despite these connections, training deep networks is fundamentally non-convex, and the success of stochastic gradient descent on these problems is one of the most studied empirical phenomena in modern machine learning. Theoretical work has shown that for sufficiently overparameterized networks, gradient descent can escape saddle points and converge to global minima despite the non-convexity, but the precise mechanisms remain an active research area.

probabilistic and bayesian methods

Convex functions also appear in probabilistic methods. The negative log-likelihood of any exponential family distribution is convex in the natural parameters: this includes Gaussian (with fixed variance), Bernoulli, multinomial, Poisson, exponential, and gamma distributions. Generalized linear models (GLMs), which model the natural parameter as a linear function of features, therefore have convex training objectives. The variational lower bound used in variational inference and variational autoencoders involves a KL divergence term that is convex in the variational parameters under common parameterizations such as Gaussian families.

relationship to convex optimization

Convex optimization is the study of minimizing convex functions over convex sets. The key guarantee is that any local minimum of a convex function over a convex set is a global minimum. This property makes convex optimization problems tractable: efficient polynomial-time algorithms exist for solving them to arbitrary precision.

Many classical machine learning algorithms are formulated as convex optimization problems:

Linear regression (least squares)
Logistic regression
Support vector machines
Ridge regression and Lasso regression
Principal component analysis (via semidefinite relaxation)

When a machine learning problem can be cast as a convex optimization problem, practitioners benefit from strong theoretical guarantees (global optimality, convergence bounds) and mature solver software. When the problem is non-convex (as in deep learning), researchers often apply techniques originally developed for convex settings, such as gradient descent and regularization, even though the theoretical guarantees no longer strictly apply.

structured convex problem classes

Within convex optimization, several special structures unlock faster algorithms.

Class	Form	Examples in ML
Linear programming (LP)	Linear objective and constraints	LP relaxations of integer programs; quantile regression
Quadratic programming (QP)	Convex quadratic objective, linear constraints	SVM, MPC, portfolio optimization
Quadratically constrained QP (QCQP)	Quadratic in objective and constraints	Sphere-constrained least squares
Second-order cone programming (SOCP)	Linear objective, second-order cone constraints	Robust regression, minimum-volume ellipsoid
Semidefinite programming (SDP)	Linear objective, PSD constraints	Max-cut relaxations, kernel learning, matrix completion
Geometric programming	Posynomial constraints (after log change of variables)	Circuit design, statistics
Conic programming	Linear objective, intersection with a closed convex cone	Unifies LP, SOCP, SDP

optimization algorithms for convex problems

A range of algorithms exploit convexity to achieve strong convergence guarantees.

Algorithm	Best for	Convergence rate
Gradient descent	Smooth convex / strongly convex	O(1/k) / linear
Nesterov accelerated gradient	Smooth convex / strongly convex	O(1/k^2) / linear with sqrt(kappa)
Subgradient method	Nonsmooth convex	O(1/sqrt(k))
Proximal gradient (ISTA / FISTA)	Smooth + nonsmooth (regularized)	O(1/k) / O(1/k^2)
Mirror descent	Constrained convex on simplex or other geometries	O(1/sqrt(k))
ADMM	Splitting structured convex problems	O(1/k) under typical conditions
Coordinate descent	Separable convex	Linear in many regimes
Newton's method	Strongly convex, twice differentiable	Quadratic locally
Interior-point methods	Conic programs (LP, SOCP, SDP)	Polynomial in problem size and 1/epsilon

historical notes

The earliest systematic study of convex functions of a real variable is due to Otto Holder (in connection with his eponymous inequality) and Johan Jensen, who in 1906 published the inequality that bears his name. The general theory in R^n was developed in the mid-20th century by researchers including Werner Fenchel, Jean-Jacques Moreau, and R. Tyrrell Rockafellar, whose 1970 monograph Convex Analysis remains a standard reference. The Fenchel-Moreau theorem, the calculus of subdifferentials, and the perspective transform all originated in this period.

Applications of convex analysis to optimization were given a major push by Naum Shor, Boris Polyak, and Yurii Nesterov in the Soviet school during the 1960s through 1980s. Nesterov and Arkadi Nemirovski developed the modern theory of polynomial-time interior-point methods for self-concordant convex functions in the late 1980s and 1990s, which extended the celebrated polynomial-time guarantees of linear programming (Khachiyan 1979, Karmarkar 1984) to the much broader class of conic convex programs. Stephen Boyd and Lieven Vandenberghe then translated the theory into a practitioner-oriented framework, culminating in the textbook Convex Optimization (2004) and in the disciplined convex programming methodology embodied in CVX, CVXPY, and Convex.jl. Their work made convex optimization a routine tool across engineering, statistics, and machine learning.

summary of convexity hierarchy

The following table summarizes the relationships between different levels of convexity.

Property	Guarantees Unique Minimum?	Convergence Rate (GD)	Second-Order Condition
Convex	No (may have multiple or none)	O(1/k)	nabla^2 f >= 0
Strictly convex	Yes (at most one)	O(1/k)	nabla^2 f > 0
Strongly convex (mu > 0)	Yes (exactly one)	O(log(1/epsilon))	nabla^2 f >= mu I

references

Boyd, S. and Vandenberghe, L. (2004). Convex Optimization. Cambridge University Press. Available at https://web.stanford.edu/~boyd/cvxbook/
Rockafellar, R. T. (1970). Convex Analysis. Princeton University Press.
Nesterov, Y. (2004). Introductory Lectures on Convex Optimization: A Basic Course. Springer.
Shalev-Shwartz, S. and Ben-David, S. (2014). Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press.
"Convex function." Wikipedia. https://en.wikipedia.org/wiki/Convex_function
"Jensen's inequality." Wikipedia. https://en.wikipedia.org/wiki/Jensen%27s_inequality
Bubeck, S. (2015). "Convex Optimization: Algorithms and Complexity." Foundations and Trends in Machine Learning, 8(3-4), 231-357.
Zhou, X. "Strong Convexity." https://xingyuzhou.org/blog/notes/strong-convexity
Boyd, S. and Vandenberghe, L. "Convex Functions." Lecture slides, Stanford EE364a. https://see.stanford.edu/materials/lsocoee364a/03ConvexFunctions.pdf
"LogSumExp." Wikipedia. https://en.wikipedia.org/wiki/LogSumExp
Bertsekas, D. P. (1999). Nonlinear Programming, 2nd ed. Athena Scientific.
Beck, A. (2017). First-Order Methods in Optimization. SIAM.
Nesterov, Y. and Nemirovskii, A. (1994). Interior-Point Polynomial Algorithms in Convex Programming. SIAM.
Diamond, S. and Boyd, S. (2016). "CVXPY: A Python-Embedded Modeling Language for Convex Optimization." Journal of Machine Learning Research, 17(83), 1-5.
Banerjee, A., Merugu, S., Dhillon, I., and Ghosh, J. (2005). "Clustering with Bregman Divergences." Journal of Machine Learning Research, 6, 1705-1749.
Goodfellow, I., Bengio, Y., and Courville, A. (2016). Deep Learning. MIT Press. Available at https://www.deeplearningbook.org
Hiriart-Urruty, J.-B. and Lemarechal, C. (2001). Fundamentals of Convex Analysis. Springer.
"Subderivative." Wikipedia. https://en.wikipedia.org/wiki/Subderivative
"Convex conjugate." Wikipedia. https://en.wikipedia.org/wiki/Convex_conjugate
"Bregman divergence." Wikipedia. https://en.wikipedia.org/wiki/Bregman_divergence

eli5: explain like i'm 5

formal definition

the epigraph viewpoint

strictly convex functions

strongly convex functions

equivalent characterizations

convergence rates

first-order condition

second-order condition

examples of convex functions

the log-sum-exp function

norms and seminorms

convex matrix functions

non-convex examples

operations that preserve convexity

disciplined convex programming

subgradients and subdifferentials

convex conjugate (legendre-fenchel transform)

bregman divergence

jensen's inequality

finite form

derived inequalities

smoothness and lipschitz gradients

convex functions in machine learning

convex loss functions

convex regularization

classical convex models in machine learning

deep learning and convexity

probabilistic and bayesian methods

relationship to convex optimization

structured convex problem classes

optimization algorithms for convex problems

historical notes

summary of convexity hierarchy

see also

references

Improve this article

Related Articles

ARC-AGI 2

Convergence

Convex Optimization

Convex Set

Gradient

Bellman Equation

eli5: explain like i'm 5

formal definition

the epigraph viewpoint

strictly convex functions

strongly convex functions

equivalent characterizations

convergence rates

first-order condition

second-order condition

examples of convex functions

the log-sum-exp function

norms and seminorms

convex matrix functions

non-convex examples

operations that preserve convexity

disciplined convex programming

subgradients and subdifferentials

convex conjugate (legendre-fenchel transform)

bregman divergence

jensen's inequality

finite form

derived inequalities

smoothness and lipschitz gradients

convex functions in machine learning

convex loss functions

convex regularization

classical convex models in machine learning

deep learning and convexity

probabilistic and bayesian methods

relationship to convex optimization

structured convex problem classes

optimization algorithms for convex problems

historical notes

summary of convexity hierarchy

see also

references