Convex Function
Last reviewed
May 9, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v6 ยท 6,211 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 9, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v6 ยท 6,211 words
Add missing citations, update stale details, or suggest a clearer explanation.
A convex function is a real-valued mathematical function whose graph curves upward, forming a bowl or cup shape. Formally, a function is convex if the line segment connecting any two points on its graph lies on or above the graph itself. Convex functions are foundational to convex optimization and play a central role in machine learning, where they guarantee that optimization algorithms such as gradient descent can find global minima efficiently.
The study of convex functions belongs to a branch of mathematics called convex analysis, formalized in the 20th century in works such as Werner Fenchel's lecture notes (1951) and R. Tyrrell Rockafellar's Convex Analysis (1970). The properties of convex functions, especially the fact that local minima coincide with global minima, make them the most well-behaved class of nonlinear objective functions in mathematical optimization. Modern textbooks such as Boyd and Vandenberghe's Convex Optimization (2004) and Nesterov's Introductory Lectures on Convex Optimization (2004) treat convex functions as the central object of study because efficient solvers exist for almost any problem expressible in convex form.
Imagine you have a bowl sitting on a table. If you drop a marble into the bowl, it will always roll down to the very bottom, no matter where you drop it. A convex function is like the inside surface of that bowl. There is only one lowest point, and every path leads down to it.
In machine learning, we use bowl-shaped (convex) functions to measure how wrong our predictions are. Because the function is shaped like a bowl, our training algorithm can always find the single best answer by rolling downhill. If the function were bumpy with lots of dips (non-convex), the marble might get stuck in a small dip that is not the true bottom, and our model would not learn the best possible answer.
A second way to picture convexity: stretch a rubber band between any two points on the curve. If the rubber band always stays above (or touching) the curve, the function is convex. If the rubber band ever dips below the curve, the function is not convex.
Let f : R^n -> R be a function whose domain is a convex set. The function f is convex if, for all x, y in its domain and for every t in [0, 1]:
f(tx + (1 - t)y) <= t f(x) + (1 - t) f(y)
Geometrically, this inequality states that the chord (line segment) between any two points (x, f(x)) and (y, f(y)) on the graph of f lies on or above the graph. Equivalently, the epigraph of f (the set of points lying on or above the graph) is a convex set.
A function f is concave if -f is convex. Linear and affine functions are the only functions that are both convex and concave simultaneously.
The definition can be extended to functions whose domain is any real vector space, including infinite-dimensional function spaces. In abstract convex analysis, a convex function is often allowed to take the value plus infinity, with the convention that the inequality remains valid by reading any expression involving infinity in the natural way. The set where f is finite is called the effective domain, written dom f. A convex function whose effective domain is nonempty is called proper.
The epigraph of f is the set epi f = {(x, r) in R^n x R : f(x) <= r}. A function is convex if and only if its epigraph is a convex subset of R^(n+1). This characterization is powerful because it converts every question about a convex function into a question about a convex set, unifying the theory of convex functions with the geometry of convex sets. The closely related sublevel set {x : f(x) <= alpha} is also convex for every alpha whenever f is convex, although sublevel-set convexity alone is weaker than convexity (it characterizes the broader class of quasi-convex functions).
A function f is strictly convex if the inequality holds with strict inequality for all distinct x != y and for all t in (0, 1):
f(tx + (1 - t)y) < t f(x) + (1 - t) f(y)
Strict convexity means the chord between two distinct points lies strictly above the graph (never touching it except at the endpoints). A strictly convex function has at most one global minimum. If a minimum exists, it is unique. The function f(x) = e^x is a classic example: it is strictly convex on all of R, but it does not attain a minimum.
Strong convexity is a more powerful condition that quantifies how "curved" a function is. A differentiable function f is mu-strongly convex (for some mu > 0) if for all x, y in its domain:
f(y) >= f(x) + nabla f(x)^T (y - x) + (mu / 2) ||y - x||^2
Intuitively, a mu-strongly convex function grows at least as fast as a quadratic with curvature parameter mu. This provides a quadratic lower bound on the function, guaranteeing a unique global minimum and enabling stronger convergence guarantees for optimization algorithms.
The following conditions are equivalent for a differentiable function f:
| Characterization | Statement |
|---|---|
| Quadratic lower bound | f(y) >= f(x) + nabla f(x)^T (y - x) + (mu/2) ||y - x||^2 |
| Subtraction test | g(x) = f(x) - (mu/2) ||x||^2 is convex |
| Monotone gradient | (nabla f(x) - nabla f(y))^T (x - y) >= mu ||x - y||^2 |
| Strengthened Jensen | f(tx + (1-t)y) <= t f(x) + (1-t)f(y) - t(1-t)(mu/2) ||x - y||^2 |
Strong convexity implies strict convexity, which in turn implies convexity, but the reverse implications do not hold.
Strong convexity is especially important in optimization because it guarantees linear convergence (exponential decrease in error) for gradient descent. For a mu-strongly convex function with L-Lipschitz continuous gradients, gradient descent achieves an error of at most epsilon after O((L/mu) log(1/epsilon)) iterations. Without strong convexity, the convergence rate for smooth convex functions is only O(1/k), which is substantially slower.
The ratio kappa = L/mu is called the condition number of the optimization problem. A small condition number (close to 1) means the function is well conditioned and gradient methods converge quickly. A large condition number means the function is elongated and ill conditioned, in which case accelerated gradient methods such as Nesterov's method achieve a faster O(sqrt(kappa) log(1/epsilon)) rate.
If f is differentiable, it is convex if and only if its domain is a convex set and for all x, y in the domain:
f(y) >= f(x) + nabla f(x)^T (y - x)
This condition has a clear geometric meaning: the tangent hyperplane at any point lies entirely on or below the graph of the function. In one dimension, this says that every tangent line is a global underestimator of the function. This first-order condition is the basis for many optimization algorithms, because it implies that the gradient provides a reliable direction toward the minimum.
A related first-order characterization is monotonicity of the gradient: a differentiable function f on a convex domain is convex if and only if (nabla f(x) - nabla f(y))^T (x - y) >= 0 for all x, y. This generalizes the fact that the derivative of a convex function on the real line is nondecreasing.
If f is twice differentiable, it is convex if and only if its Hessian matrix (the matrix of second partial derivatives) is positive semidefinite everywhere on its domain:
nabla^2 f(x) >= 0 for all x in dom(f)
For a function of a single variable, this reduces to the familiar condition f''(x) >= 0 for all x. For strictly convex functions, the Hessian must be positive definite (all eigenvalues strictly positive). For mu-strongly convex functions, the Hessian satisfies nabla^2 f(x) >= mu * I for all x, meaning every eigenvalue of the Hessian is at least mu.
| Convexity Type | Second-Order Condition |
|---|---|
| Convex | nabla^2 f(x) >= 0 (positive semidefinite) |
| Strictly convex | nabla^2 f(x) > 0 (positive definite) |
| mu-strongly convex | nabla^2 f(x) >= mu I (eigenvalues >= mu) |
A subtle point: positive definite Hessian everywhere on the domain implies strict convexity, but the converse is false. The classical counterexample is f(x) = x^4, which is strictly convex on R but has f''(0) = 0. Strong convexity is the condition that pins down a uniform positive lower bound on the curvature.
The following table lists commonly encountered convex functions and their key properties.
| Function | Formula | Strict? | Strongly Convex? | Notes |
|---|---|---|---|---|
| Affine / Linear | f(x) = a^T x + b | No | No | Both convex and concave; the only functions with this property |
| Quadratic | f(x) = x^2 | Yes | Yes (mu = 2) | The prototypical strongly convex function |
| Positive-definite quadratic | f(x) = x^T A x for A > 0 | Yes | Yes (mu = lambda_min(A)) | Strongly convex with parameter equal to the smallest eigenvalue of A |
| Exponential | f(x) = e^(ax) for any a | Yes | No | Strictly convex but not strongly convex on all of R |
| Negative logarithm | f(x) = -log(x), x > 0 | Yes | No | Strictly convex on its domain; used as a barrier function in interior-point methods |
| Norms | f(x) = ||x||_p, p >= 1 | No | No | Convex by the triangle inequality; not strictly convex in general |
| Log-sum-exp | f(x) = log(sum e^(x_i)) | No | No | Smooth approximation to the max function; its gradient is the softmax function |
| Absolute value | f(x) = |x| | No | No | Convex but not differentiable at x = 0 |
| Powers | f(x) = x^p, x >= 0, p >= 1 | Yes (if p > 1) | No | Convex on the nonneg. reals for p >= 1 |
| Negative entropy | f(x) = x log x, x > 0 | Yes | No | Strictly convex on (0, infinity); appears in information theory |
| Maximum eigenvalue | f(X) = lambda_max(X) | No | No | Convex on symmetric matrices; basis of semidefinite programming |
| Spectral norm | f(X) = sigma_max(X) | No | No | Largest singular value; convex matrix function |
| Quadratic over linear | f(x, y) = x^2 / y, y > 0 | Yes | No | Convex jointly in (x, y); appears in regression and signal processing |
The log-sum-exp (LSE) function, defined as f(x) = log(sum_{i=1}^{n} e^(x_i)), deserves special attention because of its wide use in machine learning. It is a smooth, convex approximation to the maximum function, satisfying:
max(x_1, ..., x_n) <= log(sum e^(x_i)) <= max(x_1, ..., x_n) + log(n)
The gradient of the LSE function is the softmax function, and its Hessian can be shown to be positive semidefinite via the Cauchy-Schwarz inequality. The LSE function appears in the log-partition function of exponential family distributions, in the cross-entropy loss function for classification, and in smooth relaxations used in reinforcement learning.
Every norm on R^n is a convex function. The proof is a direct application of the triangle inequality and absolute homogeneity. For t in [0, 1] and any norm || . ||:
||t x + (1 - t) y|| <= ||t x|| + ||(1 - t) y|| = t ||x|| + (1 - t) ||y||
This covers the Euclidean (L2) norm, the L1 (Manhattan) norm, the L-infinity (max) norm, and every Lp norm for p >= 1. None of these are strictly convex except for the L2 norm restricted away from the origin (and even then only weakly so). Norms are not differentiable at zero in general, so subgradient calculus replaces standard gradient calculus when working with norm-regularized problems such as the Lasso.
Several important functions on matrix arguments are convex. The trace of a matrix product A****X is linear (and hence convex) in X. The largest eigenvalue lambda_max(X) of a symmetric matrix is convex, since it can be written as the supremum of the linear functions u^T X u over unit vectors u. The negative log determinant f(X) = -log det(X) is convex on the cone of positive definite matrices and serves as a barrier in semidefinite programming. The spectral norm and the nuclear norm (sum of singular values) are also convex; the nuclear norm is the convex envelope of the matrix rank and underlies many low-rank matrix recovery techniques.
Not all functions encountered in machine learning are convex. Important non-convex cases include:
When the loss function is non-convex, gradient descent may converge to local minima or saddle points rather than the global minimum. In practice, techniques such as stochastic gradient descent with momentum, learning rate schedules, and random initialization help navigate non-convex landscapes.
A practical workaround when the original problem is non-convex is to construct a convex relaxation: replace the function with a convex envelope or a convex upper bound that is tractable to optimize. The use of the L1 norm in place of the L0 "norm" (cardinality) is the canonical example, and it underlies compressed sensing as well as sparse regression.
Convexity is preserved under several useful operations, making it possible to build complex convex functions from simpler ones. These composition rules are central to modeling in convex optimization.
| Operation | Rule | Example |
|---|---|---|
| Nonnegative weighted sum | If f_1, ..., f_m are convex and w_i >= 0, then sum w_i f_i is convex | Weighted sum of loss terms in multi-task learning |
| Affine composition | If f is convex, then f(A****x + b) is convex | Applying a linear transformation to the input |
| Pointwise maximum | If f_1, ..., f_m are convex, then max(f_1, ..., f_m) is convex | Piecewise-linear functions; hinge loss |
| Pointwise supremum | If f(x, y) is convex in x for each y, then sup_y f(x, y) is convex in x | Support function of a set |
| Composition (nondecreasing) | If g is convex and nondecreasing and f is convex, then g(f(x)) is convex | e^(f(x)) when f is convex |
| Composition (nonincreasing) | If g is convex and nonincreasing and f is concave, then g(f(x)) is convex | -log of a concave function |
| Minimization over convex set | If f(x, y) is convex in (x, y) and C is convex, then g(x) = inf_{y in C} f(x, y) is convex | Projection onto a convex set |
| Perspective | If f is convex, the perspective g(x, t) = t f(x/t) is convex for t > 0 | KL divergence as the perspective of -log |
| Conjugate | The convex conjugate f^*(y) = sup_x (y^T x - f(x)) is always convex | Legendre transform; dual formulations |
| Infimal convolution | The inf-convolution (f box g)(x) = inf_y f(y) + g(x - y) is convex if f and g are | Moreau envelope; smoothing of nondifferentiable functions |
These composition rules form the basis of disciplined convex programming (DCP), a methodology popularized by Stephen Boyd's group at Stanford and implemented in modeling languages such as CVX (MATLAB), CVXPY (Python), and Convex.jl (Julia). In DCP, every expression is built from a small library of atomic convex and concave functions combined according to verifiable composition rules. If a problem is written in DCP form, the modeling tool can automatically certify its convexity and dispatch it to a suitable solver such as ECOS, SCS, or MOSEK. DCP is the reason convex optimization is widely accessible to practitioners who have not memorized the underlying theory.
Many convex functions of practical interest, such as the L1 norm and the hinge loss, are not differentiable everywhere. The right generalization of the gradient for convex functions is the subgradient. A vector g is a subgradient of a convex function f at the point x if for every y in dom f:
f(y) >= f(x) + g^T (y - x)
The set of all subgradients at x is called the subdifferential and is denoted partial f(x). Three key facts make the subdifferential a useful object:
The subdifferential of the absolute value at zero is the interval [-1, 1], and the subdifferential of the L1 norm at a sparse vector contains the sign vectors with -1 or 1 in the nonzero coordinates and any value in [-1, 1] in the zero coordinates. This is why the L1 norm encourages sparsity: setting a coordinate to zero is a stationary point because the subdifferential straddles zero.
Subgradient methods replace the gradient in gradient descent with any element of the subdifferential. They converge at the slower O(1/sqrt(k)) rate but apply to the broader class of nonsmooth convex functions. The proximal gradient method further generalizes this by using the proximal operator of a (possibly nonsmooth) convex regularizer, and it is the workhorse for solving regularized problems such as the Lasso.
The convex conjugate of a function f, written f^*, is defined by:
f^*(y) = sup_{x in dom f} ((y)^T (x) - f(x))
This transform, also called the Legendre-Fenchel transform, plays the role of the Fourier transform in convex analysis: it linearizes the nonlinear structure of f by re-expressing it in terms of supporting hyperplanes. Key properties:
Convex conjugates underpin the theory of Lagrange duality: the dual of a convex optimization problem is constructed from the conjugate of the Lagrangian. They also describe the Bregman divergence in dual coordinates and provide the link between primal and dual formulations of regularized problems.
Given a strictly convex, differentiable function phi, the Bregman divergence generated by phi is defined as:
D_phi(x, y) = phi(x) - phi(y) - nabla phi(y)^T (x - y)
The Bregman divergence measures the gap between phi(x) and the first-order Taylor expansion of phi around y. It is always nonnegative and equals zero exactly when x = y (because phi is strictly convex).
Different generators give rise to different divergences and reproduce many familiar discrepancy measures.
| Generator phi | Domain | Resulting Bregman Divergence |
|---|---|---|
| (1/2)||x||^2 | R^n | Squared Euclidean distance |
| sum x_i log x_i (negative entropy) | Probability simplex | Kullback-Leibler divergence |
| -sum log x_i | Positive orthant | Itakura-Saito divergence |
| (1/2) x^T A x for A > 0 | R^n | Mahalanobis-like quadratic divergence |
Bregman divergences are not in general symmetric and do not satisfy the triangle inequality, so they are not metrics. They do satisfy a Pythagorean-style identity that supports projection arguments, and they are used in the analysis of mirror descent, proximal point methods, and online convex optimization.
One of the most important results involving convex functions is Jensen's inequality, named after Danish mathematician Johan Jensen who published it in 1906. For a convex function f and a random variable X:
f(E[X]) <= E[f(X)]
In words, applying a convex function to the expected value of a random variable gives a result no larger than the expected value of the convex function applied to the random variable.
For concave functions, the inequality reverses: f(E[X]) >= E[f(X)].
Jensen's inequality has wide-ranging applications. In information theory, it is used to prove that the Kullback-Leibler divergence is always nonnegative. In statistics, it establishes bounds on expectations. In machine learning, it underlies the derivation of the evidence lower bound (ELBO) used in variational inference and variational autoencoders.
The finite (discrete) form of Jensen's inequality states that for nonneg. weights w_1, ..., w_n summing to 1 and points x_1, ..., x_n:
f(sum w_i x_i) <= sum w_i f(x_i)
This follows directly from the definition of convexity by induction.
Many classical inequalities are special cases or near-corollaries of Jensen's inequality:
A twin notion to strong convexity is L-smoothness. A differentiable function f is L-smooth if its gradient is L-Lipschitz continuous, that is, ||nabla f(x) - nabla f(y)|| <= L ||x - y|| for all x, y. For twice-differentiable functions, this is equivalent to nabla^2 f(x) <= L I, an upper bound on the Hessian. L-smoothness gives a quadratic upper bound:
f(y) <= f(x) + nabla f(x)^T (y - x) + (L/2) ||y - x||^2
This upper bound, sometimes called the descent lemma, is what makes gradient descent provably effective: at each step, one can guarantee a decrease in the function value by at least a fixed amount. Combined with mu-strong convexity (lower quadratic bound), it pins the function between two paraboloids of curvature mu and L. The standard convergence theorems for gradient descent and its accelerated variants are stated in this regime.
The table below summarizes the standard regimes of convex optimization and the corresponding best-known iteration complexity for first-order methods (counting one gradient evaluation per iteration).
| Regime | Lower bound | Upper bound (vanilla GD) | Upper bound (accelerated) |
|---|---|---|---|
| Smooth (L), nonconvex | Stationary point only | O(1/sqrt(k)) gradient norm | O(1/sqrt(k)) |
| Convex, smooth (L) | O(1/k^2) | O(1/k) | O(1/k^2) |
| Strongly convex (mu), smooth (L) | O((1 - 1/kappa)^k) | O((1 - 1/kappa)^k) | O((1 - 1/sqrt(kappa))^k) |
| Strongly convex, nonsmooth | O(1/k) | O(1/k) | O(1/k) |
| Convex, nonsmooth | O(1/sqrt(k)) | O(1/sqrt(k)) | O(1/sqrt(k)) |
Convex functions are central to machine learning because they guarantee that optimization problems have well-behaved solutions. When the loss function is convex with respect to the model parameters, every local minimum is also a global minimum, and gradient descent is guaranteed to converge to an optimal solution.
The following table summarizes common convex loss functions used in machine learning.
| Loss Function | Formula | Used In | Convexity Notes |
|---|---|---|---|
| Mean Squared Error (MSE) | (1/n) sum (y_i - y_hat_i)^2 | Linear regression | Convex (and strongly convex) in the model parameters for linear models |
| Cross-entropy (log loss) | -sum y_i log(p_i) | Logistic regression, classification | Convex in the linear model parameters; non-convex in neural network weights |
| Hinge loss | max(0, 1 - y_i f(x_i)) | Support vector machines | Convex (piecewise linear); not differentiable at the hinge point |
| Huber loss | Quadratic for small errors, linear for large | Robust regression | Convex and differentiable everywhere |
| L1 loss (MAE) | (1/n) sum |y_i - y_hat_i| | Robust regression | Convex but not differentiable at zero |
| Squared hinge loss | max(0, 1 - y_i f(x_i))^2 | Squared-loss SVM variants | Convex and differentiable everywhere |
| Logistic loss | log(1 + exp(-y_i f(x_i))) | Logistic regression | Strictly convex in linear model parameters |
| Quantile (pinball) loss | max(tau (y - y_hat), (tau - 1)(y - y_hat)) | Quantile regression | Convex; estimates conditional quantiles |
| Epsilon-insensitive loss | max(0, |y - y_hat| - epsilon) | Support vector regression | Convex; ignores small errors |
For linear regression with MSE loss, the optimization problem is a convex quadratic, which admits a closed-form solution (the normal equations) or can be solved efficiently with gradient descent. For logistic regression with cross-entropy loss, the objective is convex in the weight parameters, guaranteeing convergence to the global optimum.
However, when these same loss functions are composed with a multi-layer neural network, the overall objective becomes non-convex because of the nonlinear activation functions and the product structure of the weight matrices.
Regularization terms added to loss functions are also typically convex.
| Regularizer | Formula | Convexity | Effect on Solution |
|---|---|---|---|
| L2 (ridge) | lambda ||w||_2^2 | Strongly convex | Shrinks all weights smoothly toward zero |
| L1 (lasso) | lambda ||w||_1 | Convex (not strict) | Promotes sparsity; many weights become exactly zero |
| Elastic net | lambda_1 ||w||_1 + lambda_2 ||w||_2^2 | Convex (strongly convex if lambda_2 > 0) | Combines sparsity and smoothing |
| Group Lasso | lambda sum_g ||w_g||_2 | Convex | Promotes sparsity at the group level |
| Total variation | lambda sum |w_(i+1) - w_i| | Convex | Promotes piecewise-constant solutions |
| Nuclear norm | lambda ||W||_* | Convex | Promotes low-rank solutions for matrices |
L2 regularization and L1 regularization are the most common; Lasso refers specifically to least-squares regression with L1 regularization, while ridge regression is least squares with L2 regularization. Adding L2 regularization to a convex loss function is a practical technique for converting a merely convex problem into a strongly convex one, enabling faster and more stable optimization.
The following table catalogs widely used classical models that admit convex training problems.
| Model | Training Problem | Convex Class |
|---|---|---|
| Ordinary least squares | min ||X****w - y||_2^2 | Convex quadratic |
| Ridge regression | min ||X****w - y||_2^2 + lambda ||w||_2^2 | Strongly convex quadratic |
| Lasso | min ||X****w - y||_2^2 + lambda ||w||_1 | Convex (nonsmooth) |
| Logistic regression | min sum log(1 + exp(-y_i w^T x_i)) | Convex; strongly convex with L2 regularization |
| Linear SVM (primal) | min (1/2)||w||_2^2 + C sum max(0, 1 - y_i (w^T x_i + b)) | Convex (quadratic + piecewise linear) |
| Kernel SVM (dual) | Convex quadratic in dual variables under positive semidefinite kernel | Convex QP |
| Maximum entropy / softmax classifier | min cross-entropy with linear model | Convex |
| Quantile regression | min sum pinball loss | Convex (LP) |
| Poisson regression | min Poisson negative log likelihood | Convex (canonical link) |
| LogDet model selection | min trace(S^(-1) X) - log det(X) + lambda ||X||_1 (graphical lasso) | Convex |
Support vector machines deserve particular attention. The SVM training problem is a convex quadratic program (QP) in either its primal or dual form. The convexity of this formulation means there is no risk of getting stuck in a poor local optimum, and standard QP solvers (or specialized algorithms such as SMO) can compute the unique global optimum. The kernel trick extends this convex framework to nonlinear decision boundaries while preserving convexity in the dual.
The loss function of a deep neural network is non-convex in its parameters as a whole. However, several partial convexity properties still play a role:
Despite these connections, training deep networks is fundamentally non-convex, and the success of stochastic gradient descent on these problems is one of the most studied empirical phenomena in modern machine learning. Theoretical work has shown that for sufficiently overparameterized networks, gradient descent can escape saddle points and converge to global minima despite the non-convexity, but the precise mechanisms remain an active research area.
Convex functions also appear in probabilistic methods. The negative log-likelihood of any exponential family distribution is convex in the natural parameters: this includes Gaussian (with fixed variance), Bernoulli, multinomial, Poisson, exponential, and gamma distributions. Generalized linear models (GLMs), which model the natural parameter as a linear function of features, therefore have convex training objectives. The variational lower bound used in variational inference and variational autoencoders involves a KL divergence term that is convex in the variational parameters under common parameterizations such as Gaussian families.
Convex optimization is the study of minimizing convex functions over convex sets. The key guarantee is that any local minimum of a convex function over a convex set is a global minimum. This property makes convex optimization problems tractable: efficient polynomial-time algorithms exist for solving them to arbitrary precision.
Many classical machine learning algorithms are formulated as convex optimization problems:
When a machine learning problem can be cast as a convex optimization problem, practitioners benefit from strong theoretical guarantees (global optimality, convergence bounds) and mature solver software. When the problem is non-convex (as in deep learning), researchers often apply techniques originally developed for convex settings, such as gradient descent and regularization, even though the theoretical guarantees no longer strictly apply.
Within convex optimization, several special structures unlock faster algorithms.
| Class | Form | Examples in ML |
|---|---|---|
| Linear programming (LP) | Linear objective and constraints | LP relaxations of integer programs; quantile regression |
| Quadratic programming (QP) | Convex quadratic objective, linear constraints | SVM, MPC, portfolio optimization |
| Quadratically constrained QP (QCQP) | Quadratic in objective and constraints | Sphere-constrained least squares |
| Second-order cone programming (SOCP) | Linear objective, second-order cone constraints | Robust regression, minimum-volume ellipsoid |
| Semidefinite programming (SDP) | Linear objective, PSD constraints | Max-cut relaxations, kernel learning, matrix completion |
| Geometric programming | Posynomial constraints (after log change of variables) | Circuit design, statistics |
| Conic programming | Linear objective, intersection with a closed convex cone | Unifies LP, SOCP, SDP |
A range of algorithms exploit convexity to achieve strong convergence guarantees.
| Algorithm | Best for | Convergence rate |
|---|---|---|
| Gradient descent | Smooth convex / strongly convex | O(1/k) / linear |
| Nesterov accelerated gradient | Smooth convex / strongly convex | O(1/k^2) / linear with sqrt(kappa) |
| Subgradient method | Nonsmooth convex | O(1/sqrt(k)) |
| Proximal gradient (ISTA / FISTA) | Smooth + nonsmooth (regularized) | O(1/k) / O(1/k^2) |
| Mirror descent | Constrained convex on simplex or other geometries | O(1/sqrt(k)) |
| ADMM | Splitting structured convex problems | O(1/k) under typical conditions |
| Coordinate descent | Separable convex | Linear in many regimes |
| Newton's method | Strongly convex, twice differentiable | Quadratic locally |
| Interior-point methods | Conic programs (LP, SOCP, SDP) | Polynomial in problem size and 1/epsilon |
The earliest systematic study of convex functions of a real variable is due to Otto Holder (in connection with his eponymous inequality) and Johan Jensen, who in 1906 published the inequality that bears his name. The general theory in R^n was developed in the mid-20th century by researchers including Werner Fenchel, Jean-Jacques Moreau, and R. Tyrrell Rockafellar, whose 1970 monograph Convex Analysis remains a standard reference. The Fenchel-Moreau theorem, the calculus of subdifferentials, and the perspective transform all originated in this period.
Applications of convex analysis to optimization were given a major push by Naum Shor, Boris Polyak, and Yurii Nesterov in the Soviet school during the 1960s through 1980s. Nesterov and Arkadi Nemirovski developed the modern theory of polynomial-time interior-point methods for self-concordant convex functions in the late 1980s and 1990s, which extended the celebrated polynomial-time guarantees of linear programming (Khachiyan 1979, Karmarkar 1984) to the much broader class of conic convex programs. Stephen Boyd and Lieven Vandenberghe then translated the theory into a practitioner-oriented framework, culminating in the textbook Convex Optimization (2004) and in the disciplined convex programming methodology embodied in CVX, CVXPY, and Convex.jl. Their work made convex optimization a routine tool across engineering, statistics, and machine learning.
The following table summarizes the relationships between different levels of convexity.
| Property | Guarantees Unique Minimum? | Convergence Rate (GD) | Second-Order Condition |
|---|---|---|---|
| Convex | No (may have multiple or none) | O(1/k) | nabla^2 f >= 0 |
| Strictly convex | Yes (at most one) | O(1/k) | nabla^2 f > 0 |
| Strongly convex (mu > 0) | Yes (exactly one) | O(log(1/epsilon)) | nabla^2 f >= mu I |