# Convex Function

> Source: https://aiwiki.ai/wiki/convex_function
> Updated: 2026-07-11
> Categories: Machine Learning, Mathematics, Training & Optimization
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

A **convex function** is a real-valued function whose graph curves upward into a bowl or cup shape, so that the line segment (chord) connecting any two points on the graph lies on or above the graph itself. [1] Formally, a function *f* is convex when $$f(t\mathbf{x} + (1 - t)\mathbf{y}) \le t f(\mathbf{x}) + (1 - t) f(\mathbf{y})$$ for all points **x**, **y** in its domain and every *t* in [0, 1], and equivalently when its epigraph (the set of points lying on or above the graph) is a [convex set](/wiki/convex_set). [1][2] The single property that makes convex functions central to [machine learning](/wiki/machine_learning) is that for a convex objective every local minimum is also a global minimum, so algorithms such as [gradient descent](/wiki/gradient_descent) provably converge to the best possible solution. [1][4]

Convexity, not linearity, is the dividing line between optimization problems that are reliably solvable and those that are not. As R. Tyrrell Rockafellar put it, "the great watershed in optimization isn't between linearity and nonlinearity, but convexity and nonconvexity." [21] The study of convex functions belongs to a branch of mathematics called convex analysis, formalized in the 20th century in works such as Werner Fenchel's lecture notes (1951) and Rockafellar's *Convex Analysis* (1970). [2] The properties of convex functions, especially the fact that local minima coincide with global minima, make them the most well-behaved class of nonlinear objective functions in mathematical optimization. Modern textbooks such as Boyd and Vandenberghe's *Convex Optimization* (2004) and Nesterov's *Introductory Lectures on Convex Optimization* (2004) treat convex functions as the central object of study because efficient solvers exist for almost any problem expressible in convex form. [1][3]

## eli5: explain like i'm 5

Imagine you have a bowl sitting on a table. If you drop a marble into the bowl, it will always roll down to the very bottom, no matter where you drop it. A convex function is like the inside surface of that bowl. There is only one lowest point, and every path leads down to it.

In machine learning, we use bowl-shaped (convex) functions to measure how wrong our predictions are. Because the function is shaped like a bowl, our training algorithm can always find the single best answer by rolling downhill. If the function were bumpy with lots of dips (non-convex), the marble might get stuck in a small dip that is not the true bottom, and our model would not learn the best possible answer.

A second way to picture convexity: stretch a rubber band between any two points on the curve. If the rubber band always stays above (or touching) the curve, the function is convex. If the rubber band ever dips below the curve, the function is not convex.

## What is a convex function? (formal definition)

Let $$f : \mathbb{R}^n \to \mathbb{R}$$ be a function whose domain is a [convex set](/wiki/convex_set). The function *f* is **convex** if, for all **x**, **y** in its domain and for every *t* in [0, 1]:

$$
f(t\mathbf{x} + (1 - t)\mathbf{y}) \le t f(\mathbf{x}) + (1 - t)f(\mathbf{y})
$$

Geometrically, this inequality states that the chord (line segment) between any two points $$(x, f(x))$$ and $$(y, f(y))$$ on the graph of *f* lies on or above the graph. [1] Equivalently, the **epigraph** of *f* (the set of points lying on or above the graph) is a convex set. [2]

A function *f* is **concave** if -*f* is convex. Linear and affine functions are the only functions that are both convex and concave simultaneously. [1]

The definition can be extended to functions whose domain is any real vector space, including infinite-dimensional function spaces. In abstract convex analysis, a convex function is often allowed to take the value plus infinity, with the convention that the inequality remains valid by reading any expression involving infinity in the natural way. The set where *f* is finite is called the **effective domain**, written $$\operatorname{dom} f$$. A convex function whose effective domain is nonempty is called **proper**. [2]

### the epigraph viewpoint

The epigraph of *f* is the set $$\operatorname{epi} f = \{(\mathbf{x}, r) \in \mathbb{R}^n \times \mathbb{R} : f(\mathbf{x}) \le r\}$$. A function is convex if and only if its epigraph is a convex subset of $$\mathbb{R}^{n+1}$$. [1] This characterization is powerful because it converts every question about a convex function into a question about a [convex set](/wiki/convex_set), unifying the theory of convex functions with the geometry of convex sets. The closely related **sublevel set** $$\{\mathbf{x} : f(\mathbf{x}) \le \alpha\}$$ is also convex for every *alpha* whenever *f* is convex, although sublevel-set convexity alone is weaker than convexity (it characterizes the broader class of [quasi-convex functions](/wiki/quasiconvex_function)). [1]

## strictly convex functions

A function *f* is **strictly convex** if the inequality holds with strict inequality for all distinct **x** != **y** and for all *t* in (0, 1):

$$
f(t\mathbf{x} + (1 - t)\mathbf{y}) < t f(\mathbf{x}) + (1 - t)f(\mathbf{y})
$$

Strict convexity means the chord between two distinct points lies strictly above the graph (never touching it except at the endpoints). A strictly convex function has **at most one global minimum**. If a minimum exists, it is unique. [1] The function $$f(x) = e^x$$ is a classic example: it is strictly convex on all of $$\mathbb{R}$$, but it does not attain a minimum.

## strongly convex functions

Strong convexity is a more powerful condition that quantifies how "curved" a function is. A differentiable function *f* is **mu-strongly convex** (for some $$\mu > 0$$) if for all **x**, **y** in its domain:

$$
f(\mathbf{y}) \ge f(\mathbf{x}) + \nabla f(\mathbf{x})^\top (\mathbf{y} - \mathbf{x}) + \frac{\mu}{2} \lVert \mathbf{y} - \mathbf{x} \rVert^2
$$

Intuitively, a mu-strongly convex function grows at least as fast as a quadratic with curvature parameter mu. This provides a quadratic lower bound on the function, guaranteeing a unique global minimum and enabling stronger convergence guarantees for optimization algorithms. [3][7]

### equivalent characterizations

The following conditions are equivalent for a differentiable function *f*: [8]

| Characterization | Statement |
|---|---|
| Quadratic lower bound | $$f(\mathbf{y}) \ge f(\mathbf{x}) + \nabla f(\mathbf{x})^\top (\mathbf{y} - \mathbf{x}) + \frac{\mu}{2} \lVert \mathbf{y} - \mathbf{x} \rVert^2$$ |
| Subtraction test | $$g(\mathbf{x}) = f(\mathbf{x}) - \frac{\mu}{2} \lVert \mathbf{x} \rVert^2$$ is convex |
| Monotone gradient | $$(\nabla f(\mathbf{x}) - \nabla f(\mathbf{y}))^\top (\mathbf{x} - \mathbf{y}) \ge \mu \lVert \mathbf{x} - \mathbf{y} \rVert^2$$ |
| Strengthened Jensen | $$f(t\mathbf{x} + (1-t)\mathbf{y}) \le t f(\mathbf{x}) + (1-t)f(\mathbf{y}) - t(1-t)\frac{\mu}{2} \lVert \mathbf{x} - \mathbf{y} \rVert^2$$ |

Strong convexity implies strict convexity, which in turn implies convexity, but the reverse implications do not hold. [8]

### convergence rates

Strong convexity is especially important in optimization because it guarantees **linear convergence** (exponential decrease in error) for gradient descent. For a mu-strongly convex function with *L*-Lipschitz continuous gradients, gradient descent achieves an error of at most epsilon after $$O\!\left(\frac{L}{\mu} \log(1/\epsilon)\right)$$ iterations. [3][7] Without strong convexity, the convergence rate for smooth convex functions is only $$O(1/k)$$, which is substantially slower. [7]

The ratio $$\kappa = L/\mu$$ is called the **condition number** of the optimization problem. A small condition number (close to 1) means the function is well conditioned and gradient methods converge quickly. A large condition number means the function is elongated and ill conditioned, in which case [accelerated gradient methods](/wiki/nesterov_accelerated_gradient) such as Nesterov's method achieve a faster $$O(\sqrt{\kappa} \log(1/\epsilon))$$ rate. [3][7]

## How do you test whether a function is convex?

There are three standard tests for convexity, of increasing strength of assumption: the zeroth-order definition (the chord inequality above), the first-order condition (tangent lines lie below the graph), and the second-order condition (the Hessian is positive semidefinite). [1] The second-order Hessian test is the one most often applied in practice because it reduces convexity to a linear-algebra check on the matrix of second derivatives.

### first-order condition

If *f* is differentiable, it is convex if and only if its domain is a convex set and for all **x**, **y** in the domain:

$$
f(\mathbf{y}) \ge f(\mathbf{x}) + \nabla f(\mathbf{x})^\top (\mathbf{y} - \mathbf{x})
$$

This condition has a clear geometric meaning: the tangent hyperplane at any point lies entirely on or below the graph of the function. [1] In one dimension, this says that every tangent line is a global underestimator of the function. This first-order condition is the basis for many optimization algorithms, because it implies that the gradient provides a reliable direction toward the minimum.

A related first-order characterization is **monotonicity of the gradient**: a differentiable function *f* on a convex domain is convex if and only if (nabla *f*(**x**) - nabla *f*(**y**))^T (**x** - **y**) >= 0 for all **x**, **y**. [1] This generalizes the fact that the derivative of a convex function on the real line is nondecreasing.

### second-order condition

If *f* is twice differentiable, it is convex if and only if its **Hessian matrix** (the matrix of second partial derivatives) is **positive semidefinite** everywhere on its domain:

$$
\nabla^2 f(\mathbf{x}) \succeq 0 \quad \text{for all } \mathbf{x} \in \operatorname{dom}(f)
$$

For a function of a single variable, this reduces to the familiar condition $$f''(x) \ge 0$$ for all *x*. [1] For strictly convex functions, the Hessian must be positive definite (all eigenvalues strictly positive). For mu-strongly convex functions, the Hessian satisfies $$\nabla^2 f(\mathbf{x}) \succeq \mu \mathbf{I}$$ for all **x**, meaning every eigenvalue of the Hessian is at least mu. [3]

| Convexity Type | Second-Order Condition |
|---|---|
| Convex | $$\nabla^2 f(\mathbf{x}) \succeq 0$$ (positive semidefinite) |
| Strictly convex | $$\nabla^2 f(\mathbf{x}) \succ 0$$ (positive definite) |
| mu-strongly convex | $$\nabla^2 f(\mathbf{x}) \succeq \mu \mathbf{I}$$ (eigenvalues $$\ge \mu$$) |

A subtle point: positive definite Hessian everywhere on the domain implies strict convexity, but the converse is false. The classical counterexample is $$f(x) = x^4$$, which is strictly convex on $$\mathbb{R}$$ but has $$f''(0) = 0$$. [1] Strong convexity is the condition that pins down a uniform positive lower bound on the curvature.

## What are some examples of convex functions?

The following table lists commonly encountered convex functions and their key properties. [1]

| Function | Formula | Strict? | Strongly Convex? | Notes |
|---|---|---|---|---|
| Affine / Linear | $$f(\mathbf{x}) = \mathbf{a}^\top \mathbf{x} + b$$ | No | No | Both convex and concave; the only functions with this property |
| Quadratic | $$f(x) = x^2$$ | Yes | Yes ($$\mu = 2$$) | The prototypical strongly convex function |
| Positive-definite quadratic | $$f(\mathbf{x}) = \mathbf{x}^\top \mathbf{A} \mathbf{x}$$ for $$\mathbf{A} \succ 0$$ | Yes | Yes ($$\mu = \lambda_{\min}(\mathbf{A})$$) | Strongly convex with parameter equal to the smallest eigenvalue of **A** |
| Exponential | $$f(x) = e^{ax}$$ for any *a* | Yes | No | Strictly convex but not strongly convex on all of $$\mathbb{R}$$ |
| Negative logarithm | $$f(x) = -\log(x)$$, $$x > 0$$ | Yes | No | Strictly convex on its domain; used as a barrier function in interior-point methods |
| Norms | $$f(\mathbf{x}) = \lVert \mathbf{x} \rVert_p$$, $$p \ge 1$$ | No | No | Convex by the triangle inequality; not strictly convex in general |
| Log-sum-exp | $$f(\mathbf{x}) = \log\left(\sum e^{x_i}\right)$$ | No | No | Smooth approximation to the max function; its gradient is the [softmax](/wiki/softmax) function |
| Absolute value | $$f(x) = \lvert x \rvert$$ | No | No | Convex but not differentiable at *x* = 0 |
| Powers | $$f(x) = x^p$$, $$x \ge 0$$, $$p \ge 1$$ | Yes (if $$p > 1$$) | No | Convex on the nonneg. reals for $$p \ge 1$$ |
| Negative entropy | $$f(x) = x \log x$$, $$x > 0$$ | Yes | No | Strictly convex on $$(0, \infty)$$; appears in information theory |
| Maximum eigenvalue | $$f(\mathbf{X}) = \lambda_{\max}(\mathbf{X})$$ | No | No | Convex on symmetric matrices; basis of semidefinite programming |
| Spectral norm | $$f(\mathbf{X}) = \sigma_{\max}(\mathbf{X})$$ | No | No | Largest singular value; convex matrix function |
| Quadratic over linear | $$f(x, y) = x^2 / y$$, $$y > 0$$ | Yes | No | Convex jointly in (*x*, *y*); appears in regression and signal processing |

### the log-sum-exp function

The **log-sum-exp** (LSE) function, defined as $$f(\mathbf{x}) = \log\left(\sum_{i=1}^{n} e^{x_i}\right)$$, deserves special attention because of its wide use in machine learning. It is a smooth, convex approximation to the maximum function, satisfying the tight two-sided bound: [10]

$$
\max(x_1, \ldots, x_n) \le \log\left(\sum e^{x_i}\right) \le \max(x_1, \ldots, x_n) + \log(n)
$$

The lower bound is strict except when $$n = 1$$, and the upper bound is strict unless all arguments are equal, so the approximation error never exceeds $$\log(n)$$. [10] The gradient of the LSE function is the [softmax](/wiki/softmax) function, and its Hessian can be shown to be positive semidefinite via the Cauchy-Schwarz inequality. [1][10] The LSE function appears in the log-partition function of exponential family distributions, in the cross-entropy [loss function](/wiki/loss_function) for classification, and in smooth relaxations used in reinforcement learning.

### norms and seminorms

Every norm on $$\mathbb{R}^n$$ is a convex function. The proof is a direct application of the triangle inequality and absolute homogeneity. For *t* in [0, 1] and any norm $$\lVert \cdot \rVert$$: [1]

$$
\lVert t\mathbf{x} + (1 - t)\mathbf{y} \rVert \le \lVert t\mathbf{x} \rVert + \lVert (1 - t)\mathbf{y} \rVert = t \lVert \mathbf{x} \rVert + (1 - t) \lVert \mathbf{y} \rVert
$$

This covers the Euclidean (L2) norm, the L1 (Manhattan) norm, the L-infinity (max) norm, and every Lp norm for *p* >= 1. None of these are strictly convex except for the L2 norm restricted away from the origin (and even then only weakly so). Norms are not differentiable at zero in general, so subgradient calculus replaces standard gradient calculus when working with norm-regularized problems such as the [Lasso](/wiki/lasso).

### convex matrix functions

Several important functions on matrix arguments are convex. The trace of a matrix product **A****X** is linear (and hence convex) in **X**. The largest eigenvalue lambda_max(**X**) of a symmetric matrix is convex, since it can be written as the supremum of the linear functions **u**^T **X** **u** over unit vectors **u**. [1] The negative log determinant *f*(**X**) = -log det(**X**) is convex on the cone of positive definite matrices and serves as a barrier in semidefinite programming. The spectral norm and the nuclear norm (sum of singular values) are also convex; the nuclear norm is the convex envelope of the matrix rank and underlies many [low-rank matrix recovery](/wiki/matrix_completion) techniques.

## What is an example of a non-convex function?

Not all functions encountered in machine learning are convex. Important non-convex cases include: [16]

- **Neural network loss surfaces.** The [loss function](/wiki/loss_function) of a [neural network](/wiki/neural_network) with hidden layers is generally non-convex due to nonlinear [activation functions](/wiki/activation_function) and the multiplicative interaction of weight matrices. The loss landscape contains many local minima and saddle points. [16]
- **f(x) = sin(x).** Oscillating functions are neither convex nor concave over their full domain.
- **f(x) = x^3.** This function is concave for *x* < 0 and convex for *x* > 0, so it is not convex on all of $$\mathbb{R}$$.
- **f(x) = log(x).** The logarithm is concave (not convex) on the positive reals.
- **The sigmoid function.** *f*(*x*) = 1 / (1 + *e*^(-*x*)) is neither convex nor concave on $$\mathbb{R}$$; it is convex on (-infinity, 0] and concave on [0, infinity).
- **Mixture model log-likelihoods.** The negative log-likelihood of a Gaussian mixture model is non-convex, which is why expectation-maximization is used to find local rather than global optima.
- **Matrix factorization objectives.** Loss functions that involve the product of two matrices to be optimized jointly (as in low-rank factorization) are non-convex even when each factor is fixed and the loss in the other factor is convex.

When the loss function is non-convex, [gradient descent](/wiki/gradient_descent) may converge to local minima or saddle points rather than the global minimum. In practice, techniques such as [stochastic gradient descent](/wiki/stochastic_gradient_descent_sgd) with momentum, learning rate schedules, and random initialization help navigate non-convex landscapes. [16]

A practical workaround when the original problem is non-convex is to construct a **convex relaxation**: replace the function with a convex envelope or a convex upper bound that is tractable to optimize. The use of the L1 norm in place of the L0 "norm" (cardinality) is the canonical example, and it underlies compressed sensing as well as sparse regression. [1]

## operations that preserve convexity

Convexity is preserved under several useful operations, making it possible to build complex convex functions from simpler ones. These composition rules are central to modeling in [convex optimization](/wiki/convex_optimization). [1]

| Operation | Rule | Example |
|---|---|---|
| Nonnegative weighted sum | If *f_1*, ..., *f_m* are convex and $$w_i \ge 0$$, then $$\sum w_i f_i$$ is convex | Weighted sum of loss terms in multi-task learning |
| Affine composition | If *f* is convex, then $$f(\mathbf{A}\mathbf{x} + \mathbf{b})$$ is convex | Applying a linear transformation to the input |
| Pointwise maximum | If *f_1*, ..., *f_m* are convex, then $$\max(f_1, \ldots, f_m)$$ is convex | Piecewise-linear functions; hinge loss |
| Pointwise supremum | If *f*(**x**, *y*) is convex in **x** for each *y*, then $$\sup_y f(\mathbf{x}, y)$$ is convex in **x** | Support function of a set |
| Composition (nondecreasing) | If *g* is convex and nondecreasing and *f* is convex, then $$g(f(\mathbf{x}))$$ is convex | $$e^{f(\mathbf{x})}$$ when *f* is convex |
| Composition (nonincreasing) | If *g* is convex and nonincreasing and *f* is concave, then $$g(f(\mathbf{x}))$$ is convex | -log of a concave function |
| Minimization over convex set | If *f*(**x**, **y**) is convex in (**x**, **y**) and *C* is convex, then $$g(\mathbf{x}) = \inf_{\mathbf{y} \in C} f(\mathbf{x}, \mathbf{y})$$ is convex | Projection onto a convex set |
| Perspective | If *f* is convex, the perspective $$g(\mathbf{x}, t) = t f(\mathbf{x}/t)$$ is convex for $$t > 0$$ | KL divergence as the perspective of -log |
| Conjugate | The convex conjugate $$f^*(\mathbf{y}) = \sup_x (\mathbf{y}^\top \mathbf{x} - f(\mathbf{x}))$$ is always convex | Legendre transform; dual formulations |
| Infimal convolution | The inf-convolution $$(f \,\square\, g)(\mathbf{x}) = \inf_y f(\mathbf{y}) + g(\mathbf{x} - \mathbf{y})$$ is convex if *f* and *g* are | Moreau envelope; smoothing of nondifferentiable functions |

### disciplined convex programming

These composition rules form the basis of **disciplined convex programming** (DCP), a methodology popularized by Stephen Boyd's group at Stanford and implemented in modeling languages such as CVX (MATLAB), CVXPY (Python), and Convex.jl (Julia). [14] In DCP, every expression is built from a small library of atomic convex and concave functions combined according to verifiable composition rules. If a problem is written in DCP form, the modeling tool can automatically certify its convexity and dispatch it to a suitable solver such as ECOS, SCS, or MOSEK. DCP is the reason convex optimization is widely accessible to practitioners who have not memorized the underlying theory. [14]

## subgradients and subdifferentials

Many convex functions of practical interest, such as the L1 norm and the hinge loss, are not differentiable everywhere. The right generalization of the gradient for convex functions is the **subgradient**. A vector **g** is a subgradient of a convex function *f* at the point **x** if for every **y** in dom *f*: [18]

$$
f(\mathbf{y}) \ge f(\mathbf{x}) + \mathbf{g}^\top (\mathbf{y} - \mathbf{x})
$$

The set of all subgradients at **x** is called the **subdifferential** and is denoted partial *f*(**x**). Three key facts make the subdifferential a useful object: [2][18]

- For convex *f*, the subdifferential partial *f*(**x**) is nonempty at every interior point of dom *f*.
- If *f* is differentiable at **x**, then partial *f*(**x**) = {nabla *f*(**x**)}.
- A point **x*** is a global minimum of a convex function *f* if and only if **0** in partial *f*(**x***).

The subdifferential of the absolute value at zero is the interval [-1, 1], and the subdifferential of the L1 norm at a sparse vector contains the sign vectors with -1 or 1 in the nonzero coordinates and any value in [-1, 1] in the zero coordinates. This is why the L1 norm encourages sparsity: setting a coordinate to zero is a stationary point because the subdifferential straddles zero. [18]

Subgradient methods replace the gradient in [gradient descent](/wiki/gradient_descent) with any element of the subdifferential. They converge at the slower $$O(1/\sqrt{k})$$ rate but apply to the broader class of nonsmooth convex functions. [7][12] The **proximal gradient method** further generalizes this by using the proximal operator of a (possibly nonsmooth) convex regularizer, and it is the workhorse for solving regularized problems such as the Lasso. [12]

## convex conjugate (legendre-fenchel transform)

The **convex conjugate** of a function *f*, written *f*^*, is defined by: [19]

$$
f^*(\mathbf{y}) = \sup_{\mathbf{x} \in \operatorname{dom} f} (\mathbf{y}^\top \mathbf{x} - f(\mathbf{x}))
$$

This transform, also called the Legendre-Fenchel transform, plays the role of the Fourier transform in convex analysis: it linearizes the nonlinear structure of *f* by re-expressing it in terms of supporting hyperplanes. Key properties: [2][19]

- The conjugate *f*^* is always convex and lower semicontinuous, regardless of *f*.
- For convex, lower semicontinuous, proper *f*, the **Fenchel-Moreau theorem** states that *f*** = *f*. The conjugate is therefore an involution on this class.
- The conjugate of the squared L2 norm is the squared L2 norm itself (up to a factor).
- The conjugate of the negative entropy is log-sum-exp.
- The conjugate of an indicator function delta_C of a convex set *C* is the support function of *C*.

Convex conjugates underpin the theory of **Lagrange duality**: the dual of a convex optimization problem is constructed from the conjugate of the Lagrangian. [1] They also describe the Bregman divergence in dual coordinates and provide the link between primal and dual formulations of regularized problems.

## bregman divergence

Given a strictly convex, differentiable function *phi*, the **Bregman divergence** generated by *phi* is defined as: [15][20]

$$
D_\phi(\mathbf{x}, \mathbf{y}) = \phi(\mathbf{x}) - \phi(\mathbf{y}) - \nabla \phi(\mathbf{y})^\top (\mathbf{x} - \mathbf{y})
$$

The Bregman divergence measures the gap between *phi*(**x**) and the first-order Taylor expansion of *phi* around **y**. It is always nonnegative and equals zero exactly when **x** = **y** (because *phi* is strictly convex). [15]

Different generators give rise to different divergences and reproduce many familiar discrepancy measures. [15]

| Generator $$\phi$$ | Domain | Resulting Bregman Divergence |
|---|---|---|
| $$\frac{1}{2}\lVert \mathbf{x} \rVert^2$$ | $$\mathbb{R}^n$$ | Squared Euclidean distance |
| $$\sum x_i \log x_i$$ (negative entropy) | Probability simplex | [Kullback-Leibler divergence](/wiki/kl_divergence) |
| $$-\sum \log x_i$$ | Positive orthant | Itakura-Saito divergence |
| $$\frac{1}{2} \mathbf{x}^\top \mathbf{A} \mathbf{x}$$ for $$\mathbf{A} \succ 0$$ | $$\mathbb{R}^n$$ | Mahalanobis-like quadratic divergence |

Bregman divergences are not in general symmetric and do not satisfy the triangle inequality, so they are not metrics. They do satisfy a Pythagorean-style identity that supports projection arguments, and they are used in the analysis of **mirror descent**, **proximal point methods**, and **online convex optimization**. [15]

## What is Jensen's inequality?

One of the most important results involving convex functions is **Jensen's inequality**, named after Danish mathematician Johan Jensen, who published it in his 1906 paper "Sur les fonctions convexes et les inegalites entre les valeurs moyennes" in *Acta Mathematica* (volume 30, pages 175-193). [6][22] For a convex function *f* and a random variable *X*:

$$
f(\mathbb{E}[X]) \le \mathbb{E}[f(X)]
$$

In words, applying a convex function to the expected value of a random variable gives a result no larger than the expected value of the convex function applied to the random variable. [6] Jensen's 1906 paper introduced the modern definition and terminology of convex functions, building on an earlier 1889 result by Otto Holder for twice-differentiable functions ("Uber einen Mittelwerthsatz"). [6][23]

For concave functions, the inequality reverses: $$f(\mathbb{E}[X]) \ge \mathbb{E}[f(X)]$$.

Jensen's inequality has wide-ranging applications. In information theory, it is used to prove that the [Kullback-Leibler divergence](/wiki/kl_divergence) is always nonnegative. In statistics, it establishes bounds on expectations. In machine learning, it underlies the derivation of the evidence lower bound (ELBO) used in variational inference and [variational autoencoders](/wiki/vae). [6][16]

### finite form

The finite (discrete) form of Jensen's inequality states that for nonneg. weights *w_1*, ..., *w_n* summing to 1 and points *x_1*, ..., *x_n*: [6]

$$
f\left(\sum w_i x_i\right) \le \sum w_i f(x_i)
$$

This follows directly from the definition of convexity by induction.

### derived inequalities

Many classical inequalities are special cases or near-corollaries of Jensen's inequality: [6]

- The **arithmetic-geometric mean (AM-GM) inequality** follows from applying Jensen to the concave logarithm.
- **Holder's inequality** can be derived from Jensen by applying it to the convex function *t* -> *t*^p with appropriate measures.
- The **information inequality** D_KL(*P* \|\| *Q*) >= 0 between any two probability distributions is a direct consequence of applying Jensen to -log.
- **Cauchy-Schwarz** can be obtained from Jensen applied to the squaring map.
- The fact that the variance of a random variable is nonnegative is the simplest case (apply Jensen to *t* -> *t*^2).

## smoothness and lipschitz gradients

A twin notion to strong convexity is **L-smoothness**. A differentiable function *f* is *L*-smooth if its gradient is *L*-Lipschitz continuous, that is, \|\|nabla *f*(**x**) - nabla *f*(**y**)\|\| <= *L* \|\|**x** - **y**\|\| for all **x**, **y**. For twice-differentiable functions, this is equivalent to nabla^2 *f*(**x**) <= *L* **I**, an upper bound on the Hessian. *L*-smoothness gives a quadratic upper bound: [7][12]

$$
f(\mathbf{y}) \le f(\mathbf{x}) + \nabla f(\mathbf{x})^\top (\mathbf{y} - \mathbf{x}) + \frac{L}{2} \lVert \mathbf{y} - \mathbf{x} \rVert^2
$$

This upper bound, sometimes called the descent lemma, is what makes gradient descent provably effective: at each step, one can guarantee a decrease in the function value by at least a fixed amount. [12] Combined with mu-strong convexity (lower quadratic bound), it pins the function between two paraboloids of curvature mu and *L*. The standard convergence theorems for gradient descent and its accelerated variants are stated in this regime. [3][7]

The table below summarizes the standard regimes of convex optimization and the corresponding best-known iteration complexity for first-order methods (counting one gradient evaluation per iteration). [7]

| Regime | Lower bound | Upper bound (vanilla GD) | Upper bound (accelerated) |
|---|---|---|---|
| Smooth (L), nonconvex | Stationary point only | $$O(1/\sqrt{k})$$ gradient norm | $$O(1/\sqrt{k})$$ |
| Convex, smooth (L) | $$O(1/k^2)$$ | $$O(1/k)$$ | $$O(1/k^2)$$ |
| Strongly convex ($$\mu$$), smooth (L) | $$O((1 - 1/\kappa)^k)$$ | $$O((1 - 1/\kappa)^k)$$ | $$O((1 - 1/\sqrt{\kappa})^k)$$ |
| Strongly convex, nonsmooth | $$O(1/k)$$ | $$O(1/k)$$ | $$O(1/k)$$ |
| Convex, nonsmooth | $$O(1/\sqrt{k})$$ | $$O(1/\sqrt{k})$$ | $$O(1/\sqrt{k})$$ |

## Why do convex functions matter in machine learning?

Convex functions are central to machine learning because they guarantee that optimization problems have well-behaved solutions. When the [loss function](/wiki/loss_function) is convex with respect to the model parameters, every local minimum is also a global minimum, and [gradient descent](/wiki/gradient_descent) is guaranteed to converge to an optimal solution. [1][4] As Boyd and Vandenberghe state the defining consequence, "any locally optimal point of a convex problem is (globally) optimal." [1] This removes the central difficulty of general optimization, namely the risk of getting trapped in a suboptimal local minimum.

### convex loss functions

The following table summarizes common convex loss functions used in machine learning. [4]

| Loss Function | Formula | Used In | Convexity Notes |
|---|---|---|---|
| Mean Squared Error (MSE) | $$\frac{1}{n} \sum (y_i - \hat{y}_i)^2$$ | [Linear regression](/wiki/linear_regression) | Convex (and strongly convex) in the model parameters for linear models |
| Cross-entropy (log loss) | $$-\sum y_i \log(p_i)$$ | [Logistic regression](/wiki/logistic_regression), classification | Convex in the linear model parameters; non-convex in neural network weights |
| Hinge loss | $$\max(0, 1 - y_i f(x_i))$$ | [Support vector machines](/wiki/support_vector_machine_svm) | Convex (piecewise linear); not differentiable at the hinge point |
| Huber loss | Quadratic for small errors, linear for large | Robust regression | Convex and differentiable everywhere |
| L1 loss (MAE) | $$\frac{1}{n} \sum \lvert y_i - \hat{y}_i \rvert$$ | Robust regression | Convex but not differentiable at zero |
| Squared hinge loss | $$\max(0, 1 - y_i f(x_i))^2$$ | Squared-loss SVM variants | Convex and differentiable everywhere |
| Logistic loss | $$\log(1 + \exp(-y_i f(x_i)))$$ | [Logistic regression](/wiki/logistic_regression) | Strictly convex in linear model parameters |
| Quantile (pinball) loss | $$\max(\tau (y - \hat{y}), (\tau - 1)(y - \hat{y}))$$ | Quantile regression | Convex; estimates conditional quantiles |
| Epsilon-insensitive loss | $$\max(0, \lvert y - \hat{y} \rvert - \epsilon)$$ | Support vector regression | Convex; ignores small errors |

For [linear regression](/wiki/linear_regression) with MSE loss, the optimization problem is a convex quadratic, which admits a closed-form solution (the normal equations) or can be solved efficiently with gradient descent. For [logistic regression](/wiki/logistic_regression) with cross-entropy loss, the objective is convex in the weight parameters, guaranteeing convergence to the global optimum. [4]

However, when these same loss functions are composed with a multi-layer [neural network](/wiki/neural_network), the overall objective becomes non-convex because of the nonlinear activation functions and the product structure of the weight matrices. [16]

### convex regularization

Regularization terms added to loss functions are also typically convex. [1][4]

| Regularizer | Formula | Convexity | Effect on Solution |
|---|---|---|---|
| [L2](/wiki/l2_regularization) (ridge) | $$\lambda \lVert \mathbf{w} \rVert_2^2$$ | Strongly convex | Shrinks all weights smoothly toward zero |
| [L1](/wiki/l1_regularization) (lasso) | $$\lambda \lVert \mathbf{w} \rVert_1$$ | Convex (not strict) | Promotes sparsity; many weights become exactly zero |
| Elastic net | $$\lambda_1 \lVert \mathbf{w} \rVert_1 + \lambda_2 \lVert \mathbf{w} \rVert_2^2$$ | Convex (strongly convex if $$\lambda_2 > 0$$) | Combines sparsity and smoothing |
| Group Lasso | $$\lambda \sum_g \lVert \mathbf{w}_g \rVert_2$$ | Convex | Promotes sparsity at the group level |
| Total variation | $$\lambda \sum \lvert w_{i+1} - w_i \rvert$$ | Convex | Promotes piecewise-constant solutions |
| Nuclear norm | $$\lambda \lVert \mathbf{W} \rVert_*$$ | Convex | Promotes low-rank solutions for matrices |

[L2 regularization](/wiki/l2_regularization) and [L1 regularization](/wiki/l1_regularization) are the most common; [Lasso](/wiki/lasso) refers specifically to least-squares regression with L1 regularization, while [ridge regression](/wiki/ridge_regression) is least squares with L2 regularization. Adding L2 regularization to a convex loss function is a practical technique for converting a merely convex problem into a strongly convex one, enabling faster and more stable optimization. [3][4]

### classical convex models in machine learning

The following table catalogs widely used classical models that admit convex training problems. [1][4]

| Model | Training Problem | Convex Class |
|---|---|---|
| Ordinary least squares | $$\min \lVert \mathbf{X}\mathbf{w} - \mathbf{y} \rVert_2^2$$ | Convex quadratic |
| Ridge regression | $$\min \lVert \mathbf{X}\mathbf{w} - \mathbf{y} \rVert_2^2 + \lambda \lVert \mathbf{w} \rVert_2^2$$ | Strongly convex quadratic |
| Lasso | $$\min \lVert \mathbf{X}\mathbf{w} - \mathbf{y} \rVert_2^2 + \lambda \lVert \mathbf{w} \rVert_1$$ | Convex (nonsmooth) |
| Logistic regression | $$\min \sum \log(1 + \exp(-y_i \mathbf{w}^\top \mathbf{x_i}))$$ | Convex; strongly convex with L2 regularization |
| Linear SVM (primal) | $$\min \frac{1}{2}\lVert \mathbf{w} \rVert_2^2 + C \sum \max(0, 1 - y_i (\mathbf{w}^\top \mathbf{x_i} + b))$$ | Convex (quadratic + piecewise linear) |
| Kernel SVM (dual) | Convex quadratic in dual variables under positive semidefinite kernel | Convex QP |
| Maximum entropy / softmax classifier | min cross-entropy with linear model | Convex |
| Quantile regression | min sum pinball loss | Convex (LP) |
| Poisson regression | min Poisson negative log likelihood | Convex (canonical link) |
| LogDet model selection | $$\min \operatorname{trace}(\mathbf{S}^{-1} \mathbf{X}) - \log \det(\mathbf{X}) + \lambda \lVert \mathbf{X} \rVert_1$$ (graphical lasso) | Convex |

[Support vector machines](/wiki/support_vector_machine_svm) deserve particular attention. The SVM training problem is a convex quadratic program (QP) in either its primal or dual form. The convexity of this formulation means there is no risk of getting stuck in a poor local optimum, and standard QP solvers (or specialized algorithms such as SMO) can compute the unique global optimum. [1][4] The kernel trick extends this convex framework to nonlinear decision boundaries while preserving convexity in the dual.

### deep learning and convexity

The loss function of a deep neural network is non-convex in its parameters as a whole. [16] However, several partial convexity properties still play a role:

- Holding all but the **final layer** fixed, the loss is typically convex in the final layer weights when the loss itself is convex (cross-entropy, MSE, hinge). This is why a linear classifier on frozen features works as a useful baseline and why the last linear layer of a transformer is sometimes treated separately.
- **Two-layer infinite-width networks** (mean-field limit) admit a convex formulation in the space of probability measures over neuron parameters, even though finite-width networks are non-convex.
- The **neural tangent kernel** regime linearizes the network around initialization, producing a convex (kernel ridge regression) problem that approximates the original non-convex training dynamics for large widths.
- Many regularizers used in deep learning (weight decay, L1, dropout-as-Bayesian-prior, spectral norm penalties) are convex, even though the data-fitting term is not.
- **Convex relaxations** of neural network verification problems (e.g., LP relaxations of ReLU constraints) are used to certify robustness against adversarial examples.

Despite these connections, training deep networks is fundamentally non-convex, and the success of [stochastic gradient descent](/wiki/stochastic_gradient_descent_sgd) on these problems is one of the most studied empirical phenomena in modern machine learning. [16] Theoretical work has shown that for sufficiently overparameterized networks, gradient descent can escape saddle points and converge to global minima despite the non-convexity, but the precise mechanisms remain an active research area.

### probabilistic and bayesian methods

Convex functions also appear in probabilistic methods. The negative log-likelihood of any **exponential family** distribution is convex in the natural parameters: this includes Gaussian (with fixed variance), Bernoulli, multinomial, Poisson, exponential, and gamma distributions. [1] Generalized linear models (GLMs), which model the natural parameter as a linear function of features, therefore have convex training objectives. The variational lower bound used in [variational inference](/wiki/variational_inference) and [variational autoencoders](/wiki/vae) involves a [KL divergence](/wiki/kl_divergence) term that is convex in the variational parameters under common parameterizations such as Gaussian families. [16]

## relationship to convex optimization

[Convex optimization](/wiki/convex_optimization) is the study of minimizing convex functions over [convex sets](/wiki/convex_set). The key guarantee is that any local minimum of a convex function over a convex set is a global minimum. [1] This property makes convex optimization problems tractable: efficient polynomial-time algorithms exist for solving them to arbitrary precision. [13]

Many classical machine learning algorithms are formulated as convex optimization problems: [4]

- [Linear regression](/wiki/linear_regression) (least squares)
- [Logistic regression](/wiki/logistic_regression)
- [Support vector machines](/wiki/support_vector_machine_svm)
- [Ridge regression](/wiki/ridge_regression) and [Lasso](/wiki/lasso) regression
- Principal component analysis (via semidefinite relaxation)

When a machine learning problem can be cast as a convex optimization problem, practitioners benefit from strong theoretical guarantees (global optimality, convergence bounds) and mature solver software. When the problem is non-convex (as in deep learning), researchers often apply techniques originally developed for convex settings, such as gradient descent and regularization, even though the theoretical guarantees no longer strictly apply.

### structured convex problem classes

Within convex optimization, several special structures unlock faster algorithms. [1]

| Class | Form | Examples in ML |
|---|---|---|
| Linear programming (LP) | Linear objective and constraints | LP relaxations of integer programs; quantile regression |
| Quadratic programming (QP) | Convex quadratic objective, linear constraints | SVM, MPC, portfolio optimization |
| Quadratically constrained QP (QCQP) | Quadratic in objective and constraints | Sphere-constrained least squares |
| Second-order cone programming (SOCP) | Linear objective, second-order cone constraints | Robust regression, minimum-volume ellipsoid |
| Semidefinite programming (SDP) | Linear objective, PSD constraints | Max-cut relaxations, kernel learning, matrix completion |
| Geometric programming | Posynomial constraints (after log change of variables) | Circuit design, statistics |
| Conic programming | Linear objective, intersection with a closed convex cone | Unifies LP, SOCP, SDP |

### optimization algorithms for convex problems

A range of algorithms exploit convexity to achieve strong convergence guarantees. [7]

| Algorithm | Best for | Convergence rate |
|---|---|---|
| [Gradient descent](/wiki/gradient_descent) | Smooth convex / strongly convex | $$O(1/k)$$ / linear |
| [Nesterov accelerated gradient](/wiki/nesterov_accelerated_gradient) | Smooth convex / strongly convex | $$O(1/k^2)$$ / linear with $$\sqrt{\kappa}$$ |
| Subgradient method | Nonsmooth convex | $$O(1/\sqrt{k})$$ |
| Proximal gradient (ISTA / FISTA) | Smooth + nonsmooth (regularized) | $$O(1/k)$$ / $$O(1/k^2)$$ |
| Mirror descent | Constrained convex on simplex or other geometries | $$O(1/\sqrt{k})$$ |
| ADMM | Splitting structured convex problems | $$O(1/k)$$ under typical conditions |
| Coordinate descent | Separable convex | Linear in many regimes |
| Newton's method | Strongly convex, twice differentiable | Quadratic locally |
| Interior-point methods | Conic programs (LP, SOCP, SDP) | Polynomial in problem size and $$1/\epsilon$$ |

## historical notes

The earliest systematic study of convex functions of a real variable is due to **Otto Holder** (in connection with his eponymous inequality) and **Johan Jensen**, who in 1906 published the inequality that bears his name and introduced the modern terminology of convex functions. [6][22] The general theory in $$\mathbb{R}^n$$ was developed in the mid-20th century by researchers including **Werner Fenchel**, **Jean-Jacques Moreau**, and **R. Tyrrell Rockafellar**, whose 1970 monograph *Convex Analysis* remains a standard reference. [2] The Fenchel-Moreau theorem, the calculus of subdifferentials, and the perspective transform all originated in this period.

Applications of convex analysis to optimization were given a major push by **Naum Shor**, **Boris Polyak**, and **Yurii Nesterov** in the Soviet school during the 1960s through 1980s. Nesterov and **Arkadi Nemirovski** developed the modern theory of polynomial-time interior-point methods for self-concordant convex functions in the late 1980s and 1990s, which extended the celebrated polynomial-time guarantees of [linear programming](/wiki/linear_programming) (Khachiyan 1979, Karmarkar 1984) to the much broader class of conic convex programs. [13] **Stephen Boyd** and **Lieven Vandenberghe** then translated the theory into a practitioner-oriented framework, culminating in the textbook *Convex Optimization* (2004) and in the disciplined convex programming methodology embodied in CVX, CVXPY, and Convex.jl. [1][14] Their work made convex optimization a routine tool across engineering, statistics, and machine learning. It was in this context that Rockafellar emphasized that "the great watershed in optimization isn't between linearity and nonlinearity, but convexity and nonconvexity." [21]

## summary of convexity hierarchy

The following table summarizes the relationships between different levels of convexity. [1][3]

| Property | Guarantees Unique Minimum? | Convergence Rate (GD) | Second-Order Condition |
|---|---|---|---|
| Convex | No (may have multiple or none) | $$O(1/k)$$ | $$\nabla^2 f \succeq 0$$ |
| Strictly convex | Yes (at most one) | $$O(1/k)$$ | $$\nabla^2 f \succ 0$$ |
| Strongly convex ($$\mu > 0$$) | Yes (exactly one) | $$O(\log(1/\epsilon))$$ | $$\nabla^2 f \succeq \mu \mathbf{I}$$ |

## see also

- [Convex set](/wiki/convex_set)
- [Convex optimization](/wiki/convex_optimization)
- [Gradient descent](/wiki/gradient_descent)
- [Jensen inequality](/wiki/jensen_inequality)
- [Loss function](/wiki/loss_function)
- [Logistic regression](/wiki/logistic_regression)
- [Support vector machine (SVM)](/wiki/support_vector_machine_svm)
- [Linear regression](/wiki/linear_regression)
- [L1 regularization](/wiki/l1_regularization)
- [L2 regularization](/wiki/l2_regularization)
- [Lasso](/wiki/lasso)
- [Ridge regression](/wiki/ridge_regression)
- [Stochastic gradient descent](/wiki/stochastic_gradient_descent_sgd)
- [Nesterov accelerated gradient](/wiki/nesterov_accelerated_gradient)
- [KL divergence](/wiki/kl_divergence)
- [Softmax](/wiki/softmax)
- [Variational inference](/wiki/variational_inference)

## references

1. Boyd, S. and Vandenberghe, L. (2004). *Convex Optimization*. Cambridge University Press. Available at https://web.stanford.edu/~boyd/cvxbook/
2. Rockafellar, R. T. (1970). *Convex Analysis*. Princeton University Press.
3. Nesterov, Y. (2004). *Introductory Lectures on Convex Optimization: A Basic Course*. Springer.
4. Shalev-Shwartz, S. and Ben-David, S. (2014). *Understanding Machine Learning: From Theory to Algorithms*. Cambridge University Press.
5. "Convex function." Wikipedia. https://en.wikipedia.org/wiki/Convex_function
6. "Jensen's inequality." Wikipedia. https://en.wikipedia.org/wiki/Jensen%27s_inequality
7. Bubeck, S. (2015). "Convex Optimization: Algorithms and Complexity." *Foundations and Trends in Machine Learning*, 8(3-4), 231-357.
8. Zhou, X. "Strong Convexity." https://xingyuzhou.org/blog/notes/strong-convexity
9. Boyd, S. and Vandenberghe, L. "Convex Functions." Lecture slides, Stanford EE364a. https://see.stanford.edu/materials/lsocoee364a/03ConvexFunctions.pdf
10. "LogSumExp." Wikipedia. https://en.wikipedia.org/wiki/LogSumExp
11. Bertsekas, D. P. (1999). *Nonlinear Programming*, 2nd ed. Athena Scientific.
12. Beck, A. (2017). *First-Order Methods in Optimization*. SIAM.
13. Nesterov, Y. and Nemirovskii, A. (1994). *Interior-Point Polynomial Algorithms in Convex Programming*. SIAM.
14. Diamond, S. and Boyd, S. (2016). "CVXPY: A Python-Embedded Modeling Language for Convex Optimization." *Journal of Machine Learning Research*, 17(83), 1-5.
15. Banerjee, A., Merugu, S., Dhillon, I., and Ghosh, J. (2005). "Clustering with Bregman Divergences." *Journal of Machine Learning Research*, 6, 1705-1749.
16. Goodfellow, I., Bengio, Y., and Courville, A. (2016). *Deep Learning*. MIT Press. Available at https://www.deeplearningbook.org
17. Hiriart-Urruty, J.-B. and Lemarechal, C. (2001). *Fundamentals of Convex Analysis*. Springer.
18. "Subderivative." Wikipedia. https://en.wikipedia.org/wiki/Subderivative
19. "Convex conjugate." Wikipedia. https://en.wikipedia.org/wiki/Convex_conjugate
20. "Bregman divergence." Wikipedia. https://en.wikipedia.org/wiki/Bregman_divergence
21. Rockafellar, R. T. (1993). "Lagrange Multipliers and Optimality." *SIAM Review*, 35(2), 183-238.
22. Jensen, J. L. W. V. (1906). "Sur les fonctions convexes et les inegalites entre les valeurs moyennes." *Acta Mathematica*, 30(1), 175-193.
23. Holder, O. (1889). "Uber einen Mittelwerthsatz." *Nachrichten von der Konigl. Gesellschaft der Wissenschaften und der Georg-Augusts-Universitat zu Gottingen*, 38-47.