A convex function is a real-valued mathematical function whose graph curves upward, forming a bowl or cup shape. Formally, a function is convex if the line segment connecting any two points on its graph lies on or above the graph itself. Convex functions are foundational to convex optimization and play a central role in machine learning, where they guarantee that optimization algorithms such as gradient descent can find global minima efficiently.
Imagine you have a bowl sitting on a table. If you drop a marble into the bowl, it will always roll down to the very bottom, no matter where you drop it. A convex function is like the inside surface of that bowl. There is only one lowest point, and every path leads down to it.
In machine learning, we use bowl-shaped (convex) functions to measure how wrong our predictions are. Because the function is shaped like a bowl, our training algorithm can always find the single best answer by rolling downhill. If the function were bumpy with lots of dips (non-convex), the marble might get stuck in a small dip that is not the true bottom, and our model would not learn the best possible answer.
Let f : R^n -> R be a function whose domain is a convex set. The function f is convex if, for all x, y in its domain and for every t in [0, 1]:
f(tx + (1 - t)y) <= t f(x) + (1 - t) f(y)
Geometrically, this inequality states that the chord (line segment) between any two points (x, f(x)) and (y, f(y)) on the graph of f lies on or above the graph. Equivalently, the epigraph of f (the set of points lying on or above the graph) is a convex set.
A function f is concave if -f is convex. Linear and affine functions are the only functions that are both convex and concave simultaneously.
A function f is strictly convex if the inequality holds with strict inequality for all distinct x != y and for all t in (0, 1):
f(tx + (1 - t)y) < t f(x) + (1 - t) f(y)
Strict convexity means the chord between two distinct points lies strictly above the graph (never touching it except at the endpoints). A strictly convex function has at most one global minimum. If a minimum exists, it is unique. The function f(x) = e^x is a classic example: it is strictly convex on all of R, but it does not attain a minimum.
Strong convexity is a more powerful condition that quantifies how "curved" a function is. A differentiable function f is mu-strongly convex (for some mu > 0) if for all x, y in its domain:
f(y) >= f(x) + nabla f(x)^T (y - x) + (mu / 2) ||y - x||^2
Intuitively, a mu-strongly convex function grows at least as fast as a quadratic with curvature parameter mu. This provides a quadratic lower bound on the function, guaranteeing a unique global minimum and enabling stronger convergence guarantees for optimization algorithms.
The following conditions are equivalent for a differentiable function f:
| Characterization | Statement |
|---|---|
| Quadratic lower bound | f(y) >= f(x) + nabla f(x)^T (y - x) + (mu/2) ||y - x||^2 |
| Subtraction test | g(x) = f(x) - (mu/2) ||x||^2 is convex |
| Monotone gradient | (nabla f(x) - nabla f(y))^T (x - y) >= mu ||x - y||^2 |
| Strengthened Jensen | f(tx + (1-t)y) <= t f(x) + (1-t) f(y) - t(1-t)(mu/2) ||x - y||^2 |
Strong convexity implies strict convexity, which in turn implies convexity, but the reverse implications do not hold.
Strong convexity is especially important in optimization because it guarantees linear convergence (exponential decrease in error) for gradient descent. For a mu-strongly convex function with L-Lipschitz continuous gradients, gradient descent achieves an error of at most epsilon after O((L/mu) log(1/epsilon)) iterations. Without strong convexity, the convergence rate for smooth convex functions is only O(1/k), which is substantially slower.
If f is differentiable, it is convex if and only if its domain is a convex set and for all x, y in the domain:
f(y) >= f(x) + nabla f(x)^T (y - x)
This condition has a clear geometric meaning: the tangent hyperplane at any point lies entirely on or below the graph of the function. In one dimension, this says that every tangent line is a global underestimator of the function. This first-order condition is the basis for many optimization algorithms, because it implies that the gradient provides a reliable direction toward the minimum.
If f is twice differentiable, it is convex if and only if its Hessian matrix (the matrix of second partial derivatives) is positive semidefinite everywhere on its domain:
nabla^2 f(x) >= 0 for all x in dom(f)
For a function of a single variable, this reduces to the familiar condition f''(x) >= 0 for all x. For strictly convex functions, the Hessian must be positive definite (all eigenvalues strictly positive). For mu-strongly convex functions, the Hessian satisfies nabla^2 f(x) >= mu * I for all x, meaning every eigenvalue of the Hessian is at least mu.
| Convexity Type | Second-Order Condition |
|---|---|
| Convex | nabla^2 f(x) >= 0 (positive semidefinite) |
| Strictly convex | nabla^2 f(x) > 0 (positive definite) |
| mu-strongly convex | nabla^2 f(x) >= mu I (eigenvalues >= mu) |
The following table lists commonly encountered convex functions and their key properties.
| Function | Formula | Strict? | Strongly Convex? | Notes |
|---|---|---|---|---|
| Affine / Linear | f(x) = a^T x + b | No | No | Both convex and concave; the only functions with this property |
| Quadratic | f(x) = x^2 | Yes | Yes (mu = 2) | The prototypical strongly convex function |
| Positive-definite quadratic | f(x) = x^T A x for A > 0 | Yes | Yes (mu = lambda_min(A)) | Strongly convex with parameter equal to the smallest eigenvalue of A |
| Exponential | f(x) = e^(ax) for any a | Yes | No | Strictly convex but not strongly convex on all of R |
| Negative logarithm | f(x) = -log(x), x > 0 | Yes | No | Strictly convex on its domain; used as a barrier function in interior-point methods |
| Norms | f(x) = ||x||_p, p >= 1 | No | No | Convex by the triangle inequality; not strictly convex in general |
| Log-sum-exp | f(x) = log(sum e^(x_i)) | No | No | Smooth approximation to the max function; its gradient is the softmax function |
| Absolute value | f(x) = |x| | No | No | Convex but not differentiable at x = 0 |
| Powers | f(x) = x^p, x >= 0, p >= 1 | Yes (if p > 1) | No | Convex on the nonneg. reals for p >= 1 |
The log-sum-exp (LSE) function, defined as f(x) = log(sum_{i=1}^{n} e^(x_i)), deserves special attention because of its wide use in machine learning. It is a smooth, convex approximation to the maximum function, satisfying:
max(x_1, ..., x_n) <= log(sum e^(x_i)) <= max(x_1, ..., x_n) + log(n)
The gradient of the LSE function is the softmax function, and its Hessian can be shown to be positive semidefinite via the Cauchy-Schwarz inequality. The LSE function appears in the log-partition function of exponential family distributions, in the cross-entropy loss function for classification, and in smooth relaxations used in reinforcement learning.
Not all functions encountered in machine learning are convex. Important non-convex cases include:
When the loss function is non-convex, gradient descent may converge to local minima or saddle points rather than the global minimum. In practice, techniques such as stochastic gradient descent with momentum, learning rate schedules, and random initialization help navigate non-convex landscapes.
Convexity is preserved under several useful operations, making it possible to build complex convex functions from simpler ones. These composition rules are central to modeling in convex optimization.
| Operation | Rule | Example |
|---|---|---|
| Nonnegative weighted sum | If f_1, ..., f_m are convex and w_i >= 0, then sum w_i f_i is convex | Weighted sum of loss terms in multi-task learning |
| Affine composition | If f is convex, then f(A****x + b) is convex | Applying a linear transformation to the input |
| Pointwise maximum | If f_1, ..., f_m are convex, then max(f_1, ..., f_m) is convex | Piecewise-linear functions; hinge loss |
| Pointwise supremum | If f(x, y) is convex in x for each y, then sup_y f(x, y) is convex in x | Support function of a set |
| Composition (nondecreasing) | If g is convex and nondecreasing and f is convex, then g(f(x)) is convex | e^(f(x)) when f is convex |
| Composition (nonincreasing) | If g is convex and nonincreasing and f is concave, then g(f(x)) is convex | -log of a concave function |
| Minimization over convex set | If f(x, y) is convex in (x, y) and C is convex, then g(x) = inf_{y in C} f(x, y) is convex | Projection onto a convex set |
| Perspective | If f is convex, the perspective g(x, t) = t f(x/t) is convex for t > 0 | KL divergence as the perspective of -log |
One of the most important results involving convex functions is Jensen's inequality. For a convex function f and a random variable X:
f(E[X]) <= E[f(X)]
In words, applying a convex function to the expected value of a random variable gives a result no larger than the expected value of the convex function applied to the random variable.
For concave functions, the inequality reverses: f(E[X]) >= E[f(X)].
Jensen's inequality has wide-ranging applications. In information theory, it is used to prove that the Kullback-Leibler divergence is always nonnegative. In statistics, it establishes bounds on expectations. In machine learning, it underlies the derivation of the evidence lower bound (ELBO) used in variational inference and variational autoencoders.
The finite (discrete) form of Jensen's inequality states that for nonneg. weights w_1, ..., w_n summing to 1 and points x_1, ..., x_n:
f(sum w_i x_i) <= sum w_i f(x_i)
This follows directly from the definition of convexity by induction.
Convex functions are central to machine learning because they guarantee that optimization problems have well-behaved solutions. When the loss function is convex with respect to the model parameters, every local minimum is also a global minimum, and gradient descent is guaranteed to converge to an optimal solution.
The following table summarizes common convex loss functions used in machine learning.
| Loss Function | Formula | Used In | Convexity Notes |
|---|---|---|---|
| Mean Squared Error (MSE) | (1/n) sum (y_i - y_hat_i)^2 | Linear regression | Convex (and strongly convex) in the model parameters for linear models |
| Cross-entropy (log loss) | -sum y_i log(p_i) | Logistic regression, classification | Convex in the linear model parameters; non-convex in neural network weights |
| Hinge loss | max(0, 1 - y_i f(x_i)) | Support vector machines | Convex (piecewise linear); not differentiable at the hinge point |
| Huber loss | Quadratic for small errors, linear for large | Robust regression | Convex and differentiable everywhere |
| L1 loss (MAE) | (1/n) sum |y_i - y_hat_i| | Robust regression | Convex but not differentiable at zero |
For linear regression with MSE loss, the optimization problem is a convex quadratic, which admits a closed-form solution (the normal equations) or can be solved efficiently with gradient descent. For logistic regression with cross-entropy loss, the objective is convex in the weight parameters, guaranteeing convergence to the global optimum.
However, when these same loss functions are composed with a multi-layer neural network, the overall objective becomes non-convex because of the nonlinear activation functions and the product structure of the weight matrices.
Regularization terms added to loss functions are also typically convex:
Adding L2 regularization to a convex loss function is a practical technique for converting a merely convex problem into a strongly convex one, enabling faster and more stable optimization.
Convex optimization is the study of minimizing convex functions over convex sets. The key guarantee is that any local minimum of a convex function over a convex set is a global minimum. This property makes convex optimization problems tractable: efficient polynomial-time algorithms exist for solving them to arbitrary precision.
Many classical machine learning algorithms are formulated as convex optimization problems:
When a machine learning problem can be cast as a convex optimization problem, practitioners benefit from strong theoretical guarantees (global optimality, convergence bounds) and mature solver software. When the problem is non-convex (as in deep learning), researchers often apply techniques originally developed for convex settings, such as gradient descent and regularization, even though the theoretical guarantees no longer strictly apply.
The following table summarizes the relationships between different levels of convexity.
| Property | Guarantees Unique Minimum? | Convergence Rate (GD) | Second-Order Condition |
|---|---|---|---|
| Convex | No (may have multiple or none) | O(1/k) | nabla^2 f >= 0 |
| Strictly convex | Yes (at most one) | O(1/k) | nabla^2 f > 0 |
| Strongly convex (mu > 0) | Yes (exactly one) | O(log(1/epsilon)) | nabla^2 f >= mu I |