Loss Surface

The loss surface (also called the error surface or objective function surface) is the geometric representation of a loss function as a function of the model's parameters. For a neural network with n trainable parameters, the loss surface is an (n + 1)-dimensional manifold where n axes correspond to weight values and the remaining axis represents the loss evaluated on a given dataset. The shape and structure of this surface determine how easily an optimizer can find parameter configurations that achieve low training error and, more importantly, good generalization to unseen data.

Research into loss surfaces has reshaped the understanding of why deep learning works despite the non-convex nature of the optimization problem. Early fears that deep networks would be riddled with poor local minima have given way to a more nuanced picture involving saddle points, flat regions, connected valleys, and the interplay between surface geometry and generalization performance.

Explain like I'm 5 (ELI5)

Imagine you are blindfolded and standing on a hilly field. Your goal is to walk downhill until you reach the lowest spot. The shape of the ground under your feet is the "loss surface." When you train a neural network, the computer does something similar: it adjusts little dials (the weights) and checks whether the answer got better or worse (the loss). Flat areas are easy to walk on but hard to tell which way is down. Steep, narrow valleys can trap you in a low spot that is not the very lowest. Wide, gentle bowls are the best places to end up because small changes in position do not send you uphill again. The computer uses tricks (optimizers) to feel the slope and decide which direction to step, trying to reach a nice wide low area.

Formal definition

Given a model parameterized by a vector w in R^n and a dataset D = {(x_i, y_i)}, the loss surface is defined by the mapping:

L(w) = (1 / N) * sum_{i=1}^{N} l(f(x_i; w), y_i)

where f(x_i; w) is the model output for input x_i under parameters w, l is a per-sample loss (such as cross-entropy or mean squared error), and N is the number of samples. In practice, stochastic gradient descent and its variants compute the loss over mini-batches rather than the full dataset, so the optimizer actually navigates a stochastic approximation of the true surface.

The properties of the loss surface depend on three factors: the choice of loss function, the model architecture, and the data distribution. Even when the per-sample loss is convex (as with cross-entropy applied to a single logistic unit), composing it with a multi-layer nonlinear network produces a highly non-convex surface.

Key geometric features

Global and local minima

A global minimum is a point w* where L(w*) is less than or equal to L(w) for all w in the parameter space. A local minimum is a point where L is smaller than in all nearby points but not necessarily across the entire space. In low-dimensional convex optimization, every local minimum is also global. Neural network loss surfaces are non-convex and can, in principle, have many local minima.

However, a series of theoretical results starting with Choromanska et al. (2015) and Kawaguchi (2016) showed that, under certain conditions, the local minima of deep networks tend to have loss values close to the global minimum. Kawaguchi proved that for deep linear networks (with any depth and width), every local minimum is a global minimum and every other critical point is a saddle point ^[7]. Choromanska et al. drew an analogy between neural network loss surfaces and the Hamiltonians of spherical spin-glass models from statistical physics, arguing that for large networks the poor local minima become exponentially rare ^[1].

Saddle points

A saddle point is a critical point (where the gradient is zero) that is neither a minimum nor a maximum. At a saddle point, the surface curves upward along some parameter directions and downward along others. Dauphin et al. (2014) provided a theoretical argument and empirical evidence that saddle points, not local minima, are the primary obstacle to optimization in high-dimensional spaces ^[2]. The intuition is straightforward: for a critical point to be a local minimum in a space of n dimensions, the Hessian must have all n eigenvalues positive. As n grows into the millions (typical for modern networks), the probability that every eigenvalue is positive drops to near zero. Almost every critical point is therefore a saddle point.

Saddle points surrounded by flat plateaus can slow training because the gradient magnitude is small and provides little signal for direction. Algorithms such as momentum-based SGD, Adam, and the saddle-free Newton method (Dauphin et al., 2014) help the optimizer escape these regions by accumulating velocity or using second-order curvature information.

Flat and sharp minima

The geometry of a minimum, specifically how quickly the loss increases as parameters move away from it, has been linked to generalization. This idea dates back to Hochreiter and Schmidhuber (1997), who defined a "flat minimum" as a large connected region in weight space where the loss remains approximately constant ^[3]. Using a minimum description length argument, they proposed that flat minima correspond to simpler models with lower expected overfitting.

Keskar et al. (2017) revived this line of inquiry by demonstrating that large-batch training tends to converge to sharp minima (narrow valleys), while small-batch training finds flat minima (wide basins), and that this difference correlates with the generalization gap ^[4]. The proposed explanation is that the noise inherent in small-batch SGD acts as an implicit regularizer, pushing the optimizer away from sharp regions and toward flatter ones.

The flat-versus-sharp narrative is not without controversy. Dinh et al. (2017) showed that for networks with ReLU activations, the rescaling symmetry of the parameters allows one to construct reparameterizations of a given minimum that are arbitrarily sharp yet represent the same function and therefore generalize identically ^[5]. This result demonstrated that naive measures of sharpness (such as the largest eigenvalue of the Hessian or the trace) are not invariant to reparameterization, and more careful definitions of sharpness are needed.

Plateaus and flat regions

Large expanses of nearly zero gradient, often called plateaus, appear in regions where many Hessian eigenvalues are close to zero. Empirical studies of the Hessian spectrum (Sagun et al., 2017; Ghorbani et al., 2019) have found that the eigenvalue distribution of trained networks splits into two distinct parts: a bulk concentrated near zero (representing the vast majority of eigenvalues) and a small number of outlier eigenvalues that correspond to high-curvature directions ^[6]^[8]. The number of outliers roughly equals the number of output classes minus one. This structure implies that most directions in parameter space are nearly flat, with only a few directions carrying meaningful curvature.

Hessian analysis of the loss surface

The Hessian matrix H of the loss function, the matrix of all second-order partial derivatives with respect to the parameters, provides local curvature information at any point on the loss surface. Its eigenvalues indicate how the loss changes along each eigenvector direction: positive eigenvalues indicate upward curvature (bowl-like), negative eigenvalues indicate downward curvature (hill-like), and zero eigenvalues indicate flat directions.

Hessian property	Interpretation	Implication
All eigenvalues positive	Local minimum	Optimizer has reached a basin
All eigenvalues negative	Local maximum	Unstable; optimizer will move away
Mixed positive and negative	Saddle point	Common in high dimensions; slows training
Many eigenvalues near zero	Flat region or plateau	Most directions uninformative; a few dominate
Large spectral gap (outliers far from bulk)	High curvature in a few directions	Risk of sharp minimum; may hurt generalization

Computing the full Hessian is infeasible for modern networks with millions of parameters (the matrix has n^2 entries). Practical methods include:

Hessian-vector products, which can be computed efficiently via automatic differentiation and used in iterative eigenvalue algorithms (Lanczos iteration, power method)
Stochastic estimation of the Hessian trace and top eigenvalues using random projections
Fisher information matrix approximations, which approximate the Hessian near a minimum under certain assumptions about the loss function

Effect of architecture on the loss surface

The design of a neural network has a direct impact on the topology and smoothness of its loss surface.

Skip connections

Li et al. (2018) used filter-normalized 2D loss surface visualizations to show that networks without skip connections (such as plain deep networks) produce highly chaotic, non-convex surfaces, while architectures with skip connections (such as ResNet) exhibit dramatically smoother surfaces ^[9]. Skip connections create shortcut paths for the gradient, preventing the gradient from vanishing through many layers and effectively "convexifying" the loss surface. This smoothing effect partly explains why residual networks can be trained to much greater depths than plain networks.

Network width

Wider networks (more neurons per layer) also produce smoother loss surfaces. Li et al. observed that increasing the number of filters in convolutional layers reduces the prevalence of chaotic, non-convex regions. Overparameterization, where the number of parameters far exceeds the number of training samples, has been shown to eliminate strict bad local minima under certain assumptions. For networks with continuous activation functions and sufficiently many parameters, every local minimum is either global or lies on a flat plateau from which the optimizer can escape ^[10].

Batch normalization

Batch normalization also smooths the loss surface. Santurkar et al. (2018) argued that the primary benefit of batch normalization is not reducing "internal covariate shift" (as originally proposed) but rather making the optimization problem smoother by reducing the Lipschitz constant of the loss function and its gradients. Empirical Hessian analysis shows that networks with batch normalization have far fewer large isolated eigenvalues compared to unnormalized networks.

Depth

Increasing network depth without skip connections or normalization rapidly degrades the loss surface, introducing more saddle points, sharper minima, and chaotic regions. This observation aligns with the well-known difficulty of training very deep plain networks and the historical importance of the vanishing gradient problem.

Architectural choice	Effect on loss surface
Skip connections	Smoother, more convex-like geometry; enables training at greater depth
Wider layers	Reduces chaotic regions; overparameterization eliminates bad local minima
Batch normalization	Lowers Lipschitz constant; fewer large Hessian eigenvalues
Greater depth (without skip connections)	More saddle points; sharper minima; harder optimization
Dropout	Adds stochasticity; can smooth effective loss surface

Mode connectivity

A surprising discovery about neural network loss surfaces is that independently trained minima are typically connected by paths of low loss. Two concurrent 2018 studies established this result.

Garipov et al. (2018) showed that the optima found by independent training runs are connected by simple curves, specifically polygonal chains with only one bend, along which both training and test loss remain nearly constant ^[11]. They exploited this geometric insight to develop Fast Geometric Ensembling (FGE), an ensembling method that can produce high-quality ensembles in the time required to train a single model.

Draxler et al. (2018) independently demonstrated that continuous paths between minima of modern architectures (tested on CIFAR-10 and CIFAR-100) are essentially flat in both training and test loss, leading them to propose that minima are best understood as points on a single connected manifold of low loss rather than as isolated valleys ^[12].

Linear mode connectivity

A stronger form of connectivity, called linear mode connectivity, asks whether the straight line between two minima in parameter space also maintains low loss. Frankle et al. (2020) found that networks trained from the same initialization (but with different data orderings) are linearly mode-connected, while networks trained from different random initializations generally are not, unless a permutation of the neurons is applied to align them. This is because neural networks have a discrete permutation symmetry: reordering neurons within a hidden layer does not change the function the network computes but does change the location of the minimum in parameter space. Accounting for this symmetry, recent work has shown that independently trained networks can often be linearly connected after appropriate neuron alignment.

Role of stochastic gradient descent

The noise in stochastic gradient descent plays a central role in determining where the optimizer settles on the loss surface.

Noise as implicit regularization

SGD computes gradients over random mini-batches, introducing noise whose magnitude is inversely proportional to the batch size. This noise has a regularizing effect: it helps the optimizer escape sharp minima (which are sensitive to perturbations) and biases convergence toward flat minima (which are robust to perturbations). Smith and Le (2018) formalized this by modeling SGD as a stochastic differential equation and showing that the noise scale, proportional to the learning rate divided by the batch size, determines the width of the basins the optimizer can stably occupy.

Batch size and generalization

The relationship between batch size and the shape of the loss surface around the converged solution has practical consequences. Empirical studies consistently show that very large batch sizes lead to solutions in sharper regions of the loss surface, producing models with worse test accuracy, even when training loss is comparable to small-batch solutions. This generalization gap can be partly mitigated by scaling the learning rate proportionally to the batch size (the "linear scaling rule") or by using learning rate warmup schedules.

Training setting	Typical effect on converged solution
Small batch size	More noise; converges to flatter minima; better generalization
Large batch size	Less noise; converges to sharper minima; worse generalization
Higher learning rate	Larger effective noise; favors wider basins
Learning rate decay	Reduces noise over time; allows settling into narrower minima
Cyclical learning rate	Periodically increases noise; can escape local minima

Optimization methods that exploit loss surface geometry

Several optimization algorithms have been designed to explicitly take advantage of the structure of the loss surface.

Sharpness-aware minimization (SAM)

Foret et al. (2021) introduced Sharpness-Aware Minimization (SAM), which seeks parameters that lie in neighborhoods where the loss is uniformly low, rather than parameters that merely minimize the loss at a single point ^[13]. SAM formulates this as a min-max problem: it first perturbs the parameters in the direction that maximally increases the loss, then takes a gradient step to minimize the loss at that worst-case perturbation. The result is convergence to flatter regions of the loss surface. SAM has shown consistent generalization improvements across image classification benchmarks including CIFAR-10, CIFAR-100, and ImageNet. Adaptive extensions such as ASAM (Kwon et al., 2021) improve SAM by making the perturbation scale-invariant.

Stochastic weight averaging (SWA)

Izmailov et al. (2018) proposed Stochastic Weight Averaging (SWA), which averages the weights collected at multiple points along the SGD trajectory using a cyclical or high constant learning rate ^[14]. The averaged solution tends to land in the center of a wide, flat region of the loss surface, whereas standard SGD converges to the boundary of such regions. SWA improves generalization with negligible computational overhead and has been integrated into PyTorch as a built-in optimizer.

Entropy-SGD

Chaudhari et al. (2017) proposed Entropy-SGD, which explicitly optimizes a "local entropy" objective that favors wide valleys over narrow ones ^[15]. The algorithm uses an inner loop of stochastic gradient Langevin dynamics to estimate the gradient of the local entropy, followed by an outer loop that updates the parameters. The local entropy measure assigns higher value to minima surrounded by large flat regions, biasing the optimization toward well-generalizing solutions.

Saddle-free Newton method

Dauphin et al. (2014) proposed a modification of Newton's method that takes the absolute value of the Hessian eigenvalues, effectively converting saddle-point directions into descent directions. This approach escapes saddle points much faster than first-order methods, though its practical use is limited by the cost of computing Hessian information.

Method	Strategy	Primary benefit
SGD with momentum	Accumulates velocity to move through flat regions and past saddle points	Faster convergence; escapes plateaus
SAM	Min-max perturbation to find uniformly low-loss neighborhoods	Converges to flat minima; better generalization
SWA	Averages weights along SGD trajectory	Finds center of wide basins; low overhead
Entropy-SGD	Optimizes local entropy to favor wide valleys	Biased toward flat, well-generalizing regions
Saddle-free Newton	Uses absolute Hessian eigenvalues to convert saddle directions	Rapid escape from saddle points
Adam	Adaptive per-parameter learning rates	Handles varying curvature across dimensions

Visualizing loss surfaces

Because neural network parameter spaces are enormously high-dimensional, direct visualization of the loss surface is impossible. Researchers use dimensionality reduction techniques to project the surface onto one or two dimensions.

Random direction plots

The simplest approach picks one or two random direction vectors in parameter space and evaluates the loss along those directions starting from a trained solution. This produces a 1D curve or 2D contour plot. However, random directions are sensitive to the scale of different layers, making comparisons across architectures misleading.

Filter-normalized visualization

Li et al. (2018) introduced filter normalization, which rescales each random direction vector so that each filter (or neuron) has the same norm as the corresponding filter in the trained network ^[9]. This normalization removes the confounding effect of parameter scale and allows meaningful visual comparisons between different architectures. Using filter-normalized 2D plots, Li et al. showed striking differences between the loss surfaces of ResNet (smooth, nearly convex) and VGG-like networks without skip connections (chaotic, highly non-convex).

Principal component analysis of trajectories

Another approach applies PCA to the sequence of parameter vectors visited during training, projecting the optimization trajectory and surrounding loss surface onto the top two principal components. This captures the directions of greatest variation during training and often reveals the structure of the basin the optimizer converges to.

Loss surfaces in specific contexts

Reinforcement learning

Loss surfaces in reinforcement learning exhibit additional complications because the objective (expected cumulative reward) depends on the policy, the environment dynamics, and the sampling distribution, all of which change as the policy improves. The resulting surfaces tend to have more pathological features, including degenerate saddle points and highly non-stationary geometry.

Generative adversarial networks

In GANs, the loss surface is defined by a two-player minimax game between the generator and discriminator. The surface is not a single function to be minimized but rather a saddle-point problem in a joint parameter space. Training dynamics involve cycling, mode collapse, and other instabilities that are directly related to the geometry of this joint surface.

Physics-informed neural networks

Physics-informed neural networks (PINNs) incorporate physical laws as constraints in the loss function. The resulting loss surfaces often have sharper features and more complex topology than standard supervised learning surfaces because the physics residual terms introduce competing objectives that must be simultaneously satisfied.

Continual learning

In continual learning settings, where a network is trained sequentially on multiple tasks, the loss surface geometry changes with each new task. Lyle et al. (2024) showed that standard deep learning methods gradually lose plasticity (the ability to learn new tasks) as training progresses, partly because the Hessian spectrum collapses and the loss surface becomes effectively low-rank ^[16]. This spectral collapse removes the curvature information needed for effective gradient-based optimization on new tasks.

Theoretical connections

Spin-glass analogy

Choromanska et al. (2015) established a formal connection between neural network loss surfaces and the energy functions of spherical spin-glass models from statistical physics ^[1]. Under assumptions of variable independence, parameter redundancy, and uniformity, the critical points of the loss function have a layered structure: high-loss critical points are overwhelmingly saddle points, while low-loss critical points are increasingly likely to be local minima. As the network grows, the gap between the global minimum and the lowest local minima shrinks, and the number of bad local minima decreases exponentially.

Random matrix theory

Pennington and Bahri (2017) used random matrix theory to analyze the Hessian spectrum of neural networks, deriving analytical predictions for the distribution of eigenvalues. Their results connect the macroscopic properties of the loss surface (overall curvature, number of negative eigenvalues) to the statistical properties of the data and the network architecture.

PAC-Bayes bounds and sharpness

The link between flat minima and generalization has been formalized through PAC-Bayes theory. A PAC-Bayes bound states that the generalization error is controlled by the KL divergence between the learned parameter distribution and a prior, which in turn relates to how much the loss increases when parameters are perturbed. Flatter minima tolerate larger perturbations, yielding tighter generalization bounds. SAM and Entropy-SGD can both be interpreted as optimizing PAC-Bayes-style objectives.

Open problems and ongoing research

Several questions about loss surfaces remain active areas of research:

Scale-invariant sharpness measures. As Dinh et al. (2017) showed, standard sharpness metrics are not invariant to reparameterization. Developing measures of flatness that are intrinsic to the function computed by the network, rather than its particular parameterization, is an open challenge.
Loss surface structure beyond supervised learning. Most theoretical results apply to supervised learning with standard loss functions. The geometry of loss surfaces for self-supervised learning, contrastive learning, and other modern paradigms is less understood.
Connection to grokking. The phenomenon of "grokking," where networks suddenly generalize long after memorizing the training data, may be related to transitions in the loss surface geometry during training.
Loss surfaces of large language models. The loss surfaces of models with billions of parameters remain largely unexplored empirically, in part due to the computational cost of Hessian analysis at that scale.
Mechanistic understanding of mode connectivity. While mode connectivity has been demonstrated empirically, a complete theoretical explanation of why low-loss paths exist between independently trained minima is still lacking.

References

Choromanska, A., Henaff, M., Mathieu, M., Arous, G. B., & LeCun, Y. (2015). "The Loss Surfaces of Multilayer Networks." *Proceedings of the 18th International Conference on Artificial Intelligence and Statistics (AISTATS)*.
Dauphin, Y. N., Pascanu, R., Gulcehre, C., Cho, K., Ganguli, S., & Bengio, Y. (2014). "Identifying and Attacking the Saddle Point Problem in High-Dimensional Non-Convex Optimization." *Advances in Neural Information Processing Systems 27 (NeurIPS)*.
Hochreiter, S. & Schmidhuber, J. (1997). "Flat Minima." *Neural Computation*, 9(1), 1-42.
Keskar, N. S., Mudigere, D., Nocedal, J., Smelyanskiy, M., & Tang, P. T. P. (2017). "On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima." *Proceedings of the 5th International Conference on Learning Representations (ICLR)*.
Dinh, L., Pascanu, R., Bengio, S., & Bengio, Y. (2017). "Sharp Minima Can Generalize For Deep Nets." *Proceedings of the 34th International Conference on Machine Learning (ICML)*.
Sagun, L., Evci, U., Guney, V. U., Dauphin, Y., & Bottou, L. (2017). "Empirical Analysis of the Hessian of Over-Parametrized Neural Networks." *arXiv preprint arXiv:1706.04454*.
Kawaguchi, K. (2016). "Deep Learning without Poor Local Minima." *Advances in Neural Information Processing Systems 29 (NeurIPS)*.
Ghorbani, B., Krishnan, S., & Xiao, Y. (2019). "An Investigation into Neural Net Optimization via Hessian Eigenvalue Density." *Proceedings of the 36th International Conference on Machine Learning (ICML)*.
Li, H., Xu, Z., Taylor, G., Studer, C., & Goldstein, T. (2018). "Visualizing the Loss Landscape of Neural Nets." *Advances in Neural Information Processing Systems 31 (NeurIPS)*.
Liu, C., Zhu, L., & Belkin, M. (2022). "Loss Landscapes and Optimization in Over-Parameterized Non-Linear Systems and Neural Networks." *Applied and Computational Harmonic Analysis*, 59, 85-116.
Garipov, T., Izmailov, P., Podoprikhin, D., Vetrov, D., & Wilson, A. G. (2018). "Loss Surfaces, Mode Connectivity, and Fast Ensembling of DNNs." *Advances in Neural Information Processing Systems 31 (NeurIPS)*.
Draxler, F., Veschgini, K., Salmhofer, M., & Hamprecht, F. (2018). "Essentially No Barriers in Neural Network Energy Landscape." *Proceedings of the 35th International Conference on Machine Learning (ICML)*.
Foret, P., Kleiner, A., Mobahi, H., & Neyshabur, B. (2021). "Sharpness-Aware Minimization for Efficiently Improving Generalization." *Proceedings of the 9th International Conference on Learning Representations (ICLR)*.
Izmailov, P., Podoprikhin, D., Garipov, T., Vetrov, D., & Wilson, A. G. (2018). "Averaging Weights Leads to Wider Optima and Better Generalization." *Proceedings of the 34th Conference on Uncertainty in Artificial Intelligence (UAI)*.
Chaudhari, P., Choromanska, A., Soatto, S., LeCun, Y., Baldassi, C., Borgs, C., Chayes, J., Sagun, L., & Zecchina, R. (2017). "Entropy-SGD: Biasing Gradient Descent Into Wide Valleys." *Proceedings of the 5th International Conference on Learning Representations (ICLR)*.
Lyle, C., Zheng, Z., Nikishin, E., Pires, B. A., Pascanu, R., & Dabney, W. (2024). "Understanding Plasticity in Neural Networks." *Proceedings of the 41st International Conference on Machine Learning (ICML)*.

Explain like I'm 5 (ELI5)

Formal definition

Key geometric features

Global and local minima

Saddle points

Flat and sharp minima

Plateaus and flat regions

Hessian analysis of the loss surface

Effect of architecture on the loss surface

Skip connections

Network width

Batch normalization

Depth

Mode connectivity

Linear mode connectivity

Role of stochastic gradient descent

Noise as implicit regularization

Batch size and generalization

Optimization methods that exploit loss surface geometry

Sharpness-aware minimization (SAM)

Stochastic weight averaging (SWA)

Entropy-SGD

Saddle-free Newton method

Visualizing loss surfaces

Random direction plots

Filter-normalized visualization

Principal component analysis of trajectories

Loss surfaces in specific contexts

Reinforcement learning

Generative adversarial networks

Physics-informed neural networks

Continual learning

Theoretical connections

Spin-glass analogy

Random matrix theory

PAC-Bayes bounds and sharpness

Open problems and ongoing research

References

Improve this article

Related Articles

Sparse autoencoder

ARC-AGI 2

GELU (Gaussian Error Linear Unit)

LeNet

Clipping

Hyperparameter

Explain like I'm 5 (ELI5)

Formal definition

Key geometric features

Global and local minima

Saddle points

Flat and sharp minima

Plateaus and flat regions

Hessian analysis of the loss surface

Effect of architecture on the loss surface

Skip connections

Network width

Batch normalization

Depth

Mode connectivity

Linear mode connectivity

Role of stochastic gradient descent

Noise as implicit regularization

Batch size and generalization

Optimization methods that exploit loss surface geometry

Sharpness-aware minimization (SAM)

Stochastic weight averaging (SWA)

Entropy-SGD

Saddle-free Newton method

Visualizing loss surfaces

Random direction plots

Filter-normalized visualization

Principal component analysis of trajectories

Loss surfaces in specific contexts

Reinforcement learning

Generative adversarial networks

Physics-informed neural networks

Continual learning

Theoretical connections

Spin-glass analogy