Squared Hinge Loss

Squared hinge loss (also called L2 hinge loss or L2-loss) is a loss function used in machine learning for classification tasks, most commonly in support vector machines (SVMs). It is defined as the square of the standard hinge loss and takes the form L(y, f(x)) = max(0, 1 - y * f(x))². By squaring the hinge loss, the function becomes differentiable everywhere, which is a significant practical advantage over the standard hinge loss for gradient-based optimization. The squared hinge loss penalizes margin violations quadratically rather than linearly, resulting in stronger penalties for points that are far on the wrong side of the decision boundary and weaker penalties for points that are only slightly misclassified.

Squared hinge loss is the default loss function in scikit-learn's LinearSVC classifier and is widely implemented in frameworks such as TensorFlow, Keras, and PyTorch. It plays a central role in L2-SVM formulations for both binary and multiclass classification, and it has been extended to the Crammer-Singer multiclass SVM framework.

Explain like I'm 5 (ELI5)

Imagine you are learning to sort toy animals into two boxes: cats in one box and dogs in the other. Every time you put a toy in the wrong box, you get a penalty score. With regular hinge loss, the penalty grows steadily the more wrong you are. With squared hinge loss, the penalty grows much faster when you make a big mistake, but you barely get penalized at all when you are only a tiny bit wrong. This encourages you to fix your biggest mistakes first. The squared version also has a "smooth" shape with no sharp corners, which makes it easier for a computer to figure out exactly how to adjust and improve.

Mathematical definition

Binary classification

For a single training example with true label y (where y is either +1 or -1) and model prediction f(x), the squared hinge loss is defined as:

L(y, f(x)) = [max(0, 1 - y * f(x))]²

This can be written equivalently as a piecewise function:

L(y, f(x)) = (1 - y * f(x))²   if y * f(x) < 1
L(y, f(x)) = 0                  if y * f(x) >= 1

The quantity z = y * f(x) is called the functional margin. When z >= 1, the example is correctly classified with sufficient margin and incurs zero loss. When z < 1, the loss increases quadratically as z decreases.

Multiclass formulation

For multiclass classification with k classes, the squared hinge loss generalizes to the Crammer-Singer formulation. Given a training example (x_i, y_i), the multiclass squared hinge loss is:

L_i = sum over j != y_i of [max(0, f_j(x_i) - f_{y_i}(x_i) + 1)]²

Here, f_j(x_i) is the score assigned to class j, and f_{y_i}(x_i) is the score for the correct class. The loss penalizes cases where any incorrect class score comes within a margin of the correct class score. Lee and Lin (2013) provided a detailed study of this L2-loss extension to the Crammer-Singer multiclass SVM, showing that the derivation of the dual problem has subtle but important differences from the L1 (standard hinge) case.

Gradient and derivative

One of the main advantages of squared hinge loss over standard hinge loss is that it is differentiable everywhere. The gradient of the squared hinge loss with respect to the prediction f(x) is:

dL/df(x) = -2y * max(0, 1 - y * f(x))   if y * f(x) < 1
dL/df(x) = 0                              if y * f(x) >= 1

At the boundary point z = y * f(x) = 1, both the function value and its derivative are zero, so the function transitions smoothly. In contrast, the standard hinge loss has a non-differentiable "kink" at z = 1, where the derivative jumps from -1 to 0. This kink forces the use of subgradient methods rather than standard gradient descent.

The differentiability of squared hinge loss means it can be used directly with:

Standard gradient descent and stochastic gradient descent (SGD)
Second-order methods such as Newton's method and L-BFGS
Higher-order optimization algorithms like Levenberg-Marquardt

However, it is worth noting that the squared hinge loss is not twice differentiable at z = 1 (the second derivative is discontinuous there). This means that pure second-order methods may not achieve their theoretical quadratic convergence rate on L2-SVM problems.

Comparison with other loss functions

The following table compares squared hinge loss with other commonly used classification loss functions.

Property	Hinge loss	Squared hinge loss	Logistic loss (log loss)	Exponential loss	Modified Huber loss
Formula	max(0, 1 - yz)	[max(0, 1 - yz)]²	log(1 + exp(-yz))	exp(-yz)	max(0, 1 - yz)² if yz > -1; -4yz otherwise
Convex	Yes	Yes	Yes	Yes	Yes
Differentiable	No (kink at z=1)	Yes	Yes	Yes	Yes
Twice differentiable	No	No (at z=1)	Yes	Yes	Yes
Outlier sensitivity	Moderate (linear growth)	Higher (quadratic growth)	Low (linear growth for large negative z)	Very high (exponential growth)	Low (linear growth for yz < -1)
Sparse solutions (SVM)	Yes	No	No	No	No
Probability estimates	No	No	Yes	No	Yes
Primary use	L1-SVM	L2-SVM	Logistic regression	AdaBoost	Robust SVM

Squared hinge loss vs. standard hinge loss

The standard hinge loss, defined as max(0, 1 - yz), increases linearly for margin violations. The squared hinge loss increases quadratically. This has several practical consequences:

Differentiability. The squared hinge loss is differentiable, allowing the use of standard gradient-based optimizers. The standard hinge loss requires subgradient methods because it has a non-smooth point at yz = 1.
Penalty behavior. Squared hinge loss imposes larger penalties on points that are far from the correct side of the margin, but smaller penalties on points that are only slightly within the margin. This means the squared variant is more aggressive about fixing large errors.
Sparsity. The standard hinge loss produces sparse solutions in the dual formulation, meaning many dual variables (and thus many training examples) have zero contribution to the decision boundary. These non-contributing points are not support vectors. The squared hinge loss generally produces denser solutions with more support vectors.
Outlier sensitivity. Because the loss grows quadratically, squared hinge loss is more sensitive to outliers than the standard hinge loss. A single outlier with a large margin violation will contribute disproportionately to the total loss. The standard hinge loss, growing only linearly, is more robust in this regard.
Convergence. In practice, optimization of the squared hinge loss often converges faster because the smooth objective surface provides more informative gradient information. When the regularization parameter C is small, the Hessian matrix of the L2-loss SVM objective is close to I/(2C), making the problem well-conditioned and easy to solve.

Squared hinge loss vs. logistic loss

Logistic loss (also known as cross-entropy loss for binary classification) and squared hinge loss are both smooth, convex loss functions. Key differences include:

Logistic loss never reaches exactly zero; it always assigns some nonzero loss to every example, even those that are correctly classified with high confidence. Squared hinge loss assigns exactly zero loss to all examples with functional margin z >= 1.
Logistic loss naturally produces calibrated probability estimates through the sigmoid function. Squared hinge loss does not output probabilities.
Logistic loss arises from a maximum likelihood perspective, while squared hinge loss arises from a maximum margin perspective.
For large negative margins, logistic loss grows approximately linearly, while squared hinge loss grows quadratically. This makes logistic loss more robust to extreme outliers.

SVM formulations using squared hinge loss

L2-SVM primal problem

The L2-SVM (also called L2-loss SVM) replaces the standard hinge loss in the SVM objective with squared hinge loss. The primal optimization problem is:

minimize   (1/2) * ||w||² + C * sum_i [max(0, 1 - y_i * (w^T x_i + b))]²

Here, w is the weight vector, b is the bias term, C is the regularization parameter controlling the trade-off between margin maximization and loss minimization, and the sum is over all training examples. The L2 regularization term (1/2) * ||w||² encourages a wide margin, while the squared hinge loss term penalizes margin violations.

Combining with different regularizers

Squared hinge loss can be combined with different regularization penalties:

Formulation	Regularization	Loss	Characteristics
L2-regularized L2-loss SVM	(1/2) * \|\|w\|\|²	[max(0, 1 - yz)]²	Default in scikit-learn's LinearSVC; smooth objective; dense solutions
L1-regularized L2-loss SVM	\|\|w\|\|_1	[max(0, 1 - yz)]²	Sparse weight vector for feature selection; must be solved in primal form
L2-regularized L1-loss SVM	(1/2) * \|\|w\|\|²	max(0, 1 - yz)	Traditional SVM; non-smooth loss; sparse dual variables

The L1-regularized variant with squared hinge loss is particularly useful for high-dimensional datasets where feature selection is desired, as the L1 penalty drives many weights to exactly zero. In scikit-learn, this is available through LinearSVC(penalty='l1', loss='squared_hinge', dual=False). Note that the combination of L1 penalty with standard hinge loss is not supported in most solvers because the resulting optimization problem is non-smooth in both the loss and the regularizer.

Dual formulation

The dual problem of the L2-regularized L2-loss SVM differs from the standard L1-loss SVM dual. In the L1-loss case, the dual variables alpha_i are bounded above by C. In the L2-loss case, the dual variables are unbounded, but the objective includes an additional quadratic term (1/(4C)) * sum_i alpha_i², which acts as an implicit regularizer on the dual variables. This structural difference affects how solvers such as LIBLINEAR handle the two formulations.

Optimization algorithms

Several optimization algorithms can be used to minimize the squared hinge loss SVM objective.

Dual coordinate descent

LIBLINEAR, the solver underlying scikit-learn's LinearSVC, uses a dual coordinate descent method. For L2-loss SVMs, the algorithm iteratively updates individual dual variables while keeping others fixed. The update rule for each dual variable has a closed-form solution because the objective is quadratic in each variable. This makes the per-iteration cost low and the method scalable to large datasets.

Trust region Newton method

Because the squared hinge loss is differentiable, the primal L2-loss SVM objective can be minimized using a truncated Newton method (also called a trust region Newton method). This approach computes approximate Newton directions using conjugate gradient iterations and can achieve superlinear convergence. The TRON solver in LIBLINEAR implements this strategy.

Stochastic gradient descent

For very large datasets, stochastic gradient descent (SGD) is an efficient alternative. Scikit-learn's SGDClassifier with loss='squared_hinge' performs SGD on the squared hinge loss objective. Because the loss is differentiable, the gradient at each step is well defined (unlike the standard hinge loss, which requires a subgradient).

Implementations in software libraries

Squared hinge loss is available in most major machine learning frameworks.

Library	Class/Function	Notes
scikit-learn	`LinearSVC(loss='squared_hinge')`	Default loss for LinearSVC; uses LIBLINEAR solver
scikit-learn	`SGDClassifier(loss='squared_hinge')`	SGD-based training for large-scale problems
TensorFlow / Keras	`tf.keras.losses.SquaredHinge()`	Class-based API; also available as `tf.keras.losses.squared_hinge`
PyTorch	`torchmetrics.classification.HingeLoss(squared=True)`	Via TorchMetrics; PyTorch core does not have a dedicated squared hinge loss
LIBLINEAR	Solver type 2 (primal) or type 1 (dual)	L2-regularized L2-loss SVM classification

Keras example

In Keras, the squared hinge loss expects labels to be -1 or +1. If binary labels (0 or 1) are provided, Keras automatically converts them to -1 and +1.

import tensorflow as tf
from tensorflow import keras

model = keras.Sequential([
    keras.layers.Dense(64, activation='relu', input_shape=(num_features,)),
    keras.layers.Dense(1)  # No activation; raw score output
])

model.compile(
    optimizer='adam',
    loss=keras.losses.SquaredHinge(),
    metrics=['accuracy']
)

model.fit(X_train, y_train, epochs=50, batch_size=32)

scikit-learn example

from sklearn.svm import LinearSVC
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline

# L2-regularized L2-loss SVM (default)
clf = make_pipeline(
    StandardScaler(),
    LinearSVC(loss='squared_hinge', C=1.0, max_iter=1000)
)
clf.fit(X_train, y_train)
print(clf.score(X_test, y_test))

Relationship to other hinge loss variants

The hinge loss family includes several related functions that address different limitations of the standard hinge loss.

Modified Huber loss

The modified Huber loss combines the squared hinge loss with a linear tail for robustness to outliers. It is defined as:

L(y, f(x)) = max(0, 1 - yz)²    if yz > -1
L(y, f(x)) = -4yz                if yz <= -1

For margin values greater than -1, the modified Huber loss behaves like the squared hinge loss. For very large margin violations (yz <= -1), it switches to a linear penalty, preventing extreme outliers from dominating the loss. The modified Huber loss is available in scikit-learn through SGDClassifier(loss='modified_huber') and, unlike the squared hinge loss, supports probability estimation via the predict_proba method.

Rennie-Srebro smoothed hinge loss

Rennie and Srebro proposed a piecewise quadratic approximation to the hinge loss that is smooth but does not square the entire loss:

L(y, f(x)) = 1/2 - yz            if yz <= 0
L(y, f(x)) = (1/2)(1 - yz)²      if 0 < yz < 1
L(y, f(x)) = 0                    if yz >= 1

This function is differentiable everywhere and closely approximates the standard hinge loss while avoiding the quadratic growth of the full squared hinge loss. It provides a compromise between smoothness and outlier robustness.

Smooth hinge losses

Luo, Qiao, and Zhang (2021) introduced a family of infinitely differentiable smooth hinge losses that converge uniformly to the standard hinge loss as a smoothing parameter approaches zero. These smooth variants enable the use of second-order optimization methods (such as the Trust Region Newton method) with theoretical guarantees of quadratic convergence, which the squared hinge loss cannot fully provide due to its discontinuous second derivative.

Practical considerations

When to use squared hinge loss

Squared hinge loss is a good choice in the following situations:

Large-margin classification without probability estimates. When the goal is to find a maximum-margin decision boundary and calibrated probabilities are not needed, squared hinge loss (via L2-SVM) provides a smooth, efficient alternative to standard SVM.
Optimization with gradient-based methods. If the training pipeline relies on gradient descent (for example, when training a neural network with an SVM-style last layer), squared hinge loss is preferable because it provides a well-defined gradient at every point.
Feature selection with L1 regularization. In scikit-learn, L1-regularized SVMs require the use of squared hinge loss because the combination of L1 penalty with standard hinge loss is not supported.
Moderate outlier levels. If the dataset does not contain extreme outliers, the quadratic penalty of squared hinge loss can encourage faster convergence without causing instability.

When to avoid squared hinge loss

Datasets with many outliers. The quadratic penalty amplifies the influence of outliers. Consider using the modified Huber loss or standard hinge loss for noisy data.
When probability estimates are needed. Squared hinge loss does not naturally produce probability estimates. Use logistic loss (cross-entropy) or modified Huber loss instead.
When solution sparsity matters. If minimizing the number of support vectors is important (for memory or interpretability), the standard hinge loss produces sparser dual solutions.

Hyperparameter tuning

The regularization parameter C controls the trade-off between maximizing the margin and minimizing the training loss. For squared hinge loss, the selection of C tends to be less sensitive than for standard hinge loss, meaning that a broader range of C values will produce comparable results. This is because the smooth, quadratic nature of the loss surface makes the optimization landscape more forgiving. Nevertheless, cross-validation should be used to select the optimal C value.

History and development

The origins of squared hinge loss are closely tied to the development of support vector machines. Vladimir Vapnik and colleagues introduced the SVM framework and the standard hinge loss in the early 1990s, building on the concept of optimal separating hyperplanes from Vapnik's earlier work in statistical learning theory. The squared variant emerged as researchers explored smoother alternatives to the hinge loss that could leverage faster optimization algorithms.

The term "L2-SVM" became standard in the optimization community to distinguish SVMs using squared hinge loss from the original "L1-SVM" formulation. LIBLINEAR, developed by Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, Xiang-Rui Wang, and Chih-Jen Lin at National Taiwan University, was one of the first widely used solvers to include efficient algorithms for both L1-loss and L2-loss linear SVMs. Their work demonstrated that the L2-loss formulation could be solved efficiently using trust region Newton methods.

In 2013, Chia-Hua Lee and Chih-Jen Lin published a detailed study extending the Crammer-Singer multiclass SVM formulation to use L2 loss. Their paper, published in Neural Computation, showed that the derivation of the dual problem for L2 loss has subtle but important differences from the L1 case and provided experimental evidence that L2-loss multiclass SVM performs comparably to the L1-loss version.

More recently, squared hinge loss has found applications in deep learning, where it serves as an alternative to cross-entropy loss for training classifiers. Research has shown that using squared hinge loss with higher-order optimization algorithms can improve convergence rates and generalization in deep neural networks.

References

Boser, B. E., Guyon, I. M., and Vapnik, V. N. (1992). "A Training Algorithm for Optimal Margin Classifiers." Proceedings of the 5th Annual Workshop on Computational Learning Theory (COLT), 144-152.
Vapnik, V. N. (1995). *The Nature of Statistical Learning Theory*. Springer-Verlag, New York.
Crammer, K. and Singer, Y. (2001). "On the Algorithmic Implementation of Multiclass Kernel-based Vector Machines." Journal of Machine Learning Research, 2, 265-292.
Lee, C.-H. and Lin, C.-J. (2013). "A Study on L2-Loss (Squared Hinge-Loss) Multiclass SVM." Neural Computation, 25(5), 1302-1323.
Fan, R.-E., Chang, K.-W., Hsieh, C.-J., Wang, X.-R., and Lin, C.-J. (2008). "LIBLINEAR: A Library for Large Linear Classification." Journal of Machine Learning Research, 9, 1871-1874.
Rennie, J. D. M. and Srebro, N. (2005). "Loss Functions for Preference Levels: Regression with Discrete Ordered Labels." Proceedings of the IJCAI Multidisciplinary Workshop on Advances in Preference Handling.
Zhang, T. (2004). "Statistical Behavior and Consistency of Classification Methods Based on Convex Risk Minimization." The Annals of Statistics, 32(1), 56-85.
Luo, J., Qiao, H., and Zhang, B. (2021). "Learning with Smooth Hinge Losses." Neurocomputing, 463, 379-387.
Karpathy, A. "CS231n: Convolutional Neural Networks for Visual Recognition, Linear Classification." Stanford University. https://cs231n.github.io/linear-classify/
Pedregosa, F. et al. (2011). "Scikit-learn: Machine Learning in Python." Journal of Machine Learning Research, 12, 2825-2830.
Abadi, M. et al. (2016). "TensorFlow: A System for Large-Scale Machine Learning." Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI), 265-283.
Keras documentation. "Hinge losses for maximum-margin classification." https://keras.io/api/losses/hinge_losses/
Shalev-Shwartz, S. and Ben-David, S. (2014). *Understanding Machine Learning: From Theory to Algorithms*. Cambridge University Press.

Explain like I'm 5 (ELI5)

Mathematical definition

Binary classification

Multiclass formulation

Gradient and derivative

Comparison with other loss functions

Squared hinge loss vs. standard hinge loss

Squared hinge loss vs. logistic loss

SVM formulations using squared hinge loss

L2-SVM primal problem

Combining with different regularizers

Dual formulation

Optimization algorithms

Dual coordinate descent

Trust region Newton method

Stochastic gradient descent

Implementations in software libraries

Keras example

scikit-learn example

Relationship to other hinge loss variants

Modified Huber loss

Rennie-Srebro smoothed hinge loss

Smooth hinge losses

Practical considerations

When to use squared hinge loss

When to avoid squared hinge loss

Hyperparameter tuning

History and development

See also

References

Improve this article

Related Articles

ARC-AGI 2

Kernel Support Vector Machines (KSVMs)

Log Loss

L1 Loss

L2 Loss

Squared Loss

Explain like I'm 5 (ELI5)

Mathematical definition

Binary classification

Multiclass formulation

Gradient and derivative

Comparison with other loss functions

Squared hinge loss vs. standard hinge loss

Squared hinge loss vs. logistic loss

SVM formulations using squared hinge loss

L2-SVM primal problem

Combining with different regularizers

Dual formulation

Optimization algorithms

Dual coordinate descent

Trust region Newton method

Stochastic gradient descent

Implementations in software libraries

Keras example

scikit-learn example

Relationship to other hinge loss variants

Modified Huber loss

Rennie-Srebro smoothed hinge loss

Smooth hinge losses

Practical considerations

When to use squared hinge loss

When to avoid squared hinge loss

Hyperparameter tuning

History and development

See also

References

Related Articles

ARC-AGI 2

Kernel Support Vector Machines (KSVMs)

Log Loss

L1 Loss

L2 Loss

Squared Loss