Squared hinge loss (also called L2 hinge loss or L2-loss) is a loss function used in machine learning for classification tasks, most commonly in support vector machines (SVMs). It is defined as the square of the standard hinge loss and takes the form L(y, f(x)) = max(0, 1 - y * f(x))². By squaring the hinge loss, the function becomes differentiable everywhere, which is a significant practical advantage over the standard hinge loss for gradient-based optimization. The squared hinge loss penalizes margin violations quadratically rather than linearly, resulting in stronger penalties for points that are far on the wrong side of the decision boundary and weaker penalties for points that are only slightly misclassified.
Squared hinge loss is the default loss function in scikit-learn's LinearSVC classifier and is widely implemented in frameworks such as TensorFlow, Keras, and PyTorch. It plays a central role in L2-SVM formulations for both binary and multiclass classification, and it has been extended to the Crammer-Singer multiclass SVM framework.
Imagine you are learning to sort toy animals into two boxes: cats in one box and dogs in the other. Every time you put a toy in the wrong box, you get a penalty score. With regular hinge loss, the penalty grows steadily the more wrong you are. With squared hinge loss, the penalty grows much faster when you make a big mistake, but you barely get penalized at all when you are only a tiny bit wrong. This encourages you to fix your biggest mistakes first. The squared version also has a "smooth" shape with no sharp corners, which makes it easier for a computer to figure out exactly how to adjust and improve.
For a single training example with true label y (where y is either +1 or -1) and model prediction f(x), the squared hinge loss is defined as:
L(y, f(x)) = [max(0, 1 - y * f(x))]²
This can be written equivalently as a piecewise function:
L(y, f(x)) = (1 - y * f(x))² if y * f(x) < 1
L(y, f(x)) = 0 if y * f(x) >= 1
The quantity z = y * f(x) is called the functional margin. When z >= 1, the example is correctly classified with sufficient margin and incurs zero loss. When z < 1, the loss increases quadratically as z decreases.
For multiclass classification with k classes, the squared hinge loss generalizes to the Crammer-Singer formulation. Given a training example (x_i, y_i), the multiclass squared hinge loss is:
L_i = sum over j != y_i of [max(0, f_j(x_i) - f_{y_i}(x_i) + 1)]²
Here, f_j(x_i) is the score assigned to class j, and f_{y_i}(x_i) is the score for the correct class. The loss penalizes cases where any incorrect class score comes within a margin of the correct class score. Lee and Lin (2013) provided a detailed study of this L2-loss extension to the Crammer-Singer multiclass SVM, showing that the derivation of the dual problem has subtle but important differences from the L1 (standard hinge) case.
One of the main advantages of squared hinge loss over standard hinge loss is that it is differentiable everywhere. The gradient of the squared hinge loss with respect to the prediction f(x) is:
dL/df(x) = -2y * max(0, 1 - y * f(x)) if y * f(x) < 1
dL/df(x) = 0 if y * f(x) >= 1
At the boundary point z = y * f(x) = 1, both the function value and its derivative are zero, so the function transitions smoothly. In contrast, the standard hinge loss has a non-differentiable "kink" at z = 1, where the derivative jumps from -1 to 0. This kink forces the use of subgradient methods rather than standard gradient descent.
The differentiability of squared hinge loss means it can be used directly with:
However, it is worth noting that the squared hinge loss is not twice differentiable at z = 1 (the second derivative is discontinuous there). This means that pure second-order methods may not achieve their theoretical quadratic convergence rate on L2-SVM problems.
The following table compares squared hinge loss with other commonly used classification loss functions.
| Property | Hinge loss | Squared hinge loss | Logistic loss (log loss) | Exponential loss | Modified Huber loss |
|---|---|---|---|---|---|
| Formula | max(0, 1 - yz) | [max(0, 1 - yz)]² | log(1 + exp(-yz)) | exp(-yz) | max(0, 1 - yz)² if yz > -1; -4yz otherwise |
| Convex | Yes | Yes | Yes | Yes | Yes |
| Differentiable | No (kink at z=1) | Yes | Yes | Yes | Yes |
| Twice differentiable | No | No (at z=1) | Yes | Yes | Yes |
| Outlier sensitivity | Moderate (linear growth) | Higher (quadratic growth) | Low (linear growth for large negative z) | Very high (exponential growth) | Low (linear growth for yz < -1) |
| Sparse solutions (SVM) | Yes | No | No | No | No |
| Probability estimates | No | No | Yes | No | Yes |
| Primary use | L1-SVM | L2-SVM | Logistic regression | AdaBoost | Robust SVM |
The standard hinge loss, defined as max(0, 1 - yz), increases linearly for margin violations. The squared hinge loss increases quadratically. This has several practical consequences:
Logistic loss (also known as cross-entropy loss for binary classification) and squared hinge loss are both smooth, convex loss functions. Key differences include:
The L2-SVM (also called L2-loss SVM) replaces the standard hinge loss in the SVM objective with squared hinge loss. The primal optimization problem is:
minimize (1/2) * ||w||² + C * sum_i [max(0, 1 - y_i * (w^T x_i + b))]²
Here, w is the weight vector, b is the bias term, C is the regularization parameter controlling the trade-off between margin maximization and loss minimization, and the sum is over all training examples. The L2 regularization term (1/2) * ||w||² encourages a wide margin, while the squared hinge loss term penalizes margin violations.
Squared hinge loss can be combined with different regularization penalties:
| Formulation | Regularization | Loss | Characteristics |
|---|---|---|---|
| L2-regularized L2-loss SVM | (1/2) * ||w||² | [max(0, 1 - yz)]² | Default in scikit-learn's LinearSVC; smooth objective; dense solutions |
| L1-regularized L2-loss SVM | ||w||_1 | [max(0, 1 - yz)]² | Sparse weight vector for feature selection; must be solved in primal form |
| L2-regularized L1-loss SVM | (1/2) * ||w||² | max(0, 1 - yz) | Traditional SVM; non-smooth loss; sparse dual variables |
The L1-regularized variant with squared hinge loss is particularly useful for high-dimensional datasets where feature selection is desired, as the L1 penalty drives many weights to exactly zero. In scikit-learn, this is available through LinearSVC(penalty='l1', loss='squared_hinge', dual=False). Note that the combination of L1 penalty with standard hinge loss is not supported in most solvers because the resulting optimization problem is non-smooth in both the loss and the regularizer.
The dual problem of the L2-regularized L2-loss SVM differs from the standard L1-loss SVM dual. In the L1-loss case, the dual variables alpha_i are bounded above by C. In the L2-loss case, the dual variables are unbounded, but the objective includes an additional quadratic term (1/(4C)) * sum_i alpha_i², which acts as an implicit regularizer on the dual variables. This structural difference affects how solvers such as LIBLINEAR handle the two formulations.
Several optimization algorithms can be used to minimize the squared hinge loss SVM objective.
LIBLINEAR, the solver underlying scikit-learn's LinearSVC, uses a dual coordinate descent method. For L2-loss SVMs, the algorithm iteratively updates individual dual variables while keeping others fixed. The update rule for each dual variable has a closed-form solution because the objective is quadratic in each variable. This makes the per-iteration cost low and the method scalable to large datasets.
Because the squared hinge loss is differentiable, the primal L2-loss SVM objective can be minimized using a truncated Newton method (also called a trust region Newton method). This approach computes approximate Newton directions using conjugate gradient iterations and can achieve superlinear convergence. The TRON solver in LIBLINEAR implements this strategy.
For very large datasets, stochastic gradient descent (SGD) is an efficient alternative. Scikit-learn's SGDClassifier with loss='squared_hinge' performs SGD on the squared hinge loss objective. Because the loss is differentiable, the gradient at each step is well defined (unlike the standard hinge loss, which requires a subgradient).
Squared hinge loss is available in most major machine learning frameworks.
| Library | Class/Function | Notes |
|---|---|---|
| scikit-learn | LinearSVC(loss='squared_hinge') | Default loss for LinearSVC; uses LIBLINEAR solver |
| scikit-learn | SGDClassifier(loss='squared_hinge') | SGD-based training for large-scale problems |
| TensorFlow / Keras | tf.keras.losses.SquaredHinge() | Class-based API; also available as tf.keras.losses.squared_hinge |
| PyTorch | torchmetrics.classification.HingeLoss(squared=True) | Via TorchMetrics; PyTorch core does not have a dedicated squared hinge loss |
| LIBLINEAR | Solver type 2 (primal) or type 1 (dual) | L2-regularized L2-loss SVM classification |
In Keras, the squared hinge loss expects labels to be -1 or +1. If binary labels (0 or 1) are provided, Keras automatically converts them to -1 and +1.
import tensorflow as tf
from tensorflow import keras
model = keras.Sequential([
keras.layers.Dense(64, activation='relu', input_shape=(num_features,)),
keras.layers.Dense(1) # No activation; raw score output
])
model.compile(
optimizer='adam',
loss=keras.losses.SquaredHinge(),
metrics=['accuracy']
)
model.fit(X_train, y_train, epochs=50, batch_size=32)
from sklearn.svm import LinearSVC
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
# L2-regularized L2-loss SVM (default)
clf = make_pipeline(
StandardScaler(),
LinearSVC(loss='squared_hinge', C=1.0, max_iter=1000)
)
clf.fit(X_train, y_train)
print(clf.score(X_test, y_test))
The hinge loss family includes several related functions that address different limitations of the standard hinge loss.
The modified Huber loss combines the squared hinge loss with a linear tail for robustness to outliers. It is defined as:
L(y, f(x)) = max(0, 1 - yz)² if yz > -1
L(y, f(x)) = -4yz if yz <= -1
For margin values greater than -1, the modified Huber loss behaves like the squared hinge loss. For very large margin violations (yz <= -1), it switches to a linear penalty, preventing extreme outliers from dominating the loss. The modified Huber loss is available in scikit-learn through SGDClassifier(loss='modified_huber') and, unlike the squared hinge loss, supports probability estimation via the predict_proba method.
Rennie and Srebro proposed a piecewise quadratic approximation to the hinge loss that is smooth but does not square the entire loss:
L(y, f(x)) = 1/2 - yz if yz <= 0
L(y, f(x)) = (1/2)(1 - yz)² if 0 < yz < 1
L(y, f(x)) = 0 if yz >= 1
This function is differentiable everywhere and closely approximates the standard hinge loss while avoiding the quadratic growth of the full squared hinge loss. It provides a compromise between smoothness and outlier robustness.
Luo, Qiao, and Zhang (2021) introduced a family of infinitely differentiable smooth hinge losses that converge uniformly to the standard hinge loss as a smoothing parameter approaches zero. These smooth variants enable the use of second-order optimization methods (such as the Trust Region Newton method) with theoretical guarantees of quadratic convergence, which the squared hinge loss cannot fully provide due to its discontinuous second derivative.
Squared hinge loss is a good choice in the following situations:
The regularization parameter C controls the trade-off between maximizing the margin and minimizing the training loss. For squared hinge loss, the selection of C tends to be less sensitive than for standard hinge loss, meaning that a broader range of C values will produce comparable results. This is because the smooth, quadratic nature of the loss surface makes the optimization landscape more forgiving. Nevertheless, cross-validation should be used to select the optimal C value.
The origins of squared hinge loss are closely tied to the development of support vector machines. Vladimir Vapnik and colleagues introduced the SVM framework and the standard hinge loss in the early 1990s, building on the concept of optimal separating hyperplanes from Vapnik's earlier work in statistical learning theory. The squared variant emerged as researchers explored smoother alternatives to the hinge loss that could leverage faster optimization algorithms.
The term "L2-SVM" became standard in the optimization community to distinguish SVMs using squared hinge loss from the original "L1-SVM" formulation. LIBLINEAR, developed by Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, Xiang-Rui Wang, and Chih-Jen Lin at National Taiwan University, was one of the first widely used solvers to include efficient algorithms for both L1-loss and L2-loss linear SVMs. Their work demonstrated that the L2-loss formulation could be solved efficiently using trust region Newton methods.
In 2013, Chia-Hua Lee and Chih-Jen Lin published a detailed study extending the Crammer-Singer multiclass SVM formulation to use L2 loss. Their paper, published in Neural Computation, showed that the derivation of the dual problem for L2 loss has subtle but important differences from the L1 case and provided experimental evidence that L2-loss multiclass SVM performs comparably to the L1-loss version.
More recently, squared hinge loss has found applications in deep learning, where it serves as an alternative to cross-entropy loss for training classifiers. Research has shown that using squared hinge loss with higher-order optimization algorithms can improve convergence rates and generalization in deep neural networks.