Hyperplane
Last reviewed
Sources
14 citations
Review status
Source-backed
Revision
v4 ยท 3,742 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Sources
14 citations
Review status
Source-backed
Revision
v4 ยท 3,742 words
Add missing citations, update stale details, or suggest a clearer explanation.
A hyperplane is a flat, affine subspace of dimension n-1 embedded in an n-dimensional space, defined by the linear equation w . x + b = 0, where w is a normal vector and b is a scalar offset [1][12]. It generalizes a line in two dimensions and a plane in three dimensions to spaces of arbitrary dimensionality, and every hyperplane divides the space into two half-spaces. In machine learning and statistics, hyperplanes serve as decision boundaries that separate data into distinct classes, which makes them foundational to linear classifiers such as support vector machines, logistic regression, and perceptrons [3][5].
Imagine you have a pile of red balls and a pile of blue balls on a table. You want to draw a line with a ruler so that all the red balls are on one side and all the blue balls are on the other. That line is a hyperplane in two dimensions.
Now imagine the balls are floating in the air instead of sitting on a flat table. You would need a flat sheet of paper (a plane) to separate the red ones from the blue ones. That sheet of paper is a hyperplane in three dimensions.
If you had even more dimensions (which is hard to picture, but computers handle it just fine), the separator would still be a flat surface one dimension thinner than the space it sits in. That flat separator is always called a hyperplane. The goal of many machine learning algorithms is to find the best position for this separator so that it divides the data as cleanly as possible.
Formally, a hyperplane in the Euclidean space R^n is the set of all points x satisfying a linear equation:
a_1 x_1 + a_2 x_2 + ... + a_n x_n = b
where a = (a_1, a_2, ..., a_n) is a nonzero normal vector perpendicular to the hyperplane and b is a scalar offset (also called the bias). Using vector notation, this can be written more compactly as:
w . x + b = 0
where w is the weight (normal) vector and b is the bias term [1][12]. Every hyperplane divides the ambient space into two half-spaces: the set of points where w . x + b > 0 and the set where w . x + b < 0.
The following table shows what a hyperplane looks like in spaces of increasing dimension.
| Ambient space dimension (n) | Hyperplane dimension (n-1) | Geometric object | Example |
|---|---|---|---|
| 1 | 0 | Point | A single point on a number line divides it into two rays |
| 2 | 1 | Line | A line divides the plane into two half-planes |
| 3 | 2 | Plane | A plane divides 3D space into two half-spaces |
| 4 | 3 | 3D hyperplane | A three-dimensional "slice" of 4D space |
| n | n-1 | (n-1)-flat | General case |
A linear hyperplane passes through the origin and is defined by w . x = 0. It is a true linear subspace of R^n.
An affine hyperplane does not necessarily pass through the origin. It is defined by w . x = b where b is not zero. An affine hyperplane can be seen as a translated copy of a linear hyperplane. Most hyperplanes encountered in machine learning are affine because the bias term b shifts the decision boundary away from the origin to better fit the data.
In projective geometry, a projective hyperplane is a subspace of codimension 1 in a projective space. Unlike Euclidean hyperplanes, it takes two projective hyperplanes to separate points in a projective space.
In supervised classification, a decision boundary is the surface that separates the regions of feature space assigned to different classes. When this surface is a hyperplane, the classifier is called a linear classifier. The model assigns a class label to an input x based on which side of the hyperplane it falls on:
Classifying a new point therefore reduces to evaluating the sign of w . x + b, a single dot product and comparison [3][7]. Several widely used algorithms produce hyperplane decision boundaries, as summarized below.
| Algorithm | How it finds the hyperplane | Key characteristic |
|---|---|---|
| Perceptron | Iteratively adjusts weights using misclassified points | Converges if data is linearly separable; no unique solution |
| Logistic regression | Maximizes the log-likelihood of class labels via a sigmoid function | Outputs class probabilities; decision boundary at p = 0.5 |
| Support vector machine (SVM) | Maximizes the geometric margin between classes | Unique, optimal hyperplane determined by support vectors |
| Linear discriminant analysis (LDA) | Maximizes between-class variance relative to within-class variance | Also used for dimensionality reduction |
| Naive Bayes (Gaussian) | Derives boundary from posterior probability estimates | Linear boundary when class covariances are equal |
Support vector machines provide the most direct and well-studied application of hyperplanes in classification. The core idea is to find the hyperplane that maximizes the margin, the distance between the hyperplane and the nearest training points from each class. This maximum-margin hyperplane is the unique separating hyperplane that lies halfway between the two classes and is as far as possible from the closest point of either class [3][13].
The approach traces to Corinna Cortes and Vladimir Vapnik, whose 1995 paper "Support-Vector Networks" introduced the modern formulation. They described the method this way: "The support-vector network is a new learning machine for two-group classification problems" in which "input vectors are non-linearly mapped to a very high-dimension feature space" where "a linear decision surface is constructed" with "special properties" that "ensure high generalization ability of the learning machine" [3]. The earlier algorithm of Bernhard Boser, Isabelle Guyon, and Vapnik (1992) framed the same goal directly: "A training algorithm that maximizes the margin between the training patterns and the decision boundary is presented" [4].
Given a separating hyperplane w . x + b = 0, the geometric margin is the perpendicular distance from the hyperplane to the closest data point of either class. For a correctly classified point (x_i, y_i) with y_i in {+1, -1}, the functional margin is y_i(w . x_i + b). The geometric margin normalizes this by the magnitude of w:
gamma = y_i(w . x_i + b) / ||w||
The overall margin of the classifier is the minimum geometric margin across all training points [3][10].
When the training data is perfectly linearly separable, the hard-margin SVM solves the following optimization problem:
Minimize (1/2)||w||^2
Subject to y_i(w . x_i + b) >= 1 for all i = 1, ..., m
Minimizing ||w||^2 is equivalent to maximizing the margin 2/||w||. The constraint ensures that every training point lies on the correct side of the margin boundary [3][8].
Real-world data is rarely perfectly separable. Corinna Cortes and Vladimir Vapnik introduced the soft-margin formulation in 1995, which allows some points to violate the margin by introducing slack variables xi_i >= 0 [3]:
Minimize (1/2)||w||^2 + C * sum(xi_i)
Subject to y_i(w . x_i + b) >= 1 - xi_i and xi_i >= 0
The regularization parameter C controls the trade-off between maximizing the margin and minimizing classification errors. A large C penalizes misclassifications heavily (narrower margin), while a small C allows more violations (wider margin).
The training points that lie exactly on the margin boundaries (where y_i(w . x_i + b) = 1) are called support vectors. These are the only data points that influence the position and orientation of the optimal hyperplane [3][10]. If a non-support-vector point were removed from the training set or moved (as long as it stays outside the margin), the resulting hyperplane would remain unchanged. In practice the number of support vectors is often a small fraction of the training set, which is why SVMs can be both compact and accurate.
The SVM optimization problem can be reformulated using Lagrange multipliers alpha_i into its dual form:
Maximize sum(alpha_i) - (1/2) * sum_i sum_j alpha_i alpha_j y_i y_j (x_i . x_j)
Subject to sum(alpha_i y_i) = 0 and 0 <= alpha_i <= C
The weight vector is recovered as w = sum(alpha_i y_i x_i). Points with alpha_i > 0 are the support vectors. The dual form is significant because the data appears only through dot products x_i . x_j, which enables the kernel trick [3][4].
Many real-world datasets are not linearly separable in the original input space. The kernel method addresses this by implicitly mapping data into a higher-dimensional feature space where a linear hyperplane can separate the classes. Bernhard Boser, Isabelle Guyon, and Vladimir Vapnik proposed this approach in 1992 [4].
A mapping function phi transforms each input x into a higher-dimensional space, x -> phi(x). In this transformed space, a linear hyperplane can separate data that was nonlinearly distributed in the original space. The kernel function K(x_i, x_j) = phi(x_i) . phi(x_j) computes the dot product in the higher-dimensional space without explicitly performing the transformation. This avoids the computational cost of working in a potentially infinite-dimensional space [4][9].
In the dual SVM formulation, every occurrence of the dot product x_i . x_j is simply replaced with K(x_i, x_j).
| Kernel | Formula | Description |
|---|---|---|
| Linear | K(x_i, x_j) = x_i . x_j | No transformation; equivalent to standard linear SVM |
| Polynomial | K(x_i, x_j) = (x_i . x_j + r)^d | Captures feature interactions up to degree d; r >= 0 |
| Radial basis function (RBF) / Gaussian | K(x_i, x_j) = exp(-gamma * | |
| Sigmoid | K(x_i, x_j) = tanh(kappa * x_i . x_j + c) | Mimics neural network activation; valid only for certain kappa, c |
The RBF kernel is the most widely used nonlinear kernel in practice because it can model complex decision boundaries while having only one hyperparameter (gamma) to tune [9].
The resulting decision boundary in the original input space is nonlinear (it can be a curve, a closed region, or any complex shape), but in the high-dimensional feature space it remains a flat hyperplane. The kernel trick therefore lets SVMs find nonlinear decision boundaries while retaining the mathematical elegance and optimization guarantees of linear hyperplane separation [4][9].
The perceptron, conceived by Frank Rosenblatt in 1957 and described in a 1958 paper, was one of the earliest algorithms to learn a hyperplane decision boundary [5]. A single-layer perceptron computes:
f(x) = sign(w . x + b)
The perceptron learning algorithm updates the weights each time it encounters a misclassified training example:
w <- w + eta * y_i * x_i
b <- b + eta * y_i
where eta is the learning rate. The Perceptron Convergence Theorem guarantees that if the training data is linearly separable, the algorithm will find a separating hyperplane in a finite number of iterations [2][5]. However, the perceptron provides no guarantee about the quality of the hyperplane it finds; unlike SVMs, it does not maximize the margin.
Multi-layer perceptrons (the building blocks of deep learning) compose multiple layers of linear transformations and nonlinear activation functions. Each neuron in a hidden layer defines a hyperplane, and the network as a whole learns to combine many such hyperplanes to form complex, nonlinear decision boundaries.
Logistic regression is a linear classifier that models the probability that an input belongs to the positive class using the sigmoid function:
P(y = 1 | x) = sigma(w . x + b) = 1 / (1 + exp(-(w . x + b)))
The decision boundary is the hyperplane where the predicted probability equals 0.5, which occurs when w . x + b = 0. Points on the positive side of the hyperplane receive predicted probabilities greater than 0.5 and are assigned to the positive class; points on the negative side are assigned to the negative class.
The decision threshold can be adjusted from 0.5 to change the trade-off between precision and recall. Changing the threshold effectively moves the decision boundary parallel to the original hyperplane.
Linear discriminant analysis (LDA), developed by Ronald Fisher in 1936, finds a hyperplane that separates classes by maximizing the ratio of between-class variance to within-class variance [6]. The projection direction w is chosen to maximize the Fisher criterion:
J(w) = (w^T S_B w) / (w^T S_W w)
where S_B is the between-class scatter matrix and S_W is the within-class scatter matrix. The optimal solution is w = inverse(S_W) (mu_1 - mu_2), where mu_1 and mu_2 are the class means. The resulting hyperplane is perpendicular to this direction.
LDA differs from SVM in that it models the class distributions (assuming Gaussian distributions with equal covariance), while SVM makes no distributional assumptions and focuses solely on the margin [6][7].
A single hyperplane can only separate two classes. To handle problems with K > 2 classes, several strategies exist.
| Strategy | Number of classifiers | How it works |
|---|---|---|
| One-vs-rest (OvR) | K | Each classifier separates one class from all others; assign the class whose classifier gives the highest score |
| One-vs-one (OvO) | K(K-1)/2 | Each classifier separates one pair of classes; assign the class that wins the most pairwise comparisons |
| Multiclass SVM (Crammer-Singer) | 1 (joint) | Optimizes a single objective over all K classes simultaneously |
For example, in a 5-class problem, OvR trains 5 binary classifiers (each learning one hyperplane), while OvO trains 10 binary classifiers.
The theoretical foundation for using hyperplanes as class separators comes from the hyperplane separation theorem in convex optimization. The theorem states [1][14]:
If A and B are two disjoint, nonempty convex sets in R^n, then there exists a nonzero vector v and a scalar c such that v . x >= c for all x in A and v . y <= c for all y in B.
This is the weak separation form. A stronger version adds that if both sets are closed and at least one is compact, then strict separation is possible with a gap between two parallel hyperplanes [1][14].
The theorem guarantees that whenever two classes form disjoint convex sets in feature space, a separating hyperplane exists. The SVM optimization problem then finds the one with the maximum margin.
Decision trees that split on a single feature at each node create axis-aligned hyperplanes. Each split is a hyperplane perpendicular to one feature axis. The overall decision boundary is a piecewise combination of these axis-aligned hyperplanes, forming rectangular regions in feature space. Oblique decision trees allow splits on linear combinations of features, producing arbitrarily oriented hyperplanes at each node.
Each neuron in a neural network computes w . x + b followed by a nonlinear activation function. Before the activation, the neuron defines a hyperplane in its input space. A single hidden layer with enough neurons can approximate any continuous decision boundary by composing many such hyperplanes. Deep neural networks create hierarchical compositions of hyperplanes across multiple layers, which enables them to learn highly complex, nonlinear boundaries.
Hyperplanes also play a role in dimensionality reduction. Principal component analysis (PCA) projects data onto a lower-dimensional subspace defined by hyperplanes. Random projection methods, supported by the Johnson-Lindenstrauss lemma, use random hyperplanes to project high-dimensional data into fewer dimensions while approximately preserving pairwise distances.
Spectral clustering and some variants of k-means use hyperplanes to partition feature space. In particular, max-margin clustering extends the SVM framework to unsupervised settings by finding the hyperplane that separates unlabeled data with the largest margin.
The margin is the perpendicular distance between the two parallel hyperplanes w . x + b = +1 and w . x + b = -1. These are called the margin boundaries. The distance between them is 2/||w|| [3][13].
To see why: pick any point x_plus on the positive margin boundary (w . x_plus + b = 1) and compute its distance to the negative margin boundary (w . x + b = -1). The distance is:
d = |(w . x_plus + b) - (-1)| / ||w|| = |1 - (-1)| / ||w|| = 2/||w||
Maximizing 2/||w|| is equivalent to minimizing ||w||^2, which is the SVM objective. A larger margin means the classifier is more robust to small perturbations in the test data, which generally leads to better generalization [3][10].
The concept of hyperplanes has roots in 19th-century mathematics and has been applied in machine learning since the mid-20th century.
| Year | Milestone |
|---|---|
| 1936 | Ronald Fisher develops linear discriminant analysis, using hyperplanes to separate biological classes [6] |
| 1957 | Frank Rosenblatt introduces the perceptron, one of the first algorithms to learn a hyperplane from data (1958 paper) [5] |
| 1963 | Vladimir Vapnik and Alexey Chervonenkis propose the original SVM algorithm based on optimal separating hyperplanes [2] |
| 1992 | Boser, Guyon, and Vapnik introduce the kernel trick, allowing SVMs to find nonlinear boundaries via hyperplanes in transformed feature spaces [4] |
| 1995 | Cortes and Vapnik publish the soft-margin SVM formulation, handling non-separable data [3] |
| 1998 | SVMs gain widespread adoption in text classification, bioinformatics, and computer vision [10] |
The position and orientation of the optimal hyperplane depend on the scale of each feature. Features with larger numeric ranges can dominate the dot product w . x, causing the hyperplane to be biased toward those features. Normalization or standardization of features before training is therefore important for algorithms that rely on hyperplane decision boundaries.
In high-dimensional settings (where the number of features p is large relative to the number of samples n), linear classifiers can often find a separating hyperplane even when no real pattern exists in the data. This is because, in high dimensions, random point sets become increasingly easy to separate with a hyperplane (Cover's theorem) [11]. Regularization and cross-validation are needed to avoid overfitting in such scenarios.
Training a hard-margin SVM requires solving a quadratic programming problem. The time complexity is roughly O(n^2 * p) to O(n^3) depending on the solver, where n is the number of training samples and p is the number of features [10]. The kernel trick does not change the asymptotic complexity but affects the constant factor, since kernel evaluations replace simple dot products.
| Concept | Formula |
|---|---|
| Hyperplane equation | w . x + b = 0 |
| Signed distance from point x_0 | (w . x_0 + b) / |
| Margin width | 2 / |
| Hard-margin SVM objective | Minimize (1/2) |
| Soft-margin SVM objective | Minimize (1/2) |
| Weight vector from dual | w = sum(alpha_i y_i x_i) |
| Kernel substitution | Replace x_i . x_j with K(x_i, x_j) |
| Perceptron update rule | w <- w + eta * y_i * x_i |
| Fisher criterion (LDA) | J(w) = (w^T S_B w) / (w^T S_W w) |