Decision Boundary

A decision boundary (also called a decision surface) is a hypersurface in feature space that partitions data points into distinct class regions. In a classification model, the decision boundary represents the set of points where the model's predicted class changes from one label to another. Every classifier, whether a simple logistic regression or a deep neural network, implicitly or explicitly defines a decision boundary that determines how new, unseen data points are classified.

Formally, for a binary classifier with output function f(x), the decision boundary is the locus of points where f(x) = 0 (or equivalently, where the predicted probability equals the classification threshold). Points on one side of this surface are assigned to the positive class, and points on the other side are assigned to the negative class. The shape and complexity of this surface depend on the learning algorithm, the model's capacity, and the distribution of the training data.

The Bayes Optimal Decision Boundary

The Bayes optimal decision boundary is the theoretically best possible boundary for a given classification problem. It arises from Bayesian decision theory, which defines the optimal classifier as the one that assigns each point x to the class with the highest posterior probability P(class | x). The decision boundary of this Bayes optimal classifier sits exactly where the posterior probabilities of two (or more) classes are equal.

For a binary problem, the Bayes optimal boundary is the set of all points where P(class = 1 | x) = P(class = 0 | x) = 0.5. Any departure from this boundary increases the overall classification error. The minimum error rate achievable by any classifier on a given data distribution is called the Bayes error rate, and it is analogous to an irreducible error floor. The Bayes error rate is non-zero whenever class distributions overlap in feature space, meaning some points genuinely have a non-zero probability of belonging to more than one class.

In practice, the true class-conditional distributions are unknown, so no real classifier can perfectly recover the Bayes optimal boundary. Instead, machine learning algorithms attempt to approximate it from finite training data. The quality of this approximation depends on the algorithm's inductive bias, the amount of training data, and the complexity of the true boundary.

Linear Decision Boundaries

A linear decision boundary is a straight line (in two dimensions), a plane (in three dimensions), or a hyperplane (in higher dimensions) that separates the feature space into two half-spaces, one for each class. A boundary is linear when it can be expressed as a weighted sum of features equal to a constant:

w₁x₁ + w₂x₂ + ... + wₙxₙ + b = 0

where w is the weight vector, x is the feature vector, and b is the bias term.

Several classical algorithms produce linear decision boundaries:

Algorithm	How It Forms the Boundary	Key Characteristic
Logistic Regression	Finds weights that maximize the likelihood of observed class labels; the boundary sits where the predicted probability equals the threshold (default 0.5)	Outputs calibrated probabilities via the sigmoid function
Linear SVM	Finds the hyperplane that maximizes the margin between the two closest points of opposite classes	Maximizes geometric margin
Perceptron	Iteratively adjusts weights when a training point is misclassified until all points are correctly separated	Guaranteed convergence only for linearly separable data
Linear Discriminant Analysis (LDA)	Projects data onto a lower-dimensional space and finds the boundary that maximizes class separability	Assumes Gaussian class distributions with equal covariance

Linear decision boundaries are computationally efficient and easy to interpret. They work well when the underlying data is approximately linearly separable, meaning a single flat surface can adequately separate the classes. However, many real-world datasets contain overlapping or interleaved class distributions that cannot be divided by a flat surface, which motivates the use of nonlinear methods.

It is worth noting that logistic regression, despite using a nonlinear sigmoid function to map its output to probabilities, produces a linear decision boundary. This is because the sigmoid function is monotonic: the boundary occurs where the linear combination of features equals zero (w · x + b = 0), which is a hyperplane regardless of the nonlinear squashing applied afterward.

Nonlinear Decision Boundaries

A nonlinear decision boundary is a curved, bent, or otherwise non-flat surface that separates classes in feature space. Nonlinear boundaries are necessary when the data distribution is too complex for a single hyperplane to achieve adequate separation. Several approaches produce nonlinear decision boundaries.

Polynomial Feature Expansion

One straightforward approach is to augment the original features with polynomial terms (such as x₁², x₁x₂, x₂²) and then apply a linear classifier in the expanded feature space. The linear boundary in the higher-dimensional space corresponds to a curved boundary when projected back into the original feature space. For example, a logistic regression model trained on quadratic features can learn circular or elliptical decision boundaries.

Kernel Methods and SVMs

The kernel trick allows support vector machines to find nonlinear decision boundaries without explicitly computing the coordinates in a high-dimensional feature space. A kernel function K(xᵢ, xⱼ) computes the inner product of two data points in the transformed space using only their original coordinates. Common kernels include:

Kernel	Formula	Boundary Shape
Polynomial	K(x, y) = (x · y + c)^d	Curves of degree d
Radial Basis Function (RBF)	K(x, y) = exp(-γ ‖x - y‖²)	Smooth, flexible contours
Sigmoid	K(x, y) = tanh(αx · y + c)	Similar to neural network activation

The RBF kernel is particularly popular because it maps data into an infinite-dimensional feature space and can model highly complex boundaries. The parameter γ controls the influence radius of each support vector: a large γ creates tightly curved boundaries around individual points, while a small γ produces smoother, broader boundaries.

Neural Networks

A neural network with one or more hidden layers can approximate arbitrarily complex decision boundaries. Each layer applies a nonlinear activation function (such as ReLU or sigmoid) to a linear transformation of its inputs. The composition of multiple such layers enables the network to carve out intricate, highly nonlinear regions in feature space. As stated by the universal approximation theorem, a feedforward network with a single hidden layer containing a sufficient number of neurons can represent any continuous function on a compact subset of Euclidean space, which means it can learn any continuous decision boundary.

Deep neural networks with many layers can learn hierarchical representations that capture complex patterns at multiple scales, making them especially effective for high-dimensional data such as images, audio, and text. Research has shown that the multi-layer nonlinear feature transformation in deep networks is mathematically equivalent to a kernel feature mapping, revealing deep conceptual similarities between how kernel SVMs and neural networks construct nonlinear decision boundaries.

Decision Trees and Ensemble Methods

A decision tree creates a decision boundary by recursively splitting the feature space along individual feature axes. Each internal node tests a single feature against a threshold, producing axis-aligned (orthogonal) splits. The result is a piecewise boundary made up of horizontal and vertical segments (in two dimensions) or hyper-rectangular regions (in higher dimensions). While each individual split is simple, a sufficiently deep tree can approximate complex boundaries through a staircase-like pattern of many small rectangular partitions.

Ensemble methods like random forests and gradient boosting combine many decision trees. Because each tree in the ensemble contributes its own set of axis-aligned splits, the combined boundary of the ensemble is much smoother and more flexible than that of any single tree. Random forests, for example, average the predictions of hundreds of trees, each trained on a different bootstrap sample, producing a decision boundary that effectively approximates curved surfaces despite being composed of rectangular segments.

K-Nearest Neighbors

The K-nearest neighbors (KNN) algorithm defines its decision boundary implicitly through a majority vote of the k closest training points. Unlike parametric models, KNN does not learn fixed parameters; instead, the boundary is entirely determined by the training data and the choice of k.

The value of k has a profound effect on boundary complexity:

Value of k	Boundary Behavior	Trade-off
k = 1	The boundary corresponds to the Voronoi diagram of the training points, creating highly irregular, jagged regions around each individual sample	Low bias, high variance
Small k (e.g., 3 or 5)	Flexible boundary that follows local structure in the data closely	Risk of overfitting to noise
Large k	Smoother, more regularized boundary that averages over many neighbors	Risk of underfitting; boundaries between classes become less distinct

As k increases toward the total number of training points, KNN converges to simply predicting the majority class everywhere, and the decision boundary disappears entirely. Choosing an appropriate k through cross-validation is essential for balancing boundary flexibility against generalization.

Margin and the Decision Boundary

In support vector machines, the margin is the distance between the decision boundary and the nearest data points from either class. These nearest points are called support vectors, and they are the only training examples that influence the position and orientation of the boundary. All other data points could be moved or removed without changing the decision boundary, which is a distinctive property of SVMs.

The SVM training objective is to find the hyperplane that maximizes this margin. The intuition is that a larger margin provides a greater "safety buffer" for classification: points near the boundary represent uncertain predictions (roughly a 50% chance of belonging to either class), so pushing the boundary as far as possible from training points reduces the chance of misclassifying slightly noisy or shifted test points. This principle is why SVMs are sometimes called maximum-margin classifiers.

Two main formulations exist:

Formulation	Description	When to Use
Hard margin	Requires all training points to be correctly classified and lie outside the margin	Data is perfectly linearly separable with no noise
Soft margin	Allows some training points to violate the margin or be misclassified, controlled by a regularization parameter C	Data has noise, outliers, or is not perfectly separable

The soft margin formulation introduces slack variables that permit controlled violations. A large C penalizes misclassifications heavily, producing a narrow margin that fits the training data closely. A small C tolerates more misclassifications, producing a wider margin that generalizes better to unseen data. The relationship between C and margin width is inverse: increasing C tightens the margin and makes the boundary more sensitive to individual data points, while decreasing C widens it and promotes smoother boundaries.

Decision Boundary vs. Classification Threshold

The terms "decision boundary" and "classification threshold" are related but refer to different concepts. The decision boundary is a geometric surface in feature space that may be a line, plane, curve, or complex manifold depending on the model. The classification threshold is a scalar probability value (commonly 0.5) used to convert a model's predicted probability into a discrete class label.

For a binary logistic regression model, the decision boundary in feature space corresponds to the set of points where the sigmoid output equals the chosen threshold. When the threshold is 0.5, the boundary sits where the linear combination of features equals zero (w · x + b = 0). If the threshold is changed to, say, 0.3, the boundary shifts so that the model predicts the positive class more aggressively (at lower predicted probabilities), and the geometric decision boundary in feature space moves accordingly.

Adjusting the classification threshold is a common technique for handling class imbalance or for tuning the tradeoff between precision and recall. Lowering the threshold increases recall (more positive predictions) at the cost of precision, while raising it increases precision at the cost of recall. Importantly, the model's learned parameters do not change when the threshold is adjusted; only the location of the decision boundary in feature space shifts.

Threshold Change	Effect on Boundary	Effect on Metrics
Lowered (e.g., 0.5 to 0.3)	Boundary shifts to include more points in the positive class	Higher recall, lower precision
Raised (e.g., 0.5 to 0.7)	Boundary shifts to include fewer points in the positive class	Higher precision, lower recall
Default (0.5)	Boundary at the natural midpoint of predicted probabilities	Balanced trade-off (model-dependent)

Multi-Class Decision Boundaries

When a classification problem involves more than two classes, the feature space must be divided into multiple regions, one for each class. The decision boundaries between these regions form a set of surfaces that collectively partition the space.

One-vs-Rest (OvR)

In the one-vs-rest strategy, a separate binary classifier is trained for each class, treating that class as positive and all others as negative. Each classifier defines its own decision boundary. A new data point is assigned to the class whose classifier outputs the highest confidence score. One limitation of this approach is the creation of ambiguous regions where either multiple classifiers claim a point as positive or no classifier does. In such cases, the point is typically assigned to the class with the highest raw decision function value.

One-vs-One (OvO)

The one-vs-one strategy trains a binary classifier for every pair of classes, resulting in K(K-1)/2 classifiers for K classes. Each new point is classified by majority vote among all pairwise classifiers. This approach avoids some of the ambiguity problems of one-vs-rest but requires training substantially more classifiers.

Multinomial (Softmax) Classification

Algorithms that natively support multi-class classification, such as neural networks with a softmax output layer or multinomial logistic regression, compute class probabilities simultaneously. The decision boundary between any two classes i and j is the surface where P(class = i | x) = P(class = j | x). All boundaries are determined jointly, which often leads to more coherent and consistent class regions than the one-vs-rest or one-vs-one decomposition approaches. Because the softmax function ensures all class probabilities sum to one, the resulting boundaries are globally consistent and typically produce better-calibrated probability estimates.

Decision Boundary Complexity and Overfitting

The complexity of a model's decision boundary is closely tied to the bias-variance tradeoff and the risk of overfitting.

A model with high bias (such as linear regression applied to a nonlinear problem) produces an overly simple boundary that cannot capture the true class structure. This leads to underfitting, where the model performs poorly on both training and test data.

A model with high variance (such as a deep decision tree with no pruning or an SVM with a very large γ) produces an overly complex boundary that conforms tightly to the training data, including its noise. This leads to overfitting, where the model achieves low training error but high test error because the boundary does not generalize.

The goal is to find a boundary complex enough to capture the true underlying class structure but smooth enough to generalize to unseen data. Several techniques help control boundary complexity:

Technique	Effect on Decision Boundary
Regularization (L1, L2)	Penalizes large weights, smoothing the boundary
Cross-validation	Selects hyperparameters that balance training and validation performance
Early stopping	Halts training before the model memorizes noise
Pruning (decision trees)	Removes splits that do not improve generalization
Dropout (neural networks)	Randomly deactivates neurons during training, preventing co-adaptation
Reducing model capacity	Fewer layers, neurons, or polynomial degree limits boundary flexibility
Bagging and ensembling	Averages predictions from multiple models, smoothing out individual boundary irregularities

As a rule of thumb, if a decision boundary wraps tightly around every training observation, the model is almost certainly overfitting. A well-generalizing boundary should capture broad class structure while tolerating some training errors in regions where classes naturally overlap.

Decision Boundaries in High Dimensions

Most real-world classification problems involve many features, placing the decision boundary in a high-dimensional space. While the mathematical definition of the boundary remains the same (the surface where the classifier's prediction changes), high-dimensional boundaries present unique challenges.

The curse of dimensionality means that as the number of features grows, the volume of the feature space increases exponentially, and training data becomes increasingly sparse. A model that works well with 1,000 samples in 10 dimensions might need millions of samples to achieve similar performance in 100 dimensions. Sparse data makes it harder to estimate the true decision boundary accurately, which increases the risk of overfitting.

Visualization of decision boundaries is straightforward in two or three dimensions but becomes impractical in higher-dimensional spaces. Common strategies for understanding high-dimensional boundaries include:

Dimensionality reduction: Projecting data into two dimensions using techniques like PCA or t-SNE and then plotting the boundary in the reduced space. This provides an approximation but may distort the true boundary shape.
Feature pair plots: Visualizing the boundary across pairs of features while holding others constant, which shows slices through the full boundary.
Boundary sampling: Identifying data points near the decision boundary and analyzing their properties to understand boundary behavior in the full-dimensional space.
DeepView and similar tools: Specialized visualization methods that use discriminative dimensionality reduction to faithfully represent the decision boundary of deep neural networks as two-dimensional scatter plots.

Visualizing Decision Boundaries

In two dimensions, the standard approach for visualizing a decision boundary is to create a fine grid of points spanning the feature space, classify each grid point using the trained model, and color-code the regions by predicted class. The boundary appears as the border between differently colored regions.

Popular tools for decision boundary visualization include:

scikit-learn with matplotlib: The DecisionBoundaryDisplay class in scikit-learn provides a convenient method for plotting decision boundaries of any classifier.
mlxtend: The plot_decision_regions function renders multi-class boundaries with overlaid training points.
TensorFlow Playground: An interactive web application that visualizes how neural network architecture and hyperparameters affect the decision boundary in real time.

Visualization is a valuable tool for model selection and debugging. By comparing decision boundaries across different algorithms or hyperparameter settings, practitioners can quickly identify whether a model is underfitting (too smooth a boundary) or overfitting (too jagged a boundary).

Explain Like I'm 5 (ELI5)

Imagine you have a big box of red and blue marbles scattered on a table. You want to draw a line so that all the red marbles end up on one side and all the blue marbles end up on the other side. That line is the decision boundary.

Sometimes a straight line works perfectly. But sometimes the marbles are all mixed together in a swirly pattern, and you need a curvy line to separate them. In machine learning, simple models draw straight lines, while more powerful models can draw wavy, curvy lines to separate tricky patterns.

The important thing is not to make the line too wiggly. If you try to curve around every single marble perfectly, you might memorize where today's marbles are but do a bad job when someone dumps new marbles on the table tomorrow. A good decision boundary is one that separates the colors well without being overly complicated.

References

Bishop, C. M. (2006). *Pattern Recognition and Machine Learning*. Springer.
Hastie, T., Tibshirani, R., & Friedman, J. (2009). *The Elements of Statistical Learning* (2nd ed.). Springer.
Cortes, C., & Vapnik, V. (1995). "Support-vector networks." *Machine Learning*, 20(3), 273-297.
Rosenblatt, F. (1958). "The Perceptron: A Probabilistic Model for Information Storage and Organization in the Brain." *Psychological Review*, 65(6), 386-408.
Hornik, K. (1991). "Approximation capabilities of multilayer feedforward networks." *Neural Networks*, 4(2), 251-257.
Manning, C. D., Raghavan, P., & Schutze, H. (2008). "Support Vector Machines: The Linearly Separable Case." *Introduction to Information Retrieval*. Cambridge University Press.
Goodfellow, I., Bengio, Y., & Courville, A. (2016). *Deep Learning*. MIT Press.
Schulz, A., Stober, S., & Hammer, B. (2019). "DeepView: Visualizing Classification Boundaries of Deep Neural Networks as Scatter Plots." *arXiv preprint arXiv:1909.09154*.
Scikit-learn documentation. "Decision Boundaries of Multinomial and One-vs-Rest Logistic Regression." https://scikit-learn.org/stable/auto_examples/linear_model/plot_logistic_multinomial.html
Wikipedia. "Decision boundary." https://en.wikipedia.org/wiki/Decision_boundary

The Bayes Optimal Decision Boundary

Linear Decision Boundaries

Nonlinear Decision Boundaries

Polynomial Feature Expansion

Kernel Methods and SVMs

Neural Networks

Decision Trees and Ensemble Methods

K-Nearest Neighbors

Margin and the Decision Boundary

Decision Boundary vs. Classification Threshold

Multi-Class Decision Boundaries

One-vs-Rest (OvR)

One-vs-One (OvO)

Multinomial (Softmax) Classification

Decision Boundary Complexity and Overfitting

Decision Boundaries in High Dimensions

Visualizing Decision Boundaries

Explain Like I'm 5 (ELI5)

References

Improve this article

Related Articles

ARC-AGI 2

Squared Hinge Loss

AUC (Area Under the ROC Curve)

Accuracy

Binary Classification

Class-Imbalanced Dataset

The Bayes Optimal Decision Boundary

Linear Decision Boundaries

Nonlinear Decision Boundaries

Polynomial Feature Expansion

Kernel Methods and SVMs

Neural Networks

Decision Trees and Ensemble Methods

K-Nearest Neighbors

Margin and the Decision Boundary

Decision Boundary vs. Classification Threshold

Multi-Class Decision Boundaries

One-vs-Rest (OvR)

One-vs-One (OvO)

Multinomial (Softmax) Classification

Decision Boundary Complexity and Overfitting

Decision Boundaries in High Dimensions

Visualizing Decision Boundaries

Explain Like I'm 5 (ELI5)

References

Related Articles

ARC-AGI 2

Squared Hinge Loss

AUC (Area Under the ROC Curve)

Accuracy

Binary Classification

Class-Imbalanced Dataset