Outliers
Last reviewed
Apr 30, 2026
Sources
30 citations
Review status
Source-backed
Revision
v4 ยท 7,391 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Apr 30, 2026
Sources
30 citations
Review status
Source-backed
Revision
v4 ยท 7,391 words
Add missing citations, update stale details, or suggest a clearer explanation.
See also: machine learning terms, anomaly detection, robust statistics, data preprocessing
In statistics and machine learning, an outlier is a data point that differs significantly from the bulk of the observations in a dataset. The word itself comes from the idea that the value lies outside the range that the rest of the sample suggests is plausible. Outliers can arise from genuine phenomena, such as a record-breaking earthquake or a fraudulent credit card charge, or from errors in measurement, data entry, sensor faults, or transmission noise. The choice between treating a suspicious point as a real signal or as contamination drives the entire field of anomaly detection and a large body of work in robust statistics.
The statistician John Tukey defined an outlier informally in his 1977 book Exploratory Data Analysis as any point that falls beyond a fence drawn at 1.5 times the interquartile range (IQR) from the first or third quartile. This rule of thumb still drives the whiskers in modern boxplots and remains one of the first checks performed during data exploration. Earlier work by Frank Anscombe in 1960 had already framed the underlying tension: rejection rules trade off the cost of discarding good data against the cost of letting bad data corrupt downstream analysis.
Outliers matter because many statistical and machine learning algorithms are not robust to extreme values. A single point with a large residual can shift an ordinary least squares regression line by several standard deviations, distort principal component directions, pull centroids in k-means clustering toward empty regions of feature space, or dominate the loss function during training of a neural network. At the other end of the spectrum, the same extreme value may be the most important observation in the dataset; in fraud detection, intrusion detection, predictive maintenance, and rare-disease screening, the outliers are the entire reason the analysis exists.
This article covers the formal types of outliers, the major detection algorithms grouped by family, software libraries that implement them, treatment strategies, and the relationship between outlier analysis and the broader field of anomaly detection.
A standard definition, due to Douglas Hawkins in his 1980 monograph Identification of Outliers, is that an outlier is "an observation which deviates so much from the other observations as to arouse suspicions that it was generated by a different mechanism." Hawkins's definition emphasizes the data-generating process: outliers are points that the assumed model cannot easily produce.
Three practical implications follow from this definition. First, whether a point is an outlier depends on the model. A value of 200 cm for adult height is unusual under a Gaussian model fit to a general population but unremarkable inside a sample of professional basketball players. Second, outliers can be informative or pathological. Whether to keep, transform, or remove them depends on whether the unusual mechanism is the phenomenon of interest. Third, outlier identification is inherently statistical; it carries error rates and requires choices about thresholds and significance.
A related distinction is between univariate outliers, which are extreme on a single variable, and multivariate outliers, which are extreme in the joint distribution of several variables even though each marginal value looks ordinary. A person 180 cm tall and weighing 50 kg has unremarkable height and unremarkable weight separately, but the combination is unusual and shows up only in two-dimensional analysis.
Researchers in anomaly detection commonly distinguish three types of outliers based on the structure of the surrounding data. The taxonomy was popularized by Chandola, Banerjee, and Kumar in their 2009 ACM Computing Surveys article on anomaly detection.
| Type | Definition | Example |
|---|---|---|
| Point (global) outlier | A single observation that is anomalous with respect to the entire dataset | A credit card charge of $50,000 in a stream where the typical transaction is $50 |
| Contextual (conditional) outlier | A point that is anomalous only within a specific context, such as time of day or location | A temperature of 30 degrees Celsius is normal in summer but anomalous in winter |
| Collective outlier | A group of related points that together deviate from the rest of the data, even if each individual point looks normal | A long sequence of identical packets in a network capture that signals a denial-of-service attack |
Point outliers are the easiest to detect and the focus of most classical statistical tests. Contextual outliers require defining the context, often through behavioral attributes (time, location, user identity) and contextual attributes (the variable being measured). Collective outliers usually appear in sequence, time series, graph, or spatial data, where the relationships between points carry as much information as the points themselves.
There is also a distinction between inliers and outliers. An inlier is a point that lies in the dense region of the data distribution. Some methods, such as one-class classification, model the inliers explicitly and flag everything else as a potential outlier.
The practical question of how to handle an outlier often turns on its source. Common sources include:
Most real datasets contain a mix of all five sources, which is one reason that purely statistical detection is rarely the whole story.
Outlier detection algorithms fall into a small number of families with different assumptions, computational profiles, and failure modes. The main families are summarized below.
| Family | Core idea | Representative algorithms | Typical assumption |
|---|---|---|---|
| Statistical | Fit a probability model and flag low-likelihood points | Z-score, modified Z-score, Tukey's IQR fence, Grubbs' test, Dixon's Q test | Data follow a known parametric distribution |
| Distance-based | Flag points that are far from their neighbors | k-Nearest Neighbors (kNN) outlier score, distance to k-th nearest neighbor | Outliers are isolated in feature space |
| Density-based | Flag points whose local density is much lower than that of their neighbors | Local Outlier Factor (LOF), DBSCAN noise points, COF | Normal points lie in dense regions |
| Tree-based | Use random space partitions and exploit the fact that anomalies are easier to isolate | Isolation Forest, Extended Isolation Forest | Anomalies are few and different |
| Linear/subspace | Project data onto a subspace and measure reconstruction error | PCA reconstruction, robust PCA, EllipticEnvelope (Mahalanobis distance) | Data lie near a low-dimensional subspace |
| Neural | Train a neural network to model the normal class and flag deviations | Autoencoder reconstruction error, Deep SVDD, GAN-based methods | Sufficient training data exists |
| Ensemble | Combine multiple detectors to reduce variance and improve robustness | Feature Bagging, LODA, SUOD | Individual detectors are diverse |
| One-class classification | Learn a boundary around normal data and flag points outside | One-Class SVM, Deep SVDD | Normal class is well-defined |
Within each family, methods vary along several axes: supervised vs. unsupervised, parametric vs. nonparametric, point vs. structural, batch vs. streaming, and global vs. local. The right choice depends on the dimensionality of the data, the density of the normal class, the rate of contamination, and whether labeled examples of anomalies are available.
Statistical outlier detection has the longest history of any approach and remains the default choice for low-dimensional, well-behaved data.
The Z-score expresses how many standard deviations a value lies from the sample mean:
Z = (x - mean) / standard_deviation
A common rule flags values with absolute Z-score greater than 3, which corresponds to roughly 0.27% of observations under a Gaussian distribution. The Z-score is fast and easy to compute but has two well-known weaknesses. First, the sample mean and sample standard deviation are themselves not robust; a single extreme point inflates the standard deviation and thereby reduces the Z-score of every other point, masking additional outliers. Second, the rule assumes Gaussian tails, which are too thin for many real distributions.
Iglewicz and Hoaglin proposed a modified Z-score that replaces the mean with the median and the standard deviation with the median absolute deviation (MAD):
Modified Z = 0.6745 * (x - median) / MAD
The constant 0.6745 makes the modified Z-score consistent with the standard Z-score under a Gaussian distribution. Iglewicz and Hoaglin recommended flagging values with absolute modified Z-score greater than 3.5. Because the median and the MAD have a 50% breakdown point, this method is far more resistant to the masking effect that contaminates the standard Z-score.
John Tukey's 1977 boxplot uses the interquartile range (IQR), defined as the difference between the third quartile (Q3) and the first quartile (Q1). Tukey's fences are placed at:
Values outside the inner fences are typically flagged as mild outliers; values outside the outer fences are flagged as extreme outliers. The rule does not depend on the mean or the standard deviation, so it is robust to contamination, and it underlies the whiskers and dots drawn in standard boxplot software.
Grubbs' test, also called the maximum normalized residual test, is designed to detect a single outlier in a univariate sample drawn from an approximately Gaussian distribution. The test statistic is:
G = max |x - mean| / standard_deviation
The critical value depends on the sample size and the chosen significance level. Grubbs' test can be applied iteratively after removing the most extreme point, but it is sensitive to the masking effect when more than one outlier is present and tends to lose power as the contamination rate grows.
Dixon's Q test, introduced by Robert Dixon in 1950, is widely used in analytical chemistry for small samples (typically 3 to 30 measurements). The Q statistic is the gap between the suspected outlier and its nearest neighbor divided by the total range of the data. The value is compared against tabulated critical values. Dixon's Q is intended for a single suspected outlier in small samples; it should not be reused in a sequential strip-and-retest loop on the same data.
For multivariate Gaussian data, the Mahalanobis distance measures how far a point is from the sample mean while accounting for the covariance structure. Squared Mahalanobis distance follows a chi-squared distribution under the Gaussian assumption, which gives a principled cutoff. The estimator is sensitive to outliers in the sample mean and covariance, so robust covariance estimators such as the Minimum Covariance Determinant (MCD) of Rousseeuw are typically used in practice. Scikit-learn wraps this approach in sklearn.covariance.EllipticEnvelope.
Several other tests appear in the literature. The Generalized Extreme Studentized Deviate (ESD) test of Rosner (1983) extends Grubbs' test to detect multiple outliers without the masking issue. The Tietjen-Moore test detects a known number of outliers. Chauvenet's criterion and Peirce's criterion are older rules from physical sciences that are still occasionally cited but have been largely superseded by the methods above.
Distance-based methods make no parametric assumption. They flag a point as an outlier if it is far from its neighbors in the chosen distance metric.
The simplest approach, introduced by Knorr and Ng in 1998, defines a point as a DB(k, d)-outlier if at least a fraction k of the dataset lies further than distance d from it. A common practical variant is to compute the distance from each point to its k-th nearest neighbor and rank points by that distance; points with the largest k-th nearest neighbor distance are the strongest outliers. Ramaswamy, Rastogi, and Shim refined the formulation in 2000 to rank points by distance rather than by a hard threshold.
Advantages include conceptual simplicity and a single hyperparameter k. The main disadvantages are quadratic time complexity in the number of points, sensitivity to the choice of distance metric in high dimensions (where the curse of dimensionality compresses pairwise distances), and difficulty handling clusters of varying density.
Kriegel, Schubert, and Zimek proposed the Angle-Based Outlier Detection (ABOD) method in 2008. ABOD measures the variance of angles between pairs of vectors from a query point to other points in the dataset. Points inside a cluster see a wide range of angles, while points in the periphery see a narrow range. ABOD performs better than distance-based methods in very high-dimensional spaces because angular variation is more stable than absolute distance.
Density-based methods improve on pure distance methods by considering local rather than global structure.
Local Outlier Factor (LOF), introduced by Breunig, Kriegel, Ng, and Sander in 2000, was the first widely used local outlier detection method. LOF compares the local density of a point with the local density of its k-nearest neighbors. The score is computed in three steps:
A LOF score around 1 indicates that the point has a density similar to its neighbors, so it is not an outlier. Scores significantly greater than 1 indicate that the point is in a much sparser region than its neighbors and is therefore a local outlier. LOF handles datasets where normal clusters have very different densities, which is its main advantage over global distance methods. Its drawback is sensitivity to the parameter k and quadratic complexity for naive implementations. Scikit-learn implements LOF as sklearn.neighbors.LocalOutlierFactor.
DBSCAN (Density-Based Spatial Clustering of Applications with Noise), introduced by Ester, Kriegel, Sander, and Xu in 1996, is primarily a clustering algorithm but produces an outlier classification as a byproduct. DBSCAN labels each point as either a core point, a border point, or noise. Noise points are not reachable from any core point within the chosen radius and are effectively flagged as outliers. DBSCAN does not require specifying the number of clusters and naturally handles arbitrarily shaped clusters, but it requires careful tuning of the radius (epsilon) and minimum number of neighbors (minPts) parameters.
The Connectivity-Based Outlier Factor (COF), proposed by Tang, Chen, Fu, and Cheung in 2002, is a variant of LOF that is better suited to data lying on lower-dimensional manifolds. Instead of using direct k-nearest neighbor distances, COF uses chaining distance along the shortest path through the neighborhood graph.
Tree-based methods exploit the geometric intuition that anomalies are easier to isolate than normal points.
Isolation Forest, introduced by Liu, Ting, and Zhou at the IEEE International Conference on Data Mining in 2008, builds an ensemble of random binary trees called isolation trees. To build each tree, the algorithm recursively partitions the data by selecting a random feature and a random split point between the minimum and maximum values of that feature. Anomalies tend to be separated from the rest of the data after only a few splits, while normal points require many splits to isolate. The anomaly score for a point is derived from its average path length across the forest, normalized by the expected path length of an unsuccessful search in a binary search tree.
Isolation Forest has linear time complexity, low memory requirements, scales to high-dimensional data, and does not require a distance metric, all of which made it the most widely deployed outlier detection algorithm in the years following its publication. The Extended Isolation Forest of Hariri, Kind, and Brunner (2019) addresses an axis-alignment artifact in the original method by using random hyperplanes for the splits. Scikit-learn implements the original algorithm as sklearn.ensemble.IsolationForest.
Linear methods assume that normal data lie close to a low-dimensional subspace.
Principal component analysis (PCA) projects data onto the directions of greatest variance. If most of the variance is explained by the top principal components, normal points can be reconstructed with low error from a reduced number of components, while outliers cannot. The reconstruction error then serves as an outlier score. Robust PCA, formulated as a low-rank-plus-sparse decomposition by Candes, Li, Ma, and Wright in 2011, separates the data matrix into a low-rank component representing the bulk of the data and a sparse component representing outliers.
The One-Class Support Vector Machine, proposed by Scholkopf, Platt, Shawe-Taylor, Smola, and Williamson in 2001, learns a boundary around the normal data in feature space and flags points outside the boundary. With an RBF kernel, the boundary can take complex non-linear shapes. One-Class SVM is sensitive to the choice of the nu parameter, which sets an upper bound on the fraction of training points allowed to be outliers, and to the kernel bandwidth. Scikit-learn provides this method as sklearn.svm.OneClassSVM.
Neural network methods scale to very high-dimensional data such as images, audio, and text.
An autoencoder is a neural network that learns to compress an input through a low-dimensional bottleneck and reconstruct it. When trained on data drawn predominantly from the normal class, the network learns to reconstruct normal points well and reconstruct anomalies poorly. The reconstruction error (typically mean squared error between the input and its reconstruction) serves as an anomaly score. Variational autoencoders and denoising autoencoders are common variants. The technique was popularized for industrial fault detection in the early 2010s and remains the standard deep learning baseline.
Deep Support Vector Data Description (Deep SVDD), introduced by Lukas Ruff and colleagues at ICML 2018, generalizes the classical SVDD method of Tax and Duin to deep representation learning. Deep SVDD learns a feature mapping (typically a convolutional neural network for image data) and trains it to map normal data into the smallest possible hypersphere in feature space. Anomaly score is the distance from the center of the hypersphere. Several follow-up methods extend the framework: Deep SAD adds limited labeled anomalies; PatchCore and PaDiM use pre-trained backbones for industrial defect detection.
Generative adversarial network methods such as AnoGAN (Schlegl et al., 2017) and f-AnoGAN train a generator to produce normal samples and use the residual between an input and its closest reconstruction in the generator's range as the anomaly score. The approach was first applied to medical imaging for retinal disease detection.
More recent work uses self-supervised pretext tasks. CSI (Contrasting Shifted Instances) of Tack, Mo, Jeong, and Shin (NeurIPS 2020) uses contrastive learning with distribution-shifted samples. PANDA, ICL, and other methods adapt pretrained representations such as those from a vision transformer for industrial anomaly detection.
Ensemble methods combine multiple base detectors to improve robustness and reduce variance.
Feature Bagging, proposed by Lazarevic and Kumar in 2005, trains multiple base detectors (often LOF) on random subspaces of the original feature set and combines their scores by averaging or breadth-first ranking. The approach reduces the impact of irrelevant or noisy features.
Lightweight On-line Detector of Anomalies (LODA), introduced by Tomas Pevny in 2016, is a fast ensemble that uses random projections to one-dimensional histograms and aggregates the log-densities. LODA scales linearly with the number of points and features, supports streaming updates, and provides feature-importance scores for explaining detected anomalies.
SUOD, introduced by Zhao and colleagues in 2021, is an acceleration framework that runs many heterogeneous base detectors in parallel and uses model approximation to reduce per-detector cost. It is the default ensemble engine in PyOD.
The LOF score for a point p with parameter k is computed as follows. Let k-distance(p) denote the distance from p to its k-th nearest neighbor, and let N_k(p) denote the set of k-nearest neighbors of p. The reachability distance from p to a neighbor o is defined as max(k-distance(o), dist(p, o)). This smoothing prevents the reachability distance from being zero when two points coincide. The local reachability density of p is the inverse of the mean reachability distance from p to its neighbors:
lrd(p) = 1 / mean(reach-dist(p, o)) for o in N_k(p)
Finally, the LOF score of p is:
LOF(p) = mean(lrd(o) / lrd(p)) for o in N_k(p)
The ratio compares each neighbor's density to the point's own density. Values close to 1 mean p sits in the same density region as its neighbors. Values significantly greater than 1 mean p is in a sparser region than its neighbors, indicating a local outlier. Typical thresholds are LOF > 1.5 for mild outliers and LOF > 2 for strong outliers, though application-specific tuning is usually required. The choice of k typically falls between 10 and 50; smaller values are more sensitive to local fluctuations, larger values smooth them out.
An isolation tree is built on a subsample of the dataset (default subsample size 256 in the original paper). At each internal node, the algorithm picks a random feature and a random split value uniformly between the minimum and maximum of that feature within the current partition. The recursion stops when the partition contains a single point or when a maximum depth is reached. The depth of the leaf in which a point lands is its path length for that tree.
For a forest with t trees, the anomaly score for a point x is:
s(x, n) = 2^(-E[h(x)] / c(n))
where E[h(x)] is the average path length for x across the forest and c(n) is the average path length of an unsuccessful search in a binary search tree of n points. Scores range from 0 to 1: values near 1 indicate strong anomalies, values near 0.5 indicate normal points, and values near 0 indicate strong inliers. The recommended contamination parameter (the expected fraction of anomalies) is then used to set a decision threshold.
A simple anomaly-detection autoencoder consists of an encoder f mapping inputs to a low-dimensional code z and a decoder g reconstructing the input from z. Training minimizes the reconstruction loss:
L = mean ||x - g(f(x))||^2
over normal training data. At test time, the reconstruction error of a new input x is used as the anomaly score. The bottleneck dimension is the most important hyperparameter: too large and the network learns the identity function, including for anomalies; too small and even normal inputs are reconstructed poorly. Common variants include the variational autoencoder, which adds a probabilistic prior on the latent code, and the denoising autoencoder, which is trained to reconstruct clean inputs from corrupted ones. Memory-augmented autoencoders explicitly store prototypes of normal data in a memory bank, forcing test-time reconstructions to use those prototypes and thus producing larger errors on anomalies.
PyOD (Python Outlier Detection) is the most comprehensive Python library for outlier detection. Initially released by Yue Zhao in 2017, PyOD provides a unified scikit-learn-compatible API for more than 40 algorithms, including Isolation Forest, LOF, COF, ABOD, One-Class SVM, EllipticEnvelope, AutoEncoder, Deep SVDD, LODA, Feature Bagging, and SUOD. The 2019 JMLR paper describing PyOD has been cited several thousand times and the library is the de facto standard for tabular outlier detection in research and industry.
Example usage:
from pyod.models.iforest import IForest
from pyod.models.lof import LOF
from pyod.utils.data import generate_data
X_train, X_test, y_train, y_test = generate_data(
n_train=1000, n_test=200, contamination=0.1
)
clf = IForest(contamination=0.1, random_state=42)
clf.fit(X_train)
y_train_pred = clf.labels_
y_train_scores = clf.decision_scores_
y_test_pred = clf.predict(X_test)
y_test_scores = clf.decision_function(X_test)
Scikit-learn ships several outlier detection methods in its sklearn.ensemble, sklearn.neighbors, sklearn.svm, and sklearn.covariance modules. The main classes are:
| Class | Method family | Notes |
|---|---|---|
IsolationForest | Tree-based | Default for general-purpose tabular outlier detection |
LocalOutlierFactor | Density-based | Supports novelty=True for prediction on new data |
OneClassSVM | Boundary-based | Sensitive to bandwidth choice; expensive for large data |
EllipticEnvelope | Gaussian assumption | Uses Minimum Covariance Determinant for robust estimation |
SGDOneClassSVM | Linear approximation of OneClassSVM | Scales to large datasets |
Example:
from sklearn.ensemble import IsolationForest
from sklearn.neighbors import LocalOutlierFactor
import numpy as np
rng = np.random.default_rng(42)
X = rng.normal(size=(500, 2))
X_outliers = rng.uniform(low=-6, high=6, size=(20, 2))
X = np.vstack([X, X_outliers])
iforest = IsolationForest(contamination=0.04, random_state=42).fit(X)
lof = LocalOutlierFactor(n_neighbors=20, contamination=0.04).fit_predict(X)
Major cloud providers offer hosted anomaly detection services. Amazon Lookout for Metrics, Azure Anomaly Detector, and Google Cloud Vertex AI all wrap variants of the methods above behind managed APIs. Open-source observability platforms such as Prometheus and Grafana support anomaly-detection plugins for monitoring.
Detection is only the first step. The right treatment depends on the source of the outlier and the downstream use case.
Deletion is the simplest treatment but the most consequential. Removing values without a documented reason is generally regarded as bad statistical practice because it can introduce bias and inflate apparent precision. Deletion is appropriate when there is independent evidence that the point is the result of a measurement error (such as a known sensor failure) or a data-entry mistake. When deletion is used, the procedure should be pre-registered or at least documented and the resulting estimates should be compared with and without the removed points.
Winsorization, introduced by Charles Winsor and popularized by Tukey, replaces extreme values with the values at a specified percentile. A 90% Winsorization caps values below the 5th percentile and above the 95th percentile to those quantile values. Capping limits the influence of extreme observations without throwing them away. Trimmed means, which discard a fixed percentage of the highest and lowest values before averaging, are a related estimator with high efficiency under heavy-tailed contamination.
Applying a monotone transformation can compress the right tail of skewed distributions. Common transformations include log, square root, Box-Cox, and Yeo-Johnson. The Yeo-Johnson transformation, due to Yeo and Johnson (2000), accommodates negative values, unlike Box-Cox. Transformations preserve the rank ordering of points but pull extreme values toward the rest of the distribution.
When an outlier is treated as a missing value, it can be replaced using imputation. Mean and median imputation are simple but understate variability. More sophisticated approaches use k-nearest neighbor imputation, multiple imputation by chained equations (MICE), or model-based approaches such as iterative imputation in scikit-learn.
Rather than alter the data, the analyst can choose a model that is less sensitive to outliers in the first place. This is the realm of robust statistics, discussed in the next section.
| Strategy | When to use | Caveats |
|---|---|---|
| Remove | Documented measurement or entry error | Risks bias if used informally |
| Winsorize / cap | Heavy tails dominate the analysis | Distorts marginal distribution |
| Transform | Skewed positive data | Changes interpretation of coefficients |
| Impute | Outlier reflects truly missing or implausible value | Adds modeling assumptions |
| Use a robust model | Want to keep all data | Robust methods can lose efficiency under no contamination |
| Flag and report separately | Outliers are the analytical target | Requires clear communication with stakeholders |
Robust statistics, pioneered in the 1960s by Peter Huber and Frank Hampel, develops estimators that perform well even when the data deviate from the assumed model. A few central ideas recur across this field.
The ordinary least squares loss is sensitive to outliers because it grows quadratically with the residual. Peter Huber proposed in 1964 a loss function that is quadratic for small residuals and linear for large ones:
L(r) = r^2 / 2 if |r| <= delta, else delta * (|r| - delta / 2)
This Huber loss combines the low variance of squared error for clean data with the bounded influence of absolute error for contaminated data. M-estimators generalize this idea: any loss with a bounded influence function leads to a robust estimator. Tukey's biweight (bisquare) loss goes further by completely down-weighting points beyond a cutoff. M-estimators are the basis of robust regression methods such as HuberRegressor and RANSACRegressor in scikit-learn and rlm in R's MASS package.
Random Sample Consensus (RANSAC), proposed by Fischler and Bolles in 1981, is an iterative procedure for fitting a model to data that contains a large fraction of outliers. RANSAC repeatedly samples a minimal subset of points, fits a candidate model, counts the inliers within a residual tolerance, and keeps the model with the most inliers. RANSAC is the standard tool for geometric problems in computer vision, such as homography estimation, fundamental matrix estimation, and 3D point cloud alignment.
The sample covariance matrix is highly sensitive to outliers because it is a sum of squared deviations. Rousseeuw's Minimum Covariance Determinant (MCD) estimator searches for the subset of points whose covariance matrix has the smallest determinant. Robust PCA methods then operate on this estimator. The Fast-MCD algorithm of Rousseeuw and Van Driessen (1999) made this approach computationally practical, and it underlies EllipticEnvelope in scikit-learn.
Two concepts characterize the robustness of an estimator. The breakdown point is the smallest fraction of contaminated data that can drive the estimator to an arbitrary value. The sample mean has a breakdown point of zero (a single outlier can shift it without bound), while the sample median has a breakdown point of 50%. The influence function measures the change in the estimator caused by an infinitesimal contamination at a given point. Bounded influence functions are characteristic of robust estimators.
Outlier detection appears in nearly every quantitative discipline. The most prominent applications include:
| Domain | Typical anomalies | Common methods |
|---|---|---|
| Fraud detection | Unusual transactions, account takeovers | Supervised classifiers, Isolation Forest, autoencoders |
| Network intrusion detection | Port scans, denial-of-service traffic, lateral movement | One-Class SVM, autoencoder reconstruction error, sequence models |
| Manufacturing defect detection | Surface scratches, missing parts, color anomalies | PaDiM, PatchCore, Anomalib pipeline on MVTec data |
| Medical anomaly detection | Tumors on imaging, abnormal ECG patterns, lab outliers | Autoencoders, GAN-based methods, deep one-class methods |
| KPI and time-series monitoring | Server-load spikes, revenue drops, latency increases | STL decomposition + IQR, Prophet residuals, LSTM autoencoders |
| Predictive maintenance | Vibration anomalies, temperature drift, acoustic faults | LSTM autoencoders, change-point detection, Mahalanobis on PCA features |
| Astronomical surveys | Transient events, asteroids, exoplanet transits | Isolation Forest, deep generative models, neural ODEs |
| Insurance claims | Suspicious claim patterns, exaggerated amounts | Gradient boosting, anomaly scores combined with rule engines |
| Cybersecurity log analysis | Anomalous user behavior, credential misuse | LODA, sequence-based detectors, embedding-distance methods |
| Quality control in datasets for ML | Mislabeled examples, near-duplicates, distribution shift | Loss-based detectors, cleanlab confident learning |
Industrial defect detection on MVTec AD is a good example of how rapidly the deep methods have advanced. In 2018, the original benchmark paper reported AUC scores around 0.7 for the best published methods. By 2023, several methods using pretrained backbones such as PatchCore exceeded 0.99 average AUC across the 15 categories.
In finance, fraud detection systems combine outlier scores with rule engines and human review queues. The Federal Trade Commission reports tens of billions of dollars in U.S. consumer losses to fraud each year, and machine learning fraud detection is a multi-billion-dollar industry segment dominated by vendors such as Featurespace, Sift, and the in-house systems of major banks and payment networks.
Several practical issues recur regardless of the method chosen.
In high-dimensional spaces, distances between points become more uniform and the contrast between near and far neighbors collapses. This phenomenon, formalized by Beyer and colleagues in 1999, weakens distance-based and density-based methods. Subspace methods, projection-based methods (such as LODA), and tree-based methods (such as Isolation Forest) tend to scale better.
Masking occurs when a group of outliers prevents detection of any individual outlier in the group, because the group inflates the variance estimate or alters the centroid. Swamping is the opposite problem: outliers cause normal points to be flagged. Robust statistics and iterative procedures such as Rosner's ESD test address these effects.
Most detection methods assume a small contamination rate. When contamination exceeds 30% to 40%, the boundary between normal and anomalous becomes blurred and the assumptions of methods such as Isolation Forest break down.
In streaming or production settings, the definition of normal changes over time. Models trained on historical data become stale, requiring continual retraining or online updates. Methods such as Half-Space Trees, ADWIN-based detectors, and the streaming variants of Isolation Forest address this.
Evaluating outlier detection is harder than evaluating supervised classification because labels are typically scarce. Common metrics include AUC-ROC, AUC of the precision-recall curve, and precision at the top k. ADBench (2022) is a widely used benchmark suite for tabular outlier detection that compares more than 30 algorithms across 57 datasets.
In practice, the terms outlier detection, anomaly detection, and novelty detection are often used interchangeably, but a few useful distinctions exist.
| Term | Common meaning |
|---|---|
| Outlier | A data point that differs from the bulk of the dataset, usually in a single batch of observations |
| Anomaly | A broader term covering points, contexts, or sequences that deviate from expected behavior, often in streaming or temporal settings |
| Novelty | A new data point that differs from training data, where the training data is assumed to be clean |
| Out-of-distribution (OOD) | A test input drawn from a distribution different from the training distribution, typically in deep learning |
| Change-point | A time at which the underlying data-generating process changes |
Classical statistical literature talks about outliers; machine learning literature, especially after the 2000s, talks about anomalies; and the deep learning literature on safety, monitoring, and OOD detection extends the same ideas to test-time inputs of neural networks. The methods overlap heavily across these communities, and the same algorithm (Isolation Forest, autoencoder reconstruction, LOF) often shows up under all three labels.
The following table summarizes the practical trade-offs of the most widely used outlier detection methods.
| Method | Family | Time complexity | Handles high dim | Handles varying density | Streaming support | Library |
|---|---|---|---|---|---|---|
| Z-score | Statistical | O(n) | Univariate only | No | Trivial | NumPy, SciPy |
| Modified Z-score (MAD) | Statistical | O(n) | Univariate only | No | Easy | NumPy, statsmodels |
| Tukey IQR fence | Statistical | O(n log n) | Univariate only | No | Easy | pandas, seaborn |
| Grubbs' test | Statistical | O(n) | Univariate only | No | No | SciPy, outliers |
| EllipticEnvelope | Linear (MCD) | O(n^2 d^2) | Moderate | No | No | scikit-learn |
| kNN distance | Distance | O(n^2 d) | Limited | Limited | No | PyOD, scikit-learn |
| LOF | Density | O(n^2 d) | Limited | Yes | Limited | scikit-learn, PyOD |
| DBSCAN noise | Density | O(n log n) with index | Limited | Limited | No | scikit-learn |
| Isolation Forest | Tree | O(n log n) | Yes | Moderate | Yes (variants) | scikit-learn, PyOD |
| Extended Isolation Forest | Tree | O(n log n) | Yes | Moderate | Yes (variants) | eif package, PyOD |
| One-Class SVM | Boundary | O(n^2) to O(n^3) | Moderate | Limited | No | scikit-learn, PyOD |
| Autoencoder | Neural | Training O(n * epochs) | Yes | Yes | Yes | PyTorch, TensorFlow, PyOD |
| Deep SVDD | Neural | Training O(n * epochs) | Yes | Yes | No | PyOD, DeepOD |
| LODA | Ensemble | O(n * k * d) | Yes | Yes | Yes | PyOD |
| Feature Bagging | Ensemble | O(base detector * t) | Yes | Inherited | Limited | PyOD |
The most common defaults in practice are:
Decades of methodological work and applied experience suggest a small number of consistently useful guidelines.
Imagine you and your friends measure how tall everyone in your class is. Most kids are between 110 and 130 centimeters. Then you measure one student and write down 980 centimeters. That number is an outlier. It is way bigger than the other numbers, and almost certainly someone wrote down the wrong thing.
Outliers in data are like that one strange measurement. They are points that look very different from everything else. Sometimes they are mistakes, and we want to catch them and fix them so they do not mess up our calculations. Other times they are real and important; if we are looking for the tallest tree in the forest or the strangest credit card charge of the week, the outlier is exactly the thing we want to find.
Computers find outliers in many ways. Some methods count how far each point is from the others. Some methods cut the data into pieces with random lines and see which points get separated quickly. Some methods build a small machine that learns what normal looks like and then complains when a new point does not fit. The right method depends on what kind of data we have and what we are looking for.