Outliers

Introduction

In statistics and machine learning, an outlier is a data point that differs significantly from the bulk of the observations in a dataset. The word itself comes from the idea that the value lies outside the range that the rest of the sample suggests is plausible. Outliers can arise from genuine phenomena, such as a record-breaking earthquake or a fraudulent credit card charge, or from errors in measurement, data entry, sensor faults, or transmission noise. The choice between treating a suspicious point as a real signal or as contamination drives the entire field of anomaly detection and a large body of work in robust statistics.

The statistician John Tukey defined an outlier informally in his 1977 book Exploratory Data Analysis as any point that falls beyond a fence drawn at 1.5 times the interquartile range (IQR) from the first or third quartile. This rule of thumb still drives the whiskers in modern boxplots and remains one of the first checks performed during data exploration. Earlier work by Frank Anscombe in 1960 had already framed the underlying tension: rejection rules trade off the cost of discarding good data against the cost of letting bad data corrupt downstream analysis.

Outliers matter because many statistical and machine learning algorithms are not robust to extreme values. A single point with a large residual can shift an ordinary least squares regression line by several standard deviations, distort principal component directions, pull centroids in k-means clustering toward empty regions of feature space, or dominate the loss function during training of a neural network. At the other end of the spectrum, the same extreme value may be the most important observation in the dataset; in fraud detection, intrusion detection, predictive maintenance, and rare-disease screening, the outliers are the entire reason the analysis exists.

This article covers the formal types of outliers, the major detection algorithms grouped by family, software libraries that implement them, treatment strategies, and the relationship between outlier analysis and the broader field of anomaly detection.

Definition and basic concepts

A standard definition, due to Douglas Hawkins in his 1980 monograph Identification of Outliers, is that an outlier is "an observation which deviates so much from the other observations as to arouse suspicions that it was generated by a different mechanism." Hawkins's definition emphasizes the data-generating process: outliers are points that the assumed model cannot easily produce.

Three practical implications follow from this definition. First, whether a point is an outlier depends on the model. A value of 200 cm for adult height is unusual under a Gaussian model fit to a general population but unremarkable inside a sample of professional basketball players. Second, outliers can be informative or pathological. Whether to keep, transform, or remove them depends on whether the unusual mechanism is the phenomenon of interest. Third, outlier identification is inherently statistical; it carries error rates and requires choices about thresholds and significance.

A related distinction is between univariate outliers, which are extreme on a single variable, and multivariate outliers, which are extreme in the joint distribution of several variables even though each marginal value looks ordinary. A person 180 cm tall and weighing 50 kg has unremarkable height and unremarkable weight separately, but the combination is unusual and shows up only in two-dimensional analysis.

Types of outliers

Researchers in anomaly detection commonly distinguish three types of outliers based on the structure of the surrounding data. The taxonomy was popularized by Chandola, Banerjee, and Kumar in their 2009 ACM Computing Surveys article on anomaly detection.

Type	Definition	Example
Point (global) outlier	A single observation that is anomalous with respect to the entire dataset	A credit card charge of $50,000 in a stream where the typical transaction is $50
Contextual (conditional) outlier	A point that is anomalous only within a specific context, such as time of day or location	A temperature of 30 degrees Celsius is normal in summer but anomalous in winter
Collective outlier	A group of related points that together deviate from the rest of the data, even if each individual point looks normal	A long sequence of identical packets in a network capture that signals a denial-of-service attack

Point outliers are the easiest to detect and the focus of most classical statistical tests. Contextual outliers require defining the context, often through behavioral attributes (time, location, user identity) and contextual attributes (the variable being measured). Collective outliers usually appear in sequence, time series, graph, or spatial data, where the relationships between points carry as much information as the points themselves.

There is also a distinction between inliers and outliers. An inlier is a point that lies in the dense region of the data distribution. Some methods, such as one-class classification, model the inliers explicitly and flag everything else as a potential outlier.

Sources of outliers

The practical question of how to handle an outlier often turns on its source. Common sources include:

Measurement error. Faulty sensors, miscalibrated instruments, or transient interference can produce values far outside the physical range of the quantity being measured.
Data entry error. Manual transcription introduces typos, swapped fields, and misplaced decimals. A weight recorded as 700 instead of 70 will dominate any unprotected analysis.
Sampling error. A non-representative sample may contain points from a different population than the intended target.
Genuine extreme values. Some processes have heavy-tailed distributions in which large deviations are rare but real. Stock returns, insurance claims, and earthquake magnitudes are classic examples.
Adversarial activity. Fraud, intrusion, and abuse generate points that are deliberately designed to look unusual or, harder still, deliberately designed to look normal while being malicious.

Most real datasets contain a mix of all five sources, which is one reason that purely statistical detection is rarely the whole story.

Detection methods overview

Outlier detection algorithms fall into a small number of families with different assumptions, computational profiles, and failure modes. The main families are summarized below.

Family	Core idea	Representative algorithms	Typical assumption
Statistical	Fit a probability model and flag low-likelihood points	Z-score, modified Z-score, Tukey's IQR fence, Grubbs' test, Dixon's Q test	Data follow a known parametric distribution
Distance-based	Flag points that are far from their neighbors	k-Nearest Neighbors (kNN) outlier score, distance to k-th nearest neighbor	Outliers are isolated in feature space
Density-based	Flag points whose local density is much lower than that of their neighbors	Local Outlier Factor (LOF), DBSCAN noise points, COF	Normal points lie in dense regions
Tree-based	Use random space partitions and exploit the fact that anomalies are easier to isolate	Isolation Forest, Extended Isolation Forest	Anomalies are few and different
Linear/subspace	Project data onto a subspace and measure reconstruction error	PCA reconstruction, robust PCA, EllipticEnvelope (Mahalanobis distance)	Data lie near a low-dimensional subspace
Neural	Train a neural network to model the normal class and flag deviations	Autoencoder reconstruction error, Deep SVDD, GAN-based methods	Sufficient training data exists
Ensemble	Combine multiple detectors to reduce variance and improve robustness	Feature Bagging, LODA, SUOD	Individual detectors are diverse
One-class classification	Learn a boundary around normal data and flag points outside	One-Class SVM, Deep SVDD	Normal class is well-defined

Within each family, methods vary along several axes: supervised vs. unsupervised, parametric vs. nonparametric, point vs. structural, batch vs. streaming, and global vs. local. The right choice depends on the dimensionality of the data, the density of the normal class, the rate of contamination, and whether labeled examples of anomalies are available.

Statistical methods

Statistical outlier detection has the longest history of any approach and remains the default choice for low-dimensional, well-behaved data.

Z-score

The Z-score expresses how many standard deviations a value lies from the sample mean:

Z = (x - mean) / standard_deviation

A common rule flags values with absolute Z-score greater than 3, which corresponds to roughly 0.27% of observations under a Gaussian distribution. The Z-score is fast and easy to compute but has two well-known weaknesses. First, the sample mean and sample standard deviation are themselves not robust; a single extreme point inflates the standard deviation and thereby reduces the Z-score of every other point, masking additional outliers. Second, the rule assumes Gaussian tails, which are too thin for many real distributions.

Modified Z-score (MAD)

Iglewicz and Hoaglin proposed a modified Z-score that replaces the mean with the median and the standard deviation with the median absolute deviation (MAD):

Modified Z = 0.6745 * (x - median) / MAD

The constant 0.6745 makes the modified Z-score consistent with the standard Z-score under a Gaussian distribution. Iglewicz and Hoaglin recommended flagging values with absolute modified Z-score greater than 3.5. Because the median and the MAD have a 50% breakdown point, this method is far more resistant to the masking effect that contaminates the standard Z-score.

Tukey's IQR rule and boxplots

John Tukey's 1977 boxplot uses the interquartile range (IQR), defined as the difference between the third quartile (Q3) and the first quartile (Q1). Tukey's fences are placed at:

Inner fences: Q1 minus 1.5 times IQR, and Q3 plus 1.5 times IQR
Outer fences: Q1 minus 3 times IQR, and Q3 plus 3 times IQR

Values outside the inner fences are typically flagged as mild outliers; values outside the outer fences are flagged as extreme outliers. The rule does not depend on the mean or the standard deviation, so it is robust to contamination, and it underlies the whiskers and dots drawn in standard boxplot software.

Grubbs' test

Grubbs' test, also called the maximum normalized residual test, is designed to detect a single outlier in a univariate sample drawn from an approximately Gaussian distribution. The test statistic is:

G = max |x - mean| / standard_deviation

The critical value depends on the sample size and the chosen significance level. Grubbs' test can be applied iteratively after removing the most extreme point, but it is sensitive to the masking effect when more than one outlier is present and tends to lose power as the contamination rate grows.

Dixon's Q test

Dixon's Q test, introduced by Robert Dixon in 1950, is widely used in analytical chemistry for small samples (typically 3 to 30 measurements). The Q statistic is the gap between the suspected outlier and its nearest neighbor divided by the total range of the data. The value is compared against tabulated critical values. Dixon's Q is intended for a single suspected outlier in small samples; it should not be reused in a sequential strip-and-retest loop on the same data.

Mahalanobis distance and EllipticEnvelope

For multivariate Gaussian data, the Mahalanobis distance measures how far a point is from the sample mean while accounting for the covariance structure. Squared Mahalanobis distance follows a chi-squared distribution under the Gaussian assumption, which gives a principled cutoff. The estimator is sensitive to outliers in the sample mean and covariance, so robust covariance estimators such as the Minimum Covariance Determinant (MCD) of Rousseeuw are typically used in practice. Scikit-learn wraps this approach in sklearn.covariance.EllipticEnvelope.

Other classical tests

Several other tests appear in the literature. The Generalized Extreme Studentized Deviate (ESD) test of Rosner (1983) extends Grubbs' test to detect multiple outliers without the masking issue. The Tietjen-Moore test detects a known number of outliers. Chauvenet's criterion and Peirce's criterion are older rules from physical sciences that are still occasionally cited but have been largely superseded by the methods above.

Distance-based methods

Distance-based methods make no parametric assumption. They flag a point as an outlier if it is far from its neighbors in the chosen distance metric.

k-Nearest Neighbors outlier score

The simplest approach, introduced by Knorr and Ng in 1998, defines a point as a DB(k, d)-outlier if at least a fraction k of the dataset lies further than distance d from it. A common practical variant is to compute the distance from each point to its k-th nearest neighbor and rank points by that distance; points with the largest k-th nearest neighbor distance are the strongest outliers. Ramaswamy, Rastogi, and Shim refined the formulation in 2000 to rank points by distance rather than by a hard threshold.

Advantages include conceptual simplicity and a single hyperparameter k. The main disadvantages are quadratic time complexity in the number of points, sensitivity to the choice of distance metric in high dimensions (where the curse of dimensionality compresses pairwise distances), and difficulty handling clusters of varying density.

Angle-based methods

Kriegel, Schubert, and Zimek proposed the Angle-Based Outlier Detection (ABOD) method in 2008. ABOD measures the variance of angles between pairs of vectors from a query point to other points in the dataset. Points inside a cluster see a wide range of angles, while points in the periphery see a narrow range. ABOD performs better than distance-based methods in very high-dimensional spaces because angular variation is more stable than absolute distance.

Density-based methods

Density-based methods improve on pure distance methods by considering local rather than global structure.

Local Outlier Factor (LOF)

Local Outlier Factor (LOF), introduced by Breunig, Kriegel, Ng, and Sander in 2000, was the first widely used local outlier detection method. LOF compares the local density of a point with the local density of its k-nearest neighbors. The score is computed in three steps:

For each point p, compute the k-distance, the distance to its k-th nearest neighbor.
Define the local reachability density of p as the inverse of the average reachability distance from p to its k-nearest neighbors.
Compute the LOF score as the average ratio of the local reachability densities of p's neighbors to the local reachability density of p.

A LOF score around 1 indicates that the point has a density similar to its neighbors, so it is not an outlier. Scores significantly greater than 1 indicate that the point is in a much sparser region than its neighbors and is therefore a local outlier. LOF handles datasets where normal clusters have very different densities, which is its main advantage over global distance methods. Its drawback is sensitivity to the parameter k and quadratic complexity for naive implementations. Scikit-learn implements LOF as sklearn.neighbors.LocalOutlierFactor.

DBSCAN noise points

DBSCAN (Density-Based Spatial Clustering of Applications with Noise), introduced by Ester, Kriegel, Sander, and Xu in 1996, is primarily a clustering algorithm but produces an outlier classification as a byproduct. DBSCAN labels each point as either a core point, a border point, or noise. Noise points are not reachable from any core point within the chosen radius and are effectively flagged as outliers. DBSCAN does not require specifying the number of clusters and naturally handles arbitrarily shaped clusters, but it requires careful tuning of the radius (epsilon) and minimum number of neighbors (minPts) parameters.

Connectivity-based outlier factor

The Connectivity-Based Outlier Factor (COF), proposed by Tang, Chen, Fu, and Cheung in 2002, is a variant of LOF that is better suited to data lying on lower-dimensional manifolds. Instead of using direct k-nearest neighbor distances, COF uses chaining distance along the shortest path through the neighborhood graph.

Tree-based methods

Tree-based methods exploit the geometric intuition that anomalies are easier to isolate than normal points.

Isolation Forest

Isolation Forest, introduced by Liu, Ting, and Zhou at the IEEE International Conference on Data Mining in 2008, builds an ensemble of random binary trees called isolation trees. To build each tree, the algorithm recursively partitions the data by selecting a random feature and a random split point between the minimum and maximum values of that feature. Anomalies tend to be separated from the rest of the data after only a few splits, while normal points require many splits to isolate. The anomaly score for a point is derived from its average path length across the forest, normalized by the expected path length of an unsuccessful search in a binary search tree.

Isolation Forest has linear time complexity, low memory requirements, scales to high-dimensional data, and does not require a distance metric, all of which made it the most widely deployed outlier detection algorithm in the years following its publication. The Extended Isolation Forest of Hariri, Kind, and Brunner (2019) addresses an axis-alignment artifact in the original method by using random hyperplanes for the splits. Scikit-learn implements the original algorithm as sklearn.ensemble.IsolationForest.

Linear and subspace methods

Linear methods assume that normal data lie close to a low-dimensional subspace.

PCA reconstruction error

Principal component analysis (PCA) projects data onto the directions of greatest variance. If most of the variance is explained by the top principal components, normal points can be reconstructed with low error from a reduced number of components, while outliers cannot. The reconstruction error then serves as an outlier score. Robust PCA, formulated as a low-rank-plus-sparse decomposition by Candes, Li, Ma, and Wright in 2011, separates the data matrix into a low-rank component representing the bulk of the data and a sparse component representing outliers.

One-Class SVM

The One-Class Support Vector Machine, proposed by Scholkopf, Platt, Shawe-Taylor, Smola, and Williamson in 2001, learns a boundary around the normal data in feature space and flags points outside the boundary. With an RBF kernel, the boundary can take complex non-linear shapes. One-Class SVM is sensitive to the choice of the nu parameter, which sets an upper bound on the fraction of training points allowed to be outliers, and to the kernel bandwidth. Scikit-learn provides this method as sklearn.svm.OneClassSVM.

Neural methods

Neural network methods scale to very high-dimensional data such as images, audio, and text.

Autoencoder reconstruction error

An autoencoder is a neural network that learns to compress an input through a low-dimensional bottleneck and reconstruct it. When trained on data drawn predominantly from the normal class, the network learns to reconstruct normal points well and reconstruct anomalies poorly. The reconstruction error (typically mean squared error between the input and its reconstruction) serves as an anomaly score. Variational autoencoders and denoising autoencoders are common variants. The technique was popularized for industrial fault detection in the early 2010s and remains the standard deep learning baseline.

Deep SVDD

Deep Support Vector Data Description (Deep SVDD), introduced by Lukas Ruff and colleagues at ICML 2018, generalizes the classical SVDD method of Tax and Duin to deep representation learning. Deep SVDD learns a feature mapping (typically a convolutional neural network for image data) and trains it to map normal data into the smallest possible hypersphere in feature space. Anomaly score is the distance from the center of the hypersphere. Several follow-up methods extend the framework: Deep SAD adds limited labeled anomalies; PatchCore and PaDiM use pre-trained backbones for industrial defect detection.

GAN-based methods

Generative adversarial network methods such as AnoGAN (Schlegl et al., 2017) and f-AnoGAN train a generator to produce normal samples and use the residual between an input and its closest reconstruction in the generator's range as the anomaly score. The approach was first applied to medical imaging for retinal disease detection.

Self-supervised methods

More recent work uses self-supervised pretext tasks. CSI (Contrasting Shifted Instances) of Tack, Mo, Jeong, and Shin (NeurIPS 2020) uses contrastive learning with distribution-shifted samples. PANDA, ICL, and other methods adapt pretrained representations such as those from a vision transformer for industrial anomaly detection.

Ensemble methods

Ensemble methods combine multiple base detectors to improve robustness and reduce variance.

Feature Bagging

Feature Bagging, proposed by Lazarevic and Kumar in 2005, trains multiple base detectors (often LOF) on random subspaces of the original feature set and combines their scores by averaging or breadth-first ranking. The approach reduces the impact of irrelevant or noisy features.

LODA

Lightweight On-line Detector of Anomalies (LODA), introduced by Tomas Pevny in 2016, is a fast ensemble that uses random projections to one-dimensional histograms and aggregates the log-densities. LODA scales linearly with the number of points and features, supports streaming updates, and provides feature-importance scores for explaining detected anomalies.

SUOD

SUOD, introduced by Zhao and colleagues in 2021, is an acceleration framework that runs many heterogeneous base detectors in parallel and uses model approximation to reduce per-detector cost. It is the default ensemble engine in PyOD.

Specific algorithms in detail

Local Outlier Factor (LOF) deep dive

The LOF score for a point p with parameter k is computed as follows. Let k-distance(p) denote the distance from p to its k-th nearest neighbor, and let N_k(p) denote the set of k-nearest neighbors of p. The reachability distance from p to a neighbor o is defined as max(k-distance(o), dist(p, o)). This smoothing prevents the reachability distance from being zero when two points coincide. The local reachability density of p is the inverse of the mean reachability distance from p to its neighbors:

lrd(p) = 1 / mean(reach-dist(p, o)) for o in N_k(p)

Finally, the LOF score of p is:

LOF(p) = mean(lrd(o) / lrd(p)) for o in N_k(p)

The ratio compares each neighbor's density to the point's own density. Values close to 1 mean p sits in the same density region as its neighbors. Values significantly greater than 1 mean p is in a sparser region than its neighbors, indicating a local outlier. Typical thresholds are LOF > 1.5 for mild outliers and LOF > 2 for strong outliers, though application-specific tuning is usually required. The choice of k typically falls between 10 and 50; smaller values are more sensitive to local fluctuations, larger values smooth them out.

Isolation Forest deep dive

An isolation tree is built on a subsample of the dataset (default subsample size 256 in the original paper). At each internal node, the algorithm picks a random feature and a random split value uniformly between the minimum and maximum of that feature within the current partition. The recursion stops when the partition contains a single point or when a maximum depth is reached. The depth of the leaf in which a point lands is its path length for that tree.

For a forest with t trees, the anomaly score for a point x is:

s(x, n) = 2^(-E[h(x)] / c(n))

where E[h(x)] is the average path length for x across the forest and c(n) is the average path length of an unsuccessful search in a binary search tree of n points. Scores range from 0 to 1: values near 1 indicate strong anomalies, values near 0.5 indicate normal points, and values near 0 indicate strong inliers. The recommended contamination parameter (the expected fraction of anomalies) is then used to set a decision threshold.

Autoencoder anomaly detection deep dive

A simple anomaly-detection autoencoder consists of an encoder f mapping inputs to a low-dimensional code z and a decoder g reconstructing the input from z. Training minimizes the reconstruction loss:

L = mean ||x - g(f(x))||^2

over normal training data. At test time, the reconstruction error of a new input x is used as the anomaly score. The bottleneck dimension is the most important hyperparameter: too large and the network learns the identity function, including for anomalies; too small and even normal inputs are reconstructed poorly. Common variants include the variational autoencoder, which adds a probabilistic prior on the latent code, and the denoising autoencoder, which is trained to reconstruct clean inputs from corrupted ones. Memory-augmented autoencoders explicitly store prototypes of normal data in a memory bank, forcing test-time reconstructions to use those prototypes and thus producing larger errors on anomalies.

Libraries and tools

PyOD

PyOD (Python Outlier Detection) is the most comprehensive Python library for outlier detection. Initially released by Yue Zhao in 2017, PyOD provides a unified scikit-learn-compatible API for more than 40 algorithms, including Isolation Forest, LOF, COF, ABOD, One-Class SVM, EllipticEnvelope, AutoEncoder, Deep SVDD, LODA, Feature Bagging, and SUOD. The 2019 JMLR paper describing PyOD has been cited several thousand times and the library is the de facto standard for tabular outlier detection in research and industry.

Example usage:

from pyod.models.iforest import IForest
from pyod.models.lof import LOF
from pyod.utils.data import generate_data

X_train, X_test, y_train, y_test = generate_data(
    n_train=1000, n_test=200, contamination=0.1
)

clf = IForest(contamination=0.1, random_state=42)
clf.fit(X_train)
y_train_pred = clf.labels_
y_train_scores = clf.decision_scores_
y_test_pred = clf.predict(X_test)
y_test_scores = clf.decision_function(X_test)

Scikit-learn

Scikit-learn ships several outlier detection methods in its sklearn.ensemble, sklearn.neighbors, sklearn.svm, and sklearn.covariance modules. The main classes are:

Class	Method family	Notes
`IsolationForest`	Tree-based	Default for general-purpose tabular outlier detection
`LocalOutlierFactor`	Density-based	Supports `novelty=True` for prediction on new data
`OneClassSVM`	Boundary-based	Sensitive to bandwidth choice; expensive for large data
`EllipticEnvelope`	Gaussian assumption	Uses Minimum Covariance Determinant for robust estimation
`SGDOneClassSVM`	Linear approximation of OneClassSVM	Scales to large datasets

Example:

from sklearn.ensemble import IsolationForest
from sklearn.neighbors import LocalOutlierFactor
import numpy as np

rng = np.random.default_rng(42)
X = rng.normal(size=(500, 2))
X_outliers = rng.uniform(low=-6, high=6, size=(20, 2))
X = np.vstack([X, X_outliers])

iforest = IsolationForest(contamination=0.04, random_state=42).fit(X)
lof = LocalOutlierFactor(n_neighbors=20, contamination=0.04).fit_predict(X)

Other libraries and frameworks

alibi-detect (Seldon, 2019) provides outlier, adversarial, and drift detectors with both tabular and image support.
PyTorch-based libraries such as DeepOD, Anomalib (Intel, 2022), and PyTorch Outlier Detection focus on deep methods.
Anomalib, in particular, is the dominant library for industrial anomaly detection on the MVTec benchmark.
River provides streaming anomaly detection (incremental Isolation Forest, Half-Space Trees).
OpenAD and TimeEval specialize in time-series anomaly detection with dozens of baselines.
PyGOD (2022) extends PyOD to graph-structured data.

Cloud and enterprise tools

Major cloud providers offer hosted anomaly detection services. Amazon Lookout for Metrics, Azure Anomaly Detector, and Google Cloud Vertex AI all wrap variants of the methods above behind managed APIs. Open-source observability platforms such as Prometheus and Grafana support anomaly-detection plugins for monitoring.

Treatment of outliers

Detection is only the first step. The right treatment depends on the source of the outlier and the downstream use case.

Removing outliers

Deletion is the simplest treatment but the most consequential. Removing values without a documented reason is generally regarded as bad statistical practice because it can introduce bias and inflate apparent precision. Deletion is appropriate when there is independent evidence that the point is the result of a measurement error (such as a known sensor failure) or a data-entry mistake. When deletion is used, the procedure should be pre-registered or at least documented and the resulting estimates should be compared with and without the removed points.

Capping and Winsorization

Winsorization, introduced by Charles Winsor and popularized by Tukey, replaces extreme values with the values at a specified percentile. A 90% Winsorization caps values below the 5th percentile and above the 95th percentile to those quantile values. Capping limits the influence of extreme observations without throwing them away. Trimmed means, which discard a fixed percentage of the highest and lowest values before averaging, are a related estimator with high efficiency under heavy-tailed contamination.

Transformation

Applying a monotone transformation can compress the right tail of skewed distributions. Common transformations include log, square root, Box-Cox, and Yeo-Johnson. The Yeo-Johnson transformation, due to Yeo and Johnson (2000), accommodates negative values, unlike Box-Cox. Transformations preserve the rank ordering of points but pull extreme values toward the rest of the distribution.

Imputation

When an outlier is treated as a missing value, it can be replaced using imputation. Mean and median imputation are simple but understate variability. More sophisticated approaches use k-nearest neighbor imputation, multiple imputation by chained equations (MICE), or model-based approaches such as iterative imputation in scikit-learn.

Robust modeling

Rather than alter the data, the analyst can choose a model that is less sensitive to outliers in the first place. This is the realm of robust statistics, discussed in the next section.

Treatment summary

Strategy	When to use	Caveats
Remove	Documented measurement or entry error	Risks bias if used informally
Winsorize / cap	Heavy tails dominate the analysis	Distorts marginal distribution
Transform	Skewed positive data	Changes interpretation of coefficients
Impute	Outlier reflects truly missing or implausible value	Adds modeling assumptions
Use a robust model	Want to keep all data	Robust methods can lose efficiency under no contamination
Flag and report separately	Outliers are the analytical target	Requires clear communication with stakeholders

Robust statistics

Robust statistics, pioneered in the 1960s by Peter Huber and Frank Hampel, develops estimators that perform well even when the data deviate from the assumed model. A few central ideas recur across this field.

Huber loss and M-estimators

The ordinary least squares loss is sensitive to outliers because it grows quadratically with the residual. Peter Huber proposed in 1964 a loss function that is quadratic for small residuals and linear for large ones:

L(r) = r^2 / 2 if |r| <= delta, else delta * (|r| - delta / 2)

This Huber loss combines the low variance of squared error for clean data with the bounded influence of absolute error for contaminated data. M-estimators generalize this idea: any loss with a bounded influence function leads to a robust estimator. Tukey's biweight (bisquare) loss goes further by completely down-weighting points beyond a cutoff. M-estimators are the basis of robust regression methods such as HuberRegressor and RANSACRegressor in scikit-learn and rlm in R's MASS package.

RANSAC

Random Sample Consensus (RANSAC), proposed by Fischler and Bolles in 1981, is an iterative procedure for fitting a model to data that contains a large fraction of outliers. RANSAC repeatedly samples a minimal subset of points, fits a candidate model, counts the inliers within a residual tolerance, and keeps the model with the most inliers. RANSAC is the standard tool for geometric problems in computer vision, such as homography estimation, fundamental matrix estimation, and 3D point cloud alignment.

Robust covariance and PCA

The sample covariance matrix is highly sensitive to outliers because it is a sum of squared deviations. Rousseeuw's Minimum Covariance Determinant (MCD) estimator searches for the subset of points whose covariance matrix has the smallest determinant. Robust PCA methods then operate on this estimator. The Fast-MCD algorithm of Rousseeuw and Van Driessen (1999) made this approach computationally practical, and it underlies EllipticEnvelope in scikit-learn.

Breakdown point and influence function

Two concepts characterize the robustness of an estimator. The breakdown point is the smallest fraction of contaminated data that can drive the estimator to an arbitrary value. The sample mean has a breakdown point of zero (a single outlier can shift it without bound), while the sample median has a breakdown point of 50%. The influence function measures the change in the estimator caused by an infinitesimal contamination at a given point. Bounded influence functions are characteristic of robust estimators.

Use cases

Outlier detection appears in nearly every quantitative discipline. The most prominent applications include:

Domain	Typical anomalies	Common methods
Fraud detection	Unusual transactions, account takeovers	Supervised classifiers, Isolation Forest, autoencoders
Network intrusion detection	Port scans, denial-of-service traffic, lateral movement	One-Class SVM, autoencoder reconstruction error, sequence models
Manufacturing defect detection	Surface scratches, missing parts, color anomalies	PaDiM, PatchCore, Anomalib pipeline on MVTec data
Medical anomaly detection	Tumors on imaging, abnormal ECG patterns, lab outliers	Autoencoders, GAN-based methods, deep one-class methods
KPI and time-series monitoring	Server-load spikes, revenue drops, latency increases	STL decomposition + IQR, Prophet residuals, LSTM autoencoders
Predictive maintenance	Vibration anomalies, temperature drift, acoustic faults	LSTM autoencoders, change-point detection, Mahalanobis on PCA features
Astronomical surveys	Transient events, asteroids, exoplanet transits	Isolation Forest, deep generative models, neural ODEs
Insurance claims	Suspicious claim patterns, exaggerated amounts	Gradient boosting, anomaly scores combined with rule engines
Cybersecurity log analysis	Anomalous user behavior, credential misuse	LODA, sequence-based detectors, embedding-distance methods
Quality control in datasets for ML	Mislabeled examples, near-duplicates, distribution shift	Loss-based detectors, cleanlab confident learning

Industrial defect detection on MVTec AD is a good example of how rapidly the deep methods have advanced. In 2018, the original benchmark paper reported AUC scores around 0.7 for the best published methods. By 2023, several methods using pretrained backbones such as PatchCore exceeded 0.99 average AUC across the 15 categories.

In finance, fraud detection systems combine outlier scores with rule engines and human review queues. The Federal Trade Commission reports tens of billions of dollars in U.S. consumer losses to fraud each year, and machine learning fraud detection is a multi-billion-dollar industry segment dominated by vendors such as Featurespace, Sift, and the in-house systems of major banks and payment networks.

Challenges in outlier detection

Several practical issues recur regardless of the method chosen.

Curse of dimensionality

In high-dimensional spaces, distances between points become more uniform and the contrast between near and far neighbors collapses. This phenomenon, formalized by Beyer and colleagues in 1999, weakens distance-based and density-based methods. Subspace methods, projection-based methods (such as LODA), and tree-based methods (such as Isolation Forest) tend to scale better.

Masking and swamping

Masking occurs when a group of outliers prevents detection of any individual outlier in the group, because the group inflates the variance estimate or alters the centroid. Swamping is the opposite problem: outliers cause normal points to be flagged. Robust statistics and iterative procedures such as Rosner's ESD test address these effects.

Contamination rate

Most detection methods assume a small contamination rate. When contamination exceeds 30% to 40%, the boundary between normal and anomalous becomes blurred and the assumptions of methods such as Isolation Forest break down.

Concept drift

In streaming or production settings, the definition of normal changes over time. Models trained on historical data become stale, requiring continual retraining or online updates. Methods such as Half-Space Trees, ADWIN-based detectors, and the streaming variants of Isolation Forest address this.

Evaluation

Evaluating outlier detection is harder than evaluating supervised classification because labels are typically scarce. Common metrics include AUC-ROC, AUC of the precision-recall curve, and precision at the top k. ADBench (2022) is a widely used benchmark suite for tabular outlier detection that compares more than 30 algorithms across 57 datasets.

Outliers vs. anomaly detection

In practice, the terms outlier detection, anomaly detection, and novelty detection are often used interchangeably, but a few useful distinctions exist.

Term	Common meaning
Outlier	A data point that differs from the bulk of the dataset, usually in a single batch of observations
Anomaly	A broader term covering points, contexts, or sequences that deviate from expected behavior, often in streaming or temporal settings
Novelty	A new data point that differs from training data, where the training data is assumed to be clean
Out-of-distribution (OOD)	A test input drawn from a distribution different from the training distribution, typically in deep learning
Change-point	A time at which the underlying data-generating process changes

Classical statistical literature talks about outliers; machine learning literature, especially after the 2000s, talks about anomalies; and the deep learning literature on safety, monitoring, and OOD detection extends the same ideas to test-time inputs of neural networks. The methods overlap heavily across these communities, and the same algorithm (Isolation Forest, autoencoder reconstruction, LOF) often shows up under all three labels.

Comparison of detection methods

The following table summarizes the practical trade-offs of the most widely used outlier detection methods.

Method	Family	Time complexity	Handles high dim	Handles varying density	Streaming support	Library
Z-score	Statistical	O(n)	Univariate only	No	Trivial	NumPy, SciPy
Modified Z-score (MAD)	Statistical	O(n)	Univariate only	No	Easy	NumPy, statsmodels
Tukey IQR fence	Statistical	O(n log n)	Univariate only	No	Easy	pandas, seaborn
Grubbs' test	Statistical	O(n)	Univariate only	No	No	SciPy, outliers
EllipticEnvelope	Linear (MCD)	O(n^2 d^2)	Moderate	No	No	scikit-learn
kNN distance	Distance	O(n^2 d)	Limited	Limited	No	PyOD, scikit-learn
LOF	Density	O(n^2 d)	Limited	Yes	Limited	scikit-learn, PyOD
DBSCAN noise	Density	O(n log n) with index	Limited	Limited	No	scikit-learn
Isolation Forest	Tree	O(n log n)	Yes	Moderate	Yes (variants)	scikit-learn, PyOD
Extended Isolation Forest	Tree	O(n log n)	Yes	Moderate	Yes (variants)	eif package, PyOD
One-Class SVM	Boundary	O(n^2) to O(n^3)	Moderate	Limited	No	scikit-learn, PyOD
Autoencoder	Neural	Training O(n * epochs)	Yes	Yes	Yes	PyTorch, TensorFlow, PyOD
Deep SVDD	Neural	Training O(n * epochs)	Yes	Yes	No	PyOD, DeepOD
LODA	Ensemble	O(n * k * d)	Yes	Yes	Yes	PyOD
Feature Bagging	Ensemble	O(base detector * t)	Yes	Inherited	Limited	PyOD

The most common defaults in practice are:

Tabular data, low to moderate dimension, no labels: Isolation Forest with contamination set to a domain estimate, or LOF if clusters have very different densities.
Multivariate Gaussian-like data: EllipticEnvelope using the Minimum Covariance Determinant.
High-dimensional data, no labels: Autoencoder reconstruction error or Deep SVDD.
Industrial visual inspection: PatchCore or PaDiM in Anomalib with a pretrained backbone.
Time series, single metric: Tukey IQR or modified Z-score on residuals from a forecasting model such as Prophet or STL decomposition.
Streaming data: LODA, Half-Space Trees, or online Isolation Forest variants.

Best practices

Decades of methodological work and applied experience suggest a small number of consistently useful guidelines.

Define what an outlier means for the application before running any algorithm. A sensor reading 10 standard deviations above the mean might be a crucial early warning, an instrumentation failure, or a clerical error; the right response is different in each case.
Visualize the data. Histograms, boxplots, scatter plots, parallel coordinates, and PCA projections often reveal structure that summary statistics miss.
Try multiple detectors. Different families capture different kinds of anomalies. Combining scores from Isolation Forest, LOF, and an autoencoder, for example, can flag points that any single method would miss.
Calibrate thresholds with domain expertise. The contamination rate is rarely known a priori, and even when it is, the cost of false positives and false negatives is application-specific.
Document every removal. When points are excluded, log the reason, the rule, and the affected indices. Reproducibility depends on this.
Prefer robust methods for downstream models. Even after detection, downstream regressions and classifiers benefit from robust losses and regularization that limit the influence of any single point.
Monitor in production. A model that filtered the training data well may behave differently after deployment as the data distribution drifts.

Explain like I'm 5 (ELI5)

Imagine you and your friends measure how tall everyone in your class is. Most kids are between 110 and 130 centimeters. Then you measure one student and write down 980 centimeters. That number is an outlier. It is way bigger than the other numbers, and almost certainly someone wrote down the wrong thing.

Outliers in data are like that one strange measurement. They are points that look very different from everything else. Sometimes they are mistakes, and we want to catch them and fix them so they do not mess up our calculations. Other times they are real and important; if we are looking for the tallest tree in the forest or the strangest credit card charge of the week, the outlier is exactly the thing we want to find.

Computers find outliers in many ways. Some methods count how far each point is from the others. Some methods cut the data into pieces with random lines and see which points get separated quickly. Some methods build a small machine that learns what normal looks like and then complains when a new point does not fit. The right method depends on what kind of data we have and what we are looking for.

References

Tukey, J. W. (1977). *Exploratory Data Analysis*. Addison-Wesley.
Hawkins, D. M. (1980). *Identification of Outliers*. Chapman and Hall.
Anscombe, F. J. (1960). "Rejection of Outliers." *Technometrics*, 2(2), 123-146.
Iglewicz, B., & Hoaglin, D. C. (1993). *How to Detect and Handle Outliers*. ASQC Quality Press.
Rosner, B. (1983). "Percentage Points for a Generalized ESD Many-Outlier Procedure." *Technometrics*, 25(2), 165-172.
Dixon, W. J. (1950). "Analysis of Extreme Values." *The Annals of Mathematical Statistics*, 21(4), 488-506.
Grubbs, F. E. (1969). "Procedures for Detecting Outlying Observations in Samples." *Technometrics*, 11(1), 1-21.
Knorr, E. M., & Ng, R. T. (1998). "Algorithms for Mining Distance-Based Outliers in Large Datasets." *Proceedings of VLDB*, 392-403.
Ramaswamy, S., Rastogi, R., & Shim, K. (2000). "Efficient Algorithms for Mining Outliers from Large Data Sets." *Proceedings of SIGMOD*, 427-438.
Breunig, M. M., Kriegel, H.-P., Ng, R. T., & Sander, J. (2000). "LOF: Identifying Density-Based Local Outliers." *Proceedings of SIGMOD*, 93-104.
Ester, M., Kriegel, H.-P., Sander, J., & Xu, X. (1996). "A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise." *Proceedings of KDD*, 226-231.
Liu, F. T., Ting, K. M., & Zhou, Z.-H. (2008). "Isolation Forest." *Proceedings of the 8th IEEE International Conference on Data Mining (ICDM)*, 413-422.
Hariri, S., Kind, M. C., & Brunner, R. J. (2019). "Extended Isolation Forest." *IEEE Transactions on Knowledge and Data Engineering*, 33(4), 1479-1489.
Scholkopf, B., Platt, J. C., Shawe-Taylor, J., Smola, A. J., & Williamson, R. C. (2001). "Estimating the Support of a High-Dimensional Distribution." *Neural Computation*, 13(7), 1443-1471.
Ruff, L., et al. (2018). "Deep One-Class Classification." *Proceedings of the 35th International Conference on Machine Learning (ICML)*, 4393-4402.
Schlegl, T., Seebock, P., Waldstein, S. M., Schmidt-Erfurth, U., & Langs, G. (2017). "Unsupervised Anomaly Detection with Generative Adversarial Networks to Guide Marker Discovery." *Information Processing in Medical Imaging*, 146-157.
Chandola, V., Banerjee, A., & Kumar, V. (2009). "Anomaly Detection: A Survey." *ACM Computing Surveys*, 41(3), Article 15.
Aggarwal, C. C. (2017). *Outlier Analysis* (2nd ed.). Springer.
Zhao, Y., Nasrullah, Z., & Li, Z. (2019). "PyOD: A Python Toolbox for Scalable Outlier Detection." *Journal of Machine Learning Research*, 20(96), 1-7.
Pedregosa, F., et al. (2011). "Scikit-learn: Machine Learning in Python." *Journal of Machine Learning Research*, 12, 2825-2830.
Huber, P. J. (1964). "Robust Estimation of a Location Parameter." *Annals of Mathematical Statistics*, 35(1), 73-101.
Huber, P. J. (1981). *Robust Statistics*. Wiley.
Rousseeuw, P. J., & Van Driessen, K. (1999). "A Fast Algorithm for the Minimum Covariance Determinant Estimator." *Technometrics*, 41(3), 212-223.
Fischler, M. A., & Bolles, R. C. (1981). "Random Sample Consensus: A Paradigm for Model Fitting with Applications to Image Analysis and Automated Cartography." *Communications of the ACM*, 24(6), 381-395.
Beyer, K., Goldstein, J., Ramakrishnan, R., & Shaft, U. (1999). "When Is 'Nearest Neighbor' Meaningful?" *Proceedings of ICDT*, 217-235.
Lazarevic, A., & Kumar, V. (2005). "Feature Bagging for Outlier Detection." *Proceedings of KDD*, 157-166.
Pevny, T. (2016). "LODA: Lightweight On-line Detector of Anomalies." *Machine Learning*, 102(2), 275-304.
Han, S., Hu, X., Huang, H., Jiang, M., & Zhao, Y. (2022). "ADBench: Anomaly Detection Benchmark." *Proceedings of NeurIPS Datasets and Benchmarks Track*.
Bergmann, P., Fauser, M., Sattlegger, D., & Steger, C. (2019). "MVTec AD: A Comprehensive Real-World Dataset for Unsupervised Anomaly Detection." *Proceedings of CVPR*, 9592-9600.
Roth, K., Pemula, L., Zepeda, J., Scholkopf, B., Brox, T., & Gehler, P. (2022). "Towards Total Recall in Industrial Anomaly Detection." *Proceedings of CVPR*, 14318-14328.

Introduction

Definition and basic concepts

Types of outliers

Sources of outliers

Detection methods overview

Statistical methods

Z-score

Modified Z-score (MAD)

Tukey's IQR rule and boxplots

Grubbs' test

Dixon's Q test

Mahalanobis distance and EllipticEnvelope

Other classical tests

Distance-based methods

k-Nearest Neighbors outlier score

Angle-based methods

Density-based methods

Local Outlier Factor (LOF)

DBSCAN noise points

Connectivity-based outlier factor

Tree-based methods

Isolation Forest

Linear and subspace methods

PCA reconstruction error

One-Class SVM

Neural methods

Autoencoder reconstruction error

Deep SVDD

GAN-based methods

Self-supervised methods

Ensemble methods

Feature Bagging

LODA

SUOD

Specific algorithms in detail

Local Outlier Factor (LOF) deep dive

Isolation Forest deep dive

Autoencoder anomaly detection deep dive

Libraries and tools

PyOD

Scikit-learn

Other libraries and frameworks

Cloud and enterprise tools

Treatment of outliers

Removing outliers

Capping and Winsorization

Transformation

Imputation

Robust modeling

Treatment summary

Robust statistics

Huber loss and M-estimators

RANSAC

Robust covariance and PCA

Breakdown point and influence function

Use cases

Challenges in outlier detection

Curse of dimensionality

Masking and swamping

Contamination rate

Concept drift

Evaluation

Outliers vs. anomaly detection

Comparison of detection methods

Best practices

Explain like I'm 5 (ELI5)

References

Improve this article

Related Articles

Outlier Detection

ARC-AGI 2

Novelty Detection

AUC-ROC

ARIMA

Machine learning terms/Clustering

Introduction

Definition and basic concepts

Types of outliers

Sources of outliers

Detection methods overview