# Outlier Detection

> Source: https://aiwiki.ai/wiki/outlier_detection
> Updated: 2026-06-27
> Categories: Data & Datasets, Machine Learning, Statistics
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

**Outlier detection** is the process of identifying data points, observations, or patterns that deviate so markedly from the rest of a dataset that they are likely to have been generated by a different process. The statistician Douglas Hawkins gave the classical definition in 1980: an outlier is "an observation which deviates so much from other observations as to arouse suspicions that it was generated by a different mechanism" [1]. Outlier detection (closely related to [anomaly detection](/wiki/anomaly_detection) and [novelty detection](/wiki/novelty_detection)) is applied across [fraud detection](/wiki/fraud_detection), network intrusion detection, fault detection and predictive maintenance, medical diagnosis, [sensor](/wiki/sensor) monitoring, and data cleaning [11][12].

The major method families are statistical (z-score, interquartile range, Grubbs' test), distance and density based (k-nearest neighbors, Local Outlier Factor), model based (Isolation Forest, One-Class SVM, Elliptic Envelope), and deep learning based (autoencoders, Deep SVDD) [11][12][13]. Methods draw from statistics, [machine learning](/wiki/machine_learning), and [deep learning](/wiki/deep_learning), and the right choice depends on data dimensionality, the availability of labeled examples, computational constraints, and whether the data arrives as a fixed batch or a continuous stream.

## Explain like I'm 5 (ELI5)

Imagine you are sorting a jar of red gumballs and you find one blue marble mixed in. The blue marble looks different from everything else in the jar. That is what an outlier is: something that does not fit with the rest of the group. Outlier detection is like having a helper who checks every item in the jar and says, "This one does not belong." Computers do the same thing with numbers and data, looking for the items that are unusual or surprising compared to everything else.

## What is outlier detection?

Outlier detection finds the small number of observations in a dataset that do not look like the rest. Hawkins framed the goal in 1980 as flagging any "observation which deviates so much from other observations as to arouse suspicions that it was generated by a different mechanism" [1]. The practical aim is either to remove or correct those points (data cleaning) or to surface them as the signal of interest, as in fraud or intrusion, where the rare event is exactly what you want to catch [11][12].

What counts as "too far" is not universal. A point that is normal next to a sparse cluster can be an outlier next to a dense one, which is why methods range from a single global threshold (z-score) to local, neighborhood-relative scores (Local Outlier Factor) [2]. Most real-world deployments use [unsupervised learning](/wiki/unsupervised_learning), because labeled anomalies are rare, so the algorithm must judge each point by intrinsic properties such as density, distance, or how easily it can be isolated [12][13].

## How is outlier detection different from anomaly detection?

In most of the literature the terms are used interchangeably. Chandola, Banerjee, and Kumar, in their widely cited 2009 survey, define the field broadly and note that non-conforming points are "often referred to as anomalies, outliers, discordant observations, exceptions, aberrations, surprises, peculiarities or contaminants, depending on the application domain" [12]. As a rough convention, "outlier detection" is most common in statistics and data cleaning (a property of a static dataset), while "anomaly detection" is the more common term in machine learning and operations, especially for streaming and [time series](/wiki/time_series_analysis) data where the goal is to flag anomalous behavior over time.

Novelty detection is the related task of deciding whether a new observation differs from data seen during training. The [scikit-learn](/wiki/scikit-learn) documentation draws the line by what the training set contains [5]:

| Task | Training data | scikit-learn framing |
|------|---------------|----------------------|
| Outlier detection | Contains outliers (polluted); the model fits where the training data is most concentrated and ignores deviant points. | Unsupervised anomaly detection. |
| Novelty detection | Not polluted by outliers (clean); the model decides whether a new point is an outlier (a "novelty"). | Semi-supervised anomaly detection. |

In scikit-learn's words, for novelty detection "the training data is not polluted by outliers and we are interested in detecting whether a new observation is an outlier. In this context an outlier is also called a novelty" [5].

## What are the main types of outliers?

Outliers are generally classified into three categories [12].

| Type | Description | Example |
|------|-------------|---------|
| Point outlier (global outlier) | A single data point that is far from the rest of the dataset. | A credit card transaction of $50,000 when typical transactions are under $200. |
| Contextual outlier (conditional outlier) | A data point that is anomalous in a specific context but might be normal in another. | A temperature of 35°C is normal in July but anomalous in January (in a temperate climate). |
| Collective outlier | A group of data points that are individually unremarkable but together form an unusual pattern. | A sequence of small transactions in rapid succession that together indicate a card-testing fraud attack. |

## Is outlier detection supervised or unsupervised?

Outlier detection methods fall into three learning paradigms based on the availability of labels. Unsupervised methods dominate in practice because labeled anomalies are scarce [12].

| Paradigm | Label requirement | Description |
|----------|-------------------|-------------|
| [Supervised learning](/wiki/supervised_learning) | Fully labeled (normal and anomalous) | A [classification](/wiki/classification_model) model is trained on labeled examples of both normal and anomalous data. Effective when labeled anomalies are available, but this is rare in practice. |
| Semi-supervised learning | Only normal labels | A model learns the distribution of normal data and flags deviations at test time. Also called novelty detection. |
| [Unsupervised learning](/wiki/unsupervised_learning) | No labels | The algorithm identifies outliers based on intrinsic properties of the data such as density, distance, or isolation. This is the most common paradigm for outlier detection. |

## What are the main outlier detection methods?

The sections below cover the major families: statistical tests, distance-based and density-based methods, model-based methods (Isolation Forest, One-Class SVM, Elliptic Envelope), and deep learning approaches.

## Statistical methods

Statistical approaches are among the oldest techniques for identifying outliers [11]. They rely on fitting a statistical model to the data and flagging points that have low probability under that model.

### Z-score method

The Z-score measures how many [standard deviations](/wiki/standard_deviation) a data point is from the [mean](/wiki/mean). For a data point *x* in a dataset with mean *μ* and standard deviation *σ*:

**Z = (x - μ) / σ**

A common convention is to flag data points with |Z| > 3 as outliers, meaning they lie more than three standard deviations from the mean (the "3-sigma rule"). This method assumes the data follows a [normal distribution](/wiki/normal_distribution), which limits its applicability to datasets that satisfy that assumption. It is also sensitive to the influence of extreme values on the mean and standard deviation themselves.

### Modified Z-score

The modified Z-score uses the median and the median absolute deviation (MAD) instead of the mean and standard deviation, making it more robust to the very outliers it is trying to detect:

**M = 0.6745 × (x - median) / MAD**

A threshold of |M| > 3.5 is commonly used. Because the median and MAD are resistant to extreme values, the modified Z-score performs better than the standard Z-score on datasets with heavy contamination.

### Interquartile range (IQR) method

The IQR method is a non-parametric approach that does not assume any particular data distribution. It computes the first quartile (Q1) and third quartile (Q3) and defines the IQR as Q3 - Q1. Data points that fall below Q1 - 1.5 × IQR or above Q3 + 1.5 × IQR are classified as outliers. This is the rule used in standard box plots, introduced by John Tukey in his 1977 book *Exploratory Data Analysis* [14]. Tukey called the fences at 1.5 × IQR the "inner fences" and those at 3.0 × IQR the "outer fences"; points beyond the outer fences are described as "far out." Under a perfect normal distribution the 1.5 × IQR rule flags roughly 0.7% of observations as outliers, which gives a sense of how aggressive the default threshold is [14].

### Grubbs' test

Grubbs' test (also known as the maximum normed residual test) is a formal [hypothesis testing](/wiki/hypothesis_testing) procedure designed to detect a single outlier in a univariate dataset assumed to come from a normal distribution. The test statistic is the largest absolute deviation from the mean divided by the standard deviation. If this statistic exceeds a critical value determined by the sample size and significance level, the most extreme point is declared an outlier. Grubbs' test can be applied iteratively to detect multiple outliers, though each removal changes the dataset properties.

### Mahalanobis distance and Elliptic Envelope

For multivariate data, the Mahalanobis distance accounts for correlations between variables and differences in scale. Unlike the Euclidean distance, it uses the covariance matrix of the data:

**D = √((x - μ)ᵀ S⁻¹ (x - μ))**

where *S* is the [covariance](/wiki/covariance) matrix. Points with large Mahalanobis distances are flagged as outliers. The Elliptic Envelope method in [scikit-learn](/wiki/scikit-learn) uses the Minimum Covariance Determinant (MCD) estimator to compute a robust version of the covariance matrix, making it more resistant to the influence of outliers on the covariance estimate itself [5]. This approach works best when n_samples > n_features² and when the data is roughly elliptically (Gaussian) distributed.

### Summary of statistical methods

| Method | Parametric? | Univariate/Multivariate | Assumptions | Strengths | Limitations |
|--------|-------------|------------------------|-------------|-----------|-------------|
| Z-score | Yes | Univariate | Normal distribution | Simple, fast | Sensitive to extreme values; assumes normality |
| Modified Z-score | Yes | Univariate | Approximate normality | Robust to contamination | Still assumes rough symmetry |
| IQR | No | Univariate | None | Distribution-free; simple | May miss outliers in multimodal data |
| Grubbs' test | Yes | Univariate | Normal distribution | Formal hypothesis test with p-values | Designed for single outliers; iterative use changes data |
| Mahalanobis / Elliptic Envelope | Yes | Multivariate | Elliptical distribution | Accounts for correlations | Requires stable covariance estimation; degrades in high dimensions |

## Distance-based methods

Distance-based methods define outliers by their remoteness from other data points. Instead of assuming a particular data distribution, they use distance metrics (such as [Euclidean distance](/wiki/euclidean_distance) or Manhattan distance) to quantify how far each point is from its neighbors.

### k-nearest neighbors (k-NN) approach

The [k-nearest neighbors](/wiki/k_nearest_neighbors) approach computes the distance from each data point to its k-th nearest neighbor. Points whose k-th neighbor distance exceeds a threshold are considered outliers. Knorr and Ng introduced one of the earliest distance-based outlier definitions in 1998, defining an outlier as a point for which fewer than *p* fraction of all data points lie within distance *D* [6]. Variants include using the average distance to all k neighbors or the distance to the k-th neighbor alone.

The main advantage is conceptual simplicity. The main limitation is computational cost: computing all pairwise distances requires O(n²) time, which becomes expensive for large datasets. Approximate nearest-neighbor structures such as KD-trees or ball trees can reduce this cost in low-dimensional spaces.

## Density-based methods

Density-based methods estimate the local density around each data point and flag points in regions of unusually low density. Their strength is the ability to detect outliers relative to their local neighborhood, which makes them effective when normal data forms clusters of varying density.

### Local Outlier Factor (LOF)

The Local Outlier Factor algorithm was proposed by Breunig, Kriegel, Ng, and Sander in 2000 at the ACM SIGMOD conference in Dallas [2]. Rather than treating outliers as a binary property, the paper assigns each object "a degree of being an outlier," called its local outlier factor, by comparing the local density of a point to the local densities of its k nearest neighbors [2].

The key steps are:

1. For each point, find its k nearest neighbors.
2. Compute the reachability distance, which is the maximum of the actual distance and the k-distance of the neighbor (smoothing out density estimates for points very close to dense clusters).
3. Compute the local reachability density (LRD) as the inverse of the average reachability distance to the k neighbors.
4. Compute the LOF score as the ratio of the average LRD of the point's neighbors to the point's own LRD.

A LOF score near 1 indicates the point has density similar to its neighbors (inlier). A score significantly greater than 1 indicates an outlier. A score below 1 indicates the point is in a denser region than its neighbors.

LOF is effective at finding outliers in datasets with clusters of different densities, where a global threshold approach would fail. For example, a point at moderate distance from a very dense cluster may be an outlier, even though it would be considered normal if judged against a sparser cluster. LOF shares some foundational concepts with [DBSCAN](/wiki/dbscan), including core distance and reachability distance [2].

### DBSCAN as an outlier detector

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is primarily a [clustering](/wiki/clustering) algorithm, but it has built-in outlier detection capabilities. DBSCAN classifies points as core points (with at least *minPts* neighbors within radius *eps*), border points (within *eps* of a core point but with fewer than *minPts* neighbors), or noise points. Noise points are not assigned to any cluster and can be treated as outliers.

DBSCAN's advantage for outlier detection is that it does not assume clusters have a particular shape (such as spherical). Its limitation is sensitivity to the choice of *eps* and *minPts* parameters, which can be difficult to set without domain knowledge.

## Model-based and tree-based methods

### Isolation Forest

Isolation Forest was introduced by Liu, Ting, and Zhou in 2008 at the IEEE International Conference on Data Mining (ICDM) [3]. It takes a fundamentally different approach from density-based and distance-based methods: instead of building a profile of normal data and then identifying deviations, it explicitly isolates anomalies.

The core principle, in the authors' words, is that anomalies are "few and different," and "as a result of these properties, anomalies are susceptible to a mechanism called isolation" [3]. Because anomalies are few and different, they are easier to separate from the rest of the data through random partitioning. The algorithm works as follows:

1. **Build isolation trees.** For each tree, randomly sample a subset of the data. Recursively partition the data by randomly selecting a feature and a split value between the feature's minimum and maximum in the current subset. Continue until each point is isolated in its own leaf node or a maximum tree height is reached.
2. **Compute path lengths.** The path length for a data point is the number of edges traversed from the root to the leaf node where it ends up.
3. **Compute anomaly scores.** Average the path lengths across all trees in the forest. Shorter average path lengths indicate anomalies, because anomalous points are easier to isolate. The anomaly score *s* is normalized using the average path length of unsuccessful searches in a Binary Search Tree as a baseline.

Isolation Forest has linear time complexity with low memory requirements, which makes it efficient for large datasets [3][4]. It handles high-dimensional data better than many density-based methods and does not require distance or density computations. The main parameters are the number of trees and the subsampling size. Scikit-learn provides an implementation via `sklearn.ensemble.IsolationForest`.

### Extended Isolation Forest

The original Isolation Forest uses axis-aligned splits, which can produce artifacts when anomalies do not align with feature axes. The Extended Isolation Forest (Hariri, Kind, and Brunner, 2019) addresses this by using hyperplane splits with random slopes rather than axis-parallel cuts, allowing it to capture anomalies in data with correlated features more effectively [7].

### One-Class SVM

The One-Class [Support Vector Machine](/wiki/support_vector_machine_svm) (SVM) learns a decision boundary that encloses the normal data in feature space. Points falling outside this boundary are classified as outliers. It maps data into a high-dimensional feature space using a [kernel function](/wiki/kernel_function) and finds the maximum-margin hyperplane separating the data from the origin. It was introduced by Scholkopf, Platt, Shawe-Taylor, Smola, and Williamson in their 2001 paper "Estimating the Support of a High-Dimensional Distribution" [5]. The method uses a parameter *ν* (nu) in the range (0, 1) that is simultaneously an upper bound on the fraction of training points allowed to fall outside the boundary (outliers) and a lower bound on the fraction of support vectors [5]. One-Class SVM performs well when normal behavior is well represented and anomalies are rare or unknown at training time, but it can be sensitive to the kernel and to *ν*, and it scales poorly to very large datasets.

### Deep SVDD

Deep Support Vector Data Description (Deep SVDD) extends the support-vector idea to neural networks. It trains a network to map normal data into a minimum-volume hypersphere in a learned feature space, then scores points by their distance from the center of that sphere; points far from the center are anomalous. Deep SVDD was introduced by Ruff et al. in 2018 and is one of the standard deep one-class baselines [13].

### Clustering-based detection

Clustering-based approaches first group data into clusters and then identify points that do not belong to any cluster, belong to very small clusters, or are far from cluster centroids. After running [k-means](/wiki/k-means) clustering, for example, points that are far from their assigned cluster centroid (relative to other points in the same cluster) can be flagged as outliers. The distance to the centroid can be compared against a threshold such as the mean plus some multiple of the standard deviation within each cluster.

## Deep learning approaches

[Neural network](/wiki/neural_network) architectures have become widely used for outlier detection, especially in high-dimensional data such as images, text, and time series [13].

### Autoencoders

An [autoencoder](/wiki/autoencoder) is a neural network trained to reconstruct its input through a bottleneck layer. During training on normal data, the autoencoder learns to compress and reconstruct typical patterns. At inference time, anomalous inputs produce high reconstruction error because the network has not learned to represent them. The reconstruction error serves as the anomaly score [13].

Variants include:

- **Denoising autoencoders**: trained to reconstruct clean data from artificially corrupted inputs, improving robustness.
- **Sparse autoencoders**: add a sparsity constraint on the bottleneck activations.
- **Convolutional autoencoders**: use [convolutional layers](/wiki/convolutional_neural_network) for image and spatial data.

### Variational autoencoders (VAEs)

A [variational autoencoder](/wiki/variational_autoencoder) (VAE) learns a probabilistic latent space rather than a deterministic encoding. The encoder outputs parameters of a distribution (typically a Gaussian), and the decoder samples from this distribution to reconstruct the input. Anomalies can be detected by computing the reconstruction probability or the evidence lower bound (ELBO). VAEs capture uncertainty in the data, which makes them more sensitive to subtle anomalies compared to standard autoencoders.

### Generative adversarial networks (GANs)

A [generative adversarial network](/wiki/generative_adversarial_network) (GAN) consists of a generator and a discriminator trained in an adversarial setup. For anomaly detection, the GAN is trained on normal data. At test time, anomalies are identified by their poor reconstruction by the generator, the discriminator's confidence score, or a combination of both. AnoGAN (Schlegl et al., 2017) was one of the first GAN-based anomaly detection methods, designed for detecting anomalies in retinal optical coherence tomography images [8].

### Transformer-based methods

[Transformer](/wiki/transformer) architectures, originally developed for [natural language processing](/wiki/natural_language_processing), have been adapted for time series anomaly detection. Models such as AnomalyBERT use self-supervised pretraining with synthetic anomaly injection to learn representations of normal temporal patterns. The [self-attention](/wiki/self_attention) mechanism allows these models to capture long-range dependencies in sequential data.

### Self-supervised learning for anomaly detection

[Self-supervised learning](/wiki/self-supervised_learning) methods train models on pretext tasks (such as predicting rotations, solving jigsaw puzzles, or reconstructing masked portions of input) using only normal data. At test time, anomalous inputs produce poor performance on these pretext tasks, which serves as the detection signal. This approach reduces the need for labeled anomaly data and has been applied in computer vision and industrial defect detection [13].

### Summary of deep learning approaches

| Method | Architecture | Anomaly signal | Strengths | Limitations |
|--------|-------------|----------------|-----------|-------------|
| [Autoencoder](/wiki/autoencoder) | Encoder-decoder | Reconstruction error | Simple, effective for tabular and image data | May reconstruct anomalies well if model capacity is too high |
| [VAE](/wiki/variational_autoencoder) | Probabilistic encoder-decoder | Reconstruction probability / ELBO | Captures uncertainty; more sensitive to subtle anomalies | More complex to train; requires tuning of latent dimension |
| [GAN](/wiki/generative_adversarial_network) | Generator + discriminator | Generator reconstruction + discriminator score | Can generate realistic normal data for comparison | Training instability; mode collapse |
| Deep SVDD | One-class neural network | Distance from hypersphere center | End-to-end one-class objective | Risk of trivial constant solutions; needs careful regularization |
| [Transformer](/wiki/transformer) | Self-attention blocks | Attention-based anomaly score | Captures long-range temporal dependencies | Computationally expensive; requires large datasets |
| Self-supervised | Task-specific architecture | Pretext task performance drop | No labeled anomalies needed | Performance depends on pretext task design |

## How is outlier detection done on time series data?

Detecting outliers in [time series](/wiki/time_series_analysis) data requires methods that account for temporal dependencies, trends, and seasonality.

### Seasonal decomposition

Seasonal-trend decomposition (STL) splits a time series into trend, seasonal, and residual components. Outliers are then detected in the residual component using standard statistical methods (such as the IQR method or Grubbs' test). This approach, used in methods like S-ESD (Seasonal Extreme Studentized Deviate), removes predictable patterns before testing for anomalies.

### LSTM and RNN approaches

[Recurrent neural networks](/wiki/recurrent_neural_network) (RNNs) and [Long Short-Term Memory](/wiki/long_short-term_memory_lstm) (LSTM) networks can be trained to predict the next value in a time series. When the prediction error exceeds a threshold, the observation is flagged as anomalous. LSTM-based methods are effective at capturing both short-term and long-term temporal dependencies.

### Streaming data detection

For data that arrives continuously (such as server logs, financial tickers, or IoT sensor feeds), outlier detection must be performed online. Algorithms such as Amazon's Random Cut Forest (RCF) are designed for streaming scenarios, updating the model incrementally as new data points arrive without storing the entire dataset in memory.

## How is outlier detection evaluated?

Evaluating outlier detection algorithms requires metrics suited to the typically imbalanced nature of the problem (anomalies are rare). Common metrics include:

| Metric | Description | When to use |
|--------|-------------|-------------|
| [Precision](/wiki/precision) | Fraction of detected outliers that are true outliers | When false alarms are costly |
| [Recall](/wiki/recall) | Fraction of true outliers that are detected | When missing anomalies is costly |
| [F1 score](/wiki/f1_score) | Harmonic mean of precision and recall | When balancing false positives and false negatives |
| [AUC-ROC](/wiki/auc_area_under_the_roc_curve) | Area under the receiver operating characteristic curve | For overall ranking quality across all thresholds |
| AUC-PR | Area under the precision-recall curve | Preferred for highly imbalanced datasets |
| Average precision | Weighted mean of precisions at each recall threshold | When the ranking order of anomaly scores matters |

AUC-ROC and AUC-PR do not require setting a detection threshold, which makes them useful for comparing algorithms independently of threshold selection. In highly imbalanced settings (where normal data vastly outnumbers anomalies), AUC-PR is generally more informative than AUC-ROC because the latter can be overly optimistic.

## What is outlier detection used for?

Outlier detection is used in a wide range of practical domains [11][12].

| Domain | Application | Examples |
|--------|-------------|----------|
| Finance | [Fraud detection](/wiki/fraud_detection) | Credit card fraud, money laundering, insider trading, fraudulent insurance claims |
| Cybersecurity | Network intrusion detection | Detecting unauthorized access, malware infections, data exfiltration, unusual login patterns |
| Manufacturing | Quality control and predictive maintenance | Defective products on assembly lines, equipment degradation, sensor anomalies |
| Healthcare | Medical diagnosis | Unusual patient vital signs, rare disease identification, anomalous medical images |
| IoT and smart infrastructure | Sensor monitoring | Abnormal readings from environmental sensors, smart grid anomalies, pipeline leak detection |
| Science | Experimental data cleaning | Removing erroneous measurements from datasets before analysis |
| E-commerce | User behavior analysis | Bot detection, fake review identification, unusual browsing patterns |

The link between outlier detection and security dates to 1986, when Dorothy Denning proposed an intrusion-detection model built on the idea that attacks would show up as statistically anomalous behavior relative to a user's normal profile [10].

## What are the main challenges in outlier detection?

Several challenges affect the performance of outlier detection systems in practice.

### Curse of dimensionality

As the number of features increases, the notion of distance becomes less meaningful. In high-dimensional spaces, the distances between all pairs of points tend to converge (a phenomenon called distance concentration), making it difficult for distance-based and density-based methods to distinguish outliers from normal points. [Dimensionality reduction](/wiki/dimension_reduction) techniques such as [PCA](/wiki/principal_component_analysis), [t-SNE](/wiki/t_sne), or autoencoders can help mitigate this problem by projecting data into a lower-dimensional space before applying outlier detection.

### Lack of labeled data

Labeled anomaly data is scarce in most real-world settings. Anomalies are by definition rare, and labeling them often requires expensive domain expertise. This limits the use of supervised methods and makes evaluation difficult, since ground truth labels may be incomplete or noisy [12].

### Concept drift

In streaming and production environments, the distribution of normal data changes over time. A model trained on historical data may fail to distinguish genuine anomalies from new-but-normal patterns. Adaptive methods that update their models incrementally are needed to handle concept drift.

### Interpretability

Many outlier detection algorithms (especially deep learning methods) produce anomaly scores without explaining why a point was flagged. In applications such as healthcare and finance, explaining the reason for an alert is often as important as detecting it. Research into explainable anomaly detection includes methods like the Subspace Outlier Degree (SOD), which identifies which features contributed most to the anomaly, and Correlation Outlier Probabilities (COP), which computes error vectors showing how a point would need to change to become normal.

### Parameter sensitivity

Most outlier detection algorithms require parameters that influence their behavior: the number of neighbors *k* in LOF, the *eps* and *minPts* in DBSCAN, the contamination rate in Isolation Forest, and the architecture of autoencoders. Setting these parameters without labeled validation data is a persistent difficulty.

## What software is used for outlier detection?

Several software tools and libraries provide implementations of outlier detection algorithms.

| Library | Language | Key algorithms | Notes |
|---------|----------|---------------|-------|
| [scikit-learn](/wiki/scikit-learn) | Python | Isolation Forest, LOF, One-Class SVM, Elliptic Envelope | Part of the broader scikit-learn ML toolkit; well-documented and widely used [5] |
| PyOD | Python | 50+ algorithms including LOF, k-NN, ECOD, autoencoders, COPOD, deep models | Dedicated outlier detection library released by Zhao, Nasrullah, and Li in 2019 [9] |
| ELKI | Java | LOF, ABOD, k-NN, DBSCAN, and many more | Research-oriented; optimized with index acceleration structures |
| [TensorFlow](/wiki/tensorflow) / [PyTorch](/wiki/pytorch) | Python | Custom autoencoder, VAE, GAN implementations | General deep learning frameworks used for building custom anomaly detectors |

PyOD provides "a wide range of outlier detection algorithms, including established outlier ensembles and more recent neural network-based approaches, under a single, well-documented API" intended for both practitioners and researchers [9].

## History of outlier detection

The formal study of outliers dates back to the 19th century, but computational outlier detection became a distinct field in the late 20th century.

| Year | Development |
|------|-------------|
| 1977 | John Tukey publishes *Exploratory Data Analysis*, introducing the box plot and the 1.5 × IQR fence rule [14]. |
| 1980 | Douglas Hawkins publishes *Identification of Outliers*, providing the widely cited formal definition of an outlier [1]. |
| 1986 | Dorothy Denning proposes using anomaly detection for intrusion-detection systems, linking outlier detection to cybersecurity [10]. |
| 1998 | Knorr and Ng introduce distance-based outlier detection, moving beyond distributional assumptions [6]. |
| 2000 | Breunig, Kriegel, Ng, and Sander propose the Local Outlier Factor (LOF) at ACM SIGMOD, establishing density-based outlier detection [2]. |
| 2001 | Scholkopf et al. introduce the One-Class SVM for novelty detection [5]. |
| 2008 | Liu, Ting, and Zhou propose Isolation Forest at IEEE ICDM, introducing the isolation-based paradigm [3]. |
| 2017 | Schlegl et al. introduce AnoGAN, applying GANs to anomaly detection in medical imaging [8]. |
| 2018 | Ruff et al. introduce Deep SVDD, a deep one-class objective for anomaly detection [13]. |
| 2019 | Zhao et al. release PyOD, a unified Python toolkit for outlier detection [9]; Hariri, Kind, and Brunner introduce the Extended Isolation Forest [7]. |

## See also

- [Anomaly detection](/wiki/anomaly_detection)
- [Novelty detection](/wiki/novelty_detection)
- [Clustering](/wiki/clustering)
- [Dimensionality reduction](/wiki/dimension_reduction)
- [Feature engineering](/wiki/feature_engineering)
- [Overfitting](/wiki/overfitting)
- [Data augmentation](/wiki/data_augmentation)

## References

1. Hawkins, D. M. (1980). *Identification of Outliers*. Chapman and Hall.
2. Breunig, M. M., Kriegel, H.-P., Ng, R. T., & Sander, J. (2000). "LOF: Identifying Density-Based Local Outliers." *Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data*, 93-104.
3. Liu, F. T., Ting, K. M., & Zhou, Z.-H. (2008). "Isolation Forest." *Proceedings of the 2008 IEEE International Conference on Data Mining (ICDM)*, 413-422.
4. Liu, F. T., Ting, K. M., & Zhou, Z.-H. (2012). "Isolation-Based Anomaly Detection." *ACM Transactions on Knowledge Discovery from Data*, 6(1), 1-39.
5. Scholkopf, B., Platt, J., Shawe-Taylor, J., Smola, A. J., & Williamson, R. C. (2001). "Estimating the Support of a High-Dimensional Distribution." *Neural Computation*, 13(7), 1443-1471.
6. Knorr, E. M. & Ng, R. T. (1998). "Algorithms for Mining Distance-Based Outliers in Large Datasets." *Proceedings of the 24th International Conference on Very Large Data Bases (VLDB)*, 392-403.
7. Hariri, S., Kind, M. C., & Brunner, R. J. (2019). "Extended Isolation Forest." *IEEE Transactions on Knowledge and Data Engineering*, 33(4), 1479-1489.
8. Schlegl, T., Seebock, P., Waldstein, S. M., Schmidt-Erfurth, U., & Langs, G. (2017). "Unsupervised Anomaly Detection with Generative Adversarial Networks to Guide Marker Discovery." *Information Processing in Medical Imaging (IPMI)*, 146-157.
9. Zhao, Y., Nasrullah, Z., & Li, Z. (2019). "PyOD: A Python Toolbox for Scalable Outlier Detection." *Journal of Machine Learning Research*, 20(96), 1-7.
10. Denning, D. E. (1986). "An Intrusion-Detection Model." *IEEE Transactions on Software Engineering*, SE-13(2), 222-232.
11. Aggarwal, C. C. (2017). *Outlier Analysis* (2nd ed.). Springer.
12. Chandola, V., Banerjee, A., & Kumar, V. (2009). "Anomaly Detection: A Survey." *ACM Computing Surveys*, 41(3), 1-58.
13. Ruff, L., Vandermeulen, R., Goernitz, N., Deecke, L., Siddiqui, S. A., Binder, A., Mueller, E., & Kloft, M. (2018). "Deep One-Class Classification." *Proceedings of the 35th International Conference on Machine Learning (ICML)*, 4393-4402. See also Pang, G., Shen, C., Cao, L., & Hengel, A. V. D. (2021). "Deep Learning for Anomaly Detection: A Review." *ACM Computing Surveys*, 54(2), 1-38.
14. Tukey, J. W. (1977). *Exploratory Data Analysis*. Addison-Wesley.