Anomaly Detection

In machine learning, anomaly detection is the process of identifying data points, events, or observations that deviate significantly from normal patterns in a dataset. These abnormal outcomes are known as anomalies, outliers, or exceptions. As an example, if the mean for a certain feature is 50 with a standard deviation of 5, then anomaly detection should flag a value of 300 as suspicious.

Anomaly detection is a fundamental problem across many disciplines, from finance and cybersecurity to manufacturing and healthcare. The core difficulty lies in the fact that anomalies are, by definition, rare and unpredictable. A system that works well on one type of anomaly may completely miss another. Because of this, the field has produced a wide range of methods, spanning statistical models, classical machine learning, and modern deep learning techniques.

The global anomaly detection market was valued at approximately $6.9 billion in 2025 and is projected to reach $28 billion by 2034, reflecting the growing adoption of machine learning-based anomaly detection across industries.

Learning paradigms

Anomaly detection tasks generally fall into three learning paradigms:

Supervised learning: Both normal and anomalous examples are labeled. Standard classification algorithms can be trained, though the extreme class imbalance between normal and anomalous samples makes this setting difficult. Techniques such as cost-sensitive learning, oversampling (SMOTE), and ensemble methods are frequently used to handle the imbalance.
Semi-supervised learning: Only normal data is available during training. The model learns the boundary of normality and flags deviations at test time. This is sometimes called novelty detection. Methods in this category include One-Class SVM, autoencoders, and variational autoencoders.
Unsupervised machine learning: No labels are provided at all. The algorithm must discover anomalies purely from the structure of the data. Most real-world anomaly detection systems operate in this regime. Isolation Forest, LOF, and DBSCAN are popular unsupervised approaches.

Types of anomalies

Anomalies can be divided into three primary categories: point anomalies, contextual anomalies, and collective anomalies. This taxonomy was formalized by Chandola et al. in their 2009 survey paper and remains the standard framework used in the literature.

Point anomalies

Point anomalies, also referred to as global anomalies, refer to individual data points that differ significantly from the majority. A single data instance is considered anomalous if it deviates substantially from the rest of the data, regardless of context or surrounding observations.

Examples of point anomalies include:

A credit card transaction of $10,000 when the cardholder's typical spending is under $200.
A sensor reading of 500 degrees Celsius in a system that normally operates at 80 degrees.
A network packet with an unusually large payload compared to typical traffic.

Point anomalies are the simplest type to detect. They can be identified using statistical methods like the z-score, interquartile range, or Mahalanobis distance, or machine learning techniques like Isolation Forest, One-Class SVM, or autoencoders.

Contextual anomalies

Contextual anomalies, also referred to as conditional anomalies, refer to data points that are anomalous only within certain contexts or subpopulations of the data. The same data value might be perfectly normal in one context but highly unusual in another.

For instance, a high heart rate may be considered normal during physical exercise but abnormal when sleeping. Similarly, a temperature of 35 degrees Celsius is normal in summer but unusual in winter for temperate climates. In network traffic, a spike in data transfers may be expected during business hours but suspicious at 3:00 AM.

Contextual anomalies require two types of attributes:

Contextual attributes: These define the context, such as time of day, geographic location, or user profile.
Behavioral attributes: These are the actual measured values that may or may not be anomalous depending on the context.

To detect contextual anomalies, context information must be integrated into the model. This can be done through rule-based systems, Bayesian networks, decision trees, or conditional probability models.

Collective anomalies

Collective anomalies, also referred to as group anomalies, refer to a collection of data points that exhibit unusual behavior when taken together but not individually. Each individual observation may fall within normal ranges, but the pattern formed by the group as a whole is abnormal.

Examples include:

A sequence of seemingly normal credit card transactions that, when viewed together, reveal a pattern of structured fraud (e.g., many small purchases just below the reporting threshold).
A series of normal network packets that, when combined, constitute a slow-rate denial-of-service attack.
Sudden spikes in web traffic from multiple geographic regions that individually appear benign but collectively indicate a coordinated bot attack.

Detecting collective anomalies requires the identification of patterns or dependencies between data points and the discovery of subpopulations that show anomalous behavior. Clustering, principal component analysis, sequence analysis, and Local Outlier Factor can all be utilized for detection.

Statistical methods

Statistical approaches to anomaly detection are among the oldest and most established techniques. They assume that normal data follows a known or estimable probability distribution and flag data points that have low probability under that distribution.

Parametric methods

Parametric methods assume the data follows a specific distribution (typically Gaussian). They estimate the distribution parameters from the training data and then calculate the probability of each test point.

Method	Description	Strengths	Limitations
Z-score	Measures how many standard deviations a point is from the mean. Points with z-scores beyond a threshold (commonly 2.5 or 3) are flagged.	Simple, fast, interpretable	Assumes normal distribution; sensitive to outliers in the training data
Grubbs' test	A formal statistical hypothesis test that checks whether the most extreme value in a univariate sample is an outlier	Statistically rigorous with p-values	Assumes normal distribution; tests one outlier at a time
Mahalanobis distance	Measures the distance from a point to the distribution center, accounting for correlations between variables	Handles correlated features; multivariate	Requires estimating the covariance matrix, which is unstable in high dimensions
Gaussian Mixture Models (GMM)	Models the data as a mixture of multiple Gaussian distributions using Expectation-Maximization. Points with low likelihood under the fitted mixture are flagged.	Can handle multi-modal data	Requires choosing the number of components; may overfit

Non-parametric methods

Non-parametric methods do not assume a fixed distributional form. They estimate the underlying distribution directly from the data.

Method	Description	Strengths	Limitations
Histogram-based	Builds a histogram of feature values and flags points in low-density bins	Very fast; easy to implement	Struggles in high dimensions; bin size selection is sensitive
Kernel Density Estimation (KDE)	Estimates the probability density function using kernels centered on each data point	Smooth density estimate; no binning required	Computationally expensive for large datasets; bandwidth selection is critical
Interquartile Range (IQR)	Flags points below Q1 - 1.5 * IQR or above Q3 + 1.5 * IQR	Robust to outliers in training data; simple	Univariate only; does not capture relationships between features

Multivariate statistical methods

For higher-dimensional data, multivariate methods are necessary. Principal component analysis (PCA) can be used for anomaly detection by projecting data onto principal components and measuring reconstruction error. Points with high reconstruction error in the reduced space are flagged as anomalies.

The Minimum Covariance Determinant (MCD) estimator provides a robust estimate of the covariance matrix that is less influenced by outliers, making it suitable for detecting anomalies in multivariate data.

Machine learning methods

Machine learning methods for anomaly detection go beyond distributional assumptions and instead learn the structure of normal data directly from examples. These approaches generally fall into distance-based, density-based, tree-based, and boundary-based categories.

Isolation Forest

Isolation Forest, proposed by Fei Tony Liu, Kai Ming Ting, and Zhi-Hua Zhou in 2008, takes a fundamentally different approach to anomaly detection. Rather than modeling normal data and then identifying deviations, it directly isolates anomalies.

The algorithm works by building an ensemble of random binary trees called isolation trees. Each tree is constructed by repeatedly selecting a random feature and a random split value within that feature's range. This recursive partitioning continues until every data point is isolated in its own leaf node.

The key insight is that anomalies, being "few and different," require fewer random splits to be isolated. They tend to land in leaf nodes closer to the root of the tree, resulting in shorter average path lengths across the forest. Normal points, which are clustered together in dense regions, require many splits to be separated and thus have longer average path lengths.

The anomaly score for a data point is computed as the average path length across all trees in the forest, normalized against the expected path length for a dataset of the same size. Points with anomaly scores close to 1 are likely anomalies, while scores near 0.5 indicate normal behavior.

Isolation Forest has several practical advantages:

It has linear time complexity O(n), making it efficient for large datasets.
It does not require distance or density calculations.
It handles high-dimensional data well because it uses random subspace selection.
It is available in scikit-learn and other major ML libraries.

Extensions of the original algorithm include Extended Isolation Forest (EIF), which uses random hyperplanes instead of axis-aligned splits, reducing bias on datasets with strong feature correlations.

One-Class SVM

One-Class Support Vector Machine (One-Class SVM), introduced by Scholkopf et al. in 2001, is a variant of the traditional SVM designed for anomaly and novelty detection. Unlike standard SVMs that separate two classes, One-Class SVM is trained exclusively on data from the normal class.

The algorithm maps the training data into a high-dimensional feature space using a kernel function (typically the radial basis function, or RBF kernel). It then finds the hyperplane with the maximum margin that separates the mapped data points from the origin. The origin in kernel feature space serves as a stand-in for "everything anomalous." Data points that fall on the side of the origin (outside the learned boundary) are classified as anomalies.

One-Class SVM is effective when:

The boundary between normal and anomalous data is complex and non-linear.
Only normal examples are available for training (semi-supervised setting).
The data has a moderate number of features.

However, it can be computationally expensive for large datasets due to the kernel matrix computation (O(n^2) to O(n^3)), and its performance is sensitive to the choice of kernel parameters and the contamination ratio (nu parameter). A related approach, Support Vector Data Description (SVDD), proposed by Tax and Duin, finds the smallest hypersphere enclosing the normal data rather than using a hyperplane.

Local Outlier Factor (LOF)

Local Outlier Factor (LOF), proposed by Breunig, Kriegel, Ng, and Sander in 2000, is a density-based anomaly detection algorithm. It identifies anomalies by comparing the local density of a data point with the local densities of its neighbors.

The algorithm works in several steps:

For each data point, the k nearest neighbors are identified.
The reachability distance is computed, which smooths out the distance measurement for very close points.
The local reachability density (LRD) of each point is calculated as the inverse of the average reachability distance to its k nearest neighbors.
The LOF score is the ratio of the average LRD of the point's neighbors to the point's own LRD.

A LOF score of approximately 1 indicates that a point has a similar density to its neighbors (normal). A score significantly greater than 1 indicates that the point has a lower density than its neighbors (anomaly). A score below 1 suggests the point is in a denser region than its neighbors.

The key advantage of LOF is its ability to detect local anomalies. A point that would not be considered an outlier globally (e.g., it is not far from the overall mean) can still be flagged if it resides in a locally dense neighborhood but is relatively isolated. This makes LOF particularly useful for datasets with clusters of varying densities. LOF shares core concepts such as "reachability distance" with DBSCAN and OPTICS, reflecting its roots in density-based analysis.

k-Nearest Neighbors (k-NN) distance

The k-nearest neighbors approach to anomaly detection uses the distance to a data point's k-th nearest neighbor (or the average distance to its k nearest neighbors) as an anomaly score. Points that are far from their neighbors are considered anomalous.

This method is simple and intuitive but has O(n^2) time complexity for naive implementations, which limits scalability. Approximate nearest neighbor search algorithms and spatial indexing structures (such as KD-trees and ball trees) can significantly reduce this computational burden. Unlike LOF, k-NN distance does not account for local density variations, making it less effective when clusters have different densities.

DBSCAN

DBSCAN (Density-Based Spatial Clustering of Applications with Noise), proposed by Ester, Kriegel, Sander, and Xu in 1996, is a clustering algorithm that naturally identifies outliers as a byproduct of its density-based grouping process. Points that do not belong to any cluster are labeled as noise, and these noise points serve as detected anomalies.

DBSCAN uses two parameters: epsilon (the radius of the neighborhood) and MinPts (the minimum number of points required to form a dense region). Points that have at least MinPts neighbors within epsilon distance are core points. Points within epsilon of a core point but with fewer than MinPts neighbors are border points. All remaining points are noise (anomalies).

DBSCAN is effective at finding clusters of arbitrary shape and does not require specifying the number of clusters in advance, unlike k-means. However, it can struggle with datasets that have varying densities, since a single epsilon value may not capture all cluster structures. The algorithm received the ACM SIGKDD Test of Time Award in 2014 in recognition of its broad impact on the field.

Histogram-Based Outlier Score (HBOS)

Histogram-Based Outlier Score (HBOS), introduced by Goldstein and Dengel in 2012, is an unsupervised anomaly detection method that scores data points based on their positions within univariate histograms. For each feature, HBOS constructs a histogram and assigns higher outlier scores to values that fall into bins with low frequency.

HBOS assumes feature independence, which makes it computationally very efficient (linear time complexity) but limits its ability to detect anomalies that arise from unusual combinations of feature values. Benchmarks have shown HBOS to be up to five times faster than clustering-based methods and up to seven times faster than nearest-neighbor-based methods, making it a strong choice for initial screening on large datasets. It reliably detects global outliers but performs poorly on local outlier problems.

Extended HBOS (EHBOS) addresses some of these limitations by using two-dimensional histograms to capture pairwise feature interactions.

Comparison of classical ML methods

Method	Type	Year	Complexity	Handles high dimensions	Detects local anomalies	Key parameter(s)
Isolation Forest	Tree-based	2008	O(n log n)	Yes	Partially	Number of trees, contamination
One-Class SVM	Boundary-based	2001	O(n^2) to O(n^3)	Moderate	No	Kernel, nu, gamma
LOF	Density-based	2000	O(n^2)	No	Yes	k (number of neighbors)
DBSCAN	Density-based	1996	O(n log n) with indexing	No	Yes	Epsilon, MinPts
k-NN distance	Distance-based	Classic	O(n^2)	Moderate	No	k (number of neighbors)
HBOS	Histogram-based	2012	O(n)	No	No	Number of bins
Random Forest (supervised)	Tree-based	2001	O(n log n)	Yes	No	Number of trees, depth
Elliptic Envelope	Statistical	Classic	O(n * d^2)	Moderate	No	Contamination

Deep learning methods

Deep learning methods have become increasingly prominent in anomaly detection, particularly for high-dimensional and complex data types such as images, time series, and network traffic. These approaches leverage neural networks to learn rich representations of normal data and identify deviations.

Autoencoders

An autoencoder is a neural network trained to reconstruct its input through a bottleneck layer. It consists of an encoder that compresses the input into a lower-dimensional latent representation and a decoder that reconstructs the original input from this representation.

For anomaly detection, the autoencoder is trained on normal data only. During training, the network learns to compress and reconstruct normal patterns efficiently, minimizing the reconstruction error (typically measured as mean squared error). At test time, normal data will be reconstructed accurately with low error, while anomalous data will produce high reconstruction error because the network has never learned to represent such patterns. The reconstruction error serves as the anomaly score: points with error above a chosen threshold are flagged as anomalies.

Variants of autoencoders used for anomaly detection include:

Sparse autoencoders: Add a sparsity constraint to the latent layer, forcing the network to learn more distinctive features.
Denoising autoencoders: Trained to reconstruct clean data from corrupted inputs, making them more robust to noise.
Convolutional autoencoders: Use convolutional layers instead of fully connected layers, making them well suited for image anomaly detection.
LSTM autoencoders: Replace the encoder and decoder with Long Short-Term Memory (recurrent neural network) layers, enabling them to model temporal dependencies in sequence data.

Variational autoencoders (VAEs)

Variational autoencoders (VAEs), introduced by Kingma and Welling in 2014, extend the standard autoencoder by introducing a probabilistic framework. Instead of encoding each input to a single point in latent space, the encoder outputs the parameters of a probability distribution (typically a Gaussian, defined by a mean and variance). The decoder then samples from this distribution to produce the reconstruction.

The training objective for a VAE includes two terms: the reconstruction loss (how well the decoder reproduces the input) and the KL divergence (how close the learned latent distribution is to a standard normal prior). This combination encourages a smooth, organized latent space.

For anomaly detection, VAEs offer several advantages over standard autoencoders:

The structured latent space makes the model less likely to memorize training data, reducing overfitting.
Anomaly scores can be derived from the reconstruction probability (a more principled measure than reconstruction error alone), as proposed by An and Cho in 2015.
The KL divergence term provides an additional signal: anomalous points may also exhibit unusual latent distributions.
The probabilistic framework provides a natural way to quantify uncertainty.

Recent research has combined VAEs with other architectures. For instance, hybrid VAE-Transformer models integrate variational inference with attention mechanisms for time series anomaly detection, while GAN-VAE hybrid models combine generative adversarial training with variational inference to better capture complex data distributions.

Generative adversarial networks (GANs)

Generative adversarial networks (GANs) have also been adapted for anomaly detection. The core idea is to train a GAN on normal data so that its generator learns to produce realistic normal samples. At test time, anomalous inputs will not fit the learned distribution of normal data.

Several GAN-based anomaly detection methods have been proposed:

Method	Year	Approach	Key innovation
AnoGAN	2017	Searches the latent space for the closest generated image to the input	First GAN-based anomaly detection method; uses iterative optimization
f-AnoGAN	2019	Learns a direct mapping from images to latent space	Replaces iterative search with a learned encoder; much faster inference
GANomaly	2018	Uses an encoder-decoder-encoder architecture	Compares latent representations at two stages; avoids costly iterative search
ALAD	2018	Adversarially learned anomaly detection using BiGAN	Learns both generation and inference simultaneously

GAN-based methods are particularly useful in domains where anomaly examples are scarce, since the generator can be trained exclusively on normal data. However, GANs can be difficult to train due to mode collapse and instability, and recent work on conditional GANs and cycle-consistent GANs has sought to address these challenges.

Transformer-based methods

Transformer architectures, originally developed for natural language processing, have been adapted for anomaly detection in time series and other sequential data.

The Anomaly Transformer, proposed by Xu et al. in 2022, modifies the standard self-attention mechanism to distinguish between normal and anomalous temporal patterns. It introduces an "association discrepancy" metric that measures the difference between learned associations and a prior distribution. Normal time points tend to form strong associations with their adjacent context, while anomalous points exhibit weaker or different association patterns.

More recent models include the Variable Temporal Transformer (VTT), which uses temporal self-attention for modeling temporal dependencies and variable self-attention for modeling correlations between variables, and CAE-T (Convolutional Autoencoding Transformer), which combines convolutional autoencoders for spatial feature extraction with Transformers for capturing long-term temporal dependencies.

Transformer-based approaches for anomaly detection can operate in two modes:

Prediction-based: The model is trained to predict the next value (or window) in a time series. If the actual observed value deviates significantly from the prediction, it is flagged as anomalous.
Reconstruction-based: Similar to autoencoders, the model reconstructs masked portions of the input. High reconstruction error indicates anomalies.

The main advantage of Transformers is their ability to capture both local and global dependencies through the attention mechanism, making them effective at detecting anomalies that depend on long-range temporal context.

Time series anomaly detection

Time series data presents unique challenges for anomaly detection because the ordering and temporal context of observations is critical. What constitutes an anomaly often depends on the surrounding values, seasonal patterns, and long-term trends.

Types of time series anomalies

Time series anomalies can be categorized as:

Point anomalies: A single time step with an unusually high or low value (e.g., a sudden spike in CPU usage).
Contextual anomalies: Values that are unusual given the temporal context (e.g., high energy consumption during off-peak hours).
Pattern (subsequence) anomalies: A sequence of values whose shape differs from the expected pattern (e.g., an unusual waveform in an ECG signal).
Trend anomalies: Gradual changes in the underlying trend that deviate from historical behavior (e.g., a slow but persistent increase in latency).

Methods for time series

Method	Category	Approach	Best suited for
ARIMA-based	Statistical	Fits autoregressive models and flags residuals exceeding a threshold	Stationary or trend-stationary series
STL decomposition	Statistical	Separates trend, seasonal, and residual components; detects anomalies in residuals	Seasonal data with clear periodicity
Prophet	Statistical	Facebook's forecasting tool with built-in anomaly detection via prediction intervals	Business time series with holidays and trend changes
LSTM autoencoder	Deep learning	Encodes temporal patterns with LSTM layers; detects anomalies via reconstruction error	Multivariate time series with complex temporal patterns
Anomaly Transformer	Deep learning	Uses modified self-attention with association discrepancy	Long sequences with both local and global anomalies
Matrix Profile	Algorithmic	Computes all-pairs subsequence distances efficiently	Motif and discord discovery in long series
Spectral Residual (SR)	Signal processing	Uses Fourier transform to extract the spectral residual of the time series	Detecting saliency-based anomalies

Challenges in time series anomaly detection

Several challenges are specific to the time series setting:

Seasonality and periodicity: Normal behavior changes with time of day, day of week, or season. Models must account for these cycles to avoid flagging normal periodic patterns as anomalies.
Non-stationarity: The statistical properties of the data change over time. A model trained on historical data may become outdated as the underlying process evolves.
Multivariate dependencies: In multivariate time series (e.g., sensor data from industrial equipment), anomalies may only be detectable through the relationships between variables, not from any single variable alone.
Labeling difficulty: Annotating time series anomalies requires domain expertise, and different experts may disagree on the exact boundaries of anomalous segments.

Image anomaly detection

Image anomaly detection, often studied in the context of industrial visual inspection, aims to identify defective or unusual images (or regions within images) given only examples of normal images during training.

The MVTec AD benchmark

The MVTec Anomaly Detection (MVTec AD) dataset, introduced by Bergmann et al. in 2019, has become the standard benchmark for image anomaly detection research. It contains over 5,000 high-resolution images divided into 15 categories, including 5 texture classes (carpet, grid, leather, tile, wood) and 10 object classes (bottle, cable, capsule, hazelnut, metal nut, pill, screw, toothbrush, transistor, zipper).

Each category includes defect-free training images and a test set containing both normal images and images with various defects such as scratches, dents, contaminations, cracks, and structural changes. Pixel-precise ground truth annotations are provided for all anomalous regions, enabling both image-level anomaly detection ("is this image defective?") and pixel-level anomaly localization ("where is the defect?").

MVTec AD 2, released as a follow-up, expands the benchmark with eight additional scenarios and over 8,000 images, covering more challenging inspection tasks.

Methods for image anomaly detection

Modern image anomaly detection methods can be grouped into several families:

Reconstruction-based methods train a model (such as a convolutional autoencoder or VAE) to reconstruct normal images. At test time, defective regions produce high reconstruction error. Approaches like DRAEM (2021) use synthetic anomalies during training to guide the reconstruction learning.

Feature embedding methods extract features from a pretrained convolutional neural network (such as a ResNet or Wide ResNet trained on ImageNet) and model the distribution of normal features.

Method	Year	Approach	MVTec AD image AUROC
SPADE	2020	K-nearest-neighbor matching on pretrained CNN features at multiple resolutions	85.5%
PaDiM	2020	Models patch-level feature distributions using multivariate Gaussian	95.3%
PatchCore	2022	Memory bank of representative patch features with coreset subsampling	99.1%
CFlow-AD	2022	Normalizing flow on multi-scale pretrained features	98.3%
SimpleNet	2023	Feature adaptation with a simple discriminator network	99.6%

Knowledge distillation methods train a student network to mimic the features of a pretrained teacher network on normal data. At test time, anomalous regions produce discrepancies between teacher and student features. Methods in this family include STPM (Student-Teacher Feature Pyramid Matching) and Reverse Distillation.

Synthetic anomaly methods generate artificial defects during training to provide explicit supervision. CutPaste (2021) creates anomalies by cutting and pasting image patches, while DRAEM (2021) applies Perlin noise textures as synthetic anomalies.

Applications

Anomaly detection is used in many fields to detect and prevent potentially hazardous events.

Fraud detection

In finance, anomaly detection is employed to spot fraudulent transactions. Credit card fraud detection systems analyze spending patterns and flag transactions that deviate from a cardholder's established behavior. For instance, a purchase in a foreign country minutes after a domestic transaction is likely anomalous. Mastercard's Decision Intelligence platform, for example, analyzes up to 160 billion transactions annually in under 50 milliseconds, reportedly boosting fraud detection by up to 300% while reducing false positives by more than 85%.

Machine learning models used for fraud detection include:

Logistic regression and random forests for rule-augmented classification.
Autoencoders trained on legitimate transactions, flagging those with high reconstruction error.
Graph neural networks that model relationships between customers, merchants, and transactions to detect fraud rings.

A key challenge in fraud detection is the extreme class imbalance: fraudulent transactions typically represent less than 0.5% of all transactions. Techniques such as SMOTE (Synthetic Minority Oversampling Technique), class weighting, and cost-sensitive learning are commonly used to address this imbalance.

Cybersecurity and intrusion detection

Network intrusion detection systems (NIDS) use anomaly detection to identify malicious network activity. By learning the baseline behavior of network traffic (packet sizes, protocols, source/destination patterns, timing), these systems can flag deviations that may indicate attacks such as port scanning, denial-of-service (DoS) attacks, data exfiltration, or lateral movement within a network.

Anomaly-based intrusion detection has several advantages over signature-based systems: it can detect novel (zero-day) attacks that have no known signature and adapt to evolving network environments. However, it also tends to produce higher false positive rates.

Common datasets used for benchmarking network intrusion detection include NSL-KDD, CICIDS2017, and the more recent CICIoMT2024 dataset for Internet of Medical Things (IoMT) security. Recent work has applied Transformer architectures and ensemble methods combining deep learning with traditional ML algorithms for improved detection accuracy.

Manufacturing and industrial inspection

In manufacturing, anomaly detection is used for quality control and defect detection. Visual inspection systems powered by computer vision and deep learning can automatically identify defective products on assembly lines, detecting scratches, dents, discoloration, cracks, missing components, and assembly errors.

The typical approach involves training a model exclusively on images of non-defective products. At inference time, any deviation from the learned normal appearance triggers an alert. This is particularly valuable because defect examples are rare and highly variable in manufacturing settings.

Beyond visual inspection, anomaly detection is applied to sensor data from industrial equipment for predictive maintenance. By monitoring vibration, temperature, pressure, and other sensor readings, models can detect early signs of equipment failure before a breakdown occurs, reducing unplanned downtime and maintenance costs.

Healthcare

In healthcare, anomaly detection supports several critical tasks:

Medical imaging: Detecting tumors, lesions, or other abnormalities in X-rays, MRIs, and CT scans.
Patient monitoring: Identifying unusual vital sign patterns (heart rate, blood pressure, oxygen saturation) that may indicate deterioration.
Electronic health records: Flagging unusual patterns in lab results, medication orders, or billing that could indicate errors or fraud.
Epidemiology: Detecting unusual disease outbreak patterns from public health surveillance data.

Other applications

Domain	Application	Typical methods
Astronomy	Detecting unusual celestial events or objects	Clustering, Isolation Forest
Environmental monitoring	Detecting pollution events or unusual weather patterns	Time series methods, statistical process control
Social media	Identifying bot activity, spam, or fake accounts	Graph-based methods, classification
Telecommunications	Detecting network faults and service degradation	Statistical methods, LSTM models
Energy	Detecting power grid faults and abnormal consumption	ARIMA, autoencoders
Agriculture	Identifying crop disease or pest infestations from imagery	Convolutional neural network autoencoders, PatchCore

Evaluation metrics

Evaluating anomaly detection systems requires careful consideration of the metrics used, especially because of the inherent class imbalance between normal and anomalous data.

Threshold-dependent metrics

These metrics require choosing a specific decision threshold to classify scores as normal or anomalous.

Metric	Formula	Interpretation
Precision	TP / (TP + FP)	Proportion of detected anomalies that are truly anomalous
Recall (Sensitivity)	TP / (TP + FN)	Proportion of actual anomalies that are correctly detected
F1 Score	2 * Precision * Recall / (Precision + Recall)	Harmonic mean of precision and recall
Specificity	TN / (TN + FP)	Proportion of normal instances correctly identified
False Positive Rate	FP / (FP + TN)	Rate at which normal instances are incorrectly flagged

Threshold-independent metrics

These metrics evaluate the quality of the anomaly scores across all possible thresholds.

AUROC (Area Under the Receiver Operating Characteristic curve): Plots the true positive rate against the false positive rate at various thresholds. An AUROC of 1.0 represents perfect separation between normal and anomalous classes, while 0.5 represents random performance. AUROC is the most widely reported metric in anomaly detection research.
AUPRC (Area Under the Precision-Recall Curve): Plots precision against recall at various thresholds. This metric is generally preferred over AUROC when the positive (anomalous) class is very rare, because it is more sensitive to performance on the minority class. Research has shown that ROC-AUC and PR-AUC can be only weakly correlated under severe imbalance, meaning a model with high ROC-AUC may still have poor precision-recall performance.
Average Precision (AP): The weighted mean of precision values at each recall threshold, equivalent to the area under the precision-recall curve.

Pixel-level metrics for image anomaly detection

For image anomaly localization, additional metrics are used:

Pixel AUROC: AUROC computed at the pixel level, measuring how well the model localizes anomalous regions.
PRO (Per-Region Overlap): Measures the average overlap between predicted and ground truth anomaly regions, weighted equally across all ground truth connected components regardless of size.

Practical considerations

In production systems, the choice of metric depends on the application:

In fraud detection, high recall is typically prioritized because missing a fraudulent transaction (false negative) is more costly than flagging a legitimate one (false positive).
In manufacturing inspection, high precision may be preferred to avoid wasting time investigating false alarms on an assembly line.
In medical screening, a balanced F1 score may be appropriate to ensure both sensitivity and specificity are adequate.

Challenges

Anomaly detection presents several persistent obstacles.

Data imbalance

One of the major obstacles lies in data imbalance, where anomalies make up a small fraction of all instances compared to normal data points. This makes it difficult for machine learning models to learn the characteristics of anomalies and distinguish them from regular instances. In many real-world scenarios, anomalous events represent less than 1% of the data, and in some cases (such as fraud detection), the ratio can be as extreme as 1 in 100,000. Under such severe imbalance, ROC-AUC can give a misleadingly optimistic view of performance, as nearly any non-degenerate classifier can achieve scores in the 0.90 to 0.99 range.

Labeling

Labeled anomalies may be scarce or unavailable, and the definition of what constitutes an anomaly may be uncertain or context-dependent. Collecting labeled anomaly data is expensive because it requires domain expertise, and anomalies are inherently rare. To address this, unsupervised or semi-supervised techniques that do not require labeled data may be utilized, along with expert knowledge and feedback to refine the definition of anomalies.

High dimensionality

Anomaly detection often faces the problem of high dimensionality, where data may contain many features or variables that make it challenging to detect anomalies and visualize them. As the number of dimensions grows, the concept of distance becomes less meaningful (the "curse of dimensionality"), and all points tend to appear equally distant from one another. To address this challenge, feature engineering, dimensionality reduction techniques (such as PCA or autoencoders), or feature selection strategies can be employed to simplify the data and focus on the most pertinent features.

Concept drift

Concept drift occurs when the distribution of data changes over time, making a previously trained model outdated or ineffective at detecting new anomalies. What was considered normal six months ago may no longer apply. To combat this problem, adaptive or online learning techniques should be utilized that update models in real time or adapt to changes in data distribution. Sliding window approaches, periodic retraining, and reinforcement learning-based adaptation strategies are common solutions.

Defining "normal"

In many real-world settings, there is no clear-cut boundary between normal and anomalous data. Normal behavior may be multimodal, context-dependent, or gradually shifting. A model that defines normality too narrowly will generate excessive false positives, while a model that defines it too broadly will miss genuine anomalies. The choice of how to define normal is often application-specific and requires close collaboration between data scientists and domain experts.

Interpretability

In many applications, it is not enough to simply flag a data point as anomalous. Operators and analysts need to understand why a detection was made. Deep learning models, while powerful, often function as black boxes. Explainability techniques such as SHAP (SHapley Additive exPlanations), LIME (Local Interpretable Model-agnostic Explanations), and attention visualization are increasingly used alongside anomaly detection models to provide interpretable explanations for flagged anomalies.

Comprehensive comparison of anomaly detection methods

Method	Category	Learning type	Data type	Strengths	Limitations
Z-score / IQR	Statistical	Unsupervised	Tabular	Simple, fast, interpretable	Assumes specific distributions; univariate
Mahalanobis distance	Statistical	Unsupervised	Tabular	Handles correlations	Requires covariance estimation
Isolation Forest	ML (tree-based)	Unsupervised	Tabular	Fast, scalable, no distance computations	Struggles with many irrelevant features
One-Class SVM	ML (boundary)	Semi-supervised	Tabular	Flexible non-linear boundaries via kernels	Computationally expensive; sensitive to kernel choice
LOF	ML (density)	Unsupervised	Tabular	Detects local anomalies	O(n^2) complexity; sensitive to k
DBSCAN	ML (density)	Unsupervised	Tabular, spatial	Finds clusters of arbitrary shape	Sensitive to epsilon; struggles with varying density
HBOS	ML (histogram)	Unsupervised	Tabular	Extremely fast; linear time	Assumes feature independence; misses local outliers
Autoencoder	Deep learning	Semi-supervised	Tabular, image, sequence	Learns complex patterns; versatile architecture	Requires careful architecture design; threshold selection
VAE	Deep learning	Semi-supervised	Tabular, image, sequence	Probabilistic framework; smooth latent space	More complex training; may underfit
GAN-based (AnoGAN, etc.)	Deep learning	Semi-supervised	Image	Generates realistic normal data	Difficult to train; mode collapse
LSTM Autoencoder	Deep learning	Semi-supervised	Time series	Captures temporal dependencies	Slow training; vanishing gradients
Anomaly Transformer	Deep learning	Semi-supervised	Time series	Global and local context via attention	Large model size; requires substantial data
PatchCore	Deep learning (feature embedding)	Semi-supervised	Image	State-of-the-art image detection; fast inference with coreset	Relies on pretrained backbone; memory-intensive

Software and libraries

Several open-source libraries provide implementations of anomaly detection algorithms:

scikit-learn: Includes Isolation Forest, One-Class SVM, LOF, Elliptic Envelope, and DBSCAN.
PyOD (Python Outlier Detection): A comprehensive library with over 40 anomaly detection algorithms, including both classical and deep learning methods. PyOD was introduced by Zhao, Nasrullah, and Li in 2019.
Anomalib: An Intel-backed library specifically for image anomaly detection, with implementations of PatchCore, PaDiM, STPM, CFlow-AD, and many others.
Alibi Detect: Focused on outlier, adversarial, and drift detection with support for both tabular and image data.
TODS (Time-series Outlier Detection System): A full pipeline system for time series anomaly detection developed at Carnegie Mellon University.

Explain like I'm 5 (ELI5)

Imagine you have a big box of red marbles. You know what red marbles look like because you see them every day. Now, if someone sneaks a blue marble into your box, you would notice it right away because it looks different from everything else.

Anomaly detection works the same way, but with computers. We show a computer lots of examples of what "normal" looks like. Then, when we give it new data, it checks whether each piece looks like the normal examples it has seen before. If something looks very different from what the computer expects, it raises its hand and says, "Hey, this one looks strange!"

This is useful for things like catching someone who stole a credit card (their purchases look different from the real owner's), finding a broken machine in a factory (it starts making weird sounds), or spotting a sick plant in a field (it looks different from the healthy ones).

References

Chandola, V., Banerjee, A., & Kumar, V. (2009). "Anomaly Detection: A Survey." *ACM Computing Surveys*, 41(3), 1-58.
Liu, F. T., Ting, K. M., & Zhou, Z. H. (2008). "Isolation Forest." *Proceedings of the 2008 Eighth IEEE International Conference on Data Mining*, 413-422.
Scholkopf, B., Platt, J. C., Shawe-Taylor, J., Smola, A. J., & Williamson, R. C. (2001). "Estimating the Support of a High-Dimensional Distribution." *Neural Computation*, 13(7), 1443-1471.
Breunig, M. M., Kriegel, H. P., Ng, R. T., & Sander, J. (2000). "LOF: Identifying Density-Based Local Outliers." *Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data*, 93-104.
Bergmann, P., Fauser, M., Sattlegger, D., & Steger, C. (2019). "MVTec AD: A Comprehensive Real-World Dataset for Unsupervised Anomaly Detection." *Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)*, 9592-9600.
Roth, K., Pemula, L., Zepeda, J., Scholkopf, B., Brox, T., & Gehler, P. (2022). "Towards Total Recall in Industrial Anomaly Detection." *Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)*, 14318-14328.
Schlegl, T., Seebock, P., Waldstein, S. M., Schmidt-Erfurth, U., & Langs, G. (2017). "Unsupervised Anomaly Detection with Generative Adversarial Networks to Guide Marker Discovery." *International Conference on Information Processing in Medical Imaging (IPMI)*, 146-157.
Xu, J., Wu, H., Wang, J., & Long, M. (2022). "Anomaly Transformer: Time Series Anomaly Detection with Association Discrepancy." *International Conference on Learning Representations (ICLR)*.
Kingma, D. P., & Welling, M. (2014). "Auto-Encoding Variational Bayes." *International Conference on Learning Representations (ICLR)*.
Ester, M., Kriegel, H. P., Sander, J., & Xu, X. (1996). "A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise." *Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining (KDD)*, 226-231.
Goldstein, M., & Dengel, A. (2012). "Histogram-based Outlier Score (HBOS): A Fast Unsupervised Anomaly Detection Algorithm." *Poster and Demo Track of the 35th German Conference on Artificial Intelligence (KI-2012)*.
Zhao, Y., Nasrullah, Z., & Li, Z. (2019). "PyOD: A Python Toolbox for Scalable Outlier Detection." *Journal of Machine Learning Research*, 20(96), 1-7.
An, J., & Cho, S. (2015). "Variational Autoencoder based Anomaly Detection using Reconstruction Probability." *Special Lecture on IE*, 2(1), 1-18.
Akcay, S., Atapour-Abarghouei, A., & Breckon, T. P. (2018). "GANomaly: Semi-Supervised Anomaly Detection via Adversarial Training." *Asian Conference on Computer Vision (ACCV)*, 622-637.

Learning paradigms

Types of anomalies

Point anomalies

Contextual anomalies

Collective anomalies

Statistical methods

Parametric methods

Non-parametric methods

Multivariate statistical methods

Machine learning methods

Isolation Forest

One-Class SVM

Local Outlier Factor (LOF)

k-Nearest Neighbors (k-NN) distance

DBSCAN

Histogram-Based Outlier Score (HBOS)

Comparison of classical ML methods

Deep learning methods

Autoencoders

Variational autoencoders (VAEs)

Generative adversarial networks (GANs)

Transformer-based methods

Time series anomaly detection

Types of time series anomalies

Methods for time series

Challenges in time series anomaly detection

Image anomaly detection

The MVTec AD benchmark

Methods for image anomaly detection

Applications

Fraud detection

Cybersecurity and intrusion detection

Manufacturing and industrial inspection

Healthcare

Other applications

Evaluation metrics

Threshold-dependent metrics

Threshold-independent metrics

Pixel-level metrics for image anomaly detection

Practical considerations

Challenges

Data imbalance

Labeling

High dimensionality

Concept drift

Defining "normal"

Interpretability

Comprehensive comparison of anomaly detection methods

Software and libraries

Explain like I'm 5 (ELI5)

References

Improve this article

Related Articles

ARC-AGI 2

Latent Dirichlet allocation

DBSCAN

A/B Testing

DataFrame

Data Analysis

Learning paradigms

Types of anomalies

Point anomalies

Contextual anomalies

Collective anomalies

Statistical methods

Parametric methods

Non-parametric methods

Multivariate statistical methods

Machine learning methods

Isolation Forest

One-Class SVM

Local Outlier Factor (LOF)

k-Nearest Neighbors (k-NN) distance

DBSCAN

Histogram-Based Outlier Score (HBOS)

Comparison of classical ML methods

Deep learning methods

Autoencoders

Variational autoencoders (VAEs)

Generative adversarial networks (GANs)