In machine learning, anomaly detection is the process of identifying data points, events, or observations that deviate significantly from normal patterns in a dataset. These abnormal outcomes are known as anomalies, outliers, or exceptions. As an example, if the mean for a certain feature is 50 with a standard deviation of 5, then anomaly detection should flag a value of 300 as suspicious.
Anomaly detection is a fundamental problem across many disciplines, from finance and cybersecurity to manufacturing and healthcare. The core difficulty lies in the fact that anomalies are, by definition, rare and unpredictable. A system that works well on one type of anomaly may completely miss another. Because of this, the field has produced a wide range of methods, spanning statistical models, classical machine learning, and modern deep learning techniques.
The global anomaly detection market was valued at approximately $6.9 billion in 2025 and is projected to reach $28 billion by 2034, reflecting the growing adoption of machine learning-based anomaly detection across industries.
Anomaly detection tasks generally fall into three learning paradigms:
Anomalies can be divided into three primary categories: point anomalies, contextual anomalies, and collective anomalies. This taxonomy was formalized by Chandola et al. in their 2009 survey paper and remains the standard framework used in the literature.
Point anomalies, also referred to as global anomalies, refer to individual data points that differ significantly from the majority. A single data instance is considered anomalous if it deviates substantially from the rest of the data, regardless of context or surrounding observations.
Examples of point anomalies include:
Point anomalies are the simplest type to detect. They can be identified using statistical methods like the z-score, interquartile range, or Mahalanobis distance, or machine learning techniques like Isolation Forest, One-Class SVM, or autoencoders.
Contextual anomalies, also referred to as conditional anomalies, refer to data points that are anomalous only within certain contexts or subpopulations of the data. The same data value might be perfectly normal in one context but highly unusual in another.
For instance, a high heart rate may be considered normal during physical exercise but abnormal when sleeping. Similarly, a temperature of 35 degrees Celsius is normal in summer but unusual in winter for temperate climates. In network traffic, a spike in data transfers may be expected during business hours but suspicious at 3:00 AM.
Contextual anomalies require two types of attributes:
To detect contextual anomalies, context information must be integrated into the model. This can be done through rule-based systems, Bayesian networks, decision trees, or conditional probability models.
Collective anomalies, also referred to as group anomalies, refer to a collection of data points that exhibit unusual behavior when taken together but not individually. Each individual observation may fall within normal ranges, but the pattern formed by the group as a whole is abnormal.
Examples include:
Detecting collective anomalies requires the identification of patterns or dependencies between data points and the discovery of subpopulations that show anomalous behavior. Clustering, principal component analysis, sequence analysis, and Local Outlier Factor can all be utilized for detection.
Statistical approaches to anomaly detection are among the oldest and most established techniques. They assume that normal data follows a known or estimable probability distribution and flag data points that have low probability under that distribution.
Parametric methods assume the data follows a specific distribution (typically Gaussian). They estimate the distribution parameters from the training data and then calculate the probability of each test point.
| Method | Description | Strengths | Limitations |
|---|---|---|---|
| Z-score | Measures how many standard deviations a point is from the mean. Points with z-scores beyond a threshold (commonly 2.5 or 3) are flagged. | Simple, fast, interpretable | Assumes normal distribution; sensitive to outliers in the training data |
| Grubbs' test | A formal statistical hypothesis test that checks whether the most extreme value in a univariate sample is an outlier | Statistically rigorous with p-values | Assumes normal distribution; tests one outlier at a time |
| Mahalanobis distance | Measures the distance from a point to the distribution center, accounting for correlations between variables | Handles correlated features; multivariate | Requires estimating the covariance matrix, which is unstable in high dimensions |
| Gaussian Mixture Models (GMM) | Models the data as a mixture of multiple Gaussian distributions using Expectation-Maximization. Points with low likelihood under the fitted mixture are flagged. | Can handle multi-modal data | Requires choosing the number of components; may overfit |
Non-parametric methods do not assume a fixed distributional form. They estimate the underlying distribution directly from the data.
| Method | Description | Strengths | Limitations |
|---|---|---|---|
| Histogram-based | Builds a histogram of feature values and flags points in low-density bins | Very fast; easy to implement | Struggles in high dimensions; bin size selection is sensitive |
| Kernel Density Estimation (KDE) | Estimates the probability density function using kernels centered on each data point | Smooth density estimate; no binning required | Computationally expensive for large datasets; bandwidth selection is critical |
| Interquartile Range (IQR) | Flags points below Q1 - 1.5 * IQR or above Q3 + 1.5 * IQR | Robust to outliers in training data; simple | Univariate only; does not capture relationships between features |
For higher-dimensional data, multivariate methods are necessary. Principal component analysis (PCA) can be used for anomaly detection by projecting data onto principal components and measuring reconstruction error. Points with high reconstruction error in the reduced space are flagged as anomalies.
The Minimum Covariance Determinant (MCD) estimator provides a robust estimate of the covariance matrix that is less influenced by outliers, making it suitable for detecting anomalies in multivariate data.
Machine learning methods for anomaly detection go beyond distributional assumptions and instead learn the structure of normal data directly from examples. These approaches generally fall into distance-based, density-based, tree-based, and boundary-based categories.
Isolation Forest, proposed by Fei Tony Liu, Kai Ming Ting, and Zhi-Hua Zhou in 2008, takes a fundamentally different approach to anomaly detection. Rather than modeling normal data and then identifying deviations, it directly isolates anomalies.
The algorithm works by building an ensemble of random binary trees called isolation trees. Each tree is constructed by repeatedly selecting a random feature and a random split value within that feature's range. This recursive partitioning continues until every data point is isolated in its own leaf node.
The key insight is that anomalies, being "few and different," require fewer random splits to be isolated. They tend to land in leaf nodes closer to the root of the tree, resulting in shorter average path lengths across the forest. Normal points, which are clustered together in dense regions, require many splits to be separated and thus have longer average path lengths.
The anomaly score for a data point is computed as the average path length across all trees in the forest, normalized against the expected path length for a dataset of the same size. Points with anomaly scores close to 1 are likely anomalies, while scores near 0.5 indicate normal behavior.
Isolation Forest has several practical advantages:
Extensions of the original algorithm include Extended Isolation Forest (EIF), which uses random hyperplanes instead of axis-aligned splits, reducing bias on datasets with strong feature correlations.
One-Class Support Vector Machine (One-Class SVM), introduced by Scholkopf et al. in 2001, is a variant of the traditional SVM designed for anomaly and novelty detection. Unlike standard SVMs that separate two classes, One-Class SVM is trained exclusively on data from the normal class.
The algorithm maps the training data into a high-dimensional feature space using a kernel function (typically the radial basis function, or RBF kernel). It then finds the hyperplane with the maximum margin that separates the mapped data points from the origin. The origin in kernel feature space serves as a stand-in for "everything anomalous." Data points that fall on the side of the origin (outside the learned boundary) are classified as anomalies.
One-Class SVM is effective when:
However, it can be computationally expensive for large datasets due to the kernel matrix computation (O(n^2) to O(n^3)), and its performance is sensitive to the choice of kernel parameters and the contamination ratio (nu parameter). A related approach, Support Vector Data Description (SVDD), proposed by Tax and Duin, finds the smallest hypersphere enclosing the normal data rather than using a hyperplane.
Local Outlier Factor (LOF), proposed by Breunig, Kriegel, Ng, and Sander in 2000, is a density-based anomaly detection algorithm. It identifies anomalies by comparing the local density of a data point with the local densities of its neighbors.
The algorithm works in several steps:
A LOF score of approximately 1 indicates that a point has a similar density to its neighbors (normal). A score significantly greater than 1 indicates that the point has a lower density than its neighbors (anomaly). A score below 1 suggests the point is in a denser region than its neighbors.
The key advantage of LOF is its ability to detect local anomalies. A point that would not be considered an outlier globally (e.g., it is not far from the overall mean) can still be flagged if it resides in a locally dense neighborhood but is relatively isolated. This makes LOF particularly useful for datasets with clusters of varying densities. LOF shares core concepts such as "reachability distance" with DBSCAN and OPTICS, reflecting its roots in density-based analysis.
The k-nearest neighbors approach to anomaly detection uses the distance to a data point's k-th nearest neighbor (or the average distance to its k nearest neighbors) as an anomaly score. Points that are far from their neighbors are considered anomalous.
This method is simple and intuitive but has O(n^2) time complexity for naive implementations, which limits scalability. Approximate nearest neighbor search algorithms and spatial indexing structures (such as KD-trees and ball trees) can significantly reduce this computational burden. Unlike LOF, k-NN distance does not account for local density variations, making it less effective when clusters have different densities.
DBSCAN (Density-Based Spatial Clustering of Applications with Noise), proposed by Ester, Kriegel, Sander, and Xu in 1996, is a clustering algorithm that naturally identifies outliers as a byproduct of its density-based grouping process. Points that do not belong to any cluster are labeled as noise, and these noise points serve as detected anomalies.
DBSCAN uses two parameters: epsilon (the radius of the neighborhood) and MinPts (the minimum number of points required to form a dense region). Points that have at least MinPts neighbors within epsilon distance are core points. Points within epsilon of a core point but with fewer than MinPts neighbors are border points. All remaining points are noise (anomalies).
DBSCAN is effective at finding clusters of arbitrary shape and does not require specifying the number of clusters in advance, unlike k-means. However, it can struggle with datasets that have varying densities, since a single epsilon value may not capture all cluster structures. The algorithm received the ACM SIGKDD Test of Time Award in 2014 in recognition of its broad impact on the field.
Histogram-Based Outlier Score (HBOS), introduced by Goldstein and Dengel in 2012, is an unsupervised anomaly detection method that scores data points based on their positions within univariate histograms. For each feature, HBOS constructs a histogram and assigns higher outlier scores to values that fall into bins with low frequency.
HBOS assumes feature independence, which makes it computationally very efficient (linear time complexity) but limits its ability to detect anomalies that arise from unusual combinations of feature values. Benchmarks have shown HBOS to be up to five times faster than clustering-based methods and up to seven times faster than nearest-neighbor-based methods, making it a strong choice for initial screening on large datasets. It reliably detects global outliers but performs poorly on local outlier problems.
Extended HBOS (EHBOS) addresses some of these limitations by using two-dimensional histograms to capture pairwise feature interactions.
| Method | Type | Year | Complexity | Handles high dimensions | Detects local anomalies | Key parameter(s) |
|---|---|---|---|---|---|---|
| Isolation Forest | Tree-based | 2008 | O(n log n) | Yes | Partially | Number of trees, contamination |
| One-Class SVM | Boundary-based | 2001 | O(n^2) to O(n^3) | Moderate | No | Kernel, nu, gamma |
| LOF | Density-based | 2000 | O(n^2) | No | Yes | k (number of neighbors) |
| DBSCAN | Density-based | 1996 | O(n log n) with indexing | No | Yes | Epsilon, MinPts |
| k-NN distance | Distance-based | Classic | O(n^2) | Moderate | No | k (number of neighbors) |
| HBOS | Histogram-based | 2012 | O(n) | No | No | Number of bins |
| Random Forest (supervised) | Tree-based | 2001 | O(n log n) | Yes | No | Number of trees, depth |
| Elliptic Envelope | Statistical | Classic | O(n * d^2) | Moderate | No | Contamination |
Deep learning methods have become increasingly prominent in anomaly detection, particularly for high-dimensional and complex data types such as images, time series, and network traffic. These approaches leverage neural networks to learn rich representations of normal data and identify deviations.
An autoencoder is a neural network trained to reconstruct its input through a bottleneck layer. It consists of an encoder that compresses the input into a lower-dimensional latent representation and a decoder that reconstructs the original input from this representation.
For anomaly detection, the autoencoder is trained on normal data only. During training, the network learns to compress and reconstruct normal patterns efficiently, minimizing the reconstruction error (typically measured as mean squared error). At test time, normal data will be reconstructed accurately with low error, while anomalous data will produce high reconstruction error because the network has never learned to represent such patterns. The reconstruction error serves as the anomaly score: points with error above a chosen threshold are flagged as anomalies.
Variants of autoencoders used for anomaly detection include:
Variational autoencoders (VAEs), introduced by Kingma and Welling in 2014, extend the standard autoencoder by introducing a probabilistic framework. Instead of encoding each input to a single point in latent space, the encoder outputs the parameters of a probability distribution (typically a Gaussian, defined by a mean and variance). The decoder then samples from this distribution to produce the reconstruction.
The training objective for a VAE includes two terms: the reconstruction loss (how well the decoder reproduces the input) and the KL divergence (how close the learned latent distribution is to a standard normal prior). This combination encourages a smooth, organized latent space.
For anomaly detection, VAEs offer several advantages over standard autoencoders:
Recent research has combined VAEs with other architectures. For instance, hybrid VAE-Transformer models integrate variational inference with attention mechanisms for time series anomaly detection, while GAN-VAE hybrid models combine generative adversarial training with variational inference to better capture complex data distributions.
Generative adversarial networks (GANs) have also been adapted for anomaly detection. The core idea is to train a GAN on normal data so that its generator learns to produce realistic normal samples. At test time, anomalous inputs will not fit the learned distribution of normal data.
Several GAN-based anomaly detection methods have been proposed:
| Method | Year | Approach | Key innovation |
|---|---|---|---|
| AnoGAN | 2017 | Searches the latent space for the closest generated image to the input | First GAN-based anomaly detection method; uses iterative optimization |
| f-AnoGAN | 2019 | Learns a direct mapping from images to latent space | Replaces iterative search with a learned encoder; much faster inference |
| GANomaly | 2018 | Uses an encoder-decoder-encoder architecture | Compares latent representations at two stages; avoids costly iterative search |
| ALAD | 2018 | Adversarially learned anomaly detection using BiGAN | Learns both generation and inference simultaneously |
GAN-based methods are particularly useful in domains where anomaly examples are scarce, since the generator can be trained exclusively on normal data. However, GANs can be difficult to train due to mode collapse and instability, and recent work on conditional GANs and cycle-consistent GANs has sought to address these challenges.
Transformer architectures, originally developed for natural language processing, have been adapted for anomaly detection in time series and other sequential data.
The Anomaly Transformer, proposed by Xu et al. in 2022, modifies the standard self-attention mechanism to distinguish between normal and anomalous temporal patterns. It introduces an "association discrepancy" metric that measures the difference between learned associations and a prior distribution. Normal time points tend to form strong associations with their adjacent context, while anomalous points exhibit weaker or different association patterns.
More recent models include the Variable Temporal Transformer (VTT), which uses temporal self-attention for modeling temporal dependencies and variable self-attention for modeling correlations between variables, and CAE-T (Convolutional Autoencoding Transformer), which combines convolutional autoencoders for spatial feature extraction with Transformers for capturing long-term temporal dependencies.
Transformer-based approaches for anomaly detection can operate in two modes:
The main advantage of Transformers is their ability to capture both local and global dependencies through the attention mechanism, making them effective at detecting anomalies that depend on long-range temporal context.
Time series data presents unique challenges for anomaly detection because the ordering and temporal context of observations is critical. What constitutes an anomaly often depends on the surrounding values, seasonal patterns, and long-term trends.
Time series anomalies can be categorized as:
| Method | Category | Approach | Best suited for |
|---|---|---|---|
| ARIMA-based | Statistical | Fits autoregressive models and flags residuals exceeding a threshold | Stationary or trend-stationary series |
| STL decomposition | Statistical | Separates trend, seasonal, and residual components; detects anomalies in residuals | Seasonal data with clear periodicity |
| Prophet | Statistical | Facebook's forecasting tool with built-in anomaly detection via prediction intervals | Business time series with holidays and trend changes |
| LSTM autoencoder | Deep learning | Encodes temporal patterns with LSTM layers; detects anomalies via reconstruction error | Multivariate time series with complex temporal patterns |
| Anomaly Transformer | Deep learning | Uses modified self-attention with association discrepancy | Long sequences with both local and global anomalies |
| Matrix Profile | Algorithmic | Computes all-pairs subsequence distances efficiently | Motif and discord discovery in long series |
| Spectral Residual (SR) | Signal processing | Uses Fourier transform to extract the spectral residual of the time series | Detecting saliency-based anomalies |
Several challenges are specific to the time series setting:
Image anomaly detection, often studied in the context of industrial visual inspection, aims to identify defective or unusual images (or regions within images) given only examples of normal images during training.
The MVTec Anomaly Detection (MVTec AD) dataset, introduced by Bergmann et al. in 2019, has become the standard benchmark for image anomaly detection research. It contains over 5,000 high-resolution images divided into 15 categories, including 5 texture classes (carpet, grid, leather, tile, wood) and 10 object classes (bottle, cable, capsule, hazelnut, metal nut, pill, screw, toothbrush, transistor, zipper).
Each category includes defect-free training images and a test set containing both normal images and images with various defects such as scratches, dents, contaminations, cracks, and structural changes. Pixel-precise ground truth annotations are provided for all anomalous regions, enabling both image-level anomaly detection ("is this image defective?") and pixel-level anomaly localization ("where is the defect?").
MVTec AD 2, released as a follow-up, expands the benchmark with eight additional scenarios and over 8,000 images, covering more challenging inspection tasks.
Modern image anomaly detection methods can be grouped into several families:
Reconstruction-based methods train a model (such as a convolutional autoencoder or VAE) to reconstruct normal images. At test time, defective regions produce high reconstruction error. Approaches like DRAEM (2021) use synthetic anomalies during training to guide the reconstruction learning.
Feature embedding methods extract features from a pretrained convolutional neural network (such as a ResNet or Wide ResNet trained on ImageNet) and model the distribution of normal features.
| Method | Year | Approach | MVTec AD image AUROC |
|---|---|---|---|
| SPADE | 2020 | K-nearest-neighbor matching on pretrained CNN features at multiple resolutions | 85.5% |
| PaDiM | 2020 | Models patch-level feature distributions using multivariate Gaussian | 95.3% |
| PatchCore | 2022 | Memory bank of representative patch features with coreset subsampling | 99.1% |
| CFlow-AD | 2022 | Normalizing flow on multi-scale pretrained features | 98.3% |
| SimpleNet | 2023 | Feature adaptation with a simple discriminator network | 99.6% |
Knowledge distillation methods train a student network to mimic the features of a pretrained teacher network on normal data. At test time, anomalous regions produce discrepancies between teacher and student features. Methods in this family include STPM (Student-Teacher Feature Pyramid Matching) and Reverse Distillation.
Synthetic anomaly methods generate artificial defects during training to provide explicit supervision. CutPaste (2021) creates anomalies by cutting and pasting image patches, while DRAEM (2021) applies Perlin noise textures as synthetic anomalies.
Anomaly detection is used in many fields to detect and prevent potentially hazardous events.
In finance, anomaly detection is employed to spot fraudulent transactions. Credit card fraud detection systems analyze spending patterns and flag transactions that deviate from a cardholder's established behavior. For instance, a purchase in a foreign country minutes after a domestic transaction is likely anomalous. Mastercard's Decision Intelligence platform, for example, analyzes up to 160 billion transactions annually in under 50 milliseconds, reportedly boosting fraud detection by up to 300% while reducing false positives by more than 85%.
Machine learning models used for fraud detection include:
A key challenge in fraud detection is the extreme class imbalance: fraudulent transactions typically represent less than 0.5% of all transactions. Techniques such as SMOTE (Synthetic Minority Oversampling Technique), class weighting, and cost-sensitive learning are commonly used to address this imbalance.
Network intrusion detection systems (NIDS) use anomaly detection to identify malicious network activity. By learning the baseline behavior of network traffic (packet sizes, protocols, source/destination patterns, timing), these systems can flag deviations that may indicate attacks such as port scanning, denial-of-service (DoS) attacks, data exfiltration, or lateral movement within a network.
Anomaly-based intrusion detection has several advantages over signature-based systems: it can detect novel (zero-day) attacks that have no known signature and adapt to evolving network environments. However, it also tends to produce higher false positive rates.
Common datasets used for benchmarking network intrusion detection include NSL-KDD, CICIDS2017, and the more recent CICIoMT2024 dataset for Internet of Medical Things (IoMT) security. Recent work has applied Transformer architectures and ensemble methods combining deep learning with traditional ML algorithms for improved detection accuracy.
In manufacturing, anomaly detection is used for quality control and defect detection. Visual inspection systems powered by computer vision and deep learning can automatically identify defective products on assembly lines, detecting scratches, dents, discoloration, cracks, missing components, and assembly errors.
The typical approach involves training a model exclusively on images of non-defective products. At inference time, any deviation from the learned normal appearance triggers an alert. This is particularly valuable because defect examples are rare and highly variable in manufacturing settings.
Beyond visual inspection, anomaly detection is applied to sensor data from industrial equipment for predictive maintenance. By monitoring vibration, temperature, pressure, and other sensor readings, models can detect early signs of equipment failure before a breakdown occurs, reducing unplanned downtime and maintenance costs.
In healthcare, anomaly detection supports several critical tasks:
| Domain | Application | Typical methods |
|---|---|---|
| Astronomy | Detecting unusual celestial events or objects | Clustering, Isolation Forest |
| Environmental monitoring | Detecting pollution events or unusual weather patterns | Time series methods, statistical process control |
| Social media | Identifying bot activity, spam, or fake accounts | Graph-based methods, classification |
| Telecommunications | Detecting network faults and service degradation | Statistical methods, LSTM models |
| Energy | Detecting power grid faults and abnormal consumption | ARIMA, autoencoders |
| Agriculture | Identifying crop disease or pest infestations from imagery | Convolutional neural network autoencoders, PatchCore |
Evaluating anomaly detection systems requires careful consideration of the metrics used, especially because of the inherent class imbalance between normal and anomalous data.
These metrics require choosing a specific decision threshold to classify scores as normal or anomalous.
| Metric | Formula | Interpretation |
|---|---|---|
| Precision | TP / (TP + FP) | Proportion of detected anomalies that are truly anomalous |
| Recall (Sensitivity) | TP / (TP + FN) | Proportion of actual anomalies that are correctly detected |
| F1 Score | 2 * Precision * Recall / (Precision + Recall) | Harmonic mean of precision and recall |
| Specificity | TN / (TN + FP) | Proportion of normal instances correctly identified |
| False Positive Rate | FP / (FP + TN) | Rate at which normal instances are incorrectly flagged |
These metrics evaluate the quality of the anomaly scores across all possible thresholds.
For image anomaly localization, additional metrics are used:
In production systems, the choice of metric depends on the application:
Anomaly detection presents several persistent obstacles.
One of the major obstacles lies in data imbalance, where anomalies make up a small fraction of all instances compared to normal data points. This makes it difficult for machine learning models to learn the characteristics of anomalies and distinguish them from regular instances. In many real-world scenarios, anomalous events represent less than 1% of the data, and in some cases (such as fraud detection), the ratio can be as extreme as 1 in 100,000. Under such severe imbalance, ROC-AUC can give a misleadingly optimistic view of performance, as nearly any non-degenerate classifier can achieve scores in the 0.90 to 0.99 range.
Labeled anomalies may be scarce or unavailable, and the definition of what constitutes an anomaly may be uncertain or context-dependent. Collecting labeled anomaly data is expensive because it requires domain expertise, and anomalies are inherently rare. To address this, unsupervised or semi-supervised techniques that do not require labeled data may be utilized, along with expert knowledge and feedback to refine the definition of anomalies.
Anomaly detection often faces the problem of high dimensionality, where data may contain many features or variables that make it challenging to detect anomalies and visualize them. As the number of dimensions grows, the concept of distance becomes less meaningful (the "curse of dimensionality"), and all points tend to appear equally distant from one another. To address this challenge, feature engineering, dimensionality reduction techniques (such as PCA or autoencoders), or feature selection strategies can be employed to simplify the data and focus on the most pertinent features.
Concept drift occurs when the distribution of data changes over time, making a previously trained model outdated or ineffective at detecting new anomalies. What was considered normal six months ago may no longer apply. To combat this problem, adaptive or online learning techniques should be utilized that update models in real time or adapt to changes in data distribution. Sliding window approaches, periodic retraining, and reinforcement learning-based adaptation strategies are common solutions.
In many real-world settings, there is no clear-cut boundary between normal and anomalous data. Normal behavior may be multimodal, context-dependent, or gradually shifting. A model that defines normality too narrowly will generate excessive false positives, while a model that defines it too broadly will miss genuine anomalies. The choice of how to define normal is often application-specific and requires close collaboration between data scientists and domain experts.
In many applications, it is not enough to simply flag a data point as anomalous. Operators and analysts need to understand why a detection was made. Deep learning models, while powerful, often function as black boxes. Explainability techniques such as SHAP (SHapley Additive exPlanations), LIME (Local Interpretable Model-agnostic Explanations), and attention visualization are increasingly used alongside anomaly detection models to provide interpretable explanations for flagged anomalies.
| Method | Category | Learning type | Data type | Strengths | Limitations |
|---|---|---|---|---|---|
| Z-score / IQR | Statistical | Unsupervised | Tabular | Simple, fast, interpretable | Assumes specific distributions; univariate |
| Mahalanobis distance | Statistical | Unsupervised | Tabular | Handles correlations | Requires covariance estimation |
| Isolation Forest | ML (tree-based) | Unsupervised | Tabular | Fast, scalable, no distance computations | Struggles with many irrelevant features |
| One-Class SVM | ML (boundary) | Semi-supervised | Tabular | Flexible non-linear boundaries via kernels | Computationally expensive; sensitive to kernel choice |
| LOF | ML (density) | Unsupervised | Tabular | Detects local anomalies | O(n^2) complexity; sensitive to k |
| DBSCAN | ML (density) | Unsupervised | Tabular, spatial | Finds clusters of arbitrary shape | Sensitive to epsilon; struggles with varying density |
| HBOS | ML (histogram) | Unsupervised | Tabular | Extremely fast; linear time | Assumes feature independence; misses local outliers |
| Autoencoder | Deep learning | Semi-supervised | Tabular, image, sequence | Learns complex patterns; versatile architecture | Requires careful architecture design; threshold selection |
| VAE | Deep learning | Semi-supervised | Tabular, image, sequence | Probabilistic framework; smooth latent space | More complex training; may underfit |
| GAN-based (AnoGAN, etc.) | Deep learning | Semi-supervised | Image | Generates realistic normal data | Difficult to train; mode collapse |
| LSTM Autoencoder | Deep learning | Semi-supervised | Time series | Captures temporal dependencies | Slow training; vanishing gradients |
| Anomaly Transformer | Deep learning | Semi-supervised | Time series | Global and local context via attention | Large model size; requires substantial data |
| PatchCore | Deep learning (feature embedding) | Semi-supervised | Image | State-of-the-art image detection; fast inference with coreset | Relies on pretrained backbone; memory-intensive |
Several open-source libraries provide implementations of anomaly detection algorithms:
Imagine you have a big box of red marbles. You know what red marbles look like because you see them every day. Now, if someone sneaks a blue marble into your box, you would notice it right away because it looks different from everything else.
Anomaly detection works the same way, but with computers. We show a computer lots of examples of what "normal" looks like. Then, when we give it new data, it checks whether each piece looks like the normal examples it has seen before. If something looks very different from what the computer expects, it raises its hand and says, "Hey, this one looks strange!"
This is useful for things like catching someone who stole a credit card (their purchases look different from the real owner's), finding a broken machine in a factory (it starts making weird sounds), or spotting a sick plant in a field (it looks different from the healthy ones).