Novelty detection is a branch of machine learning concerned with identifying test data that differ in some meaningful way from the data available during training. Unlike standard supervised learning tasks where both normal and abnormal classes are well-represented, novelty detection typically operates in a setting where only data from the "normal" class is available at training time. When a new observation arrives, the system must determine whether it belongs to the same distribution as the training data or represents something previously unseen. The problem is sometimes referred to as one-class classification, since the learning algorithm builds a model describing only the normal class and flags anything that falls outside that model's boundary as novel.
Novelty detection plays a central role in fields where abnormal events are rare, expensive to label, or inherently unpredictable. Examples include detecting previously unknown cyber attacks in network traffic, identifying manufacturing defects that have never been seen on a production line, and spotting early signs of disease from patient monitoring data. Because the system trains on clean, normal data and then evaluates new observations against that learned baseline, novelty detection is classified as a form of semi-supervised anomaly detection.
Imagine you have a toy box full of red balls. You play with those red balls every day, so you know exactly what a red ball looks like. One day, someone drops a blue cube into your toy box. You would notice right away that the blue cube does not look like any of your red balls. Novelty detection works the same way for computers: the computer first learns what "normal" looks like by studying lots of normal examples, and then whenever something new and different shows up, it raises a flag and says, "Hey, this does not match what I have seen before!"
The terms novelty detection, anomaly detection, and outlier detection are often used interchangeably in casual discussion, but they carry distinct meanings in the machine learning literature. Understanding the differences is important for choosing the right algorithm and evaluation strategy.
| Aspect | Novelty detection | Outlier detection | Anomaly detection |
|---|---|---|---|
| Training data | Clean, uncontaminated (no anomalies) | May contain outliers that need to be identified | Varies; may or may not contain anomalies |
| Task timing | Detects anomalies in new, unseen data after training | Identifies anomalies within the existing training dataset | General umbrella term covering both settings |
| Learning paradigm | Semi-supervised (one-class) | Unsupervised learning | Supervised, semi-supervised, or unsupervised |
| Anomaly clustering | Novel points can form dense clusters, as long as they fall in a low-density region of the training distribution | Outliers cannot form dense clusters; they must lie in low-density regions globally | Depends on the specific method |
| Typical use case | Monitoring a deployed system for new types of failures | Cleaning a dataset by removing erroneous or extreme values | Broad term for any task that separates normal from abnormal |
In the scikit-learn documentation, novelty detection is described as semi-supervised anomaly detection, while outlier detection is described as unsupervised anomaly detection. The practical difference comes down to whether the training set is assumed to be free of anomalies (novelty detection) or whether anomalies may already be present in the training set (outlier detection).
Novelty detection is closely related to two other problems in modern machine learning: open-set recognition and out-of-distribution detection.
In novelty detection (also called one-class classification), the model is trained on data from a single normal class and must decide whether a new sample belongs to that class or not. Open-set recognition extends this to multiple known classes: the model is trained on K classes and must reject samples that do not belong to any of those K classes while still correctly classifying samples from the known classes. Out-of-distribution (OOD) detection addresses the same goal as open-set recognition but is typically studied in the context of deep neural networks and uses confidence scores or other signals from a pre-trained classifier to flag inputs that differ from the training distribution.
Novelty detection can be viewed as an extreme case of open-set recognition where K equals 1. All three problems share the goal of deciding whether a sample comes from the distribution seen during training, but they differ in the number of known classes and the types of models and evaluation protocols used.
Novelty detection methods can be grouped into several broad categories based on how they model the normal class. The taxonomy below follows the structure described by Pimentel et al. (2014) in their review of novelty detection.
| Category | Core idea | Representative methods |
|---|---|---|
| Probabilistic | Estimate the probability density of normal data; flag low-probability samples as novel | Gaussian mixture models, kernel density estimation, Bayesian approaches |
| Distance-based | Measure distances or similarities to training data; large distances indicate novelty | k-nearest neighbors, Local Outlier Factor |
| Domain-based | Learn a boundary or region in feature space that encloses the normal data | One-class SVM, Support Vector Data Description (SVDD) |
| Reconstruction-based | Learn to reconstruct normal data; high reconstruction error on a test sample indicates novelty | Autoencoders, variational autoencoders, sparse coding |
| Information-theoretic | Use information-theoretic measures such as entropy or Kolmogorov complexity to detect distributional changes | Entropy-based detectors, minimum description length |
Each category has its own strengths and weaknesses. Probabilistic methods provide principled uncertainty estimates but can struggle in high-dimensional spaces. Distance-based methods are intuitive and nonparametric but become computationally expensive with large datasets. Domain-based methods are efficient at test time but require careful kernel selection. Reconstruction-based methods scale well with deep learning but can sometimes reconstruct anomalies too accurately if the model is overly flexible.
Probabilistic approaches to novelty detection estimate the probability density function (PDF) of the training data and then classify a new observation as novel if its estimated density falls below a threshold.
A Gaussian mixture model (GMM) represents the training distribution as a weighted sum of multiple Gaussian components. Each component is defined by a mean vector and a covariance matrix. The parameters are typically estimated using the expectation-maximization (EM) algorithm. At test time, the likelihood of a new sample under the GMM is computed. If the likelihood is below a predefined threshold, the sample is flagged as novel.
GMMs are flexible enough to model multimodal distributions, making them useful when the normal data form several distinct clusters. However, the user must choose the number of components, and the method can become unreliable in high-dimensional spaces due to the difficulty of estimating covariance matrices accurately.
Kernel density estimation (KDE) is a nonparametric method that places a kernel function (often a Gaussian) at each training point and sums them to produce a smooth estimate of the density. For a test point x, the estimated density is:
f(x) = (1 / nh) * sum(K((x - x_i) / h))
where K is the kernel function, h is the bandwidth (smoothing parameter), n is the number of training points, and x_i are the training samples. Points with density estimates below a threshold are classified as novel.
KDE makes no assumptions about the shape of the underlying distribution, which is an advantage over parametric methods like GMMs. The main drawback is that KDE does not scale well to high-dimensional data because the number of samples needed to produce reliable density estimates grows exponentially with the number of dimensions.
The Elliptic Envelope method assumes that the normal data follow a single multivariate Gaussian distribution. It fits a robust covariance estimate to the data and uses Mahalanobis distance to measure how far a new observation lies from the center of the distribution. Points with large Mahalanobis distances are classified as outliers or novelties. This method is implemented in scikit-learn as EllipticEnvelope and works well when the Gaussian assumption holds, but it is unreliable for data with non-Gaussian structure.
Distance-based methods define novelty in terms of how far a test point is from its nearest neighbors in the training set. The assumption is that normal data points tend to be close to other normal points, while novel points tend to be far from any training example.
The simplest distance-based approach computes the distance from a test point to its k-th nearest neighbor in the training set. If that distance exceeds a threshold, the point is flagged as novel. Variants include using the average distance to the k nearest neighbors rather than the distance to the k-th neighbor alone.
The Local Outlier Factor (LOF), proposed by Breunig et al. (2000), refines the basic k-nearest-neighbor approach by comparing the local density of a point to the local densities of its neighbors. The LOF score for a point p is the ratio of the average local density of p's k nearest neighbors to the local density of p itself. A LOF score close to 1 indicates that the point has a density similar to its neighbors (normal), while a score significantly greater than 1 indicates that the point lies in a region of lower density than its neighbors (potentially novel).
LOF is effective when the data contain clusters of varying densities, because it evaluates each point relative to its local neighborhood rather than against a global threshold. In scikit-learn, the LocalOutlierFactor class supports both outlier detection and novelty detection. When the novelty parameter is set to True, the model can be used to score new, unseen data points. When novelty is False (the default), the model can only be applied to the training data itself for outlier detection.
The one-class support vector machine (One-Class SVM), introduced by Scholkopf et al. (2001), learns a decision boundary that separates the training data from the origin in a high-dimensional feature space induced by a kernel function. The algorithm finds a hyperplane with maximum margin between the data points and the origin. The fraction of training points allowed to fall on the wrong side of the hyperplane is controlled by a parameter called nu, which can be interpreted as an upper bound on the fraction of outliers and a lower bound on the fraction of support vectors.
At test time, a new observation is projected into the kernel feature space, and the sign of its distance from the hyperplane determines whether it is classified as normal (positive) or novel (negative). The most common kernel choice is the radial basis function (RBF) kernel, which allows the decision boundary to take complex, nonlinear shapes in the original input space.
One-Class SVM has strong theoretical foundations rooted in statistical learning theory. However, training complexity scales roughly quadratically with the number of samples, which can make it impractical for very large datasets. For such cases, scikit-learn provides SGDOneClassSVM, a linear approximation trained with stochastic gradient descent that scales linearly with the number of samples.
Support Vector Data Description (SVDD), proposed by Tax and Duin (2004), takes a complementary approach to One-Class SVM. Instead of separating the data from the origin with a hyperplane, SVDD finds the smallest hypersphere that encloses most of the training data. Points that fall outside the hypersphere at test time are classified as novel. Like One-Class SVM, SVDD can be kernelized to produce flexible, nonlinear boundaries. The two methods are mathematically equivalent when using an RBF kernel, since the RBF kernel maps all data points onto the surface of a unit hypersphere in the feature space.
Reconstruction-based approaches train a model to compress and then reconstruct normal data. The idea is that a model trained only on normal data will learn representations tailored to that data and will produce high reconstruction error when presented with novel inputs.
An autoencoder is a neural network trained to map its input to a lower-dimensional latent representation (the bottleneck) and then reconstruct the original input from that representation. The training objective is to minimize the reconstruction error, typically measured as mean squared error or binary cross-entropy, over the normal training data.
For novelty detection, the reconstruction error serves as the novelty score. Because the autoencoder has been trained exclusively on normal data, it learns to reconstruct normal patterns accurately. When a novel input is presented, the autoencoder produces a poor reconstruction, resulting in a high reconstruction error that exceeds a predefined threshold.
Several architectural variants have been explored for novelty detection:
A variational autoencoder (VAE) extends the standard autoencoder by imposing a probabilistic structure on the latent space. Instead of encoding each input as a single point, the encoder outputs the parameters (mean and variance) of a Gaussian distribution. During training, samples are drawn from this distribution and decoded back to the input space. The training objective combines reconstruction error with a Kullback-Leibler divergence term that regularizes the latent distribution toward a standard normal prior.
For novelty detection, VAEs can use either the reconstruction error or the evidence lower bound (ELBO) as a novelty score. Because the latent space is regularized, VAEs tend to produce smoother latent representations than standard autoencoders, which can improve separation between normal and novel samples. Research has shown that VAEs can outperform ordinary autoencoders for novelty detection in certain settings because the probabilistic encoding provides a richer signal for identifying distributional shifts.
A generative adversarial network (GAN) consists of a generator that produces synthetic data and a discriminator that attempts to distinguish real data from generated data. For novelty detection, a GAN is trained on normal data so that the generator learns to produce realistic normal samples. At test time, novelty can be assessed in several ways: by computing the reconstruction error of a test sample through the generator (as in BiGAN or AnoGAN architectures), by examining the discriminator's confidence score, or by measuring the distance between a test sample and the closest generated sample in the latent space.
GAN-based novelty detection has shown strong results in image domains, but training GANs can be unstable and sensitive to hyperparameter choices.
The Isolation Forest algorithm, proposed by Liu, Ting, and Zhou (2008), takes a fundamentally different approach to novelty and anomaly detection. Rather than modeling what normal data looks like, it explicitly isolates anomalies by exploiting their key properties: anomalies are few in number and differ significantly from normal points.
The algorithm works as follows:
The expected path length for normal points in a balanced binary tree of n samples is approximately 2H(n-1) - 2(n-1)/n, where H(k) is the harmonic number. The anomaly score s for a data point x is defined as s(x, n) = 2^(-E(h(x)) / c(n)), where E(h(x)) is the average path length and c(n) is the expected path length for a normal point. A score close to 1 indicates an anomaly, while a score close to 0.5 indicates a normal point.
| Property | Value |
|---|---|
| Training time complexity | O(t * psi * log(psi)), where t is the number of trees and psi is the subsample size |
| Prediction time complexity | O(t * log(psi)) per sample |
| Memory requirement | Low; each tree stores only split features and split values |
| Hyperparameters | Number of trees (t), subsample size (psi), contamination ratio |
| High-dimensional performance | Good; random feature selection avoids the curse of dimensionality |
Isolation Forest is implemented in scikit-learn as IsolationForest. While originally designed for outlier detection (where the training set may contain anomalies), it can also be used for novelty detection by training on clean data and then applying the model to new observations.
Beyond autoencoders, VAEs, and GANs, several other deep learning architectures have been applied to novelty detection.
Deep SVDD (Ruff et al., 2018) combines the idea of SVDD with deep neural networks. A neural network is trained to map normal data to a compact region around a fixed center point in the output space. The training objective minimizes the mean distance between the network's outputs and the center. At test time, points that map far from the center are classified as novel. This approach allows the boundary to be learned in a data-driven feature space rather than in a fixed kernel feature space.
Self-supervised novelty detection creates pretext tasks (such as predicting image rotations, solving jigsaw puzzles, or contrastive learning) from unlabeled normal data. A model trained on these tasks learns representations that capture the structure of normal data. At test time, poor performance on the pretext task indicates that the input does not match the patterns seen during training. Self-supervised approaches have become increasingly popular because they leverage the power of modern representation learning without requiring labeled anomalies.
Recent work has applied transformer architectures to novelty detection, particularly for sequential and time-series data. Transformers can model long-range dependencies in normal sequences and detect novel patterns that violate those dependencies. Attention mechanisms in transformers also provide interpretability by highlighting which parts of the input contributed most to a novelty decision.
Evaluating novelty detection systems requires metrics that account for the typically imbalanced nature of the problem (novel samples are rare). The most commonly used metrics include the following.
| Metric | Description | When to use |
|---|---|---|
| AUROC | Area under the ROC curve; measures discrimination ability across all thresholds | General-purpose evaluation; insensitive to class imbalance in ranking |
| AUPRC | Area under the precision-recall curve; focuses on performance for the positive (novel) class | When the positive class is very rare and false positives are costly |
| F1 score | Harmonic mean of precision and recall; requires a fixed threshold | When a single operating point must be chosen |
| Precision at k | Fraction of the top-k scored samples that are truly novel | When only the top-ranked samples will be inspected |
| False positive rate at fixed true positive rate | FPR when TPR is held at a specific level (e.g., 95%) | When a minimum detection rate is required |
AUROC is the most widely reported metric in novelty detection research because it summarizes performance across all possible thresholds. However, AUPRC can be more informative when the novel class is extremely rare, because AUROC can appear high even when the model produces many false positives in absolute terms.
Novelty detection has been applied across a wide range of domains. The following table summarizes some of the most common application areas.
| Domain | Application | Typical data | Common methods |
|---|---|---|---|
| Cybersecurity | Detecting novel intrusion attempts, zero-day attacks, and new malware variants | Network traffic logs, system call traces | One-Class SVM, Isolation Forest, autoencoders |
| Manufacturing | Identifying defects or failures not seen during quality control training | Sensor readings, images of products | Convolutional autoencoders, One-Class SVM |
| Medical monitoring | Detecting abnormal patient vitals or rare disease patterns | Electrocardiograms, medical images, patient records | GMMs, LOF, deep autoencoders |
| Fraud detection | Flagging new types of fraudulent transactions | Transaction records, user behavior logs | Isolation Forest, autoencoders, LOF |
| Autonomous vehicles | Recognizing unknown objects or scenarios on the road | Camera images, lidar point clouds | Deep one-class classification, self-supervised methods |
| Natural language processing | Detecting novel topics, events, or out-of-domain text | Text documents, social media posts | Transformer-based methods, probabilistic models |
| Robotics | Identifying unfamiliar objects or environments | Sensor data, camera feeds | Reconstruction-based methods, distance-based methods |
| Scientific discovery | Flagging unusual astronomical events or particle physics signals | Telescope observations, detector readouts | KDE, Isolation Forest, deep learning |
Several popular machine learning libraries provide implementations of novelty detection algorithms.
| Library | Algorithms available | Language |
|---|---|---|
| scikit-learn | One-Class SVM, SGD One-Class SVM, Isolation Forest, LOF, Elliptic Envelope | Python |
| PyTorch | Custom autoencoders, VAEs, GANs, Deep SVDD (via libraries like PyOD) | Python |
| TensorFlow / Keras | Custom autoencoders, VAEs, GANs | Python |
| PyOD | Over 40 algorithms including ABOD, COPOD, ECOD, SOD, and deep learning methods | Python |
| MATLAB | Isolation Forest, One-Class SVM, LOF, robust covariance | MATLAB |
PyOD (Python Outlier Detection) is a particularly comprehensive library that provides a unified API for dozens of novelty and outlier detection algorithms, including both classical statistical methods and deep learning approaches.
Despite significant progress, novelty detection remains a challenging problem for several reasons.
Defining normality. In many real-world settings, the boundary between normal and novel is not sharp. What counts as normal can change over time (concept drift), and different stakeholders may disagree on where to draw the line.
High-dimensional data. Many novelty detection methods rely on distance or density computations that become unreliable in high-dimensional spaces due to the curse of dimensionality. Deep learning methods partially address this by learning lower-dimensional representations, but they introduce their own challenges around architecture selection and training stability.
Threshold selection. Most novelty detection algorithms produce a continuous score rather than a binary decision. Choosing the threshold that separates normal from novel is a practical challenge, especially when labeled novel samples are unavailable for calibration.
Contaminated training data. Novelty detection assumes clean training data, but in practice some anomalies may be present in the training set. Methods that are robust to small amounts of contamination are an active area of research.
Evaluation without labeled data. In deployment settings, ground-truth labels for novel events may not be available, making it difficult to evaluate and monitor model performance over time.
Interpretability. When a system flags an observation as novel, practitioners often need to understand why. Providing explanations for novelty scores is an active research topic, with approaches ranging from feature importance analysis to attention visualization in deep models.
The roots of novelty detection can be traced to classical statistical methods for outlier rejection, which date back to the 19th century. Early work focused on identifying extreme values in univariate distributions using techniques such as Grubbs' test and Dixon's Q test.
The modern formulation of novelty detection as a machine learning problem began to take shape in the late 1990s and early 2000s. Scholkopf et al. introduced the support vector method for novelty detection in 1999 at NeurIPS, and the full journal version describing the One-Class SVM appeared in Neural Computation in 2001. Around the same time, Tax and Duin developed Support Vector Data Description (SVDD) as a complementary approach. Breunig et al. introduced the Local Outlier Factor algorithm at ACM SIGMOD in 2000, providing a density-based alternative to distance-based methods.
Markou and Singh published a comprehensive two-part review of novelty detection in Signal Processing in 2003, covering both statistical approaches (Part 1) and neural network-based approaches (Part 2). These reviews helped establish novelty detection as a recognized subfield of machine learning.
Liu, Ting, and Zhou introduced the Isolation Forest algorithm at IEEE ICDM in 2008, offering a tree-based approach with linear time complexity and low memory requirements. The extended journal version appeared in ACM Transactions on Knowledge Discovery from Data in 2012.
Pimentel et al. published an updated review in Signal Processing in 2014, organizing the field into five categories (probabilistic, distance-based, domain-based, reconstruction-based, and information-theoretic) and surveying applications across multiple domains.
The rise of deep learning from 2015 onward brought new reconstruction-based and representation-learning approaches. Deep SVDD (Ruff et al., 2018), self-supervised novelty detection methods, and transformer-based approaches have since expanded the toolkit available to practitioners.