Unsupervised Machine Learning

Introduction

Unsupervised machine learning or unsupervised training is a type of machine learning in which the model is trained using unlabeled data. Unlike supervised machine learning, where the training set includes both input data and corresponding output labels, unsupervised learning aims to recognize patterns, structures, or relationships in data without prior knowledge about their labels or categories. The algorithm receives no explicit feedback on whether its discoveries are correct; instead, it must identify regularities in the data on its own.

Unsupervised learning is particularly valuable when labeled data is scarce, expensive to obtain, or simply unavailable. In many real-world scenarios, manual labeling is too time-consuming, costly, or impossible. For example, a biologist studying gene expression data across thousands of samples may have no predefined categories for the data, making unsupervised methods the natural starting point for analysis. Similarly, companies analyzing millions of customer transactions often rely on unsupervised techniques to discover natural groupings before any domain-specific labels have been assigned.

At its core, unsupervised learning involves giving a model an array of data points and asking it to discover structure or relationships within it. Without any prior knowledge, the model must discover patterns on its own. Furthermore, there is no feedback regarding the accuracy of its predictions since there are no labels with which to compare them. This characteristic makes unsupervised learning both powerful and challenging: it can reveal genuinely novel structure in data, but evaluating the quality of its results is inherently more difficult than in supervised settings.

Types of unsupervised machine learning

Unsupervised machine learning encompasses several major families of techniques, each addressing a different aspect of discovering structure in unlabeled data. The primary categories are clustering, dimensionality reduction, density estimation, anomaly detection, association rule mining, and generative modeling.

Clustering

Clustering is an unsupervised learning technique used to group similar data points together. The objective of clustering is to discover natural groupings within the data such that data points within the same cluster are more similar to each other than they are to data points in other clusters. Clustering can be beneficial for tasks such as customer segmentation, anomaly detection, image segmentation, document organization, and biological taxonomy.

Several families of clustering algorithms exist, each making different assumptions about the shape, size, and density of clusters.

K-means clustering

K-means is one of the most widely used clustering algorithms due to its simplicity and efficiency. The algorithm partitions a dataset into k clusters, where k is specified in advance by the user. It works by iteratively assigning each data point to the nearest cluster centroid and then recomputing the centroids as the mean of all points assigned to each cluster. This process repeats until the assignments stabilize or a maximum number of iterations is reached.

K-means minimizes the within-cluster sum of squares (WCSS), also known as inertia. It works best when clusters are roughly spherical and of similar size. The algorithm has a time complexity of O(n * k * d * i), where n is the number of data points, k is the number of clusters, d is the number of dimensions, and i is the number of iterations.

Variants such as K-means++ improve initialization to avoid poor convergence, while Mini-Batch K-means processes random subsets of data for faster execution on large datasets. K-medoids (PAM) is a related algorithm that uses actual data points as cluster centers rather than computed means, making it more robust to outliers.

Limitations of K-means include its sensitivity to the initial placement of centroids, the requirement to specify k beforehand, and its assumption that clusters are convex and isotropic. It also struggles with clusters of varying sizes or densities.

DBSCAN

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a density-based clustering algorithm introduced by Martin Ester, Hans-Peter Kriegel, Jorg Sander, and Xiaowei Xu in 1996. Unlike K-means, DBSCAN does not require the user to specify the number of clusters in advance. Instead, it identifies clusters as regions of high density separated by regions of low density.

DBSCAN uses two parameters: epsilon (the radius of a neighborhood around a data point) and MinPts (the minimum number of points required to form a dense region). Points that have at least MinPts neighbors within their epsilon-radius are classified as core points. Points reachable from core points but without enough neighbors of their own are border points. Points that are neither core nor border points are classified as noise.

DBSCAN's strengths include its ability to discover clusters of arbitrary shape, its robustness to outliers, and the fact that it does not require a predefined number of clusters. However, it can struggle when clusters have significantly different densities, since a single set of epsilon and MinPts values may not suit all clusters. HDBSCAN (Hierarchical DBSCAN) addresses this limitation by allowing the density threshold to vary.

Hierarchical clustering

Hierarchical clustering builds a tree-like structure (called a dendrogram) of nested clusters. There are two main approaches:

Agglomerative (bottom-up): Each data point starts as its own cluster. At each step, the two most similar clusters are merged until all points belong to a single cluster or a stopping criterion is met.
Divisive (top-down): All data points start in a single cluster, which is recursively split into smaller clusters.

The choice of linkage criterion determines how the distance between clusters is calculated. Common linkage methods include single linkage (minimum distance between any pair of points), complete linkage (maximum distance), average linkage (mean pairwise distance), and Ward's method (minimizes the increase in total within-cluster variance).

Hierarchical clustering does not require specifying the number of clusters in advance, and the dendrogram provides an intuitive visualization of the data's hierarchical structure. However, it has a higher computational cost, typically O(n^2) for agglomerative methods, making it less suitable for very large datasets.

Gaussian mixture models (GMM)

Gaussian Mixture Models (GMMs) represent a probabilistic approach to clustering. A GMM assumes that the data is generated from a mixture of several Gaussian (normal) distributions, each representing a cluster. The model estimates the parameters (mean, covariance, and mixing coefficient) of each Gaussian component using the Expectation-Maximization (EM) algorithm.

Unlike K-means, which assigns each point to exactly one cluster (hard assignment), GMMs provide soft assignments: each data point receives a probability of belonging to each cluster. This makes GMMs more flexible, as they can model elliptical clusters of varying sizes and orientations. GMMs are also the foundation for more advanced techniques in speech recognition, image segmentation, and density estimation.

The number of components in a GMM is typically selected using model selection criteria such as the Bayesian Information Criterion (BIC) or the Akaike Information Criterion (AIC).

Mean shift and spectral clustering

Mean shift is a non-parametric clustering algorithm that does not require specifying the number of clusters in advance. It works by iteratively shifting each data point toward the mode (region of highest density) of the local data distribution. The bandwidth parameter controls the size of the region considered for each shift. Mean shift can find clusters of arbitrary shape, but it is computationally expensive for large datasets.

Spectral clustering uses the eigenvalues of a similarity (affinity) matrix to reduce the dimensionality of the data before applying a standard clustering method like K-means. By constructing a graph where edges represent similarities between data points and computing the graph Laplacian, spectral clustering can identify non-convex clusters that K-means would miss. It is particularly useful for image segmentation and community detection in networks.

Dimensionality reduction

Dimensionality reduction is an unsupervised learning technique used to reduce the number of features in data. The objective is to simplify the information while maintaining as much meaningful structure as possible. High-dimensional data is common in fields such as genomics, natural language processing, and computer vision, where datasets may have thousands or even millions of features. Dimensionality reduction can be beneficial for tasks such as data visualization, noise reduction, feature extraction, and as a preprocessing step for other machine learning algorithms.

Principal component analysis (PCA)

Principal Component Analysis (PCA) is one of the oldest and most widely used dimensionality reduction techniques. Introduced by Karl Pearson in 1901 and further developed by Harold Hotelling in the 1930s, PCA works by projecting data onto a lower-dimensional subspace that captures the maximum amount of variance.

PCA computes the eigenvectors and eigenvalues of the data's covariance matrix. The eigenvectors (principal components) define the directions of maximum variance, and the eigenvalues indicate the amount of variance captured along each direction. By selecting only the top k principal components, PCA reduces the dimensionality of the data while retaining as much information as possible.

PCA is a linear method, meaning it can only capture linear relationships between features. It is computationally efficient, interpretable (each principal component is a linear combination of the original features), and widely used for preprocessing, visualization, and noise reduction. Kernel PCA extends PCA to capture nonlinear relationships by mapping data into a higher-dimensional space before applying PCA.

t-SNE

t-distributed Stochastic Neighbor Embedding (t-SNE) is a nonlinear dimensionality reduction technique developed by Laurens van der Maaten and Geoffrey Hinton in 2008. It is primarily used for visualization of high-dimensional data in two or three dimensions.

t-SNE works by converting high-dimensional pairwise distances into conditional probabilities that represent similarities. It then finds a low-dimensional embedding that minimizes the Kullback-Leibler divergence between the high-dimensional and low-dimensional probability distributions. The use of a Student t-distribution in the low-dimensional space (rather than a Gaussian) helps alleviate the "crowding problem" that affects earlier methods like Stochastic Neighbor Embedding.

t-SNE excels at preserving local structure and revealing clusters in data, making it popular for visualizing embeddings from neural networks, gene expression data, and other high-dimensional datasets. However, it is computationally expensive (O(n^2) in its basic form), does not preserve global structure reliably, and the results can vary across different runs due to random initialization. The perplexity hyperparameter, which roughly controls the balance between local and global structure, must be carefully tuned.

UMAP

Uniform Manifold Approximation and Projection (UMAP), introduced by Leland McInnes, John Healy, and James Melville in 2018, is a nonlinear dimensionality reduction technique grounded in Riemannian geometry and algebraic topology. UMAP constructs a high-dimensional graph representation of the data and then optimizes a low-dimensional layout that preserves the topological structure.

UMAP offers several advantages over t-SNE. It is significantly faster, especially on large datasets, and scales better to millions of data points. UMAP also tends to preserve both local and global structure more effectively than t-SNE, producing embeddings where the relative distances between clusters are more meaningful. It supports arbitrary embedding dimensions (not just 2D or 3D) and can be used as a general-purpose dimensionality reduction technique, not only for visualization.

UMAP has become widely adopted in bioinformatics (particularly single-cell RNA sequencing analysis), natural language processing, and other domains where high-dimensional data visualization is critical.

Autoencoders

Autoencoders are a class of neural networks trained to reconstruct their input through a bottleneck layer, effectively learning a compressed representation of the data. An autoencoder consists of two parts: an encoder that maps the input to a lower-dimensional latent space, and a decoder that reconstructs the input from the latent representation. The network is trained to minimize the reconstruction error between the input and the output.

Because autoencoders learn nonlinear transformations, they can capture more complex relationships than PCA. Several important variants exist:

Sparse autoencoders add a sparsity constraint on the hidden layer activations, encouraging the network to learn features that are active only for a subset of the data. This is inspired by the sparse coding hypothesis in neuroscience.
Denoising autoencoders are trained on corrupted input data and learn to reconstruct the original uncorrupted data, forcing the network to learn more robust features.
Contractive autoencoders add a penalty on the derivative of the hidden layer activations with respect to the input, encouraging the learned representations to be robust to small perturbations.

Autoencoders are used for dimensionality reduction, feature learning, denoising, and as building blocks for more complex generative models such as variational autoencoders.

Independent component analysis (ICA)

Independent Component Analysis (ICA) is a computational technique for separating a multivariate signal into additive, statistically independent components. ICA was originally developed for the "cocktail party problem," where multiple audio sources (such as several people speaking simultaneously in a room) are recorded by multiple microphones, and the goal is to recover the individual source signals from the mixed recordings.

Unlike PCA, which finds components that are uncorrelated and ordered by variance explained, ICA finds components that are statistically independent and typically non-Gaussian. ICA assumes that the observed signals are linear mixtures of the independent source signals. The FastICA algorithm, developed by Aapo Hyvarinen and Erkki Oja, is one of the most widely used ICA implementations and uses negentropy as its cost function.

ICA is applied in biomedical signal processing (separating brain signals in EEG and MEG recordings), telecommunications, image processing, and financial data analysis. It is a linear method and assumes that the mixing process is linear and that the source signals are non-Gaussian and statistically independent.

Self-organizing maps (SOMs)

Self-Organizing Maps (SOMs), also known as Kohonen maps, are a type of artificial neural network introduced by Finnish professor Teuvo Kohonen in the 1980s. SOMs produce a low-dimensional (typically two-dimensional) discrete representation of the input space, preserving the topological properties of the data.

Unlike most neural networks, SOMs are trained using competitive learning rather than backpropagation. The training process works as follows:

Initialization: The weight vectors of the output neurons are initialized (randomly or from the data).
Competition: For each training example, the algorithm computes the Euclidean distance between the input and all weight vectors. The neuron with the closest weight vector is called the Best Matching Unit (BMU).
Cooperation: A neighborhood function centered on the BMU determines which surrounding neurons will also be updated.
Adaptation: The BMU and its neighbors adjust their weight vectors toward the input vector. Neurons closer to the BMU in the grid receive larger updates.
Iteration: The neighborhood radius and learning rate gradually decrease over time, allowing the map to converge.

SOMs are useful for visualizing high-dimensional data on a 2D grid while preserving topological relationships. They have been applied in document clustering, financial analysis, bioinformatics, and industrial process monitoring. A key advantage of SOMs over methods like K-means is that they preserve the neighborhood structure of the data, so similar clusters appear adjacent on the map.

Density estimation

Density estimation is the task of estimating the probability density function that generated a set of observed data points. It is a fundamental problem in unsupervised learning because understanding the data distribution enables many downstream tasks, including anomaly detection, data generation, and clustering.

Parametric density estimation assumes a specific functional form for the distribution (such as a Gaussian) and estimates its parameters from the data. Gaussian Mixture Models are a common parametric approach. Non-parametric density estimation makes fewer assumptions about the distribution shape. Kernel Density Estimation (KDE) is a widely used non-parametric method that places a kernel function (often a Gaussian) at each data point and sums them to produce a smooth density estimate. The bandwidth parameter controls the smoothness of the resulting estimate.

Density estimation serves as a building block for many unsupervised learning tasks. DBSCAN, for example, implicitly relies on density estimation to define clusters, and anomaly detectors frequently flag points in low-density regions.

Anomaly detection

Anomaly detection (also called outlier detection) is the task of identifying data points that deviate significantly from the expected pattern. In the unsupervised setting, no labels indicating which data points are anomalies are available; the algorithm must determine what constitutes "normal" behavior from the data alone.

Unsupervised anomaly detection is used in fraud detection, network intrusion detection, manufacturing quality control, medical diagnostics, and system monitoring.

Isolation forest

The Isolation Forest algorithm, proposed by Fei Tony Liu, Kai Ming Ting, and Zhi-Hua Zhou in 2008, detects anomalies by exploiting the observation that anomalous points are easier to isolate than normal points. The algorithm builds an ensemble of random isolation trees by recursively selecting random features and random split values. Anomalies, being few and different, tend to be isolated in fewer splits (shorter path lengths) than normal points.

Isolation Forest is efficient, with a time complexity of O(n * log(n)), and performs well on high-dimensional data. It does not require computing distances between data points, making it faster than many density-based methods.

Local outlier factor (LOF)

The Local Outlier Factor (LOF) algorithm, introduced by Markus Breunig, Hans-Peter Kriegel, Raymond Ng, and Jorg Sander in 2000, detects anomalies based on the local density of data points relative to their neighbors. LOF computes a score for each point that reflects how isolated it is compared to its surrounding neighborhood. A LOF score significantly greater than 1 indicates a potential outlier.

LOF is effective at detecting local anomalies that may appear normal in a global context but are unusual compared to their neighbors. It handles datasets where the density varies across different regions.

One-class SVM

One-Class Support Vector Machine (One-Class SVM) learns a boundary that encompasses the majority of the data in a high-dimensional feature space. Data points falling outside this boundary are classified as anomalies. The algorithm maps data into a high-dimensional space using a kernel function and finds a hyperplane that separates the data from the origin with maximum margin.

One-Class SVM is effective for novelty detection, where the training set contains only normal examples and the goal is to identify new data points that differ from the training distribution.

Association rule mining

Association rule mining is a technique for discovering relationships between variables in large datasets. It identifies rules of the form "if X then Y" (written X => Y), where X and Y are sets of items that frequently co-occur.

Three key metrics are used to evaluate association rules:

Support: The proportion of transactions that contain both X and Y.
Confidence: The proportion of transactions containing X that also contain Y.
Lift: The ratio of observed support to expected support if X and Y were independent. A lift greater than 1 indicates a positive association.

Apriori algorithm

The Apriori algorithm, introduced by Rakesh Agrawal and Ramakrishnan Srikant in 1994, is the foundational algorithm for association rule mining. It works by iteratively identifying frequent itemsets (sets of items that appear together above a minimum support threshold) and then generating rules from these itemsets.

The key insight of Apriori is the "Apriori principle": if an itemset is infrequent, all of its supersets must also be infrequent. This allows the algorithm to prune the search space significantly.

FP-Growth

The FP-Growth (Frequent Pattern Growth) algorithm, proposed by Jiawei Han, Jian Pei, and Yiwen Yin in 2000, improves upon Apriori by avoiding repeated scans of the database. It compresses the dataset into a compact data structure called an FP-tree and then mines frequent patterns directly from the tree using a divide-and-conquer strategy.

FP-Growth is typically faster and more memory-efficient than Apriori, especially on large datasets.

Association rule mining is widely used in market basket analysis (discovering which products are frequently purchased together), web usage mining, bioinformatics, and recommendation systems.

Topic modeling

Topic modeling is an unsupervised technique for discovering abstract "topics" that occur in a collection of documents. Each document is represented as a mixture of topics, and each topic is characterized by a distribution over words.

Latent Dirichlet Allocation (LDA), introduced by David Blei, Andrew Ng, and Michael Jordan in 2003, is the most widely used topic model. LDA assumes that each document is generated by first choosing a distribution over topics, and then for each word in the document, choosing a topic from that distribution and drawing a word from the topic's word distribution. The model is typically fit using variational inference or Gibbs sampling.

Non-Negative Matrix Factorization (NMF) is an alternative approach to topic modeling that decomposes the document-term matrix into two non-negative matrices: one representing document-topic associations and another representing topic-word associations. Because all factors are non-negative, the resulting components are often more interpretable than those produced by methods that allow negative values (such as SVD).

Topic models are used for document organization, information retrieval, content recommendation, and exploratory analysis of large text corpora.

Generative models

Generative models are a class of unsupervised learning algorithms that learn the underlying probability distribution of the training data in order to generate new samples that resemble the original data. Rather than simply discovering structure (as clustering or dimensionality reduction do), generative models explicitly or implicitly model the full data distribution.

Variational autoencoders (VAEs)

Variational autoencoders (VAEs), introduced by Diederik Kingma and Max Welling in 2013, extend the standard autoencoder framework with a probabilistic approach. Instead of mapping each input to a single point in the latent space, VAEs map inputs to a probability distribution (typically a Gaussian). The model is trained to maximize a lower bound on the data likelihood (the evidence lower bound, or ELBO), which consists of a reconstruction term and a regularization term that encourages the latent space to follow a standard normal distribution.

VAEs produce a smooth, continuous latent space, which makes them suitable for interpolation between data points and controlled generation. They are used in image generation, drug discovery, music synthesis, and anomaly detection. However, VAEs tend to produce blurrier outputs compared to other generative models, because the reconstruction loss encourages averaging over plausible outputs.

Generative adversarial networks (GANs)

Generative Adversarial Networks (GANs), introduced by Ian Goodfellow and colleagues in 2014, consist of two neural networks trained in competition with each other. The generator network produces synthetic data samples, while the discriminator network attempts to distinguish between real and generated samples. Training proceeds as a minimax game: the generator improves by trying to fool the discriminator, and the discriminator improves by getting better at detecting fakes.

GANs are capable of generating highly realistic images, videos, and audio. Notable GAN variants include DCGAN (using convolutional architectures), StyleGAN (enabling control over visual features at different scales), CycleGAN (performing unpaired image-to-image translation), and Wasserstein GAN (using the Wasserstein distance for more stable training).

Training GANs can be difficult due to issues such as mode collapse (the generator produces limited variety), training instability, and the need to carefully balance the generator and discriminator. Despite these challenges, GANs have had a significant impact on creative applications, data augmentation, and super-resolution imaging.

Diffusion models

Diffusion models, which gained prominence beginning around 2020, represent a newer class of generative models. They work through two processes: a forward diffusion process that gradually adds Gaussian noise to the data over a series of steps until the data becomes pure noise, and a reverse denoising process where a neural network learns to reverse each step, gradually reconstructing data from noise.

Diffusion models have achieved state-of-the-art results in image generation, surpassing GANs in both sample quality and diversity. Systems such as DALL-E 2, Stable Diffusion, and Midjourney are built on diffusion model architectures. The training process for diffusion models is more stable than GAN training, though inference (sample generation) is slower because it requires running the denoising process over many steps. Techniques such as DDIM (Denoising Diffusion Implicit Models) and latent diffusion (operating in a compressed latent space rather than pixel space) have been developed to speed up inference.

Normalizing flows

Normalizing flows are a family of generative models that transform a simple base distribution (such as a standard Gaussian) into a complex target distribution through a sequence of invertible, differentiable mappings. Because each transformation is invertible, normalizing flows allow exact computation of the data likelihood, unlike VAEs (which optimize a lower bound) and GANs (which have no explicit density).

Examples of normalizing flow architectures include RealNVP, Glow, and Neural Spline Flows. Normalizing flows are used for density estimation, variational inference, and data generation. Their main limitation is that each transformation must be invertible, which constrains the model architecture and can limit expressiveness compared to GANs or diffusion models.

Self-supervised learning as modern unsupervised learning

Self-supervised learning (SSL) has emerged as a powerful paradigm that bridges unsupervised and supervised learning. In self-supervised learning, the model generates its own supervisory signal from the structure of the unlabeled data itself. For example, a model might be trained to predict a missing portion of its input (such as a masked word in a sentence or a masked patch in an image), using the surrounding context as the learning signal.

Self-supervised learning is technically a form of unsupervised learning because it does not require human-provided labels. However, it differs from traditional unsupervised methods like clustering or dimensionality reduction in that the training objective resembles a supervised task (the model predicts a target derived from the input). Some researchers, including Yann LeCun, have argued that self-supervised learning should be distinguished from classical unsupervised learning, while others consider it a modern evolution of the same concept.

Contrastive learning

Contrastive learning methods train models to produce similar representations for related data points (positive pairs) and dissimilar representations for unrelated data points (negative pairs).

SimCLR, introduced by Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton in 2020, demonstrated that a simple framework combining data augmentation, a neural network encoder, and a contrastive loss function could learn visual representations competitive with supervised methods.

MoCo (Momentum Contrast), developed by Kaiming He and colleagues at Meta AI, uses a momentum-updated encoder and a queue of negative samples to enable contrastive learning with large effective batch sizes.

BYOL (Bootstrap Your Own Latent), introduced by Jean-Bastien Grill and colleagues in 2020, showed that contrastive learning could succeed without negative pairs at all. BYOL uses two networks (an online network and a target network updated via exponential moving average) and trains the online network to predict the target network's representation of an augmented view of the same image.

Masked prediction

Masked prediction methods train models to reconstruct missing parts of the input.

In natural language processing, BERT (Bidirectional Encoder Representations from Transformers) introduced masked language modeling, where random tokens in a sentence are masked and the model learns to predict them from context. This approach has become the foundation for modern NLP pre-training.

In computer vision, Masked Autoencoders (MAE), introduced by Kaiming He and colleagues in 2022, apply a similar principle: random patches of an image are masked, and the model learns to reconstruct the missing patches. MAE ViT-Huge achieved 87.8% top-1 accuracy on ImageNet, demonstrating that masked prediction can match or surpass supervised learning baselines.

DINO and DINOv2, developed by Meta AI, combine self-distillation with vision transformers to learn visual features without any labels. DINOv2 has shown remarkable generalizability across a wide range of visual tasks without fine-tuning.

Self-supervised pre-training and transfer learning

Self-supervised learning is commonly used as a pre-training step. A model is first trained on a large unlabeled dataset using a self-supervised objective, and then fine-tuned on a smaller labeled dataset for a specific downstream task. This approach, known as transfer learning, has proven highly effective in both NLP and computer vision, enabling models to achieve strong performance even with limited labeled data.

Recent research has highlighted that the choice of data augmentation strategy is often more important than the specific self-supervised paradigm used, and that scaling to larger models and datasets consistently improves representation quality.

Applications of unsupervised machine learning

Unsupervised machine learning has numerous applications across a wide range of fields.

Customer segmentation and marketing

Businesses use clustering algorithms to group customers based on purchasing behavior, demographics, browsing patterns, and other features. These segments can then inform targeted marketing campaigns, personalized recommendations, and pricing strategies. For example, an e-commerce company might use K-means clustering to identify groups of customers with similar buying habits and tailor promotions accordingly. Netflix uses unsupervised algorithms in its recommendation engine by analyzing viewing history, search trends, and ratings to identify hidden patterns in user preferences.

Feature learning and representation learning

Unsupervised methods are widely used to learn useful representations of data that can be used as features for downstream tasks. Autoencoders, self-supervised models, and other unsupervised techniques can discover meaningful features without labels, which is especially valuable when labeled data is limited. Pre-trained word embeddings (such as Word2Vec and GloVe) are a classic example of unsupervised feature learning in NLP.

Data exploration and visualization

Dimensionality reduction techniques like PCA, t-SNE, and UMAP are essential tools for exploratory data analysis. They allow researchers to visualize high-dimensional data in two or three dimensions, revealing clusters, trends, and outliers that would be invisible in the raw data. This is particularly important in fields like genomics, where datasets may have tens of thousands of dimensions.

Data compression

Dimensionality reduction and autoencoders can be used to compress data by learning compact representations that preserve the essential information. This is useful for reducing storage requirements, speeding up computation, and transmitting data more efficiently. Image compression using autoencoders, for instance, can produce smaller file sizes than traditional methods while maintaining perceptual quality.

Natural language processing

Unsupervised learning is used to identify topics and themes from unstructured text data through techniques like Latent Dirichlet Allocation (LDA) and non-negative matrix factorization (NMF). Topic models can automatically organize large document collections, summarize content, and discover hidden thematic structures. Word embeddings such as Word2Vec, GloVe, and FastText learn distributed representations of words from large text corpora in a fully unsupervised manner.

Image and video analysis

Unsupervised learning can be applied to recognize objects, scenes, and events captured in images and videos. Clustering of image features enables content-based image retrieval and automatic tagging. Generative models trained in an unsupervised manner can produce realistic synthetic images for data augmentation and creative applications. Self-supervised pre-training on large image datasets has become the standard approach for learning visual representations.

Anomaly and fraud detection

Unsupervised learning is employed to detect unusual patterns in data that could indicate anomalies or outliers. In financial services, unsupervised anomaly detection systems flag potentially fraudulent transactions by identifying patterns inconsistent with normal behavior. In cybersecurity, these methods detect network intrusions and unusual system activity.

Bioinformatics and healthcare

Clustering algorithms are used to group genes with similar expression patterns, identify disease subtypes, and discover biomarkers. Dimensionality reduction techniques help visualize complex biological datasets, such as single-cell RNA sequencing data, where each cell may be characterized by thousands of gene expression measurements. Unsupervised methods are also used for drug discovery, where generative models such as VAEs explore chemical space to propose novel molecular structures.

Signal processing

ICA and related techniques are used in biomedical signal processing to separate independent source signals from recorded mixtures. Applications include removing artifacts from EEG recordings, separating fetal heartbeat signals from maternal recordings, and enhancing speech signals in noisy environments.

Comparison of learning paradigms

The following table compares unsupervised learning with supervised and self-supervised learning across several key dimensions.

Aspect	Supervised learning	Unsupervised learning	Self-supervised learning
Training data	Labeled (input-output pairs)	Unlabeled	Unlabeled (labels derived from data)
Human annotation required	Yes	No	No
Learning objective	Predict known targets	Discover structure or patterns	Predict derived targets (e.g., masked tokens)
Typical tasks	Classification, regression	Clustering, dimensionality reduction, density estimation	Pre-training for downstream tasks
Evaluation	Direct (compare predictions to labels)	Indirect (silhouette score, visual inspection, downstream performance)	Indirect (downstream task performance)
Examples	Linear regression, decision trees, neural networks	K-means, PCA, GANs	BERT, SimCLR, MAE
Data requirements	Needs labeled data (expensive)	Works with raw, unlabeled data	Works with raw, unlabeled data
Scalability with data	Limited by labeling cost	Scales easily with more data	Scales easily with more data

Evaluation challenges

Evaluating unsupervised learning is fundamentally more difficult than evaluating supervised learning, because there are no ground-truth labels against which to measure performance. Several approaches and metrics are used, each with limitations.

Internal evaluation metrics

Internal metrics assess the quality of clustering or dimensionality reduction using only the data itself, without reference to external labels.

Silhouette Score: Measures how similar each data point is to its own cluster compared to the nearest neighboring cluster. The score ranges from -1 to 1, where values near 1 indicate well-separated clusters and values near 0 indicate overlapping clusters. The silhouette coefficient for a single sample is calculated as (b - a) / max(a, b), where a is the mean distance to other points in the same cluster and b is the mean distance to points in the nearest neighboring cluster.
Elbow Method: Used to determine the optimal number of clusters in K-means by plotting the within-cluster sum of squares (WCSS) against different values of k. The "elbow point," where the rate of decrease in WCSS sharply slows, suggests a good choice of k. However, the elbow point is often subjective and may not be distinct.
Davies-Bouldin Index: Measures the average similarity ratio of each cluster with the cluster most similar to it. Lower values indicate better clustering. A recent study published in PeerJ found that the Silhouette coefficient and the Davies-Bouldin index are more informative than the Dunn index, Calinski-Harabasz index, Shannon entropy, and Gap statistic for internal evaluation of convex clusters.
Calinski-Harabasz Index: Also known as the Variance Ratio Criterion, this metric measures the ratio of between-cluster dispersion to within-cluster dispersion. Higher values indicate better-defined clusters.
Reconstruction Error: For dimensionality reduction methods like PCA and autoencoders, reconstruction error measures how well the compressed representation can reconstruct the original data. Lower reconstruction error indicates better preservation of information.

External evaluation metrics

When ground-truth labels are available (for example, in benchmarking studies), external metrics can be used:

Adjusted Rand Index (ARI): Measures the agreement between the predicted clusters and the true labels, adjusted for chance.
Normalized Mutual Information (NMI): Measures the mutual information between the predicted and true cluster assignments, normalized to lie between 0 and 1.
V-measure: The harmonic mean of homogeneity (each cluster contains only members of a single class) and completeness (all members of a given class are assigned to the same cluster).

Generative model evaluation

For generative models, evaluation is particularly complex because assessing the quality of generated samples involves multiple dimensions:

Frechet Inception Distance (FID): Measures the distance between the distribution of generated images and real images in the feature space of an Inception network. Lower FID scores indicate higher quality and more diverse generated images.
Inception Score (IS): Evaluates both the quality and diversity of generated images based on the predictions of an Inception classifier.
Perplexity: Used for language models, measuring how well the model predicts a held-out dataset. Lower perplexity indicates a better fit to the data distribution.

Challenges in practice

Beyond the choice of metric, several practical challenges affect the evaluation of unsupervised learning:

The optimal number of clusters is often unknown and must be estimated.
Different algorithms may produce different clusterings of the same data, and there is no universal criterion for choosing the "best" one.
High-dimensional data makes both clustering and evaluation more difficult due to the curse of dimensionality.
Evaluation metrics may disagree with each other, requiring domain expertise to interpret the results.
Internal metrics may not always correlate with the true quality of clustering, especially in high-dimensional spaces or when clusters have complex shapes.

Table of algorithms by type

The following table summarizes major unsupervised learning algorithms organized by category.

Category	Algorithm	Key parameters	Strengths	Limitations
Clustering	K-Means	Number of clusters (k)	Simple, fast, scalable	Assumes spherical clusters; requires k in advance
Clustering	DBSCAN	Epsilon, MinPts	Finds arbitrary-shaped clusters; handles noise	Sensitive to epsilon; struggles with varying densities
Clustering	Hierarchical Clustering	Linkage method, distance metric	No need to predefine k; produces dendrogram	O(n^2) or higher; not ideal for large datasets
Clustering	GMM	Number of components	Soft assignments; models elliptical clusters	Assumes Gaussian distributions; sensitive to initialization
Clustering	Mean Shift	Bandwidth	No need to predefine k; finds arbitrarily shaped clusters	Computationally expensive; bandwidth selection is critical
Clustering	Spectral Clustering	Number of clusters, affinity	Handles non-convex clusters via graph Laplacian	Expensive for large datasets; requires k
Dimensionality Reduction	PCA	Number of components	Fast, interpretable, well-understood	Linear only; may miss nonlinear structure
Dimensionality Reduction	t-SNE	Perplexity, learning rate	Excellent for 2D/3D visualization	Slow on large data; does not preserve global structure well
Dimensionality Reduction	UMAP	Number of neighbors, min_dist	Fast; preserves local and global structure	Hyperparameter-sensitive; less interpretable than PCA
Dimensionality Reduction	Autoencoder	Architecture, latent dimension	Captures nonlinear relationships; flexible	Requires neural network training; harder to interpret
Dimensionality Reduction	ICA	Number of components	Separates independent sources; useful for BSS	Assumes statistical independence; linear
Dimensionality Reduction	SOM	Grid size, learning rate, neighborhood radius	Preserves topology; visual 2D mapping	Requires grid size choice; not suited for very high-dimensional data
Density Estimation	KDE	Bandwidth, kernel function	Non-parametric; flexible	Slow in high dimensions; bandwidth-sensitive
Density Estimation	GMM	Number of components	Parametric; provides cluster probabilities	Assumes Gaussian components
Anomaly Detection	Isolation Forest	Number of trees, contamination	Fast; handles high dimensions	Less effective with low-dimensional data
Anomaly Detection	Local Outlier Factor	Number of neighbors	Detects local anomalies	Sensitive to parameter choice; slow on large data
Anomaly Detection	One-Class SVM	Kernel, nu	Effective in high-dimensional spaces	Computationally expensive; kernel choice affects results
Association Rules	Apriori	Min support, min confidence	Intuitive; well-established	Slow on large datasets; many candidate itemsets
Association Rules	FP-Growth	Min support	Faster than Apriori; no candidate generation	Memory-intensive for very large FP-trees
Topic Modeling	LDA	Number of topics, alpha, beta	Interpretable topics; principled probabilistic model	Requires specifying topic count; bag-of-words assumption
Topic Modeling	NMF	Number of components	Non-negative factors; interpretable	Sensitive to initialization; no probabilistic interpretation
Generative	VAE	Latent dimension, architecture	Smooth latent space; stable training	Blurry outputs; limited expressiveness
Generative	GAN	Architecture, learning rates	High-quality samples; versatile	Training instability; mode collapse
Generative	Diffusion Model	Noise schedule, number of steps	State-of-the-art image quality; stable training	Slow inference; high compute requirements
Generative	Normalizing Flow	Number of layers, coupling design	Exact likelihood; invertible	Constrained architecture; limited expressiveness

Explain like I'm 5 (ELI5)

Imagine you have a big box of colorful building blocks, but nobody told you what to build or how the blocks should be sorted. You start looking at the blocks and notice some are red and round, some are blue and square, and some are green and long. You decide to put similar-looking blocks into piles: all the red round ones together, all the blue square ones together, and so on. Nobody told you the names of these groups or which block goes where. You figured it all out by yourself, just by looking at the blocks.

That is basically what unsupervised machine learning does. A computer looks at a huge pile of data, with no labels or instructions, and tries to find patterns and groups on its own. It might discover that some data points are very similar to each other and put them in a group, or it might figure out a simpler way to describe complicated data. The computer does not know if it is right or wrong because nobody gave it an answer key. It just does its best to make sense of the information.

References

Jain, A. K. (2010). "Data clustering: 50 years beyond K-means." *Pattern Recognition Letters*, 31(8), 651-666.
Ester, M., Kriegel, H.-P., Sander, J., & Xu, X. (1996). "A density-based algorithm for discovering clusters in large spatial databases with noise." *Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining (KDD-96)*, 226-231.
Pearson, K. (1901). "On lines and planes of closest fit to systems of points in space." *The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science*, 2(11), 559-572.
van der Maaten, L., & Hinton, G. (2008). "Visualizing data using t-SNE." *Journal of Machine Learning Research*, 9, 2579-2605.
McInnes, L., Healy, J., & Melville, J. (2018). "UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction." *arXiv preprint arXiv:1802.03426*.
Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., & Bengio, Y. (2014). "Generative adversarial nets." *Advances in Neural Information Processing Systems*, 27.
Kingma, D. P., & Welling, M. (2013). "Auto-Encoding Variational Bayes." *arXiv preprint arXiv:1312.6114*.
Ho, J., Jain, A., & Abbeel, P. (2020). "Denoising Diffusion Probabilistic Models." *Advances in Neural Information Processing Systems*, 33, 6840-6851.
Chen, T., Kornblith, S., Norouzi, M., & Hinton, G. (2020). "A Simple Framework for Contrastive Learning of Visual Representations." *Proceedings of the 37th International Conference on Machine Learning (ICML)*.
Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding." *Proceedings of NAACL-HLT 2019*, 4171-4186.
Liu, F. T., Ting, K. M., & Zhou, Z.-H. (2008). "Isolation Forest." *Proceedings of the 8th IEEE International Conference on Data Mining (ICDM)*, 413-422.
Breunig, M. M., Kriegel, H.-P., Ng, R. T., & Sander, J. (2000). "LOF: Identifying Density-Based Local Outliers." *Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data*, 93-104.
Agrawal, R., & Srikant, R. (1994). "Fast algorithms for mining association rules." *Proceedings of the 20th International Conference on Very Large Data Bases (VLDB)*, 487-499.
Han, J., Pei, J., & Yin, Y. (2000). "Mining frequent patterns without candidate generation." *Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data*, 1-12.
He, K., Chen, X., Xie, S., Li, Y., Dollar, P., & Girshick, R. (2022). "Masked Autoencoders Are Scalable Vision Learners." *Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)*, 16000-16009.
Rousseeuw, P. J. (1987). "Silhouettes: A graphical aid to the interpretation and validation of cluster analysis." *Journal of Computational and Applied Mathematics*, 20, 53-65.
Kohonen, T. (1982). "Self-organized formation of topologically correct feature maps." *Biological Cybernetics*, 43(1), 59-69.
Hyvarinen, A., & Oja, E. (2000). "Independent component analysis: algorithms and applications." *Neural Networks*, 13(4-5), 411-430.
Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). "Latent Dirichlet Allocation." *Journal of Machine Learning Research*, 3, 993-1022.
Grill, J.-B., Strub, F., Altche, F., Tallec, C., Richemond, P. H., Buchatskaya, E., ... & Valko, M. (2020). "Bootstrap Your Own Latent: A New Approach to Self-Supervised Learning." *Advances in Neural Information Processing Systems*, 33.

Introduction

Types of unsupervised machine learning

Clustering

K-means clustering

DBSCAN

Hierarchical clustering

Gaussian mixture models (GMM)

Mean shift and spectral clustering

Dimensionality reduction

Principal component analysis (PCA)

t-SNE

UMAP

Autoencoders

Independent component analysis (ICA)

Self-organizing maps (SOMs)

Density estimation

Anomaly detection

Isolation forest

Local outlier factor (LOF)

One-class SVM

Association rule mining

Apriori algorithm

FP-Growth

Topic modeling

Generative models

Variational autoencoders (VAEs)

Generative adversarial networks (GANs)

Diffusion models

Normalizing flows

Self-supervised learning as modern unsupervised learning

Contrastive learning

Masked prediction

Self-supervised pre-training and transfer learning

Applications of unsupervised machine learning

Customer segmentation and marketing

Feature learning and representation learning

Data exploration and visualization

Data compression

Natural language processing

Image and video analysis

Anomaly and fraud detection

Bioinformatics and healthcare

Signal processing

Comparison of learning paradigms

Evaluation challenges

Internal evaluation metrics

External evaluation metrics

Generative model evaluation

Challenges in practice

Table of algorithms by type

Explain like I'm 5 (ELI5)

References

Improve this article

Related Articles

ARC-AGI 2

AUC-ROC

Machine learning terms/Clustering

Machine learning terms/Decision Forests

Machine learning terms/Fairness

Machine learning terms/Fundamentals

Introduction

Types of unsupervised machine learning

Clustering

K-means clustering

DBSCAN

Hierarchical clustering

Gaussian mixture models (GMM)

Mean shift and spectral clustering

Dimensionality reduction

Principal component analysis (PCA)

t-SNE

UMAP

Autoencoders

Independent component analysis (ICA)

Self-organizing maps (SOMs)

Density estimation

Anomaly detection

Isolation forest

Local outlier factor (LOF)

One-class SVM