See also: Machine learning terms
Unsupervised machine learning or unsupervised training is a type of machine learning in which the model is trained using unlabeled data. Unlike supervised machine learning, where the training set includes both input data and corresponding output labels, unsupervised learning aims to recognize patterns, structures, or relationships in data without prior knowledge about their labels or categories. The algorithm receives no explicit feedback on whether its discoveries are correct; instead, it must identify regularities in the data on its own.
Unsupervised learning is particularly valuable when labeled data is scarce, expensive to obtain, or simply unavailable. In many real-world scenarios, manual labeling is too time-consuming, costly, or impossible. For example, a biologist studying gene expression data across thousands of samples may have no predefined categories for the data, making unsupervised methods the natural starting point for analysis. Similarly, companies analyzing millions of customer transactions often rely on unsupervised techniques to discover natural groupings before any domain-specific labels have been assigned.
At its core, unsupervised learning involves giving a model an array of data points and asking it to discover structure or relationships within it. Without any prior knowledge, the model must discover patterns on its own. Furthermore, there is no feedback regarding the accuracy of its predictions since there are no labels with which to compare them. This characteristic makes unsupervised learning both powerful and challenging: it can reveal genuinely novel structure in data, but evaluating the quality of its results is inherently more difficult than in supervised settings.
Unsupervised machine learning encompasses several major families of techniques, each addressing a different aspect of discovering structure in unlabeled data. The primary categories are clustering, dimensionality reduction, density estimation, anomaly detection, association rule mining, and generative modeling.
Clustering is an unsupervised learning technique used to group similar data points together. The objective of clustering is to discover natural groupings within the data such that data points within the same cluster are more similar to each other than they are to data points in other clusters. Clustering can be beneficial for tasks such as customer segmentation, anomaly detection, image segmentation, document organization, and biological taxonomy.
Several families of clustering algorithms exist, each making different assumptions about the shape, size, and density of clusters.
K-means is one of the most widely used clustering algorithms due to its simplicity and efficiency. The algorithm partitions a dataset into k clusters, where k is specified in advance by the user. It works by iteratively assigning each data point to the nearest cluster centroid and then recomputing the centroids as the mean of all points assigned to each cluster. This process repeats until the assignments stabilize or a maximum number of iterations is reached.
K-means minimizes the within-cluster sum of squares (WCSS), also known as inertia. It works best when clusters are roughly spherical and of similar size. The algorithm has a time complexity of O(n * k * d * i), where n is the number of data points, k is the number of clusters, d is the number of dimensions, and i is the number of iterations.
Variants such as K-means++ improve initialization to avoid poor convergence, while Mini-Batch K-means processes random subsets of data for faster execution on large datasets. K-medoids (PAM) is a related algorithm that uses actual data points as cluster centers rather than computed means, making it more robust to outliers.
Limitations of K-means include its sensitivity to the initial placement of centroids, the requirement to specify k beforehand, and its assumption that clusters are convex and isotropic. It also struggles with clusters of varying sizes or densities.
DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a density-based clustering algorithm introduced by Martin Ester, Hans-Peter Kriegel, Jorg Sander, and Xiaowei Xu in 1996. Unlike K-means, DBSCAN does not require the user to specify the number of clusters in advance. Instead, it identifies clusters as regions of high density separated by regions of low density.
DBSCAN uses two parameters: epsilon (the radius of a neighborhood around a data point) and MinPts (the minimum number of points required to form a dense region). Points that have at least MinPts neighbors within their epsilon-radius are classified as core points. Points reachable from core points but without enough neighbors of their own are border points. Points that are neither core nor border points are classified as noise.
DBSCAN's strengths include its ability to discover clusters of arbitrary shape, its robustness to outliers, and the fact that it does not require a predefined number of clusters. However, it can struggle when clusters have significantly different densities, since a single set of epsilon and MinPts values may not suit all clusters. HDBSCAN (Hierarchical DBSCAN) addresses this limitation by allowing the density threshold to vary.
Hierarchical clustering builds a tree-like structure (called a dendrogram) of nested clusters. There are two main approaches:
The choice of linkage criterion determines how the distance between clusters is calculated. Common linkage methods include single linkage (minimum distance between any pair of points), complete linkage (maximum distance), average linkage (mean pairwise distance), and Ward's method (minimizes the increase in total within-cluster variance).
Hierarchical clustering does not require specifying the number of clusters in advance, and the dendrogram provides an intuitive visualization of the data's hierarchical structure. However, it has a higher computational cost, typically O(n^2) for agglomerative methods, making it less suitable for very large datasets.
Gaussian Mixture Models (GMMs) represent a probabilistic approach to clustering. A GMM assumes that the data is generated from a mixture of several Gaussian (normal) distributions, each representing a cluster. The model estimates the parameters (mean, covariance, and mixing coefficient) of each Gaussian component using the Expectation-Maximization (EM) algorithm.
Unlike K-means, which assigns each point to exactly one cluster (hard assignment), GMMs provide soft assignments: each data point receives a probability of belonging to each cluster. This makes GMMs more flexible, as they can model elliptical clusters of varying sizes and orientations. GMMs are also the foundation for more advanced techniques in speech recognition, image segmentation, and density estimation.
The number of components in a GMM is typically selected using model selection criteria such as the Bayesian Information Criterion (BIC) or the Akaike Information Criterion (AIC).
Mean shift is a non-parametric clustering algorithm that does not require specifying the number of clusters in advance. It works by iteratively shifting each data point toward the mode (region of highest density) of the local data distribution. The bandwidth parameter controls the size of the region considered for each shift. Mean shift can find clusters of arbitrary shape, but it is computationally expensive for large datasets.
Spectral clustering uses the eigenvalues of a similarity (affinity) matrix to reduce the dimensionality of the data before applying a standard clustering method like K-means. By constructing a graph where edges represent similarities between data points and computing the graph Laplacian, spectral clustering can identify non-convex clusters that K-means would miss. It is particularly useful for image segmentation and community detection in networks.
Dimensionality reduction is an unsupervised learning technique used to reduce the number of features in data. The objective is to simplify the information while maintaining as much meaningful structure as possible. High-dimensional data is common in fields such as genomics, natural language processing, and computer vision, where datasets may have thousands or even millions of features. Dimensionality reduction can be beneficial for tasks such as data visualization, noise reduction, feature extraction, and as a preprocessing step for other machine learning algorithms.
Principal Component Analysis (PCA) is one of the oldest and most widely used dimensionality reduction techniques. Introduced by Karl Pearson in 1901 and further developed by Harold Hotelling in the 1930s, PCA works by projecting data onto a lower-dimensional subspace that captures the maximum amount of variance.
PCA computes the eigenvectors and eigenvalues of the data's covariance matrix. The eigenvectors (principal components) define the directions of maximum variance, and the eigenvalues indicate the amount of variance captured along each direction. By selecting only the top k principal components, PCA reduces the dimensionality of the data while retaining as much information as possible.
PCA is a linear method, meaning it can only capture linear relationships between features. It is computationally efficient, interpretable (each principal component is a linear combination of the original features), and widely used for preprocessing, visualization, and noise reduction. Kernel PCA extends PCA to capture nonlinear relationships by mapping data into a higher-dimensional space before applying PCA.
t-distributed Stochastic Neighbor Embedding (t-SNE) is a nonlinear dimensionality reduction technique developed by Laurens van der Maaten and Geoffrey Hinton in 2008. It is primarily used for visualization of high-dimensional data in two or three dimensions.
t-SNE works by converting high-dimensional pairwise distances into conditional probabilities that represent similarities. It then finds a low-dimensional embedding that minimizes the Kullback-Leibler divergence between the high-dimensional and low-dimensional probability distributions. The use of a Student t-distribution in the low-dimensional space (rather than a Gaussian) helps alleviate the "crowding problem" that affects earlier methods like Stochastic Neighbor Embedding.
t-SNE excels at preserving local structure and revealing clusters in data, making it popular for visualizing embeddings from neural networks, gene expression data, and other high-dimensional datasets. However, it is computationally expensive (O(n^2) in its basic form), does not preserve global structure reliably, and the results can vary across different runs due to random initialization. The perplexity hyperparameter, which roughly controls the balance between local and global structure, must be carefully tuned.
Uniform Manifold Approximation and Projection (UMAP), introduced by Leland McInnes, John Healy, and James Melville in 2018, is a nonlinear dimensionality reduction technique grounded in Riemannian geometry and algebraic topology. UMAP constructs a high-dimensional graph representation of the data and then optimizes a low-dimensional layout that preserves the topological structure.
UMAP offers several advantages over t-SNE. It is significantly faster, especially on large datasets, and scales better to millions of data points. UMAP also tends to preserve both local and global structure more effectively than t-SNE, producing embeddings where the relative distances between clusters are more meaningful. It supports arbitrary embedding dimensions (not just 2D or 3D) and can be used as a general-purpose dimensionality reduction technique, not only for visualization.
UMAP has become widely adopted in bioinformatics (particularly single-cell RNA sequencing analysis), natural language processing, and other domains where high-dimensional data visualization is critical.
Autoencoders are a class of neural networks trained to reconstruct their input through a bottleneck layer, effectively learning a compressed representation of the data. An autoencoder consists of two parts: an encoder that maps the input to a lower-dimensional latent space, and a decoder that reconstructs the input from the latent representation. The network is trained to minimize the reconstruction error between the input and the output.
Because autoencoders learn nonlinear transformations, they can capture more complex relationships than PCA. Several important variants exist:
Autoencoders are used for dimensionality reduction, feature learning, denoising, and as building blocks for more complex generative models such as variational autoencoders.
Independent Component Analysis (ICA) is a computational technique for separating a multivariate signal into additive, statistically independent components. ICA was originally developed for the "cocktail party problem," where multiple audio sources (such as several people speaking simultaneously in a room) are recorded by multiple microphones, and the goal is to recover the individual source signals from the mixed recordings.
Unlike PCA, which finds components that are uncorrelated and ordered by variance explained, ICA finds components that are statistically independent and typically non-Gaussian. ICA assumes that the observed signals are linear mixtures of the independent source signals. The FastICA algorithm, developed by Aapo Hyvarinen and Erkki Oja, is one of the most widely used ICA implementations and uses negentropy as its cost function.
ICA is applied in biomedical signal processing (separating brain signals in EEG and MEG recordings), telecommunications, image processing, and financial data analysis. It is a linear method and assumes that the mixing process is linear and that the source signals are non-Gaussian and statistically independent.
Self-Organizing Maps (SOMs), also known as Kohonen maps, are a type of artificial neural network introduced by Finnish professor Teuvo Kohonen in the 1980s. SOMs produce a low-dimensional (typically two-dimensional) discrete representation of the input space, preserving the topological properties of the data.
Unlike most neural networks, SOMs are trained using competitive learning rather than backpropagation. The training process works as follows:
SOMs are useful for visualizing high-dimensional data on a 2D grid while preserving topological relationships. They have been applied in document clustering, financial analysis, bioinformatics, and industrial process monitoring. A key advantage of SOMs over methods like K-means is that they preserve the neighborhood structure of the data, so similar clusters appear adjacent on the map.
Density estimation is the task of estimating the probability density function that generated a set of observed data points. It is a fundamental problem in unsupervised learning because understanding the data distribution enables many downstream tasks, including anomaly detection, data generation, and clustering.
Parametric density estimation assumes a specific functional form for the distribution (such as a Gaussian) and estimates its parameters from the data. Gaussian Mixture Models are a common parametric approach. Non-parametric density estimation makes fewer assumptions about the distribution shape. Kernel Density Estimation (KDE) is a widely used non-parametric method that places a kernel function (often a Gaussian) at each data point and sums them to produce a smooth density estimate. The bandwidth parameter controls the smoothness of the resulting estimate.
Density estimation serves as a building block for many unsupervised learning tasks. DBSCAN, for example, implicitly relies on density estimation to define clusters, and anomaly detectors frequently flag points in low-density regions.
Anomaly detection (also called outlier detection) is the task of identifying data points that deviate significantly from the expected pattern. In the unsupervised setting, no labels indicating which data points are anomalies are available; the algorithm must determine what constitutes "normal" behavior from the data alone.
Unsupervised anomaly detection is used in fraud detection, network intrusion detection, manufacturing quality control, medical diagnostics, and system monitoring.
The Isolation Forest algorithm, proposed by Fei Tony Liu, Kai Ming Ting, and Zhi-Hua Zhou in 2008, detects anomalies by exploiting the observation that anomalous points are easier to isolate than normal points. The algorithm builds an ensemble of random isolation trees by recursively selecting random features and random split values. Anomalies, being few and different, tend to be isolated in fewer splits (shorter path lengths) than normal points.
Isolation Forest is efficient, with a time complexity of O(n * log(n)), and performs well on high-dimensional data. It does not require computing distances between data points, making it faster than many density-based methods.
The Local Outlier Factor (LOF) algorithm, introduced by Markus Breunig, Hans-Peter Kriegel, Raymond Ng, and Jorg Sander in 2000, detects anomalies based on the local density of data points relative to their neighbors. LOF computes a score for each point that reflects how isolated it is compared to its surrounding neighborhood. A LOF score significantly greater than 1 indicates a potential outlier.
LOF is effective at detecting local anomalies that may appear normal in a global context but are unusual compared to their neighbors. It handles datasets where the density varies across different regions.
One-Class Support Vector Machine (One-Class SVM) learns a boundary that encompasses the majority of the data in a high-dimensional feature space. Data points falling outside this boundary are classified as anomalies. The algorithm maps data into a high-dimensional space using a kernel function and finds a hyperplane that separates the data from the origin with maximum margin.
One-Class SVM is effective for novelty detection, where the training set contains only normal examples and the goal is to identify new data points that differ from the training distribution.
Association rule mining is a technique for discovering relationships between variables in large datasets. It identifies rules of the form "if X then Y" (written X => Y), where X and Y are sets of items that frequently co-occur.
Three key metrics are used to evaluate association rules:
The Apriori algorithm, introduced by Rakesh Agrawal and Ramakrishnan Srikant in 1994, is the foundational algorithm for association rule mining. It works by iteratively identifying frequent itemsets (sets of items that appear together above a minimum support threshold) and then generating rules from these itemsets.
The key insight of Apriori is the "Apriori principle": if an itemset is infrequent, all of its supersets must also be infrequent. This allows the algorithm to prune the search space significantly.
The FP-Growth (Frequent Pattern Growth) algorithm, proposed by Jiawei Han, Jian Pei, and Yiwen Yin in 2000, improves upon Apriori by avoiding repeated scans of the database. It compresses the dataset into a compact data structure called an FP-tree and then mines frequent patterns directly from the tree using a divide-and-conquer strategy.
FP-Growth is typically faster and more memory-efficient than Apriori, especially on large datasets.
Association rule mining is widely used in market basket analysis (discovering which products are frequently purchased together), web usage mining, bioinformatics, and recommendation systems.
Topic modeling is an unsupervised technique for discovering abstract "topics" that occur in a collection of documents. Each document is represented as a mixture of topics, and each topic is characterized by a distribution over words.
Latent Dirichlet Allocation (LDA), introduced by David Blei, Andrew Ng, and Michael Jordan in 2003, is the most widely used topic model. LDA assumes that each document is generated by first choosing a distribution over topics, and then for each word in the document, choosing a topic from that distribution and drawing a word from the topic's word distribution. The model is typically fit using variational inference or Gibbs sampling.
Non-Negative Matrix Factorization (NMF) is an alternative approach to topic modeling that decomposes the document-term matrix into two non-negative matrices: one representing document-topic associations and another representing topic-word associations. Because all factors are non-negative, the resulting components are often more interpretable than those produced by methods that allow negative values (such as SVD).
Topic models are used for document organization, information retrieval, content recommendation, and exploratory analysis of large text corpora.
Generative models are a class of unsupervised learning algorithms that learn the underlying probability distribution of the training data in order to generate new samples that resemble the original data. Rather than simply discovering structure (as clustering or dimensionality reduction do), generative models explicitly or implicitly model the full data distribution.
Variational autoencoders (VAEs), introduced by Diederik Kingma and Max Welling in 2013, extend the standard autoencoder framework with a probabilistic approach. Instead of mapping each input to a single point in the latent space, VAEs map inputs to a probability distribution (typically a Gaussian). The model is trained to maximize a lower bound on the data likelihood (the evidence lower bound, or ELBO), which consists of a reconstruction term and a regularization term that encourages the latent space to follow a standard normal distribution.
VAEs produce a smooth, continuous latent space, which makes them suitable for interpolation between data points and controlled generation. They are used in image generation, drug discovery, music synthesis, and anomaly detection. However, VAEs tend to produce blurrier outputs compared to other generative models, because the reconstruction loss encourages averaging over plausible outputs.
Generative Adversarial Networks (GANs), introduced by Ian Goodfellow and colleagues in 2014, consist of two neural networks trained in competition with each other. The generator network produces synthetic data samples, while the discriminator network attempts to distinguish between real and generated samples. Training proceeds as a minimax game: the generator improves by trying to fool the discriminator, and the discriminator improves by getting better at detecting fakes.
GANs are capable of generating highly realistic images, videos, and audio. Notable GAN variants include DCGAN (using convolutional architectures), StyleGAN (enabling control over visual features at different scales), CycleGAN (performing unpaired image-to-image translation), and Wasserstein GAN (using the Wasserstein distance for more stable training).
Training GANs can be difficult due to issues such as mode collapse (the generator produces limited variety), training instability, and the need to carefully balance the generator and discriminator. Despite these challenges, GANs have had a significant impact on creative applications, data augmentation, and super-resolution imaging.
Diffusion models, which gained prominence beginning around 2020, represent a newer class of generative models. They work through two processes: a forward diffusion process that gradually adds Gaussian noise to the data over a series of steps until the data becomes pure noise, and a reverse denoising process where a neural network learns to reverse each step, gradually reconstructing data from noise.
Diffusion models have achieved state-of-the-art results in image generation, surpassing GANs in both sample quality and diversity. Systems such as DALL-E 2, Stable Diffusion, and Midjourney are built on diffusion model architectures. The training process for diffusion models is more stable than GAN training, though inference (sample generation) is slower because it requires running the denoising process over many steps. Techniques such as DDIM (Denoising Diffusion Implicit Models) and latent diffusion (operating in a compressed latent space rather than pixel space) have been developed to speed up inference.
Normalizing flows are a family of generative models that transform a simple base distribution (such as a standard Gaussian) into a complex target distribution through a sequence of invertible, differentiable mappings. Because each transformation is invertible, normalizing flows allow exact computation of the data likelihood, unlike VAEs (which optimize a lower bound) and GANs (which have no explicit density).
Examples of normalizing flow architectures include RealNVP, Glow, and Neural Spline Flows. Normalizing flows are used for density estimation, variational inference, and data generation. Their main limitation is that each transformation must be invertible, which constrains the model architecture and can limit expressiveness compared to GANs or diffusion models.
Self-supervised learning (SSL) has emerged as a powerful paradigm that bridges unsupervised and supervised learning. In self-supervised learning, the model generates its own supervisory signal from the structure of the unlabeled data itself. For example, a model might be trained to predict a missing portion of its input (such as a masked word in a sentence or a masked patch in an image), using the surrounding context as the learning signal.
Self-supervised learning is technically a form of unsupervised learning because it does not require human-provided labels. However, it differs from traditional unsupervised methods like clustering or dimensionality reduction in that the training objective resembles a supervised task (the model predicts a target derived from the input). Some researchers, including Yann LeCun, have argued that self-supervised learning should be distinguished from classical unsupervised learning, while others consider it a modern evolution of the same concept.
Contrastive learning methods train models to produce similar representations for related data points (positive pairs) and dissimilar representations for unrelated data points (negative pairs).
SimCLR, introduced by Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton in 2020, demonstrated that a simple framework combining data augmentation, a neural network encoder, and a contrastive loss function could learn visual representations competitive with supervised methods.
MoCo (Momentum Contrast), developed by Kaiming He and colleagues at Meta AI, uses a momentum-updated encoder and a queue of negative samples to enable contrastive learning with large effective batch sizes.
BYOL (Bootstrap Your Own Latent), introduced by Jean-Bastien Grill and colleagues in 2020, showed that contrastive learning could succeed without negative pairs at all. BYOL uses two networks (an online network and a target network updated via exponential moving average) and trains the online network to predict the target network's representation of an augmented view of the same image.
Masked prediction methods train models to reconstruct missing parts of the input.
In natural language processing, BERT (Bidirectional Encoder Representations from Transformers) introduced masked language modeling, where random tokens in a sentence are masked and the model learns to predict them from context. This approach has become the foundation for modern NLP pre-training.
In computer vision, Masked Autoencoders (MAE), introduced by Kaiming He and colleagues in 2022, apply a similar principle: random patches of an image are masked, and the model learns to reconstruct the missing patches. MAE ViT-Huge achieved 87.8% top-1 accuracy on ImageNet, demonstrating that masked prediction can match or surpass supervised learning baselines.
DINO and DINOv2, developed by Meta AI, combine self-distillation with vision transformers to learn visual features without any labels. DINOv2 has shown remarkable generalizability across a wide range of visual tasks without fine-tuning.
Self-supervised learning is commonly used as a pre-training step. A model is first trained on a large unlabeled dataset using a self-supervised objective, and then fine-tuned on a smaller labeled dataset for a specific downstream task. This approach, known as transfer learning, has proven highly effective in both NLP and computer vision, enabling models to achieve strong performance even with limited labeled data.
Recent research has highlighted that the choice of data augmentation strategy is often more important than the specific self-supervised paradigm used, and that scaling to larger models and datasets consistently improves representation quality.
Unsupervised machine learning has numerous applications across a wide range of fields.
Businesses use clustering algorithms to group customers based on purchasing behavior, demographics, browsing patterns, and other features. These segments can then inform targeted marketing campaigns, personalized recommendations, and pricing strategies. For example, an e-commerce company might use K-means clustering to identify groups of customers with similar buying habits and tailor promotions accordingly. Netflix uses unsupervised algorithms in its recommendation engine by analyzing viewing history, search trends, and ratings to identify hidden patterns in user preferences.
Unsupervised methods are widely used to learn useful representations of data that can be used as features for downstream tasks. Autoencoders, self-supervised models, and other unsupervised techniques can discover meaningful features without labels, which is especially valuable when labeled data is limited. Pre-trained word embeddings (such as Word2Vec and GloVe) are a classic example of unsupervised feature learning in NLP.
Dimensionality reduction techniques like PCA, t-SNE, and UMAP are essential tools for exploratory data analysis. They allow researchers to visualize high-dimensional data in two or three dimensions, revealing clusters, trends, and outliers that would be invisible in the raw data. This is particularly important in fields like genomics, where datasets may have tens of thousands of dimensions.
Dimensionality reduction and autoencoders can be used to compress data by learning compact representations that preserve the essential information. This is useful for reducing storage requirements, speeding up computation, and transmitting data more efficiently. Image compression using autoencoders, for instance, can produce smaller file sizes than traditional methods while maintaining perceptual quality.
Unsupervised learning is used to identify topics and themes from unstructured text data through techniques like Latent Dirichlet Allocation (LDA) and non-negative matrix factorization (NMF). Topic models can automatically organize large document collections, summarize content, and discover hidden thematic structures. Word embeddings such as Word2Vec, GloVe, and FastText learn distributed representations of words from large text corpora in a fully unsupervised manner.
Unsupervised learning can be applied to recognize objects, scenes, and events captured in images and videos. Clustering of image features enables content-based image retrieval and automatic tagging. Generative models trained in an unsupervised manner can produce realistic synthetic images for data augmentation and creative applications. Self-supervised pre-training on large image datasets has become the standard approach for learning visual representations.
Unsupervised learning is employed to detect unusual patterns in data that could indicate anomalies or outliers. In financial services, unsupervised anomaly detection systems flag potentially fraudulent transactions by identifying patterns inconsistent with normal behavior. In cybersecurity, these methods detect network intrusions and unusual system activity.
Clustering algorithms are used to group genes with similar expression patterns, identify disease subtypes, and discover biomarkers. Dimensionality reduction techniques help visualize complex biological datasets, such as single-cell RNA sequencing data, where each cell may be characterized by thousands of gene expression measurements. Unsupervised methods are also used for drug discovery, where generative models such as VAEs explore chemical space to propose novel molecular structures.
ICA and related techniques are used in biomedical signal processing to separate independent source signals from recorded mixtures. Applications include removing artifacts from EEG recordings, separating fetal heartbeat signals from maternal recordings, and enhancing speech signals in noisy environments.
The following table compares unsupervised learning with supervised and self-supervised learning across several key dimensions.
| Aspect | Supervised learning | Unsupervised learning | Self-supervised learning |
|---|---|---|---|
| Training data | Labeled (input-output pairs) | Unlabeled | Unlabeled (labels derived from data) |
| Human annotation required | Yes | No | No |
| Learning objective | Predict known targets | Discover structure or patterns | Predict derived targets (e.g., masked tokens) |
| Typical tasks | Classification, regression | Clustering, dimensionality reduction, density estimation | Pre-training for downstream tasks |
| Evaluation | Direct (compare predictions to labels) | Indirect (silhouette score, visual inspection, downstream performance) | Indirect (downstream task performance) |
| Examples | Linear regression, decision trees, neural networks | K-means, PCA, GANs | BERT, SimCLR, MAE |
| Data requirements | Needs labeled data (expensive) | Works with raw, unlabeled data | Works with raw, unlabeled data |
| Scalability with data | Limited by labeling cost | Scales easily with more data | Scales easily with more data |
Evaluating unsupervised learning is fundamentally more difficult than evaluating supervised learning, because there are no ground-truth labels against which to measure performance. Several approaches and metrics are used, each with limitations.
Internal metrics assess the quality of clustering or dimensionality reduction using only the data itself, without reference to external labels.
When ground-truth labels are available (for example, in benchmarking studies), external metrics can be used:
For generative models, evaluation is particularly complex because assessing the quality of generated samples involves multiple dimensions:
Beyond the choice of metric, several practical challenges affect the evaluation of unsupervised learning:
The following table summarizes major unsupervised learning algorithms organized by category.
| Category | Algorithm | Key parameters | Strengths | Limitations |
|---|---|---|---|---|
| Clustering | K-Means | Number of clusters (k) | Simple, fast, scalable | Assumes spherical clusters; requires k in advance |
| Clustering | DBSCAN | Epsilon, MinPts | Finds arbitrary-shaped clusters; handles noise | Sensitive to epsilon; struggles with varying densities |
| Clustering | Hierarchical Clustering | Linkage method, distance metric | No need to predefine k; produces dendrogram | O(n^2) or higher; not ideal for large datasets |
| Clustering | GMM | Number of components | Soft assignments; models elliptical clusters | Assumes Gaussian distributions; sensitive to initialization |
| Clustering | Mean Shift | Bandwidth | No need to predefine k; finds arbitrarily shaped clusters | Computationally expensive; bandwidth selection is critical |
| Clustering | Spectral Clustering | Number of clusters, affinity | Handles non-convex clusters via graph Laplacian | Expensive for large datasets; requires k |
| Dimensionality Reduction | PCA | Number of components | Fast, interpretable, well-understood | Linear only; may miss nonlinear structure |
| Dimensionality Reduction | t-SNE | Perplexity, learning rate | Excellent for 2D/3D visualization | Slow on large data; does not preserve global structure well |
| Dimensionality Reduction | UMAP | Number of neighbors, min_dist | Fast; preserves local and global structure | Hyperparameter-sensitive; less interpretable than PCA |
| Dimensionality Reduction | Autoencoder | Architecture, latent dimension | Captures nonlinear relationships; flexible | Requires neural network training; harder to interpret |
| Dimensionality Reduction | ICA | Number of components | Separates independent sources; useful for BSS | Assumes statistical independence; linear |
| Dimensionality Reduction | SOM | Grid size, learning rate, neighborhood radius | Preserves topology; visual 2D mapping | Requires grid size choice; not suited for very high-dimensional data |
| Density Estimation | KDE | Bandwidth, kernel function | Non-parametric; flexible | Slow in high dimensions; bandwidth-sensitive |
| Density Estimation | GMM | Number of components | Parametric; provides cluster probabilities | Assumes Gaussian components |
| Anomaly Detection | Isolation Forest | Number of trees, contamination | Fast; handles high dimensions | Less effective with low-dimensional data |
| Anomaly Detection | Local Outlier Factor | Number of neighbors | Detects local anomalies | Sensitive to parameter choice; slow on large data |
| Anomaly Detection | One-Class SVM | Kernel, nu | Effective in high-dimensional spaces | Computationally expensive; kernel choice affects results |
| Association Rules | Apriori | Min support, min confidence | Intuitive; well-established | Slow on large datasets; many candidate itemsets |
| Association Rules | FP-Growth | Min support | Faster than Apriori; no candidate generation | Memory-intensive for very large FP-trees |
| Topic Modeling | LDA | Number of topics, alpha, beta | Interpretable topics; principled probabilistic model | Requires specifying topic count; bag-of-words assumption |
| Topic Modeling | NMF | Number of components | Non-negative factors; interpretable | Sensitive to initialization; no probabilistic interpretation |
| Generative | VAE | Latent dimension, architecture | Smooth latent space; stable training | Blurry outputs; limited expressiveness |
| Generative | GAN | Architecture, learning rates | High-quality samples; versatile | Training instability; mode collapse |
| Generative | Diffusion Model | Noise schedule, number of steps | State-of-the-art image quality; stable training | Slow inference; high compute requirements |
| Generative | Normalizing Flow | Number of layers, coupling design | Exact likelihood; invertible | Constrained architecture; limited expressiveness |
Imagine you have a big box of colorful building blocks, but nobody told you what to build or how the blocks should be sorted. You start looking at the blocks and notice some are red and round, some are blue and square, and some are green and long. You decide to put similar-looking blocks into piles: all the red round ones together, all the blue square ones together, and so on. Nobody told you the names of these groups or which block goes where. You figured it all out by yourself, just by looking at the blocks.
That is basically what unsupervised machine learning does. A computer looks at a huge pile of data, with no labels or instructions, and tries to find patterns and groups on its own. It might discover that some data points are very similar to each other and put them in a group, or it might figure out a simpler way to describe complicated data. The computer does not know if it is right or wrong because nobody gave it an answer key. It just does its best to make sense of the information.