Unsupervised learning

Artificial Intelligence Data Science Machine Learning

29 min read

Updated Jul 12, 2026

Suggest edit History Talk

RawGraph

Last edited

Jul 12, 2026

Fact-checked

In review queue

Sources

26 citations

Revision

v5 · 5,875 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Unsupervised learning is a branch of machine learning in which algorithms identify patterns, structures, and relationships in data without relying on labeled examples or explicit human guidance. Unlike supervised learning, where a model trains on input-output pairs to learn a mapping function, unsupervised learning operates on raw, unlabeled datasets and attempts to discover the underlying organization of the data on its own. It is one of the three major paradigms in machine learning, alongside supervised learning and reinforcement learning, and Turing Award winner Yann LeCun has called it the dominant ingredient of intelligence: "If intelligence is a cake, the bulk of the cake is unsupervised learning, the icing on the cake is supervised learning, and the cherry on the cake is reinforcement learning" ^[23].

Because labeled data is expensive and time-consuming to produce, unsupervised methods are attractive for the vast majority of real-world data, which arrives without annotations. Industry analysts at Gartner and IDC estimate that more than 80% of enterprise data is unstructured and unlabeled, and the global data labeling solution and services market was valued at roughly 18.6 billion dollars in 2024, reflecting how costly human annotation has become ^[24]^[25]. Text scraped from the web, sensor readings from industrial equipment, transaction logs, and genomic sequences all represent the kinds of unlabeled data that unsupervised algorithms can process. The field has grown in importance since the mid-2010s as deep learning architectures made it possible to learn rich representations from massive unlabeled corpora, leading to breakthroughs in natural language processing, computer vision, and generative modeling.

What is unsupervised learning used for?

Unsupervised learning is used to discover hidden structure in unlabeled data. Its core uses fall into four families: clustering (grouping similar data points, for example in customer segmentation), dimensionality reduction (compressing high-dimensional data for visualization or preprocessing, for example with PCA or UMAP), density estimation and anomaly detection (flagging fraud, defects, or intrusions), and generative modeling (synthesizing new images, audio, or text). Modern self-supervised learning extends these uses to pretrain large language models and vision models on raw web-scale corpora before any labels are applied. The sections below cover the history, mathematics, methods, applications, and evaluation of each of these tasks.

History

The intellectual roots of unsupervised learning extend back to the early twentieth century. Karl Pearson introduced principal component analysis (PCA) in 1901 as a technique for fitting lines and planes to point clouds in high-dimensional space ^[1]. Harold Hotelling independently developed PCA in 1933 as a method for extracting dominant modes of variation from multivariate datasets ^[2]. These early linear methods provided the foundation for dimensionality reduction, one of the core tasks in unsupervised learning.

In neuroscience, Donald Hebb proposed in 1949 that neurons that fire together strengthen their connections, a principle now known as Hebbian learning ^[3]. This rule describes a form of unsupervised synaptic plasticity: the network adjusts its weights based solely on correlations in the input, with no external teaching signal. Hebbian learning influenced decades of connectionist research and inspired algorithms such as Oja's rule (1982), which extracts the first principal component of streaming data.

The k-means clustering algorithm, one of the most widely used unsupervised methods, has origins in the late 1950s. Stuart Lloyd devised the iterative procedure in 1957 at Bell Laboratories as a technique for pulse-code modulation in signal processing, though his paper was not formally published until 1982 ^[4]. In 1965, Edward Forgy independently described essentially the same method, which is why the algorithm is sometimes called the Lloyd-Forgy algorithm ^[5]. James MacQueen coined the name "k-means" in 1967 ^[6].

The 1980s brought several advances. Teuvo Kohonen introduced self-organizing maps (SOMs) in 1982, providing a neural network approach to topology-preserving dimensionality reduction ^[7]. The expectation-maximization (EM) algorithm, formalized by Dempster, Laird, and Rubin in 1977, became the standard method for fitting Gaussian mixture models and other latent-variable models ^[8].

The 1990s and 2000s saw density-based methods like DBSCAN (Ester et al., 1996) ^[9] and spectral clustering gain popularity, alongside nonlinear dimensionality reduction techniques such as Isomap (2000) and locally linear embedding (2000). The introduction of t-distributed stochastic neighbor embedding (t-SNE) by Laurens van der Maaten and Geoffrey Hinton in 2008 provided a powerful tool for visualizing high-dimensional data ^[10].

The deep learning era transformed unsupervised learning. Autoencoders, which date back to the 1980s but gained new life with deep architectures, became effective tools for learning compressed representations. Generative adversarial networks (GANs), introduced by Ian Goodfellow and colleagues in 2014, offered a new framework for generative modeling ^[11]. Variational autoencoders (VAEs), proposed by Kingma and Welling in 2013, combined deep learning with probabilistic inference ^[12]. More recently, diffusion models, score-based generative models, and self-supervised learning approaches (including BERT and GPT) have blurred the boundary between unsupervised and supervised paradigms.

Mathematical framework

Unsupervised learning can be formalized as the problem of estimating the probability distribution $p(x)$ that generated an observed dataset $X = \{x_1, x_2, \ldots, x_n\}$ , where each $x_i$ is a data point in some high-dimensional space. Unlike supervised learning, there is no corresponding set of labels $y_i$ . The goal is to learn structure from the data alone.

Density estimation

The most general formulation of unsupervised learning is density estimation: given samples from an unknown distribution, estimate the probability density function (PDF). Parametric approaches assume the data follows a known family of distributions (such as a mixture of Gaussians) and estimate the parameters. Non-parametric approaches, such as kernel density estimation (KDE), make fewer assumptions about the data distribution and let the data shape the density estimate directly.

Maximum likelihood estimation (MLE) is a common optimization objective. Given a parametric model $p(x; \theta)$ , MLE finds the parameters $\theta$ that maximize the likelihood of the observed data:

\theta^* = \arg\max_\theta \sum_i \log p(x_i; \theta)

For models with latent variables z (such as mixture models or VAEs), the marginal likelihood $p(x; \theta) = \int p(x, z; \theta) \, dz$ is often intractable, and the EM algorithm or variational inference methods are used instead.

Latent variable models

Many unsupervised methods assume that each observed data point x was generated by some unobserved (latent) variable z through a process $x = f(z) + \text{noise}$ . The goal is to infer both the latent variables and the mapping function. PCA can be understood as a linear latent variable model where f is a linear transformation. Autoencoders learn nonlinear versions of this mapping using neural networks. VAEs place a probabilistic prior $p(z)$ on the latent space and optimize a variational lower bound (the evidence lower bound, or ELBO) on the log-likelihood.

Objective functions for clustering

Clustering algorithms typically optimize an objective that measures how well the data is partitioned. K-means minimizes the within-cluster sum of squared distances:

J = \sum_k \sum_{x_i \in C_k} \lVert x_i - \mu_k \rVert^2

where $C_k$ is the set of points assigned to cluster k and $\mu_k$ is the centroid. Gaussian mixture models maximize the likelihood of a mixture of K Gaussian distributions, each with its own mean and covariance. Spectral clustering methods optimize graph-based objectives such as the normalized cut.

Core tasks

Unsupervised learning encompasses several distinct tasks, each with its own methods and evaluation criteria.

Clustering

Clustering is the task of partitioning a dataset into groups (clusters) such that points within the same group are more similar to each other than to points in different groups. It is one of the most common applications of unsupervised learning and is used across domains from biology to marketing.

Algorithm	Type	Key idea	Handles arbitrary shapes	Requires k	Time complexity
K-means	Centroid-based	Iteratively assigns points to nearest centroid and recomputes centroids	No (assumes convex clusters)	Yes	O(nkt)
Hierarchical (agglomerative)	Connectivity-based	Builds a tree of clusters by merging closest pairs	Depends on linkage	No (uses dendrogram)	O(n²log n) to O(n³)
DBSCAN	Density-based	Groups points in dense regions; marks sparse regions as noise	Yes	No (uses ε, MinPts)	O(n log n) with spatial indexing
Gaussian mixture models	Model-based	Fits a mixture of Gaussian distributions using EM	Partial (ellipsoidal)	Yes	O(nkd²) per iteration
Spectral clustering	Graph-based	Uses eigenvalues of a similarity graph Laplacian	Yes	Yes	O(n³) for eigen-decomposition
Mean shift	Density-based	Iteratively shifts points toward local density maxima	Yes	No (uses bandwidth)	O(n²) per iteration
OPTICS	Density-based	Extension of DBSCAN that produces a reachability ordering	Yes	No	O(n log n) with spatial indexing

K-means clustering. The k-means algorithm partitions n data points into k clusters by iterating two steps: (1) assign each point to the cluster with the nearest centroid, and (2) recompute each centroid as the mean of its assigned points. The algorithm converges when assignments stop changing. Despite its simplicity, k-means remains popular because it scales well to large datasets and is easy to implement. Its main limitations are that it assumes roughly spherical clusters, is sensitive to initialization (the k-means++ initialization helps mitigate this), and requires the user to specify k in advance ^[4].

Hierarchical clustering. Agglomerative hierarchical clustering starts with each data point as its own cluster and repeatedly merges the two closest clusters until a single cluster remains. The result is a dendrogram, a tree structure that shows the hierarchy of merges at different distance thresholds. The choice of linkage criterion (single, complete, average, or Ward's method) determines how inter-cluster distance is defined and strongly affects the shape of the resulting clusters. Divisive hierarchical clustering works top-down but is less commonly used.

DBSCAN. Density-Based Spatial Clustering of Applications with Noise (DBSCAN) identifies clusters as contiguous regions of high density separated by regions of low density ^[9]. It requires two parameters: ε (the neighborhood radius) and MinPts (the minimum number of points to form a dense region). Points that do not belong to any dense region are labeled as noise. DBSCAN can discover clusters of arbitrary shape and does not require specifying the number of clusters, but it struggles when clusters have widely varying densities.

Gaussian mixture models. A Gaussian mixture model (GMM) represents the data as a weighted sum of K multivariate Gaussian distributions. Each Gaussian component has its own mean vector and covariance matrix, allowing the model to capture ellipsoidal clusters of different sizes and orientations. The parameters are typically estimated using the EM algorithm, which alternates between computing soft assignments (E-step) and updating parameters (M-step). GMMs are more flexible than k-means because they model cluster shapes and provide probabilistic cluster assignments rather than hard boundaries ^[8].

Dimensionality reduction

Dimensionality reduction maps high-dimensional data to a lower-dimensional representation while preserving important structure. It is used for visualization, noise reduction, feature extraction, and as a preprocessing step before other analyses.

Method	Linear/Nonlinear	Preserves	Output dimensions	Year introduced
PCA	Linear	Global variance	Any	1901
Independent Component Analysis (ICA)	Linear	Statistical independence	Same as input	1994
t-SNE	Nonlinear	Local neighborhoods	2 or 3 (visualization)	2008
UMAP	Nonlinear	Local and some global structure	Any	2018
Autoencoders	Nonlinear	Learned reconstruction	Any	1986/2006
Isomap	Nonlinear	Geodesic distances	Any	2000

Principal component analysis. PCA finds the directions (principal components) along which the data varies most. Mathematically, it computes the eigenvectors of the data's covariance matrix and projects the data onto the top k eigenvectors. The first principal component captures the direction of maximum variance, the second captures the direction of maximum remaining variance orthogonal to the first, and so on. PCA is fast, well-understood, and provides an optimal linear projection in the sense of minimizing reconstruction error. Its main limitation is that it can only capture linear relationships ^[1].

t-SNE. t-distributed stochastic neighbor embedding (t-SNE) converts pairwise similarities between data points into joint probabilities in the high-dimensional space and a low-dimensional embedding, then minimizes the Kullback-Leibler (KL) divergence between the two distributions ^[10]. It excels at revealing local cluster structure and is widely used for visualizing high-dimensional datasets in two or three dimensions. However, t-SNE is computationally expensive for large datasets, its results depend on hyperparameters (particularly the perplexity), and distances between distant clusters in the output are not meaningful.

UMAP. Uniform Manifold Approximation and Projection (UMAP), introduced by McInnes, Healy, and Melville in 2018, constructs a topological representation of the high-dimensional data using fuzzy simplicial sets, then optimizes a low-dimensional layout to preserve that topology ^[13]. UMAP tends to preserve both local and some global structure better than t-SNE, runs faster on large datasets, and can be used for general-purpose dimensionality reduction (not just visualization). It has become a standard tool in genomics, single-cell biology, and other fields with high-dimensional data.

Autoencoders. An autoencoder is a neural network trained to reconstruct its input through a bottleneck layer. The encoder maps the input to a low-dimensional latent representation, and the decoder maps the latent representation back to the original space. The network is trained by minimizing reconstruction error (typically mean squared error or binary cross-entropy). Because the bottleneck forces the network to learn a compressed representation, autoencoders perform nonlinear dimensionality reduction. Variants include denoising autoencoders (which corrupt the input before encoding), sparse autoencoders (which add a sparsity penalty), and contractive autoencoders (which penalize sensitivity to input perturbations).

Density estimation and anomaly detection

Anomaly detection, also called outlier detection, identifies data points that deviate significantly from the majority of the data. In unsupervised settings, there are no labeled examples of anomalies; the algorithm must learn what "normal" looks like and flag deviations.

Kernel density estimation. KDE places a kernel function (typically Gaussian) at each data point and sums the contributions to estimate the probability density at any location. Points falling in low-density regions are considered potential anomalies. KDE makes few distributional assumptions but becomes computationally expensive in high dimensions.

Isolation Forest. The Isolation Forest algorithm, proposed by Liu, Ting, and Zhou in 2008, detects anomalies by building an ensemble of random decision trees ^[14]. The core insight is that anomalies are "few and different," so they tend to be isolated by fewer random splits than normal points. The average path length from the root to a point across the ensemble serves as an anomaly score: shorter paths indicate more anomalous points. Isolation Forest is efficient, scales well, and does not require density estimation.

One-class SVM. The one-class support vector machine learns a boundary around the normal data in a high-dimensional feature space (using the kernel trick) and classifies points outside the boundary as anomalies.

Autoencoders for anomaly detection. An autoencoder trained on normal data learns to reconstruct normal patterns well. When presented with an anomalous input, the reconstruction error is high, which serves as a signal for anomaly detection. This approach has been applied in manufacturing quality control, network intrusion detection, and medical imaging.

Association rule learning

Association rule learning discovers interesting relationships between variables in large databases. The most well-known application is market basket analysis, where the goal is to find sets of products that are frequently purchased together.

The Apriori algorithm, introduced by Agrawal and Srikant in 1994, is the classic method for mining frequent itemsets and generating association rules ^[15]. It operates on three key measures: support (how frequently an itemset appears in the dataset), confidence (the conditional probability of the consequent given the antecedent), and lift (the ratio of observed co-occurrence to expected co-occurrence under independence). Extensions such as the FP-Growth algorithm improve efficiency by using a compact tree structure to avoid repeated database scans.

Generative models

Generative models learn to model the data distribution $p(x)$ and can generate new samples that resemble the training data. They represent a major subfield of unsupervised learning, with applications in image synthesis, text generation, drug discovery, and data augmentation.

Variational autoencoders

Variational autoencoders (VAEs), introduced by Kingma and Welling in 2013, combine deep neural networks with variational Bayesian inference ^[12]. A VAE consists of an encoder network that maps input data to a distribution over latent variables (typically a diagonal Gaussian) and a decoder network that maps latent samples back to the data space. The model is trained by maximizing the evidence lower bound (ELBO):

\text{ELBO} = \mathbb{E}_{q(z \mid x)}[\log p(x \mid z)] - \mathrm{KL}(q(z \mid x) \parallel p(z))

The first term encourages accurate reconstruction, while the KL divergence term regularizes the latent space by pushing the approximate posterior $q(z \mid x)$ toward the prior $p(z)$ . This regularization enables smooth interpolation in latent space and generation of new samples by decoding random draws from the prior.

VAEs are valued for their principled probabilistic framework, stable training, and meaningful latent representations. Their main drawback is that generated images tend to be blurrier than those from GANs, a consequence of optimizing a lower bound on the likelihood rather than an adversarial objective.

Generative adversarial networks

Generative adversarial networks (GANs), proposed by Ian Goodfellow and colleagues in 2014, train two neural networks in competition: a generator G that produces synthetic data and a discriminator D that tries to distinguish real data from generated data ^[11]. The training objective is a minimax game:

\min_G \max_D \mathbb{E}_x[\log D(x)] + \mathbb{E}_z[\log(1 - D(G(z)))]

At equilibrium, the generator produces data indistinguishable from real samples, and the discriminator outputs 0.5 for all inputs. In practice, reaching this equilibrium is difficult, and GAN training can suffer from mode collapse (where the generator produces limited variety), training instability, and vanishing gradients.

Numerous GAN variants address these issues. Wasserstein GAN (WGAN) replaces the divergence objective with the Wasserstein distance for more stable training. Progressive GAN and StyleGAN (Karras et al., 2018, 2019) introduced progressive training and style-based architectures that produce high-resolution, photorealistic images. Conditional GANs add label information to control the generation process, making them technically semi-supervised.

Diffusion models

Diffusion models have emerged as a powerful alternative to GANs and VAEs. They define a forward diffusion process that gradually adds Gaussian noise to the data over T steps until it becomes pure noise, and a reverse process that learns to iteratively denoise, recovering the original data distribution ^[16]. The training objective is typically a simplified form of variational inference, with the model (usually a U-Net architecture) predicting the noise added at each step.

Diffusion models produce high-fidelity, diverse samples and avoid the mode collapse problems of GANs. DALL-E 2 (OpenAI, 2022), Stable Diffusion (Stability AI, 2022), and Imagen (Google, 2022) demonstrated the power of diffusion models for text-to-image generation. The main trade-off is sampling speed: generating a single image requires running the reverse process for many steps, though methods like DDIM and consistency models have reduced the number of required steps.

Model type	Year	Training stability	Sample quality	Sample diversity	Latent space	Generation speed
VAE	2013	Stable	Moderate (can be blurry)	High	Structured, continuous	Fast
GAN	2014	Can be unstable	High (sharp images)	Risk of mode collapse	No explicit latent model	Fast
Diffusion model	2015/2020	Stable	Very high	High	Implicit	Slow (many steps)

What is self-supervised learning?

Self-supervised learning is a form of unsupervised learning in which the model generates its own supervision signal from the structure of the unlabeled data. Rather than predicting external labels, the model solves a pretext task derived from the data itself, such as predicting masked tokens, predicting the next word, or determining the relative position of image patches.

Self-supervised methods have driven some of the most significant advances in deep learning. In natural language processing, BERT (Devlin et al., 2018) trains by masking 15% of input tokens and predicting their identities from context ^[17]. The GPT family of models (Radford et al., 2018, 2019) trains by predicting the next token in a sequence, a form of autoregressive language modeling ^[18]. Word2Vec, introduced by Mikolov et al. in 2013, learns word embeddings by predicting context words from a target word (skip-gram) or vice versa (CBOW) ^[19].

In computer vision, contrastive learning methods such as SimCLR (Chen et al., 2020) and MoCo (He et al., 2020) learn visual representations by training the network to produce similar embeddings for augmented versions of the same image and dissimilar embeddings for different images. DINO (Caron et al., 2021) and MAE (He et al., 2022) extended self-supervised learning to Vision Transformers, achieving performance competitive with supervised pretraining on ImageNet. Meta's DINOv2, released in 2023, was pretrained without any labels on a curated dataset of 142 million images and produces general-purpose visual features that match or exceed supervised baselines on classification, segmentation, and depth estimation without fine-tuning, demonstrating how far label-free representation learning has progressed ^[26].

The relationship between self-supervised learning and traditional unsupervised learning is a matter of ongoing discussion. Some researchers classify self-supervised learning as a subset of unsupervised learning because no human-provided labels are used. Others consider it a distinct paradigm because the model still optimizes a prediction loss against a concrete target (just one derived automatically from the data). Yann LeCun has advocated for the term "self-supervised learning" over "unsupervised learning," arguing that the former more accurately describes what these systems actually do ^[20]. LeCun has summarized the stakes bluntly: "We need to solve the unsupervised learning problem before we can even think of getting to true AI" ^[23].

What is unsupervised learning used for in practice?

Unsupervised learning methods are applied across a wide range of industries and scientific disciplines.

Customer segmentation and marketing

Businesses use clustering algorithms to segment customers into groups based on purchasing behavior, demographics, browsing patterns, and other features. K-means, hierarchical clustering, and GMMs are commonly applied to customer datasets. The resulting segments enable targeted marketing campaigns, personalized product recommendations, and pricing strategies tailored to different customer profiles. Research has shown that machine-learning-based segmentation can improve customer engagement by 4-5% and increase revenue growth ^[21].

Bioinformatics and genomics

Unsupervised learning is used extensively in biology. Single-cell RNA sequencing generates gene expression profiles for individual cells; clustering and dimensionality reduction (particularly UMAP and t-SNE) help identify cell types and states within tissues. PCA is a standard preprocessing step in genome-wide association studies. Hierarchical clustering of gene expression data reveals co-regulated gene modules. Anomaly detection methods identify aberrant samples in clinical datasets.

Natural language processing

Topic modeling algorithms such as Latent Dirichlet Allocation (LDA) discover latent topics in document collections without labeled categories. Word embeddings learned through Word2Vec or GloVe capture semantic relationships between words in an unsupervised manner. These representations serve as features for downstream tasks such as sentiment analysis, document classification, and information retrieval.

Computer vision and image processing

Unsupervised methods are used for image recognition preprocessing, image compression, denoising, and feature extraction. Autoencoders learn compressed image representations. GANs generate realistic synthetic images for data augmentation. Self-supervised pretraining on large unlabeled image datasets produces feature extractors that transfer well to downstream tasks with limited labeled data.

Anomaly detection in industry

Manufacturing companies deploy unsupervised anomaly detection to identify defective products on production lines using sensor data or visual inspection. Financial institutions use unsupervised methods to detect fraudulent transactions by identifying patterns that deviate from normal spending behavior. Network security systems use clustering and density estimation to flag unusual traffic patterns that may indicate intrusions.

Scientific discovery

In physics and astronomy, unsupervised learning helps analyze large datasets from particle colliders and telescopes. Clustering algorithms group astronomical objects by spectral properties. Dimensionality reduction helps researchers visualize and interpret complex simulation outputs. In chemistry and materials science, unsupervised methods identify patterns in molecular properties and accelerate the search for new materials.

How is unsupervised learning evaluated?

Evaluating unsupervised learning algorithms is inherently more difficult than evaluating supervised methods because there are no ground-truth labels to compare against. The field relies on a combination of internal metrics, external metrics (when partial labels are available), and downstream task performance.

Internal (intrinsic) metrics

Internal metrics evaluate the quality of a clustering or representation using only the data itself.

Metric	Range	Optimal value	What it measures
Silhouette coefficient	$[-1, 1]$	Close to 1	How similar a point is to its own cluster versus other clusters
Davies-Bouldin index	$[0, \infty)$	Close to 0	Average similarity between each cluster and its most similar neighbor
Calinski-Harabasz index	$[0, \infty)$	Higher is better	Ratio of between-cluster dispersion to within-cluster dispersion
Elbow method (inertia)	$[0, \infty)$	Elbow point	Within-cluster sum of squared distances as a function of k
BIC/AIC	$(-\infty, \infty)$	Lower is better	Model likelihood penalized by model complexity

The silhouette coefficient measures how well each data point fits within its assigned cluster compared to the nearest neighboring cluster. A value near 1 indicates that the point is well-matched to its own cluster; a value near -1 suggests the point may be assigned to the wrong cluster. Research has found the silhouette coefficient and the Davies-Bouldin index to be more informative than other metrics such as the Dunn index when assessing convex-shaped clusters ^[22].

The Davies-Bouldin index computes the average of the maximum ratio of within-cluster scatter to between-cluster distance for each pair of clusters. Lower values indicate better-separated, more compact clusters.

The Calinski-Harabasz index (also called the variance ratio criterion) computes the ratio of between-cluster variance to within-cluster variance, scaled by the number of clusters and data points. Higher values indicate well-defined clusters.

External metrics

When some ground-truth labels are available (even for a subset of the data), external metrics can evaluate how well the discovered clusters align with known categories.

Metric	Range	What it measures
Adjusted Rand Index (ARI)	$[-1, 1]$	Agreement between two partitions, adjusted for chance
Normalized Mutual Information (NMI)	$[0, 1]$	Mutual information between cluster assignments and true labels, normalized
V-measure	$[0, 1]$	Harmonic mean of homogeneity and completeness
Fowlkes-Mallows index	$[0, 1]$	Geometric mean of pairwise precision and recall

Downstream task evaluation

A practical and widely used approach to evaluating unsupervised representations is to measure their usefulness on downstream supervised tasks. The learned features or embeddings are fed into a simple classifier (such as a linear probe or k-nearest neighbors), and performance on the supervised task (measured by accuracy, F1 score, or other standard metrics) serves as a proxy for representation quality. This evaluation protocol is standard in self-supervised learning research, where models like SimCLR, DINO, and MAE are evaluated by their linear probing accuracy on ImageNet classification.

Challenges in evaluation

Several factors make unsupervised evaluation difficult. The "right" number of clusters is often unknown and domain-dependent. Different algorithms may produce valid but structurally different partitions of the same data. Metrics that work well for compact, spherical clusters (like the silhouette coefficient) may be misleading for clusters with irregular shapes. For generative models, evaluation metrics such as the Frechet Inception Distance (FID) and Inception Score (IS) measure sample quality but do not fully capture diversity, mode coverage, or usefulness for downstream applications.

How does unsupervised learning differ from supervised and reinforcement learning?

Unsupervised learning is one of several machine learning paradigms, each defined by the type of supervision available during training. The key difference is the supervision signal: unsupervised learning uses no labels and aims to discover structure, supervised learning uses human-provided input-output pairs to predict labels, and reinforcement learning uses a reward signal from environment interactions.

Aspect	Unsupervised learning	Supervised learning	Semi-supervised learning	Self-supervised learning	Reinforcement learning
Training data	Unlabeled	Labeled (input-output pairs)	Mix of labeled and unlabeled	Unlabeled (labels derived from data)	Environment interactions
Supervision signal	None	Human-provided labels	Partial human labels	Automatically generated from data	Reward signal
Primary goal	Discover structure	Predict labels/values	Leverage unlabeled data	Learn general representations	Maximize cumulative reward
Example tasks	Clustering, density estimation, dimensionality reduction	Classification, regression	Classification with few labels	Masked language modeling, contrastive learning	Game playing, robotics
Evaluation	Intrinsic metrics, downstream tasks	Accuracy, precision, recall, F1	Same as supervised	Linear probing, fine-tuning accuracy	Cumulative reward, win rate

Supervised learning requires labeled examples and trains a model to generalize from input-output pairs. It is well-suited to tasks where labeled data is abundant, such as image recognition or spam filtering. Evaluation is straightforward because predictions can be compared to ground truth.

Semi-supervised learning combines a small amount of labeled data with a large amount of unlabeled data. Methods like self-training, co-training, and label propagation use the structure of the unlabeled data to improve classification accuracy beyond what the labeled data alone can achieve.

Self-supervised learning generates its own labels from the input data and optimizes a prediction objective. As discussed above, it is often classified as a subset of unsupervised learning but has distinct characteristics. Self-supervised pretraining followed by supervised fine-tuning (the "transfer learning" paradigm) has become the dominant approach in NLP and is growing in influence in computer vision.

Reinforcement learning learns through interaction with an environment, receiving reward signals rather than labeled examples. The agent must balance exploration and exploitation to maximize long-term reward.

Limitations and open problems

Despite its broad applicability, unsupervised learning faces several challenges.

Sensitivity to hyperparameters. Most unsupervised algorithms have hyperparameters that significantly affect results. K-means requires choosing k; DBSCAN requires setting ε and MinPts; t-SNE depends on perplexity; autoencoders depend on architecture choices and the size of the latent dimension. Without ground-truth labels, selecting optimal hyperparameters is difficult.

Scalability. Some methods scale poorly to large datasets. Hierarchical clustering has quadratic or cubic time complexity. t-SNE is slow for datasets with more than tens of thousands of points, though approximate methods (Barnes-Hut t-SNE) help. Spectral clustering requires computing eigenvectors of an n-by-n matrix.

Curse of dimensionality. In very high dimensions, distance metrics become less meaningful because all points tend to be roughly equidistant. This affects distance-based methods like k-means and DBSCAN. Dimensionality reduction can mitigate this problem but may discard important information.

Interpretability. The clusters or representations discovered by unsupervised methods may not correspond to meaningful or actionable categories. A clustering algorithm will always produce clusters, but they may not reflect genuine structure in the data. Interpreting and validating the output requires domain expertise.

Evaluation difficulty. As discussed in the Evaluation section, the lack of ground-truth labels makes it difficult to assess whether an unsupervised model has learned useful structure. This complicates model selection and comparison.

Computational cost of generative models. Training GANs, VAEs, and diffusion models on high-resolution data requires substantial computational resources. Diffusion models in particular require many denoising steps during generation, making inference slower than for GANs.

References

Pearson, K. (1901). "On lines and planes of closest fit to systems of points in space." *Philosophical Magazine*, 2(11), 559-572. ↩
Hotelling, H. (1933). "Analysis of a complex of statistical variables into principal components." *Journal of Educational Psychology*, 24(6), 417-441. ↩
Hebb, D. O. (1949). *The Organization of Behavior: A Neuropsychological Theory*. Wiley. ↩
Lloyd, S. P. (1982). "Least squares quantization in PCM." *IEEE Transactions on Information Theory*, 28(2), 129-137. (Originally an unpublished Bell Labs technical report from 1957.) ↩
Forgy, E. W. (1965). "Cluster analysis of multivariate data: efficiency versus interpretability of classifications." *Biometrics*, 21, 768-769. ↩
MacQueen, J. (1967). "Some methods for classification and analysis of multivariate observations." *Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability*, 1, 281-297. ↩
Kohonen, T. (1982). "Self-organized formation of topologically correct feature maps." *Biological Cybernetics*, 43(1), 59-69. ↩
Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). "Maximum likelihood from incomplete data via the EM algorithm." *Journal of the Royal Statistical Society: Series B*, 39(1), 1-38. ↩
Ester, M., Kriegel, H.-P., Sander, J., & Xu, X. (1996). "A density-based algorithm for discovering clusters in large spatial databases with noise." *Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD-96)*, 226-231. ↩
van der Maaten, L., & Hinton, G. (2008). "Visualizing data using t-SNE." *Journal of Machine Learning Research*, 9(Nov), 2579-2605. ↩
Goodfellow, I. J., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., & Bengio, Y. (2014). "Generative adversarial nets." *Advances in Neural Information Processing Systems*, 27, 2672-2680. ↩
Kingma, D. P., & Welling, M. (2013). "Auto-encoding variational Bayes." *arXiv preprint arXiv:1312.6114*. ↩
McInnes, L., Healy, J., & Melville, J. (2018). "UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction." *arXiv preprint arXiv:1802.03426*. ↩
Liu, F. T., Ting, K. M., & Zhou, Z.-H. (2008). "Isolation Forest." *Proceedings of the 2008 Eighth IEEE International Conference on Data Mining*, 413-422. ↩
Agrawal, R., & Srikant, R. (1994). "Fast algorithms for mining association rules." *Proceedings of the 20th International Conference on Very Large Data Bases (VLDB)*, 487-499. ↩
Ho, J., Jain, A., & Abbeel, P. (2020). "Denoising diffusion probabilistic models." *Advances in Neural Information Processing Systems*, 33, 6840-6851. ↩
Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2018). "BERT: Pre-training of deep bidirectional transformers for language understanding." *arXiv preprint arXiv:1810.04805*. ↩
Radford, A., Narasimhan, K., Salimans, T., & Sutskever, I. (2018). "Improving language understanding by generative pre-training." OpenAI technical report. ↩
Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). "Efficient estimation of word representations in vector space." *arXiv preprint arXiv:1301.3781*. ↩
LeCun, Y. (2019). "Self-supervised learning." Talk at AAAI 2020. ↩
Li, J., et al. (2021). "E-commerce customer segmentation via unsupervised machine learning." *Proceedings of the 2nd International Conference on Computing and Data Science*, ACM. ↩
Arbelaitz, O., et al. (2013). "An extensive comparative study of cluster validity indices." *Pattern Recognition*, 46(1), 243-256. ↩
LeCun, Y. (2016). Keynote, "Predictive Learning," NIPS 2016; cake analogy also presented at the NYU Future of AI Symposium. Quoted widely as: "If intelligence is a cake, the bulk of the cake is unsupervised learning, the icing on the cake is supervised learning, and the cherry on the cake is reinforcement learning." ↩
Gartner / IDC analyst estimates that more than 80% of enterprise data is unstructured (cited across enterprise data-management reporting, 2023-2024). ↩
Grand View Research (2024). "Data Labeling Solution And Services Market Size, Share & Trends Analysis Report." Market valued at approximately 18.6 billion dollars in 2024. ↩
Oquab, M., Darcet, T., Moutakanni, T., et al. (2023). "DINOv2: Learning Robust Visual Features without Supervision." *arXiv preprint arXiv:2304.07193*; Meta AI. Pretrained on a curated dataset of 142 million images. ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

4 revisions by 1 contributors · full history

Suggest edit

Unsupervised learning

What is unsupervised learning used for?

History

Mathematical framework

Density estimation

Latent variable models

Objective functions for clustering

Core tasks

Clustering

Dimensionality reduction

Density estimation and anomaly detection

Association rule learning

Generative models

Variational autoencoders

Generative adversarial networks

Diffusion models

What is self-supervised learning?

What is unsupervised learning used for in practice?

Customer segmentation and marketing

Bioinformatics and genomics

Natural language processing

Computer vision and image processing

Anomaly detection in industry

Scientific discovery

How is unsupervised learning evaluated?

Internal (intrinsic) metrics

External metrics

Downstream task evaluation

Challenges in evaluation

How does unsupervised learning differ from supervised and reinforcement learning?

Limitations and open problems

See also

References

Improve this article

What links here (24 of 44)

What links here (24 of 44)

What is unsupervised learning used for?

History

Mathematical framework

Density estimation

Latent variable models

Objective functions for clustering

Core tasks

Clustering

Dimensionality reduction

Density estimation and anomaly detection

Association rule learning

Generative models

Variational autoencoders

Generative adversarial networks

Diffusion models

What is self-supervised learning?

What is unsupervised learning used for in practice?

Customer segmentation and marketing

Bioinformatics and genomics

Natural language processing

Computer vision and image processing

Anomaly detection in industry

Scientific discovery

How is unsupervised learning evaluated?

Internal (intrinsic) metrics

External metrics

Downstream task evaluation

Challenges in evaluation

How does unsupervised learning differ from supervised and reinforcement learning?

Limitations and open problems

See also

References

Improve this article

Related Articles

Data labeling

DBSCAN

Julius AI

A/B Testing

Anomaly Detection

Confirmation Bias

What links here (24 of 44)

Related Articles

Data labeling

DBSCAN

Julius AI

A/B Testing

Anomaly Detection

Confirmation Bias

What links here (24 of 44)