t-distributed stochastic neighbor embedding (t-SNE) is a nonlinear dimensionality reduction technique used primarily for visualizing high-dimensional data in two or three dimensions. Developed by Laurens van der Maaten and Geoffrey Hinton in 2008, t-SNE converts pairwise similarities between data points in a high-dimensional space into joint probabilities, then constructs a similar probability distribution over pairs of points in a low-dimensional map. The algorithm minimizes the Kullback-Leibler divergence between these two distributions using gradient descent, producing visualizations where similar objects appear as nearby points and dissimilar objects appear as distant points.
t-SNE has become one of the most widely used methods for visualizing complex datasets in fields such as bioinformatics, natural language processing, computer vision, and genomics. It is especially popular for visualizing word embeddings, single-cell RNA sequencing data, and the internal representations of deep neural networks. Unlike linear methods such as principal component analysis (PCA), t-SNE can capture nonlinear relationships in data and often reveals cluster structure that linear methods miss.
Imagine you have a giant room full of toys, and each toy has a long list of features: color, size, weight, softness, number of wheels, and dozens more. You want to arrange all the toys on a flat table so that similar toys (like two stuffed animals) end up close together, while very different toys (like a stuffed animal and a toy truck) end up far apart. The problem is that you can only use two directions on the table (left-right and up-down), but the toys were originally described by many different features.
t-SNE is like a clever helper that looks at all those features, figures out which toys are similar to each other, and then carefully places them on the table so that the neighborhoods are preserved. Toys that were close together in the big feature list stay close on the table. It is not perfect: the exact distances on the table do not always match the original distances perfectly, and the sizes of groups can look different from what they really are. But it does a very good job at keeping neighbors together, so you can quickly see clusters of similar toys just by looking at the table.
The predecessor to t-SNE is stochastic neighbor embedding (SNE), introduced by Geoffrey Hinton and Sam Roweis at NeurIPS (then called NIPS) in 2002. SNE was the first algorithm to use probabilistic neighbor relationships for dimensionality reduction. In SNE, a Gaussian distribution is centered on each data point in the high-dimensional space, and the probability that one point would choose another as its neighbor is computed from the density under this Gaussian. A similar distribution is computed in the low-dimensional space, and the algorithm minimizes the Kullback-Leibler divergence between the two sets of conditional probability distributions.
SNE had several limitations. The cost function was asymmetric (using conditional probabilities rather than joint probabilities), which made optimization difficult. More importantly, SNE used Gaussian distributions in both the high-dimensional and low-dimensional spaces, which led to the "crowding problem" where moderately distant points in high dimensions were forced too close together in the low-dimensional map, causing clusters to overlap.
Van der Maaten and Hinton addressed both of these problems in their 2008 paper "Visualizing Data using t-SNE," published in the Journal of Machine Learning Research. They introduced two modifications to SNE:
The resulting method produced substantially better visualizations than SNE and quickly became one of the most popular dimensionality reduction tools in machine learning.
Several important extensions followed the original t-SNE paper:
| Year | Development | Authors | Contribution |
|---|---|---|---|
| 2009 | Parametric t-SNE | van der Maaten | Used a neural network to learn a parametric mapping, enabling out-of-sample embedding |
| 2013 | Barnes-Hut t-SNE | van der Maaten | Reduced computational complexity from O(n^2) to O(n log n) using tree-based approximations |
| 2019 | FIt-SNE | Linderman et al. | Achieved O(n) complexity using FFT-based interpolation |
| 2019 | openTSNE | Policar et al. | Provided a modular, extensible Python library with support for adding new points to existing embeddings |
| 2019 | opt-SNE | Belkina et al. | Automated hyperparameter selection for large datasets |
The t-SNE algorithm operates in two main stages: constructing a probability distribution over pairs of points in the high-dimensional space, and then optimizing a low-dimensional embedding to match this distribution as closely as possible.
Given a dataset of n points x_1, x_2, ..., x_n in a high-dimensional space, t-SNE first computes pairwise conditional probabilities that measure how similar each pair of points is.
For each pair of points x_i and x_j, the conditional probability p(j|i) represents the probability that x_i would pick x_j as its neighbor, assuming neighbors are chosen in proportion to their probability density under a Gaussian centered at x_i:
p(j|i) = exp(-||x_i - x_j||^2 / 2 * sigma_i^2) / sum over all k != i of exp(-||x_i - x_k||^2 / 2 * sigma_i^2)
The bandwidth sigma_i of the Gaussian for each point is set individually so that the conditional distribution P_i has a fixed perplexity (described below). Points in denser regions of the data get smaller values of sigma_i, while points in sparser regions get larger values. The value of sigma_i for each point is found using binary search.
The conditional probabilities are then symmetrized into joint probabilities:
p_ij = (p(j|i) + p(i|j)) / (2n)
This symmetrization ensures that p_ij = p_ji for all pairs, which simplifies the gradient computation. It also ensures that each data point makes a meaningful contribution to the cost function, even if the point is an outlier. By convention, p_ii = 0.
In the low-dimensional embedding (typically 2D or 3D), t-SNE defines a second probability distribution Q over all pairs of map points y_1, y_2, ..., y_n. Instead of using a Gaussian, t-SNE uses a Student-t distribution with one degree of freedom (a Cauchy distribution):
q_ij = (1 + ||y_i - y_j||^2)^(-1) / sum over all k != l of (1 + ||y_k - y_l||^2)^(-1)
The use of the Student-t distribution is the defining innovation of t-SNE. The heavier tails of the Student-t distribution compared to the Gaussian allow moderate distances in the high-dimensional space to be represented by larger distances in the low-dimensional map. This directly addresses the crowding problem that plagued the original SNE.
The algorithm minimizes the Kullback-Leibler divergence between the joint distributions P and Q:
C = KL(P || Q) = sum over all i != j of p_ij * log(p_ij / q_ij)
The KL divergence is asymmetric: it penalizes map configurations where nearby points in the high-dimensional space are placed far apart in the low-dimensional map (because p_ij is large but q_ij is small) more heavily than configurations where distant points in high-dimensional space are placed close together. This means t-SNE prioritizes preserving local structure (keeping neighbors together) over global structure (maintaining large-scale distances).
The gradient of the cost function with respect to the low-dimensional point y_i is:
dC/dy_i = 4 * sum over j of (p_ij - q_ij) * (y_i - y_j) * (1 + ||y_i - y_j||^2)^(-1)
This gradient has an intuitive interpretation. Each pair of points exerts a "force" on y_i: if p_ij > q_ij (the points should be closer), the force is attractive; if p_ij < q_ij (the points should be farther apart), the force is repulsive. The magnitude depends on the mismatch between p_ij and q_ij and on the current distance in the low-dimensional space.
The embedding is optimized using gradient descent with momentum. The positions of the low-dimensional points are initialized randomly (often from a Gaussian distribution with small variance) and then updated iteratively.
Several tricks improve the optimization:
Typical implementations run for 1,000 iterations total, though convergence should be verified by checking that the KL divergence has stabilized.
The crowding problem is the central motivation for using a Student-t distribution in the low-dimensional space. When embedding high-dimensional data into two dimensions, the volume of space available for moderately distant points shrinks dramatically. In a 10-dimensional hypersphere, for example, the volume of a thin shell at moderate distance from the center is much larger relative to the volume of the inner sphere than the corresponding ratio in two dimensions.
This means that even if high-dimensional similarities are computed correctly, a faithful embedding in 2D would require placing moderately distant points unreasonably close to nearby points, causing clusters to collapse together. By using a heavy-tailed distribution in the low-dimensional space, t-SNE allows these moderately distant points to be placed farther apart without incurring a large penalty, effectively spreading out the embedding and giving clusters room to separate.
The Student-t distribution with one degree of freedom was chosen because it closely approximates the distribution of pairwise distances in a low-dimensional space when points follow a Gaussian distribution in the high-dimensional space. This theoretical justification is one reason the method works so well in practice.
Perplexity is the single most important hyperparameter in t-SNE. It controls the effective number of neighbors that each point considers, balancing attention between local and global aspects of the data.
Formally, the perplexity of the conditional distribution P_i is defined as:
Perp(P_i) = 2^(H(P_i))
where H(P_i) is the Shannon entropy of P_i in bits:
H(P_i) = -sum over j of p(j|i) * log_2(p(j|i))
A perplexity of k means that the conditional distribution has roughly the same entropy as a uniform distribution over k neighbors. Higher perplexity makes the Gaussian bandwidth sigma_i larger, so each point considers more neighbors; lower perplexity focuses attention on fewer, closer neighbors.
Van der Maaten and Hinton recommended perplexity values between 5 and 50, noting that the algorithm is "fairly robust to changes in the perplexity." In practice:
The perplexity must always be smaller than the number of data points. Setting it too low can produce noisy, fragmented visualizations where random groupings appear as false clusters. Setting it too high can blur meaningful local structure.
Beyond perplexity, several other hyperparameters influence the quality of t-SNE visualizations.
| Parameter | Typical value | Effect |
|---|---|---|
| Perplexity | 5 to 50 | Controls the effective number of neighbors; balances local vs. global structure |
| Learning rate | 200 (or n/12) | Step size for gradient descent; too small causes slow convergence, too large causes instability |
| Number of iterations | 1,000 | Total optimization steps; too few produces distorted shapes |
| Early exaggeration factor | 12 | Multiplier applied to p_ij during early iterations to encourage cluster formation |
| Early exaggeration iterations | 250 | Number of iterations with exaggeration active |
| Momentum (early) | 0.5 | Momentum during early exaggeration phase |
| Momentum (late) | 0.8 | Momentum after early exaggeration |
| Initialization | Random or PCA | Starting positions; PCA initialization often improves global structure |
| Metric | Euclidean | Distance function in the input space; can be changed to cosine, Manhattan, etc. |
The learning rate (sometimes called epsilon) controls how much the embedding points move at each iteration. If the learning rate is too low, the optimization converges slowly and may get stuck in poor local minima. If it is too high, the points oscillate and the embedding never stabilizes. The scikit-learn documentation recommends a learning rate between 10 and 1,000, with 200 as the default. Belkina et al. (2019) proposed setting the learning rate to n/early_exaggeration_factor as a heuristic that works well across dataset sizes.
The initial positions of the low-dimensional points can significantly affect the result. Random initialization (sampling from a small Gaussian) is the traditional approach, but PCA-based initialization (projecting the data onto its first two principal components) often produces better results because it provides a reasonable starting configuration that already captures some of the global structure. Recent best-practice recommendations favor PCA initialization.
The original t-SNE algorithm has O(n^2) time and space complexity because it computes pairwise similarities and forces between all n data points. This makes it impractical for datasets with more than a few thousand points. Several approximations have been developed to scale t-SNE to larger datasets.
Van der Maaten (2014) adapted the Barnes-Hut algorithm, originally developed for N-body gravitational simulations, to approximate the repulsive forces in t-SNE. The key insight is that the repulsive force between two distant points in the embedding depends on (1 + ||y_i - y_j||^2)^(-1), which follows an inverse-square-like law. Groups of distant points can be approximated as a single point (their center of mass), just as in gravitational simulations.
The algorithm uses a quad-tree (in 2D) or oct-tree (in 3D) to hierarchically partition the embedding space. For each point, repulsive forces from distant groups of points are approximated using the tree structure, while forces from nearby points are computed exactly. This reduces the complexity of each gradient computation from O(n^2) to O(n log n).
For the attractive forces, Barnes-Hut t-SNE uses vantage-point trees to efficiently compute approximate nearest neighbors, computing only the k nearest neighbors for each point rather than all pairwise distances. This makes the attractive force computation O(n * k) rather than O(n^2).
Barnes-Hut t-SNE is the default algorithm in scikit-learn's implementation and can handle datasets with tens of thousands of points.
Linderman, Rachh, Hoskins, Steinerberger, and Kluger (2019) introduced FIt-SNE (Fast Interpolation-based t-SNE), which further reduces the complexity to O(n). Instead of using tree-based approximations for the repulsive forces, FIt-SNE interpolates the forces onto a regular grid and uses the fast Fourier transform (FFT) to compute the convolution efficiently.
The key steps in FIt-SNE are:
FIt-SNE also replaces vantage-point trees with the Annoy (Approximate Nearest Neighbors Oh Yeah) library for faster nearest-neighbor computation. The result is an algorithm that can embed datasets with millions of points in minutes rather than hours.
Several GPU implementations exist for even faster computation:
| Implementation | Platform | Speedup | Notes |
|---|---|---|---|
| t-SNE-CUDA | NVIDIA GPUs | Up to 1,200x vs. scikit-learn | Based on FIt-SNE, Python bindings with scikit-learn-style API |
| RAPIDS cuML | NVIDIA GPUs | Up to 2,000x vs. scikit-learn | Drop-in replacement for scikit-learn's TSNE |
| TensorBoard projector | Browser | Interactive | Part of TensorFlow ecosystem |
The flexibility of t-SNE makes it a powerful visualization tool, but it also means that the resulting plots can be misleading if interpreted carelessly. Several common pitfalls have been identified by Wattenberg, Viegas, and Johnson (2016) in their interactive article "How to Use t-SNE Effectively."
t-SNE tends to equalize the apparent sizes of clusters. A cluster that is highly dispersed in the original high-dimensional space and a cluster that is tightly packed can appear similar in size in a t-SNE plot. This happens because the algorithm adjusts the bandwidth sigma_i independently for each point based on the local density, effectively normalizing local distances. As a result, the visual size of a cluster in a t-SNE plot does not reliably indicate its actual spread or variance.
The distances between clusters in a t-SNE plot should not be interpreted as reflecting true inter-cluster distances in the original space. Two clusters that appear far apart may not be more different than two clusters that appear closer together. At low perplexity values, the algorithm focuses almost exclusively on local neighborhoods and may place clusters at arbitrary relative positions. At higher perplexity values, some global structure may be preserved, but this is not guaranteed.
t-SNE can produce apparent cluster structure even in random, uniformly distributed data, especially at low perplexity values. This means that observing clusters in a t-SNE plot is not sufficient evidence that genuine clusters exist in the data. The visualization should always be validated using other methods, such as statistical clustering algorithms or domain-specific analysis.
Different perplexity values can produce qualitatively different visualizations of the same dataset. A visualization that shows clear clusters at one perplexity value may show a featureless blob at another. Best practice is to examine multiple perplexity values and look for patterns that are consistent across settings. If a structure disappears at higher perplexity, it may reflect local noise rather than genuine structure.
Because t-SNE uses random initialization and a non-convex objective function, different runs on the same data can produce different-looking plots. Features that appear consistently across multiple runs are more likely to reflect real data structure. Running t-SNE several times and comparing results is recommended practice.
If the optimization is stopped too early, the embedding may not have converged, producing distorted shapes. A common sign of premature stopping is the presence of "pinched" or "compressed" shapes in the plot. The optimization should run until the KL divergence has stabilized, which typically requires at least 1,000 iterations.
PCA is a linear method that finds orthogonal directions of maximum variance in the data. It is fast (O(n * d^2) for n points in d dimensions), deterministic, and preserves global structure well. However, PCA cannot capture nonlinear relationships and often fails to reveal cluster structure in complex data.
t-SNE is nonlinear, stochastic, and much more computationally expensive, but it excels at revealing local cluster structure. A common workflow is to use PCA to reduce high-dimensional data to 50 dimensions first (to remove noise and speed up computation), then apply t-SNE for the final 2D visualization.
UMAP (Uniform Manifold Approximation and Projection), introduced by McInnes, Healy, and Melville in 2018, is the most common alternative to t-SNE for nonlinear dimensionality reduction. The two methods share many similarities but differ in important ways.
| Aspect | t-SNE | UMAP |
|---|---|---|
| Theoretical basis | Information theory (KL divergence) | Riemannian geometry and algebraic topology |
| Global structure preservation | Weak; focuses on local neighborhoods | Better; preserves both local and some global structure |
| Computational speed | Slower (O(n log n) with Barnes-Hut) | Faster, especially for large datasets |
| Scalability | Millions of points with FIt-SNE | Scales well to millions natively |
| Embedding new points | Requires recomputation (non-parametric) | Supports transform on new data |
| Determinism | Non-deterministic (random initialization) | Non-deterministic (random initialization) |
| Hyperparameter sensitivity | Sensitive (perplexity, learning rate) | Sensitive (n_neighbors, min_dist) |
| Continuous structure | Tends to break continua into clusters | Better at preserving continuous manifold structure |
| Cluster separation | Often produces tighter, more distinct clusters | Clusters may be less tightly separated |
In practice, UMAP is often preferred for exploratory analysis of large datasets because of its speed and better global structure preservation. t-SNE remains popular for producing publication-quality visualizations of moderate-sized datasets where fine local structure is the priority.
| Method | Type | Preserves | Scalability | Key limitation |
|---|---|---|---|---|
| t-SNE | Probabilistic | Local neighborhoods | O(n log n) with Barnes-Hut | No global structure guarantee |
| UMAP | Topological | Local and some global | O(n) approximate | Sensitive to hyperparameters |
| Isomap | Geodesic | Global (geodesic distances) | O(n^3) for shortest paths | Fails with non-convex manifolds |
| LLE (Locally Linear Embedding) | Linear reconstruction | Local geometry | O(n^2) | Sensitive to noise, requires connected manifold |
| Laplacian Eigenmaps | Spectral | Local connectivity | O(n^2) for graph construction | Requires choosing neighborhood size |
| MDS (Multidimensional scaling) | Distance-based | All pairwise distances | O(n^3) for classical MDS | Linear; does not capture nonlinear structure |
| Autoencoders | Neural network | Learned representation | O(n) per epoch | Requires architecture design and training |
t-SNE has found widespread use across many domains. Below are some of the most common applications.
The most prominent application of t-SNE is in single-cell RNA sequencing (scRNA-seq) analysis. Single-cell experiments can measure the expression levels of thousands of genes in hundreds of thousands or millions of individual cells. t-SNE is used to visualize these high-dimensional expression profiles in 2D, revealing distinct cell types as clusters. Kobak and Berens (2019) provided detailed guidelines for using t-SNE with single-cell data, recommending PCA preprocessing, perplexity values proportional to the dataset size, and the use of FIt-SNE for large datasets.
t-SNE is commonly used to visualize word embeddings from models such as Word2Vec, GloVe, and contextual embeddings from transformers. When word vectors are projected into 2D using t-SNE, semantically similar words cluster together. For example, country names, color terms, or verb forms tend to form visually distinct groups. TensorFlow's Embedding Projector uses t-SNE as one of its primary visualization methods.
t-SNE is used to visualize the feature representations learned by convolutional neural networks (CNNs). By applying t-SNE to the activations of a layer (often the penultimate fully connected layer), researchers can see how the network organizes images internally. Images of the same class should cluster together if the network has learned effective representations.
t-SNE has been applied to network traffic analysis for intrusion detection. By visualizing high-dimensional network flow features, analysts can identify clusters of normal traffic and spot anomalous patterns that may indicate attacks.
In cancer biology, t-SNE is used to visualize flow cytometry and mass cytometry (CyTOF) data, where each cell is characterized by the expression of dozens of protein markers. t-SNE plots help identify distinct cell populations in tumor microenvironments and track how these populations change in response to treatment.
t-SNE has been used to visualize audio features extracted from music tracks, revealing genre clusters, artist similarities, and temporal patterns within musical datasets.
t-SNE has several well-known limitations that practitioners should consider.
t-SNE learns a non-parametric embedding, meaning it does not produce an explicit function that maps from the high-dimensional input space to the low-dimensional output space. This has two practical consequences:
Parametric t-SNE (van der Maaten, 2009) addresses the first limitation by training a neural network to approximate the t-SNE mapping, but this variant is less commonly used.
Even with Barnes-Hut approximation, t-SNE is significantly more expensive than linear methods like PCA. For very large datasets (millions of points), the computation can take hours even with optimized implementations. FIt-SNE and GPU-accelerated implementations have largely addressed this issue, but they require additional software dependencies.
t-SNE is designed for reducing data to 2 or 3 dimensions for visualization purposes. It is not appropriate as a general-purpose dimensionality reduction method for downstream tasks such as classification or clustering, because the embedding does not preserve distances or densities in a consistent way.
In very high-dimensional spaces, Euclidean distances lose their discriminative power (the distances between all pairs of points tend to become similar). This can degrade the quality of t-SNE embeddings. A common mitigation is to first reduce the dimensionality to 50 or so using PCA before applying t-SNE.
As discussed in the interpretation pitfalls section, t-SNE results can vary significantly with different hyperparameter settings. There is no single set of hyperparameters that works well for all datasets, and manual tuning or automated approaches like opt-SNE are often necessary.
The t-SNE objective function is non-convex, so the optimization can converge to different local minima depending on the random initialization. This non-determinism means that two runs on the same data with the same hyperparameters may produce visually different results.
t-SNE is available in many popular machine learning libraries.
| Library | Language | Algorithm | Notes |
|---|---|---|---|
| scikit-learn | Python | Barnes-Hut, exact | sklearn.manifold.TSNE; the most widely used implementation |
| openTSNE | Python | FIt-SNE, Barnes-Hut | Modular, supports adding new points to existing embeddings |
| FIt-SNE | C++ with Python bindings | FFT-accelerated | Fastest CPU implementation for large datasets |
| t-SNE-CUDA | Python/CUDA | GPU-accelerated FIt-SNE | Up to 1,200x speedup over scikit-learn |
| RAPIDS cuML | Python/CUDA | GPU-accelerated | Drop-in scikit-learn replacement |
| Rtsne | R | Barnes-Hut | R wrapper around C++ implementation |
| MulticoreTSNE | Python/C++ | Barnes-Hut (parallel) | Multi-threaded CPU implementation |
| TensorBoard | JavaScript | Interactive | Part of TensorFlow; runs in the browser |
| ELKI | Java | Barnes-Hut | General-purpose data mining framework |
| TSne.jl | Julia | Barnes-Hut | Julia implementation |
from sklearn.manifold import TSNE
import numpy as np
# X is an array of shape (n_samples, n_features)
X = np.random.randn(1000, 50)
# Create and fit t-SNE
tsne = TSNE(
n_components=2,
perplexity=30,
learning_rate='auto',
init='pca',
n_iter=1000,
random_state=42
)
X_embedded = tsne.fit_transform(X)
# X_embedded is now (1000, 2)
from openTSNE import TSNE
from sklearn.decomposition import PCA
import numpy as np
# Reduce to 50 dimensions with PCA first
X = np.random.randn(10000, 784)
X_pca = PCA(n_components=50).fit_transform(X)
# Run t-SNE with FIt-SNE backend
tsne = TSNE(
n_components=2,
perplexity=30,
initialization='pca',
n_jobs=-1
)
embedding = tsne.fit(X_pca)
# Add new points without recomputing
new_points = np.random.randn(100, 784)
new_pca = PCA(n_components=50).fit_transform(new_points)
new_embedding = embedding.transform(new_pca)
Based on the accumulated experience of the research community, the following guidelines help produce reliable t-SNE visualizations:
learning_rate='auto' in scikit-learn.