t-SNE

t-distributed stochastic neighbor embedding (t-SNE) is a nonlinear dimensionality reduction technique used primarily for visualizing high-dimensional data in two or three dimensions. Developed by Laurens van der Maaten and Geoffrey Hinton in 2008, t-SNE converts pairwise similarities between data points in a high-dimensional space into joint probabilities, then constructs a similar probability distribution over pairs of points in a low-dimensional map. The algorithm minimizes the Kullback-Leibler divergence between these two distributions using gradient descent, producing visualizations where similar objects appear as nearby points and dissimilar objects appear as distant points.

t-SNE has become one of the most widely used methods for visualizing complex datasets in fields such as bioinformatics, natural language processing, computer vision, and genomics. It is especially popular for visualizing word embeddings, single-cell RNA sequencing data, and the internal representations of deep neural networks. Unlike linear methods such as principal component analysis (PCA), t-SNE can capture nonlinear relationships in data and often reveals cluster structure that linear methods miss.

Explain like I'm 5 (ELI5)

Imagine you have a giant room full of toys, and each toy has a long list of features: color, size, weight, softness, number of wheels, and dozens more. You want to arrange all the toys on a flat table so that similar toys (like two stuffed animals) end up close together, while very different toys (like a stuffed animal and a toy truck) end up far apart. The problem is that you can only use two directions on the table (left-right and up-down), but the toys were originally described by many different features.

t-SNE is like a clever helper that looks at all those features, figures out which toys are similar to each other, and then carefully places them on the table so that the neighborhoods are preserved. Toys that were close together in the big feature list stay close on the table. It is not perfect: the exact distances on the table do not always match the original distances perfectly, and the sizes of groups can look different from what they really are. But it does a very good job at keeping neighbors together, so you can quickly see clusters of similar toys just by looking at the table.

History

Stochastic neighbor embedding (2002)

The predecessor to t-SNE is stochastic neighbor embedding (SNE), introduced by Geoffrey Hinton and Sam Roweis at NeurIPS (then called NIPS) in 2002. SNE was the first algorithm to use probabilistic neighbor relationships for dimensionality reduction. In SNE, a Gaussian distribution is centered on each data point in the high-dimensional space, and the probability that one point would choose another as its neighbor is computed from the density under this Gaussian. A similar distribution is computed in the low-dimensional space, and the algorithm minimizes the Kullback-Leibler divergence between the two sets of conditional probability distributions.

SNE had several limitations. The cost function was asymmetric (using conditional probabilities rather than joint probabilities), which made optimization difficult. More importantly, SNE used Gaussian distributions in both the high-dimensional and low-dimensional spaces, which led to the "crowding problem" where moderately distant points in high dimensions were forced too close together in the low-dimensional map, causing clusters to overlap.

The t-SNE improvement (2008)

Van der Maaten and Hinton addressed both of these problems in their 2008 paper "Visualizing Data using t-SNE," published in the Journal of Machine Learning Research. They introduced two modifications to SNE:

Symmetric cost function. Instead of minimizing separate KL divergences for each point's conditional distribution, t-SNE uses a single joint probability distribution, producing a simpler gradient and more stable optimization.
Student-t distribution in the low-dimensional space. By replacing the Gaussian with a heavy-tailed Student-t distribution (specifically, a Cauchy distribution with one degree of freedom), t-SNE solved the crowding problem. The heavier tails of the Student-t distribution allow moderately distant points to be placed farther apart in the low-dimensional map without incurring a large cost, giving well-separated clusters room to breathe.

The resulting method produced substantially better visualizations than SNE and quickly became one of the most popular dimensionality reduction tools in machine learning.

Subsequent developments

Several important extensions followed the original t-SNE paper:

Year	Development	Authors	Contribution
2009	Parametric t-SNE	van der Maaten	Used a neural network to learn a parametric mapping, enabling out-of-sample embedding
2013	Barnes-Hut t-SNE	van der Maaten	Reduced computational complexity from O(n^2) to O(n log n) using tree-based approximations
2019	FIt-SNE	Linderman et al.	Achieved O(n) complexity using FFT-based interpolation
2019	openTSNE	Policar et al.	Provided a modular, extensible Python library with support for adding new points to existing embeddings
2019	opt-SNE	Belkina et al.	Automated hyperparameter selection for large datasets

Algorithm

The t-SNE algorithm operates in two main stages: constructing a probability distribution over pairs of points in the high-dimensional space, and then optimizing a low-dimensional embedding to match this distribution as closely as possible.

Step 1: pairwise similarities in high-dimensional space

Given a dataset of n points x_1, x_2, ..., x_n in a high-dimensional space, t-SNE first computes pairwise conditional probabilities that measure how similar each pair of points is.

For each pair of points x_i and x_j, the conditional probability p(j|i) represents the probability that x_i would pick x_j as its neighbor, assuming neighbors are chosen in proportion to their probability density under a Gaussian centered at x_i:

p(j|i) = exp(-||x_i - x_j||^2 / 2 * sigma_i^2) / sum over all k != i of exp(-||x_i - x_k||^2 / 2 * sigma_i^2)

The bandwidth sigma_i of the Gaussian for each point is set individually so that the conditional distribution P_i has a fixed perplexity (described below). Points in denser regions of the data get smaller values of sigma_i, while points in sparser regions get larger values. The value of sigma_i for each point is found using binary search.

The conditional probabilities are then symmetrized into joint probabilities:

p_ij = (p(j|i) + p(i|j)) / (2n)

This symmetrization ensures that p_ij = p_ji for all pairs, which simplifies the gradient computation. It also ensures that each data point makes a meaningful contribution to the cost function, even if the point is an outlier. By convention, p_ii = 0.

Step 2: pairwise similarities in low-dimensional space

In the low-dimensional embedding (typically 2D or 3D), t-SNE defines a second probability distribution Q over all pairs of map points y_1, y_2, ..., y_n. Instead of using a Gaussian, t-SNE uses a Student-t distribution with one degree of freedom (a Cauchy distribution):

q_ij = (1 + ||y_i - y_j||^2)^(-1) / sum over all k != l of (1 + ||y_k - y_l||^2)^(-1)

The use of the Student-t distribution is the defining innovation of t-SNE. The heavier tails of the Student-t distribution compared to the Gaussian allow moderate distances in the high-dimensional space to be represented by larger distances in the low-dimensional map. This directly addresses the crowding problem that plagued the original SNE.

Step 3: minimizing the KL divergence

The algorithm minimizes the Kullback-Leibler divergence between the joint distributions P and Q:

C = KL(P || Q) = sum over all i != j of p_ij * log(p_ij / q_ij)

The KL divergence is asymmetric: it penalizes map configurations where nearby points in the high-dimensional space are placed far apart in the low-dimensional map (because p_ij is large but q_ij is small) more heavily than configurations where distant points in high-dimensional space are placed close together. This means t-SNE prioritizes preserving local structure (keeping neighbors together) over global structure (maintaining large-scale distances).

The gradient of the cost function with respect to the low-dimensional point y_i is:

dC/dy_i = 4 * sum over j of (p_ij - q_ij) * (y_i - y_j) * (1 + ||y_i - y_j||^2)^(-1)

This gradient has an intuitive interpretation. Each pair of points exerts a "force" on y_i: if p_ij > q_ij (the points should be closer), the force is attractive; if p_ij < q_ij (the points should be farther apart), the force is repulsive. The magnitude depends on the mismatch between p_ij and q_ij and on the current distance in the low-dimensional space.

Step 4: optimization

The embedding is optimized using gradient descent with momentum. The positions of the low-dimensional points are initialized randomly (often from a Gaussian distribution with small variance) and then updated iteratively.

Several tricks improve the optimization:

Early exaggeration. During the first 250 iterations (by default), all p_ij values are multiplied by a factor (typically 12). This exaggeration increases the attractive forces between points, encouraging tight clusters to form early in the optimization. The exaggeration factor is reduced to 1 after this initial phase.
Momentum. The optimizer uses momentum to accelerate convergence. A lower momentum value (0.5) is used during the early exaggeration phase, and a higher value (0.8) is used afterward.
Adaptive learning rate. Some implementations use an adaptive learning rate, where the step size is increased for points that consistently move in the same direction and decreased for points that oscillate. A common heuristic sets the learning rate to n/exaggeration_factor, where n is the number of data points.

Typical implementations run for 1,000 iterations total, though convergence should be verified by checking that the KL divergence has stabilized.

The crowding problem

The crowding problem is the central motivation for using a Student-t distribution in the low-dimensional space. When embedding high-dimensional data into two dimensions, the volume of space available for moderately distant points shrinks dramatically. In a 10-dimensional hypersphere, for example, the volume of a thin shell at moderate distance from the center is much larger relative to the volume of the inner sphere than the corresponding ratio in two dimensions.

This means that even if high-dimensional similarities are computed correctly, a faithful embedding in 2D would require placing moderately distant points unreasonably close to nearby points, causing clusters to collapse together. By using a heavy-tailed distribution in the low-dimensional space, t-SNE allows these moderately distant points to be placed farther apart without incurring a large penalty, effectively spreading out the embedding and giving clusters room to separate.

The Student-t distribution with one degree of freedom was chosen because it closely approximates the distribution of pairwise distances in a low-dimensional space when points follow a Gaussian distribution in the high-dimensional space. This theoretical justification is one reason the method works so well in practice.

Perplexity

Perplexity is the single most important hyperparameter in t-SNE. It controls the effective number of neighbors that each point considers, balancing attention between local and global aspects of the data.

Formally, the perplexity of the conditional distribution P_i is defined as:

Perp(P_i) = 2^(H(P_i))

where H(P_i) is the Shannon entropy of P_i in bits:

H(P_i) = -sum over j of p(j|i) * log_2(p(j|i))

A perplexity of k means that the conditional distribution has roughly the same entropy as a uniform distribution over k neighbors. Higher perplexity makes the Gaussian bandwidth sigma_i larger, so each point considers more neighbors; lower perplexity focuses attention on fewer, closer neighbors.

Van der Maaten and Hinton recommended perplexity values between 5 and 50, noting that the algorithm is "fairly robust to changes in the perplexity." In practice:

Small datasets (hundreds of points): perplexity of 5 to 30
Medium datasets (thousands of points): perplexity of 30 to 50
Large datasets (tens of thousands or more): perplexity of 50 to 100 or higher

The perplexity must always be smaller than the number of data points. Setting it too low can produce noisy, fragmented visualizations where random groupings appear as false clusters. Setting it too high can blur meaningful local structure.

Hyperparameters

Beyond perplexity, several other hyperparameters influence the quality of t-SNE visualizations.

Parameter	Typical value	Effect
Perplexity	5 to 50	Controls the effective number of neighbors; balances local vs. global structure
Learning rate	200 (or n/12)	Step size for gradient descent; too small causes slow convergence, too large causes instability
Number of iterations	1,000	Total optimization steps; too few produces distorted shapes
Early exaggeration factor	12	Multiplier applied to p_ij during early iterations to encourage cluster formation
Early exaggeration iterations	250	Number of iterations with exaggeration active
Momentum (early)	0.5	Momentum during early exaggeration phase
Momentum (late)	0.8	Momentum after early exaggeration
Initialization	Random or PCA	Starting positions; PCA initialization often improves global structure
Metric	Euclidean	Distance function in the input space; can be changed to cosine, Manhattan, etc.

Learning rate

The learning rate (sometimes called epsilon) controls how much the embedding points move at each iteration. If the learning rate is too low, the optimization converges slowly and may get stuck in poor local minima. If it is too high, the points oscillate and the embedding never stabilizes. The scikit-learn documentation recommends a learning rate between 10 and 1,000, with 200 as the default. Belkina et al. (2019) proposed setting the learning rate to n/early_exaggeration_factor as a heuristic that works well across dataset sizes.

Initialization

The initial positions of the low-dimensional points can significantly affect the result. Random initialization (sampling from a small Gaussian) is the traditional approach, but PCA-based initialization (projecting the data onto its first two principal components) often produces better results because it provides a reasonable starting configuration that already captures some of the global structure. Recent best-practice recommendations favor PCA initialization.

Computational speedups

The original t-SNE algorithm has O(n^2) time and space complexity because it computes pairwise similarities and forces between all n data points. This makes it impractical for datasets with more than a few thousand points. Several approximations have been developed to scale t-SNE to larger datasets.

Barnes-Hut t-SNE

Van der Maaten (2014) adapted the Barnes-Hut algorithm, originally developed for N-body gravitational simulations, to approximate the repulsive forces in t-SNE. The key insight is that the repulsive force between two distant points in the embedding depends on (1 + ||y_i - y_j||^2)^(-1), which follows an inverse-square-like law. Groups of distant points can be approximated as a single point (their center of mass), just as in gravitational simulations.

The algorithm uses a quad-tree (in 2D) or oct-tree (in 3D) to hierarchically partition the embedding space. For each point, repulsive forces from distant groups of points are approximated using the tree structure, while forces from nearby points are computed exactly. This reduces the complexity of each gradient computation from O(n^2) to O(n log n).

For the attractive forces, Barnes-Hut t-SNE uses vantage-point trees to efficiently compute approximate nearest neighbors, computing only the k nearest neighbors for each point rather than all pairwise distances. This makes the attractive force computation O(n * k) rather than O(n^2).

Barnes-Hut t-SNE is the default algorithm in scikit-learn's implementation and can handle datasets with tens of thousands of points.

FIt-SNE

Linderman, Rachh, Hoskins, Steinerberger, and Kluger (2019) introduced FIt-SNE (Fast Interpolation-based t-SNE), which further reduces the complexity to O(n). Instead of using tree-based approximations for the repulsive forces, FIt-SNE interpolates the forces onto a regular grid and uses the fast Fourier transform (FFT) to compute the convolution efficiently.

The key steps in FIt-SNE are:

The embedding space is divided into a regular grid.
The forces from each point are interpolated onto the nearest grid nodes using polynomial interpolation.
The convolution of the interpolated values with the Student-t kernel is computed using FFT, which runs in O(m log m) time where m is the number of grid nodes (a fixed quantity independent of n).
The forces at each point are recovered by interpolating back from the grid.

FIt-SNE also replaces vantage-point trees with the Annoy (Approximate Nearest Neighbors Oh Yeah) library for faster nearest-neighbor computation. The result is an algorithm that can embed datasets with millions of points in minutes rather than hours.

GPU-accelerated t-SNE

Several GPU implementations exist for even faster computation:

Implementation	Platform	Speedup	Notes
t-SNE-CUDA	NVIDIA GPUs	Up to 1,200x vs. scikit-learn	Based on FIt-SNE, Python bindings with scikit-learn-style API
RAPIDS cuML	NVIDIA GPUs	Up to 2,000x vs. scikit-learn	Drop-in replacement for scikit-learn's TSNE
TensorBoard projector	Browser	Interactive	Part of TensorFlow ecosystem

Interpretation pitfalls

The flexibility of t-SNE makes it a powerful visualization tool, but it also means that the resulting plots can be misleading if interpreted carelessly. Several common pitfalls have been identified by Wattenberg, Viegas, and Johnson (2016) in their interactive article "How to Use t-SNE Effectively."

Cluster sizes are not meaningful

t-SNE tends to equalize the apparent sizes of clusters. A cluster that is highly dispersed in the original high-dimensional space and a cluster that is tightly packed can appear similar in size in a t-SNE plot. This happens because the algorithm adjusts the bandwidth sigma_i independently for each point based on the local density, effectively normalizing local distances. As a result, the visual size of a cluster in a t-SNE plot does not reliably indicate its actual spread or variance.

Distances between clusters are not reliable

The distances between clusters in a t-SNE plot should not be interpreted as reflecting true inter-cluster distances in the original space. Two clusters that appear far apart may not be more different than two clusters that appear closer together. At low perplexity values, the algorithm focuses almost exclusively on local neighborhoods and may place clusters at arbitrary relative positions. At higher perplexity values, some global structure may be preserved, but this is not guaranteed.

Apparent clusters in random data

t-SNE can produce apparent cluster structure even in random, uniformly distributed data, especially at low perplexity values. This means that observing clusters in a t-SNE plot is not sufficient evidence that genuine clusters exist in the data. The visualization should always be validated using other methods, such as statistical clustering algorithms or domain-specific analysis.

Sensitivity to perplexity

Different perplexity values can produce qualitatively different visualizations of the same dataset. A visualization that shows clear clusters at one perplexity value may show a featureless blob at another. Best practice is to examine multiple perplexity values and look for patterns that are consistent across settings. If a structure disappears at higher perplexity, it may reflect local noise rather than genuine structure.

Non-determinism

Because t-SNE uses random initialization and a non-convex objective function, different runs on the same data can produce different-looking plots. Features that appear consistently across multiple runs are more likely to reflect real data structure. Running t-SNE several times and comparing results is recommended practice.

Number of iterations matters

If the optimization is stopped too early, the embedding may not have converged, producing distorted shapes. A common sign of premature stopping is the presence of "pinched" or "compressed" shapes in the plot. The optimization should run until the KL divergence has stabilized, which typically requires at least 1,000 iterations.

Comparison with other methods

t-SNE vs. PCA

PCA is a linear method that finds orthogonal directions of maximum variance in the data. It is fast (O(n * d^2) for n points in d dimensions), deterministic, and preserves global structure well. However, PCA cannot capture nonlinear relationships and often fails to reveal cluster structure in complex data.

t-SNE is nonlinear, stochastic, and much more computationally expensive, but it excels at revealing local cluster structure. A common workflow is to use PCA to reduce high-dimensional data to 50 dimensions first (to remove noise and speed up computation), then apply t-SNE for the final 2D visualization.

t-SNE vs. UMAP

UMAP (Uniform Manifold Approximation and Projection), introduced by McInnes, Healy, and Melville in 2018, is the most common alternative to t-SNE for nonlinear dimensionality reduction. The two methods share many similarities but differ in important ways.

Aspect	t-SNE	UMAP
Theoretical basis	Information theory (KL divergence)	Riemannian geometry and algebraic topology
Global structure preservation	Weak; focuses on local neighborhoods	Better; preserves both local and some global structure
Computational speed	Slower (O(n log n) with Barnes-Hut)	Faster, especially for large datasets
Scalability	Millions of points with FIt-SNE	Scales well to millions natively
Embedding new points	Requires recomputation (non-parametric)	Supports transform on new data
Determinism	Non-deterministic (random initialization)	Non-deterministic (random initialization)
Hyperparameter sensitivity	Sensitive (perplexity, learning rate)	Sensitive (n_neighbors, min_dist)
Continuous structure	Tends to break continua into clusters	Better at preserving continuous manifold structure
Cluster separation	Often produces tighter, more distinct clusters	Clusters may be less tightly separated

In practice, UMAP is often preferred for exploratory analysis of large datasets because of its speed and better global structure preservation. t-SNE remains popular for producing publication-quality visualizations of moderate-sized datasets where fine local structure is the priority.

t-SNE vs. other nonlinear methods

Method	Type	Preserves	Scalability	Key limitation
t-SNE	Probabilistic	Local neighborhoods	O(n log n) with Barnes-Hut	No global structure guarantee
UMAP	Topological	Local and some global	O(n) approximate	Sensitive to hyperparameters
Isomap	Geodesic	Global (geodesic distances)	O(n^3) for shortest paths	Fails with non-convex manifolds
LLE (Locally Linear Embedding)	Linear reconstruction	Local geometry	O(n^2)	Sensitive to noise, requires connected manifold
Laplacian Eigenmaps	Spectral	Local connectivity	O(n^2) for graph construction	Requires choosing neighborhood size
MDS (Multidimensional scaling)	Distance-based	All pairwise distances	O(n^3) for classical MDS	Linear; does not capture nonlinear structure
Autoencoders	Neural network	Learned representation	O(n) per epoch	Requires architecture design and training

Applications

t-SNE has found widespread use across many domains. Below are some of the most common applications.

Single-cell genomics

The most prominent application of t-SNE is in single-cell RNA sequencing (scRNA-seq) analysis. Single-cell experiments can measure the expression levels of thousands of genes in hundreds of thousands or millions of individual cells. t-SNE is used to visualize these high-dimensional expression profiles in 2D, revealing distinct cell types as clusters. Kobak and Berens (2019) provided detailed guidelines for using t-SNE with single-cell data, recommending PCA preprocessing, perplexity values proportional to the dataset size, and the use of FIt-SNE for large datasets.

Natural language processing

t-SNE is commonly used to visualize word embeddings from models such as Word2Vec, GloVe, and contextual embeddings from transformers. When word vectors are projected into 2D using t-SNE, semantically similar words cluster together. For example, country names, color terms, or verb forms tend to form visually distinct groups. TensorFlow's Embedding Projector uses t-SNE as one of its primary visualization methods.

Computer vision

t-SNE is used to visualize the feature representations learned by convolutional neural networks (CNNs). By applying t-SNE to the activations of a layer (often the penultimate fully connected layer), researchers can see how the network organizes images internally. Images of the same class should cluster together if the network has learned effective representations.

Cybersecurity

t-SNE has been applied to network traffic analysis for intrusion detection. By visualizing high-dimensional network flow features, analysts can identify clusters of normal traffic and spot anomalous patterns that may indicate attacks.

Cancer research

In cancer biology, t-SNE is used to visualize flow cytometry and mass cytometry (CyTOF) data, where each cell is characterized by the expression of dozens of protein markers. t-SNE plots help identify distinct cell populations in tumor microenvironments and track how these populations change in response to treatment.

Music and audio

t-SNE has been used to visualize audio features extracted from music tracks, revealing genre clusters, artist similarities, and temporal patterns within musical datasets.

Limitations

t-SNE has several well-known limitations that practitioners should consider.

Non-parametric mapping

t-SNE learns a non-parametric embedding, meaning it does not produce an explicit function that maps from the high-dimensional input space to the low-dimensional output space. This has two practical consequences:

No out-of-sample extension. New data points cannot be added to an existing t-SNE embedding without rerunning the entire algorithm on the combined dataset. This makes t-SNE unsuitable for streaming or real-time applications.
No inverse mapping. There is no way to map from a position in the t-SNE plot back to the original high-dimensional space.

Parametric t-SNE (van der Maaten, 2009) addresses the first limitation by training a neural network to approximate the t-SNE mapping, but this variant is less commonly used.

Computational cost

Even with Barnes-Hut approximation, t-SNE is significantly more expensive than linear methods like PCA. For very large datasets (millions of points), the computation can take hours even with optimized implementations. FIt-SNE and GPU-accelerated implementations have largely addressed this issue, but they require additional software dependencies.

Only suitable for visualization

t-SNE is designed for reducing data to 2 or 3 dimensions for visualization purposes. It is not appropriate as a general-purpose dimensionality reduction method for downstream tasks such as classification or clustering, because the embedding does not preserve distances or densities in a consistent way.

Curse of dimensionality

In very high-dimensional spaces, Euclidean distances lose their discriminative power (the distances between all pairs of points tend to become similar). This can degrade the quality of t-SNE embeddings. A common mitigation is to first reduce the dimensionality to 50 or so using PCA before applying t-SNE.

Hyperparameter sensitivity

As discussed in the interpretation pitfalls section, t-SNE results can vary significantly with different hyperparameter settings. There is no single set of hyperparameters that works well for all datasets, and manual tuning or automated approaches like opt-SNE are often necessary.

Non-convex optimization

The t-SNE objective function is non-convex, so the optimization can converge to different local minima depending on the random initialization. This non-determinism means that two runs on the same data with the same hyperparameters may produce visually different results.

Software implementations

t-SNE is available in many popular machine learning libraries.

Library	Language	Algorithm	Notes
scikit-learn	Python	Barnes-Hut, exact	`sklearn.manifold.TSNE`; the most widely used implementation
openTSNE	Python	FIt-SNE, Barnes-Hut	Modular, supports adding new points to existing embeddings
FIt-SNE	C++ with Python bindings	FFT-accelerated	Fastest CPU implementation for large datasets
t-SNE-CUDA	Python/CUDA	GPU-accelerated FIt-SNE	Up to 1,200x speedup over scikit-learn
RAPIDS cuML	Python/CUDA	GPU-accelerated	Drop-in scikit-learn replacement
Rtsne	R	Barnes-Hut	R wrapper around C++ implementation
MulticoreTSNE	Python/C++	Barnes-Hut (parallel)	Multi-threaded CPU implementation
TensorBoard	JavaScript	Interactive	Part of TensorFlow; runs in the browser
ELKI	Java	Barnes-Hut	General-purpose data mining framework
TSne.jl	Julia	Barnes-Hut	Julia implementation

Basic scikit-learn usage

from sklearn.manifold import TSNE
import numpy as np

# X is an array of shape (n_samples, n_features)
X = np.random.randn(1000, 50)

# Create and fit t-SNE
tsne = TSNE(
    n_components=2,
    perplexity=30,
    learning_rate='auto',
    init='pca',
    n_iter=1000,
    random_state=42
)
X_embedded = tsne.fit_transform(X)

# X_embedded is now (1000, 2)

Using openTSNE with precomputed PCA

from openTSNE import TSNE
from sklearn.decomposition import PCA
import numpy as np

# Reduce to 50 dimensions with PCA first
X = np.random.randn(10000, 784)
X_pca = PCA(n_components=50).fit_transform(X)

# Run t-SNE with FIt-SNE backend
tsne = TSNE(
    n_components=2,
    perplexity=30,
    initialization='pca',
    n_jobs=-1
)
embedding = tsne.fit(X_pca)

# Add new points without recomputing
new_points = np.random.randn(100, 784)
new_pca = PCA(n_components=50).fit_transform(new_points)
new_embedding = embedding.transform(new_pca)

Best practices

Based on the accumulated experience of the research community, the following guidelines help produce reliable t-SNE visualizations:

Preprocess with PCA. Reduce the input to 50 dimensions using PCA before running t-SNE. This removes noise, speeds up computation, and can improve the quality of the embedding.
Try multiple perplexity values. Run t-SNE with at least three different perplexity values (for example, 5, 30, and 50) and compare the results. Structures that appear consistently are more likely to be genuine.
Run multiple times. Because of the stochastic nature of t-SNE, run the algorithm multiple times with different random seeds. Consistent patterns across runs are more trustworthy.
Use enough iterations. Allow at least 1,000 iterations for convergence. If the plot shows distorted or pinched shapes, increase the iteration count.
Use PCA initialization. Initializing from PCA rather than random positions tends to produce more reproducible results with better global structure.
Do not over-interpret. Remember that cluster sizes, inter-cluster distances, and point densities in a t-SNE plot do not necessarily reflect the true structure of the data.
Combine with other methods. Use t-SNE alongside clustering algorithms (such as k-means or DBSCAN) and statistical tests to validate observed patterns.
Scale the learning rate. For large datasets, set the learning rate to n/12 (where n is the number of data points) or use learning_rate='auto' in scikit-learn.

References

van der Maaten, L. and Hinton, G. (2008). "Visualizing Data using t-SNE." *Journal of Machine Learning Research*, 9(Nov), 2579-2605.
Hinton, G. and Roweis, S. (2002). "Stochastic Neighbor Embedding." *Advances in Neural Information Processing Systems 15 (NIPS 2002)*, 833-840. MIT Press.
van der Maaten, L. (2014). "Accelerating t-SNE using Tree-Based Algorithms." *Journal of Machine Learning Research*, 15(1), 3221-3245.
Linderman, G. C., Rachh, M., Hoskins, J. G., Steinerberger, S., and Kluger, Y. (2019). "Fast interpolation-based t-SNE for improved visualization of single-cell RNA-seq data." *Nature Methods*, 16(3), 243-245.
Wattenberg, M., Viegas, F., and Johnson, I. (2016). "How to Use t-SNE Effectively." *Distill*. https://distill.pub/2016/misread-tsne/
Kobak, D. and Berens, P. (2019). "The art of using t-SNE for single-cell transcriptomics." *Nature Communications*, 10(1), 5416.
van der Maaten, L. (2009). "Learning a Parametric Embedding by Preserving Local Structure." *Proceedings of the 12th International Conference on Artificial Intelligence and Statistics (AISTATS)*, 384-391.
Belkina, A. C., Ciccolella, C. O., Anno, R., Halber, R., Spidlen, J., and Snyder-Cappione, J. E. (2019). "Automated optimized parameters for T-distributed stochastic neighbor embedding improve visualization and analysis of large datasets." *Nature Communications*, 10(1), 5415.
McInnes, L., Healy, J., and Melville, J. (2018). "UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction." *arXiv preprint arXiv:1802.03426*.
Chan, D. M., Rao, R., Huang, F., and Canny, J. F. (2019). "t-SNE-CUDA: GPU-Accelerated t-SNE and its Applications to Modern Data." *Proceedings of the 30th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)*, 330-337.
Policar, P. G., Strazar, M., and Zupan, B. (2019). "openTSNE: a modular Python library for t-SNE dimensionality reduction and embedding." *bioRxiv*, 731877.
Arora, S., Hu, W., and Kothari, P. K. (2018). "An Analysis of the t-SNE Algorithm for Data Visualization." *Proceedings of the 31st Conference on Learning Theory (COLT)*, 1455-1462.
Cai, T. T. and Ma, R. (2022). "Theoretical Foundations of t-SNE for Visualizing High-Dimensional Clustered Data." *Journal of Machine Learning Research*, 23(301), 1-54.

Explain like I'm 5 (ELI5)

History

Stochastic neighbor embedding (2002)

The t-SNE improvement (2008)

Subsequent developments

Algorithm

Step 1: pairwise similarities in high-dimensional space

Step 2: pairwise similarities in low-dimensional space

Step 3: minimizing the KL divergence

Step 4: optimization

The crowding problem

Perplexity

Hyperparameters

Learning rate

Initialization

Computational speedups

Barnes-Hut t-SNE

FIt-SNE

GPU-accelerated t-SNE

Interpretation pitfalls

Cluster sizes are not meaningful

Distances between clusters are not reliable

Apparent clusters in random data

Sensitivity to perplexity

Non-determinism

Number of iterations matters

Comparison with other methods

t-SNE vs. PCA

t-SNE vs. UMAP

t-SNE vs. other nonlinear methods

Applications

Single-cell genomics

Natural language processing

Computer vision

Cybersecurity

Cancer research

Music and audio

Limitations

Non-parametric mapping

Computational cost

Only suitable for visualization

Curse of dimensionality

Hyperparameter sensitivity

Non-convex optimization

Software implementations

Basic scikit-learn usage

Using openTSNE with precomputed PCA

Best practices

See also

References

Improve this article

Related Articles

UMAP (Uniform Manifold Approximation and Projection)

ARC-AGI 2

Principal Component Analysis (PCA)

Manifold Learning

Linear Discriminant Analysis

Principal Component Analysis (PCA)

Explain like I'm 5 (ELI5)

History

Stochastic neighbor embedding (2002)

The t-SNE improvement (2008)

Subsequent developments

Algorithm

Step 1: pairwise similarities in high-dimensional space

Step 2: pairwise similarities in low-dimensional space

Step 3: minimizing the KL divergence

Step 4: optimization

The crowding problem

Perplexity

Hyperparameters

Learning rate

Initialization

Computational speedups

Barnes-Hut t-SNE

FIt-SNE

GPU-accelerated t-SNE

Interpretation pitfalls

Cluster sizes are not meaningful

Distances between clusters are not reliable