See also: Machine learning terms
In machine learning, the word "dimensions" is overloaded. Depending on the context, it can refer to the number of input features that describe a data point, the number of axes (rank) of a tensor, the width of a hidden layer in a neural network, the size of a learned embedding, the bottleneck of an autoencoder, or even the number of output classes in a classifier. All of these meanings are related, because every one of them counts coordinates in some real or learned vector space, but they have very different practical consequences for memory, compute, and statistical behavior.
This article walks through the main meanings of "dimensions" used in modern machine learning, lists typical numerical values found in popular models, explains why high-dimensional spaces behave in counterintuitive ways (the curse of dimensionality), and outlines how practitioners deal with that with dimension reduction, feature engineering, and manifold learning.
The term shows up in at least seven distinct ways. Mixing them up is one of the more common sources of confusion when reading papers or model cards.
| Meaning | What it counts | Typical scale |
|---|---|---|
| Feature dimensions | Input features per example | 10 to 10^6 |
| Tensor rank | Number of axes of a tensor | 1 to 5 in practice |
| Tensor shape | Size along each axis | varies per axis (see shape (tensor) and size (tensor)) |
| Embedding / hidden dim | Width of internal vector representation | 64 to 16,384 |
| Latent dim | Bottleneck width in autoencoder, VAE, GAN | 2 to 512 |
| Output dim | Outputs of the final layer | 1 (regression) to 128,000+ (vocabularies) |
| Intrinsic dim | True degrees of freedom of the data | usually << ambient dim |
The rest of the article expands each of these in turn.
The oldest meaning is the one used in classical statistics and pre-deep-learning ML: a feature dimension is one column in a tabular dataset, one measurement that describes an example. A house listing might have 12 features (square footage, bedrooms, year built, ZIP code, etc.). An iris flower in Fisher's classic dataset has 4 features. Tabular datasets in production usually sit somewhere between 10 and a few hundred feature dimensions.
Not every data type is naturally tabular, and the dimensionality of the raw input can be much larger.
| Data type | Raw dimensions | Note |
|---|---|---|
| Iris (UCI) | 4 | Classic toy dataset |
| Tabular CRM data | 10 to 500 | Mostly numeric and categorical columns |
| MNIST digit | 28 x 28 = 784 | Grayscale image flattened |
| ImageNet 224 image | 224 x 224 x 3 = 150,528 | RGB pixels |
| 4K video frame | 3840 x 2160 x 3 = ~25 million | Per single frame |
| One-hot bag of words | vocabulary size, e.g. 50,000 | Sparse representation |
| Genomics SNP array | ~10^6 | Many more features than examples |
When the number of features p is comparable to or larger than the number of training examples n (the so called "large p, small n" regime, common in genomics and finance), classical estimators like ordinary least squares break down and regularization or feature selection becomes essential.
In deep learning libraries like PyTorch and TensorFlow, every piece of data is a tensor. Two related but distinct numbers describe a tensor: its rank (also called its number of dimensions or axes) and its shape.
The rank is the number of indices needed to address a single element. A scalar has rank 0. A vector has rank 1. A matrix has rank 2. A batch of color images is usually rank 4. PyTorch exposes the rank as tensor.ndim or tensor.dim(). The shape is a tuple giving the length along each axis, available as tensor.shape or tensor.size(). See the dedicated articles on shape (tensor) and size (tensor) for more detail.
A confusing point: the word "dimension" can mean either "how many axes" or "how long is the axis," and you have to read the surrounding context to know which. "This tensor has 4 dimensions" almost always means rank 4. "The hidden dimension is 768" almost always means the size along one specific axis.
Inside a neural network, every layer transforms its inputs into another vector (or a batch of vectors). The width of that internal vector is called the hidden dimension or, when the vector represents a discrete token like a word or an image patch, the embedding dimension. This is one of the most important architectural hyperparameters in modern models because it largely controls capacity and cost.
The table below lists hidden dimensions for several well known architectures. Numbers come from the original papers and official model cards.
| Model | Hidden dim | Layers | Heads | Notes |
|---|---|---|---|---|
| Word2vec (skip-gram, original) | 300 | n/a | n/a | One layer of word embedding |
| BERT-base | 768 | 12 | 12 | 110M parameters |
| BERT-large | 1,024 | 24 | 16 | 340M parameters |
| GPT-2 small | 768 | 12 | 12 | 124M parameters |
| GPT-2 XL | 1,600 | 48 | 25 | 1.5B parameters |
| GPT-3 175B | 12,288 | 96 | 96 | Each head uses a 128-dim subspace |
| ViT-Base/16 | 768 | 12 | 12 | Image patches as tokens |
| ViT-Large/16 | 1,024 | 24 | 16 | Image patches as tokens |
| CLIP ViT-L/14 | 768 (shared) | 24 (image) | 16 | Joint image / text embedding space |
| Llama 3 8B | 4,096 | 32 | 32 | 14,336 FFN dim |
| Llama 3 70B | 8,192 | 80 | 64 | 28,672 FFN dim |
| Llama 3 405B | 16,384 | 126 | 128 | 53,248 FFN dim |
A recurring pattern in transformer models is that the per-head dimension stays roughly constant (often 64 or 128) while the number of heads grows: GPT-3 uses 96 heads of width 128, giving the 12,288 dim hidden state.
Word embedding sizes tend to be smaller than LLM hidden states. The original word2vec paper by Mikolov and colleagues at Google trained 300-dimensional vectors. Sentence and document embedding models still cluster in a similar range: 384 dimensions is common for compact models like Sentence-Transformers MiniLM, while OpenAI's text-embedding-3-small returns 1,536-dim vectors and text-embedding-3-large returns 3,072-dim vectors by default. Both can be truncated through the API parameter when storage is tight.
In generative models like autoencoders, variational autoencoders (VAEs), and generative adversarial networks (GANs), the latent dimension is the width of the bottleneck through which the model is forced to compress its inputs. Kingma and Welling introduced the VAE in 2013 with a Gaussian prior over a latent vector whose dimensionality is left as a hyperparameter; smaller latent spaces give stronger compression and more disentangled features at the cost of reconstruction quality. Latent vectors in image VAEs are usually 16 to 512 dimensional, with very small (2 to 10) latent spaces sometimes used for visualization. Stable Diffusion's image VAE encodes a 512x512x3 image into a 64x64x4 latent grid; that is a roughly 48x compression in raw element count.
The output dimension is dictated by the task, not by capacity choices.
| Task | Output dim |
|---|---|
| Binary classification | 1 (logit) or 2 (softmax) |
| Multiclass classification | number of classes (e.g. 1,000 for ImageNet) |
| Univariate regression | 1 |
| Multivariate regression | one per target |
| Object detection | (4 box coords + classes + confidence) per anchor |
| Language modeling, GPT-2 | 50,257 (vocabulary) |
| Language modeling, Llama 3 | 128,256 (vocabulary) |
For a language model the output projection from hidden dim to vocabulary is one of the largest matrices in the network. In Llama 3 8B, the embedding and output projection together take up most of the parameter difference vs Llama 2 7B because the vocabulary grew from about 32,000 to 128,256 tokens.
Richard Bellman coined the phrase "curse of dimensionality" in his 1957 book Dynamic Programming and elaborated on it in his 1961 Adaptive Control Processes. He used it to describe how the number of grid points needed to discretize a state space grows exponentially with the number of state variables, making naive dynamic programming infeasible past a handful of dimensions. The term has since spread to cover several related phenomena that all bite when the number of features is large.
| Phenomenon | What happens | Example consequence |
|---|---|---|
| Volume explosion | Volume of a unit cube grows as 2^d if you double each side | A grid with 10 points per axis needs 10^d cells |
| Sparsity | Any fixed number of samples covers a vanishing fraction of the space | Density estimation needs exponentially more data |
| Distance concentration | Pairwise distances tend toward a single value as d grows | Nearest-neighbor queries become unstable (Beyer et al., 1999) |
| Hubness | A few points become nearest neighbors to many others | k-NN classification gets biased |
| Empty space phenomenon | Most of the volume of a high-dim ball is near its surface | Gaussian samples concentrate on a thin shell |
| Statistical inefficiency | Sample complexity for many estimators grows exponentially in d | Kernel density estimators are unusable past d~10 |
These effects hit nearest-neighbor methods, kernel density estimation, and grid-based approaches the hardest. Linear models and decision trees with regularization are more robust, and modern deep networks largely sidestep the curse by exploiting structure in the data rather than learning over a uniform high-dimensional grid.
A dataset with thousands of features rarely fills the ambient space uniformly. The manifold hypothesis says that high-dimensional natural data such as images, audio, or text actually lies on or near a much lower dimensional manifold embedded inside that ambient space. Pope, Zhu, Abbas, Goldblum, and Goldstein (2021) estimated the intrinsic dimension of common image datasets and reported numbers in the dozens for MNIST and in the low hundreds for ImageNet, even though the raw pixel dimensions are 784 and roughly 150,000 respectively. They also showed that intrinsic dimension correlates closely with the number of training samples needed to learn a given task.
Ansuini, Laio, Macke, and Zoccolan (2019) measured intrinsic dimension layer by layer inside trained convolutional networks and found that the dimension first grows in early layers and then steadily contracts in later layers, ending well below the ambient feature dimension. That progressive contraction toward the data's true degrees of freedom is one mechanistic explanation for why deep networks generalize despite huge nominal capacity.
The gap between nominal dimension and intrinsic dimension is also the formal justification for dimension reduction. Methods like principal component analysis, t-SNE, UMAP, and trained autoencoders all try to approximate this lower-dimensional structure either for visualization, denoising, or downstream learning.
For convolutional networks operating on images, the input tensor has both spatial dimensions (height H and width W) and a channel dimension (C, often 3 for RGB). Together with the batch axis N you get a rank-4 tensor. The two common layouts are NCHW (channels first, default in PyTorch and cuDNN) and NHWC (channels last, default in TensorFlow). The choice matters for performance: NVIDIA Tensor Cores run convolutions fastest in NHWC, and PyTorch added a torch.channels_last memory format in version 1.5 to take advantage of this.
When people speak of a "512-channel feature map" inside a deep CNN, they are talking about C, not about H or W. A typical ResNet-50 starts at 64 channels in conv1 and grows the channel dimension to 2,048 by the final stage while shrinking the spatial dimensions from 224x224 down to 7x7.
Dimension choices propagate through the entire systems stack.
Memory cost scales linearly in the hidden dimension d for activations and quadratically for the largest weight matrices. The query, key, and value projections in self attention are d x d, so doubling the hidden dim quadruples those weight matrices. Compute cost in self attention scales as O(L^2 * d) where L is the sequence length. Optimizer states for Adam carry two extra moments per parameter, so a model with W parameters needs roughly 12 * W bytes in fp32 just for weights and optimizer state.
Data efficiency is also tied to dimension. The classical heuristic from learning theory is that you need on the order of d log d samples to fit a linear model with d features, and far more for nonparametric methods. In practice, transformer scaling laws (Kaplan et al. 2020 and Chinchilla in 2022) tie hidden dimension, depth, and dataset size together: a model with a hidden dimension that is too large for the available data underperforms a smaller, better matched one.
When the ambient dimension is unwieldy, practitioners reduce it. The two broad strategies are feature selection, which keeps a subset of original features, and feature extraction, which builds new features as combinations of the originals. The major methods include:
| Method | Type | Notes |
|---|---|---|
| Principal component analysis | Linear, unsupervised | Captures axes of maximum variance |
| Linear discriminant analysis | Linear, supervised | Maximizes class separation |
| t-SNE | Nonlinear, visualization | Preserves local neighborhoods, distorts global structure |
| UMAP | Nonlinear, visualization | Faster than t-SNE, preserves more global structure |
| Autoencoder bottleneck | Nonlinear, learned | Trained reconstruction objective |
| Variational autoencoder | Probabilistic, learned | Adds a prior on latent space |
| Random projection | Linear, unsupervised | Justified by Johnson-Lindenstrauss lemma |
See the dedicated dimension reduction and manifold learning articles for full discussion.
Imagine sorting toys. If you only care about color, you have one dimension and a long row of bins works fine. Add size and now you have a grid: small red, big red, small blue, big blue. Add shape and the grid becomes a cube. Add ten more features and you cannot draw it anymore, but you can still write each toy down as a list of numbers. That list of numbers is a vector, and its length is the number of dimensions. Computers do not mind high-dimensional toys, but they need a lot more examples to fill in all the bins, and a lot more memory to store all the lists.