Dimensions

introduction

In machine learning, the word "dimensions" is overloaded. Depending on the context, it can refer to the number of input features that describe a data point, the number of axes (rank) of a tensor, the width of a hidden layer in a neural network, the size of a learned embedding, the bottleneck of an autoencoder, or even the number of output classes in a classifier. All of these meanings are related, because every one of them counts coordinates in some real or learned vector space, but they have very different practical consequences for memory, compute, and statistical behavior.

This article walks through the main meanings of "dimensions" used in modern machine learning, lists typical numerical values found in popular models, explains why high-dimensional spaces behave in counterintuitive ways (the curse of dimensionality), and outlines how practitioners deal with that with dimension reduction, feature engineering, and manifold learning.

meanings of "dimensions"

The term shows up in at least seven distinct ways. Mixing them up is one of the more common sources of confusion when reading papers or model cards.

Meaning	What it counts	Typical scale
Feature dimensions	Input features per example	10 to 10^6
Tensor rank	Number of axes of a tensor	1 to 5 in practice
Tensor shape	Size along each axis	varies per axis (see shape (tensor) and size (tensor))
Embedding / hidden dim	Width of internal vector representation	64 to 16,384
Latent dim	Bottleneck width in autoencoder, VAE, GAN	2 to 512
Output dim	Outputs of the final layer	1 (regression) to 128,000+ (vocabularies)
Intrinsic dim	True degrees of freedom of the data	usually << ambient dim

The rest of the article expands each of these in turn.

feature dimensions

The oldest meaning is the one used in classical statistics and pre-deep-learning ML: a feature dimension is one column in a tabular dataset, one measurement that describes an example. A house listing might have 12 features (square footage, bedrooms, year built, ZIP code, etc.). An iris flower in Fisher's classic dataset has 4 features. Tabular datasets in production usually sit somewhere between 10 and a few hundred feature dimensions.

Not every data type is naturally tabular, and the dimensionality of the raw input can be much larger.

Data type	Raw dimensions	Note
Iris (UCI)	4	Classic toy dataset
Tabular CRM data	10 to 500	Mostly numeric and categorical columns
MNIST digit	28 x 28 = 784	Grayscale image flattened
ImageNet 224 image	224 x 224 x 3 = 150,528	RGB pixels
4K video frame	3840 x 2160 x 3 = ~25 million	Per single frame
One-hot bag of words	vocabulary size, e.g. 50,000	Sparse representation
Genomics SNP array	~10^6	Many more features than examples

When the number of features p is comparable to or larger than the number of training examples n (the so called "large p, small n" regime, common in genomics and finance), classical estimators like ordinary least squares break down and regularization or feature selection becomes essential.

tensor dimensions (rank and shape)

In deep learning libraries like PyTorch and TensorFlow, every piece of data is a tensor. Two related but distinct numbers describe a tensor: its rank (also called its number of dimensions or axes) and its shape.

The rank is the number of indices needed to address a single element. A scalar has rank 0. A vector has rank 1. A matrix has rank 2. A batch of color images is usually rank 4. PyTorch exposes the rank as tensor.ndim or tensor.dim(). The shape is a tuple giving the length along each axis, available as tensor.shape or tensor.size(). See the dedicated articles on shape (tensor) and size (tensor) for more detail.

A confusing point: the word "dimension" can mean either "how many axes" or "how long is the axis," and you have to read the surrounding context to know which. "This tensor has 4 dimensions" almost always means rank 4. "The hidden dimension is 768" almost always means the size along one specific axis.

hidden and embedding dimensions

Inside a neural network, every layer transforms its inputs into another vector (or a batch of vectors). The width of that internal vector is called the hidden dimension or, when the vector represents a discrete token like a word or an image patch, the embedding dimension. This is one of the most important architectural hyperparameters in modern models because it largely controls capacity and cost.

The table below lists hidden dimensions for several well known architectures. Numbers come from the original papers and official model cards.

Model	Hidden dim	Layers	Heads	Notes
Word2vec (skip-gram, original)	300	n/a	n/a	One layer of word embedding
BERT-base	768	12	12	110M parameters
BERT-large	1,024	24	16	340M parameters
GPT-2 small	768	12	12	124M parameters
GPT-2 XL	1,600	48	25	1.5B parameters
GPT-3 175B	12,288	96	96	Each head uses a 128-dim subspace
ViT-Base/16	768	12	12	Image patches as tokens
ViT-Large/16	1,024	24	16	Image patches as tokens
CLIP ViT-L/14	768 (shared)	24 (image)	16	Joint image / text embedding space
Llama 3 8B	4,096	32	32	14,336 FFN dim
Llama 3 70B	8,192	80	64	28,672 FFN dim
Llama 3 405B	16,384	126	128	53,248 FFN dim

A recurring pattern in transformer models is that the per-head dimension stays roughly constant (often 64 or 128) while the number of heads grows: GPT-3 uses 96 heads of width 128, giving the 12,288 dim hidden state.

Word embedding sizes tend to be smaller than LLM hidden states. The original word2vec paper by Mikolov and colleagues at Google trained 300-dimensional vectors. Sentence and document embedding models still cluster in a similar range: 384 dimensions is common for compact models like Sentence-Transformers MiniLM, while OpenAI's text-embedding-3-small returns 1,536-dim vectors and text-embedding-3-large returns 3,072-dim vectors by default. Both can be truncated through the API parameter when storage is tight.

latent dimensions

In generative models like autoencoders, variational autoencoders (VAEs), and generative adversarial networks (GANs), the latent dimension is the width of the bottleneck through which the model is forced to compress its inputs. Kingma and Welling introduced the VAE in 2013 with a Gaussian prior over a latent vector whose dimensionality is left as a hyperparameter; smaller latent spaces give stronger compression and more disentangled features at the cost of reconstruction quality. Latent vectors in image VAEs are usually 16 to 512 dimensional, with very small (2 to 10) latent spaces sometimes used for visualization. Stable Diffusion's image VAE encodes a 512x512x3 image into a 64x64x4 latent grid; that is a roughly 48x compression in raw element count.

output dimensions

The output dimension is dictated by the task, not by capacity choices.

Task	Output dim
Binary classification	1 (logit) or 2 (softmax)
Multiclass classification	number of classes (e.g. 1,000 for ImageNet)
Univariate regression	1
Multivariate regression	one per target
Object detection	(4 box coords + classes + confidence) per anchor
Language modeling, GPT-2	50,257 (vocabulary)
Language modeling, Llama 3	128,256 (vocabulary)

For a language model the output projection from hidden dim to vocabulary is one of the largest matrices in the network. In Llama 3 8B, the embedding and output projection together take up most of the parameter difference vs Llama 2 7B because the vocabulary grew from about 32,000 to 128,256 tokens.

the curse of dimensionality

Richard Bellman coined the phrase "curse of dimensionality" in his 1957 book Dynamic Programming and elaborated on it in his 1961 Adaptive Control Processes. He used it to describe how the number of grid points needed to discretize a state space grows exponentially with the number of state variables, making naive dynamic programming infeasible past a handful of dimensions. The term has since spread to cover several related phenomena that all bite when the number of features is large.

Phenomenon	What happens	Example consequence
Volume explosion	Volume of a unit cube grows as 2^d if you double each side	A grid with 10 points per axis needs 10^d cells
Sparsity	Any fixed number of samples covers a vanishing fraction of the space	Density estimation needs exponentially more data
Distance concentration	Pairwise distances tend toward a single value as d grows	Nearest-neighbor queries become unstable (Beyer et al., 1999)
Hubness	A few points become nearest neighbors to many others	k-NN classification gets biased
Empty space phenomenon	Most of the volume of a high-dim ball is near its surface	Gaussian samples concentrate on a thin shell
Statistical inefficiency	Sample complexity for many estimators grows exponentially in d	Kernel density estimators are unusable past d~10

These effects hit nearest-neighbor methods, kernel density estimation, and grid-based approaches the hardest. Linear models and decision trees with regularization are more robust, and modern deep networks largely sidestep the curse by exploiting structure in the data rather than learning over a uniform high-dimensional grid.

intrinsic dimension and the manifold hypothesis

A dataset with thousands of features rarely fills the ambient space uniformly. The manifold hypothesis says that high-dimensional natural data such as images, audio, or text actually lies on or near a much lower dimensional manifold embedded inside that ambient space. Pope, Zhu, Abbas, Goldblum, and Goldstein (2021) estimated the intrinsic dimension of common image datasets and reported numbers in the dozens for MNIST and in the low hundreds for ImageNet, even though the raw pixel dimensions are 784 and roughly 150,000 respectively. They also showed that intrinsic dimension correlates closely with the number of training samples needed to learn a given task.

Ansuini, Laio, Macke, and Zoccolan (2019) measured intrinsic dimension layer by layer inside trained convolutional networks and found that the dimension first grows in early layers and then steadily contracts in later layers, ending well below the ambient feature dimension. That progressive contraction toward the data's true degrees of freedom is one mechanistic explanation for why deep networks generalize despite huge nominal capacity.

The gap between nominal dimension and intrinsic dimension is also the formal justification for dimension reduction. Methods like principal component analysis, t-SNE, UMAP, and trained autoencoders all try to approximate this lower-dimensional structure either for visualization, denoising, or downstream learning.

spatial vs channel dimensions in CNNs

For convolutional networks operating on images, the input tensor has both spatial dimensions (height H and width W) and a channel dimension (C, often 3 for RGB). Together with the batch axis N you get a rank-4 tensor. The two common layouts are NCHW (channels first, default in PyTorch and cuDNN) and NHWC (channels last, default in TensorFlow). The choice matters for performance: NVIDIA Tensor Cores run convolutions fastest in NHWC, and PyTorch added a torch.channels_last memory format in version 1.5 to take advantage of this.

When people speak of a "512-channel feature map" inside a deep CNN, they are talking about C, not about H or W. A typical ResNet-50 starts at 64 channels in conv1 and grows the channel dimension to 2,048 by the final stage while shrinking the spatial dimensions from 224x224 down to 7x7.

practical implications of dimension choices

Dimension choices propagate through the entire systems stack.

Memory cost scales linearly in the hidden dimension d for activations and quadratically for the largest weight matrices. The query, key, and value projections in self attention are d x d, so doubling the hidden dim quadruples those weight matrices. Compute cost in self attention scales as O(L^2 * d) where L is the sequence length. Optimizer states for Adam carry two extra moments per parameter, so a model with W parameters needs roughly 12 * W bytes in fp32 just for weights and optimizer state.

Data efficiency is also tied to dimension. The classical heuristic from learning theory is that you need on the order of d log d samples to fit a linear model with d features, and far more for nonparametric methods. In practice, transformer scaling laws (Kaplan et al. 2020 and Chinchilla in 2022) tie hidden dimension, depth, and dataset size together: a model with a hidden dimension that is too large for the available data underperforms a smaller, better matched one.

dimensionality reduction

When the ambient dimension is unwieldy, practitioners reduce it. The two broad strategies are feature selection, which keeps a subset of original features, and feature extraction, which builds new features as combinations of the originals. The major methods include:

Method	Type	Notes
Principal component analysis	Linear, unsupervised	Captures axes of maximum variance
Linear discriminant analysis	Linear, supervised	Maximizes class separation
t-SNE	Nonlinear, visualization	Preserves local neighborhoods, distorts global structure
UMAP	Nonlinear, visualization	Faster than t-SNE, preserves more global structure
Autoencoder bottleneck	Nonlinear, learned	Trained reconstruction objective
Variational autoencoder	Probabilistic, learned	Adds a prior on latent space
Random projection	Linear, unsupervised	Justified by Johnson-Lindenstrauss lemma

See the dedicated dimension reduction and manifold learning articles for full discussion.

explain like i'm 5

Imagine sorting toys. If you only care about color, you have one dimension and a long row of bins works fine. Add size and now you have a grid: small red, big red, small blue, big blue. Add shape and the grid becomes a cube. Add ten more features and you cannot draw it anymore, but you can still write each toy down as a list of numbers. That list of numbers is a vector, and its length is the number of dimensions. Computers do not mind high-dimensional toys, but they need a lot more examples to fill in all the bins, and a lot more memory to store all the lists.

references

Bellman, R. E. (1957). Dynamic Programming. Princeton University Press. https://press.princeton.edu/books/paperback/9780691146683/dynamic-programming
Bellman, R. E. (1961). Adaptive Control Processes: A Guided Tour. Princeton University Press.
Beyer, K., Goldstein, J., Ramakrishnan, R., and Shaft, U. (1999). "When Is Nearest Neighbor Meaningful?" International Conference on Database Theory (ICDT). https://link.springer.com/chapter/10.1007/3-540-49257-7_15
Mikolov, T., Chen, K., Corrado, G., and Dean, J. (2013). "Efficient Estimation of Word Representations in Vector Space." arXiv:1301.3781. https://arxiv.org/abs/1301.3781
Kingma, D. P. and Welling, M. (2013). "Auto-Encoding Variational Bayes." arXiv:1312.6114. https://arxiv.org/abs/1312.6114
Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2019). "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding." NAACL. https://arxiv.org/abs/1810.04805
Brown, T. et al. (2020). "Language Models are Few-Shot Learners" (GPT-3). NeurIPS. https://arxiv.org/abs/2005.14165
Dosovitskiy, A. et al. (2020). "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale" (ViT). https://arxiv.org/abs/2010.11929
Radford, A. et al. (2021). "Learning Transferable Visual Models From Natural Language Supervision" (CLIP). ICML. https://proceedings.mlr.press/v139/radford21a/radford21a.pdf
Kaplan, J. et al. (2020). "Scaling Laws for Neural Language Models." arXiv:2001.08361. https://arxiv.org/abs/2001.08361
Hoffmann, J. et al. (2022). "Training Compute-Optimal Large Language Models" (Chinchilla). arXiv:2203.15556. https://arxiv.org/abs/2203.15556
Pope, P., Zhu, C., Abbas, A., Goldblum, M., and Goldstein, T. (2021). "The Intrinsic Dimension of Images and Its Impact on Learning." ICLR. https://openreview.net/forum?id=XJk19XzGq2J
Ansuini, A., Laio, A., Macke, J. H., and Zoccolan, D. (2019). "Intrinsic dimension of data representations in deep neural networks." NeurIPS. https://arxiv.org/abs/1905.12784
Grattafiori, A. et al. (2024). "The Llama 3 Herd of Models." arXiv:2407.21783. https://arxiv.org/abs/2407.21783
OpenAI (2024). "New embedding models and API updates." https://openai.com/index/new-embedding-models-and-api-updates/
NVIDIA (2024). "Convolutional Layers User's Guide." https://docs.nvidia.com/deeplearning/performance/dl-performance-convolutional/index.html

Dimensions

introduction

meanings of "dimensions"

feature dimensions

tensor dimensions (rank and shape)

hidden and embedding dimensions

latent dimensions

output dimensions

the curse of dimensionality

intrinsic dimension and the manifold hypothesis

spatial vs channel dimensions in CNNs

practical implications of dimension choices

dimensionality reduction

explain like i'm 5

see also

references

Improve this article

introduction

meanings of "dimensions"

feature dimensions

tensor dimensions (rank and shape)

hidden and embedding dimensions

latent dimensions

output dimensions

the curse of dimensionality

intrinsic dimension and the manifold hypothesis

spatial vs channel dimensions in CNNs

practical implications of dimension choices

dimensionality reduction

explain like i'm 5

see also

references

introduction

meanings of "dimensions"

feature dimensions

tensor dimensions (rank and shape)

hidden and embedding dimensions

latent dimensions

output dimensions

the curse of dimensionality

intrinsic dimension and the manifold hypothesis

spatial vs channel dimensions in CNNs

practical implications of dimension choices

dimensionality reduction

explain like i'm 5

see also

references

Improve this article

Related Articles

Shape (Tensor)

Tensor size

introduction

meanings of "dimensions"

feature dimensions

tensor dimensions (rank and shape)

hidden and embedding dimensions

latent dimensions

output dimensions

the curse of dimensionality

intrinsic dimension and the manifold hypothesis

spatial vs channel dimensions in CNNs

practical implications of dimension choices

dimensionality reduction

explain like i'm 5

see also

references

Related Articles

Shape (Tensor)

Tensor size