Tensor Shape

A tensor shape is a tuple of integers that describes the number of elements along each dimension (or axis) of a tensor. In machine learning and deep learning, tensor shape is one of the most frequently encountered concepts because it governs how data flows through every layer of a neural network, how operations combine tensors, and how memory is allocated on hardware accelerators like GPUs and TPUs. A mismatch in tensor shapes is one of the most common sources of runtime errors during model development, making a solid understanding of shapes, ranks, and dimension conventions essential for practitioners.

ELI5 (Explain like I'm 5)

Imagine you have a box of crayons. If you line up 8 crayons in a single row, the "shape" of that row is just (8). Now picture a muffin tin that has 3 rows and 4 columns of cups. Its shape is (3, 4), because you need two numbers to describe where each cup is. If you stack several muffin tins on top of each other, say 5 of them, you now need three numbers: (5, 3, 4). That is exactly what tensor shape does. It tells you how many slots exist along each direction of your container, so you (and the computer) always know exactly how the data is organized.

Definition and terminology

A tensor is a multidimensional array of numerical values arranged in a regular grid. Its shape is the tuple that lists the size of every dimension. Several closely related terms appear throughout the literature, and their usage varies between mathematics, physics, and computer science.

Term	Meaning in computer science / ML	Meaning in mathematics / physics
Rank (also called order or ndim)	The number of dimensions of the tensor (the length of the shape tuple). A scalar has rank 0, a vector has rank 1, a matrix has rank 2.	The number of indices needed to address a component. In physics, tensors of rank n may be further classified by their contravariant and covariant index structure.
Shape	The tuple of dimension sizes, e.g. `(3, 224, 224)`.	Sometimes called the "type" or "signature" of the tensor when referring to its index structure.
Axis (or dimension)	A single positional index within the shape tuple. Axis 0 is the first dimension, axis 1 is the second, and so on.	Equivalent to a particular mode of the tensor.
Size (of a dimension)	The number of elements along that axis.	The range of the corresponding index.
Dtype	The data type of the tensor elements (e.g. float32, int64). Not part of shape, but closely related because it determines memory usage per element.	N/A

The total number of elements (sometimes called numel) in a tensor equals the product of all dimension sizes. For example, a tensor of shape (2, 3, 4) contains 2 x 3 x 4 = 24 elements.

Common tensor ranks

The following table summarizes the most frequently used tensor ranks and their typical roles in machine learning.

Rank	Common name	Example shape	Typical use in ML
0	Scalar	`()`	A single loss value, learning rate, or metric
1	Vector	`(512,)`	A bias vector, a 1-D embedding
2	Matrix	`(64, 768)`	A batch of feature vectors, a weight matrix in a linear layer
3	3-D tensor	`(32, 128, 768)`	A batch of token sequences in an NLP transformer (batch, sequence length, embedding dim)
4	4-D tensor	`(16, 3, 224, 224)`	A batch of RGB images for a convolutional neural network (batch, channels, height, width)
5	5-D tensor	`(8, 3, 16, 112, 112)`	A batch of video clips (batch, channels, frames, height, width)

Shape conventions by domain

Different application areas and frameworks follow different ordering conventions for the dimensions of their tensors. Understanding these conventions is necessary when converting data between frameworks or when reading model code.

Image data

Computer vision models process images as 4-D tensors. The two dominant conventions are:

Convention	Dimension order	Frameworks
NCHW	Batch, Channels, Height, Width	PyTorch, Caffe, cuDNN default
NHWC	Batch, Height, Width, Channels	TensorFlow / Keras default, NVIDIA Tensor Cores

NCHW stores all values of a single channel contiguously in memory, which can benefit certain GPU kernels. NHWC stores all channels for a single spatial location together, which is the preferred layout for NVIDIA Tensor Cores and often yields faster training when using mixed precision. PyTorch supports both layouts through its channels_last memory format, introduced to take advantage of Tensor Core acceleration.

Sequence / NLP data

Transformer models for natural language processing work with 3-D tensors whose shape is typically (N, L, E), where N is the batch size, L is the sequence length (number of tokens), and E is the embedding dimension. After passing through the final linear layer, the output often becomes (N, L, V), where V is the vocabulary size, representing a probability distribution over tokens at each position.

Some older APIs (and certain NVIDIA libraries) place the sequence dimension first, using the (L, N, E) convention, so checking the documentation for each library is important.

Audio data

Audio is commonly represented as a 3-D tensor of shape (N, C, T) for raw waveforms (batch, channels, time samples) or (N, C, F, T) for spectrograms (batch, channels, frequency bins, time frames).

Shape manipulation operations

Changing the shape of a tensor without altering (or selectively altering) its underlying data is one of the most frequent tasks in deep learning code. The table below summarizes the main operations.

Operation	Description	Key constraint	Example (PyTorch)
Reshape	Reinterprets the data with a new shape	Total element count must stay the same	`x.reshape(2, 6)` on shape `(3, 4)`
View	Same as reshape but requires contiguous memory	Tensor must be contiguous; shares memory with original	`x.view(2, 6)`
Permute	Reorders the dimensions (axes)	Does not change element count; may make tensor non-contiguous	`x.permute(0, 2, 1)` swaps axes 1 and 2
Transpose	Swaps exactly two dimensions	Limited to two axes at a time	`x.transpose(1, 2)`
Squeeze	Removes all dimensions of size 1, or a specified one	Only affects size-1 dimensions	`x.squeeze(1)` on shape `(3, 1, 4)` gives `(3, 4)`
Unsqueeze	Inserts a new dimension of size 1 at a given position	Adds exactly one axis	`x.unsqueeze(0)` on shape `(3, 4)` gives `(1, 3, 4)`
Expand / Repeat	Replicates data along one or more dimensions	Expand uses no extra memory (virtual repeat); repeat copies data	`x.expand(4, 3, 4)` on shape `(1, 3, 4)`
Flatten	Collapses a contiguous range of dims into one	Specified dims must be contiguous	`x.flatten(1, 2)` on shape `(2, 3, 4)` gives `(2, 12)`
Concatenate	Joins tensors along an existing dimension	All other dimensions must match	`torch.cat([a, b], dim=0)`
Stack	Joins tensors along a new dimension	All shapes must be identical	`torch.stack([a, b], dim=0)`

View vs. reshape

In PyTorch, view() and reshape() both produce a tensor with a different shape but the same data. The key difference is that view() requires the source tensor to be contiguous in memory and always returns a tensor that shares storage with the original. reshape() works on both contiguous and non-contiguous tensors; it returns a view when possible and falls back to copying the data when a view is not feasible. Using view() is slightly more explicit because it will raise an error if the memory layout does not support a zero-copy view, which can help catch bugs early.

Contiguity

A tensor is contiguous when its elements are stored in memory in the same order they would be visited by iterating over the tensor in row-major (C-style) order. Operations like transpose() and permute() change the stride metadata but do not move data in memory, so the result is typically non-contiguous. Calling .contiguous() on such a tensor copies the data into a new, contiguous block of memory. Many operations (including view()) require contiguity.

Broadcasting

Broadcasting is the mechanism by which frameworks automatically expand the shapes of tensors so that element-wise operations can be performed on tensors of different shapes without explicitly copying data. The rules originated in NumPy and have been adopted by PyTorch, TensorFlow, and JAX.

Broadcasting rules

Two tensors are broadcastable if, when comparing their shapes element-wise starting from the trailing (rightmost) dimension:

The dimension sizes are equal, or
One of the dimension sizes is 1, or
One of the tensors does not have that dimension (it is implicitly prepended with size 1).

The output shape takes the maximum size along each dimension.

Broadcasting examples

Tensor A shape	Tensor B shape	Result shape	Broadcastable?
`(5, 3, 4, 1)`	`(3, 1, 1)`	`(5, 3, 4, 1)`	Yes
`(1,)`	`(3, 1, 7)`	`(3, 1, 7)`	Yes
`(15, 3, 5)`	`(3, 1)`	`(15, 3, 5)`	Yes
`(5, 4)`	`(4,)`	`(5, 4)`	Yes
`(8, 1, 6, 1)`	`(7, 1, 5)`	`(8, 7, 6, 5)`	Yes
`(3,)`	`(4,)`	N/A	No (trailing dims 3 vs 4)
`(2, 1)`	`(8, 4, 3)`	N/A	No (dim mismatch 2 vs 8)

A common use of broadcasting is adding a bias vector of shape (C,) to a batch of feature maps of shape (N, C, H, W) in a convolutional layer. The bias is implicitly expanded to (1, C, 1, 1) before addition.

In-place broadcasting restriction

In PyTorch, in-place operations (e.g. x.add_(y)) do not allow the shape of x to change as a result of broadcasting. If the broadcast would require x to grow, a RuntimeError is raised.

Shape through neural network layers

Understanding how each type of layer transforms the shape of its input is essential for building and debugging models.

Linear (fully connected) layer

A linear layer with in_features inputs and out_features outputs transforms shape (*, in_features) to (*, out_features), where * represents any number of leading batch dimensions.

Convolutional layer

For a 2-D convolutional layer, the spatial dimensions of the output are determined by:

H_out = floor((H_in + 2 * padding - dilation * (kernel_size - 1) - 1) / stride + 1)
W_out = floor((W_in + 2 * padding - dilation * (kernel_size - 1) - 1) / stride + 1)

The number of output channels equals the number of filters (out_channels), and the batch dimension is unchanged. A full shape transformation example:

Parameter	Value
Input shape	`(N, 3, 224, 224)`
`out_channels`	64
`kernel_size`	7
`stride`	2
`padding`	3
`dilation`	1
Output shape	`(N, 64, 112, 112)`

Applying the formula: floor((224 + 23 - 1(7-1) - 1) / 2 + 1) = floor((224 + 6 - 6 - 1) / 2 + 1) = floor(223 / 2 + 1) = floor(111.5 + 1) = 112.

Pooling layer

Pooling layers (max pool, average pool) follow the same spatial output formula as convolutional layers but do not change the channel dimension.

Recurrent layers

An LSTM or GRU with input shape (N, L, H_in) and hidden_size H produces an output of shape (N, L, D * H), where D is 2 for bidirectional and 1 otherwise.

Attention / transformer layers

A standard multi-head self-attention layer preserves the input shape (N, L, E). The queries, keys, and values are internally reshaped from (N, L, E) to (N, num_heads, L, E // num_heads) for parallel attention computation, then reshaped back. The feed-forward network inside each transformer block temporarily projects to a higher dimension (often 4E) and then back to E, again preserving the overall shape (N, L, E).

Debugging shape errors

Tensor shape mismatches are among the most frequent runtime errors in deep learning. A 2021 study (Shin et al.) found that shape-related bugs in neural network training code are both common and difficult to detect statically. The following strategies help prevent and diagnose these errors.

Common causes

Error type	Typical symptom	Example
Mismatched matrix multiply dimensions	`RuntimeError: mat1 and mat2 shapes cannot be multiplied`	The `in_features` of a linear layer does not match the last dimension of the input
Wrong number of dimensions	`RuntimeError: Expected 4-dimensional input for spatial ... but got 3-dimensional input`	Forgetting the batch dimension when feeding a single image to a CNN
Incompatible broadcast	`RuntimeError: The size of tensor a (X) must match the size of tensor b (Y) at non-singleton dimension Z`	Trying to add tensors whose shapes violate broadcasting rules
Invalid reshape	`RuntimeError: shape [X] is invalid for input of size Y`	Reshaping to a shape whose total element count differs from the source
Last-batch size mismatch	`RuntimeError: size mismatch, m1: [A x B], m2: [C x D]`	The final mini-batch is smaller than `batch_size` and a layer expects a fixed size

Debugging techniques

Print shapes at every step. Insert print(x.shape) before and after each layer or operation inside the forward() method. This is the fastest way to find the point where a shape goes wrong.
Use the meta device. PyTorch's meta device lets you compute output shapes without allocating memory. Create a meta tensor and pass it through your model to trace shapes at near-zero cost: x = torch.randn(1, 3, 224, 224, device='meta'); out = model(x); print(out.shape).
Use model summary tools. Libraries such as torchinfo (formerly torchsummary) print a table of layer names, output shapes, and parameter counts for a given input size.
Read the error message carefully. PyTorch error messages typically include the exact shapes that caused the failure and the operation that triggered it.
Check the documentation. Layer documentation specifies the expected input and output shapes, including which dimensions correspond to batch, channels, features, and spatial extent.

Static and dynamic shapes

The distinction between static and dynamic shapes arises when compiling or tracing a model.

Static shapes are fixed at graph-construction or compilation time. TensorFlow 1.x graph mode and TensorRT require static shapes by default, which enables aggressive kernel fusion and memory planning but limits flexibility.

Dynamic shapes allow one or more dimensions to vary between invocations. This is the default behavior in PyTorch eager mode and TensorFlow 2.x eager mode. When using torch.compile(), PyTorch initially treats all shapes as static and recompiles if a shape changes. Developers can mark dimensions as dynamic with torch._dynamo.mark_dynamic() to avoid repeated recompilation. Internally, PyTorch uses SymPy to represent symbolic shape expressions that are solved at dispatch time.

Bounded dynamic shapes (used by PyTorch/XLA for TPUs) restrict dynamic dimensions to a declared range, allowing the compiler to allocate a fixed memory budget while still accepting variable-length inputs.

Einops and expressive shape notation

Einops is a library that provides a concise, readable notation for tensor shape operations. Published as a conference paper at ICLR 2022, einops offers three core functions, rearrange, reduce, and repeat, that replace many individual calls to reshape, transpose, permute, squeeze, and unsqueeze.

For example, converting an image batch from NHWC to NCHW:

from einops import rearrange
# x has shape (batch, height, width, channels)
x = rearrange(x, 'b h w c -> b c h w')

Splitting an embedding dimension into multiple attention heads:

# q has shape (batch, seq_len, num_heads * head_dim)
q = rearrange(q, 'b s (h d) -> b h s d', h=8)

The einops notation makes the intended shape transformation self-documenting, reducing the risk of silent shape errors that can occur with chains of .view() and .permute() calls.

Named tensors

A known limitation of positional-index-based shape manipulation is that axes are identified only by integer positions, making code error-prone and difficult to read. Several projects aim to attach human-readable names to tensor dimensions.

PyTorch Named Tensors (prototype API) let you assign names to dimensions at creation time, for example torch.zeros(2, 3, names=('N', 'C')). Operations then check dimension names for correctness at runtime, catching permutation errors that would otherwise produce silent bugs.

Named Tensor Notation, proposed by Chiang and Rush (2021), is a formal notation that uses subscript names on tensors to make dimension semantics explicit in mathematical writing, analogous to how einsum uses named indices.

Xarray and xarray-jax bring labeled, named dimensions to JAX and NumPy arrays, and are widely used in scientific computing.

Tensor decomposition and shape reduction

Tensor decomposition methods factorize a high-dimensional tensor into smaller tensors, effectively changing the shape representation while preserving (or approximating) the information content.

Decomposition	Input shape	Output shapes (conceptual)	Use in ML
CP (CANDECOMP/PARAFAC)	`(I, J, K)`	R vectors of sizes `(I,)`, `(J,)`, `(K,)`	Compressing convolutional filters, recommender systems
Tucker	`(I, J, K)`	Core tensor `(R1, R2, R3)` + factor matrices `(I, R1)`, `(J, R2)`, `(K, R3)`	Model compression, higher-order SVD
Tensor Train (TT)	`(I1, I2, ..., In)`	Chain of 3-D cores	Compressing large embedding tables, physics simulations

These decompositions reduce parameter counts and computational cost while transforming the original tensor's shape into a set of smaller, structured shapes. They are used in practice to compress deep neural network layers for deployment on resource-constrained devices.

Shape in popular frameworks

The following table compares how common tensor operations are invoked across the three major frameworks.

Operation	NumPy	PyTorch	TensorFlow
Get shape	`a.shape`	`x.shape` or `x.size()`	`x.shape` or `tf.shape(x)`
Reshape	`np.reshape(a, (2, 6))`	`x.reshape(2, 6)` or `x.view(2, 6)`	`tf.reshape(x, [2, 6])`
Transpose	`np.transpose(a, (1, 0, 2))`	`x.permute(1, 0, 2)`	`tf.transpose(x, perm=[1, 0, 2])`
Add axis	`np.expand_dims(a, 0)`	`x.unsqueeze(0)`	`tf.expand_dims(x, 0)`
Remove size-1 axis	`np.squeeze(a)`	`x.squeeze()`	`tf.squeeze(x)`
Concatenate	`np.concatenate([a, b], axis=0)`	`torch.cat([x, y], dim=0)`	`tf.concat([x, y], axis=0)`
Stack	`np.stack([a, b], axis=0)`	`torch.stack([x, y], dim=0)`	`tf.stack([x, y], axis=0)`
Number of elements	`a.size`	`x.numel()`	`tf.size(x)`

Best practices

Always verify shapes during development. Print or log tensor shapes at every major step. Use assertions such as assert x.shape == (batch, channels, h, w) to catch errors early.
Use -1 for inferred dimensions. When reshaping, you can set one dimension to -1 and the framework will compute it automatically: x.reshape(batch, -1) flattens all trailing dimensions.
Prefer named constants over magic numbers. Define BATCH = 32; SEQ_LEN = 512; EMBED = 768 and use them in shape assertions and reshapes to make code self-documenting.
Be deliberate about view vs. copy. When you need a new shape that shares memory with the original, use view(). When you need an independent copy, use reshape() or call .contiguous() first.
Handle the last batch. If the dataset size is not divisible by the batch size, the last batch will have a smaller first dimension. Either set drop_last=True in your data loader or ensure your model code does not hardcode the batch size.
Favor einops for complex rearrangements. For any operation that involves more than a simple reshape or transpose, einops notation is more readable, self-documenting, and less error-prone than raw view/permute chains.
Use the meta device for shape debugging. Before running expensive forward passes, trace shapes with meta tensors to verify that all dimensions are compatible.

References

Harris, C. R., Millman, K. J., van der Walt, S. J., et al. (2020). "Array programming with NumPy." *Nature*, 585(7825), 357-362. https://doi.org/10.1038/s41586-020-2649-2
Paszke, A., Gross, S., Massa, F., et al. (2019). "PyTorch: An Imperative Style, High-Performance Deep Learning Library." *Advances in Neural Information Processing Systems 32 (NeurIPS 2019)*.
Abadi, M., Barham, P., Chen, J., et al. (2016). "TensorFlow: A System for Large-Scale Machine Learning." *Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI)*.
Rogozhnikov, A. (2022). "Einops: Clear and Reliable Tensor Manipulations with Einstein-like Notation." *Proceedings of the International Conference on Learning Representations (ICLR 2022)*.
Shin, J., Lee, S., Yoon, H., and Oh, H. (2022). "A Static Analyzer for Detecting Tensor Shape Errors in Deep Neural Network Training Code." *Proceedings of the ACM/IEEE 44th International Conference on Software Engineering (ICSE)*.
Lagouvardos, S., Dolby, J., Grech, N., Antoniadis, A., and Smaragdakis, Y. (2020). "Static Analysis of Shape in TensorFlow Programs." *Proceedings of the 34th European Conference on Object-Oriented Programming (ECOOP 2020)*.
Chiang, D. and Rush, A. M. (2021). "Named Tensor Notation." *arXiv preprint arXiv:2102.13196*.
Kolda, T. G. and Bader, B. W. (2009). "Tensor Decompositions and Applications." *SIAM Review*, 51(3), 455-500.
NVIDIA (2023). "Convolutional Layers User's Guide." *NVIDIA Deep Learning Performance Documentation*. https://docs.nvidia.com/deeplearning/performance/dl-performance-convolutional/index.html
PyTorch Contributors (2025). "Broadcasting Semantics." *PyTorch Documentation*. https://docs.pytorch.org/docs/stable/notes/broadcasting.html
PyTorch Contributors (2025). "Reasoning about Shapes in PyTorch." *PyTorch Tutorials*. https://docs.pytorch.org/tutorials/recipes/recipes/reasoning_about_shapes.html
NumPy Contributors (2025). "Broadcasting." *NumPy Documentation*. https://numpy.org/doc/stable/user/basics.broadcasting.html
Vaswani, A., Shazeer, N., Parmar, N., et al. (2017). "Attention Is All You Need." *Advances in Neural Information Processing Systems 30 (NeurIPS 2017)*.

ELI5 (Explain like I'm 5)

Definition and terminology

Common tensor ranks

Shape conventions by domain

Image data

Sequence / NLP data

Audio data

Shape manipulation operations

View vs. reshape

Contiguity

Broadcasting

Broadcasting rules

Broadcasting examples

In-place broadcasting restriction

Shape through neural network layers

Linear (fully connected) layer

Convolutional layer

Pooling layer

Recurrent layers

Attention / transformer layers

Debugging shape errors

Common causes

Debugging techniques

Static and dynamic shapes

Einops and expressive shape notation

Named tensors

Tensor decomposition and shape reduction

Shape in popular frameworks

Best practices

See also

References

Improve this article

Related Articles

Sparse autoencoder

ARC-AGI 2

GELU (Gaussian Error Linear Unit)

LeNet

Broadcasting

Convolution

ELI5 (Explain like I'm 5)

Definition and terminology

Common tensor ranks

Shape conventions by domain

Image data

Sequence / NLP data

Audio data

Shape manipulation operations

View vs. reshape

Contiguity

Broadcasting

Broadcasting rules

Broadcasting examples

In-place broadcasting restriction

Shape through neural network layers

Linear (fully connected) layer

Convolutional layer

Pooling layer

Recurrent layers

Attention / transformer layers

Debugging shape errors

Common causes

Debugging techniques

Static and dynamic shapes

Einops and expressive shape notation

Named tensors

Tensor decomposition and shape reduction

Shape in popular frameworks

Best practices

See also

References

Related Articles

Sparse autoencoder

ARC-AGI 2

GELU (Gaussian Error Linear Unit)

LeNet

Broadcasting

Convolution