A tensor shape is a tuple of integers that describes the number of elements along each dimension (or axis) of a tensor. In machine learning and deep learning, tensor shape is one of the most frequently encountered concepts because it governs how data flows through every layer of a neural network, how operations combine tensors, and how memory is allocated on hardware accelerators like GPUs and TPUs. A mismatch in tensor shapes is one of the most common sources of runtime errors during model development, making a solid understanding of shapes, ranks, and dimension conventions essential for practitioners.
Imagine you have a box of crayons. If you line up 8 crayons in a single row, the "shape" of that row is just (8). Now picture a muffin tin that has 3 rows and 4 columns of cups. Its shape is (3, 4), because you need two numbers to describe where each cup is. If you stack several muffin tins on top of each other, say 5 of them, you now need three numbers: (5, 3, 4). That is exactly what tensor shape does. It tells you how many slots exist along each direction of your container, so you (and the computer) always know exactly how the data is organized.
A tensor is a multidimensional array of numerical values arranged in a regular grid. Its shape is the tuple that lists the size of every dimension. Several closely related terms appear throughout the literature, and their usage varies between mathematics, physics, and computer science.
| Term | Meaning in computer science / ML | Meaning in mathematics / physics |
|---|---|---|
| Rank (also called order or ndim) | The number of dimensions of the tensor (the length of the shape tuple). A scalar has rank 0, a vector has rank 1, a matrix has rank 2. | The number of indices needed to address a component. In physics, tensors of rank n may be further classified by their contravariant and covariant index structure. |
| Shape | The tuple of dimension sizes, e.g. (3, 224, 224). | Sometimes called the "type" or "signature" of the tensor when referring to its index structure. |
| Axis (or dimension) | A single positional index within the shape tuple. Axis 0 is the first dimension, axis 1 is the second, and so on. | Equivalent to a particular mode of the tensor. |
| Size (of a dimension) | The number of elements along that axis. | The range of the corresponding index. |
| Dtype | The data type of the tensor elements (e.g. float32, int64). Not part of shape, but closely related because it determines memory usage per element. | N/A |
The total number of elements (sometimes called numel) in a tensor equals the product of all dimension sizes. For example, a tensor of shape (2, 3, 4) contains 2 x 3 x 4 = 24 elements.
The following table summarizes the most frequently used tensor ranks and their typical roles in machine learning.
| Rank | Common name | Example shape | Typical use in ML |
|---|---|---|---|
| 0 | Scalar | () | A single loss value, learning rate, or metric |
| 1 | Vector | (512,) | A bias vector, a 1-D embedding |
| 2 | Matrix | (64, 768) | A batch of feature vectors, a weight matrix in a linear layer |
| 3 | 3-D tensor | (32, 128, 768) | A batch of token sequences in an NLP transformer (batch, sequence length, embedding dim) |
| 4 | 4-D tensor | (16, 3, 224, 224) | A batch of RGB images for a convolutional neural network (batch, channels, height, width) |
| 5 | 5-D tensor | (8, 3, 16, 112, 112) | A batch of video clips (batch, channels, frames, height, width) |
Different application areas and frameworks follow different ordering conventions for the dimensions of their tensors. Understanding these conventions is necessary when converting data between frameworks or when reading model code.
Computer vision models process images as 4-D tensors. The two dominant conventions are:
| Convention | Dimension order | Frameworks |
|---|---|---|
| NCHW | Batch, Channels, Height, Width | PyTorch, Caffe, cuDNN default |
| NHWC | Batch, Height, Width, Channels | TensorFlow / Keras default, NVIDIA Tensor Cores |
NCHW stores all values of a single channel contiguously in memory, which can benefit certain GPU kernels. NHWC stores all channels for a single spatial location together, which is the preferred layout for NVIDIA Tensor Cores and often yields faster training when using mixed precision. PyTorch supports both layouts through its channels_last memory format, introduced to take advantage of Tensor Core acceleration.
Transformer models for natural language processing work with 3-D tensors whose shape is typically (N, L, E), where N is the batch size, L is the sequence length (number of tokens), and E is the embedding dimension. After passing through the final linear layer, the output often becomes (N, L, V), where V is the vocabulary size, representing a probability distribution over tokens at each position.
Some older APIs (and certain NVIDIA libraries) place the sequence dimension first, using the (L, N, E) convention, so checking the documentation for each library is important.
Audio is commonly represented as a 3-D tensor of shape (N, C, T) for raw waveforms (batch, channels, time samples) or (N, C, F, T) for spectrograms (batch, channels, frequency bins, time frames).
Changing the shape of a tensor without altering (or selectively altering) its underlying data is one of the most frequent tasks in deep learning code. The table below summarizes the main operations.
| Operation | Description | Key constraint | Example (PyTorch) |
|---|---|---|---|
| Reshape | Reinterprets the data with a new shape | Total element count must stay the same | x.reshape(2, 6) on shape (3, 4) |
| View | Same as reshape but requires contiguous memory | Tensor must be contiguous; shares memory with original | x.view(2, 6) |
| Permute | Reorders the dimensions (axes) | Does not change element count; may make tensor non-contiguous | x.permute(0, 2, 1) swaps axes 1 and 2 |
| Transpose | Swaps exactly two dimensions | Limited to two axes at a time | x.transpose(1, 2) |
| Squeeze | Removes all dimensions of size 1, or a specified one | Only affects size-1 dimensions | x.squeeze(1) on shape (3, 1, 4) gives (3, 4) |
| Unsqueeze | Inserts a new dimension of size 1 at a given position | Adds exactly one axis | x.unsqueeze(0) on shape (3, 4) gives (1, 3, 4) |
| Expand / Repeat | Replicates data along one or more dimensions | Expand uses no extra memory (virtual repeat); repeat copies data | x.expand(4, 3, 4) on shape (1, 3, 4) |
| Flatten | Collapses a contiguous range of dims into one | Specified dims must be contiguous | x.flatten(1, 2) on shape (2, 3, 4) gives (2, 12) |
| Concatenate | Joins tensors along an existing dimension | All other dimensions must match | torch.cat([a, b], dim=0) |
| Stack | Joins tensors along a new dimension | All shapes must be identical | torch.stack([a, b], dim=0) |
In PyTorch, view() and reshape() both produce a tensor with a different shape but the same data. The key difference is that view() requires the source tensor to be contiguous in memory and always returns a tensor that shares storage with the original. reshape() works on both contiguous and non-contiguous tensors; it returns a view when possible and falls back to copying the data when a view is not feasible. Using view() is slightly more explicit because it will raise an error if the memory layout does not support a zero-copy view, which can help catch bugs early.
A tensor is contiguous when its elements are stored in memory in the same order they would be visited by iterating over the tensor in row-major (C-style) order. Operations like transpose() and permute() change the stride metadata but do not move data in memory, so the result is typically non-contiguous. Calling .contiguous() on such a tensor copies the data into a new, contiguous block of memory. Many operations (including view()) require contiguity.
Broadcasting is the mechanism by which frameworks automatically expand the shapes of tensors so that element-wise operations can be performed on tensors of different shapes without explicitly copying data. The rules originated in NumPy and have been adopted by PyTorch, TensorFlow, and JAX.
Two tensors are broadcastable if, when comparing their shapes element-wise starting from the trailing (rightmost) dimension:
The output shape takes the maximum size along each dimension.
| Tensor A shape | Tensor B shape | Result shape | Broadcastable? |
|---|---|---|---|
(5, 3, 4, 1) | (3, 1, 1) | (5, 3, 4, 1) | Yes |
(1,) | (3, 1, 7) | (3, 1, 7) | Yes |
(15, 3, 5) | (3, 1) | (15, 3, 5) | Yes |
(5, 4) | (4,) | (5, 4) | Yes |
(8, 1, 6, 1) | (7, 1, 5) | (8, 7, 6, 5) | Yes |
(3,) | (4,) | N/A | No (trailing dims 3 vs 4) |
(2, 1) | (8, 4, 3) | N/A | No (dim mismatch 2 vs 8) |
A common use of broadcasting is adding a bias vector of shape (C,) to a batch of feature maps of shape (N, C, H, W) in a convolutional layer. The bias is implicitly expanded to (1, C, 1, 1) before addition.
In PyTorch, in-place operations (e.g. x.add_(y)) do not allow the shape of x to change as a result of broadcasting. If the broadcast would require x to grow, a RuntimeError is raised.
Understanding how each type of layer transforms the shape of its input is essential for building and debugging models.
A linear layer with in_features inputs and out_features outputs transforms shape (*, in_features) to (*, out_features), where * represents any number of leading batch dimensions.
For a 2-D convolutional layer, the spatial dimensions of the output are determined by:
H_out = floor((H_in + 2 * padding - dilation * (kernel_size - 1) - 1) / stride + 1)
W_out = floor((W_in + 2 * padding - dilation * (kernel_size - 1) - 1) / stride + 1)
The number of output channels equals the number of filters (out_channels), and the batch dimension is unchanged. A full shape transformation example:
| Parameter | Value |
|---|---|
| Input shape | (N, 3, 224, 224) |
out_channels | 64 |
kernel_size | 7 |
stride | 2 |
padding | 3 |
dilation | 1 |
| Output shape | (N, 64, 112, 112) |
Applying the formula: floor((224 + 23 - 1(7-1) - 1) / 2 + 1) = floor((224 + 6 - 6 - 1) / 2 + 1) = floor(223 / 2 + 1) = floor(111.5 + 1) = 112.
Pooling layers (max pool, average pool) follow the same spatial output formula as convolutional layers but do not change the channel dimension.
An LSTM or GRU with input shape (N, L, H_in) and hidden_size H produces an output of shape (N, L, D * H), where D is 2 for bidirectional and 1 otherwise.
A standard multi-head self-attention layer preserves the input shape (N, L, E). The queries, keys, and values are internally reshaped from (N, L, E) to (N, num_heads, L, E // num_heads) for parallel attention computation, then reshaped back. The feed-forward network inside each transformer block temporarily projects to a higher dimension (often 4E) and then back to E, again preserving the overall shape (N, L, E).
Tensor shape mismatches are among the most frequent runtime errors in deep learning. A 2021 study (Shin et al.) found that shape-related bugs in neural network training code are both common and difficult to detect statically. The following strategies help prevent and diagnose these errors.
| Error type | Typical symptom | Example |
|---|---|---|
| Mismatched matrix multiply dimensions | RuntimeError: mat1 and mat2 shapes cannot be multiplied | The in_features of a linear layer does not match the last dimension of the input |
| Wrong number of dimensions | RuntimeError: Expected 4-dimensional input for spatial ... but got 3-dimensional input | Forgetting the batch dimension when feeding a single image to a CNN |
| Incompatible broadcast | RuntimeError: The size of tensor a (X) must match the size of tensor b (Y) at non-singleton dimension Z | Trying to add tensors whose shapes violate broadcasting rules |
| Invalid reshape | RuntimeError: shape [X] is invalid for input of size Y | Reshaping to a shape whose total element count differs from the source |
| Last-batch size mismatch | RuntimeError: size mismatch, m1: [A x B], m2: [C x D] | The final mini-batch is smaller than batch_size and a layer expects a fixed size |
print(x.shape) before and after each layer or operation inside the forward() method. This is the fastest way to find the point where a shape goes wrong.x = torch.randn(1, 3, 224, 224, device='meta'); out = model(x); print(out.shape).torchinfo (formerly torchsummary) print a table of layer names, output shapes, and parameter counts for a given input size.The distinction between static and dynamic shapes arises when compiling or tracing a model.
Static shapes are fixed at graph-construction or compilation time. TensorFlow 1.x graph mode and TensorRT require static shapes by default, which enables aggressive kernel fusion and memory planning but limits flexibility.
Dynamic shapes allow one or more dimensions to vary between invocations. This is the default behavior in PyTorch eager mode and TensorFlow 2.x eager mode. When using torch.compile(), PyTorch initially treats all shapes as static and recompiles if a shape changes. Developers can mark dimensions as dynamic with torch._dynamo.mark_dynamic() to avoid repeated recompilation. Internally, PyTorch uses SymPy to represent symbolic shape expressions that are solved at dispatch time.
Bounded dynamic shapes (used by PyTorch/XLA for TPUs) restrict dynamic dimensions to a declared range, allowing the compiler to allocate a fixed memory budget while still accepting variable-length inputs.
Einops is a library that provides a concise, readable notation for tensor shape operations. Published as a conference paper at ICLR 2022, einops offers three core functions, rearrange, reduce, and repeat, that replace many individual calls to reshape, transpose, permute, squeeze, and unsqueeze.
For example, converting an image batch from NHWC to NCHW:
from einops import rearrange
# x has shape (batch, height, width, channels)
x = rearrange(x, 'b h w c -> b c h w')
Splitting an embedding dimension into multiple attention heads:
# q has shape (batch, seq_len, num_heads * head_dim)
q = rearrange(q, 'b s (h d) -> b h s d', h=8)
The einops notation makes the intended shape transformation self-documenting, reducing the risk of silent shape errors that can occur with chains of .view() and .permute() calls.
A known limitation of positional-index-based shape manipulation is that axes are identified only by integer positions, making code error-prone and difficult to read. Several projects aim to attach human-readable names to tensor dimensions.
PyTorch Named Tensors (prototype API) let you assign names to dimensions at creation time, for example torch.zeros(2, 3, names=('N', 'C')). Operations then check dimension names for correctness at runtime, catching permutation errors that would otherwise produce silent bugs.
Named Tensor Notation, proposed by Chiang and Rush (2021), is a formal notation that uses subscript names on tensors to make dimension semantics explicit in mathematical writing, analogous to how einsum uses named indices.
Xarray and xarray-jax bring labeled, named dimensions to JAX and NumPy arrays, and are widely used in scientific computing.
Tensor decomposition methods factorize a high-dimensional tensor into smaller tensors, effectively changing the shape representation while preserving (or approximating) the information content.
| Decomposition | Input shape | Output shapes (conceptual) | Use in ML |
|---|---|---|---|
| CP (CANDECOMP/PARAFAC) | (I, J, K) | R vectors of sizes (I,), (J,), (K,) | Compressing convolutional filters, recommender systems |
| Tucker | (I, J, K) | Core tensor (R1, R2, R3) + factor matrices (I, R1), (J, R2), (K, R3) | Model compression, higher-order SVD |
| Tensor Train (TT) | (I1, I2, ..., In) | Chain of 3-D cores | Compressing large embedding tables, physics simulations |
These decompositions reduce parameter counts and computational cost while transforming the original tensor's shape into a set of smaller, structured shapes. They are used in practice to compress deep neural network layers for deployment on resource-constrained devices.
The following table compares how common tensor operations are invoked across the three major frameworks.
| Operation | NumPy | PyTorch | TensorFlow |
|---|---|---|---|
| Get shape | a.shape | x.shape or x.size() | x.shape or tf.shape(x) |
| Reshape | np.reshape(a, (2, 6)) | x.reshape(2, 6) or x.view(2, 6) | tf.reshape(x, [2, 6]) |
| Transpose | np.transpose(a, (1, 0, 2)) | x.permute(1, 0, 2) | tf.transpose(x, perm=[1, 0, 2]) |
| Add axis | np.expand_dims(a, 0) | x.unsqueeze(0) | tf.expand_dims(x, 0) |
| Remove size-1 axis | np.squeeze(a) | x.squeeze() | tf.squeeze(x) |
| Concatenate | np.concatenate([a, b], axis=0) | torch.cat([x, y], dim=0) | tf.concat([x, y], axis=0) |
| Stack | np.stack([a, b], axis=0) | torch.stack([x, y], dim=0) | tf.stack([x, y], axis=0) |
| Number of elements | a.size | x.numel() | tf.size(x) |
assert x.shape == (batch, channels, h, w) to catch errors early.x.reshape(batch, -1) flattens all trailing dimensions.BATCH = 32; SEQ_LEN = 512; EMBED = 768 and use them in shape assertions and reshapes to make code self-documenting.view(). When you need an independent copy, use reshape() or call .contiguous() first.drop_last=True in your data loader or ensure your model code does not hardcode the batch size.