# Tensor Shape

> Source: https://aiwiki.ai/wiki/tensor_shape
> Updated: 2026-04-07
> Categories: Deep Learning, Machine Learning, Mathematics
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

A **tensor shape** is a tuple of integers that describes the number of elements along each dimension (or axis) of a [tensor](/wiki/tensor). In [machine learning](/wiki/machine_learning) and [deep learning](/wiki/deep_learning), tensor shape is one of the most frequently encountered concepts because it governs how data flows through every layer of a [neural network](/wiki/neural_network), how operations combine tensors, and how memory is allocated on hardware accelerators like [GPUs](/wiki/gpu_computing) and [TPUs](/wiki/tpu). A mismatch in tensor shapes is one of the most common sources of runtime errors during model development, making a solid understanding of shapes, ranks, and dimension conventions essential for practitioners.

## ELI5 (Explain like I'm 5)

Imagine you have a box of crayons. If you line up 8 crayons in a single row, the "shape" of that row is just (8). Now picture a muffin tin that has 3 rows and 4 columns of cups. Its shape is (3, 4), because you need two numbers to describe where each cup is. If you stack several muffin tins on top of each other, say 5 of them, you now need three numbers: (5, 3, 4). That is exactly what tensor shape does. It tells you how many slots exist along each direction of your container, so you (and the computer) always know exactly how the data is organized.

## Definition and terminology

A [tensor](/wiki/tensor) is a multidimensional array of numerical values arranged in a regular grid. Its shape is the tuple that lists the size of every dimension. Several closely related terms appear throughout the literature, and their usage varies between mathematics, physics, and computer science.

| Term | Meaning in computer science / ML | Meaning in mathematics / physics |
|---|---|---|
| **Rank** (also called *order* or *ndim*) | The number of dimensions of the tensor (the length of the shape tuple). A scalar has rank 0, a [vector](/wiki/scalar) has rank 1, a matrix has rank 2. | The number of indices needed to address a component. In physics, tensors of rank *n* may be further classified by their contravariant and covariant index structure. |
| **Shape** | The tuple of dimension sizes, e.g. `(3, 224, 224)`. | Sometimes called the "type" or "signature" of the tensor when referring to its index structure. |
| **Axis** (or **dimension**) | A single positional index within the shape tuple. Axis 0 is the first dimension, axis 1 is the second, and so on. | Equivalent to a particular mode of the tensor. |
| **Size** (of a dimension) | The number of elements along that axis. | The range of the corresponding index. |
| **Dtype** | The data type of the tensor elements (e.g. float32, int64). Not part of shape, but closely related because it determines memory usage per element. | N/A |

The total number of elements (sometimes called *numel*) in a tensor equals the product of all dimension sizes. For example, a tensor of shape `(2, 3, 4)` contains 2 x 3 x 4 = 24 elements.

## Common tensor ranks

The following table summarizes the most frequently used tensor ranks and their typical roles in machine learning.

| Rank | Common name | Example shape | Typical use in ML |
|---|---|---|---|
| 0 | Scalar | `()` | A single loss value, learning rate, or metric |
| 1 | Vector | `(512,)` | A bias vector, a 1-D [embedding](/wiki/embedding_vector) |
| 2 | Matrix | `(64, 768)` | A batch of feature vectors, a weight matrix in a [linear layer](/wiki/fully_connected_layer) |
| 3 | 3-D tensor | `(32, 128, 768)` | A batch of token sequences in an NLP [transformer](/wiki/transformer) (batch, sequence length, embedding dim) |
| 4 | 4-D tensor | `(16, 3, 224, 224)` | A batch of RGB images for a [convolutional neural network](/wiki/convolutional_neural_network) (batch, channels, height, width) |
| 5 | 5-D tensor | `(8, 3, 16, 112, 112)` | A batch of video clips (batch, channels, frames, height, width) |

## Shape conventions by domain

Different application areas and frameworks follow different ordering conventions for the dimensions of their tensors. Understanding these conventions is necessary when converting data between frameworks or when reading model code.

### Image data

[Computer vision](/wiki/computer_vision) models process images as 4-D tensors. The two dominant conventions are:

| Convention | Dimension order | Frameworks |
|---|---|---|
| **NCHW** | Batch, Channels, Height, Width | [PyTorch](/wiki/pytorch), Caffe, cuDNN default |
| **NHWC** | Batch, Height, Width, Channels | [TensorFlow](/wiki/tensorflow) / [Keras](/wiki/keras) default, NVIDIA Tensor Cores |

NCHW stores all values of a single channel contiguously in memory, which can benefit certain GPU kernels. NHWC stores all channels for a single spatial location together, which is the preferred layout for NVIDIA Tensor Cores and often yields faster training when using mixed precision. PyTorch supports both layouts through its `channels_last` memory format, introduced to take advantage of Tensor Core acceleration.

### Sequence / NLP data

[Transformer](/wiki/transformer) models for [natural language processing](/wiki/natural_language_understanding) work with 3-D tensors whose shape is typically `(N, L, E)`, where N is the batch size, L is the sequence length (number of [tokens](/wiki/token)), and E is the [embedding](/wiki/embedding_vector) dimension. After passing through the final linear layer, the output often becomes `(N, L, V)`, where V is the vocabulary size, representing a probability distribution over tokens at each position.

Some older APIs (and certain NVIDIA libraries) place the sequence dimension first, using the `(L, N, E)` convention, so checking the documentation for each library is important.

### Audio data

Audio is commonly represented as a 3-D tensor of shape `(N, C, T)` for raw waveforms (batch, channels, time samples) or `(N, C, F, T)` for spectrograms (batch, channels, frequency bins, time frames).

## Shape manipulation operations

Changing the shape of a tensor without altering (or selectively altering) its underlying data is one of the most frequent tasks in deep learning code. The table below summarizes the main operations.

| Operation | Description | Key constraint | Example (PyTorch) |
|---|---|---|---|
| **Reshape** | Reinterprets the data with a new shape | Total element count must stay the same | `x.reshape(2, 6)` on shape `(3, 4)` |
| **View** | Same as reshape but requires contiguous memory | Tensor must be contiguous; shares memory with original | `x.view(2, 6)` |
| **Permute** | Reorders the dimensions (axes) | Does not change element count; may make tensor non-contiguous | `x.permute(0, 2, 1)` swaps axes 1 and 2 |
| **Transpose** | Swaps exactly two dimensions | Limited to two axes at a time | `x.transpose(1, 2)` |
| **Squeeze** | Removes all dimensions of size 1, or a specified one | Only affects size-1 dimensions | `x.squeeze(1)` on shape `(3, 1, 4)` gives `(3, 4)` |
| **Unsqueeze** | Inserts a new dimension of size 1 at a given position | Adds exactly one axis | `x.unsqueeze(0)` on shape `(3, 4)` gives `(1, 3, 4)` |
| **Expand / Repeat** | Replicates data along one or more dimensions | Expand uses no extra memory (virtual repeat); repeat copies data | `x.expand(4, 3, 4)` on shape `(1, 3, 4)` |
| **Flatten** | Collapses a contiguous range of dims into one | Specified dims must be contiguous | `x.flatten(1, 2)` on shape `(2, 3, 4)` gives `(2, 12)` |
| **Concatenate** | Joins tensors along an existing dimension | All other dimensions must match | `torch.cat([a, b], dim=0)` |
| **Stack** | Joins tensors along a new dimension | All shapes must be identical | `torch.stack([a, b], dim=0)` |

### View vs. reshape

In [PyTorch](/wiki/pytorch), `view()` and `reshape()` both produce a tensor with a different shape but the same data. The key difference is that `view()` requires the source tensor to be contiguous in memory and always returns a tensor that shares storage with the original. `reshape()` works on both contiguous and non-contiguous tensors; it returns a view when possible and falls back to copying the data when a view is not feasible. Using `view()` is slightly more explicit because it will raise an error if the memory layout does not support a zero-copy view, which can help catch bugs early.

### Contiguity

A tensor is contiguous when its elements are stored in memory in the same order they would be visited by iterating over the tensor in row-major (C-style) order. Operations like `transpose()` and `permute()` change the stride metadata but do not move data in memory, so the result is typically non-contiguous. Calling `.contiguous()` on such a tensor copies the data into a new, contiguous block of memory. Many operations (including `view()`) require contiguity.

## Broadcasting

[Broadcasting](/wiki/broadcasting) is the mechanism by which frameworks automatically expand the shapes of tensors so that element-wise operations can be performed on tensors of different shapes without explicitly copying data. The rules originated in [NumPy](/wiki/numpy) and have been adopted by PyTorch, TensorFlow, and JAX.

### Broadcasting rules

Two tensors are broadcastable if, when comparing their shapes element-wise starting from the trailing (rightmost) dimension:

1. The dimension sizes are equal, **or**
2. One of the dimension sizes is 1, **or**
3. One of the tensors does not have that dimension (it is implicitly prepended with size 1).

The output shape takes the maximum size along each dimension.

### Broadcasting examples

| Tensor A shape | Tensor B shape | Result shape | Broadcastable? |
|---|---|---|---|
| `(5, 3, 4, 1)` | `(3, 1, 1)` | `(5, 3, 4, 1)` | Yes |
| `(1,)` | `(3, 1, 7)` | `(3, 1, 7)` | Yes |
| `(15, 3, 5)` | `(3, 1)` | `(15, 3, 5)` | Yes |
| `(5, 4)` | `(4,)` | `(5, 4)` | Yes |
| `(8, 1, 6, 1)` | `(7, 1, 5)` | `(8, 7, 6, 5)` | Yes |
| `(3,)` | `(4,)` | N/A | **No** (trailing dims 3 vs 4) |
| `(2, 1)` | `(8, 4, 3)` | N/A | **No** (dim mismatch 2 vs 8) |

A common use of broadcasting is adding a bias vector of shape `(C,)` to a batch of feature maps of shape `(N, C, H, W)` in a [convolutional layer](/wiki/convolutional_layer). The bias is implicitly expanded to `(1, C, 1, 1)` before addition.

### In-place broadcasting restriction

In PyTorch, in-place operations (e.g. `x.add_(y)`) do not allow the shape of `x` to change as a result of broadcasting. If the broadcast would require `x` to grow, a `RuntimeError` is raised.

## Shape through neural network layers

Understanding how each type of layer transforms the shape of its input is essential for building and debugging models.

### Linear (fully connected) layer

A [linear layer](/wiki/fully_connected_layer) with `in_features` inputs and `out_features` outputs transforms shape `(*, in_features)` to `(*, out_features)`, where `*` represents any number of leading batch dimensions.

### Convolutional layer

For a 2-D [convolutional layer](/wiki/convolutional_layer), the spatial dimensions of the output are determined by:

```
H_out = floor((H_in + 2 * padding - dilation * (kernel_size - 1) - 1) / stride + 1)
W_out = floor((W_in + 2 * padding - dilation * (kernel_size - 1) - 1) / stride + 1)
```

The number of output channels equals the number of filters (`out_channels`), and the batch dimension is unchanged. A full shape transformation example:

| Parameter | Value |
|---|---|
| Input shape | `(N, 3, 224, 224)` |
| `out_channels` | 64 |
| `kernel_size` | 7 |
| `stride` | 2 |
| `padding` | 3 |
| `dilation` | 1 |
| **Output shape** | `(N, 64, 112, 112)` |

Applying the formula: floor((224 + 2*3 - 1*(7-1) - 1) / 2 + 1) = floor((224 + 6 - 6 - 1) / 2 + 1) = floor(223 / 2 + 1) = floor(111.5 + 1) = 112.

### Pooling layer

[Pooling](/wiki/pooling) layers (max pool, average pool) follow the same spatial output formula as convolutional layers but do not change the channel dimension.

### Recurrent layers

An [LSTM](/wiki/long_short-term_memory_lstm) or GRU with input shape `(N, L, H_in)` and `hidden_size` H produces an output of shape `(N, L, D * H)`, where D is 2 for bidirectional and 1 otherwise.

### Attention / transformer layers

A standard multi-head [self-attention](/wiki/self-attention_also_called_self-attention_layer) layer preserves the input shape `(N, L, E)`. The queries, keys, and values are internally reshaped from `(N, L, E)` to `(N, num_heads, L, E // num_heads)` for parallel attention computation, then reshaped back. The [feed-forward network](/wiki/feedforward_neural_network_ffn) inside each transformer block temporarily projects to a higher dimension (often 4E) and then back to E, again preserving the overall shape `(N, L, E)`.

## Debugging shape errors

Tensor shape mismatches are among the most frequent runtime errors in deep learning. A 2021 study (Shin et al.) found that shape-related bugs in neural network training code are both common and difficult to detect statically. The following strategies help prevent and diagnose these errors.

### Common causes

| Error type | Typical symptom | Example |
|---|---|---|
| Mismatched matrix multiply dimensions | `RuntimeError: mat1 and mat2 shapes cannot be multiplied` | The `in_features` of a [linear layer](/wiki/fully_connected_layer) does not match the last dimension of the input |
| Wrong number of dimensions | `RuntimeError: Expected 4-dimensional input for spatial ... but got 3-dimensional input` | Forgetting the batch dimension when feeding a single image to a [CNN](/wiki/convolutional_neural_network) |
| Incompatible broadcast | `RuntimeError: The size of tensor a (X) must match the size of tensor b (Y) at non-singleton dimension Z` | Trying to add tensors whose shapes violate broadcasting rules |
| Invalid reshape | `RuntimeError: shape [X] is invalid for input of size Y` | Reshaping to a shape whose total element count differs from the source |
| Last-batch size mismatch | `RuntimeError: size mismatch, m1: [A x B], m2: [C x D]` | The final mini-batch is smaller than `batch_size` and a layer expects a fixed size |

### Debugging techniques

1. **Print shapes at every step.** Insert `print(x.shape)` before and after each layer or operation inside the `forward()` method. This is the fastest way to find the point where a shape goes wrong.
2. **Use the meta device.** PyTorch's meta device lets you compute output shapes without allocating memory. Create a meta tensor and pass it through your model to trace shapes at near-zero cost: `x = torch.randn(1, 3, 224, 224, device='meta'); out = model(x); print(out.shape)`.
3. **Use model summary tools.** Libraries such as `torchinfo` (formerly `torchsummary`) print a table of layer names, output shapes, and parameter counts for a given input size.
4. **Read the error message carefully.** PyTorch error messages typically include the exact shapes that caused the failure and the operation that triggered it.
5. **Check the documentation.** Layer documentation specifies the expected input and output shapes, including which dimensions correspond to batch, channels, features, and spatial extent.

## Static and dynamic shapes

The distinction between static and dynamic shapes arises when compiling or tracing a model.

**Static shapes** are fixed at graph-construction or compilation time. [TensorFlow](/wiki/tensorflow) 1.x graph mode and TensorRT require static shapes by default, which enables aggressive kernel fusion and memory planning but limits flexibility.

**Dynamic shapes** allow one or more dimensions to vary between invocations. This is the default behavior in [PyTorch](/wiki/pytorch) eager mode and TensorFlow 2.x eager mode. When using `torch.compile()`, PyTorch initially treats all shapes as static and recompiles if a shape changes. Developers can mark dimensions as dynamic with `torch._dynamo.mark_dynamic()` to avoid repeated recompilation. Internally, PyTorch uses SymPy to represent symbolic shape expressions that are solved at dispatch time.

**Bounded dynamic shapes** (used by PyTorch/XLA for TPUs) restrict dynamic dimensions to a declared range, allowing the compiler to allocate a fixed memory budget while still accepting variable-length inputs.

## Einops and expressive shape notation

[Einops](https://einops.rocks/) is a library that provides a concise, readable notation for tensor shape operations. Published as a conference paper at ICLR 2022, einops offers three core functions, `rearrange`, `reduce`, and `repeat`, that replace many individual calls to reshape, transpose, permute, squeeze, and unsqueeze.

For example, converting an image batch from NHWC to NCHW:

```python
from einops import rearrange
# x has shape (batch, height, width, channels)
x = rearrange(x, 'b h w c -> b c h w')
```

Splitting an embedding dimension into multiple attention heads:

```python
# q has shape (batch, seq_len, num_heads * head_dim)
q = rearrange(q, 'b s (h d) -> b h s d', h=8)
```

The einops notation makes the intended shape transformation self-documenting, reducing the risk of silent shape errors that can occur with chains of `.view()` and `.permute()` calls.

## Named tensors

A known limitation of positional-index-based shape manipulation is that axes are identified only by integer positions, making code error-prone and difficult to read. Several projects aim to attach human-readable names to tensor dimensions.

**PyTorch Named Tensors** (prototype API) let you assign names to dimensions at creation time, for example `torch.zeros(2, 3, names=('N', 'C'))`. Operations then check dimension names for correctness at runtime, catching permutation errors that would otherwise produce silent bugs.

**Named Tensor Notation**, proposed by Chiang and Rush (2021), is a formal notation that uses subscript names on tensors to make dimension semantics explicit in mathematical writing, analogous to how einsum uses named indices.

**Xarray** and **xarray-jax** bring labeled, named dimensions to JAX and NumPy arrays, and are widely used in scientific computing.

## Tensor decomposition and shape reduction

Tensor decomposition methods factorize a high-dimensional tensor into smaller tensors, effectively changing the shape representation while preserving (or approximating) the information content.

| Decomposition | Input shape | Output shapes (conceptual) | Use in ML |
|---|---|---|---|
| **CP (CANDECOMP/PARAFAC)** | `(I, J, K)` | R vectors of sizes `(I,)`, `(J,)`, `(K,)` | Compressing convolutional filters, recommender systems |
| **Tucker** | `(I, J, K)` | Core tensor `(R1, R2, R3)` + factor matrices `(I, R1)`, `(J, R2)`, `(K, R3)` | Model compression, higher-order SVD |
| **Tensor Train (TT)** | `(I1, I2, ..., In)` | Chain of 3-D cores | Compressing large [embedding](/wiki/embedding_vector) tables, physics simulations |

These decompositions reduce parameter counts and computational cost while transforming the original tensor's shape into a set of smaller, structured shapes. They are used in practice to compress [deep neural network](/wiki/deep_neural_network) layers for deployment on resource-constrained devices.

## Shape in popular frameworks

The following table compares how common tensor operations are invoked across the three major frameworks.

| Operation | NumPy | PyTorch | TensorFlow |
|---|---|---|---|
| Get shape | `a.shape` | `x.shape` or `x.size()` | `x.shape` or `tf.shape(x)` |
| Reshape | `np.reshape(a, (2, 6))` | `x.reshape(2, 6)` or `x.view(2, 6)` | `tf.reshape(x, [2, 6])` |
| Transpose | `np.transpose(a, (1, 0, 2))` | `x.permute(1, 0, 2)` | `tf.transpose(x, perm=[1, 0, 2])` |
| Add axis | `np.expand_dims(a, 0)` | `x.unsqueeze(0)` | `tf.expand_dims(x, 0)` |
| Remove size-1 axis | `np.squeeze(a)` | `x.squeeze()` | `tf.squeeze(x)` |
| Concatenate | `np.concatenate([a, b], axis=0)` | `torch.cat([x, y], dim=0)` | `tf.concat([x, y], axis=0)` |
| Stack | `np.stack([a, b], axis=0)` | `torch.stack([x, y], dim=0)` | `tf.stack([x, y], axis=0)` |
| Number of elements | `a.size` | `x.numel()` | `tf.size(x)` |

## Best practices

1. **Always verify shapes during development.** Print or log tensor shapes at every major step. Use assertions such as `assert x.shape == (batch, channels, h, w)` to catch errors early.
2. **Use -1 for inferred dimensions.** When reshaping, you can set one dimension to -1 and the framework will compute it automatically: `x.reshape(batch, -1)` flattens all trailing dimensions.
3. **Prefer named constants over magic numbers.** Define `BATCH = 32; SEQ_LEN = 512; EMBED = 768` and use them in shape assertions and reshapes to make code self-documenting.
4. **Be deliberate about view vs. copy.** When you need a new shape that shares memory with the original, use `view()`. When you need an independent copy, use `reshape()` or call `.contiguous()` first.
5. **Handle the last batch.** If the dataset size is not divisible by the batch size, the last batch will have a smaller first dimension. Either set `drop_last=True` in your data loader or ensure your model code does not hardcode the batch size.
6. **Favor einops for complex rearrangements.** For any operation that involves more than a simple reshape or transpose, einops notation is more readable, self-documenting, and less error-prone than raw view/permute chains.
7. **Use the meta device for shape debugging.** Before running expensive forward passes, trace shapes with meta tensors to verify that all dimensions are compatible.

## See also

- [Tensor](/wiki/tensor)
- [Broadcasting](/wiki/broadcasting)
- [Convolutional neural network](/wiki/convolutional_neural_network)
- [NumPy](/wiki/numpy)
- [PyTorch](/wiki/pytorch)
- [TensorFlow](/wiki/tensorflow)
- [Batch size](/wiki/batch_size)
- [Embedding vector](/wiki/embedding_vector)

## References

1. Harris, C. R., Millman, K. J., van der Walt, S. J., et al. (2020). "Array programming with NumPy." *Nature*, 585(7825), 357-362. https://doi.org/10.1038/s41586-020-2649-2
2. Paszke, A., Gross, S., Massa, F., et al. (2019). "PyTorch: An Imperative Style, High-Performance Deep Learning Library." *Advances in Neural Information Processing Systems 32 (NeurIPS 2019)*.
3. Abadi, M., Barham, P., Chen, J., et al. (2016). "TensorFlow: A System for Large-Scale Machine Learning." *Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI)*.
4. Rogozhnikov, A. (2022). "Einops: Clear and Reliable Tensor Manipulations with Einstein-like Notation." *Proceedings of the International Conference on Learning Representations (ICLR 2022)*.
5. Shin, J., Lee, S., Yoon, H., and Oh, H. (2022). "A Static Analyzer for Detecting Tensor Shape Errors in Deep Neural Network Training Code." *Proceedings of the ACM/IEEE 44th International Conference on Software Engineering (ICSE)*.
6. Lagouvardos, S., Dolby, J., Grech, N., Antoniadis, A., and Smaragdakis, Y. (2020). "Static Analysis of Shape in TensorFlow Programs." *Proceedings of the 34th European Conference on Object-Oriented Programming (ECOOP 2020)*.
7. Chiang, D. and Rush, A. M. (2021). "Named Tensor Notation." *arXiv preprint arXiv:2102.13196*.
8. Kolda, T. G. and Bader, B. W. (2009). "Tensor Decompositions and Applications." *SIAM Review*, 51(3), 455-500.
9. NVIDIA (2023). "Convolutional Layers User's Guide." *NVIDIA Deep Learning Performance Documentation*. https://docs.nvidia.com/deeplearning/performance/dl-performance-convolutional/index.html
10. PyTorch Contributors (2025). "Broadcasting Semantics." *PyTorch Documentation*. https://docs.pytorch.org/docs/stable/notes/broadcasting.html
11. PyTorch Contributors (2025). "Reasoning about Shapes in PyTorch." *PyTorch Tutorials*. https://docs.pytorch.org/tutorials/recipes/recipes/reasoning_about_shapes.html
12. NumPy Contributors (2025). "Broadcasting." *NumPy Documentation*. https://numpy.org/doc/stable/user/basics.broadcasting.html
13. Vaswani, A., Shazeer, N., Parmar, N., et al. (2017). "Attention Is All You Need." *Advances in Neural Information Processing Systems 30 (NeurIPS 2017)*.
