# Tensor Shape

> Source: https://aiwiki.ai/wiki/tensor_shape
> Updated: 2026-06-28
> Categories: Deep Learning, Machine Learning, Mathematics
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

A **tensor shape** is a tuple of integers that describes the number of elements along each dimension (or axis) of a [tensor](/wiki/tensor). For a tensor of rank `r`, the shape is written `(d_1, d_2, ..., d_r)`, where each `d_i` is the size of the tensor along axis `i`; for example, a batch of 16 RGB images that are 224 pixels tall and 224 wide has shape `(16, 3, 224, 224)`. Shape is one of the two pieces of metadata (alongside data type) that virtually every deep learning framework attaches to a tensor, and it governs which operations are legal, how data flows through a [neural network](/wiki/neural_network), and how memory is laid out on hardware.

In [machine learning](/wiki/machine_learning) and [deep learning](/wiki/deep_learning), tensor shape is one of the most frequently encountered concepts because it governs how data flows through every layer of a model, how operations combine tensors, and how memory is allocated on hardware accelerators like [GPUs](/wiki/gpu_computing) and [TPUs](/wiki/tpu). A mismatch in tensor shapes is one of the most common sources of runtime errors during model development, making a solid understanding of shapes, ranks, and dimension conventions essential for practitioners. This page covers the same core concept as the companion article [shape (tensor)](/wiki/shape_tensor) and goes further into shape manipulation, layer-by-layer transformations, broadcasting, and decomposition.

## ELI5 (Explain like I'm 5)

Imagine you have a box of crayons. If you line up 8 crayons in a single row, the "shape" of that row is just (8). Now picture a muffin tin that has 3 rows and 4 columns of cups. Its shape is (3, 4), because you need two numbers to describe where each cup is. If you stack several muffin tins on top of each other, say 5 of them, you now need three numbers: (5, 3, 4). That is exactly what tensor shape does. It tells you how many slots exist along each direction of your container, so you (and the computer) always know exactly how the data is organized.

## What is the shape of a tensor?

A [tensor](/wiki/tensor) is a multidimensional array of numerical values arranged in a regular grid. Its shape is the tuple that lists the size of every dimension. The length of that tuple is the **rank** (the number of axes), and each entry is the number of elements along the corresponding axis. The [NumPy](/wiki/numpy) manual defines `ndarray.shape` simply as a "Tuple of array dimensions," and in [PyTorch](/wiki/pytorch) the equivalent `torch.Tensor.size()` returns a `torch.Size` object that the documentation describes as the "result type of a call to `torch.Tensor.size()`" which "describes the size of all dimensions of the original tensor."[14][15] Because `torch.Size` is a subclass of `tuple`, `tensor.shape` and `tensor.size()` are interchangeable.[15]

Several closely related terms appear throughout the literature, and their usage varies between mathematics, physics, and computer science.

| Term | Meaning in computer science / ML | Meaning in mathematics / physics |
|---|---|---|
| **Rank** (also called *order* or *ndim*) | The number of dimensions of the tensor (the length of the shape tuple). A scalar has rank 0, a [vector](/wiki/scalar) has rank 1, a matrix has rank 2. | The number of indices needed to address a component. In physics, tensors of rank *n* may be further classified by their contravariant and covariant index structure. |
| **Shape** | The tuple of dimension sizes, e.g. `(3, 224, 224)`. | Sometimes called the "type" or "signature" of the tensor when referring to its index structure. |
| **Axis** (or **dimension**) | A single positional index within the shape tuple. Axis 0 is the first dimension, axis 1 is the second, and so on. | Equivalent to a particular mode of the tensor. |
| **Size** (of a dimension) | The number of elements along that axis. | The range of the corresponding index. |
| **Dtype** | The data type of the tensor elements (e.g. float32, int64). Not part of shape, but closely related because it determines memory usage per element. | N/A |

The total number of elements (sometimes called *numel*) in a tensor equals the product of all dimension sizes. For example, a tensor of shape `(2, 3, 4)` contains 2 x 3 x 4 = 24 elements.

## What is the rank of a tensor?

Rank refers to how many axes a tensor has, not how big each axis is: a 1000-element vector has the same rank (1) as a 3-element vector. The following table summarizes the most frequently used tensor ranks and their typical roles in machine learning.

| Rank | Common name | Example shape | Typical use in ML |
|---|---|---|---|
| 0 | Scalar | `()` | A single loss value, learning rate, or metric |
| 1 | Vector | `(512,)` | A bias vector, a 1-D [embedding](/wiki/embedding_vector) |
| 2 | Matrix | `(64, 768)` | A batch of feature vectors, a weight matrix in a [linear layer](/wiki/fully_connected_layer) |
| 3 | 3-D tensor | `(32, 128, 768)` | A batch of token sequences in an NLP [transformer](/wiki/transformer) (batch, sequence length, embedding dim) |
| 4 | 4-D tensor | `(16, 3, 224, 224)` | A batch of RGB images for a [convolutional neural network](/wiki/convolutional_neural_network) (batch, channels, height, width) |
| 5 | 5-D tensor | `(8, 3, 16, 112, 112)` | A batch of video clips (batch, channels, frames, height, width) |

The trailing comma in `(512,)` is Python syntax that distinguishes a one-element tuple from a parenthesized integer: `(512)` is the integer 512, while `(512,)` is a tuple of length one. A scalar produced by a reduction such as `arr.sum()` has shape `()`, the empty tuple, which is distinct from a length-1 vector with shape `(1,)`.

## What is the batch dimension?

In nearly all deep learning frameworks the first axis (axis 0) of a tensor is the **batch dimension**: the number of independent examples processed together in one forward or backward pass. A single 224x224 RGB image has shape `(3, 224, 224)` in channels-first order, but a [CNN](/wiki/convolutional_neural_network) expects it to be wrapped as `(N, 3, 224, 224)`, where `N` is the [batch size](/wiki/batch_size). Forgetting to add this leading dimension (for example, feeding one image without `unsqueeze(0)`) is one of the most common beginner shape errors. The batch axis is also the dimension that most often varies at runtime, which is why it is frequently left unspecified (`None` in [TensorFlow](/wiki/tensorflow), or marked dynamic in [PyTorch](/wiki/pytorch)) when a model is compiled.

## What are NCHW and NHWC?

[Computer vision](/wiki/computer_vision) models process images as 4-D tensors, and the two dominant ordering conventions differ in where the channel axis sits.

| Convention | Dimension order | Frameworks |
|---|---|---|
| **NCHW** (channels-first) | Batch, Channels, Height, Width | [PyTorch](/wiki/pytorch), Caffe, ONNX, cuDNN default |
| **NHWC** (channels-last) | Batch, Height, Width, Channels | [TensorFlow](/wiki/tensorflow) / [Keras](/wiki/keras) default, NVIDIA Tensor Cores |

NCHW stores all values of a single channel contiguously in memory, which can benefit certain GPU kernels. NHWC stores all channels for a single spatial location together, which is the preferred layout for NVIDIA Tensor Cores and often yields faster training when using mixed precision. PyTorch supports both layouts through its `channels_last` memory format, which the official PyTorch tutorial defines as "an alternative way of ordering NCHW tensors in memory preserving dimensions ordering."[16] That same tutorial reports "8%-35% performance gains on Volta devices" and "26%-76% performance gains on Intel(R) Xeon(R) Ice Lake (or newer) CPUs" from switching to channels-last via `tensor.to(memory_format=torch.channels_last)`.[16]

Importantly, in PyTorch channels-last is a memory format on a 4-D NCHW tensor, not a different shape: the strides change but `tensor.shape` still reports `(N, C, H, W)`. To actually reorder the axes, use `permute` (`x.permute(0, 2, 3, 1)`); in TensorFlow use `tf.transpose(x, [0, 2, 3, 1])`.

### Sequence / NLP data

[Transformer](/wiki/transformer) models for [natural language processing](/wiki/natural_language_understanding) work with 3-D tensors whose shape is typically `(N, L, E)`, where N is the batch size, L is the sequence length (number of [tokens](/wiki/token)), and E is the [embedding](/wiki/embedding_vector) dimension. After passing through the final linear layer, the output often becomes `(N, L, V)`, where V is the vocabulary size, representing a probability distribution over tokens at each position.

Some older APIs (and certain NVIDIA libraries) place the sequence dimension first, using the `(L, N, E)` convention, so checking the documentation for each library is important.

### Audio data

Audio is commonly represented as a 3-D tensor of shape `(N, C, T)` for raw waveforms (batch, channels, time samples) or `(N, C, F, T)` for spectrograms (batch, channels, frequency bins, time frames).

## How does reshaping work?

Changing the shape of a tensor without altering (or selectively altering) its underlying data is one of the most frequent tasks in deep learning code. The `reshape` operation "Returns a tensor with the same data and number of elements as input, but with the specified shape," so the only hard constraint is that the product of the new dimension sizes must equal the product of the old ones.[17] The table below summarizes the main operations.

| Operation | Description | Key constraint | Example (PyTorch) |
|---|---|---|---|
| **Reshape** | Reinterprets the data with a new shape | Total element count must stay the same | `x.reshape(2, 6)` on shape `(3, 4)` |
| **View** | Same as reshape but requires contiguous memory | Tensor must be contiguous; shares memory with original | `x.view(2, 6)` |
| **Permute** | Reorders the dimensions (axes) | Does not change element count; may make tensor non-contiguous | `x.permute(0, 2, 1)` swaps axes 1 and 2 |
| **Transpose** | Swaps exactly two dimensions | Limited to two axes at a time | `x.transpose(1, 2)` |
| **Squeeze** | Removes all dimensions of size 1, or a specified one | Only affects size-1 dimensions | `x.squeeze(1)` on shape `(3, 1, 4)` gives `(3, 4)` |
| **Unsqueeze** | Inserts a new dimension of size 1 at a given position | Adds exactly one axis | `x.unsqueeze(0)` on shape `(3, 4)` gives `(1, 3, 4)` |
| **Expand / Repeat** | Replicates data along one or more dimensions | Expand uses no extra memory (virtual repeat); repeat copies data | `x.expand(4, 3, 4)` on shape `(1, 3, 4)` |
| **Flatten** | Collapses a contiguous range of dims into one | Specified dims must be contiguous | `x.flatten(1, 2)` on shape `(2, 3, 4)` gives `(2, 12)` |
| **Concatenate** | Joins tensors along an existing dimension | All other dimensions must match | `torch.cat([a, b], dim=0)` |
| **Stack** | Joins tensors along a new dimension | All shapes must be identical | `torch.stack([a, b], dim=0)` |

### What does -1 mean when reshaping?

When reshaping you can set exactly one dimension to `-1` and let the framework compute it for you. The PyTorch documentation states: "A single dimension may be -1, in which case it's inferred from the remaining dimensions and the number of elements in input."[17] For example, calling `x.reshape(batch, -1)` on a tensor of shape `(32, 3, 224, 224)` produces `(32, 150528)`, because 3 x 224 x 224 = 150528 is inferred automatically. Only one dimension may be `-1`; supplying two raises an error because the size would be ambiguous. NumPy uses the same convention (`a.reshape(2, -1)`) and TensorFlow accepts `-1` in `tf.reshape` as well.

### View vs. reshape

In [PyTorch](/wiki/pytorch), `view()` and `reshape()` both produce a tensor with a different shape but the same data. The PyTorch documentation describes `view` as returning "a new tensor with the same data as the self tensor but of a different shape," adding that "the returned tensor shares the same data and must have the same number of elements, but may have a different size."[14] The key difference is that `view()` requires the source tensor to be contiguous in memory and always returns a tensor that shares storage with the original. `reshape()` works on both contiguous and non-contiguous tensors; it returns a view when possible and falls back to copying the data when a view is not feasible. Using `view()` is slightly more explicit because it will raise an error if the memory layout does not support a zero-copy view, which can help catch bugs early.

### Contiguity

A tensor is contiguous when its elements are stored in memory in the same order they would be visited by iterating over the tensor in row-major (C-style) order. Operations like `transpose()` and `permute()` change the stride metadata but do not move data in memory, so the result is typically non-contiguous. Calling `.contiguous()` on such a tensor copies the data into a new, contiguous block of memory. Many operations (including `view()`) require contiguity.

## How does broadcasting use tensor shapes?

[Broadcasting](/wiki/broadcasting) is the mechanism by which frameworks automatically expand the shapes of tensors so that element-wise operations can be performed on tensors of different shapes without explicitly copying data. The [NumPy](/wiki/numpy) documentation explains that "the term broadcasting describes how NumPy treats arrays with different shapes during arithmetic operations."[12] The rules originated in NumPy and have been adopted by [PyTorch](/wiki/pytorch), TensorFlow, and JAX.

### Broadcasting rules

The NumPy manual states the comparison procedure precisely: "When operating on two arrays, NumPy compares their shapes element-wise. It starts with the trailing (i.e. rightmost) dimension and works its way left."[12] Two tensors are broadcastable if, comparing their shapes from the trailing (rightmost) dimension, each pair of dimensions is **compatible**. Per the NumPy documentation, "Two dimensions are compatible when 1. they are equal, or 2. one of them is 1."[12] A tensor that lacks a given dimension is treated as if that dimension had size 1 (it is implicitly prepended). The output shape takes the maximum size along each dimension.

### Broadcasting examples

| Tensor A shape | Tensor B shape | Result shape | Broadcastable? |
|---|---|---|---|
| `(5, 3, 4, 1)` | `(3, 1, 1)` | `(5, 3, 4, 1)` | Yes |
| `(1,)` | `(3, 1, 7)` | `(3, 1, 7)` | Yes |
| `(15, 3, 5)` | `(3, 1)` | `(15, 3, 5)` | Yes |
| `(5, 4)` | `(4,)` | `(5, 4)` | Yes |
| `(8, 1, 6, 1)` | `(7, 1, 5)` | `(8, 7, 6, 5)` | Yes |
| `(3,)` | `(4,)` | N/A | **No** (trailing dims 3 vs 4) |
| `(2, 1)` | `(8, 4, 3)` | N/A | **No** (dim mismatch 2 vs 8) |

A common use of broadcasting is adding a bias vector of shape `(C,)` to a batch of feature maps of shape `(N, C, H, W)` in a [convolutional layer](/wiki/convolutional_layer). The bias is implicitly expanded to `(1, C, 1, 1)` before addition.

### In-place broadcasting restriction

In PyTorch, in-place operations (e.g. `x.add_(y)`) do not allow the shape of `x` to change as a result of broadcasting. If the broadcast would require `x` to grow, a `RuntimeError` is raised.

## How does shape change through neural network layers?

Understanding how each type of layer transforms the shape of its input is essential for building and debugging models.

### Linear (fully connected) layer

A [linear layer](/wiki/fully_connected_layer) with `in_features` inputs and `out_features` outputs transforms shape `(*, in_features)` to `(*, out_features)`, where `*` represents any number of leading batch dimensions.

### Convolutional layer

For a 2-D [convolutional layer](/wiki/convolutional_layer), the spatial dimensions of the output are determined by:

```
H_out = floor((H_in + 2 * padding - dilation * (kernel_size - 1) - 1) / stride + 1)
W_out = floor((W_in + 2 * padding - dilation * (kernel_size - 1) - 1) / stride + 1)
```

The number of output channels equals the number of filters (`out_channels`), and the batch dimension is unchanged. A full shape transformation example:

| Parameter | Value |
|---|---|
| Input shape | `(N, 3, 224, 224)` |
| `out_channels` | 64 |
| `kernel_size` | 7 |
| `stride` | 2 |
| `padding` | 3 |
| `dilation` | 1 |
| **Output shape** | `(N, 64, 112, 112)` |

Applying the formula: floor((224 + 2*3 - 1*(7-1) - 1) / 2 + 1) = floor((224 + 6 - 6 - 1) / 2 + 1) = floor(223 / 2 + 1) = floor(111.5 + 1) = 112.

### Pooling layer

[Pooling](/wiki/pooling) layers (max pool, average pool) follow the same spatial output formula as convolutional layers but do not change the channel dimension.

### Recurrent layers

An [LSTM](/wiki/long_short-term_memory_lstm) or GRU with input shape `(N, L, H_in)` and `hidden_size` H produces an output of shape `(N, L, D * H)`, where D is 2 for bidirectional and 1 otherwise.

### Attention / transformer layers

A standard multi-head [self-attention](/wiki/self-attention_also_called_self-attention_layer) layer preserves the input shape `(N, L, E)`. The queries, keys, and values are internally reshaped from `(N, L, E)` to `(N, num_heads, L, E // num_heads)` for parallel attention computation, then reshaped back. The [feed-forward network](/wiki/feedforward_neural_network_ffn) inside each transformer block temporarily projects to a higher dimension (often 4E) and then back to E, again preserving the overall shape `(N, L, E)`.

## How do you debug tensor shape errors?

Tensor shape mismatches are among the most frequent runtime errors in deep learning. Lagouvardos et al. (2020) note that in TensorFlow programs "a large class of errors is reported by the library at runtime, due to mismatches in tensor shapes" and that such errors "are very common: we encountered numerous such errors" in real code.[6] A separate static analyzer for PyTorch and TensorFlow training code, presented by Shin et al. (2022) at ICSE, was built specifically because these shape bugs are common and hard to detect statically.[5] The following strategies help prevent and diagnose them.

### Common causes

| Error type | Typical symptom | Example |
|---|---|---|
| Mismatched matrix multiply dimensions | `RuntimeError: mat1 and mat2 shapes cannot be multiplied` | The `in_features` of a [linear layer](/wiki/fully_connected_layer) does not match the last dimension of the input |
| Wrong number of dimensions | `RuntimeError: Expected 4-dimensional input for spatial ... but got 3-dimensional input` | Forgetting the batch dimension when feeding a single image to a [CNN](/wiki/convolutional_neural_network) |
| Incompatible broadcast | `RuntimeError: The size of tensor a (X) must match the size of tensor b (Y) at non-singleton dimension Z` | Trying to add tensors whose shapes violate broadcasting rules |
| Invalid reshape | `RuntimeError: shape [X] is invalid for input of size Y` | Reshaping to a shape whose total element count differs from the source |
| Last-batch size mismatch | `RuntimeError: size mismatch, m1: [A x B], m2: [C x D]` | The final mini-batch is smaller than `batch_size` and a layer expects a fixed size |

### Debugging techniques

1. **Print shapes at every step.** Insert `print(x.shape)` before and after each layer or operation inside the `forward()` method. This is the fastest way to find the point where a shape goes wrong.
2. **Use the meta device.** PyTorch's meta device lets you compute output shapes without allocating memory. Create a meta tensor and pass it through your model to trace shapes at near-zero cost: `x = torch.randn(1, 3, 224, 224, device='meta'); out = model(x); print(out.shape)`.
3. **Use model summary tools.** Libraries such as `torchinfo` (formerly `torchsummary`) print a table of layer names, output shapes, and parameter counts for a given input size.
4. **Read the error message carefully.** PyTorch error messages typically include the exact shapes that caused the failure and the operation that triggered it.
5. **Check the documentation.** Layer documentation specifies the expected input and output shapes, including which dimensions correspond to batch, channels, features, and spatial extent.

## What is the difference between static and dynamic shapes?

The distinction between static and dynamic shapes arises when compiling or tracing a model.

**Static shapes** are fixed at graph-construction or compilation time. [TensorFlow](/wiki/tensorflow) 1.x graph mode and TensorRT require static shapes by default, which enables aggressive kernel fusion and memory planning but limits flexibility.

**Dynamic shapes** allow one or more dimensions to vary between invocations. This is the default behavior in [PyTorch](/wiki/pytorch) eager mode and TensorFlow 2.x eager mode. When using `torch.compile()`, PyTorch initially treats all shapes as static and recompiles if a shape changes. Developers can mark dimensions as dynamic with `torch._dynamo.mark_dynamic()` to avoid repeated recompilation. Internally, PyTorch uses SymPy to represent symbolic shape expressions that are solved at dispatch time.

**Bounded dynamic shapes** (used by PyTorch/XLA for TPUs) restrict dynamic dimensions to a declared range, allowing the compiler to allocate a fixed memory budget while still accepting variable-length inputs.

## Einops and expressive shape notation

[Einops](https://einops.rocks/) is a library that provides a concise, readable notation for tensor shape operations. Published as a conference paper at ICLR 2022, einops offers three core functions, `rearrange`, `reduce`, and `repeat`, that replace many individual calls to reshape, transpose, permute, squeeze, and unsqueeze.[4]

For example, converting an image batch from NHWC to NCHW:

```python
from einops import rearrange
# x has shape (batch, height, width, channels)
x = rearrange(x, 'b h w c -> b c h w')
```

Splitting an embedding dimension into multiple attention heads:

```python
# q has shape (batch, seq_len, num_heads * head_dim)
q = rearrange(q, 'b s (h d) -> b h s d', h=8)
```

The einops notation makes the intended shape transformation self-documenting, reducing the risk of silent shape errors that can occur with chains of `.view()` and `.permute()` calls.

## Named tensors

A known limitation of positional-index-based shape manipulation is that axes are identified only by integer positions, making code error-prone and difficult to read. Several projects aim to attach human-readable names to tensor dimensions.

**PyTorch Named Tensors** (prototype API) let you assign names to dimensions at creation time, for example `torch.zeros(2, 3, names=('N', 'C'))`. Operations then check dimension names for correctness at runtime, catching permutation errors that would otherwise produce silent bugs.

**Named Tensor Notation**, proposed by Chiang and Rush (2021), is a formal notation that uses subscript names on tensors to make dimension semantics explicit in mathematical writing, analogous to how einsum uses named indices.[7]

**Xarray** and **xarray-jax** bring labeled, named dimensions to JAX and NumPy arrays, and are widely used in scientific computing.

## Tensor decomposition and shape reduction

Tensor decomposition methods factorize a high-dimensional tensor into smaller tensors, effectively changing the shape representation while preserving (or approximating) the information content.[8]

| Decomposition | Input shape | Output shapes (conceptual) | Use in ML |
|---|---|---|---|
| **CP (CANDECOMP/PARAFAC)** | `(I, J, K)` | R vectors of sizes `(I,)`, `(J,)`, `(K,)` | Compressing convolutional filters, recommender systems |
| **Tucker** | `(I, J, K)` | Core tensor `(R1, R2, R3)` + factor matrices `(I, R1)`, `(J, R2)`, `(K, R3)` | Model compression, higher-order SVD |
| **Tensor Train (TT)** | `(I1, I2, ..., In)` | Chain of 3-D cores | Compressing large [embedding](/wiki/embedding_vector) tables, physics simulations |

These decompositions reduce parameter counts and computational cost while transforming the original tensor's shape into a set of smaller, structured shapes. They are used in practice to compress [deep neural network](/wiki/deep_neural_network) layers for deployment on resource-constrained devices.

## How do tensor shapes differ across frameworks?

The following table compares how common tensor operations are invoked across the three major frameworks. Note that the APIs agree on the meaning of shape but differ on when it is known: NumPy shapes are always concrete, PyTorch returns a `torch.Size` tuple subclass, and TensorFlow distinguishes a static `tensor.shape` (which may contain `None`) from a dynamic `tf.shape(tensor)` resolved at run time.

| Operation | NumPy | PyTorch | TensorFlow |
|---|---|---|---|
| Get shape | `a.shape` | `x.shape` or `x.size()` | `x.shape` or `tf.shape(x)` |
| Reshape | `np.reshape(a, (2, 6))` | `x.reshape(2, 6)` or `x.view(2, 6)` | `tf.reshape(x, [2, 6])` |
| Transpose | `np.transpose(a, (1, 0, 2))` | `x.permute(1, 0, 2)` | `tf.transpose(x, perm=[1, 0, 2])` |
| Add axis | `np.expand_dims(a, 0)` | `x.unsqueeze(0)` | `tf.expand_dims(x, 0)` |
| Remove size-1 axis | `np.squeeze(a)` | `x.squeeze()` | `tf.squeeze(x)` |
| Concatenate | `np.concatenate([a, b], axis=0)` | `torch.cat([x, y], dim=0)` | `tf.concat([x, y], axis=0)` |
| Stack | `np.stack([a, b], axis=0)` | `torch.stack([x, y], dim=0)` | `tf.stack([x, y], axis=0)` |
| Number of elements | `a.size` | `x.numel()` | `tf.size(x)` |

## Best practices

1. **Always verify shapes during development.** Print or log tensor shapes at every major step. Use assertions such as `assert x.shape == (batch, channels, h, w)` to catch errors early.
2. **Use -1 for inferred dimensions.** When reshaping, you can set one dimension to -1 and the framework will compute it automatically: `x.reshape(batch, -1)` flattens all trailing dimensions.
3. **Prefer named constants over magic numbers.** Define `BATCH = 32; SEQ_LEN = 512; EMBED = 768` and use them in shape assertions and reshapes to make code self-documenting.
4. **Be deliberate about view vs. copy.** When you need a new shape that shares memory with the original, use `view()`. When you need an independent copy, use `reshape()` or call `.contiguous()` first.
5. **Handle the last batch.** If the dataset size is not divisible by the batch size, the last batch will have a smaller first dimension. Either set `drop_last=True` in your data loader or ensure your model code does not hardcode the batch size.
6. **Favor einops for complex rearrangements.** For any operation that involves more than a simple reshape or transpose, einops notation is more readable, self-documenting, and less error-prone than raw view/permute chains.
7. **Use the meta device for shape debugging.** Before running expensive forward passes, trace shapes with meta tensors to verify that all dimensions are compatible.

## See also

- [Tensor](/wiki/tensor)
- [Shape (tensor)](/wiki/shape_tensor)
- [Broadcasting](/wiki/broadcasting)
- [Convolutional neural network](/wiki/convolutional_neural_network)
- [NumPy](/wiki/numpy)
- [PyTorch](/wiki/pytorch)
- [TensorFlow](/wiki/tensorflow)
- [Batch size](/wiki/batch_size)
- [Embedding vector](/wiki/embedding_vector)

## References

1. Harris, C. R., Millman, K. J., van der Walt, S. J., et al. (2020). "Array programming with NumPy." *Nature*, 585(7825), 357-362. https://doi.org/10.1038/s41586-020-2649-2
2. Paszke, A., Gross, S., Massa, F., et al. (2019). "PyTorch: An Imperative Style, High-Performance Deep Learning Library." *Advances in Neural Information Processing Systems 32 (NeurIPS 2019)*.
3. Abadi, M., Barham, P., Chen, J., et al. (2016). "TensorFlow: A System for Large-Scale Machine Learning." *Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI)*.
4. Rogozhnikov, A. (2022). "Einops: Clear and Reliable Tensor Manipulations with Einstein-like Notation." *Proceedings of the International Conference on Learning Representations (ICLR 2022)*.
5. Shin, J., Lee, S., Yoon, H., and Oh, H. (2022). "A Static Analyzer for Detecting Tensor Shape Errors in Deep Neural Network Training Code." *Proceedings of the ACM/IEEE 44th International Conference on Software Engineering (ICSE)*.
6. Lagouvardos, S., Dolby, J., Grech, N., Antoniadis, A., and Smaragdakis, Y. (2020). "Static Analysis of Shape in TensorFlow Programs." *Proceedings of the 34th European Conference on Object-Oriented Programming (ECOOP 2020)*.
7. Chiang, D. and Rush, A. M. (2021). "Named Tensor Notation." *arXiv preprint arXiv:2102.13196*.
8. Kolda, T. G. and Bader, B. W. (2009). "Tensor Decompositions and Applications." *SIAM Review*, 51(3), 455-500.
9. NVIDIA (2023). "Convolutional Layers User's Guide." *NVIDIA Deep Learning Performance Documentation*. https://docs.nvidia.com/deeplearning/performance/dl-performance-convolutional/index.html
10. PyTorch Contributors (2025). "Broadcasting Semantics." *PyTorch Documentation*. https://docs.pytorch.org/docs/stable/notes/broadcasting.html
11. PyTorch Contributors (2025). "Reasoning about Shapes in PyTorch." *PyTorch Tutorials*. https://docs.pytorch.org/tutorials/recipes/recipes/reasoning_about_shapes.html
12. NumPy Contributors (2025). "Broadcasting." *NumPy Documentation*. https://numpy.org/doc/stable/user/basics.broadcasting.html
13. Vaswani, A., Shazeer, N., Parmar, N., et al. (2017). "Attention Is All You Need." *Advances in Neural Information Processing Systems 30 (NeurIPS 2017)*.
14. PyTorch Contributors (2025). "torch.Tensor.view." *PyTorch Documentation*. https://docs.pytorch.org/docs/stable/generated/torch.Tensor.view.html
15. PyTorch Contributors (2025). "torch.Tensor.size." *PyTorch Documentation*. https://docs.pytorch.org/docs/stable/generated/torch.Tensor.size.html
16. PyTorch Contributors (2025). "(beta) Channels Last Memory Format in PyTorch." *PyTorch Tutorials*. https://docs.pytorch.org/tutorials/intermediate/memory_format_tutorial.html
17. PyTorch Contributors (2025). "torch.reshape." *PyTorch Documentation*. https://docs.pytorch.org/docs/stable/generated/torch.reshape.html