# Shape (Tensor)

> Source: https://aiwiki.ai/wiki/shape_tensor
> Updated: 2026-04-26
> Categories: Machine Learning
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

*See also: [Machine learning terms](/wiki/machine_learning_terms), [Tensor](/wiki/tensor), [Tensor size](/wiki/tensor_size)*

In machine learning, the **shape** of a [tensor](/wiki/tensor) is the tuple of integers giving the size of the tensor along each of its axes. For a tensor of rank `r`, the shape is written `(d_1, d_2, ..., d_r)`, where each `d_i` is the number of elements stored along axis `i`. Shape is one of the two metadata fields (alongside data type) that almost every deep learning framework attaches to a tensor, and shape correctness is the single most common source of runtime errors in model code.

A tensor's shape determines the total number of elements (`prod(shape)`), the legal operations that can be applied to it (matrix multiplication, convolution, broadcasting), and the memory layout used by the underlying buffer. Frameworks such as [NumPy](/wiki/numpy), [PyTorch](/wiki/pytorch), [TensorFlow](/wiki/tensorflow), and [JAX](/wiki/jax) all expose a `.shape` attribute on their array types, but they differ in whether the shape is fixed at trace time or known only at runtime, and in the conventions they use for image, sequence, and audio data.

## definition

A shape is an ordered tuple of non-negative integers. The length of the tuple is called the **rank** (also **ndim**, the number of dimensions, or the number of axes), and each entry is the size of the tensor along the corresponding axis. NumPy's documentation defines `ndarray.shape` as "a tuple of array dimensions" whose length equals the array's number of dimensions.[1] PyTorch's `torch.Tensor.size()` returns a `torch.Size` object, a subclass of `tuple`; the equivalent attribute is `tensor.shape`, and the two are interchangeable.[2]

The vocabulary varies between communities. Mathematicians and physicists tend to call the rank the *order* of the tensor and call individual axes *modes*; deep learning practitioners usually say *rank*, *ndim*, or just "number of dimensions," and call individual axes *axes* or *dims*. The numeric values inside the shape tuple are sometimes called *dimension sizes*, *extents*, or simply *dims*.

## rank, axes, and examples

Rank refers to how many axes a tensor has, not how big each axis is. A 1000-element vector has the same rank (1) as a 3-element vector. The table below collects the canonical examples.

| Rank | Object | Example shape | Notes |
|------|--------|---------------|-------|
| 0 | Scalar | `()` | A single number; NumPy reports the empty tuple |
| 1 | Vector | `(n,)` | A 1D array of length `n` |
| 2 | Matrix | `(m, n)` | `m` rows, `n` columns |
| 3 | 3D tensor | `(d, h, w)` | Often a single image with `d` channels |
| 4 | Image batch | `(N, C, H, W)` or `(N, H, W, C)` | NCHW (PyTorch) vs NHWC (TensorFlow) |
| 5 | Video batch | `(N, T, C, H, W)` | Adds a time axis `T` |

The trailing comma in `(n,)` is Python syntax to disambiguate a one-element tuple from a parenthesized expression; `(4)` is the integer 4, while `(4,)` is a tuple of length one. A scalar produced by a NumPy reduction (`arr.sum()`) has shape `()`, an empty tuple, which is distinct from a length-1 vector with shape `(1,)`.

## framework conventions

Different frameworks expose shape through slightly different APIs and with different semantics around when the shape is known.

| Framework | Shape access | Type returned | Notes |
|-----------|--------------|---------------|-------|
| [NumPy](/wiki/numpy) | `arr.shape` | `tuple` of `int` | Always concrete; assigning to `shape` reshapes in place but is discouraged in favor of `reshape`[1] |
| [PyTorch](/wiki/pytorch) | `t.shape` or `t.size()` | `torch.Size` (tuple subclass) | `t.size(dim)` returns the size of one axis as `int`[2] |
| [TensorFlow](/wiki/tensorflow) | `t.shape` (static) and `tf.shape(t)` (dynamic) | `TensorShape` and 1-D `int32` `Tensor` | Static may contain `None` for unknown dims; dynamic is always concrete at run time[3] |
| [JAX](/wiki/jax) | `jnp.array.shape` | `tuple` of `int` | Inside `jit`, shapes must be concrete at trace time; varying shape recompiles[4] |

In TensorFlow, the static shape returned by `tensor.shape` is the shape inferred during graph construction. It can be partially known, with `None` standing in for axes whose size depends on runtime data (typical for the batch axis or a variable sequence length). The dynamic shape returned by `tf.shape(tensor)` is a 1-D `int32` tensor that is always fully known at execution time and that can be fed into other ops.[3] The two forms are equivalent only when the static shape is fully defined.

JAX takes a stricter line. Inside a `jax.jit`-compiled function, all array shapes must be static at trace time, because the function is lowered to StableHLO and compiled separately for each combination of input shapes and dtypes. Calling the same jitted function with a new input shape triggers a recompilation. Practitioners usually pad inputs to a small set of fixed sizes or use `static_argnums` for shape-dependent constants. JAX's *shape polymorphism* feature relaxes this for export, allowing a single compiled artifact to handle a family of shapes.[4]

## channel ordering in computer vision

Four-dimensional image tensors come in two competing layouts. **NCHW** (channels-first) puts the channel axis right after the batch axis: `(N, C, H, W)`. **NHWC** (channels-last) puts channels at the end: `(N, H, W, C)`. PyTorch defaults to NCHW; TensorFlow and Keras default to NHWC; ONNX uses NCHW; Apple's Metal Performance Shaders prefer NHWC; cuDNN supports both.

| Layout | Convention | Default in | Memory locality |
|--------|-----------|-----------|-----------------|
| NCHW | Channels-first | PyTorch, ONNX, Caffe | Spatial pixels of one channel are contiguous |
| NHWC | Channels-last | TensorFlow, Keras, Apple MPS | All channels of one pixel are contiguous |

The choice is not just cosmetic. NHWC is often faster on modern hardware because Tensor Cores and vector instructions read across channels, and convolution kernels written for the cuDNN `NHWC` path avoid the layout conversion that the `NCHW` path performs internally. The official PyTorch tutorial reports 8 to 35 percent speedups on Volta GPUs and over 22 percent on ResNet-50 with mixed-precision training when using `tensor.to(memory_format=torch.channels_last)`.[5] Intel's CPU benchmarks for vision models show 1.3 to 1.8 times higher throughput with channels-last on Ice Lake and newer CPUs.[6]

In PyTorch, channels-last is implemented as a memory format on a 4D NCHW tensor rather than as a different shape. The strides change from `(C*H*W, H*W, W, 1)` to `(C*H*W, 1, W*C, C)`, but `tensor.shape` still reports `(N, C, H, W)`. To switch layouts entirely, use [`permute`](/wiki/transpose) (`x.permute(0, 2, 3, 1)`); in TensorFlow use `tf.transpose(x, [0, 2, 3, 1])`.

## common shape patterns

Most deep learning code follows a small set of canonical shape conventions, summarized below.

| Domain | Typical shape | Meaning of axes |
|--------|---------------|-----------------|
| Tabular | `(N, F)` | `N` rows, `F` features |
| Image batch (PyTorch) | `(N, C, H, W)` | batch, channels, height, width |
| Image batch (TF/Keras) | `(N, H, W, C)` | batch, height, width, channels |
| Token IDs | `(B, T)` | batch, sequence length |
| Token embeddings | `(B, T, D)` | batch, sequence, hidden dim |
| Attention scores | `(B, H, T, T)` | batch, heads, query len, key len |
| Audio waveform | `(B, C, L)` or `(B, L, C)` | batch, channels, samples |
| Mel spectrogram | `(B, C, F, T)` | batch, channels, frequency, time |
| Video | `(B, T, C, H, W)` | batch, frames, channels, height, width |
| Point cloud | `(B, N, 3)` | batch, points, xyz |

The leading axis is almost always the [batch size](/wiki/batch_size) `N` or `B`; this matters because most operators (linear layers, BatchNorm, attention) treat the first axis as independent samples that can be parallelized.

## broadcasting rules

Broadcasting is the rule that lets two arrays with different shapes participate in an element-wise operation. NumPy, PyTorch, TensorFlow, and JAX all share the same rules, originally specified by NumPy.[7]

The shapes are aligned from the right (trailing axis first), padding the shorter shape with leading 1s. Two aligned axes are compatible if they are equal, or if one of them is 1. The broadcast result takes the per-axis maximum.

| Operation | Left shape | Right shape | Aligned | Result |
|-----------|-----------|-------------|---------|--------|
| Add scalar | `(3, 4)` | `()` | `(3, 4)` vs `(1, 1)` | `(3, 4)` |
| Add row | `(3, 4)` | `(4,)` | `(3, 4)` vs `(1, 4)` | `(3, 4)` |
| Add column | `(3, 4)` | `(3, 1)` | `(3, 4)` vs `(3, 1)` | `(3, 4)` |
| Outer product | `(3, 1)` | `(1, 4)` | `(3, 1)` vs `(1, 4)` | `(3, 4)` |
| Mismatch | `(3, 4)` | `(3,)` | `(3, 4)` vs `(1, 3)` | `ValueError` |
| 4D vs 3D | `(8, 1, 6, 1)` | `(7, 1, 5)` | `(8, 1, 6, 1)` vs `(1, 7, 1, 5)` | `(8, 7, 6, 5)` |

The last row is the canonical example from the NumPy manual.[7] Broadcasting never copies data; it conceptually stretches a size-1 axis by reusing the same memory through a stride of 0. This is why broadcasting is cheap, and why bugs caused by accidental broadcasting (a `(1, 1000)` vs `(1000,)` mistake that silently produces a `(1, 1000)` result) can be hard to spot.

## static vs dynamic shapes

The term *static shape* describes a shape known at compile time, before any data has been seen. *Dynamic shape* describes a shape known only at run time. The distinction matters because compilers (XLA, TorchInductor, ONNX Runtime, TensorRT) generally produce faster code when shapes are static, since they can fuse operations, allocate buffers, and pick kernels for specific dimensions ahead of time.

In the eager mode used by default in PyTorch and TensorFlow 2, every shape is dynamic in the sense that it is computed each time the program runs. In graph or JIT mode the picture changes:

- `torch.compile` and TorchScript can specialize on shapes, but PyTorch 2 also supports *dynamic shapes* through symbolic reasoning so that variable batch or sequence lengths do not force a recompile.
- TensorFlow's `tf.function` traces a `ConcreteFunction` per input signature; passing a tensor with a new shape may trigger retracing unless the input signature uses `None` for the variable axis.
- JAX's `jit` recompiles per shape unless `static_argnums` or `shape_polymorphism` is used.[4]
- TensorRT and ONNX Runtime expose explicit *dynamic axes* in the model signature so that one engine handles a range of input sizes.

A practical consequence is that ML engineers spend a noticeable fraction of debugging time on shape specialization: a model that runs fine on a fixed batch of 32 may recompile (or crash) when given a batch of 31 at the end of an epoch.

## shape manipulation operations

The table below lists the operations every deep learning practitioner uses to bend shapes into the form an operator expects.

| Operation | What it does | Constraints |
|-----------|--------------|-------------|
| [reshape](/wiki/reshape) | Returns a tensor with a new shape and the same total elements | `prod(new_shape) == prod(old_shape)`; one entry may be `-1` to be inferred |
| `view` (PyTorch) | Like `reshape`, but returns a view sharing memory | Tensor must be contiguous in the requested layout |
| `flatten` | Collapses a range of axes into one | Equivalent to `reshape` with the right product |
| `squeeze` | Removes axes of size 1 | Optionally restricted to a single axis |
| `unsqueeze` (PyTorch) / `expand_dims` (NumPy, TF) | Inserts a new axis of size 1 | New rank is `r + 1` |
| [`transpose`](/wiki/transpose) | Swaps two axes | Returns a view; result is usually non-contiguous |
| `permute` (PyTorch) | Reorders all axes by a permutation | Returns a view; usually non-contiguous |
| `expand` (PyTorch) / `broadcast_to` (NumPy, TF) | Makes a size-1 axis appear larger via stride 0 | No data copy; result is read-only-ish |
| `repeat` (PyTorch) / `tile` (NumPy, TF) | Actually copies data along an axis | Allocates new memory |
| `stack` | Concatenates along a new axis | All inputs must share the same shape |
| `concat` / `cat` | Concatenates along an existing axis | All inputs must match on every axis except the one being concatenated |

In PyTorch, `view` requires the source tensor to be contiguous, while `reshape` falls back to a copy when it cannot return a view. Calling `tensor.contiguous()` materializes a fresh contiguous copy when needed, typically after a `transpose` or `permute`.[8]

## common shape errors

Shape errors fall into a small number of recurring patterns:

- Matrix multiplication mismatch. [`matmul`](/wiki/matmul) requires the inner dimensions to agree: `(B, M, K) @ (B, K, N)` is valid, `(B, M, K) @ (B, M, N)` is not.
- Forgetting the batch axis at inference. A model trained on `(B, C, H, W)` rejects a single image of shape `(C, H, W)`; the fix is `image.unsqueeze(0)`.
- Wrong channel order. Loading an image with PIL (`(H, W, C)`) and feeding it directly into a PyTorch model that expects `(C, H, W)` triggers a [`convolution`](/wiki/convolution) shape error or, worse, silently treats height as channels.
- Off-by-one with sequence axes. Some libraries default to `(T, B, D)` (sequence first), others to `(B, T, D)`. The PyTorch RNN modules used to default to `T` first; their `batch_first=True` flag is now common.
- Broadcasting accidents. Subtracting a `(N,)` row from an `(N, 1)` column produces an `(N, N)` result by broadcasting, which is rarely what the author meant.
- `transpose` followed by `view`. `view` fails after `transpose` because the result is non-contiguous; use `reshape` or call `.contiguous()` first.

## performance implications

Shape interacts with performance in several non-obvious ways. Memory layout, set by both the shape and the strides, decides whether a kernel can use vectorized loads or must gather scattered elements. The official PyTorch channels-last guide reports significant speedups precisely because the new layout makes the convolution inner loop a streaming load over channels.[5][6]

Tensor Cores on NVIDIA GPUs (Volta, Turing, Ampere, Hopper) operate on tiles of 16x16 or larger, so matrix dimensions that are not multiples of 8 or 16 either fall back to slower paths or pad internally. The cuDNN documentation and several training playbooks recommend rounding hidden sizes, vocabulary sizes, and batch sizes to multiples of 64 or 128 for the same reason. Wave Quantization, the term NVIDIA uses for the throughput cliff that appears when a dimension is just above a multiple of the warp size, is essentially a shape-rounding effect.

For variable-length text or audio batches, padding to the longest sequence wastes compute on the padded positions, while *bucketing* groups similar-length samples and *packing* (also called *example packing*) concatenates several short samples into one long sequence with an attention mask that forbids cross-example attention.[9] Modern LLM training stacks (Megatron, NeMo, vLLM, FlashAttention with variable length) all rely on packed sequences to keep the GPUs saturated.

## modern context

In the transformer era, sequence length has become the dominant variable axis. A typical decoder-only LLM forward pass works on a `(B, T)` token tensor that becomes `(B, T, D)` after embedding, `(B, H, T, D_head)` inside the attention head split, and `(B, H, T, T)` for the attention score matrix. The quadratic `T*T` term is the reason context-length growth is expensive, and the reason FlashAttention and ring attention focus on streaming the score matrix without materializing it.

Mixture-of-experts models add a routing dimension that changes shape per token: each expert receives a `(K, D)` slice of the batch where `K` is the number of tokens routed to it, which varies. Diffusion models and video generators add a denoising-step axis or a frame axis. Multimodal models stack image patches, audio frames, and text tokens into a single shared sequence, and the bookkeeping for which axis means what is now a major part of model engineering.

The shape of a tensor was once a quiet implementation detail. In production deep learning it has become an interface contract: the type signature of every operator, the unit of test coverage, and often the difference between a 10-millisecond and a 100-millisecond inference call.

## see also

- [Tensor](/wiki/tensor)
- [Tensor size](/wiki/tensor_size)
- [Broadcasting](/wiki/broadcasting)
- [Reshape](/wiki/reshape)
- [Transpose](/wiki/transpose)
- [Matmul](/wiki/matmul)
- [Convolution](/wiki/convolution)
- [Batch size](/wiki/batch_size)
- [Channels first](/wiki/channels_first)
- [Channels last](/wiki/channels_last)
- [NumPy](/wiki/numpy)
- [PyTorch](/wiki/pytorch)
- [TensorFlow](/wiki/tensorflow)
- [JAX](/wiki/jax)

## explain like i'm 5

A tensor is a box of numbers. Its shape is the list of how many numbers fit along each side of the box. A flat row of 5 numbers has shape `(5,)`. A grid with 3 rows and 4 columns has shape `(3, 4)`. A stack of 10 such grids has shape `(10, 3, 4)`. When two boxes are different shapes, the computer sometimes lets you add them anyway by pretending the smaller one is repeated along the missing sides; this trick is called broadcasting. Most bugs in machine learning code are about boxes that did not line up the way you thought they did.

## references

1. NumPy Developers. "numpy.ndarray.shape." *NumPy v2.4 Manual*. https://numpy.org/doc/stable/reference/generated/numpy.ndarray.shape.html
2. PyTorch Contributors. "torch.Tensor.size." *PyTorch 2.11 Documentation*. https://docs.pytorch.org/docs/stable/generated/torch.Tensor.size.html
3. TensorFlow Authors. "tf.shape." *TensorFlow API Documentation*. https://www.tensorflow.org/api_docs/python/tf/shape
4. JAX Authors. "Shape polymorphism" and "jax.jit." *JAX Documentation*. https://docs.jax.dev/en/latest/export/shape_poly.html and https://docs.jax.dev/en/latest/_autosummary/jax.jit.html
5. Vitaly Fedyunin. "Channels Last Memory Format in PyTorch." *PyTorch Tutorials*. https://docs.pytorch.org/tutorials/intermediate/memory_format_tutorial.html
6. Mingfei Ma et al. "Accelerating PyTorch Vision Models with Channels Last on CPU." *PyTorch Blog*, August 2022. https://pytorch.org/blog/accelerating-pytorch-vision-models-with-channels-last-on-cpu/
7. NumPy Developers. "Broadcasting." *NumPy v2.4 Manual*. https://numpy.org/doc/stable/user/basics.broadcasting.html
8. PyTorch Contributors. "torch.Tensor.view" and "torch.reshape." *PyTorch Documentation*. https://docs.pytorch.org/docs/stable/generated/torch.Tensor.view.html
9. Lukas Brun. "Efficient LLM Pretraining: Packed Sequences and Masked Attention." *Hugging Face Blog*. https://huggingface.co/blog/sirluk/llm-sequence-packing

