# Shape (Tensor)

> Source: https://aiwiki.ai/wiki/shape_tensor
> Updated: 2026-06-28
> Categories: Machine Learning
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

*See also: [Machine learning terms](/wiki/machine_learning_terms), [Tensor](/wiki/tensor), [Tensor size](/wiki/tensor_size)*

In machine learning, the **shape** of a [tensor](/wiki/tensor) is the tuple of integers giving the size of the tensor along each of its axes. For a tensor of rank `r`, the shape is written `(d_1, d_2, ..., d_r)`, where each `d_i` is the number of elements stored along axis `i`. For example, a batch of 32 RGB images that are 224 pixels tall and 224 wide has shape `(32, 3, 224, 224)`. Shape is one of the two metadata fields (alongside data type) that almost every deep learning framework attaches to a tensor, and shape correctness is the single most common source of runtime errors in model code.

A tensor's shape determines the total number of elements (`prod(shape)`), the legal operations that can be applied to it (matrix multiplication, convolution, broadcasting), and the memory layout used by the underlying buffer. Frameworks such as [NumPy](/wiki/numpy), [PyTorch](/wiki/pytorch), [TensorFlow](/wiki/tensorflow), and [JAX](/wiki/jax) all expose a `.shape` attribute on their array types, but they differ in whether the shape is fixed at trace time or known only at runtime, and in the conventions they use for image, sequence, and audio data.

## What is the shape of a tensor?

A shape is an ordered tuple of non-negative integers. The length of the tuple is called the **rank** (also **ndim**, the number of dimensions, or the number of axes), and each entry is the size of the tensor along the corresponding axis. The NumPy v2.4 manual defines `ndarray.shape` as a "Tuple of array dimensions" and notes that "The shape property is usually used to get the current shape of an array, but may also be used to reshape the array in-place by assigning a tuple of array dimensions to it."[1] In [PyTorch](/wiki/pytorch), `torch.Tensor.size()` returns a `torch.Size` object that the documentation describes as "the result type of a call to torch.Tensor.size(). It describes the size of all dimensions of the original tensor"; because `torch.Size` is a subclass of `tuple`, it "supports common sequence operations like indexing and length."[2] The equivalent attribute is `tensor.shape`, and the two are interchangeable.[2]

The vocabulary varies between communities. Mathematicians and physicists tend to call the rank the *order* of the tensor and call individual axes *modes*; deep learning practitioners usually say *rank*, *ndim*, or just "number of dimensions," and call individual axes *axes* or *dims*. The numeric values inside the shape tuple are sometimes called *dimension sizes*, *extents*, or simply *dims*.

## What is the rank of a tensor?

Rank refers to how many axes a tensor has, not how big each axis is. A 1000-element vector has the same rank (1) as a 3-element vector. The table below collects the canonical examples.

| Rank | Object | Example shape | Notes |
|------|--------|---------------|-------|
| 0 | Scalar | `()` | A single number; NumPy reports the empty tuple |
| 1 | Vector | `(n,)` | A 1D array of length `n` |
| 2 | Matrix | `(m, n)` | `m` rows, `n` columns |
| 3 | 3D tensor | `(d, h, w)` | Often a single image with `d` channels |
| 4 | Image batch | `(N, C, H, W)` or `(N, H, W, C)` | NCHW (PyTorch) vs NHWC (TensorFlow) |
| 5 | Video batch | `(N, T, C, H, W)` | Adds a time axis `T` |

The trailing comma in `(n,)` is Python syntax to disambiguate a one-element tuple from a parenthesized expression; `(4)` is the integer 4, while `(4,)` is a tuple of length one. A scalar produced by a NumPy reduction (`arr.sum()`) has shape `()`, an empty tuple, which is distinct from a length-1 vector with shape `(1,)`.

## How do frameworks expose tensor shape?

Different frameworks expose shape through slightly different APIs and with different semantics around when the shape is known.

| Framework | Shape access | Type returned | Notes |
|-----------|--------------|---------------|-------|
| [NumPy](/wiki/numpy) | `arr.shape` | `tuple` of `int` | Always concrete; assigning to `shape` reshapes in place but is discouraged in favor of `reshape`[1] |
| [PyTorch](/wiki/pytorch) | `t.shape` or `t.size()` | `torch.Size` (tuple subclass) | `t.size(dim)` returns the size of one axis as `int`[2] |
| [TensorFlow](/wiki/tensorflow) | `t.shape` (static) and `tf.shape(t)` (dynamic) | `TensorShape` and 1-D `int32` `Tensor` | Static may contain `None` for unknown dims; dynamic is always concrete at run time[3] |
| [JAX](/wiki/jax) | `jnp.array.shape` | `tuple` of `int` | Inside `jit`, shapes must be concrete at trace time; varying shape recompiles[4] |

In TensorFlow, the static shape returned by `tensor.shape` is the shape inferred during graph construction. It can be partially known, with `None` standing in for axes whose size depends on runtime data (typical for the batch axis or a variable sequence length). The dynamic shape returned by `tf.shape(tensor)` is a 1-D `int32` tensor that is always fully known at execution time and that can be fed into other ops.[3] The two forms are equivalent only when the static shape is fully defined.

JAX takes a stricter line. Inside a `jax.jit`-compiled function, all array shapes must be static at trace time, because the function is lowered to StableHLO and compiled separately for each combination of input shapes and dtypes; the JAX documentation states that "the function will be recompiled whenever the input data type or shape is changed."[4] Calling the same jitted function with a new input shape triggers a recompilation. Practitioners usually pad inputs to a small set of fixed sizes or use `static_argnums` for shape-dependent constants. JAX's *shape polymorphism* feature relaxes this for export, allowing "some exported functions to be used for a whole family of input shapes" by introducing symbolic dimension variables.[4]

## What is the difference between NCHW and NHWC?

Four-dimensional image tensors come in two competing layouts. **NCHW** (channels-first) puts the channel axis right after the batch axis: `(N, C, H, W)`. **NHWC** (channels-last) puts channels at the end: `(N, H, W, C)`. PyTorch defaults to NCHW; TensorFlow and Keras default to NHWC; ONNX uses NCHW; Apple's Metal Performance Shaders prefer NHWC; cuDNN supports both.

| Layout | Convention | Default in | Memory locality |
|--------|-----------|-----------|-----------------|
| NCHW | Channels-first | PyTorch, ONNX, Caffe | Spatial pixels of one channel are contiguous |
| NHWC | Channels-last | TensorFlow, Keras, Apple MPS | All channels of one pixel are contiguous |

The choice is not just cosmetic. NHWC is often faster on modern hardware because Tensor Cores and vector instructions read across channels, and convolution kernels written for the cuDNN `NHWC` path avoid the layout conversion that the `NCHW` path performs internally. The official PyTorch tutorial, which defines the channels-last format as "an alternative way of ordering NCHW tensors in memory preserving dimensions ordering," reports "8%-35% performance gains on Volta devices" and notes that "We were able to archive over 22% performance gains with channels last comparing to contiguous format" on ResNet-50 with mixed-precision training, via `tensor.to(memory_format=torch.channels_last)`.[5] A separate PyTorch CPU blog reports 1.3 to 1.8 times higher throughput for vision models with channels-last on Ice Lake and newer Intel CPUs.[6]

In PyTorch, channels-last is implemented as a memory format on a 4D NCHW tensor rather than as a different shape. The strides change from `(C*H*W, H*W, W, 1)` to `(C*H*W, 1, W*C, C)`, but `tensor.shape` still reports `(N, C, H, W)`. To switch layouts entirely, use [`permute`](/wiki/transpose) (`x.permute(0, 2, 3, 1)`); in TensorFlow use `tf.transpose(x, [0, 2, 3, 1])`.

## What are common tensor shape conventions?

Most deep learning code follows a small set of canonical shape conventions, summarized below.

| Domain | Typical shape | Meaning of axes |
|--------|---------------|-----------------|
| Tabular | `(N, F)` | `N` rows, `F` features |
| Image batch (PyTorch) | `(N, C, H, W)` | batch, channels, height, width |
| Image batch (TF/Keras) | `(N, H, W, C)` | batch, height, width, channels |
| Token IDs | `(B, T)` | batch, sequence length |
| Token embeddings | `(B, T, D)` | batch, sequence, hidden dim |
| Attention scores | `(B, H, T, T)` | batch, heads, query len, key len |
| Audio waveform | `(B, C, L)` or `(B, L, C)` | batch, channels, samples |
| Mel spectrogram | `(B, C, F, T)` | batch, channels, frequency, time |
| Video | `(B, T, C, H, W)` | batch, frames, channels, height, width |
| Point cloud | `(B, N, 3)` | batch, points, xyz |

The leading axis is almost always the [batch size](/wiki/batch_size) `N` or `B`; this matters because most operators (linear layers, BatchNorm, attention) treat the first axis as independent samples that can be parallelized.

## How do broadcasting rules work?

Broadcasting is the rule that lets two arrays with different shapes participate in an element-wise operation. NumPy, PyTorch, TensorFlow, and JAX all share the same rules, originally specified by NumPy.[7] The NumPy v2.4 manual states the rule precisely: "When operating on two arrays, NumPy compares their shapes element-wise. It starts with the trailing (i.e. rightmost) dimension and works its way left. Two dimensions are compatible when 1. they are equal, or 2. one of them is 1."[7]

The shapes are aligned from the right (trailing axis first), padding the shorter shape with leading 1s. Two aligned axes are compatible if they are equal, or if one of them is 1. The broadcast result takes the per-axis maximum.

| Operation | Left shape | Right shape | Aligned | Result |
|-----------|-----------|-------------|---------|--------|
| Add scalar | `(3, 4)` | `()` | `(3, 4)` vs `(1, 1)` | `(3, 4)` |
| Add row | `(3, 4)` | `(4,)` | `(3, 4)` vs `(1, 4)` | `(3, 4)` |
| Add column | `(3, 4)` | `(3, 1)` | `(3, 4)` vs `(3, 1)` | `(3, 4)` |
| Outer product | `(3, 1)` | `(1, 4)` | `(3, 1)` vs `(1, 4)` | `(3, 4)` |
| Mismatch | `(3, 4)` | `(3,)` | `(3, 4)` vs `(1, 3)` | `ValueError` |
| 4D vs 3D | `(8, 1, 6, 1)` | `(7, 1, 5)` | `(8, 1, 6, 1)` vs `(1, 7, 1, 5)` | `(8, 7, 6, 5)` |

The last row is the canonical example from the NumPy manual.[7] Broadcasting never copies data; it conceptually stretches a size-1 axis by reusing the same memory through a stride of 0. As the NumPy documentation puts it, "The stretching analogy is only conceptual. NumPy is smart enough to use the original scalar value without actually making copies so that broadcasting operations are as memory and computationally efficient as possible."[7] This is why broadcasting is cheap, and why bugs caused by accidental broadcasting (a `(1, 1000)` vs `(1000,)` mistake that silently produces a `(1, 1000)` result) can be hard to spot.

## What is the difference between static and dynamic shapes?

The term *static shape* describes a shape known at compile time, before any data has been seen. *Dynamic shape* describes a shape known only at run time. The distinction matters because compilers (XLA, TorchInductor, ONNX Runtime, TensorRT) generally produce faster code when shapes are static, since they can fuse operations, allocate buffers, and pick kernels for specific dimensions ahead of time.

In the eager mode used by default in PyTorch and TensorFlow 2, every shape is dynamic in the sense that it is computed each time the program runs. In graph or JIT mode the picture changes:

- `torch.compile` and TorchScript can specialize on shapes, but PyTorch 2 also supports *dynamic shapes* through symbolic reasoning so that variable batch or sequence lengths do not force a recompile.
- TensorFlow's `tf.function` traces a `ConcreteFunction` per input signature; passing a tensor with a new shape may trigger retracing unless the input signature uses `None` for the variable axis.
- JAX's `jit` recompiles per shape unless `static_argnums` or `shape_polymorphism` is used.[4]
- TensorRT and ONNX Runtime expose explicit *dynamic axes* in the model signature so that one engine handles a range of input sizes.

A practical consequence is that ML engineers spend a noticeable fraction of debugging time on shape specialization: a model that runs fine on a fixed batch of 32 may recompile (or crash) when given a batch of 31 at the end of an epoch.

## How does reshaping work?

The table below lists the operations every deep learning practitioner uses to bend shapes into the form an operator expects.

| Operation | What it does | Constraints |
|-----------|--------------|-------------|
| [reshape](/wiki/reshape) | Returns a tensor with a new shape and the same total elements | `prod(new_shape) == prod(old_shape)`; one entry may be `-1` to be inferred |
| `view` (PyTorch) | Like `reshape`, but returns a view sharing memory | Tensor must be contiguous in the requested layout |
| `flatten` | Collapses a range of axes into one | Equivalent to `reshape` with the right product |
| `squeeze` | Removes axes of size 1 | Optionally restricted to a single axis |
| `unsqueeze` (PyTorch) / `expand_dims` (NumPy, TF) | Inserts a new axis of size 1 | New rank is `r + 1` |
| [`transpose`](/wiki/transpose) | Swaps two axes | Returns a view; result is usually non-contiguous |
| `permute` (PyTorch) | Reorders all axes by a permutation | Returns a view; usually non-contiguous |
| `expand` (PyTorch) / `broadcast_to` (NumPy, TF) | Makes a size-1 axis appear larger via stride 0 | No data copy; result is read-only-ish |
| `repeat` (PyTorch) / `tile` (NumPy, TF) | Actually copies data along an axis | Allocates new memory |
| `stack` | Concatenates along a new axis | All inputs must share the same shape |
| `concat` / `cat` | Concatenates along an existing axis | All inputs must match on every axis except the one being concatenated |

A key convenience is that one entry of a reshape target may be set to `-1`, and the framework infers it. The NumPy manual specifies that for `reshape`, "One shape dimension can be -1. In this case, the value is inferred from the length of the array and remaining dimensions"; only one dimension may be left unspecified.[10] For example, `np.reshape(a, (3, -1))` on a 6-element array yields shape `(3, 2)`.[10] In PyTorch, `view` requires the source tensor to be contiguous, while `reshape` falls back to a copy when it cannot return a view. Calling `tensor.contiguous()` materializes a fresh contiguous copy when needed, typically after a `transpose` or `permute`.[8]

## What causes shape errors?

Shape errors fall into a small number of recurring patterns:

- Matrix multiplication mismatch. [`matmul`](/wiki/matmul) requires the inner dimensions to agree: `(B, M, K) @ (B, K, N)` is valid, `(B, M, K) @ (B, M, N)` is not.
- Forgetting the batch axis at inference. A model trained on `(B, C, H, W)` rejects a single image of shape `(C, H, W)`; the fix is `image.unsqueeze(0)`.
- Wrong channel order. Loading an image with PIL (`(H, W, C)`) and feeding it directly into a PyTorch model that expects `(C, H, W)` triggers a [`convolution`](/wiki/convolution) shape error or, worse, silently treats height as channels.
- Off-by-one with sequence axes. Some libraries default to `(T, B, D)` (sequence first), others to `(B, T, D)`. The PyTorch RNN modules used to default to `T` first; their `batch_first=True` flag is now common.
- Broadcasting accidents. Subtracting a `(N,)` row from an `(N, 1)` column produces an `(N, N)` result by broadcasting, which is rarely what the author meant.
- `transpose` followed by `view`. `view` fails after `transpose` because the result is non-contiguous; use `reshape` or call `.contiguous()` first.

## Why does tensor shape affect performance?

Shape interacts with performance in several non-obvious ways. Memory layout, set by both the shape and the strides, decides whether a kernel can use vectorized loads or must gather scattered elements. The official PyTorch channels-last guide reports significant speedups precisely because the new layout makes the convolution inner loop a streaming load over channels.[5][6]

Tensor Cores on NVIDIA GPUs (Volta, Turing, Ampere, Hopper) operate on tiles of 16x16 or larger, so matrix dimensions that are not multiples of 8 or 16 either fall back to slower paths or pad internally. The cuDNN documentation and several training playbooks recommend rounding hidden sizes, vocabulary sizes, and batch sizes to multiples of 64 or 128 for the same reason. Wave Quantization, the term NVIDIA uses for the throughput cliff that appears when a dimension is just above a multiple of the warp size, is essentially a shape-rounding effect.

For variable-length text or audio batches, padding to the longest sequence wastes compute on the padded positions, while *bucketing* groups similar-length samples and *packing* (also called *example packing*) concatenates several short samples into one long sequence with an attention mask that forbids cross-example attention.[9] Modern LLM training stacks (Megatron, NeMo, vLLM, FlashAttention with variable length) all rely on packed sequences to keep the GPUs saturated.

## How does shape matter for transformers and LLMs?

In the transformer era, sequence length has become the dominant variable axis. A typical decoder-only LLM forward pass works on a `(B, T)` token tensor that becomes `(B, T, D)` after embedding, `(B, H, T, D_head)` inside the attention head split, and `(B, H, T, T)` for the attention score matrix. The quadratic `T*T` term is the reason context-length growth is expensive, and the reason FlashAttention and ring attention focus on streaming the score matrix without materializing it.

Mixture-of-experts models add a routing dimension that changes shape per token: each expert receives a `(K, D)` slice of the batch where `K` is the number of tokens routed to it, which varies. Diffusion models and video generators add a denoising-step axis or a frame axis. Multimodal models stack image patches, audio frames, and text tokens into a single shared sequence, and the bookkeeping for which axis means what is now a major part of model engineering.

The shape of a tensor was once a quiet implementation detail. In production deep learning it has become an interface contract: the type signature of every operator, the unit of test coverage, and often the difference between a 10-millisecond and a 100-millisecond inference call.

## See also

- [Tensor](/wiki/tensor)
- [Tensor size](/wiki/tensor_size)
- [Broadcasting](/wiki/broadcasting)
- [Reshape](/wiki/reshape)
- [Transpose](/wiki/transpose)
- [Matmul](/wiki/matmul)
- [Convolution](/wiki/convolution)
- [Batch size](/wiki/batch_size)
- [Channels first](/wiki/channels_first)
- [Channels last](/wiki/channels_last)
- [NumPy](/wiki/numpy)
- [PyTorch](/wiki/pytorch)
- [TensorFlow](/wiki/tensorflow)
- [JAX](/wiki/jax)

## Explain like I'm 5

A tensor is a box of numbers. Its shape is the list of how many numbers fit along each side of the box. A flat row of 5 numbers has shape `(5,)`. A grid with 3 rows and 4 columns has shape `(3, 4)`. A stack of 10 such grids has shape `(10, 3, 4)`. When two boxes are different shapes, the computer sometimes lets you add them anyway by pretending the smaller one is repeated along the missing sides; this trick is called broadcasting. Most bugs in machine learning code are about boxes that did not line up the way you thought they did.

## References

1. NumPy Developers. "numpy.ndarray.shape." *NumPy v2.4 Manual*. https://numpy.org/doc/stable/reference/generated/numpy.ndarray.shape.html
2. PyTorch Contributors. "torch.Size" and "torch.Tensor.size." *PyTorch Documentation*. https://docs.pytorch.org/docs/stable/size.html and https://docs.pytorch.org/docs/stable/generated/torch.Tensor.size.html
3. TensorFlow Authors. "tf.shape." *TensorFlow API Documentation*. https://www.tensorflow.org/api_docs/python/tf/shape
4. JAX Authors. "Shape polymorphism." *JAX Documentation*. https://docs.jax.dev/en/latest/export/shape_poly.html
5. Vitaly Fedyunin. "Channels Last Memory Format in PyTorch." *PyTorch Tutorials*. https://docs.pytorch.org/tutorials/intermediate/memory_format_tutorial.html
6. Mingfei Ma et al. "Accelerating PyTorch Vision Models with Channels Last on CPU." *PyTorch Blog*, August 2022. https://pytorch.org/blog/accelerating-pytorch-vision-models-with-channels-last-on-cpu/
7. NumPy Developers. "Broadcasting." *NumPy v2.4 Manual*. https://numpy.org/doc/stable/user/basics.broadcasting.html
8. PyTorch Contributors. "torch.Tensor.view" and "torch.reshape." *PyTorch Documentation*. https://docs.pytorch.org/docs/stable/generated/torch.Tensor.view.html
9. Lukas Brun. "Efficient LLM Pretraining: Packed Sequences and Masked Attention." *Hugging Face Blog*. https://huggingface.co/blog/sirluk/llm-sequence-packing
10. NumPy Developers. "numpy.reshape." *NumPy v2.4 Manual*. https://numpy.org/doc/stable/reference/generated/numpy.reshape.html