See also: Machine learning terms, Tensor, Tensor size
In machine learning, the shape of a tensor is the tuple of integers giving the size of the tensor along each of its axes. For a tensor of rank r, the shape is written (d_1, d_2, ..., d_r), where each d_i is the number of elements stored along axis i. Shape is one of the two metadata fields (alongside data type) that almost every deep learning framework attaches to a tensor, and shape correctness is the single most common source of runtime errors in model code.
A tensor's shape determines the total number of elements (prod(shape)), the legal operations that can be applied to it (matrix multiplication, convolution, broadcasting), and the memory layout used by the underlying buffer. Frameworks such as NumPy, PyTorch, TensorFlow, and JAX all expose a .shape attribute on their array types, but they differ in whether the shape is fixed at trace time or known only at runtime, and in the conventions they use for image, sequence, and audio data.
A shape is an ordered tuple of non-negative integers. The length of the tuple is called the rank (also ndim, the number of dimensions, or the number of axes), and each entry is the size of the tensor along the corresponding axis. NumPy's documentation defines ndarray.shape as "a tuple of array dimensions" whose length equals the array's number of dimensions.[1] PyTorch's torch.Tensor.size() returns a torch.Size object, a subclass of tuple; the equivalent attribute is tensor.shape, and the two are interchangeable.[2]
The vocabulary varies between communities. Mathematicians and physicists tend to call the rank the order of the tensor and call individual axes modes; deep learning practitioners usually say rank, ndim, or just "number of dimensions," and call individual axes axes or dims. The numeric values inside the shape tuple are sometimes called dimension sizes, extents, or simply dims.
Rank refers to how many axes a tensor has, not how big each axis is. A 1000-element vector has the same rank (1) as a 3-element vector. The table below collects the canonical examples.
| Rank | Object | Example shape | Notes |
|---|---|---|---|
| 0 | Scalar | () | A single number; NumPy reports the empty tuple |
| 1 | Vector | (n,) | A 1D array of length n |
| 2 | Matrix | (m, n) | m rows, n columns |
| 3 | 3D tensor | (d, h, w) | Often a single image with d channels |
| 4 | Image batch | (N, C, H, W) or (N, H, W, C) | NCHW (PyTorch) vs NHWC (TensorFlow) |
| 5 | Video batch | (N, T, C, H, W) | Adds a time axis T |
The trailing comma in (n,) is Python syntax to disambiguate a one-element tuple from a parenthesized expression; (4) is the integer 4, while (4,) is a tuple of length one. A scalar produced by a NumPy reduction (arr.sum()) has shape (), an empty tuple, which is distinct from a length-1 vector with shape (1,).
Different frameworks expose shape through slightly different APIs and with different semantics around when the shape is known.
| Framework | Shape access | Type returned | Notes |
|---|---|---|---|
| NumPy | arr.shape | tuple of int | Always concrete; assigning to shape reshapes in place but is discouraged in favor of reshape[1] |
| PyTorch | t.shape or t.size() | torch.Size (tuple subclass) | t.size(dim) returns the size of one axis as int[2] |
| TensorFlow | t.shape (static) and tf.shape(t) (dynamic) | TensorShape and 1-D int32 Tensor | Static may contain None for unknown dims; dynamic is always concrete at run time[3] |
| JAX | jnp.array.shape | tuple of int | Inside jit, shapes must be concrete at trace time; varying shape recompiles[4] |
In TensorFlow, the static shape returned by tensor.shape is the shape inferred during graph construction. It can be partially known, with None standing in for axes whose size depends on runtime data (typical for the batch axis or a variable sequence length). The dynamic shape returned by tf.shape(tensor) is a 1-D int32 tensor that is always fully known at execution time and that can be fed into other ops.[3] The two forms are equivalent only when the static shape is fully defined.
JAX takes a stricter line. Inside a jax.jit-compiled function, all array shapes must be static at trace time, because the function is lowered to StableHLO and compiled separately for each combination of input shapes and dtypes. Calling the same jitted function with a new input shape triggers a recompilation. Practitioners usually pad inputs to a small set of fixed sizes or use static_argnums for shape-dependent constants. JAX's shape polymorphism feature relaxes this for export, allowing a single compiled artifact to handle a family of shapes.[4]
Four-dimensional image tensors come in two competing layouts. NCHW (channels-first) puts the channel axis right after the batch axis: (N, C, H, W). NHWC (channels-last) puts channels at the end: (N, H, W, C). PyTorch defaults to NCHW; TensorFlow and Keras default to NHWC; ONNX uses NCHW; Apple's Metal Performance Shaders prefer NHWC; cuDNN supports both.
| Layout | Convention | Default in | Memory locality |
|---|---|---|---|
| NCHW | Channels-first | PyTorch, ONNX, Caffe | Spatial pixels of one channel are contiguous |
| NHWC | Channels-last | TensorFlow, Keras, Apple MPS | All channels of one pixel are contiguous |
The choice is not just cosmetic. NHWC is often faster on modern hardware because Tensor Cores and vector instructions read across channels, and convolution kernels written for the cuDNN NHWC path avoid the layout conversion that the NCHW path performs internally. The official PyTorch tutorial reports 8 to 35 percent speedups on Volta GPUs and over 22 percent on ResNet-50 with mixed-precision training when using tensor.to(memory_format=torch.channels_last).[5] Intel's CPU benchmarks for vision models show 1.3 to 1.8 times higher throughput with channels-last on Ice Lake and newer CPUs.[6]
In PyTorch, channels-last is implemented as a memory format on a 4D NCHW tensor rather than as a different shape. The strides change from (C*H*W, H*W, W, 1) to (C*H*W, 1, W*C, C), but tensor.shape still reports (N, C, H, W). To switch layouts entirely, use permute (x.permute(0, 2, 3, 1)); in TensorFlow use tf.transpose(x, [0, 2, 3, 1]).
Most deep learning code follows a small set of canonical shape conventions, summarized below.
| Domain | Typical shape | Meaning of axes |
|---|---|---|
| Tabular | (N, F) | N rows, F features |
| Image batch (PyTorch) | (N, C, H, W) | batch, channels, height, width |
| Image batch (TF/Keras) | (N, H, W, C) | batch, height, width, channels |
| Token IDs | (B, T) | batch, sequence length |
| Token embeddings | (B, T, D) | batch, sequence, hidden dim |
| Attention scores | (B, H, T, T) | batch, heads, query len, key len |
| Audio waveform | (B, C, L) or (B, L, C) | batch, channels, samples |
| Mel spectrogram | (B, C, F, T) | batch, channels, frequency, time |
| Video | (B, T, C, H, W) | batch, frames, channels, height, width |
| Point cloud | (B, N, 3) | batch, points, xyz |
The leading axis is almost always the batch size N or B; this matters because most operators (linear layers, BatchNorm, attention) treat the first axis as independent samples that can be parallelized.
Broadcasting is the rule that lets two arrays with different shapes participate in an element-wise operation. NumPy, PyTorch, TensorFlow, and JAX all share the same rules, originally specified by NumPy.[7]
The shapes are aligned from the right (trailing axis first), padding the shorter shape with leading 1s. Two aligned axes are compatible if they are equal, or if one of them is 1. The broadcast result takes the per-axis maximum.
| Operation | Left shape | Right shape | Aligned | Result |
|---|---|---|---|---|
| Add scalar | (3, 4) | () | (3, 4) vs (1, 1) | (3, 4) |
| Add row | (3, 4) | (4,) | (3, 4) vs (1, 4) | (3, 4) |
| Add column | (3, 4) | (3, 1) | (3, 4) vs (3, 1) | (3, 4) |
| Outer product | (3, 1) | (1, 4) | (3, 1) vs (1, 4) | (3, 4) |
| Mismatch | (3, 4) | (3,) | (3, 4) vs (1, 3) | ValueError |
| 4D vs 3D | (8, 1, 6, 1) | (7, 1, 5) | (8, 1, 6, 1) vs (1, 7, 1, 5) | (8, 7, 6, 5) |
The last row is the canonical example from the NumPy manual.[7] Broadcasting never copies data; it conceptually stretches a size-1 axis by reusing the same memory through a stride of 0. This is why broadcasting is cheap, and why bugs caused by accidental broadcasting (a (1, 1000) vs (1000,) mistake that silently produces a (1, 1000) result) can be hard to spot.
The term static shape describes a shape known at compile time, before any data has been seen. Dynamic shape describes a shape known only at run time. The distinction matters because compilers (XLA, TorchInductor, ONNX Runtime, TensorRT) generally produce faster code when shapes are static, since they can fuse operations, allocate buffers, and pick kernels for specific dimensions ahead of time.
In the eager mode used by default in PyTorch and TensorFlow 2, every shape is dynamic in the sense that it is computed each time the program runs. In graph or JIT mode the picture changes:
torch.compile and TorchScript can specialize on shapes, but PyTorch 2 also supports dynamic shapes through symbolic reasoning so that variable batch or sequence lengths do not force a recompile.tf.function traces a ConcreteFunction per input signature; passing a tensor with a new shape may trigger retracing unless the input signature uses None for the variable axis.jit recompiles per shape unless static_argnums or shape_polymorphism is used.[4]A practical consequence is that ML engineers spend a noticeable fraction of debugging time on shape specialization: a model that runs fine on a fixed batch of 32 may recompile (or crash) when given a batch of 31 at the end of an epoch.
The table below lists the operations every deep learning practitioner uses to bend shapes into the form an operator expects.
| Operation | What it does | Constraints |
|---|---|---|
| reshape | Returns a tensor with a new shape and the same total elements | prod(new_shape) == prod(old_shape); one entry may be -1 to be inferred |
view (PyTorch) | Like reshape, but returns a view sharing memory | Tensor must be contiguous in the requested layout |
flatten | Collapses a range of axes into one | Equivalent to reshape with the right product |
squeeze | Removes axes of size 1 | Optionally restricted to a single axis |
unsqueeze (PyTorch) / expand_dims (NumPy, TF) | Inserts a new axis of size 1 | New rank is r + 1 |
transpose | Swaps two axes | Returns a view; result is usually non-contiguous |
permute (PyTorch) | Reorders all axes by a permutation | Returns a view; usually non-contiguous |
expand (PyTorch) / broadcast_to (NumPy, TF) | Makes a size-1 axis appear larger via stride 0 | No data copy; result is read-only-ish |
repeat (PyTorch) / tile (NumPy, TF) | Actually copies data along an axis | Allocates new memory |
stack | Concatenates along a new axis | All inputs must share the same shape |
concat / cat | Concatenates along an existing axis | All inputs must match on every axis except the one being concatenated |
In PyTorch, view requires the source tensor to be contiguous, while reshape falls back to a copy when it cannot return a view. Calling tensor.contiguous() materializes a fresh contiguous copy when needed, typically after a transpose or permute.[8]
Shape errors fall into a small number of recurring patterns:
matmul requires the inner dimensions to agree: (B, M, K) @ (B, K, N) is valid, (B, M, K) @ (B, M, N) is not.(B, C, H, W) rejects a single image of shape (C, H, W); the fix is image.unsqueeze(0).(H, W, C)) and feeding it directly into a PyTorch model that expects (C, H, W) triggers a convolution shape error or, worse, silently treats height as channels.(T, B, D) (sequence first), others to (B, T, D). The PyTorch RNN modules used to default to T first; their batch_first=True flag is now common.(N,) row from an (N, 1) column produces an (N, N) result by broadcasting, which is rarely what the author meant.transpose followed by view. view fails after transpose because the result is non-contiguous; use reshape or call .contiguous() first.Shape interacts with performance in several non-obvious ways. Memory layout, set by both the shape and the strides, decides whether a kernel can use vectorized loads or must gather scattered elements. The official PyTorch channels-last guide reports significant speedups precisely because the new layout makes the convolution inner loop a streaming load over channels.[5][6]
Tensor Cores on NVIDIA GPUs (Volta, Turing, Ampere, Hopper) operate on tiles of 16x16 or larger, so matrix dimensions that are not multiples of 8 or 16 either fall back to slower paths or pad internally. The cuDNN documentation and several training playbooks recommend rounding hidden sizes, vocabulary sizes, and batch sizes to multiples of 64 or 128 for the same reason. Wave Quantization, the term NVIDIA uses for the throughput cliff that appears when a dimension is just above a multiple of the warp size, is essentially a shape-rounding effect.
For variable-length text or audio batches, padding to the longest sequence wastes compute on the padded positions, while bucketing groups similar-length samples and packing (also called example packing) concatenates several short samples into one long sequence with an attention mask that forbids cross-example attention.[9] Modern LLM training stacks (Megatron, NeMo, vLLM, FlashAttention with variable length) all rely on packed sequences to keep the GPUs saturated.
In the transformer era, sequence length has become the dominant variable axis. A typical decoder-only LLM forward pass works on a (B, T) token tensor that becomes (B, T, D) after embedding, (B, H, T, D_head) inside the attention head split, and (B, H, T, T) for the attention score matrix. The quadratic T*T term is the reason context-length growth is expensive, and the reason FlashAttention and ring attention focus on streaming the score matrix without materializing it.
Mixture-of-experts models add a routing dimension that changes shape per token: each expert receives a (K, D) slice of the batch where K is the number of tokens routed to it, which varies. Diffusion models and video generators add a denoising-step axis or a frame axis. Multimodal models stack image patches, audio frames, and text tokens into a single shared sequence, and the bookkeeping for which axis means what is now a major part of model engineering.
The shape of a tensor was once a quiet implementation detail. In production deep learning it has become an interface contract: the type signature of every operator, the unit of test coverage, and often the difference between a 10-millisecond and a 100-millisecond inference call.
A tensor is a box of numbers. Its shape is the list of how many numbers fit along each side of the box. A flat row of 5 numbers has shape (5,). A grid with 3 rows and 4 columns has shape (3, 4). A stack of 10 such grids has shape (10, 3, 4). When two boxes are different shapes, the computer sometimes lets you add them anyway by pretending the smaller one is repeated along the missing sides; this trick is called broadcasting. Most bugs in machine learning code are about boxes that did not line up the way you thought they did.