Shape (Tensor)

In machine learning, the shape of a tensor is the tuple of integers giving the size of the tensor along each of its axes. For a tensor of rank r, the shape is written (d_1, d_2, ..., d_r), where each d_i is the number of elements stored along axis i. Shape is one of the two metadata fields (alongside data type) that almost every deep learning framework attaches to a tensor, and shape correctness is the single most common source of runtime errors in model code.

A tensor's shape determines the total number of elements (prod(shape)), the legal operations that can be applied to it (matrix multiplication, convolution, broadcasting), and the memory layout used by the underlying buffer. Frameworks such as NumPy, PyTorch, TensorFlow, and JAX all expose a .shape attribute on their array types, but they differ in whether the shape is fixed at trace time or known only at runtime, and in the conventions they use for image, sequence, and audio data.

definition

A shape is an ordered tuple of non-negative integers. The length of the tuple is called the rank (also ndim, the number of dimensions, or the number of axes), and each entry is the size of the tensor along the corresponding axis. NumPy's documentation defines ndarray.shape as "a tuple of array dimensions" whose length equals the array's number of dimensions.[1] PyTorch's torch.Tensor.size() returns a torch.Size object, a subclass of tuple; the equivalent attribute is tensor.shape, and the two are interchangeable.[2]

The vocabulary varies between communities. Mathematicians and physicists tend to call the rank the order of the tensor and call individual axes modes; deep learning practitioners usually say rank, ndim, or just "number of dimensions," and call individual axes axes or dims. The numeric values inside the shape tuple are sometimes called dimension sizes, extents, or simply dims.

rank, axes, and examples

Rank refers to how many axes a tensor has, not how big each axis is. A 1000-element vector has the same rank (1) as a 3-element vector. The table below collects the canonical examples.

Rank	Object	Example shape	Notes
0	Scalar	`()`	A single number; NumPy reports the empty tuple
1	Vector	`(n,)`	A 1D array of length `n`
2	Matrix	`(m, n)`	`m` rows, `n` columns
3	3D tensor	`(d, h, w)`	Often a single image with `d` channels
4	Image batch	`(N, C, H, W)` or `(N, H, W, C)`	NCHW (PyTorch) vs NHWC (TensorFlow)
5	Video batch	`(N, T, C, H, W)`	Adds a time axis `T`

The trailing comma in (n,) is Python syntax to disambiguate a one-element tuple from a parenthesized expression; (4) is the integer 4, while (4,) is a tuple of length one. A scalar produced by a NumPy reduction (arr.sum()) has shape (), an empty tuple, which is distinct from a length-1 vector with shape (1,).

framework conventions

Different frameworks expose shape through slightly different APIs and with different semantics around when the shape is known.

Framework	Shape access	Type returned	Notes
NumPy	`arr.shape`	`tuple` of `int`	Always concrete; assigning to `shape` reshapes in place but is discouraged in favor of `reshape`[1]
PyTorch	`t.shape` or `t.size()`	`torch.Size` (tuple subclass)	`t.size(dim)` returns the size of one axis as `int`[2]
TensorFlow	`t.shape` (static) and `tf.shape(t)` (dynamic)	`TensorShape` and 1-D `int32` `Tensor`	Static may contain `None` for unknown dims; dynamic is always concrete at run time[3]
JAX	`jnp.array.shape`	`tuple` of `int`	Inside `jit`, shapes must be concrete at trace time; varying shape recompiles[4]

In TensorFlow, the static shape returned by tensor.shape is the shape inferred during graph construction. It can be partially known, with None standing in for axes whose size depends on runtime data (typical for the batch axis or a variable sequence length). The dynamic shape returned by tf.shape(tensor) is a 1-D int32 tensor that is always fully known at execution time and that can be fed into other ops.[3] The two forms are equivalent only when the static shape is fully defined.

JAX takes a stricter line. Inside a jax.jit-compiled function, all array shapes must be static at trace time, because the function is lowered to StableHLO and compiled separately for each combination of input shapes and dtypes. Calling the same jitted function with a new input shape triggers a recompilation. Practitioners usually pad inputs to a small set of fixed sizes or use static_argnums for shape-dependent constants. JAX's shape polymorphism feature relaxes this for export, allowing a single compiled artifact to handle a family of shapes.[4]

channel ordering in computer vision

Four-dimensional image tensors come in two competing layouts. NCHW (channels-first) puts the channel axis right after the batch axis: (N, C, H, W). NHWC (channels-last) puts channels at the end: (N, H, W, C). PyTorch defaults to NCHW; TensorFlow and Keras default to NHWC; ONNX uses NCHW; Apple's Metal Performance Shaders prefer NHWC; cuDNN supports both.

Layout	Convention	Default in	Memory locality
NCHW	Channels-first	PyTorch, ONNX, Caffe	Spatial pixels of one channel are contiguous
NHWC	Channels-last	TensorFlow, Keras, Apple MPS	All channels of one pixel are contiguous

The choice is not just cosmetic. NHWC is often faster on modern hardware because Tensor Cores and vector instructions read across channels, and convolution kernels written for the cuDNN NHWC path avoid the layout conversion that the NCHW path performs internally. The official PyTorch tutorial reports 8 to 35 percent speedups on Volta GPUs and over 22 percent on ResNet-50 with mixed-precision training when using tensor.to(memory_format=torch.channels_last).[5] Intel's CPU benchmarks for vision models show 1.3 to 1.8 times higher throughput with channels-last on Ice Lake and newer CPUs.[6]

In PyTorch, channels-last is implemented as a memory format on a 4D NCHW tensor rather than as a different shape. The strides change from (C*H*W, H*W, W, 1) to (C*H*W, 1, W*C, C), but tensor.shape still reports (N, C, H, W). To switch layouts entirely, use permute (x.permute(0, 2, 3, 1)); in TensorFlow use tf.transpose(x, [0, 2, 3, 1]).

common shape patterns

Most deep learning code follows a small set of canonical shape conventions, summarized below.

Domain	Typical shape	Meaning of axes
Tabular	`(N, F)`	`N` rows, `F` features
Image batch (PyTorch)	`(N, C, H, W)`	batch, channels, height, width
Image batch (TF/Keras)	`(N, H, W, C)`	batch, height, width, channels
Token IDs	`(B, T)`	batch, sequence length
Token embeddings	`(B, T, D)`	batch, sequence, hidden dim
Attention scores	`(B, H, T, T)`	batch, heads, query len, key len
Audio waveform	`(B, C, L)` or `(B, L, C)`	batch, channels, samples
Mel spectrogram	`(B, C, F, T)`	batch, channels, frequency, time
Video	`(B, T, C, H, W)`	batch, frames, channels, height, width
Point cloud	`(B, N, 3)`	batch, points, xyz

The leading axis is almost always the batch size N or B; this matters because most operators (linear layers, BatchNorm, attention) treat the first axis as independent samples that can be parallelized.

broadcasting rules

Broadcasting is the rule that lets two arrays with different shapes participate in an element-wise operation. NumPy, PyTorch, TensorFlow, and JAX all share the same rules, originally specified by NumPy.[7]

The shapes are aligned from the right (trailing axis first), padding the shorter shape with leading 1s. Two aligned axes are compatible if they are equal, or if one of them is 1. The broadcast result takes the per-axis maximum.

Operation	Left shape	Right shape	Aligned	Result
Add scalar	`(3, 4)`	`()`	`(3, 4)` vs `(1, 1)`	`(3, 4)`
Add row	`(3, 4)`	`(4,)`	`(3, 4)` vs `(1, 4)`	`(3, 4)`
Add column	`(3, 4)`	`(3, 1)`	`(3, 4)` vs `(3, 1)`	`(3, 4)`
Outer product	`(3, 1)`	`(1, 4)`	`(3, 1)` vs `(1, 4)`	`(3, 4)`
Mismatch	`(3, 4)`	`(3,)`	`(3, 4)` vs `(1, 3)`	`ValueError`
4D vs 3D	`(8, 1, 6, 1)`	`(7, 1, 5)`	`(8, 1, 6, 1)` vs `(1, 7, 1, 5)`	`(8, 7, 6, 5)`

The last row is the canonical example from the NumPy manual.[7] Broadcasting never copies data; it conceptually stretches a size-1 axis by reusing the same memory through a stride of 0. This is why broadcasting is cheap, and why bugs caused by accidental broadcasting (a (1, 1000) vs (1000,) mistake that silently produces a (1, 1000) result) can be hard to spot.

static vs dynamic shapes

The term static shape describes a shape known at compile time, before any data has been seen. Dynamic shape describes a shape known only at run time. The distinction matters because compilers (XLA, TorchInductor, ONNX Runtime, TensorRT) generally produce faster code when shapes are static, since they can fuse operations, allocate buffers, and pick kernels for specific dimensions ahead of time.

In the eager mode used by default in PyTorch and TensorFlow 2, every shape is dynamic in the sense that it is computed each time the program runs. In graph or JIT mode the picture changes:

torch.compile and TorchScript can specialize on shapes, but PyTorch 2 also supports dynamic shapes through symbolic reasoning so that variable batch or sequence lengths do not force a recompile.
TensorFlow's tf.function traces a ConcreteFunction per input signature; passing a tensor with a new shape may trigger retracing unless the input signature uses None for the variable axis.
JAX's jit recompiles per shape unless static_argnums or shape_polymorphism is used.[4]
TensorRT and ONNX Runtime expose explicit dynamic axes in the model signature so that one engine handles a range of input sizes.

A practical consequence is that ML engineers spend a noticeable fraction of debugging time on shape specialization: a model that runs fine on a fixed batch of 32 may recompile (or crash) when given a batch of 31 at the end of an epoch.

shape manipulation operations

The table below lists the operations every deep learning practitioner uses to bend shapes into the form an operator expects.

Operation	What it does	Constraints
reshape	Returns a tensor with a new shape and the same total elements	`prod(new_shape) == prod(old_shape)`; one entry may be `-1` to be inferred
`view` (PyTorch)	Like `reshape`, but returns a view sharing memory	Tensor must be contiguous in the requested layout
`flatten`	Collapses a range of axes into one	Equivalent to `reshape` with the right product
`squeeze`	Removes axes of size 1	Optionally restricted to a single axis
`unsqueeze` (PyTorch) / `expand_dims` (NumPy, TF)	Inserts a new axis of size 1	New rank is `r + 1`
`transpose`	Swaps two axes	Returns a view; result is usually non-contiguous
`permute` (PyTorch)	Reorders all axes by a permutation	Returns a view; usually non-contiguous
`expand` (PyTorch) / `broadcast_to` (NumPy, TF)	Makes a size-1 axis appear larger via stride 0	No data copy; result is read-only-ish
`repeat` (PyTorch) / `tile` (NumPy, TF)	Actually copies data along an axis	Allocates new memory
`stack`	Concatenates along a new axis	All inputs must share the same shape
`concat` / `cat`	Concatenates along an existing axis	All inputs must match on every axis except the one being concatenated

In PyTorch, view requires the source tensor to be contiguous, while reshape falls back to a copy when it cannot return a view. Calling tensor.contiguous() materializes a fresh contiguous copy when needed, typically after a transpose or permute.[8]

common shape errors

Shape errors fall into a small number of recurring patterns:

Matrix multiplication mismatch. matmul requires the inner dimensions to agree: (B, M, K) @ (B, K, N) is valid, (B, M, K) @ (B, M, N) is not.
Forgetting the batch axis at inference. A model trained on (B, C, H, W) rejects a single image of shape (C, H, W); the fix is image.unsqueeze(0).
Wrong channel order. Loading an image with PIL ((H, W, C)) and feeding it directly into a PyTorch model that expects (C, H, W) triggers a convolution shape error or, worse, silently treats height as channels.
Off-by-one with sequence axes. Some libraries default to (T, B, D) (sequence first), others to (B, T, D). The PyTorch RNN modules used to default to T first; their batch_first=True flag is now common.
Broadcasting accidents. Subtracting a (N,) row from an (N, 1) column produces an (N, N) result by broadcasting, which is rarely what the author meant.
transpose followed by view. view fails after transpose because the result is non-contiguous; use reshape or call .contiguous() first.

performance implications

Shape interacts with performance in several non-obvious ways. Memory layout, set by both the shape and the strides, decides whether a kernel can use vectorized loads or must gather scattered elements. The official PyTorch channels-last guide reports significant speedups precisely because the new layout makes the convolution inner loop a streaming load over channels.[5][6]

Tensor Cores on NVIDIA GPUs (Volta, Turing, Ampere, Hopper) operate on tiles of 16x16 or larger, so matrix dimensions that are not multiples of 8 or 16 either fall back to slower paths or pad internally. The cuDNN documentation and several training playbooks recommend rounding hidden sizes, vocabulary sizes, and batch sizes to multiples of 64 or 128 for the same reason. Wave Quantization, the term NVIDIA uses for the throughput cliff that appears when a dimension is just above a multiple of the warp size, is essentially a shape-rounding effect.

For variable-length text or audio batches, padding to the longest sequence wastes compute on the padded positions, while bucketing groups similar-length samples and packing (also called example packing) concatenates several short samples into one long sequence with an attention mask that forbids cross-example attention.[9] Modern LLM training stacks (Megatron, NeMo, vLLM, FlashAttention with variable length) all rely on packed sequences to keep the GPUs saturated.

modern context

In the transformer era, sequence length has become the dominant variable axis. A typical decoder-only LLM forward pass works on a (B, T) token tensor that becomes (B, T, D) after embedding, (B, H, T, D_head) inside the attention head split, and (B, H, T, T) for the attention score matrix. The quadratic T*T term is the reason context-length growth is expensive, and the reason FlashAttention and ring attention focus on streaming the score matrix without materializing it.

Mixture-of-experts models add a routing dimension that changes shape per token: each expert receives a (K, D) slice of the batch where K is the number of tokens routed to it, which varies. Diffusion models and video generators add a denoising-step axis or a frame axis. Multimodal models stack image patches, audio frames, and text tokens into a single shared sequence, and the bookkeeping for which axis means what is now a major part of model engineering.

The shape of a tensor was once a quiet implementation detail. In production deep learning it has become an interface contract: the type signature of every operator, the unit of test coverage, and often the difference between a 10-millisecond and a 100-millisecond inference call.

explain like i'm 5

A tensor is a box of numbers. Its shape is the list of how many numbers fit along each side of the box. A flat row of 5 numbers has shape (5,). A grid with 3 rows and 4 columns has shape (3, 4). A stack of 10 such grids has shape (10, 3, 4). When two boxes are different shapes, the computer sometimes lets you add them anyway by pretending the smaller one is repeated along the missing sides; this trick is called broadcasting. Most bugs in machine learning code are about boxes that did not line up the way you thought they did.

references

NumPy Developers. "numpy.ndarray.shape." NumPy v2.4 Manual. https://numpy.org/doc/stable/reference/generated/numpy.ndarray.shape.html
PyTorch Contributors. "torch.Tensor.size." PyTorch 2.11 Documentation. https://docs.pytorch.org/docs/stable/generated/torch.Tensor.size.html
TensorFlow Authors. "tf.shape." TensorFlow API Documentation. https://www.tensorflow.org/api_docs/python/tf/shape
JAX Authors. "Shape polymorphism" and "jax.jit." JAX Documentation. https://docs.jax.dev/en/latest/export/shape_poly.html and https://docs.jax.dev/en/latest/_autosummary/jax.jit.html
Vitaly Fedyunin. "Channels Last Memory Format in PyTorch." PyTorch Tutorials. https://docs.pytorch.org/tutorials/intermediate/memory_format_tutorial.html
Mingfei Ma et al. "Accelerating PyTorch Vision Models with Channels Last on CPU." PyTorch Blog, August 2022. https://pytorch.org/blog/accelerating-pytorch-vision-models-with-channels-last-on-cpu/
NumPy Developers. "Broadcasting." NumPy v2.4 Manual. https://numpy.org/doc/stable/user/basics.broadcasting.html
PyTorch Contributors. "torch.Tensor.view" and "torch.reshape." PyTorch Documentation. https://docs.pytorch.org/docs/stable/generated/torch.Tensor.view.html
Lukas Brun. "Efficient LLM Pretraining: Packed Sequences and Masked Attention." Hugging Face Blog. https://huggingface.co/blog/sirluk/llm-sequence-packing

Shape (Tensor)

definition

rank, axes, and examples

framework conventions

channel ordering in computer vision

common shape patterns

broadcasting rules

static vs dynamic shapes

shape manipulation operations

common shape errors

performance implications

modern context

see also

explain like i'm 5

references

Improve this article

definition

rank, axes, and examples

framework conventions

channel ordering in computer vision

common shape patterns

broadcasting rules

static vs dynamic shapes

shape manipulation operations

common shape errors

performance implications

modern context

see also

explain like i'm 5

references

definition

rank, axes, and examples

framework conventions

channel ordering in computer vision

common shape patterns

broadcasting rules

static vs dynamic shapes

shape manipulation operations

common shape errors

performance implications

modern context

see also

explain like i'm 5

references

Improve this article

Related Articles

Dimensions

Tensor size

definition

rank, axes, and examples

framework conventions

channel ordering in computer vision

common shape patterns

broadcasting rules

static vs dynamic shapes

shape manipulation operations

common shape errors

performance implications

modern context

see also

explain like i'm 5

references

Related Articles

Dimensions

Tensor size