Tensor size
Last reviewed
Sources
15 citations
Review status
Source-backed
Revision
v3 ยท 4,045 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Sources
15 citations
Review status
Source-backed
Revision
v3 ยท 4,045 words
Add missing citations, update stale details, or suggest a clearer explanation.
See also: tensor, shape, tensor rank, dtype
The size of a tensor is a description of how big the tensor is, and the term carries two distinct meanings in everyday deep learning. In its most precise sense, a tensor's size is its shape: the tuple of per-axis lengths, for example (2, 3, 4) for a three-axis tensor [1][2]. In a second, equally common sense, "size" means the total number of elements, which is the product of those axis lengths (2 * 3 * 4 = 24) [1][5]. These two readings are split across libraries: PyTorch's tensor.size() returns the shape, while NumPy's ndarray.size returns the element count [2][3][1]. Closely related but separate is the tensor's rank (its number of axes, also called ndim), and its memory footprint in bytes (element count multiplied by bytes per element) [5][4].
In machine learning, tensor size is one of two related but distinct quantities, depending on which library you are reading. The word is genuinely overloaded, and confusing the two meanings is a common source of bugs.
The two meanings are:
(3, 4, 5) has 3 * 4 * 5 = 60 elements. This is the meaning used by NumPy (ndarray.size), TensorFlow (tf.size), and JAX (jnp.size) [1][5][6].tensor.size() returns a torch.Size object that is essentially a tuple subclass describing each dimension [2]. PyTorch users get the element count from a separate method, tensor.numel() [3].The TensorFlow tensor guide draws the line crisply, defining shape as "The length (number of elements) of each of the axes of a tensor" and size as "The total number of items in the tensor, the product of the shape vector's elements" [14]. Keeping those two definitions separate is the single most useful habit when reasoning about tensor dimensions.
In casual practice, "size" is also used loosely to mean memory footprint in bytes, which depends on the element count and the dtype. When someone says a model is "16 GB," they usually mean the total parameter tensors take 16 GB of memory at a particular precision, not that they have 16 GB of elements.
Because of this ambiguity, careful technical writing tries to distinguish three things: the shape (per-axis lengths), the element count (a single integer), and the byte size (element count multiplied by bytes per element).
Shape, rank (ndim), and size answer three different questions about the same tensor. The TensorFlow documentation defines them this way [14]:
| Term | TensorFlow definition (verbatim) | For a tensor of shape (2, 3, 4) |
|---|---|---|
| Shape | "The length (number of elements) of each of the axes of a tensor." | (2, 3, 4) |
| Rank (ndim) | "Number of tensor axes. A scalar has rank 0, a vector has rank 1, a matrix is rank 2." | 3 |
| Axis or Dimension | "A particular dimension of a tensor." | axis 0 has length 2, axis 1 has length 3, axis 2 has length 4 |
| Size | "The total number of items in the tensor, the product of the shape vector's elements." | 24 |
In other words:
ndim or the number of axes) is just how many entries that tuple has. The shape (2, 3, 4) has rank 3. A scalar has rank 0 and shape ().The NumPy manual states the shape attribute simply as "Tuple of array dimensions" and the ndim attribute as "Number of array dimensions" [1][15]. Note the subtlety: NumPy uses the word "dimensions" for what TensorFlow calls axes, so np.ndarray.ndim returns the rank (axis count), not any single axis length. This is a frequent source of confusion between the two ecosystems.
Rank is often introduced through the familiar ladder of mathematical objects. The TensorFlow guide spells it out: "A scalar has rank 0, a vector has rank 1, a matrix is rank 2" [14], and the pattern continues upward.
| Object | Rank (ndim) | Example shape | Typical use |
|---|---|---|---|
| Scalar | 0 | () | a single loss value, a learning rate |
| Vector | 1 | (768,) | an embedding, a bias vector |
| Matrix | 2 | (1024, 1024) | a linear layer weight, a batch of vectors |
| 3-tensor | 3 | (32, 128, 768) | (batch, sequence, hidden) for a transformer |
| 4-tensor | 4 | (32, 3, 224, 224) | (batch, channels, height, width) for an image batch |
| 5-tensor | 5 | (32, 16, 3, 224, 224) | (batch, frames, channels, height, width) for video |
The word "tensor" in deep learning is used loosely for an array of any rank, including rank 0 and rank 1, even though strictly a tensor of rank greater than 2 is what the ladder reserves the name for. Frameworks make no distinction: a PyTorch scalar and a PyTorch image batch are both torch.Tensor objects.
NumPy exposes three separate attributes, and the official one-line descriptions make the split explicit. The shape attribute is documented as "Tuple of array dimensions," ndim as "Number of array dimensions," and size as the total element count [1][15].
import numpy as np
a = np.zeros((2, 3, 4))
a.shape # (2, 3, 4) tuple of per-axis lengths
a.ndim # 3 number of axes (rank)
a.size # 24 total elements
len(a) # 2 length of the first axis only
A common mistake is to use len(a), which returns only the length of the first axis, when you wanted a.size (all elements) or a.ndim (the rank). They coincide only for a 1-D array.
PyTorch is the framework most responsible for the overloaded meaning of "size," because tensor.size() returns the shape, not the element count. The official documentation states plainly: "torch.Size is the result type of a call to torch.Tensor.size()" and "As a subclass of tuple, it supports common sequence operations like indexing and length" [2].
import torch
t = torch.zeros(2, 3, 4)
t.size() # torch.Size([2, 3, 4]) the shape
t.shape # torch.Size([2, 3, 4]) identical to .size()
t.size(0) # 2 length of one axis
t.dim() # 3 rank / number of axes (ndim)
t.ndim # 3 same as t.dim()
t.numel() # 24 total elements
So in PyTorch the shape comes from .size() or .shape, the rank from .dim() or .ndim, and the element count from .numel(). Passing an integer argument to .size(dim) returns a single axis length as a plain int rather than a torch.Size.
The table below summarizes how each library exposes "size" for a hypothetical tensor t with shape (2, 3, 4).
| Library | Expression | Returns | Value for shape (2, 3, 4) |
|---|---|---|---|
| NumPy | arr.size | total elements (int) | 24 |
| NumPy | arr.shape | per-axis tuple | (2, 3, 4) |
| NumPy | arr.ndim | number of axes (int) | 3 |
| PyTorch | t.size() | torch.Size (a tuple subclass) | torch.Size([2, 3, 4]) |
| PyTorch | t.shape | same torch.Size | torch.Size([2, 3, 4]) |
| PyTorch | t.numel() | total elements (int) | 24 |
| PyTorch | t.dim() / t.ndim | number of axes (int) | 3 |
| PyTorch | t.size().numel() | total elements (int) | 24 |
| TensorFlow | tf.size(t) | 0-D int tensor | 24 |
| TensorFlow | tf.rank(t) | 0-D int tensor | 3 |
| TensorFlow | t.shape | TensorShape | TensorShape([2, 3, 4]) |
| TensorFlow | t.ndim | number of axes (int) | 3 |
| JAX | arr.size | total elements (int) | 24 |
| JAX | jnp.size(arr) | total elements (int) | 24 |
| JAX | arr.shape | per-axis tuple | (2, 3, 4) |
A few details are worth flagging. PyTorch's torch.Size is itself a tuple subclass, so it behaves like a tuple in indexing and unpacking but also has a helper method numel() that returns the product of its entries [2]. Calling numel() on a torch.Size returns the element count of a tensor that would have that shape, not the rank of the tensor. The PyTorch issue tracker has documented this distinction explicitly, since users sometimes assume torch.Size.numel() returns the number of axes.
NumPy reports ndarray.size as a Python int of arbitrary precision, while np.prod(a.shape) returns a fixed-width np.int_. For very large arrays this matters because np.prod can silently overflow on 32-bit platforms.
TensorFlow's tf.size returns a 0-dimensional integer tensor (default tf.int32), not a Python integer [5]. If the count exceeds the int32 range, you must pass out_type=tf.int64. The static t.shape returns a TensorShape object that may contain None entries for dynamic dimensions, while tf.shape(t) returns a 1-D tensor with the runtime shape. The TensorFlow guide warns that the Tensor.ndim and Tensor.shape attributes return Python objects rather than tensors, so when you need the value inside a graph you should call tf.rank or tf.shape instead [14].
JAX follows NumPy semantics: arr.size is the element count, and jnp.size(arr) mirrors np.size [6]. Unlike NumPy's free function, jnp.size raises TypeError on Python lists or tuples instead of converting them implicitly.
The element count is always the product of the entries in the shape:
elements = shape<sup><a href="#cite_note-0" class="cite-ref">[0]</a></sup> * shape<sup><a href="#cite_note-1" class="cite-ref">[1]</a></sup> * ... * shape[rank - 1]
A few corner cases:
() and contains 1 element. numel(), tf.size, and NumPy's arr.size all return 1, not 0.(3, 0, 4), has 0 elements. The shape is still meaningful and operations like reshape may need to preserve it.The element count is a property of the shape only. It does not depend on the dtype, the device, or whether the tensor is contiguous in memory.
Most array operations are best understood by how they transform the shape, because the shape, not the raw element count, is what has to line up for an operation to be legal.
(2, 12) tensor can reshape to (4, 6) or (2, 3, 4) (all 24 elements) but not to (5, 5). A -1 entry tells the framework to infer that axis from the element count.sum, mean, or max remove an axis (or collapse it to length 1 with keepdims=True), lowering the rank.cat / concatenate) or add a new one (stack), and require the other axis lengths to match.(m, k) @ (k, n) produces (m, n). The shared k must be equal, which is the most common shape-mismatch error in practice.Because shape mismatches are the single most frequent class of bug in tensor code, frameworks raise immediate errors when axis lengths do not align. Reading the printed shapes in the error message is usually the fastest way to find the offending operation.
Memory footprint is the practical meaning of "size" for anyone budgeting GPU RAM. The formula is simple:
bytes = element_count * bytes_per_element
In PyTorch, tensor.element_size() returns the bytes per element for the tensor's dtype, so t.numel() * t.element_size() is the storage size of a contiguous tensor [4][3]. NumPy exposes the same number as arr.itemsize and the total as arr.nbytes. TensorFlow does not have a built-in helper, so users compute tf.size(t).numpy() * t.dtype.size.
The table below lists the byte sizes for common dtypes used in modern deep learning. Numbers are bytes per scalar element.
| Dtype | Bytes | Common uses |
|---|---|---|
| float64 (fp64) | 8 | scientific computing, rarely used in deep learning |
| float32 (fp32) | 4 | classic training default; baseline weights |
| float16 (fp16) | 2 | mixed-precision training, inference |
| bfloat16 (bf16) | 2 | TPU and modern GPU training, same range as fp32 |
| float8 e4m3 (fp8) | 1 | forward activations, weights on Hopper and Blackwell |
| float8 e5m2 (fp8) | 1 | gradients during backward pass |
| int64 | 8 | indices, large counters |
| int32 | 4 | indices, ids |
| int16 | 2 | quantized intermediate values |
| int8 / uint8 | 1 | quantized weights or activations, image bytes |
| int4 (packed) | 0.5 | aggressive quantization for LLM serving |
| bool | 1 | masks; stored as a full byte despite being one bit logically |
| complex64 | 8 | signal processing, two fp32 components |
| complex128 | 16 | high-precision signal processing |
A practical detail: PyTorch's bool tensor uses one byte per element, not one bit. If you want true bit-packing you have to use a uint8 tensor and pack manually. The PyTorch issue tracker has a long-running discussion about this for users who expected packed booleans.
FP8 came to mainstream hardware with the NVIDIA Hopper generation and is now a fixture on Blackwell [7][8]. The two encodings are E4M3 (4 exponent bits, 3 mantissa bits, 1 sign bit, max value about 448) and E5M2 (5 exponent bits, 2 mantissa bits, 1 sign bit, max value about 57344) [7]. Both use 1 byte per element. The convention recommended by NVIDIA is E4M3 for forward activations and weights, E5M2 for backward gradients where dynamic range matters more than precision [7].
A float32 matrix of shape (1024, 1024) has:
In bfloat16 the same matrix is 2 MiB. In int8 it is 1 MiB. In int4 (packed two-per-byte) it is 0.5 MiB. The shape did not change; only the bytes per element did.
When people say a model is "7B" or "70B," they mean the total count of trainable parameters across every weight tensor. To get that number programmatically:
sum(p.numel() for p in model.parameters())sum(p.numel() for p in model.parameters() if p.requires_grad)model.num_parameters() (with optional only_trainable=True)model.count_params()Multiplying the parameter count by bytes per element gives the storage cost of the weights themselves. The table below shows weight memory at common precisions for several well-known checkpoints (BERT-base at 110M and BERT-large at 340M parameters [13]). These figures are weights only; activations, optimizer state, and KV cache are separate.
| Model | Parameters | fp32 weights | fp16/bf16 weights | int8 weights | int4 weights |
|---|---|---|---|---|---|
| BERT-base | 110M | ~440 MB | ~220 MB | ~110 MB | ~55 MB |
| BERT-large | 340M | ~1.36 GB | ~680 MB | ~340 MB | ~170 MB |
| GPT-2 small | 117M | ~468 MB | ~234 MB | ~117 MB | ~59 MB |
| GPT-2 XL | 1.5B | ~6 GB | ~3 GB | ~1.5 GB | ~0.75 GB |
| Llama 3 8B | 8.0B | ~32 GB | ~16 GB | ~8 GB | ~4 GB |
| Llama 3 70B | 70B | ~280 GB | ~140 GB | ~70 GB | ~35 GB |
| Llama 3.1 405B | 405B | ~1.62 TB | ~810 GB | ~405 GB | ~203 GB |
The Hugging Face release of Llama 3.1 405B reported about 812 GB for the bf16 instruction-tuned variant, which lines up with the 405B * 2 bytes calculation [9]. The fp32 release is roughly 2 TB. Even on an 8-way H100 node with about 640 GB of HBM, you cannot fit the bf16 weights of 405B without sharding across nodes or dropping to FP8, which fits in roughly 486 GB.
During training, the forward pass stores intermediate activations for use in the backward pass. Activation memory typically scales with batch size, sequence length, hidden width, and depth. For large transformers it can match or exceed the parameter footprint, which is why techniques like gradient checkpointing, sequence parallelism, and activation offloading exist. As a rough rule of thumb in Megatron-style training without recomputation, activations for a transformer can require on the order of the parameter count again, sometimes more.
The optimizer keeps its own per-parameter state. Adam and AdamW track two fp32 moments per parameter, so they add 8 bytes per parameter on top of the weights [11]. In standard mixed-precision training with an fp32 master copy and Adam moments, the per-parameter cost is roughly [11]:
This is the famous "16 bytes per parameter" budget that motivates ZeRO sharding. For an 8B model that already amounts to about 128 GB before activations or KV cache. Eight-bit Adam variants (for example bitsandbytes' adamw_8bit) cut the moment storage to 1 byte each, bringing the total down significantly [12].
For autoregressive LLM inference, the key and value projections from previous tokens are cached so attention does not have to recompute them. The KV cache size in bytes is [10]:
kv_bytes = 2 * num_layers * num_kv_heads * head_dim * seq_len * batch_size * bytes_per_element
The leading factor of 2 covers both K and V. With grouped-query attention, num_kv_heads is smaller than the total head count, which is one reason GQA is so popular for long-context serving. KV cache scales linearly with sequence length, so a 128k-token context can dwarf the weights themselves on a small model.
For Llama 3 8B at fp16, with 32 layers, hidden size 4096, 8 KV heads of dimension 128, the KV cache for a single 8k-token sequence is:
2 * 32 * 8 * 128 * 8192 * 1 * 2 bytes = ~1.07 GB
That is per sequence in the batch. Serving 32 concurrent 8k contexts pushes the KV cache past 30 GB, which is often the bottleneck rather than the weights.
Quantization reduces bytes per element without changing the element count. The common steps in modern LLM serving are:
| Step | Bytes/param | Reduction vs fp32 |
|---|---|---|
| fp32 baseline | 4 | 1.00x |
| fp16 or bf16 | 2 | 0.50x |
| fp8 (E4M3 or E5M2) | 1 | 0.25x |
| int8 | 1 | 0.25x |
| int4 (GPTQ, AWQ, GGUF Q4) | 0.5 | 0.125x |
| int3 / Q3 GGUF | ~0.375 | ~0.094x |
| int2 / Q2 GGUF | ~0.25 | ~0.063x |
Real quantized formats include extra metadata (group scales, zero points), so the on-disk size is usually a bit larger than the naive byte-per-parameter count. GGUF Q4_K_M, for instance, lands closer to 4.5 bits per weight on average.
Libraries that perform this work include bitsandbytes (NF4 and 8-bit linear layers), AutoGPTQ, AutoAWQ, and llama.cpp's GGUF format. Hardware-side, NVIDIA's Transformer Engine handles automatic FP8 scaling on Hopper and Blackwell [7].
Many machine learning workloads are bound by memory bandwidth, not raw arithmetic. Loading a billion fp16 weights from HBM to the streaming multiprocessors costs 2 GB of bandwidth no matter how fast the matmul units run. This is why halving the bytes per parameter often roughly doubles inference throughput on modern GPUs, even though the math is unchanged. Tensor size, in the byte-footprint sense, is therefore a first-order driver of decode speed for LLMs, not a secondary concern.
A few traps that catch newcomers and experienced practitioners alike:
tensor.size() in PyTorch as if it were the element count. It is not; it returns the shape [2]. Use numel() or size().numel() for the count.a.ndim is the number of axes (the rank), not the size of any axis [15].np.prod(a.shape) on huge arrays and getting a silent overflow on 32-bit np.int_. Use a.size or cast explicitly to np.int64.t.shape (which can include None) with dynamic tf.shape(t) in TensorFlow graph mode.sum(p.numel()) can over-count.A short list of utilities that compute or report tensor and model sizes:
torchinfo and the older torchsummary: print per-layer parameter and activation shapes for a PyTorch model.transformers' built-in model.num_parameters() and model.get_memory_footprint().infer_auto_device_map which estimates per-device memory needs.estimate_zero2_model_states_mem_needs_all_live and a ZeRO-3 equivalent for training memory budgeting.nvidia-smi and torch.cuda.memory_allocated() for runtime measurement.--mlock and --n-gpu-layers flags rely on its internal byte accounting, visible at load time.Imagine a box of Lego bricks arranged in a stack. The shape is how the stack is built: 3 bricks wide, 4 bricks tall, 5 bricks deep. The rank is just how many of those measurements there are, which is 3 (wide, tall, deep). The size can mean two different things depending on who you ask. Some people mean "how many bricks are in the box," which is 3 times 4 times 5, or 60 bricks. Other people, like the PyTorch crowd, mean the description of the stack itself: "3 by 4 by 5." And if you ask how heavy the box is, that depends on whether the bricks are big plastic ones or tiny micro pieces. That weight is the memory footprint, and it depends on the dtype.