See also: tensor, shape, tensor rank, dtype
In machine learning, tensor size is one of two related but distinct quantities, depending on which library you are reading. The word is genuinely overloaded, and confusing the two meanings is a common source of bugs.
The two meanings are:
(3, 4, 5) has 3 * 4 * 5 = 60 elements. This is the meaning used by NumPy (ndarray.size), TensorFlow (tf.size), and JAX (jnp.size).tensor.size() returns a torch.Size object that is essentially a tuple subclass describing each dimension. PyTorch users get the element count from a separate method, tensor.numel().In casual practice, "size" is also used loosely to mean memory footprint in bytes, which depends on the element count and the dtype. When someone says a model is "16 GB," they usually mean the total parameter tensors take 16 GB of memory at a particular precision, not that they have 16 GB of elements.
Because of this ambiguity, careful technical writing tries to distinguish three things: the shape (per-axis lengths), the element count (a single integer), and the byte size (element count multiplied by bytes per element).
The table below summarizes how each library exposes "size" for a hypothetical tensor t with shape (2, 3, 4).
| Library | Expression | Returns | Value for shape (2, 3, 4) |
|---|---|---|---|
| NumPy | arr.size | total elements (int) | 24 |
| NumPy | arr.shape | per-axis tuple | (2, 3, 4) |
| PyTorch | t.size() | torch.Size (a tuple subclass) | torch.Size([2, 3, 4]) |
| PyTorch | t.shape | same torch.Size | torch.Size([2, 3, 4]) |
| PyTorch | t.numel() | total elements (int) | 24 |
| PyTorch | t.size().numel() | total elements (int) | 24 |
| TensorFlow | tf.size(t) | 0-D int tensor | 24 |
| TensorFlow | t.shape | TensorShape | TensorShape([2, 3, 4]) |
| JAX | arr.size | total elements (int) | 24 |
| JAX | jnp.size(arr) | total elements (int) | 24 |
| JAX | arr.shape | per-axis tuple | (2, 3, 4) |
A few details are worth flagging. PyTorch's torch.Size is itself a tuple subclass, so it behaves like a tuple in indexing and unpacking but also has a helper method numel() that returns the product of its entries. Calling numel() on a torch.Size returns the element count of a tensor that would have that shape, not the rank of the tensor. The PyTorch issue tracker has documented this distinction explicitly, since users sometimes assume torch.Size.numel() returns the number of axes.
NumPy reports ndarray.size as a Python int of arbitrary precision, while np.prod(a.shape) returns a fixed-width np.int_. For very large arrays this matters because np.prod can silently overflow on 32-bit platforms.
TensorFlow's tf.size returns a 0-dimensional integer tensor (default tf.int32), not a Python integer. If the count exceeds the int32 range, you must pass out_type=tf.int64. The static t.shape returns a TensorShape object that may contain None entries for dynamic dimensions, while tf.shape(t) returns a 1-D tensor with the runtime shape.
JAX follows NumPy semantics: arr.size is the element count, and jnp.size(arr) mirrors np.size. Unlike NumPy's free function, jnp.size raises TypeError on Python lists or tuples instead of converting them implicitly.
The element count is always the product of the entries in the shape:
elements = shape<sup><a href="#cite_note-0" class="cite-ref">[0]</a></sup> * shape<sup><a href="#cite_note-1" class="cite-ref">[1]</a></sup> * ... * shape[rank - 1]
A few corner cases:
() and contains 1 element. numel(), tf.size, and NumPy's arr.size all return 1, not 0.(3, 0, 4), has 0 elements. The shape is still meaningful and operations like reshape may need to preserve it.The element count is a property of the shape only. It does not depend on the dtype, the device, or whether the tensor is contiguous in memory.
Memory footprint is the practical meaning of "size" for anyone budgeting GPU RAM. The formula is simple:
bytes = element_count * bytes_per_element
In PyTorch, tensor.element_size() returns the bytes per element for the tensor's dtype, so t.numel() * t.element_size() is the storage size of a contiguous tensor. NumPy exposes the same number as arr.itemsize and the total as arr.nbytes. TensorFlow does not have a built-in helper, so users compute tf.size(t).numpy() * t.dtype.size.
The table below lists the byte sizes for common dtypes used in modern deep learning. Numbers are bytes per scalar element.
| Dtype | Bytes | Common uses |
|---|---|---|
| float64 (fp64) | 8 | scientific computing, rarely used in deep learning |
| float32 (fp32) | 4 | classic training default; baseline weights |
| float16 (fp16) | 2 | mixed-precision training, inference |
| bfloat16 (bf16) | 2 | TPU and modern GPU training, same range as fp32 |
| float8 e4m3 (fp8) | 1 | forward activations, weights on Hopper and Blackwell |
| float8 e5m2 (fp8) | 1 | gradients during backward pass |
| int64 | 8 | indices, large counters |
| int32 | 4 | indices, ids |
| int16 | 2 | quantized intermediate values |
| int8 / uint8 | 1 | quantized weights or activations, image bytes |
| int4 (packed) | 0.5 | aggressive quantization for LLM serving |
| bool | 1 | masks; stored as a full byte despite being one bit logically |
| complex64 | 8 | signal processing, two fp32 components |
| complex128 | 16 | high-precision signal processing |
A practical detail: PyTorch's bool tensor uses one byte per element, not one bit. If you want true bit-packing you have to use a uint8 tensor and pack manually. The PyTorch issue tracker has a long-running discussion about this for users who expected packed booleans.
FP8 came to mainstream hardware with the NVIDIA Hopper generation and is now a fixture on Blackwell. The two encodings are E4M3 (4 exponent bits, 3 mantissa bits, 1 sign bit, max value about 448) and E5M2 (5 exponent bits, 2 mantissa bits, 1 sign bit, max value about 57344). Both use 1 byte per element. The convention recommended by NVIDIA is E4M3 for forward activations and weights, E5M2 for backward gradients where dynamic range matters more than precision.
A float32 matrix of shape (1024, 1024) has:
In bfloat16 the same matrix is 2 MiB. In int8 it is 1 MiB. In int4 (packed two-per-byte) it is 0.5 MiB. The shape did not change; only the bytes per element did.
When people say a model is "7B" or "70B," they mean the total count of trainable parameters across every weight tensor. To get that number programmatically:
sum(p.numel() for p in model.parameters())sum(p.numel() for p in model.parameters() if p.requires_grad)model.num_parameters() (with optional only_trainable=True)model.count_params()Multiplying the parameter count by bytes per element gives the storage cost of the weights themselves. The table below shows weight memory at common precisions for several well-known checkpoints. These figures are weights only; activations, optimizer state, and KV cache are separate.
| Model | Parameters | fp32 weights | fp16/bf16 weights | int8 weights | int4 weights |
|---|---|---|---|---|---|
| BERT-base | 110M | ~440 MB | ~220 MB | ~110 MB | ~55 MB |
| BERT-large | 340M | ~1.36 GB | ~680 MB | ~340 MB | ~170 MB |
| GPT-2 small | 117M | ~468 MB | ~234 MB | ~117 MB | ~59 MB |
| GPT-2 XL | 1.5B | ~6 GB | ~3 GB | ~1.5 GB | ~0.75 GB |
| Llama 3 8B | 8.0B | ~32 GB | ~16 GB | ~8 GB | ~4 GB |
| Llama 3 70B | 70B | ~280 GB | ~140 GB | ~70 GB | ~35 GB |
| Llama 3.1 405B | 405B | ~1.62 TB | ~810 GB | ~405 GB | ~203 GB |
The Hugging Face release of Llama 3.1 405B reported about 812 GB for the bf16 instruction-tuned variant, which lines up with the 405B * 2 bytes calculation. The fp32 release is roughly 2 TB. Even on an 8-way H100 node with about 640 GB of HBM, you cannot fit the bf16 weights of 405B without sharding across nodes or dropping to FP8, which fits in roughly 486 GB.
During training, the forward pass stores intermediate activations for use in the backward pass. Activation memory typically scales with batch size, sequence length, hidden width, and depth. For large transformers it can match or exceed the parameter footprint, which is why techniques like gradient checkpointing, sequence parallelism, and activation offloading exist. As a rough rule of thumb in Megatron-style training without recomputation, activations for a transformer can require on the order of the parameter count again, sometimes more.
The optimizer keeps its own per-parameter state. Adam and AdamW track two fp32 moments per parameter, so they add 8 bytes per parameter on top of the weights. In standard mixed-precision training with an fp32 master copy and Adam moments, the per-parameter cost is roughly:
This is the famous "16 bytes per parameter" budget that motivates ZeRO sharding. For an 8B model that already amounts to about 128 GB before activations or KV cache. Eight-bit Adam variants (for example bitsandbytes' adamw_8bit) cut the moment storage to 1 byte each, bringing the total down significantly.
For autoregressive LLM inference, the key and value projections from previous tokens are cached so attention does not have to recompute them. The KV cache size in bytes is:
kv_bytes = 2 * num_layers * num_kv_heads * head_dim * seq_len * batch_size * bytes_per_element
The leading factor of 2 covers both K and V. With grouped-query attention, num_kv_heads is smaller than the total head count, which is one reason GQA is so popular for long-context serving. KV cache scales linearly with sequence length, so a 128k-token context can dwarf the weights themselves on a small model.
For Llama 3 8B at fp16, with 32 layers, hidden size 4096, 8 KV heads of dimension 128, the KV cache for a single 8k-token sequence is:
2 * 32 * 8 * 128 * 8192 * 1 * 2 bytes = ~1.07 GB
That is per sequence in the batch. Serving 32 concurrent 8k contexts pushes the KV cache past 30 GB, which is often the bottleneck rather than the weights.
Quantization reduces bytes per element without changing the element count. The common steps in modern LLM serving are:
| Step | Bytes/param | Reduction vs fp32 |
|---|---|---|
| fp32 baseline | 4 | 1.00x |
| fp16 or bf16 | 2 | 0.50x |
| fp8 (E4M3 or E5M2) | 1 | 0.25x |
| int8 | 1 | 0.25x |
| int4 (GPTQ, AWQ, GGUF Q4) | 0.5 | 0.125x |
| int3 / Q3 GGUF | ~0.375 | ~0.094x |
| int2 / Q2 GGUF | ~0.25 | ~0.063x |
Real quantized formats include extra metadata (group scales, zero points), so the on-disk size is usually a bit larger than the naive byte-per-parameter count. GGUF Q4_K_M, for instance, lands closer to 4.5 bits per weight on average.
Libraries that perform this work include bitsandbytes (NF4 and 8-bit linear layers), AutoGPTQ, AutoAWQ, and llama.cpp's GGUF format. Hardware-side, NVIDIA's Transformer Engine handles automatic FP8 scaling on Hopper and Blackwell.
Many machine learning workloads are bound by memory bandwidth, not raw arithmetic. Loading a billion fp16 weights from HBM to the streaming multiprocessors costs 2 GB of bandwidth no matter how fast the matmul units run. This is why halving the bytes per parameter often roughly doubles inference throughput on modern GPUs, even though the math is unchanged. Tensor size, in the byte-footprint sense, is therefore a first-order driver of decode speed for LLMs, not a secondary concern.
A few traps that catch newcomers and experienced practitioners alike:
tensor.size() in PyTorch as if it were the element count. It is not. Use numel() or size().numel().np.prod(a.shape) on huge arrays and getting a silent overflow on 32-bit np.int_. Use a.size or cast explicitly to np.int64.t.shape (which can include None) with dynamic tf.shape(t) in TensorFlow graph mode.sum(p.numel()) can over-count.A short list of utilities that compute or report tensor and model sizes:
torchinfo and the older torchsummary: print per-layer parameter and activation shapes for a PyTorch model.transformers' built-in model.num_parameters() and model.get_memory_footprint().infer_auto_device_map which estimates per-device memory needs.estimate_zero2_model_states_mem_needs_all_live and a ZeRO-3 equivalent for training memory budgeting.nvidia-smi and torch.cuda.memory_allocated() for runtime measurement.--mlock and --n-gpu-layers flags rely on its internal byte accounting, visible at load time.Imagine a box of Lego bricks arranged in a stack. The shape is how the stack is built: 3 bricks wide, 4 bricks tall, 5 bricks deep. The size can mean two different things depending on who you ask. Some people mean "how many bricks are in the box," which is 3 times 4 times 5, or 60 bricks. Other people, like the PyTorch crowd, mean the description of the stack itself: "3 by 4 by 5." And if you ask how heavy the box is, that depends on whether the bricks are big plastic ones or tiny micro pieces. That weight is the memory footprint, and it depends on the dtype.