Tensor size

See also: tensor, shape, tensor rank, dtype

Definition

In machine learning, tensor size is one of two related but distinct quantities, depending on which library you are reading. The word is genuinely overloaded, and confusing the two meanings is a common source of bugs.

The two meanings are:

Total element count. The product of the dimensions of a tensor. A tensor with shape (3, 4, 5) has 3 * 4 * 5 = 60 elements. This is the meaning used by NumPy (ndarray.size), TensorFlow (tf.size), and JAX (jnp.size).
The shape itself. The tuple of per-axis lengths. This is the meaning used by PyTorch, where tensor.size() returns a torch.Size object that is essentially a tuple subclass describing each dimension. PyTorch users get the element count from a separate method, tensor.numel().

In casual practice, "size" is also used loosely to mean memory footprint in bytes, which depends on the element count and the dtype. When someone says a model is "16 GB," they usually mean the total parameter tensors take 16 GB of memory at a particular precision, not that they have 16 GB of elements.

Because of this ambiguity, careful technical writing tries to distinguish three things: the shape (per-axis lengths), the element count (a single integer), and the byte size (element count multiplied by bytes per element).

Size across major frameworks

The table below summarizes how each library exposes "size" for a hypothetical tensor t with shape (2, 3, 4).

Library	Expression	Returns	Value for shape (2, 3, 4)
NumPy	`arr.size`	total elements (int)	24
NumPy	`arr.shape`	per-axis tuple	(2, 3, 4)
PyTorch	`t.size()`	`torch.Size` (a tuple subclass)	torch.Size([2, 3, 4])
PyTorch	`t.shape`	same `torch.Size`	torch.Size([2, 3, 4])
PyTorch	`t.numel()`	total elements (int)	24
PyTorch	`t.size().numel()`	total elements (int)	24
TensorFlow	`tf.size(t)`	0-D int tensor	24
TensorFlow	`t.shape`	`TensorShape`	TensorShape([2, 3, 4])
JAX	`arr.size`	total elements (int)	24
JAX	`jnp.size(arr)`	total elements (int)	24
JAX	`arr.shape`	per-axis tuple	(2, 3, 4)

A few details are worth flagging. PyTorch's torch.Size is itself a tuple subclass, so it behaves like a tuple in indexing and unpacking but also has a helper method numel() that returns the product of its entries. Calling numel() on a torch.Size returns the element count of a tensor that would have that shape, not the rank of the tensor. The PyTorch issue tracker has documented this distinction explicitly, since users sometimes assume torch.Size.numel() returns the number of axes.

NumPy reports ndarray.size as a Python int of arbitrary precision, while np.prod(a.shape) returns a fixed-width np.int_. For very large arrays this matters because np.prod can silently overflow on 32-bit platforms.

TensorFlow's tf.size returns a 0-dimensional integer tensor (default tf.int32), not a Python integer. If the count exceeds the int32 range, you must pass out_type=tf.int64. The static t.shape returns a TensorShape object that may contain None entries for dynamic dimensions, while tf.shape(t) returns a 1-D tensor with the runtime shape.

JAX follows NumPy semantics: arr.size is the element count, and jnp.size(arr) mirrors np.size. Unlike NumPy's free function, jnp.size raises TypeError on Python lists or tuples instead of converting them implicitly.

Element count and shape

The element count is always the product of the entries in the shape:

elements = shape<sup><a href="#cite_note-0" class="cite-ref">[0]</a></sup> * shape<sup><a href="#cite_note-1" class="cite-ref">[1]</a></sup> * ... * shape[rank - 1]

A few corner cases:

A scalar (rank 0) has shape () and contains 1 element. numel(), tf.size, and NumPy's arr.size all return 1, not 0.
An empty tensor with any zero in its shape, for example shape (3, 0, 4), has 0 elements. The shape is still meaningful and operations like reshape may need to preserve it.
A tensor with a single zero-length axis is not the same as a scalar; the rank still matters for broadcasting and reshape.

The element count is a property of the shape only. It does not depend on the dtype, the device, or whether the tensor is contiguous in memory.

Memory footprint

Memory footprint is the practical meaning of "size" for anyone budgeting GPU RAM. The formula is simple:

bytes = element_count * bytes_per_element

In PyTorch, tensor.element_size() returns the bytes per element for the tensor's dtype, so t.numel() * t.element_size() is the storage size of a contiguous tensor. NumPy exposes the same number as arr.itemsize and the total as arr.nbytes. TensorFlow does not have a built-in helper, so users compute tf.size(t).numpy() * t.dtype.size.

Bytes per dtype

The table below lists the byte sizes for common dtypes used in modern deep learning. Numbers are bytes per scalar element.

Dtype	Bytes	Common uses
float64 (fp64)	8	scientific computing, rarely used in deep learning
float32 (fp32)	4	classic training default; baseline weights
float16 (fp16)	2	mixed-precision training, inference
bfloat16 (bf16)	2	TPU and modern GPU training, same range as fp32
float8 e4m3 (fp8)	1	forward activations, weights on Hopper and Blackwell
float8 e5m2 (fp8)	1	gradients during backward pass
int64	8	indices, large counters
int32	4	indices, ids
int16	2	quantized intermediate values
int8 / uint8	1	quantized weights or activations, image bytes
int4 (packed)	0.5	aggressive quantization for LLM serving
bool	1	masks; stored as a full byte despite being one bit logically
complex64	8	signal processing, two fp32 components
complex128	16	high-precision signal processing

A practical detail: PyTorch's bool tensor uses one byte per element, not one bit. If you want true bit-packing you have to use a uint8 tensor and pack manually. The PyTorch issue tracker has a long-running discussion about this for users who expected packed booleans.

FP8 came to mainstream hardware with the NVIDIA Hopper generation and is now a fixture on Blackwell. The two encodings are E4M3 (4 exponent bits, 3 mantissa bits, 1 sign bit, max value about 448) and E5M2 (5 exponent bits, 2 mantissa bits, 1 sign bit, max value about 57344). Both use 1 byte per element. The convention recommended by NVIDIA is E4M3 for forward activations and weights, E5M2 for backward gradients where dynamic range matters more than precision.

Worked example

A float32 matrix of shape (1024, 1024) has:

1,024 * 1,024 = 1,048,576 elements
1,048,576 * 4 bytes = 4,194,304 bytes = 4 MiB

In bfloat16 the same matrix is 2 MiB. In int8 it is 1 MiB. In int4 (packed two-per-byte) it is 0.5 MiB. The shape did not change; only the bytes per element did.

Model size

When people say a model is "7B" or "70B," they mean the total count of trainable parameters across every weight tensor. To get that number programmatically:

PyTorch: sum(p.numel() for p in model.parameters())
PyTorch trainable only: sum(p.numel() for p in model.parameters() if p.requires_grad)
Hugging Face: model.num_parameters() (with optional only_trainable=True)
Keras: model.count_params()

Multiplying the parameter count by bytes per element gives the storage cost of the weights themselves. The table below shows weight memory at common precisions for several well-known checkpoints. These figures are weights only; activations, optimizer state, and KV cache are separate.

Model	Parameters	fp32 weights	fp16/bf16 weights	int8 weights	int4 weights
BERT-base	110M	~440 MB	~220 MB	~110 MB	~55 MB
BERT-large	340M	~1.36 GB	~680 MB	~340 MB	~170 MB
GPT-2 small	117M	~468 MB	~234 MB	~117 MB	~59 MB
GPT-2 XL	1.5B	~6 GB	~3 GB	~1.5 GB	~0.75 GB
Llama 3 8B	8.0B	~32 GB	~16 GB	~8 GB	~4 GB
Llama 3 70B	70B	~280 GB	~140 GB	~70 GB	~35 GB
Llama 3.1 405B	405B	~1.62 TB	~810 GB	~405 GB	~203 GB

The Hugging Face release of Llama 3.1 405B reported about 812 GB for the bf16 instruction-tuned variant, which lines up with the 405B * 2 bytes calculation. The fp32 release is roughly 2 TB. Even on an 8-way H100 node with about 640 GB of HBM, you cannot fit the bf16 weights of 405B without sharding across nodes or dropping to FP8, which fits in roughly 486 GB.

Activation memory

During training, the forward pass stores intermediate activations for use in the backward pass. Activation memory typically scales with batch size, sequence length, hidden width, and depth. For large transformers it can match or exceed the parameter footprint, which is why techniques like gradient checkpointing, sequence parallelism, and activation offloading exist. As a rough rule of thumb in Megatron-style training without recomputation, activations for a transformer can require on the order of the parameter count again, sometimes more.

Optimizer state

The optimizer keeps its own per-parameter state. Adam and AdamW track two fp32 moments per parameter, so they add 8 bytes per parameter on top of the weights. In standard mixed-precision training with an fp32 master copy and Adam moments, the per-parameter cost is roughly:

2 bytes for the fp16/bf16 weight
4 bytes for the fp32 master weight
4 bytes for the fp32 first moment
4 bytes for the fp32 second moment
2 or 4 bytes for the gradient

This is the famous "16 bytes per parameter" budget that motivates ZeRO sharding. For an 8B model that already amounts to about 128 GB before activations or KV cache. Eight-bit Adam variants (for example bitsandbytes' adamw_8bit) cut the moment storage to 1 byte each, bringing the total down significantly.

KV cache

For autoregressive LLM inference, the key and value projections from previous tokens are cached so attention does not have to recompute them. The KV cache size in bytes is:

kv_bytes = 2 * num_layers * num_kv_heads * head_dim * seq_len * batch_size * bytes_per_element

The leading factor of 2 covers both K and V. With grouped-query attention, num_kv_heads is smaller than the total head count, which is one reason GQA is so popular for long-context serving. KV cache scales linearly with sequence length, so a 128k-token context can dwarf the weights themselves on a small model.

For Llama 3 8B at fp16, with 32 layers, hidden size 4096, 8 KV heads of dimension 128, the KV cache for a single 8k-token sequence is:

2 * 32 * 8 * 128 * 8192 * 1 * 2 bytes = ~1.07 GB

That is per sequence in the batch. Serving 32 concurrent 8k contexts pushes the KV cache past 30 GB, which is often the bottleneck rather than the weights.

Quantization and size

Quantization reduces bytes per element without changing the element count. The common steps in modern LLM serving are:

Step	Bytes/param	Reduction vs fp32
fp32 baseline	4	1.00x
fp16 or bf16	2	0.50x
fp8 (E4M3 or E5M2)	1	0.25x
int8	1	0.25x
int4 (GPTQ, AWQ, GGUF Q4)	0.5	0.125x
int3 / Q3 GGUF	~0.375	~0.094x
int2 / Q2 GGUF	~0.25	~0.063x

Real quantized formats include extra metadata (group scales, zero points), so the on-disk size is usually a bit larger than the naive byte-per-parameter count. GGUF Q4_K_M, for instance, lands closer to 4.5 bits per weight on average.

Libraries that perform this work include bitsandbytes (NF4 and 8-bit linear layers), AutoGPTQ, AutoAWQ, and llama.cpp's GGUF format. Hardware-side, NVIDIA's Transformer Engine handles automatic FP8 scaling on Hopper and Blackwell.

Why size matters: bandwidth and throughput

Many machine learning workloads are bound by memory bandwidth, not raw arithmetic. Loading a billion fp16 weights from HBM to the streaming multiprocessors costs 2 GB of bandwidth no matter how fast the matmul units run. This is why halving the bytes per parameter often roughly doubles inference throughput on modern GPUs, even though the math is unchanged. Tensor size, in the byte-footprint sense, is therefore a first-order driver of decode speed for LLMs, not a secondary concern.

Common pitfalls

A few traps that catch newcomers and experienced practitioners alike:

Treating tensor.size() in PyTorch as if it were the element count. It is not. Use numel() or size().numel().
Forgetting that bool tensors take a full byte per element in PyTorch and NumPy, not a bit.
Multiplying parameter count by 4 bytes for an inference budget when the model is actually loaded in bf16. The right factor is 2.
Estimating GPU memory from weights only and being surprised when activations or KV cache take more space than the model itself.
Using np.prod(a.shape) on huge arrays and getting a silent overflow on 32-bit np.int_. Use a.size or cast explicitly to np.int64.
Confusing static t.shape (which can include None) with dynamic tf.shape(t) in TensorFlow graph mode.
Counting parameters of a model with tied weights twice. Embedding and LM head layers in many transformers share weights, so a naive sum(p.numel()) can over-count.

Tools to estimate size

A short list of utilities that compute or report tensor and model sizes:

torchinfo and the older torchsummary: print per-layer parameter and activation shapes for a PyTorch model.
transformers' built-in model.num_parameters() and model.get_memory_footprint().
Hugging Face's Accelerate library includes infer_auto_device_map which estimates per-device memory needs.
DeepSpeed ships estimate_zero2_model_states_mem_needs_all_live and a ZeRO-3 equivalent for training memory budgeting.
nvidia-smi and torch.cuda.memory_allocated() for runtime measurement.
llama.cpp's --mlock and --n-gpu-layers flags rely on its internal byte accounting, visible at load time.
Online VRAM calculators such as the LLM Studio and ApxML calculators wrap these formulas in a UI for quick what-if sizing.

Explain like I'm 5

Imagine a box of Lego bricks arranged in a stack. The shape is how the stack is built: 3 bricks wide, 4 bricks tall, 5 bricks deep. The size can mean two different things depending on who you ask. Some people mean "how many bricks are in the box," which is 3 times 4 times 5, or 60 bricks. Other people, like the PyTorch crowd, mean the description of the stack itself: "3 by 4 by 5." And if you ask how heavy the box is, that depends on whether the bricks are big plastic ones or tiny micro pieces. That weight is the memory footprint, and it depends on the dtype.

References

NumPy documentation. *numpy.ndarray.size*. https://numpy.org/doc/stable/reference/generated/numpy.ndarray.size.html
PyTorch documentation. *torch.Size*. https://docs.pytorch.org/docs/stable/size.html
PyTorch documentation. *torch.Tensor.numel*. https://docs.pytorch.org/docs/stable/generated/torch.Tensor.numel.html
PyTorch documentation. *torch.Tensor.element_size*. https://docs.pytorch.org/docs/stable/generated/torch.Tensor.element_size.html
TensorFlow documentation. *tf.size*. https://www.tensorflow.org/api_docs/python/tf/size
JAX documentation. *jax.numpy.size*. https://docs.jax.dev/en/latest/_autosummary/jax.numpy.size.html
NVIDIA Developer Blog. *Floating-Point 8: An Introduction to Efficient, Lower-Precision AI Training*. https://developer.nvidia.com/blog/floating-point-8-an-introduction-to-efficient-lower-precision-ai-training/
NVIDIA Developer Blog. *NVIDIA Hopper Architecture In-Depth*. https://developer.nvidia.com/blog/nvidia-hopper-architecture-in-depth/
Hugging Face. *Llama 3.1 release blog*. https://huggingface.co/blog/llama31
NVIDIA Developer Blog. *Mastering LLM Techniques: Inference Optimization* (KV cache section). https://developer.nvidia.com/blog/mastering-llm-techniques-inference-optimization/
Hugging Face Transformers documentation. *Efficient Training on a Single GPU* (Adam memory breakdown). https://huggingface.co/docs/transformers/v4.20.1/en/perf_train_gpu_one
Hugging Face bitsandbytes documentation. *8-bit optimizers*. https://huggingface.co/docs/bitsandbytes/optimizers
Wikipedia. *BERT (language model)* (parameter counts). https://en.wikipedia.org/wiki/BERT_(language_model)

Definition

Size across major frameworks

Element count and shape

Memory footprint

Bytes per dtype

Worked example

Model size

Activation memory

Optimizer state

KV cache

Quantization and size

Why size matters: bandwidth and throughput

Common pitfalls

Tools to estimate size

Explain like I'm 5

References

Improve this article

Related Articles

Dimensions

Shape (Tensor)

Definition

Size across major frameworks

Element count and shape

Memory footprint

Bytes per dtype

Worked example

Model size

Activation memory

Optimizer state

KV cache

Quantization and size

Why size matters: bandwidth and throughput

Common pitfalls

Tools to estimate size

Explain like I'm 5

References

Related Articles

Dimensions

Shape (Tensor)