# Tensor size

> Source: https://aiwiki.ai/wiki/tensor_size
> Updated: 2026-06-27
> Categories: Machine Learning
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

*See also: [tensor](/wiki/tensor), [shape](/wiki/shape_tensor), [tensor rank](/wiki/tensor_rank), [dtype](/wiki/dtype)*

The **size of a tensor** is a description of how big the tensor is, and the term carries two distinct meanings in everyday deep learning. In its most precise sense, a tensor's size is its **shape**: the tuple of per-axis lengths, for example `(2, 3, 4)` for a three-axis tensor [1][2]. In a second, equally common sense, "size" means the **total number of elements**, which is the product of those axis lengths (`2 * 3 * 4 = 24`) [1][5]. These two readings are split across libraries: [PyTorch](/wiki/pytorch)'s `tensor.size()` returns the shape, while [NumPy](/wiki/numpy)'s `ndarray.size` returns the element count [2][3][1]. Closely related but separate is the tensor's **rank** (its number of axes, also called `ndim`), and its **memory footprint** in bytes (element count multiplied by bytes per element) [5][4].

## What is the size of a tensor?

In machine learning, **tensor size** is one of two related but distinct quantities, depending on which library you are reading. The word is genuinely overloaded, and confusing the two meanings is a common source of bugs.

The two meanings are:

1. **Total element count.** The product of the dimensions of a [tensor](/wiki/tensor). A tensor with shape `(3, 4, 5)` has `3 * 4 * 5 = 60` elements. This is the meaning used by [NumPy](/wiki/numpy) (`ndarray.size`), [TensorFlow](/wiki/tensorflow) (`tf.size`), and [JAX](/wiki/jax) (`jnp.size`) [1][5][6].
2. **The shape itself.** The tuple of per-axis lengths. This is the meaning used by [PyTorch](/wiki/pytorch), where `tensor.size()` returns a `torch.Size` object that is essentially a tuple subclass describing each dimension [2]. PyTorch users get the element count from a separate method, `tensor.numel()` [3].

The TensorFlow tensor guide draws the line crisply, defining **shape** as "The length (number of elements) of each of the axes of a tensor" and **size** as "The total number of items in the tensor, the product of the shape vector's elements" [14]. Keeping those two definitions separate is the single most useful habit when reasoning about tensor dimensions.

In casual practice, "size" is also used loosely to mean **memory footprint** in bytes, which depends on the element count and the [dtype](/wiki/dtype). When someone says a model is "16 GB," they usually mean the total parameter tensors take 16 GB of memory at a particular precision, not that they have 16 GB of elements.

Because of this ambiguity, careful technical writing tries to distinguish three things: the [shape](/wiki/shape_tensor) (per-axis lengths), the element count (a single integer), and the byte size (element count multiplied by bytes per element).

## What is the difference between shape, rank, and size?

Shape, rank (`ndim`), and size answer three different questions about the same tensor. The TensorFlow documentation defines them this way [14]:

| Term | TensorFlow definition (verbatim) | For a tensor of shape (2, 3, 4) |
|---|---|---|
| Shape | "The length (number of elements) of each of the axes of a tensor." | (2, 3, 4) |
| Rank (ndim) | "Number of tensor axes. A scalar has rank 0, a vector has rank 1, a matrix is rank 2." | 3 |
| Axis or Dimension | "A particular dimension of a tensor." | axis 0 has length 2, axis 1 has length 3, axis 2 has length 4 |
| Size | "The total number of items in the tensor, the product of the shape vector's elements." | 24 |

In other words:

- **Shape** is the full tuple of axis lengths. It is the richest description; rank and size can both be derived from it.
- **Rank** (also called `ndim` or the number of axes) is just how many entries that tuple has. The shape `(2, 3, 4)` has rank 3. A scalar has rank 0 and shape `()`.
- **Size** in the element-count sense is the product of the tuple's entries, a single integer.

The NumPy manual states the shape attribute simply as "Tuple of array dimensions" and the `ndim` attribute as "Number of array dimensions" [1][15]. Note the subtlety: NumPy uses the word "dimensions" for what TensorFlow calls axes, so `np.ndarray.ndim` returns the rank (axis count), not any single axis length. This is a frequent source of confusion between the two ecosystems.

### The scalar, vector, matrix, tensor rank ladder

Rank is often introduced through the familiar ladder of mathematical objects. The TensorFlow guide spells it out: "A scalar has rank 0, a vector has rank 1, a matrix is rank 2" [14], and the pattern continues upward.

| Object | Rank (ndim) | Example shape | Typical use |
|---|---|---|---|
| Scalar | 0 | () | a single loss value, a learning rate |
| Vector | 1 | (768,) | an embedding, a bias vector |
| Matrix | 2 | (1024, 1024) | a linear layer weight, a batch of vectors |
| 3-tensor | 3 | (32, 128, 768) | (batch, sequence, hidden) for a transformer |
| 4-tensor | 4 | (32, 3, 224, 224) | (batch, channels, height, width) for an image batch |
| 5-tensor | 5 | (32, 16, 3, 224, 224) | (batch, frames, channels, height, width) for video |

The word "tensor" in deep learning is used loosely for an array of any rank, including rank 0 and rank 1, even though strictly a tensor of rank greater than 2 is what the ladder reserves the name for. Frameworks make no distinction: a PyTorch scalar and a PyTorch image batch are both `torch.Tensor` objects.

## How do you get a tensor's shape and size in NumPy?

NumPy exposes three separate attributes, and the official one-line descriptions make the split explicit. The shape attribute is documented as "Tuple of array dimensions," `ndim` as "Number of array dimensions," and `size` as the total element count [1][15].

```python
import numpy as np
a = np.zeros((2, 3, 4))
a.shape   # (2, 3, 4)   tuple of per-axis lengths
a.ndim    # 3           number of axes (rank)
a.size    # 24          total elements
len(a)    # 2           length of the first axis only
```

A common mistake is to use `len(a)`, which returns only the length of the first axis, when you wanted `a.size` (all elements) or `a.ndim` (the rank). They coincide only for a 1-D array.

## How do you get a tensor's shape in PyTorch?

PyTorch is the framework most responsible for the overloaded meaning of "size," because `tensor.size()` returns the shape, not the element count. The official documentation states plainly: "torch.Size is the result type of a call to torch.Tensor.size()" and "As a subclass of tuple, it supports common sequence operations like indexing and length" [2].

```python
import torch
t = torch.zeros(2, 3, 4)
t.size()       # torch.Size([2, 3, 4])   the shape
t.shape        # torch.Size([2, 3, 4])   identical to .size()
t.size(0)      # 2                        length of one axis
t.dim()        # 3                        rank / number of axes (ndim)
t.ndim         # 3                        same as t.dim()
t.numel()      # 24                       total elements
```

So in PyTorch the **shape** comes from `.size()` or `.shape`, the **rank** from `.dim()` or `.ndim`, and the **element count** from `.numel()`. Passing an integer argument to `.size(dim)` returns a single axis length as a plain `int` rather than a `torch.Size`.

## Size across major frameworks

The table below summarizes how each library exposes "size" for a hypothetical tensor `t` with shape `(2, 3, 4)`.

| Library | Expression | Returns | Value for shape (2, 3, 4) |
|---|---|---|---|
| NumPy | `arr.size` | total elements (int) | 24 |
| NumPy | `arr.shape` | per-axis tuple | (2, 3, 4) |
| NumPy | `arr.ndim` | number of axes (int) | 3 |
| PyTorch | `t.size()` | `torch.Size` (a tuple subclass) | torch.Size([2, 3, 4]) |
| PyTorch | `t.shape` | same `torch.Size` | torch.Size([2, 3, 4]) |
| PyTorch | `t.numel()` | total elements (int) | 24 |
| PyTorch | `t.dim()` / `t.ndim` | number of axes (int) | 3 |
| PyTorch | `t.size().numel()` | total elements (int) | 24 |
| TensorFlow | `tf.size(t)` | 0-D int tensor | 24 |
| TensorFlow | `tf.rank(t)` | 0-D int tensor | 3 |
| TensorFlow | `t.shape` | `TensorShape` | TensorShape([2, 3, 4]) |
| TensorFlow | `t.ndim` | number of axes (int) | 3 |
| JAX | `arr.size` | total elements (int) | 24 |
| JAX | `jnp.size(arr)` | total elements (int) | 24 |
| JAX | `arr.shape` | per-axis tuple | (2, 3, 4) |

A few details are worth flagging. PyTorch's `torch.Size` is itself a tuple subclass, so it behaves like a tuple in indexing and unpacking but also has a helper method `numel()` that returns the product of its entries [2]. Calling `numel()` on a `torch.Size` returns the element count of a tensor that would have that shape, not the rank of the tensor. The PyTorch issue tracker has documented this distinction explicitly, since users sometimes assume `torch.Size.numel()` returns the number of axes.

NumPy reports `ndarray.size` as a Python `int` of arbitrary precision, while `np.prod(a.shape)` returns a fixed-width `np.int_`. For very large arrays this matters because `np.prod` can silently overflow on 32-bit platforms.

TensorFlow's `tf.size` returns a 0-dimensional integer tensor (default `tf.int32`), not a Python integer [5]. If the count exceeds the int32 range, you must pass `out_type=tf.int64`. The static `t.shape` returns a `TensorShape` object that may contain `None` entries for dynamic dimensions, while `tf.shape(t)` returns a 1-D tensor with the runtime shape. The TensorFlow guide warns that the `Tensor.ndim` and `Tensor.shape` attributes return Python objects rather than tensors, so when you need the value inside a graph you should call `tf.rank` or `tf.shape` instead [14].

JAX follows NumPy semantics: `arr.size` is the element count, and `jnp.size(arr)` mirrors `np.size` [6]. Unlike NumPy's free function, `jnp.size` raises `TypeError` on Python lists or tuples instead of converting them implicitly.

## Element count and shape

The element count is always the product of the entries in the shape:

```
elements = shape[0] * shape[1] * ... * shape[rank - 1]
```

A few corner cases:

- A scalar (rank 0) has shape `()` and contains 1 element. `numel()`, `tf.size`, and NumPy's `arr.size` all return 1, not 0.
- An empty tensor with any zero in its shape, for example shape `(3, 0, 4)`, has 0 elements. The shape is still meaningful and operations like reshape may need to preserve it.
- A tensor with a single zero-length axis is not the same as a scalar; the rank still matters for [broadcasting](/wiki/broadcasting) and reshape.

The element count is a property of the shape only. It does not depend on the dtype, the device, or whether the tensor is contiguous in memory.

## How does a tensor's shape change through operations?

Most array operations are best understood by how they transform the shape, because the shape, not the raw element count, is what has to line up for an operation to be legal.

- **Reshape** rearranges the same elements into a new shape. It is only valid when the new shape has the same total element count, so a `(2, 12)` tensor can reshape to `(4, 6)` or `(2, 3, 4)` (all 24 elements) but not to `(5, 5)`. A `-1` entry tells the framework to infer that axis from the element count.
- **Reductions** such as `sum`, `mean`, or `max` remove an axis (or collapse it to length 1 with `keepdims=True`), lowering the rank.
- **Indexing and slicing** can drop axes (integer index) or keep them (slice), changing both shape and rank.
- **Concatenation and stacking** grow one axis (`cat` / `concatenate`) or add a new one (`stack`), and require the other axis lengths to match.
- **Matrix multiplication** contracts the shared inner dimension: `(m, k) @ (k, n)` produces `(m, n)`. The shared `k` must be equal, which is the most common shape-mismatch error in practice.
- **[Broadcasting](/wiki/broadcasting)** lets operations combine tensors of different shapes by virtually stretching axes of length 1. Two shapes are broadcast-compatible when, aligned from the trailing axis, each pair of axis lengths is either equal or one of them is 1.

Because shape mismatches are the single most frequent class of bug in tensor code, frameworks raise immediate errors when axis lengths do not align. Reading the printed shapes in the error message is usually the fastest way to find the offending operation.

## Memory footprint

Memory footprint is the practical meaning of "size" for anyone budgeting GPU RAM. The formula is simple:

```
bytes = element_count * bytes_per_element
```

In PyTorch, `tensor.element_size()` returns the bytes per element for the tensor's dtype, so `t.numel() * t.element_size()` is the storage size of a contiguous tensor [4][3]. NumPy exposes the same number as `arr.itemsize` and the total as `arr.nbytes`. TensorFlow does not have a built-in helper, so users compute `tf.size(t).numpy() * t.dtype.size`.

### Bytes per dtype

The table below lists the byte sizes for common dtypes used in modern deep learning. Numbers are bytes per scalar element.

| Dtype | Bytes | Common uses |
|---|---|---|
| float64 (fp64) | 8 | scientific computing, rarely used in deep learning |
| float32 (fp32) | 4 | classic training default; baseline weights |
| float16 (fp16) | 2 | mixed-precision training, inference |
| bfloat16 (bf16) | 2 | TPU and modern GPU training, same range as fp32 |
| float8 e4m3 (fp8) | 1 | forward activations, weights on Hopper and Blackwell |
| float8 e5m2 (fp8) | 1 | gradients during backward pass |
| int64 | 8 | indices, large counters |
| int32 | 4 | indices, ids |
| int16 | 2 | quantized intermediate values |
| int8 / uint8 | 1 | quantized weights or activations, image bytes |
| int4 (packed) | 0.5 | aggressive [quantization](/wiki/quantization) for [LLM](/wiki/llm) serving |
| bool | 1 | masks; stored as a full byte despite being one bit logically |
| complex64 | 8 | signal processing, two fp32 components |
| complex128 | 16 | high-precision signal processing |

A practical detail: PyTorch's `bool` tensor uses one byte per element, not one bit. If you want true bit-packing you have to use a uint8 tensor and pack manually. The PyTorch issue tracker has a long-running discussion about this for users who expected packed booleans.

FP8 came to mainstream hardware with the NVIDIA Hopper generation and is now a fixture on Blackwell [7][8]. The two encodings are E4M3 (4 exponent bits, 3 mantissa bits, 1 sign bit, max value about 448) and E5M2 (5 exponent bits, 2 mantissa bits, 1 sign bit, max value about 57344) [7]. Both use 1 byte per element. The convention recommended by NVIDIA is E4M3 for forward activations and weights, E5M2 for backward gradients where dynamic range matters more than precision [7].

### Worked example

A float32 matrix of shape `(1024, 1024)` has:

- 1,024 * 1,024 = 1,048,576 elements
- 1,048,576 * 4 bytes = 4,194,304 bytes = 4 MiB

In [bfloat16](/wiki/bfloat16) the same matrix is 2 MiB. In [int8](/wiki/int8) it is 1 MiB. In int4 (packed two-per-byte) it is 0.5 MiB. The shape did not change; only the bytes per element did.

## Model size

When people say a model is "7B" or "70B," they mean the total count of trainable parameters across every weight tensor. To get that number programmatically:

- PyTorch: `sum(p.numel() for p in model.parameters())`
- PyTorch trainable only: `sum(p.numel() for p in model.parameters() if p.requires_grad)`
- Hugging Face: `model.num_parameters()` (with optional `only_trainable=True`)
- Keras: `model.count_params()`

Multiplying the parameter count by bytes per element gives the storage cost of the weights themselves. The table below shows weight memory at common precisions for several well-known checkpoints (BERT-base at 110M and BERT-large at 340M parameters [13]). These figures are weights only; activations, optimizer state, and KV cache are separate.

| Model | Parameters | fp32 weights | fp16/bf16 weights | int8 weights | int4 weights |
|---|---|---|---|---|---|
| BERT-base | 110M | ~440 MB | ~220 MB | ~110 MB | ~55 MB |
| BERT-large | 340M | ~1.36 GB | ~680 MB | ~340 MB | ~170 MB |
| GPT-2 small | 117M | ~468 MB | ~234 MB | ~117 MB | ~59 MB |
| GPT-2 XL | 1.5B | ~6 GB | ~3 GB | ~1.5 GB | ~0.75 GB |
| Llama 3 8B | 8.0B | ~32 GB | ~16 GB | ~8 GB | ~4 GB |
| Llama 3 70B | 70B | ~280 GB | ~140 GB | ~70 GB | ~35 GB |
| Llama 3.1 405B | 405B | ~1.62 TB | ~810 GB | ~405 GB | ~203 GB |

The Hugging Face release of Llama 3.1 405B reported about 812 GB for the bf16 instruction-tuned variant, which lines up with the 405B * 2 bytes calculation [9]. The fp32 release is roughly 2 TB. Even on an 8-way H100 node with about 640 GB of HBM, you cannot fit the bf16 weights of 405B without sharding across nodes or dropping to FP8, which fits in roughly 486 GB.

### Activation memory

During training, the forward pass stores intermediate activations for use in the backward pass. Activation memory typically scales with batch size, sequence length, hidden width, and depth. For large transformers it can match or exceed the parameter footprint, which is why techniques like gradient checkpointing, sequence parallelism, and activation offloading exist. As a rough rule of thumb in Megatron-style training without recomputation, activations for a transformer can require on the order of the parameter count again, sometimes more.

### Optimizer state

The optimizer keeps its own per-parameter state. Adam and AdamW track two fp32 moments per parameter, so they add 8 bytes per parameter on top of the weights [11]. In standard mixed-precision training with an fp32 master copy and Adam moments, the per-parameter cost is roughly [11]:

- 2 bytes for the fp16/bf16 weight
- 4 bytes for the fp32 master weight
- 4 bytes for the fp32 first moment
- 4 bytes for the fp32 second moment
- 2 or 4 bytes for the gradient

This is the famous "16 bytes per parameter" budget that motivates ZeRO sharding. For an 8B model that already amounts to about 128 GB before activations or KV cache. Eight-bit Adam variants (for example bitsandbytes' `adamw_8bit`) cut the moment storage to 1 byte each, bringing the total down significantly [12].

### KV cache

For autoregressive [LLM](/wiki/llm) inference, the key and value projections from previous tokens are cached so attention does not have to recompute them. The KV cache size in bytes is [10]:

```
kv_bytes = 2 * num_layers * num_kv_heads * head_dim * seq_len * batch_size * bytes_per_element
```

The leading factor of 2 covers both K and V. With grouped-query attention, `num_kv_heads` is smaller than the total head count, which is one reason GQA is so popular for long-context serving. KV cache scales linearly with sequence length, so a 128k-token context can dwarf the weights themselves on a small model.

For Llama 3 8B at fp16, with 32 layers, hidden size 4096, 8 KV heads of dimension 128, the KV cache for a single 8k-token sequence is:

```
2 * 32 * 8 * 128 * 8192 * 1 * 2 bytes = ~1.07 GB
```

That is per sequence in the batch. Serving 32 concurrent 8k contexts pushes the KV cache past 30 GB, which is often the bottleneck rather than the weights.

## Quantization and size

[Quantization](/wiki/quantization) reduces bytes per element without changing the element count. The common steps in modern LLM serving are:

| Step | Bytes/param | Reduction vs fp32 |
|---|---|---|
| fp32 baseline | 4 | 1.00x |
| fp16 or bf16 | 2 | 0.50x |
| fp8 (E4M3 or E5M2) | 1 | 0.25x |
| int8 | 1 | 0.25x |
| int4 (GPTQ, AWQ, GGUF Q4) | 0.5 | 0.125x |
| int3 / Q3 GGUF | ~0.375 | ~0.094x |
| int2 / Q2 GGUF | ~0.25 | ~0.063x |

Real quantized formats include extra metadata (group scales, zero points), so the on-disk size is usually a bit larger than the naive byte-per-parameter count. GGUF Q4_K_M, for instance, lands closer to 4.5 bits per weight on average.

Libraries that perform this work include bitsandbytes (NF4 and 8-bit linear layers), AutoGPTQ, AutoAWQ, and llama.cpp's GGUF format. Hardware-side, NVIDIA's Transformer Engine handles automatic FP8 scaling on Hopper and Blackwell [7].

## Why does tensor size matter for bandwidth and throughput?

Many machine learning workloads are bound by memory bandwidth, not raw arithmetic. Loading a billion fp16 weights from HBM to the streaming multiprocessors costs 2 GB of bandwidth no matter how fast the matmul units run. This is why halving the bytes per parameter often roughly doubles inference throughput on modern GPUs, even though the math is unchanged. Tensor size, in the byte-footprint sense, is therefore a first-order driver of decode speed for [LLMs](/wiki/llm), not a secondary concern.

## What are common tensor-size pitfalls?

A few traps that catch newcomers and experienced practitioners alike:

- Treating `tensor.size()` in PyTorch as if it were the element count. It is not; it returns the shape [2]. Use `numel()` or `size().numel()` for the count.
- Confusing rank with an axis length. In NumPy, `a.ndim` is the number of axes (the rank), not the size of any axis [15].
- Forgetting that bool tensors take a full byte per element in PyTorch and NumPy, not a bit.
- Multiplying parameter count by 4 bytes for an inference budget when the model is actually loaded in bf16. The right factor is 2.
- Estimating GPU memory from weights only and being surprised when activations or KV cache take more space than the model itself.
- Using `np.prod(a.shape)` on huge arrays and getting a silent overflow on 32-bit `np.int_`. Use `a.size` or cast explicitly to `np.int64`.
- Confusing static `t.shape` (which can include `None`) with dynamic `tf.shape(t)` in TensorFlow graph mode.
- Counting parameters of a model with tied weights twice. Embedding and LM head layers in many transformers share weights, so a naive `sum(p.numel())` can over-count.

## Tools to estimate size

A short list of utilities that compute or report tensor and model sizes:

- `torchinfo` and the older `torchsummary`: print per-layer parameter and activation shapes for a PyTorch model.
- `transformers`' built-in `model.num_parameters()` and `model.get_memory_footprint()`.
- Hugging Face's Accelerate library includes `infer_auto_device_map` which estimates per-device memory needs.
- DeepSpeed ships `estimate_zero2_model_states_mem_needs_all_live` and a ZeRO-3 equivalent for training memory budgeting.
- `nvidia-smi` and `torch.cuda.memory_allocated()` for runtime measurement.
- llama.cpp's `--mlock` and `--n-gpu-layers` flags rely on its internal byte accounting, visible at load time.
- Online VRAM calculators such as the LLM Studio and ApxML calculators wrap these formulas in a UI for quick what-if sizing.

## Explain like I'm 5

Imagine a box of Lego bricks arranged in a stack. The **shape** is how the stack is built: 3 bricks wide, 4 bricks tall, 5 bricks deep. The **rank** is just how many of those measurements there are, which is 3 (wide, tall, deep). The **size** can mean two different things depending on who you ask. Some people mean "how many bricks are in the box," which is 3 times 4 times 5, or 60 bricks. Other people, like the PyTorch crowd, mean the description of the stack itself: "3 by 4 by 5." And if you ask how heavy the box is, that depends on whether the bricks are big plastic ones or tiny micro pieces. That weight is the memory footprint, and it depends on the dtype.

## References

1. NumPy documentation. *numpy.ndarray.size*. https://numpy.org/doc/stable/reference/generated/numpy.ndarray.size.html
2. PyTorch documentation. *torch.Size*. https://docs.pytorch.org/docs/stable/size.html
3. PyTorch documentation. *torch.Tensor.numel*. https://docs.pytorch.org/docs/stable/generated/torch.Tensor.numel.html
4. PyTorch documentation. *torch.Tensor.element_size*. https://docs.pytorch.org/docs/stable/generated/torch.Tensor.element_size.html
5. TensorFlow documentation. *tf.size*. https://www.tensorflow.org/api_docs/python/tf/size
6. JAX documentation. *jax.numpy.size*. https://docs.jax.dev/en/latest/_autosummary/jax.numpy.size.html
7. NVIDIA Developer Blog. *Floating-Point 8: An Introduction to Efficient, Lower-Precision AI Training*. https://developer.nvidia.com/blog/floating-point-8-an-introduction-to-efficient-lower-precision-ai-training/
8. NVIDIA Developer Blog. *NVIDIA Hopper Architecture In-Depth*. https://developer.nvidia.com/blog/nvidia-hopper-architecture-in-depth/
9. Hugging Face. *Llama 3.1 release blog*. https://huggingface.co/blog/llama31
10. NVIDIA Developer Blog. *Mastering LLM Techniques: Inference Optimization* (KV cache section). https://developer.nvidia.com/blog/mastering-llm-techniques-inference-optimization/
11. Hugging Face Transformers documentation. *Efficient Training on a Single GPU* (Adam memory breakdown). https://huggingface.co/docs/transformers/v4.20.1/en/perf_train_gpu_one
12. Hugging Face bitsandbytes documentation. *8-bit optimizers*. https://huggingface.co/docs/bitsandbytes/optimizers
13. Wikipedia. *BERT (language model)* (parameter counts). https://en.wikipedia.org/wiki/BERT_(language_model)
14. TensorFlow documentation. *Introduction to Tensors* (shape, rank, axis, size definitions). https://www.tensorflow.org/guide/tensor
15. NumPy documentation. *numpy.ndarray.ndim* and *numpy.ndarray.shape*. https://numpy.org/doc/stable/reference/generated/numpy.ndarray.ndim.html