Safetensors is an open-source tensor serialization format developed by Hugging Face as a safe, fast, and framework-agnostic alternative to Python's pickle module. The format stores only raw tensor data and a small JSON header, which makes it impossible to embed executable code in a model file. This design directly addresses the security risks of pickle-based formats like PyTorch .bin files, where loading an untrusted model can trigger arbitrary code execution on the host system.
The safetensors library was first released on September 22, 2022, as version 0.0.1 on PyPI. The primary author is Nicolas Patry, a machine learning engineer at Hugging Face known for his work on Rust-based ML infrastructure. The project is licensed under the Apache 2.0 license and hosted at github.com/huggingface/safetensors. As of late 2025, the repository has over 3,700 stars on GitHub, 76 contributors, and is used by more than 129,000 downstream projects.
Python's pickle module is the default serialization mechanism behind PyTorch model checkpoints (.pt, .pth, .bin files). When a user calls torch.load(), pickle deserializes the byte stream back into Python objects. The problem is that pickle was designed to reconstruct arbitrary Python objects, including their behavior, not just their data. An attacker can craft a pickle file that executes arbitrary Python code the moment it is loaded. This code could exfiltrate credentials, install a backdoor, open a reverse shell, or run any other system-level command.
This is not a theoretical risk. Researchers at JFrog, Rapid7, and Snyk have documented real-world malicious models uploaded to public model hubs. Weaponized .pth files have been found on the Hugging Face Hub containing payloads that perform system fingerprinting, credential theft, and remote access trojan installation. Pickle's REDUCE, GLOBAL, and STACK_GLOBAL opcodes allow a pickle program to import and invoke arbitrary Python callables during unpickling. An attacker can use these opcodes to call os.system or similar functions, achieving arbitrary shell command execution.
Pickle scanning tools like picklescan exist to detect known malicious patterns, but researchers have discovered multiple bypass vulnerabilities in these scanners. JFrog disclosed three zero-day vulnerabilities in picklescan in 2025, demonstrating that static analysis cannot fully secure an inherently unsafe format. Repositories with pickle models are downloaded over 2.1 billion times per month from the Hugging Face model hub, making the attack surface enormous.
Hugging Face's own scanning infrastructure catches some malicious uploads, but the fundamental issue is that pickle allows code execution by design. No amount of scanning can guarantee safety for an inherently unsafe format. Safetensors was built from the ground up to solve this problem: it stores only numerical data and metadata, with no mechanism for embedding or executing code.
The safetensors format is intentionally simple. A .safetensors file consists of three contiguous sections:
| Section | Size | Description |
|---|---|---|
| Header size | 8 bytes | Unsigned 64-bit little-endian integer specifying the length of the JSON header in bytes |
| JSON header | N bytes (value from header size) | UTF-8 encoded JSON object containing tensor metadata |
| Data buffer | Remaining bytes | Raw tensor data stored as a contiguous binary block, without compression |
The JSON header is a dictionary where each key is a tensor name and each value is an object with three fields:
dtype: The data type of the tensor (e.g., F16, BF16, F32, I32)shape: An array of integers representing the tensor dimensionsdata_offsets: A two-element array [BEGIN, END] specifying the byte range of the tensor's data within the data buffer (not absolute file offsets)A special key called __metadata__ can store arbitrary string-to-string key-value pairs for user-defined metadata such as model architecture, training configuration, or authorship information. Because metadata values must be strings, complex data structures need to be serialized as JSON strings within the value. This metadata is commonly used by tools like CivitAI and LoRA training scripts to record learning rates, training epochs, and dataset information.
Here is a simplified example of a header:
{
"__metadata__": {"format": "pt"},
"model.embed.weight": {
"dtype": "F16",
"shape": [32000, 4096],
"data_offsets": [0, 262144000]
},
"model.norm.weight": {
"dtype": "F16",
"shape": [4096],
"data_offsets": [262144000, 262152192]
}
}
Several constraints are enforced to prevent abuse and ensure correctness:
{ character (byte 0x7B).Safetensors supports a range of data types relevant to modern machine learning workloads:
| Dtype string | Description | Bytes per element |
|---|---|---|
BOOL | Boolean | 1 |
U8 | Unsigned 8-bit integer | 1 |
I8 | Signed 8-bit integer | 1 |
U16 | Unsigned 16-bit integer | 2 |
I16 | Signed 16-bit integer | 2 |
F16 | IEEE 754 half-precision float | 2 |
BF16 | Brain floating point (bfloat16) | 2 |
U32 | Unsigned 32-bit integer | 4 |
I32 | Signed 32-bit integer | 4 |
F32 | IEEE 754 single-precision float | 4 |
F64 | IEEE 754 double-precision float | 8 |
I64 | Signed 64-bit integer | 8 |
U64 | Unsigned 64-bit integer | 8 |
F8_E4M3 | FP8 (4-bit exponent, 3-bit mantissa) | 1 |
F8_E5M2 | FP8 (5-bit exponent, 2-bit mantissa) | 1 |
The inclusion of bfloat16 and FP8 types is notable because some older formats (such as HDF5 and NumPy's .npz) lack native support for these types, which are now standard in large-scale model training and inference.
The central promise of safetensors is that loading a .safetensors file cannot execute arbitrary code, regardless of how the file was crafted. This holds because:
eval, exec, or any form of dynamic code loading.In May 2023, Hugging Face, EleutherAI, and Stability AI jointly commissioned an external security audit of the safetensors library. The audit was performed by Trail of Bits, a well-known cybersecurity firm that has audited projects across the blockchain, infrastructure, and AI security domains.
The audit's findings were:
All three organizations agreed to make the full audit report public, in keeping with open-source transparency principles. The report is available through Trail of Bits' publications repository on GitHub. EleutherAI published a companion blog post confirming their endorsement of safetensors as a safe default format.
Safetensors achieves significant speed improvements over pickle-based loading, primarily through memory mapping (mmap). When a safetensors file is loaded, the operating system maps the file directly into the process's virtual address space. Tensor data can then be accessed without copying it into a separate memory allocation.
With pickle-based PyTorch loading (torch.load()), the process typically works as follows: PyTorch creates empty tensors, loads data from the pickle file, and then copies the data into the empty tensors. This requires approximately twice the model's size in available RAM, since both the loaded data and the destination tensors exist in memory simultaneously.
Safetensors avoids this double-allocation problem. On CPU, tensors reference the memory-mapped file directly. On GPU, the library uses cudaMemcpy to transfer data straight from the mapped file to GPU memory, bypassing intermediate CPU tensor allocations.
The term "zero-copy" deserves some clarification. On CPU, when the file contents are already in the OS page cache (for example, after a recent read), loading a safetensors file can be truly zero-copy: the tensor objects point directly at the mapped memory with no data duplication at all. On GPU, a copy from host memory to device memory is always required (there is no way around the PCIe or NVLink transfer), but safetensors still eliminates the intermediate step of allocating CPU tensors. In practice, this means safetensors skips one full copy of the model's weights compared to PyTorch's standard loading path.
The official benchmarks from the Hugging Face documentation use GPT-2 weights and were run on Ubuntu 18.04.6 LTS with an Intel Xeon CPU @ 2.00 GHz and a Tesla T4 GPU (CUDA 11.2, driver version 460.32.03):
| Device | Safetensors load time | PyTorch load time | Speedup |
|---|---|---|---|
| CPU | 0.004 seconds | 0.307 seconds | 76.6x faster |
| GPU (CUDA, Tesla T4) | 0.165 seconds | 0.354 seconds | 2.1x faster |
The CPU speedup is dramatic because memory mapping avoids all data copying. The GPU speedup is smaller because the bottleneck shifts to the PCIe bus transfer between host memory and GPU memory, which both formats must perform.
For large-scale models, the gains are even more pronounced. The BLOOM 176-billion-parameter model loaded on 8 GPUs in approximately 45 seconds using safetensors, compared to roughly 10 minutes with standard PyTorch pickle weights. This roughly 13x improvement comes from lazy loading, where safetensors reads specific tensor shards directly to each GPU without loading the entire model into CPU memory first.
Safetensors supports lazy loading, meaning individual tensors can be read from the file without loading the entire model into memory. This is particularly useful for distributed inference, where each GPU only needs a subset of the model's weights. The library also supports tensor slicing, allowing users to read a sub-range of a tensor:
from safetensors import safe_open
with safe_open("model.safetensors", framework="pt", device="cuda:0") as f:
tensor_slice = f.get_slice("embedding")
vocab_size, hidden_dim = tensor_slice.get_shape()
partial = tensor_slice[:1000, :] # Load only first 1000 rows
This capability is useful for vocabulary-parallel inference, where different GPUs handle different portions of the embedding table.
The core safetensors library is implemented in Rust, which provides memory safety guarantees without a garbage collector. The Rust implementation handles all file parsing, header validation, and data access logic. Language-specific bindings (for Python, JavaScript, and others) call into the Rust code through foreign function interfaces built with PyO3 and setuptools-rust.
Using Rust adds a meaningful layer of security: common vulnerability classes such as buffer overflows, use-after-free errors, and data races are prevented by the Rust compiler's ownership and borrowing rules. This is particularly relevant for a library that parses untrusted binary data downloaded from the internet.
The Rust crate is published on crates.io as safetensors. It has accumulated over 980,000 monthly downloads and is used by more than 760 other Rust crates, including Hugging Face's Candle deep learning framework, which uses safetensors as its primary model loading mechanism.
Safetensors provides native bindings for multiple deep learning frameworks:
| Framework | Module | Notes |
|---|---|---|
| PyTorch | safetensors.torch | Full support; default in Transformers since v4.35.0 |
| TensorFlow | safetensors.tensorflow | Full support |
| JAX / Flax | safetensors.flax | Full support |
| NumPy | safetensors.numpy | Full support |
| PaddlePaddle | safetensors.paddle | Full support |
Because the file format stores raw tensor data without framework-specific metadata, the same .safetensors file can be loaded in different frameworks. A model saved from PyTorch can be loaded in TensorFlow or JAX, provided the consuming code knows the expected tensor names and shapes. This is a meaningful advantage over pickle, which is tightly coupled to Python and often to the specific framework version that created the file.
Models with billions of parameters often exceed the memory capacity of a single device or the practical limits of a single file. Safetensors supports sharding, where model weights are split across multiple files with a JSON index file that maps tensor names to their respective shard files.
A sharded model typically looks like this on disk:
model-00001-of-00008.safetensors
model-00002-of-00008.safetensors
...
model-00008-of-00008.safetensors
model.safetensors.index.json
The model.safetensors.index.json file contains two keys:
metadata: An object with information such as the total model size in bytesweight_map: A dictionary mapping each parameter name to the shard file that contains itThe Hugging Face Transformers library and the Accelerate library handle sharding automatically. When calling save_pretrained(), the model is split into shards of a configurable maximum size (typically 5 GB per shard). During loading, from_pretrained() reads the index file and loads only the required shards, enabling efficient distribution of weights across multiple GPUs with minimal memory overhead.
Sharding is particularly important for models with tens or hundreds of billions of parameters. A 70-billion-parameter model stored in FP16 requires approximately 140 GB of storage, which would be impractical as a single file for downloading, checkpointing, and distributed loading across multiple GPUs.
Following the successful Trail of Bits security audit, Hugging Face made safetensors the default serialization format in the Transformers library. Since that transition, save_pretrained() saves models as .safetensors files unless explicitly overridden. When loading, the library prefers .safetensors files over .bin files if both are available.
Hugging Face also provides a web-based conversion tool at huggingface.co/spaces/safetensors/convert that allows users to convert existing pickle-based models to safetensors format directly on the Hub.
Major model families released on the Hugging Face Hub, including Llama 3, Gemma, Phi, Mistral, Qwen, Stable Diffusion XL, and Flux, all use safetensors as their primary distribution format.
The format has been adopted by a wide range of open-source projects:
| Project | Use case |
|---|---|
| Transformers | Default model serialization for Hugging Face models |
| Diffusers | Image generation model storage |
| Candle | Rust-based ML inference |
| MLX | Apple Silicon ML framework |
| llama.cpp | CPU/GPU LLM inference (reads safetensors for conversion) |
| AUTOMATIC1111 Stable Diffusion WebUI | Image generation interface |
| ComfyUI | Node-based image generation workflow |
| CivitAI | Community model sharing platform |
| ColossalAI | Distributed training framework |
| InvokeAI | Creative AI toolkit |
| oobabooga text-generation-webui | Local LLM chat interface |
| vLLM | High-throughput LLM serving |
| pytorch-image-models (timm) | Computer vision model hub |
| BERTopic | Topic modeling library |
Several model serialization formats compete in the machine learning ecosystem. Each serves a different primary use case.
| Feature | Safetensors | Pickle / PyTorch .bin | GGUF | ONNX |
|---|---|---|---|---|
| Primary use case | Model weight storage and sharing | PyTorch checkpointing | Local LLM inference (llama.cpp, Ollama) | Cross-framework model deployment |
| Safety (no code execution) | Yes | No (arbitrary code execution possible) | Yes | Yes |
| Memory mapping (mmap) | Yes | No | Yes | No |
| Lazy loading | Yes | No | Yes | No |
| Stores computation graph | No (weights only) | No (weights only) | No (weights + tokenizer + config) | Yes (full graph + weights) |
| Built-in quantization | No (stores pre-quantized weights) | No | Yes (Q2 through Q8, k-quants) | Yes (via ONNX quantization tools) |
| Single-file format | Yes (or sharded with index) | Yes (or sharded) | Yes | Yes |
| bfloat16 support | Yes | Yes | Yes | Limited |
| FP8 support | Yes (F8_E4M3, F8_E5M2) | Yes (via PyTorch) | Yes | No |
| Framework-agnostic | Yes | No (Python/PyTorch specific) | Yes (binary format) | Yes (uses protobuf) |
| Typical file extension | .safetensors | .bin, .pt, .pth | .gguf | .onnx |
| Typical size (7B model, FP16) | ~14 GB | ~14 GB | ~4 GB (Q4 quantized) | ~14 GB |
| Developer | Hugging Face | PyTorch / Meta | ggml.ai (Georgi Gerganov) | Linux Foundation (ONNX Project) |
| License | Apache 2.0 | Python/BSD | MIT | Apache 2.0 |
The most direct comparison is between safetensors and pickle-based formats (.pt, .bin, .pth). Safetensors is strictly safer because it cannot execute code. It is also faster on CPU because of memory mapping. Pickle files, however, can serialize arbitrary Python objects, not just tensors; safetensors is limited to tensor data and string metadata. For pure model weight storage and distribution, safetensors is the preferred format. PyTorch 2.0 introduced a weights_only=True parameter for torch.load() that restricts deserialization to tensor data, but this mitigation is opt-in and does not change the fundamental pickle format.
GGUF (Georgi Gerganov Universal Format) was designed for the llama.cpp project and is optimized for running quantized models on consumer hardware, especially on CPUs. GGUF files bundle model weights, tokenizer configuration, and architecture metadata into a single file, and support a wide range of quantization formats (from 2-bit to 8-bit integers, including k-quant variants like Q4_K_M and Q5_K_S). A 7-billion-parameter model quantized to Q4 in GGUF format is roughly 4 GB, compared to approximately 14 GB for the same model in FP16 safetensors. Safetensors and GGUF serve different stages of the model lifecycle: safetensors for training, fine-tuning, and distribution; GGUF for efficient local inference. Models are typically converted from safetensors to GGUF using conversion scripts provided by the llama.cpp project.
ONNX (Open Neural Network Exchange) stores not just model weights but the full computation graph, including operator definitions and layer connections. This makes ONNX suitable for deploying models across different inference runtimes (ONNX Runtime, TensorRT, CoreML) without needing the original training framework. ONNX launched in 2017 as a collaboration between Microsoft and Facebook (now Meta) and uses Protocol Buffers for serialization. Safetensors does not store computation graphs; it is purely a weight storage format. Models stored in safetensors still require the original model architecture code (or a framework like transformers) to reconstruct the model for inference.
The safetensors documentation explains why several existing formats were considered and rejected:
| Format | Reason rejected |
|---|---|
| Pickle | Allows arbitrary code execution |
| HDF5 | Complex codebase (~210,000 lines of C) with historical CVEs |
| Protobuf | 2 GB file size limit |
| Cap'n Proto | No native float16 support |
| NumPy .npz | Vulnerable to zip bomb attacks; uses pickle internally for object arrays |
| Flatbuffers | No native bfloat16/FP8 support at the time of development |
This analysis led the Hugging Face team to create a new format from scratch rather than building on an existing one.
Safetensors is intentionally limited in scope. It stores only tensor data and simple string metadata. It does not store:
__metadata__)This narrow scope is a deliberate trade-off. By refusing to serialize arbitrary objects, the format remains safe by construction. Model architecture information is typically stored in a separate config.json file alongside the safetensors weights, as is standard practice in the Hugging Face ecosystem.
Another limitation is that safetensors files for full-precision models are large. A 7-billion-parameter model in FP16 occupies approximately 14 GB. Unlike GGUF, safetensors does not perform quantization; it stores whatever precision the tensors are already in. Users who need smaller files for local inference typically convert safetensors models to GGUF with quantization applied.
Safetensors can be installed from PyPI or conda:
pip install safetensors
conda install -c huggingface safetensors
import torch
from safetensors.torch import save_file
tensors = {
"embedding": torch.zeros((2, 2)),
"attention": torch.zeros((2, 3))
}
save_file(tensors, "model.safetensors")
from safetensors import safe_open
tensors = {}
with safe_open("model.safetensors", framework="pt", device="cpu") as f:
for key in f.keys():
tensors[key] = f.get_tensor(key)
from safetensors.torch import load_file
tensors = load_file("model.safetensors", device="cuda:0")
The safetensors library has been under active development since its initial release:
| Version | Date | Notes |
|---|---|---|
| 0.0.1 | September 22, 2022 | Initial release on PyPI |
| 0.2.0 | September 29, 2022 | First major iteration with Rust backend |
| 0.2.4 | November 7, 2022 | Performance improvements (Nicolas Patry announced 150x CPU speedup) |
| 0.2.5 | November 23, 2022 | Stability improvements |
| 0.3.x | Early 2023 | Added framework bindings for TensorFlow, JAX, Flax |
| 0.4.0 | October 2023 | Improved GPU loading and metadata support |
| 0.4.5 | September 2024 | Bug fixes and performance improvements |
| 0.5.0 | January 2025 | Further refinements |
| 0.7.0 | November 19, 2025 | Latest stable release |
The project has not yet reached version 1.0, but it carries a "Production/Stable" development status designation on PyPI. The library is maintained by Nicolas Patry, Daniel de Kok, and other Hugging Face engineers. As of version 0.7.0, there have been 45 total releases and 372 commits to the repository.
The safetensors project and the broader ML security community recommend the following practices:
torch.load(..., weights_only=True) (available in PyTorch 2.0+), which restricts deserialization to tensor data only.