GGUF

Developer Tools Large Language Models Machine Learning

29 min read

Updated Jun 21, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 21, 2026

Fact-checked

In review queue

Sources

19 citations

Revision

v8 · 5,854 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

GGUF (GPT-Generated Unified Format) is the standard binary file format for storing large language models for local inference, bundling a model's weights, tokenizer, and metadata into a single self-contained file that loads quickly and runs efficiently on consumer CPUs and GPUs. Created by Georgi Gerganov as part of the llama.cpp and ggml ecosystem, GGUF was introduced on August 21, 2023 as the backward-incompatible successor to the earlier GGML, GGMF, and GGJT formats.^[1]^[2] The official specification states the format "is designed to be unambiguous by containing all the information needed to load a model" and "is also designed to be extensible, so that new information can be added to models without breaking compatibility."^[1]

GGUF files are natively supported by llama.cpp and by popular inference tools such as Ollama, LM Studio, GPT4All, Jan, and koboldcpp. The format is the dominant vehicle for distributing quantized models: as of early 2026 the Hugging Face Hub hosted roughly 45,000 public GGUF checkpoints, runnable with a single command, alongside first-class platform integration including a metadata viewer, an inference endpoint service, and a JavaScript parser library.^[5]^[19]

History and Evolution

What is GGML?

Towards the end of September 2022, Georgi Gerganov began work on GGML (originally "Georgi Gerganov's Machine Learning" library), a lightweight C library for tensor operations.^[3] GGML was built with strict memory management and multi-threading in mind, and it provided the computational backbone for inference of transformer-based models without heavy dependencies on frameworks like PyTorch or TensorFlow. Gerganov's first widely used GGML application was whisper.cpp, a C/C++ port of OpenAI's Whisper speech-to-text model, released in October 2022.^[3]

The original GGML file format stored model weights alongside a flat list of untyped hyperparameters. While functional, this design had significant limitations: adding new metadata fields required changing the loader code, and there was no versioning to distinguish between format revisions.^[7]

What is llama.cpp?

In March 2023, shortly after Meta released the first LLaMA models, Gerganov created llama.cpp, a pure C/C++ implementation of the LLaMA inference code with no external dependencies.^[3] The project's core goal was to let users run large language models on everyday consumer hardware, including laptops and desktops without dedicated GPUs. Before llama.cpp, Gerganov had built whisper.cpp using GGML as the underlying computation library.

As llama.cpp gained popularity, the shortcomings of the original GGML file format became more pressing. The community needed a format that could carry richer metadata, support new quantization schemes without breaking backward compatibility, and work cleanly with memory-mapped I/O.

Format Progression: GGML, GGMF, GGJT, GGUF

The model file formats used by the llama.cpp project evolved in four stages:

Format	Description	Key Characteristics
GGML	Original format	Unversioned; flat list of untyped hyperparameters; no alignment requirements; identified by magic number `0x67676d6c`
GGMF	Versioned GGML	Added a version field to the header; only one version (v1) was ever released; same structure as GGML otherwise
GGJT	Aligned format	Aligned tensors for mmap support; three versions (v1, v2, v3) with progressively updated quantization schemes
GGUF	Current standard	Key-value metadata store; extensible design; self-contained with tokenizer; introduced August 21, 2023

GGUF is based on GGJT but replaces the flat hyperparameter list with a structured key-value metadata system. The specification describes this as the central difference: GGUF uses "a key-value structure for the hyperparameters (now referred to as metadata), rather than a list of untyped values," which "allows for new metadata to be added without breaking compatibility with existing models."^[1] This single change made it possible to add new fields (architecture details, tokenizer vocabularies, training parameters) without modifying the loader or breaking older models.^[1] The GGUF specification was formalized through Pull Request #302 in the ggerganov/ggml repository, authored by philpax.^[2]

GGUF Version History

The GGUF format itself has gone through three internal versions:^[1]

Version	Changes
v1	Initial release with key-value metadata and tensor alignment
v2	Changed most countable values (array lengths, tensor counts) from uint32 to uint64, enabling support for much larger models
v3	Added big-endian support for architectures such as IBM POWER and s390x; current version

Versions only increment for structural changes to the binary layout. New metadata fields and quantization types can be added without changing the format version.^[1]

File Structure

A GGUF file is a contiguous binary blob divided into four sequential sections: a header, metadata key-value pairs, tensor information, and tensor data.^[1] All multi-byte values are stored in little-endian byte order by default, though version 3 of the format introduced support for big-endian files.^[1]

The header is a fixed structure at the start of every GGUF file:^[1]

Field	Type	Size	Description
Magic number	uint32	4 bytes	`0x46554747` (the ASCII bytes for "GGUF" stored in little-endian order)
Version	uint32	4 bytes	Format version (currently 3)
Tensor count	uint64	8 bytes	Total number of tensors stored in the file
Metadata KV count	uint64	8 bytes	Number of key-value pairs in the metadata section

The magic number allows any tool to quickly verify whether a file is a valid GGUF file. The version field lets loaders handle structural differences between format revisions.

Metadata

The metadata section is a flexible key-value store that carries all the information needed to configure and run the model. Keys are hierarchical ASCII strings written in lower_snake_case with dot separators (for example, general.architecture or llama.context_length). Keys can be up to 65,535 bytes long.^[1]

The value types supported in metadata are:^[1]

Type ID	Type Name	Description
0	UINT8	8-bit unsigned integer
1	INT8	8-bit signed integer
2	UINT16	16-bit unsigned integer
3	INT16	16-bit signed integer
4	UINT32	32-bit unsigned integer
5	INT32	32-bit signed integer
6	FLOAT32	32-bit IEEE 754 float
7	BOOL	Boolean (1 byte)
8	STRING	Length-prefixed UTF-8 string
9	ARRAY	Typed array of values
10	UINT64	64-bit unsigned integer
11	INT64	64-bit signed integer
12	FLOAT64	64-bit IEEE 754 double

Common metadata keys include general.architecture (the model architecture, such as llama or mistral), general.name (human-readable model name), [arch].context_length (maximum sequence length), [arch].embedding_length (embedding dimensions), [arch].block_count (number of transformer layers), and tokenizer.ggml.model (the tokenizer type used, such as llama for SentencePiece BPE or gpt2 for byte-level BPE).^[1] The specification marks general.architecture as required, and general.quantization_version as required whenever any tensor in the file is quantized.^[1] The model's chat formatting can also travel inside the file: many models store a Jinja prompt template under the tokenizer.chat_template key, which llama.cpp and other loaders read to apply the correct conversation format automatically.^[1]

This metadata design is a major improvement over earlier formats. Because keys are self-describing strings, new metadata fields can be added without changing the format specification or breaking compatibility with tools that do not recognize the new keys.^[1]

Tensor Info and Tensor Data

Following the metadata, the file contains an array of tensor information entries. Each entry records:^[1]

Field	Type	Description
Name	string	Identifier for the tensor (up to 64 bytes, e.g. `blk.0.attn_q.weight`)
Number of dimensions	uint32	Dimensionality of the tensor (up to 4 currently)
Dimensions	uint64[]	Size along each axis
Type	uint32	The `ggml_type` enum value indicating the data type or quantization method
Offset	uint64	Byte offset into the tensor data section

After the tensor info array, the file is padded to a 32-byte alignment boundary (GGUF_DEFAULT_ALIGNMENT).^[1] The tensor data section then follows as a contiguous block of binary data containing the actual model weights. Each tensor's offset must be a multiple of the alignment value, which is essential for memory-mapped access. The alignment can be overridden through the general.alignment metadata key, which must be a multiple of 8; when the key is absent, readers assume the default of 32 bytes.^[1]

How does GGUF use memory-mapped I/O (mmap)?

One of GGUF's most important design features is native support for memory-mapped I/O. When a GGUF file is loaded with mmap (the default in llama.cpp), the operating system maps the file directly into the process's virtual address space. Tensor weights can then be accessed through pointers without copying the data into a separate memory buffer.^[9]

This approach has several practical benefits:

Reduced memory usage. Read-only model weights live in the OS page cache and do not consume additional RAM beyond what the file itself occupies on disk. Multiple processes can share the same mapped pages.
Faster loading. Instead of reading the entire file sequentially into a buffer, the OS loads pages on demand as they are accessed. A model that takes several seconds to read() into memory can begin inference almost instantly with mmap.^[9]
Models larger than RAM. Because the OS can page data in and out, it is possible to run a model whose file size exceeds available physical memory, though performance will degrade as the system swaps pages to disk.
Multi-process sharing. Multiple instances of llama.cpp (or other GGUF-compatible tools) can load the same model file simultaneously, and they will share the same physical memory pages. This is particularly useful for serving multiple concurrent requests.

The 32-byte alignment of tensor data in GGUF files is specifically chosen to support efficient mmap behavior. Aligned data avoids partial-page reads and ensures that individual tensors can be accessed without reading adjacent tensors' data. Justine Tunney's 2023 blog post "Edge AI Just Got Faster" demonstrated that mmap-based loading in llama.cpp delivered significant real-world speedups, especially when combined with operating system optimizations like madvise() hints for sequential access patterns.^[9]

Quantization

Quantization is the process of reducing the numerical precision of model weights, trading a small amount of accuracy for significantly lower memory usage and faster inference. GGUF supports a wide range of quantization types, organized into several families: floating-point formats, legacy quantizations, K-quantizations, and I-quantizations (importance-matrix quantization).

Floating-Point Formats

Type	Bits per Weight	Description
F32	32.0	Standard IEEE 754 single-precision float
BF16	16.0	Bfloat16; shortened single-precision format common in training
F16	16.0	IEEE 754 half-precision float; common baseline for quality comparison

These are unquantized formats that preserve full model precision. F16 is the most commonly used starting point for conversion, as it matches the precision at which most models are distributed.

Legacy Quantization Types

The original quantization methods from the GGML era use simple round-to-nearest schemes with 32-weight blocks. The "_0" variants store only a scale factor (symmetric quantization), while the "_1" variants store both a scale and a minimum value (asymmetric quantization).

Type	Bits per Weight	Block Size	Formula	Notes
Q4_0	~4.5	32	w = q * scale	Symmetric; legacy
Q4_1	~5.0	32	w = q * scale + min	Asymmetric; legacy
Q5_0	~5.5	32	w = q * scale	Symmetric; legacy
Q5_1	~6.0	32	w = q * scale + min	Asymmetric; legacy
Q8_0	~8.5	32	w = q * scale	Near-lossless; still widely used
Q8_1	~9.0	32	w = q * scale + min	Asymmetric 8-bit; legacy

These legacy types have been largely superseded by the K-quantization family, which delivers better quality at comparable file sizes.^[10]

What are K-quants?

K-quantizations were introduced in llama.cpp Pull Request #1684 by Iwan Kawrakow (GitHub user ikawrakow), merged on June 5, 2023.^[4] They use a hierarchical "super-block" structure that groups multiple smaller blocks together with shared higher-precision scale factors. This approach produces lower quantization error than legacy methods at the same bit rate. The PR demonstrated that, on the original LLaMA-7B, "6-bit quantized perplexity is within 0.1% or better from the original f16 model."^[4] The suffixes _S, _M, and _L stand for Small, Medium, and Large, indicating how aggressively quality-sensitive layers receive higher precision.

Type	Bits per Weight	Super-Block Structure	Weight Formula	Description
Q2_K	2.625	16 blocks of 16 weights	w = q * block_scale(4-bit) + block_min(4-bit)	Extreme compression; noticeable quality loss
Q3_K_S	3.4375	16 blocks of 16 weights	w = q * block_scale(6-bit)	Small variant; all layers at Q3_K
Q3_K_M	~3.9	Mixed precision	Mixed	Medium variant; attention layers at higher precision
Q3_K_L	~4.3	Mixed precision	Mixed	Large variant; more layers kept at higher precision
Q4_K_S	4.5	8 blocks of 32 weights	w = q * block_scale(6-bit) + block_min(6-bit)	Small variant; uniform Q4_K across layers
Q4_K_M	~4.9	Mixed precision	Mixed	Medium variant; widely recommended as the general-purpose default
Q5_K_S	5.5	8 blocks of 32 weights	w = q * block_scale(6-bit) + block_min(6-bit)	Small variant; uniform Q5_K
Q5_K_M	~5.7	Mixed precision	Mixed	Medium variant; excellent quality-to-size ratio
Q6_K	6.5625	16 blocks of 16 weights	w = q * block_scale(8-bit)	High quality; moderate compression
Q8_K	~8.5	256 weights per block	w = q * block_scale	Used internally for intermediate quantization results

The "mixed" variants (_M and _L) do not apply a uniform quantization across all layers. Instead, they assign higher-precision types to the most sensitive layers (such as attention output and feed-forward gate projections) while using the named type for less critical layers. For example, Q4_K_M may use Q6_K for half of the attention and feed-forward weight tensors, with Q4_K for the rest. This targeted approach preserves quality in the most impactful layers while still compressing the bulk of the model.^[11]

Q4_K_M is widely regarded as the best general-purpose quantization level, offering a good balance between model size, inference speed, and output quality. Q5_K_M is preferred when memory allows, as it preserves more of the original model's capabilities.^[11]

What are I-quants (importance-matrix quantization)?

I-quantizations represent a newer family that uses an importance matrix to guide the quantization process. The importance matrix is generated by running a calibration dataset through the model to measure how much each weight contributes to the model's output. Weights that matter more receive higher precision. I-quantizations also use non-linear reconstruction and lookup tables, achieving better quality than K-quantizations at the same bit rate, especially at very low bit widths (2-3 bits). Most I-quant types use 256-weight super-blocks.

Type	Bits per Weight	Description
IQ1_S	~1.56	Ultra-low 1-bit precision with importance scaling
IQ1_M	~1.75	Slightly higher precision 1-bit variant
IQ2_XXS	~2.06	2-bit; extreme compression using learned lookup tables
IQ2_XS	~2.31	2-bit; indices into learned lookup tables
IQ2_S	~2.50	2-bit importance-matrix-based quantization
IQ2_M	~2.93	2-bit; highest quality variant in the IQ2 family
IQ3_XXS	~3.06	3-bit with importance scaling
IQ3_XS	~3.50	3-bit extended variant
IQ3_S	~3.44	3-bit; better quality than Q3_K at similar size
IQ3_M	~3.76	3-bit; highest quality in the IQ3 family
IQ4_XS	~4.25	4-bit with importance matrix; competes with Q4_K_M at smaller size
IQ4_NL	~4.50	4-bit with non-linear mapping; uses 32-weight blocks

I-quantizations tend to deliver better quality per byte than K-quantizations at the same bit rate.^[11] However, they require an importance matrix file generated by the llama-imatrix tool, which adds an extra step to the quantization workflow. For best results, the calibration dataset should be representative of the model's intended use case.

Additional Quantization Types

Recent additions to the GGUF type system include:

Type	Description
TQ1_0	Ternary quantization; weights restricted to {-1, 0, +1}; for models like BitNet
TQ2_0	Ternary quantization variant
MXFP4	4-bit Microscaling Block Floating Point; a newer standard for efficient computation

The two ternary types were merged into llama.cpp in Pull Request #8151 by the contributor compilade to support TriLM and BitNet b1.58 style models, whose weights are restricted to the values -1, 0, and +1. TQ1_0 packs weights at 1.6875 bits each by encoding five ternary digits in a single byte (3 to the power of 5 is 243, which fits in the 256 values of a byte), while TQ2_0 uses 2.0625 bits per weight in exchange for faster decoding; the two are lossless conversions of each other and carry no measured quality difference for genuinely ternary models.^[15] MXFP4 (Microscaling FP4) became prominent in August 2025 when it was adopted as the native weight format of OpenAI's gpt-oss models, and it is described in more detail in the developments section below.^[14]

How do you choose a quantization level?

Selecting the right quantization type involves balancing model quality, file size, memory usage, and inference speed. As a practical guide:

Use Case	Recommended Types	Approximate Size (7B model)
Maximum quality	F16, Q8_0	13-14 GB
High quality, moderate compression	Q6_K, Q5_K_M	5.5-7 GB
Balanced quality and size	Q4_K_M	~4.6 GB
Aggressive compression, limited RAM	Q3_K_M, IQ3_M	3.3-3.5 GB
Extreme compression	Q2_K, IQ2_M	2.5-3 GB

Perplexity measurements on Llama 2 7B illustrate the quality gradient: F16 achieves a perplexity of 7.4924, Q8_0 scores 7.4933 (essentially identical), Q4_K_M scores 7.5692 (minimal degradation), and Q2_K scores 8.6501 (noticeable but functional degradation).^[11]

Conversion and Quantization Tools

Converting a model from a training framework like PyTorch to GGUF format is typically a two-step process handled by tools included in the llama.cpp repository.^[12]

Step 1: Convert to GGUF

The primary conversion script is convert_hf_to_gguf.py, which reads a model stored in Hugging Face Transformers format (typically SafeTensors files plus a tokenizer and config) and writes a GGUF file at full or reduced precision.^[12]

A basic conversion command:

python convert_hf_to_gguf.py ./models/my-model/ --outfile my-model-f16.gguf --outtype f16

The --outtype flag controls the initial precision. Common choices are f16 (16-bit float, lossless relative to typical model distributions), bf16, f32 (32-bit float), and q8_0 (8-bit quantized in one step). The script handles architecture detection, tokenizer extraction, and tensor layout automatically for all supported model families.^[12]

Step 2: Quantize

Once a model has been converted to a full-precision GGUF file, the llama-quantize tool (compiled as part of llama.cpp) reduces the precision to any of the supported quantization types:^[6]

./llama-quantize my-model-f16.gguf my-model-Q4_K_M.gguf Q4_K_M

For improved quality at low bit widths, an importance matrix can be provided:

./llama-imatrix -m my-model-f16.gguf -f calibration-data.txt -o imatrix.dat
./llama-quantize --imatrix imatrix.dat my-model-f16.gguf my-model-IQ3_M.gguf IQ3_M

The importance matrix is generated by running a calibration dataset through the model using the llama-imatrix tool. This extra step is highly recommended for IQ-series quantizations and is also beneficial for K-quantizations at Q3 and below.^[11]

Hugging Face GGUF-My-Repo

For users who prefer not to install llama.cpp locally, Hugging Face provides a web-based tool at ggml-org/gguf-my-repo that can convert and quantize models directly on Hugging Face infrastructure.^[5] Users select a source model repository and a target quantization type, and the tool produces a GGUF file hosted on Hugging Face.

Splitting Large Models

Very large models (70B parameters and above) can produce GGUF files that exceed the upload limits of hosting platforms like Hugging Face (50 GB per file). The llama-gguf-split tool addresses this by splitting a GGUF file into multiple shards at the tensor level, never cutting a tensor in half:

llama-gguf-split --split --split-max-size 20G my-model.gguf my-model-split

Each shard retains a complete copy of the header and metadata, so any shard can be inspected independently. The shards can be reassembled with:

llama-gguf-split --merge my-model-split-00001-of-00004.gguf my-model-merged.gguf

llama.cpp's model loader can also read split GGUF files directly without merging them first.

File Naming Convention

The GGUF specification defines an optional but recommended naming convention so that a file name alone communicates the model, its size, its version, and its encoding. The pattern is <BaseName><SizeLabel><FineTune><Version><Encoding><Type><Shard>.gguf, where the base name identifies the architecture, the size label gives the parameter class (for example 8x7B for a mixture-of-experts model), the version uses a v<Major>.<Minor> form that defaults to v1.0 when omitted, and the shard suffix takes the form <ShardNum>-of-<ShardTotal> with the shard number zero-padded to five digits beginning at 00001. At a minimum a compliant file carries the base name, the size label, and the version so that it can be validated automatically.^[1]

Software Ecosystem

GGUF has become the standard file format for local LLM inference. The following tools and libraries support GGUF files natively.

llama.cpp

llama.cpp is the reference implementation and the project where GGUF originated.^[3] Written in C/C++ with minimal dependencies, it supports inference on x86 (with AVX, AVX2, AVX-512 SIMD optimizations), ARM (with NEON), Apple Metal, CUDA, HIP (AMD), Vulkan, and SYCL backends. llama.cpp reads GGUF files directly and uses memory-mapped I/O for efficient model loading. It also supports hybrid CPU/GPU inference, where some transformer layers run on the GPU and others on the CPU, controlled by the --n-gpu-layers parameter.

Ollama

Ollama wraps llama.cpp in a user-friendly command-line tool with a built-in model registry. Users can pull models by name (e.g., ollama run llama3), and Ollama handles downloading the appropriate GGUF file, selecting a quantization level, and configuring GPU offloading. It exposes an OpenAI-compatible REST API for integration with other applications. Ollama uses GGUF as its internal model storage format, wrapping GGUF files in its own Modelfile configuration layer, and can run any of the roughly 45,000 GGUF checkpoints on Hugging Face directly with a single ollama run command.^[19]

LM Studio

LM Studio is a desktop application with a graphical interface for browsing, downloading, and running GGUF models. It connects directly to the Hugging Face model hub, provides one-click downloads, and includes a chat interface for testing. LM Studio also offers a local server mode that exposes an OpenAI-compatible API, typically on port 1234.

GPT4All

GPT4All, developed by Nomic AI, is a desktop application that prioritizes accessibility and privacy for local model inference. It supports GGUF files and provides a simplified interface for users who want to run language models without technical barriers. GPT4All also includes a local document question-answering feature that can index files on the user's machine.

Python Libraries

For programmatic access, two Python libraries provide bindings to GGUF-based inference:

llama-cpp-python: Python bindings for llama.cpp with GPU acceleration support, LangChain integration, and an OpenAI-compatible API server. Models can be loaded from local GGUF files or downloaded directly from Hugging Face using the from_pretrained method.
ctransformers: Python bindings for C/C++ transformer implementations using the GGML library, with GPU acceleration and LangChain support.

Other Tools

Additional tools that support GGUF include Jan (an open-source desktop client), koboldcpp (a llama.cpp fork with a web UI focused on creative writing), text-generation-webui (a popular web interface for LLM inference), and Unsloth (which can export fine-tuned models directly to GGUF). The Wolfram Language added native GGUF import support, and the Hugging Face Transformers library can also load GGUF models via its from_pretrained interface.

GGUF on Hugging Face

Hugging Face has become the primary distribution platform for GGUF models. As of early 2026 the platform hosted roughly 45,000 public GGUF checkpoints, and its web interface includes built-in tools for working with the format.^[5]^[19]

Metadata Viewer

The Hugging Face Hub includes a viewer for GGUF files that displays metadata key-value pairs and tensor information (name, shape, precision) directly on the model page.^[5] This viewer allows users to inspect a model's architecture, context length, quantization details, and tokenizer configuration without downloading the file.

JavaScript Parser

Hugging Face maintains the @huggingface/gguf npm package, a JavaScript parser that can read GGUF metadata and tensor information from remotely hosted files:^[5]

import { gguf } from "@huggingface/gguf";

const { metadata, tensorInfos } = await gguf(
  "https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF/resolve/main/llama-2-7b-chat.Q4_K_M.gguf"
);

This enables web applications to inspect GGUF models hosted on the Hub without downloading the full file.

Inference Endpoints

Hugging Face Inference Endpoints supports GGUF models out of the box.^[5] Users can deploy any GGUF model as a dedicated endpoint with a few clicks, selecting their preferred hardware configuration.

Community Quantizers

A vibrant community of quantizers publishes pre-quantized GGUF models on Hugging Face. TheBloke (Tom Jobbins) was one of the earliest and most prolific contributors, publishing hundreds of models in multiple quantization variants starting in mid-2023. Other active quantization groups include bartowski (known for rapid turnaround on newly released models), QuantFactory, and lmstudio-community. These repositories typically provide each model in several quantization levels (from Q2_K through Q8_0 and various IQ types), along with benchmarks and recommendations in the model card.

Users can browse all GGUF models on Hugging Face by filtering with the gguf library tag at hf.co/models?library=gguf.^[5]

How does GGUF compare with other model formats?

GGUF occupies a specific niche in the ecosystem of model file formats. Understanding how it compares to alternatives helps clarify when to use each format.

Feature	GGUF	SafeTensors	ONNX	Pickle (.bin)
Primary use case	Quantized local inference	Training, fine-tuning, GPU serving	Cross-framework deployment	Legacy PyTorch serialization
Stores weights	Yes	Yes	Yes	Yes
Stores tokenizer	Yes	No (separate files)	No (separate files)	No (separate files)
Stores computation graph	No	No	Yes	Yes (via Python objects)
Stores metadata	Yes (rich key-value)	Minimal (tensor names/shapes)	Yes (graph-level)	Python object state
Built-in quantization	Yes (1-bit to 8-bit, plus FP16/BF16)	No	Limited (via ONNX Runtime tools)	No
Memory mapping	Yes	Yes (zero-copy)	Varies by runtime	No
Security	Safe (no code execution)	Safe (no code execution)	Safe (no code execution)	Unsafe (arbitrary code execution via pickle)
Self-contained	Yes (single file)	No (multiple files needed)	Yes (graph + weights)	No (needs separate files)
Primary ecosystem	llama.cpp, Ollama, LM Studio	Hugging Face Transformers, vLLM	ONNX Runtime, TensorRT, OpenVINO	PyTorch

GGUF vs. SafeTensors

SafeTensors is the default format for the Hugging Face Transformers library and is widely used for training, fine-tuning, and GPU-based serving. It stores only tensor data with minimal metadata; the model architecture, tokenizer, and configuration live in separate files. SafeTensors supports zero-copy loading and is safe against arbitrary code execution. However, SafeTensors does not include built-in quantization. Running a 7B parameter model from SafeTensors at FP16 requires roughly 14 GB of VRAM, while a Q4_K_M GGUF of the same model fits in approximately 4.6 GB of RAM.

GGUF vs. ONNX

ONNX (Open Neural Network Exchange) stores the full computation graph of a model alongside its weights, making ONNX files portable across frameworks (PyTorch, TensorFlow, MXNet) and hardware targets (CPUs, GPUs, NPUs, custom accelerators). ONNX is better suited for deployment pipelines that require cross-platform compatibility or hardware-specific optimizations through ONNX Runtime, TensorRT, or OpenVINO. GGUF, by contrast, is optimized for the specific case of running transformer-based LLMs on consumer hardware with aggressive quantization.

GGUF vs. Pickle

Python's pickle format was historically the default serialization method for PyTorch models (stored as .bin files). Pickle files can execute arbitrary Python code during deserialization, which poses a serious security risk when loading models from untrusted sources. Both GGUF and SafeTensors were designed in part to eliminate this vulnerability. Pickle also does not support mmap-based loading and requires the full file to be loaded into memory.

Supported Model Architectures

GGUF is not limited to a single model family. The convert_hf_to_gguf.py script in the llama.cpp repository supports conversion for a broad and growing range of transformer-based architectures, including:

LLaMA family: LLaMA 1, LLaMA 2, LLaMA 3, Code Llama
Mistral family: Mistral 7B, Mixtral (Mixture of Experts)
Qwen family: Qwen, Qwen2, Qwen2.5
Microsoft Phi: Phi-2, Phi-3, Phi-4
Google Gemma: Gemma, Gemma 2
DeepSeek: DeepSeek, DeepSeek-V2, DeepSeek-V3
Others: Falcon, Yi, Solar, StableLM, StarCoder, InternLM, Baichuan, BLOOM, MPT, GPT-2, GPT-NeoX, and more

New architectures are added regularly by community contributors through pull requests to the llama.cpp repository.

While GGUF is primarily associated with language models, the format can store any collection of tensors with metadata. Projects like stable-diffusion.cpp have used GGUF for diffusion models, though this remains less common than its application in the language model space.

Community and Development

The GGUF format is developed as part of the broader ggml ecosystem. In 2023, Georgi Gerganov founded ggml.ai, a company dedicated to supporting the development of the GGML library and related tools. In February 2026, the ggml.ai team joined Hugging Face, bringing the core llama.cpp team under the Hugging Face umbrella while continuing to develop the project as open-source software.^[8]

The llama.cpp GitHub repository (under the ggml-org organization) serves as the central hub for GGUF format development.^[3] Format changes are proposed and discussed through GitHub issues and pull requests, and the specification document is maintained alongside the source code at docs/gguf.md.^[1]

The community around GGUF and local LLM inference is active across several platforms. The llama.cpp GitHub Discussions board hosts technical conversations about quantization methods, performance optimization, and new model support. Reddit communities such as r/LocalLLaMA serve as gathering places for users sharing benchmark results, hardware recommendations, and quantization comparisons.

GGUF's growth has been driven by the broader trend toward running AI models locally rather than relying on cloud APIs. Privacy concerns, cost reduction, offline availability, and the desire for customization have all contributed to demand for efficient local inference, and GGUF has positioned itself as the format that makes this practical on everyday hardware.

2025-2026 developments

GGUF and its surrounding tooling continued to evolve quickly through 2025 and into 2026.

Native 4-bit models and MXFP4. On August 5, 2025, OpenAI released its first open-weight models since GPT-2, gpt-oss-120b and gpt-oss-20b, under the Apache 2.0 license.^[13] The two models carry 116.8 billion and 20.9 billion parameters respectively, and rather than being trained in higher precision and quantized afterward, their mixture-of-experts weights (more than 90 percent of the parameter count) were trained natively in the Microscaling FP4 (MXFP4) format at 4.25 bits per parameter, which lets the 120B model run on a single 80 GB GPU and the 20B model run within 16 GB of memory.^[13] On the same day, the llama.cpp team announced native MXFP4 support across all major ggml backends, including CUDA, Vulkan, Metal, and CPU, and described the moment as the beginning of the era of natively trained 4-bit local models.^[14] Because GGUF could carry the MXFP4 tensor type and the model's metadata without any structural change to the format, gpt-oss GGUF files appeared on Hugging Face within hours of release.

Ternary quantization in the mainline. Support for ternary models, whose weights take only the values -1, 0, and +1, was merged into llama.cpp through Pull Request #8151, adding the TQ1_0 and TQ2_0 types described above for TriLM and BitNet b1.58 architectures.^[15] These types push the lower bound of GGUF quantization below two bits per weight while remaining decodable on commodity CPUs.

Multimodal and platform reach. On April 10, 2025, llama.cpp introduced libmtmd, a multimodal abstraction layer that revived and unified support for vision and audio models; inputs such as images, PDFs, and audio are tokenized and routed through the model's encoder, and many of these projectors ship as companion GGUF files (commonly labeled with an mmproj prefix).^[17] On December 17, 2025, the project added full hardware acceleration on Android and ChromeOS devices through a new GUI binding, broadening where GGUF models can run.^[17]

Hugging Face stewardship. On February 20, 2026, Hugging Face announced that the ggml.ai team, including Georgi Gerganov and core contributors Xuan-Son Nguyen and Aleksander Grygier, had joined the company.^[16] The announcement stated that "the project will continue to be 100% open-source and community driven as it is now," that Gerganov and team would keep dedicating 100 percent of their time to llama.cpp with full autonomy over technical direction, and that a goal of the move was to make it nearly single-click to ship new models from the Transformers library into llama.cpp.^[16] The team framed the shared objective as providing "the community with the building blocks to make open-source superintelligence accessible to the world over the coming years."^[16]

Dynamic and per-layer quantization. As model releases accelerated, the most prolific early quantizer, TheBloke, stopped publishing new files around 2024, and contributors such as bartowski and the Unsloth team became the leading sources of community GGUF builds.^[18] Unsloth introduced a method it calls Dynamic 2.0 GGUFs, which analyzes each layer of a model and selects a per-layer quantization type rather than applying one uniform bit width, keeping sensitive embedding and early or late attention layers at higher precision while compressing the more redundant middle feed-forward layers more aggressively; the company reports lower perplexity and KL divergence against the full-precision model than standard importance-matrix quantization.^[18]

References

Gerganov, G. et al. "GGUF Specification." ggml-org/ggml, GitHub. https://github.com/ggml-org/ggml/blob/master/docs/gguf.md ↩
philpax. "GGUF file format specification." Pull Request #302, ggerganov/ggml, GitHub, August 2023. https://github.com/ggerganov/ggml/pull/302 ↩
Gerganov, G. "llama.cpp." GitHub repository. https://github.com/ggml-org/llama.cpp ↩
Kawrakow, I. (ikawrakow). "k-quants." Pull Request #1684, ggml-org/llama.cpp, GitHub, June 5, 2023. https://github.com/ggml-org/llama.cpp/pull/1684 ↩
"GGUF." Hugging Face Hub Documentation. https://huggingface.co/docs/hub/gguf ↩
"Quantize README." llama.cpp/tools/quantize, GitHub. https://github.com/ggml-org/llama.cpp/blob/master/tools/quantize/README.md ↩
"GGUF versus GGML." IBM Think. https://www.ibm.com/think/topics/gguf-versus-ggml ↩
"GGML and llama.cpp join HF." Hugging Face Blog. https://huggingface.co/blog/ggml-joins-hf ↩
Tunney, J. "Edge AI Just Got Faster." 2023. https://justine.lol/mmap/ ↩
"Difference in different quantization methods." llama.cpp Discussions #2094. https://github.com/ggml-org/llama.cpp/discussions/2094 ↩
"Choosing a GGUF Model: K-Quants, I-Quants, and Legacy Formats." Kaitchup. https://kaitchup.substack.com/p/choosing-a-gguf-model-k-quants-i ↩
"Tutorial: How to convert HuggingFace model to GGUF format." llama.cpp Discussions #7927. https://github.com/ggml-org/llama.cpp/discussions/7927 ↩
"Introducing gpt-oss." OpenAI, August 5, 2025. https://openai.com/index/introducing-gpt-oss/ ↩
"llama.cpp supports the new gpt-oss model in native MXFP4 format." llama.cpp Discussions #15095, GitHub, August 5, 2025. https://github.com/ggml-org/llama.cpp/discussions/15095 ↩
compilade. "ggml-quants: ternary packing for TriLMs and BitNet b1.58." Pull Request #8151, ggerganov/llama.cpp, GitHub. https://github.com/ggml-org/llama.cpp/pull/8151 ↩
"ggml.ai joins Hugging Face to ensure the long-term progress of Local AI." Hugging Face Blog / llama.cpp Discussions #19759, GitHub, February 20, 2026. https://github.com/ggml-org/llama.cpp/discussions/19759 ↩
"llama.cpp." Wikipedia. https://en.wikipedia.org/wiki/Llama.cpp ↩
"Unsloth Dynamic 2.0 GGUFs." Unsloth Documentation. https://docs.unsloth.ai/basics/unsloth-dynamic-2.0-ggufs ↩
"Use Ollama with any GGUF Model on Hugging Face Hub." Hugging Face Hub Documentation. https://huggingface.co/docs/hub/ollama ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

7 revisions by 1 contributors · full history

Suggest edit

History and Evolution

What is GGML?

What is llama.cpp?

Format Progression: GGML, GGMF, GGJT, GGUF

GGUF Version History

File Structure

Header

Metadata

Tensor Info and Tensor Data

How does GGUF use memory-mapped I/O (mmap)?

Quantization

Floating-Point Formats

Legacy Quantization Types

What are K-quants?

What are I-quants (importance-matrix quantization)?

Additional Quantization Types

How do you choose a quantization level?

Conversion and Quantization Tools

Step 1: Convert to GGUF

Step 2: Quantize

Hugging Face GGUF-My-Repo

Splitting Large Models

File Naming Convention

Software Ecosystem

llama.cpp

Ollama

LM Studio

GPT4All

Python Libraries

Other Tools

GGUF on Hugging Face

Metadata Viewer

JavaScript Parser

Inference Endpoints

Community Quantizers

How does GGUF compare with other model formats?

GGUF vs. SafeTensors

GGUF vs. ONNX

GGUF vs. Pickle

Supported Model Architectures

Community and Development

2025-2026 developments

References

Improve this article

Related Articles

llama.cpp

DSPy

LLaMA/Model Card

LangChain

Ollama

OpenRouter

What links here (24 of 36)

Related Articles

llama.cpp

DSPy

LLaMA/Model Card

LangChain

Ollama

OpenRouter

What links here (24 of 36)