GGUF (GPT-Generated Unified Format) is a binary file format designed for storing large language models in a way that is optimized for fast loading and efficient inference on consumer hardware. Created by Georgi Gerganov as part of the llama.cpp ecosystem, GGUF replaced earlier formats (GGML, GGMF, and GGJT) on August 21, 2023. The format bundles model weights, tokenizer data, and metadata into a single self-contained file, making it the standard format for running quantized language models locally on CPUs and GPUs alike.
GGUF files are natively supported by llama.cpp and by popular inference tools such as Ollama, LM Studio, GPT4All, Jan, and koboldcpp. Tens of thousands of GGUF models are available on Hugging Face, where the format has first-class integration including a metadata viewer, an inference endpoint service, and a JavaScript parser library.
Towards the end of September 2022, Georgi Gerganov began work on GGML (GG's Machine Learning library), a lightweight C library for tensor operations. GGML was built with strict memory management and multi-threading in mind, and it provided the computational backbone for inference of transformer-based models without heavy dependencies on frameworks like PyTorch or TensorFlow.
The original GGML file format stored model weights alongside a flat list of untyped hyperparameters. While functional, this design had significant limitations: adding new metadata fields required changing the loader code, and there was no versioning to distinguish between format revisions.
In March 2023, shortly after Meta released the first LLaMA models, Gerganov created llama.cpp, a pure C/C++ implementation of the LLaMA inference code with no external dependencies. The project's core goal was to let users run large language models on everyday consumer hardware, including laptops and desktops without dedicated GPUs. Before llama.cpp, Gerganov had built whisper.cpp, a similar C/C++ port of OpenAI's Whisper speech-to-text model, using GGML as the underlying computation library.
As llama.cpp gained popularity, the shortcomings of the original GGML file format became more pressing. The community needed a format that could carry richer metadata, support new quantization schemes without breaking backward compatibility, and work cleanly with memory-mapped I/O.
The model file formats used by the llama.cpp project evolved in four stages:
| Format | Description | Key Characteristics |
|---|---|---|
| GGML | Original format | Unversioned; flat list of untyped hyperparameters; no alignment requirements; identified by magic number 0x67676d6c |
| GGMF | Versioned GGML | Added a version field to the header; only one version (v1) was ever released; same structure as GGML otherwise |
| GGJT | Aligned format | Aligned tensors for mmap support; three versions (v1, v2, v3) with progressively updated quantization schemes |
| GGUF | Current standard | Key-value metadata store; extensible design; self-contained with tokenizer; introduced August 21, 2023 |
GGUF is based on GGJT but replaces the flat hyperparameter list with a structured key-value metadata system. This single change made it possible to add new fields (architecture details, tokenizer vocabularies, training parameters) without modifying the loader or breaking older models. The GGUF specification was formalized through Pull Request #302 in the ggerganov/ggml repository, authored by philpax.
The GGUF format itself has gone through three internal versions:
| Version | Changes |
|---|---|
| v1 | Initial release with key-value metadata and tensor alignment |
| v2 | Changed most countable values (array lengths, tensor counts) from uint32 to uint64, enabling support for much larger models |
| v3 | Added big-endian support for architectures such as IBM POWER and s390x; current version |
Versions only increment for structural changes to the binary layout. New metadata fields and quantization types can be added without changing the format version.
A GGUF file is a contiguous binary blob divided into four sequential sections: a header, metadata key-value pairs, tensor information, and tensor data. All multi-byte values are stored in little-endian byte order by default, though version 3 of the format introduced support for big-endian files.
The header is a fixed structure at the start of every GGUF file:
| Field | Type | Size | Description |
|---|---|---|---|
| Magic number | uint32 | 4 bytes | 0x46554747 (the ASCII bytes for "GGUF" stored in little-endian order) |
| Version | uint32 | 4 bytes | Format version (currently 3) |
| Tensor count | uint64 | 8 bytes | Total number of tensors stored in the file |
| Metadata KV count | uint64 | 8 bytes | Number of key-value pairs in the metadata section |
The magic number allows any tool to quickly verify whether a file is a valid GGUF file. The version field lets loaders handle structural differences between format revisions.
The metadata section is a flexible key-value store that carries all the information needed to configure and run the model. Keys are hierarchical ASCII strings written in lower_snake_case with dot separators (for example, general.architecture or llama.context_length). Keys can be up to 65,535 bytes long.
The value types supported in metadata are:
| Type ID | Type Name | Description |
|---|---|---|
| 0 | UINT8 | 8-bit unsigned integer |
| 1 | INT8 | 8-bit signed integer |
| 2 | UINT16 | 16-bit unsigned integer |
| 3 | INT16 | 16-bit signed integer |
| 4 | UINT32 | 32-bit unsigned integer |
| 5 | INT32 | 32-bit signed integer |
| 6 | FLOAT32 | 32-bit IEEE 754 float |
| 7 | BOOL | Boolean (1 byte) |
| 8 | STRING | Length-prefixed UTF-8 string |
| 9 | ARRAY | Typed array of values |
| 10 | UINT64 | 64-bit unsigned integer |
| 11 | INT64 | 64-bit signed integer |
| 12 | FLOAT64 | 64-bit IEEE 754 double |
Common metadata keys include general.architecture (the model architecture, such as llama or mistral), general.name (human-readable model name), [arch].context_length (maximum sequence length), [arch].embedding_length (embedding dimensions), [arch].block_count (number of transformer layers), and tokenizer.ggml.model (the tokenizer type used, such as llama for SentencePiece BPE or gpt2 for byte-level BPE).
This metadata design is a major improvement over earlier formats. Because keys are self-describing strings, new metadata fields can be added without changing the format specification or breaking compatibility with tools that do not recognize the new keys.
Following the metadata, the file contains an array of tensor information entries. Each entry records:
| Field | Type | Description |
|---|---|---|
| Name | string | Identifier for the tensor (up to 64 bytes, e.g. blk.0.attn_q.weight) |
| Number of dimensions | uint32 | Dimensionality of the tensor (up to 4 currently) |
| Dimensions | uint64[] | Size along each axis |
| Type | uint32 | The ggml_type enum value indicating the data type or quantization method |
| Offset | uint64 | Byte offset into the tensor data section |
After the tensor info array, the file is padded to a 32-byte alignment boundary (GGUF_DEFAULT_ALIGNMENT). The tensor data section then follows as a contiguous block of binary data containing the actual model weights. Each tensor's offset must be a multiple of the alignment value, which is essential for memory-mapped access.
One of GGUF's most important design features is native support for memory-mapped I/O. When a GGUF file is loaded with mmap (the default in llama.cpp), the operating system maps the file directly into the process's virtual address space. Tensor weights can then be accessed through pointers without copying the data into a separate memory buffer.
This approach has several practical benefits:
read() into memory can begin inference almost instantly with mmap.The 32-byte alignment of tensor data in GGUF files is specifically chosen to support efficient mmap behavior. Aligned data avoids partial-page reads and ensures that individual tensors can be accessed without reading adjacent tensors' data. Justine Tunney's 2023 blog post "Edge AI Just Got Faster" demonstrated that mmap-based loading in llama.cpp delivered significant real-world speedups, especially when combined with operating system optimizations like madvise() hints for sequential access patterns.
Quantization is the process of reducing the numerical precision of model weights, trading a small amount of accuracy for significantly lower memory usage and faster inference. GGUF supports a wide range of quantization types, organized into several families: floating-point formats, legacy quantizations, K-quantizations, and I-quantizations (importance-matrix quantization).
| Type | Bits per Weight | Description |
|---|---|---|
| F32 | 32.0 | Standard IEEE 754 single-precision float |
| BF16 | 16.0 | Bfloat16; shortened single-precision format common in training |
| F16 | 16.0 | IEEE 754 half-precision float; common baseline for quality comparison |
These are unquantized formats that preserve full model precision. F16 is the most commonly used starting point for conversion, as it matches the precision at which most models are distributed.
The original quantization methods from the GGML era use simple round-to-nearest schemes with 32-weight blocks. The "_0" variants store only a scale factor (symmetric quantization), while the "_1" variants store both a scale and a minimum value (asymmetric quantization).
| Type | Bits per Weight | Block Size | Formula | Notes |
|---|---|---|---|---|
| Q4_0 | ~4.5 | 32 | w = q * scale | Symmetric; legacy |
| Q4_1 | ~5.0 | 32 | w = q * scale + min | Asymmetric; legacy |
| Q5_0 | ~5.5 | 32 | w = q * scale | Symmetric; legacy |
| Q5_1 | ~6.0 | 32 | w = q * scale + min | Asymmetric; legacy |
| Q8_0 | ~8.5 | 32 | w = q * scale | Near-lossless; still widely used |
| Q8_1 | ~9.0 | 32 | w = q * scale + min | Asymmetric 8-bit; legacy |
These legacy types have been largely superseded by the K-quantization family, which delivers better quality at comparable file sizes.
K-quantizations were introduced in llama.cpp Pull Request #1684 by Gerganov in June 2023. They use a hierarchical "super-block" structure that groups multiple smaller blocks together with shared higher-precision scale factors. This approach produces lower quantization error than legacy methods at the same bit rate. The suffixes _S, _M, and _L stand for Small, Medium, and Large, indicating how aggressively quality-sensitive layers receive higher precision.
| Type | Bits per Weight | Super-Block Structure | Weight Formula | Description |
|---|---|---|---|---|
| Q2_K | 2.625 | 16 blocks of 16 weights | w = q * block_scale(4-bit) + block_min(4-bit) | Extreme compression; noticeable quality loss |
| Q3_K_S | 3.4375 | 16 blocks of 16 weights | w = q * block_scale(6-bit) | Small variant; all layers at Q3_K |
| Q3_K_M | ~3.9 | Mixed precision | Mixed | Medium variant; attention layers at higher precision |
| Q3_K_L | ~4.3 | Mixed precision | Mixed | Large variant; more layers kept at higher precision |
| Q4_K_S | 4.5 | 8 blocks of 32 weights | w = q * block_scale(6-bit) + block_min(6-bit) | Small variant; uniform Q4_K across layers |
| Q4_K_M | ~4.9 | Mixed precision | Mixed | Medium variant; widely recommended as the general-purpose default |
| Q5_K_S | 5.5 | 8 blocks of 32 weights | w = q * block_scale(6-bit) + block_min(6-bit) | Small variant; uniform Q5_K |
| Q5_K_M | ~5.7 | Mixed precision | Mixed | Medium variant; excellent quality-to-size ratio |
| Q6_K | 6.5625 | 16 blocks of 16 weights | w = q * block_scale(8-bit) | High quality; moderate compression |
| Q8_K | ~8.5 | 256 weights per block | w = q * block_scale | Used internally for intermediate quantization results |
The "mixed" variants (_M and _L) do not apply a uniform quantization across all layers. Instead, they assign higher-precision types to the most sensitive layers (such as attention output and feed-forward gate projections) while using the named type for less critical layers. For example, Q4_K_M may use Q6_K for half of the attention and feed-forward weight tensors, with Q4_K for the rest. This targeted approach preserves quality in the most impactful layers while still compressing the bulk of the model.
Q4_K_M is widely regarded as the best general-purpose quantization level, offering a good balance between model size, inference speed, and output quality. Q5_K_M is preferred when memory allows, as it preserves more of the original model's capabilities.
I-quantizations represent a newer family that uses an importance matrix to guide the quantization process. The importance matrix is generated by running a calibration dataset through the model to measure how much each weight contributes to the model's output. Weights that matter more receive higher precision. I-quantizations also use non-linear reconstruction and lookup tables, achieving better quality than K-quantizations at the same bit rate, especially at very low bit widths (2-3 bits). Most I-quant types use 256-weight super-blocks.
| Type | Bits per Weight | Description |
|---|---|---|
| IQ1_S | ~1.56 | Ultra-low 1-bit precision with importance scaling |
| IQ1_M | ~1.75 | Slightly higher precision 1-bit variant |
| IQ2_XXS | ~2.06 | 2-bit; extreme compression using learned lookup tables |
| IQ2_XS | ~2.31 | 2-bit; indices into learned lookup tables |
| IQ2_S | ~2.50 | 2-bit importance-matrix-based quantization |
| IQ2_M | ~2.93 | 2-bit; highest quality variant in the IQ2 family |
| IQ3_XXS | ~3.06 | 3-bit with importance scaling |
| IQ3_XS | ~3.50 | 3-bit extended variant |
| IQ3_S | ~3.44 | 3-bit; better quality than Q3_K at similar size |
| IQ3_M | ~3.76 | 3-bit; highest quality in the IQ3 family |
| IQ4_XS | ~4.25 | 4-bit with importance matrix; competes with Q4_K_M at smaller size |
| IQ4_NL | ~4.50 | 4-bit with non-linear mapping; uses 32-weight blocks |
I-quantizations tend to deliver better quality per byte than K-quantizations at the same bit rate. However, they require an importance matrix file generated by the llama-imatrix tool, which adds an extra step to the quantization workflow. For best results, the calibration dataset should be representative of the model's intended use case.
Recent additions to the GGUF type system include:
| Type | Description |
|---|---|
| TQ1_0 | Ternary quantization; weights restricted to {-1, 0, +1}; for models like BitNet |
| TQ2_0 | Ternary quantization variant |
| MXFP4 | 4-bit Microscaling Block Floating Point; a newer standard for efficient computation |
Selecting the right quantization type involves balancing model quality, file size, memory usage, and inference speed. As a practical guide:
| Use Case | Recommended Types | Approximate Size (7B model) |
|---|---|---|
| Maximum quality | F16, Q8_0 | 13-14 GB |
| High quality, moderate compression | Q6_K, Q5_K_M | 5.5-7 GB |
| Balanced quality and size | Q4_K_M | ~4.6 GB |
| Aggressive compression, limited RAM | Q3_K_M, IQ3_M | 3.3-3.5 GB |
| Extreme compression | Q2_K, IQ2_M | 2.5-3 GB |
Perplexity measurements on Llama 2 7B illustrate the quality gradient: F16 achieves a perplexity of 7.4924, Q8_0 scores 7.4933 (essentially identical), Q4_K_M scores 7.5692 (minimal degradation), and Q2_K scores 8.6501 (noticeable but functional degradation).
Converting a model from a training framework like PyTorch to GGUF format is typically a two-step process handled by tools included in the llama.cpp repository.
The primary conversion script is convert_hf_to_gguf.py, which reads a model stored in Hugging Face Transformers format (typically SafeTensors files plus a tokenizer and config) and writes a GGUF file at full or reduced precision.
A basic conversion command:
python convert_hf_to_gguf.py ./models/my-model/ --outfile my-model-f16.gguf --outtype f16
The --outtype flag controls the initial precision. Common choices are f16 (16-bit float, lossless relative to typical model distributions), bf16, f32 (32-bit float), and q8_0 (8-bit quantized in one step). The script handles architecture detection, tokenizer extraction, and tensor layout automatically for all supported model families.
Once a model has been converted to a full-precision GGUF file, the llama-quantize tool (compiled as part of llama.cpp) reduces the precision to any of the supported quantization types:
./llama-quantize my-model-f16.gguf my-model-Q4_K_M.gguf Q4_K_M
For improved quality at low bit widths, an importance matrix can be provided:
./llama-imatrix -m my-model-f16.gguf -f calibration-data.txt -o imatrix.dat
./llama-quantize --imatrix imatrix.dat my-model-f16.gguf my-model-IQ3_M.gguf IQ3_M
The importance matrix is generated by running a calibration dataset through the model using the llama-imatrix tool. This extra step is highly recommended for IQ-series quantizations and is also beneficial for K-quantizations at Q3 and below.
For users who prefer not to install llama.cpp locally, Hugging Face provides a web-based tool at ggml-org/gguf-my-repo that can convert and quantize models directly on Hugging Face infrastructure. Users select a source model repository and a target quantization type, and the tool produces a GGUF file hosted on Hugging Face.
Very large models (70B parameters and above) can produce GGUF files that exceed the upload limits of hosting platforms like Hugging Face (50 GB per file). The llama-gguf-split tool addresses this by splitting a GGUF file into multiple shards at the tensor level, never cutting a tensor in half:
llama-gguf-split --split --split-max-size 20G my-model.gguf my-model-split
Each shard retains a complete copy of the header and metadata, so any shard can be inspected independently. The shards can be reassembled with:
llama-gguf-split --merge my-model-split-00001-of-00004.gguf my-model-merged.gguf
llama.cpp's model loader can also read split GGUF files directly without merging them first.
GGUF has become the standard file format for local LLM inference. The following tools and libraries support GGUF files natively.
llama.cpp is the reference implementation and the project where GGUF originated. Written in C/C++ with minimal dependencies, it supports inference on x86 (with AVX, AVX2, AVX-512 SIMD optimizations), ARM (with NEON), Apple Metal, CUDA, HIP (AMD), Vulkan, and SYCL backends. llama.cpp reads GGUF files directly and uses memory-mapped I/O for efficient model loading. It also supports hybrid CPU/GPU inference, where some transformer layers run on the GPU and others on the CPU, controlled by the --n-gpu-layers parameter.
Ollama wraps llama.cpp in a user-friendly command-line tool with a built-in model registry. Users can pull models by name (e.g., ollama run llama3), and Ollama handles downloading the appropriate GGUF file, selecting a quantization level, and configuring GPU offloading. It exposes an OpenAI-compatible REST API for integration with other applications. Ollama uses GGUF as its internal model storage format, wrapping GGUF files in its own Modelfile configuration layer.
LM Studio is a desktop application with a graphical interface for browsing, downloading, and running GGUF models. It connects directly to the Hugging Face model hub, provides one-click downloads, and includes a chat interface for testing. LM Studio also offers a local server mode that exposes an OpenAI-compatible API, typically on port 1234.
GPT4All, developed by Nomic AI, is a desktop application that prioritizes accessibility and privacy for local model inference. It supports GGUF files and provides a simplified interface for users who want to run language models without technical barriers. GPT4All also includes a local document question-answering feature that can index files on the user's machine.
For programmatic access, two Python libraries provide bindings to GGUF-based inference:
from_pretrained method.Additional tools that support GGUF include Jan (an open-source desktop client), koboldcpp (a llama.cpp fork with a web UI focused on creative writing), text-generation-webui (a popular web interface for LLM inference), and Unsloth (which can export fine-tuned models directly to GGUF). The Wolfram Language added native GGUF import support, and the Hugging Face Transformers library can also load GGUF models via its from_pretrained interface.
Hugging Face has become the primary distribution platform for GGUF models. The platform hosts tens of thousands of pre-quantized GGUF files, and its web interface includes built-in tools for working with the format.
The Hugging Face Hub includes a viewer for GGUF files that displays metadata key-value pairs and tensor information (name, shape, precision) directly on the model page. This viewer allows users to inspect a model's architecture, context length, quantization details, and tokenizer configuration without downloading the file.
Hugging Face maintains the @huggingface/gguf npm package, a JavaScript parser that can read GGUF metadata and tensor information from remotely hosted files:
import { gguf } from "@huggingface/gguf";
const { metadata, tensorInfos } = await gguf(
"https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF/resolve/main/llama-2-7b-chat.Q4_K_M.gguf"
);
This enables web applications to inspect GGUF models hosted on the Hub without downloading the full file.
Hugging Face Inference Endpoints supports GGUF models out of the box. Users can deploy any GGUF model as a dedicated endpoint with a few clicks, selecting their preferred hardware configuration.
A vibrant community of quantizers publishes pre-quantized GGUF models on Hugging Face. TheBloke (Tom Jobbins) was one of the earliest and most prolific contributors, publishing hundreds of models in multiple quantization variants starting in mid-2023. Other active quantization groups include bartowski (known for rapid turnaround on newly released models), QuantFactory, and lmstudio-community. These repositories typically provide each model in several quantization levels (from Q2_K through Q8_0 and various IQ types), along with benchmarks and recommendations in the model card.
Users can browse all GGUF models on Hugging Face by filtering with the gguf library tag at hf.co/models?library=gguf.
GGUF occupies a specific niche in the ecosystem of model file formats. Understanding how it compares to alternatives helps clarify when to use each format.
| Feature | GGUF | SafeTensors | ONNX | Pickle (.bin) |
|---|---|---|---|---|
| Primary use case | Quantized local inference | Training, fine-tuning, GPU serving | Cross-framework deployment | Legacy PyTorch serialization |
| Stores weights | Yes | Yes | Yes | Yes |
| Stores tokenizer | Yes | No (separate files) | No (separate files) | No (separate files) |
| Stores computation graph | No | No | Yes | Yes (via Python objects) |
| Stores metadata | Yes (rich key-value) | Minimal (tensor names/shapes) | Yes (graph-level) | Python object state |
| Built-in quantization | Yes (1-bit to 8-bit, plus FP16/BF16) | No | Limited (via ONNX Runtime tools) | No |
| Memory mapping | Yes | Yes (zero-copy) | Varies by runtime | No |
| Security | Safe (no code execution) | Safe (no code execution) | Safe (no code execution) | Unsafe (arbitrary code execution via pickle) |
| Self-contained | Yes (single file) | No (multiple files needed) | Yes (graph + weights) | No (needs separate files) |
| Primary ecosystem | llama.cpp, Ollama, LM Studio | Hugging Face Transformers, vLLM | ONNX Runtime, TensorRT, OpenVINO | PyTorch |
SafeTensors is the default format for the Hugging Face Transformers library and is widely used for training, fine-tuning, and GPU-based serving. It stores only tensor data with minimal metadata; the model architecture, tokenizer, and configuration live in separate files. SafeTensors supports zero-copy loading and is safe against arbitrary code execution. However, SafeTensors does not include built-in quantization. Running a 7B parameter model from SafeTensors at FP16 requires roughly 14 GB of VRAM, while a Q4_K_M GGUF of the same model fits in approximately 4.6 GB of RAM.
ONNX (Open Neural Network Exchange) stores the full computation graph of a model alongside its weights, making ONNX files portable across frameworks (PyTorch, TensorFlow, MXNet) and hardware targets (CPUs, GPUs, NPUs, custom accelerators). ONNX is better suited for deployment pipelines that require cross-platform compatibility or hardware-specific optimizations through ONNX Runtime, TensorRT, or OpenVINO. GGUF, by contrast, is optimized for the specific case of running transformer-based LLMs on consumer hardware with aggressive quantization.
Python's pickle format was historically the default serialization method for PyTorch models (stored as .bin files). Pickle files can execute arbitrary Python code during deserialization, which poses a serious security risk when loading models from untrusted sources. Both GGUF and SafeTensors were designed in part to eliminate this vulnerability. Pickle also does not support mmap-based loading and requires the full file to be loaded into memory.
GGUF is not limited to a single model family. The convert_hf_to_gguf.py script in the llama.cpp repository supports conversion for a broad and growing range of transformer-based architectures, including:
New architectures are added regularly by community contributors through pull requests to the llama.cpp repository.
While GGUF is primarily associated with language models, the format can store any collection of tensors with metadata. Projects like stable-diffusion.cpp have used GGUF for diffusion models, though this remains less common than its application in the language model space.
The GGUF format is developed as part of the broader ggml ecosystem. In 2023, Georgi Gerganov founded ggml.ai, a company dedicated to supporting the development of the GGML library and related tools. In 2025, the GGML project joined Hugging Face, bringing the core llama.cpp team under the Hugging Face umbrella while continuing to develop the project as open-source software.
The llama.cpp GitHub repository (under the ggml-org organization) serves as the central hub for GGUF format development. Format changes are proposed and discussed through GitHub issues and pull requests, and the specification document is maintained alongside the source code at docs/gguf.md.
The community around GGUF and local LLM inference is active across several platforms. The llama.cpp GitHub Discussions board hosts technical conversations about quantization methods, performance optimization, and new model support. Reddit communities such as r/LocalLLaMA serve as gathering places for users sharing benchmark results, hardware recommendations, and quantization comparisons.
GGUF's growth has been driven by the broader trend toward running AI models locally rather than relying on cloud APIs. Privacy concerns, cost reduction, offline availability, and the desire for customization have all contributed to demand for efficient local inference, and GGUF has positioned itself as the format that makes this practical on everyday hardware.