GGML is an open-source tensor library written in pure C, designed for efficient machine learning inference on consumer hardware. Created by Georgi Gerganov in late 2022, GGML serves as the computational foundation for llama.cpp, whisper.cpp, and a growing ecosystem of C/C++ inference tools. The name stands for "Georgi Gerganov Machine Learning," combining the creator's initials (GG) with ML. Released under the MIT license and carrying no third-party dependencies, GGML is one of the most portable and lightweight tensor libraries available for running large AI models on everyday devices.
As of March 2026, the GGML repository on GitHub has over 14,000 stars and more than 585 contributors. In February 2026, Gerganov's company ggml.ai joined Hugging Face to secure long-term institutional support for the project.
Towards the end of September 2022, Georgi Gerganov began developing GGML as a C library implementing tensor algebra. Gerganov, a Bulgarian software engineer with an MSc in Medical Physics from Sofia University St. Kliment Ohridski, had previously worked as a Principal Scientist at ViewRay, a medical technology company. Beyond his work in medical physics, Gerganov had a track record of creative side projects spanning audio-based applications, including keyboard acoustic analysis tools and peer-to-peer communication experiments. His goal was to create a minimal, dependency-free tensor library with strict memory management and built-in multithreading support.
The creation of GGML was partly inspired by Fabrice Bellard's LibNC, a C library for tensor manipulation that supported automatic differentiation and could implement models such as LSTMs and Transformers. Bellard, known for creating FFmpeg and QEMU, had used LibNC for his NNCP lossless data compressor, which applied neural networks to achieve state-of-the-art compression ratios. In a 2023 interview on the Changelog podcast (episode #532), Gerganov confirmed this connection, noting he was "definitely inspired by Fabrice Bellard" and his LibNC library. A key difference was that while LibNC was distributed primarily as a binary, GGML was designed from the start as a fully open-source project that other developers could build upon.
Before GGML gained widespread recognition through large language model inference, its first high-profile application was whisper.cpp. Released in late 2022, whisper.cpp is a complete C/C++ port of OpenAI's Whisper speech recognition model. The entire high-level model implementation is contained in whisper.h and whisper.cpp, while all the low-level tensor computation is handled by the GGML library. The project demonstrated that production-quality transcription could run efficiently on consumer devices, including iPhones and Raspberry Pis, without Python or PyTorch dependencies. On Apple M1 hardware, whisper.cpp could transcribe one minute of audio in roughly one second using ARM NEON optimizations. The project has since accumulated over 47,000 stars on GitHub.
In March 2023, days after Meta released the LLaMA model weights, Gerganov published llama.cpp on March 10, 2023. It was an implementation of LLaMA inference in pure C/C++ built on top of GGML. The original README stated that the main goal was "to run the model using 4-bit quantization on a MacBook." The project proved that billion-parameter language models could run on a laptop without a dedicated GPU, sparking enormous interest in local AI inference. By applying 4-bit quantization, the 7-billion-parameter LLaMA model could fit into roughly 4 GB of RAM, making it accessible on hardware that most developers already owned. llama.cpp rapidly became one of the fastest-growing open-source projects on GitHub, accumulating over 97,000 stars and more than 1,200 contributors by early 2026.
In 2023, Gerganov founded ggml.ai, a company based in Sofia, Bulgaria, dedicated to supporting the development of GGML and related projects. The company received pre-seed funding from Nat Friedman (former CEO of GitHub) and Daniel Gross (former Y Combinator partner and AI investor). Core team members included Xuan-Son Nguyen (ngxson) and Aleksander Grygier (allozaur), both of whom made significant contributions to llama.cpp's codebase and backend systems.
On February 20, 2026, Gerganov announced that ggml.ai would be joining Hugging Face. The move was designed to ensure long-term sustainable resources for GGML and llama.cpp while keeping both projects fully open-source and community-driven. Under the arrangement, Gerganov and the core team became full-time Hugging Face employees but retained complete technical autonomy over the projects. The announcement, co-authored by Gerganov, Nguyen, Grygier, and Hugging Face team members Lysandre, Victor Mustar, and Julien Chaumond, emphasized several strategic goals: near single-click deployment from Hugging Face's model hub to local inference, tighter integration between llama.cpp and Hugging Face's Transformers library, and faster delivery of quantized model support after new model releases. The GitHub discussion thread announcing the move drew 389 combined reactions within its first day, reflecting broad community confidence in the decision.
GGML follows several core design principles that set it apart from larger frameworks like PyTorch and TensorFlow.
The library is composed primarily of C++ (58.9%), C (21.2%), with specialized implementations in CUDA (10.5%), Metal (3.1%), and GLSL (2.1%) for hardware acceleration. When compiled, the binary is under 1 MB, compared to the hundreds of megabytes typical of Python-based frameworks. This makes GGML suitable for embedding in desktop applications, mobile apps, and other resource-constrained environments.
GGML has no third-party dependencies. The only requirement to build it is a C compiler (GCC or Clang). GPU backends are optional and can be enabled at compile time through CMake flags. This approach eliminates dependency conflicts and simplifies cross-platform deployment. In contrast, installing PyTorch typically requires downloading hundreds of megabytes of libraries including CUDA toolkit components, cuDNN, and numerous Python packages. Building GGML requires just a few commands:
mkdir build && cd build
cmake ..
cmake --build . --config Release
One of GGML's most distinctive features is that it performs zero memory allocations during inference. All memory is pre-allocated through a context system (ggml_context) before computation begins. Users calculate the required memory size upfront, accounting for tensor data, tensor metadata overhead (via ggml_tensor_overhead()), and graph overhead (via ggml_graph_overhead()). This design eliminates memory fragmentation, makes memory usage predictable, and avoids the overhead of dynamic allocation during model execution. The dynamic allocator (ggml_dyn_tallocr) maintains free block lists (up to 256 blocks per chunk) using best-fit search to minimize fragmentation.
GGML uses memory-mapped file I/O (mmap) to load model weights directly into memory without copying them. When a model file is memory-mapped, the operating system creates page table entries that reserve virtual memory addresses, but the actual data pages are loaded into physical memory on demand. This lazy-loading approach means that mapping a 20 GB model file requires only about 40 MB of page tables initially. The individual pages are not loaded into the resident memory set until inference actually accesses them. Multiple processes can share the same memory-mapped model data, enabling efficient multi-instance deployments where several LLM server processes serve different users from the same model file without duplicating the weights in memory.
GGML uses a deferred (lazy) execution model. When tensor operations are called, they do not execute immediately. Instead, they create nodes in a directed acyclic graph (DAG) that records the operation type and source tensors. The computation graph is executed only when explicitly triggered via ggml_graph_compute() or ggml_backend_graph_compute(). This approach allows the library to optimize execution order through topological sorting and enables multi-threaded parallel execution of independent operations.
GGML represents multi-dimensional data using the ggml_tensor structure, supporting up to 4 dimensions. Each tensor contains metadata including element counts (ne<sup><a href="#cite_note-4" class="cite-ref">[4]</a></sup>), byte strides (nb<sup><a href="#cite_note-4" class="cite-ref">[4]</a></sup>), data type, and operation information. The stride mechanism enables zero-copy operations like transpose without data duplication.
At the heart of GGML is the computation graph (ggml_cgraph), which represents a DAG of tensor operations. The graph contains two categories of tensors:
op == GGML_OP_NONE), serving as the starting points of computation.Graph construction uses ggml_build_forward_expand(), which recursively visits parent tensors via the ggml_visit_parents() function to build the dependency structure. Each tensor is added to a hash table to prevent duplicate processing. When execution is triggered, the graph traverses nodes in topological order, ensuring that every tensor's inputs have been computed before the tensor itself is evaluated. Once built, multiple threads can safely execute the same graph with different input data.
GGML's backend system abstracts hardware-specific execution behind a uniform interface (ggml_backend_i). The key components are:
| Component | Purpose |
|---|---|
ggml_backend | Interface for executing computation graphs on specific hardware |
ggml_backend_buffer_type | Memory allocator tied to each backend |
ggml_backend_buffer | Allocated buffer holding data for multiple tensors |
ggml_gallocr | Graph memory allocator that performs liveness analysis for efficient tensor memory reuse |
ggml_backend_sched | Scheduler enabling concurrent use of multiple backends with automatic CPU fallback |
The scheduler (ggml_backend_sched) distributes operations across available hardware using a 5-pass assignment algorithm. If an operation is not supported on a GPU backend, the scheduler automatically falls back to the CPU implementation without requiring user intervention. It also inserts automatic tensor copy operations at backend boundaries, so that even if a model uses exotic operations only implemented on the CPU, the rest of the computation can still be offloaded to the GPU.
Backends can be loaded statically (compiled into the library) or dynamically at runtime when GGML_BACKEND_DL is enabled, loading shared libraries from a configurable directory (GGML_BACKEND_DIR).
GGML supports a broad range of data types, spanning full precision, half precision, and over 40 quantized formats:
| Data Type | Description |
|---|---|
| GGML_TYPE_F32 | 32-bit floating point |
| GGML_TYPE_F16 | 16-bit floating point |
| GGML_TYPE_BF16 | Brain floating point (16-bit) |
| GGML_TYPE_Q8_0 | 8-bit quantized (scale-only) |
| GGML_TYPE_Q5_1 | 5-bit quantized (scale + minimum) |
| GGML_TYPE_Q5_0 | 5-bit quantized (scale-only) |
| GGML_TYPE_Q4_1 | 4-bit quantized (scale + minimum) |
| GGML_TYPE_Q4_0 | 4-bit quantized (scale-only) |
| GGML_TYPE_I8 | 8-bit integer |
| GGML_TYPE_I16 | 16-bit integer |
| GGML_TYPE_I32 | 32-bit integer |
Each quantization format in GGML's type system provides three key functions: dequantization (converting quantized values back to floats), quantization (converting float values to the quantized representation), and an optimized vector dot product that operates directly on quantized data. Backend-specific implementations use SIMD instructions, GPU compute kernels, or hardware accelerator instructions to maximize performance.
GGML implements approximately 95 operation types organized into several categories. These operations cover the full range of computations needed for modern neural network inference.
Basic mathematical operations include ADD, SUB, MUL, and DIV for element-wise binary computation, with broadcasting support for operations between tensors of different shapes. Unary operations include ABS, SQR, SQRT, LOG, and common activation functions such as GELU, ReLU, SIGMOID, SiLU, and Tanh.
Matrix multiplication (MUL_MAT) is the most performance-critical operation in GGML, as it dominates the computation time during neural network inference. The second operand is automatically transposed internally, so the operation computes A * B^T. GGML also supports MUL_MAT_ID for multi-expert matrix multiplication (used in Mixture of Experts architectures) and OUT_PROD for outer products.
Reduction operations include SUM, MEAN, ARGMAX, COUNT_EQUAL, and SUM_ROWS, used for aggregation steps in normalization layers, loss computation, and sampling.
GGML supports several normalization schemes: NORM (layer normalization), RMS_NORM (root mean square normalization, commonly used in LLaMA-style models), GROUP_NORM, and L2_NORM. SOFT_MAX implements the softmax function used in attention mechanisms.
For Transformer model inference, GGML includes specialized operations:
GGML supports CONV_1D and CONV_2D convolution operations with IM2COL support, as well as POOL_2D for pooling operations. Shape manipulation operations include VIEW, RESHAPE, PERMUTE, TRANSPOSE, CONCAT, and REPEAT. Many of these are zero-copy operations, creating new views of existing data rather than copying tensor contents. Custom operations (MAP_CUSTOM1, MAP_CUSTOM2, CUSTOM) allow users to define their own computations, extending the library for specialized use cases.
Quantization is one of GGML's most important features. It allows large model weights to be compressed from 16-bit or 32-bit floating point down to as few as 2 bits per weight, dramatically reducing memory requirements and often improving inference speed on CPUs.
GGML uses block-based quantization, where weights are grouped into blocks and each block stores a scale factor (and optionally a minimum value) alongside the quantized weights.
Legacy formats use a block size of 32 elements per block:
| Type | Bits per Weight | Size (7B model) | Perplexity Increase | Notes |
|---|---|---|---|---|
| Q4_0 | 4 | ~3.50 GB | +0.2499 | Superseded by Q3_K_M |
| Q4_1 | 4 | ~3.90 GB | +0.1846 | Superseded by Q3_K_L |
| Q5_0 | 5 | ~4.30 GB | +0.0796 | Superseded by Q4_K_M |
| Q5_1 | 5 | ~4.70 GB | +0.0415 | Superseded by Q5_K_M |
| Q8_0 | 8 | ~6.70 GB | ~+0.0004 | Nearly indistinguishable from FP16 |
K-quantization, introduced in mid-2023, uses super-blocks of 256 elements subdivided into smaller sub-blocks of 16 or 32 elements. This scheme intelligently allocates bits across different layers of the model: more critical weights receive higher precision while less important ones use lower precision. The suffixes _S (Small), _M (Medium), and _L (Large) denote how aggressively mixed precision is applied. K-quants provide better quality at the same average bit rate compared to legacy types.
| Type | Avg. Bits per Weight | Size (7B model) | Quality Impact | Recommendation |
|---|---|---|---|---|
| Q2_K | ~2.5 | ~2.67 GB | Extreme loss | Not recommended |
| Q3_K_S | ~3.0 | ~2.75 GB | Very high loss | Small models only |
| Q3_K_M | ~3.5 | ~3.06 GB | High loss | Budget constrained |
| Q3_K_L | ~3.5 | ~3.35 GB | High loss | Budget constrained |
| Q4_K_S | ~4.5 | ~3.59 GB | Moderate | Good balance |
| Q4_K_M | ~4.5 | ~3.80 GB | Balanced | Recommended default |
| Q5_K_S | ~5.0 | ~4.33 GB | Low loss | High quality |
| Q5_K_M | ~5.0 | ~4.45 GB | Low loss | High quality |
| Q6_K | ~6.0 | ~5.15 GB | Very low loss | Near-lossless |
For most users, Q4_K_M provides the best balance between model size and output quality. Q5_K_M offers high quality that is close to imperceptible degradation for many tasks. Q6_K approaches near-lossless behavior while still offering meaningful compression.
I-quantization (IQ) formats represent the most advanced quantization approach in GGML. Instead of uniform block-wise quantization, IQ formats use vector quantization with lookup tables and importance matrices to achieve better quality at extreme compression levels (2 to 4 bits per weight).
The importance matrix (imatrix) is generated by running calibration data through the model and recording which weights produce the largest activations during inference. Weights that consistently have high impact are quantized with greater care. Non-linear quantization levels allow IQ formats to learn optimal level placements that minimize reconstruction error, rather than using uniformly spaced quantization bins. The key IQ formats include:
| Type | Bits per Weight | Block Size | Description |
|---|---|---|---|
| IQ2_XXS | ~2.06 | 256 | Extreme compression with lookup tables |
| IQ2_XS | ~2.31 | 256 | Slightly higher precision than IQ2_XXS |
| IQ2_S | ~2.50 | 256 | 2-bit with sign bits stored separately |
| IQ3_XXS | ~3.06 | 256 | 3-bit importance-weighted |
| IQ3_S | ~3.44 | 256 | 3-bit with separate sign bits |
| IQ4_XS | ~4.25 | 256 | 4-bit importance-weighted |
| IQ4_NL | ~4.50 | 32 | 4-bit with non-linear quantization levels |
Generating the importance matrix requires access to representative calibration text, which is passed through the model using the llama-imatrix tool. The resulting imatrix file is then provided during quantization. Using an importance matrix is highly recommended for IQ formats and generally improves the output quality of all quantization types.
GGML also supports specialty formats such as TQ1_0, which encodes ternary weights (values restricted to {-1, 0, +1}). This format is relevant for models like BitNet that use ternary weight representations, and for very large models (such as DeepSeek) where extreme compression is needed. Additionally, MXFP4 provides a microscaling 4-bit floating point format.
GGML supports a wide range of hardware backends through its backend abstraction layer. The same model code can run across different processors and accelerators without modification.
The CPU backend is GGML's most mature and fully featured backend. It includes hand-tuned SIMD (Single Instruction, Multiple Data) kernels for multiple instruction set architectures:
| Architecture | Instruction Sets | Notes |
|---|---|---|
| x86_64 | AVX, AVX2, AVX-512 | Intel and AMD processors |
| ARM | NEON, SVE | Apple Silicon, Qualcomm, ARM servers |
| WebAssembly | WASM SIMD | Browser-based inference |
CPU feature detection happens both at build time through CMake options (GGML_NATIVE, GGML_AVX2, GGML_AVX512) and at runtime. When GGML_CPU_ALL_VARIANTS is enabled, multiple CPU backend variants are compiled, and the runtime selects the best one based on detected processor features. On ARM platforms, the KleidiAI integration provides additional optimized matrix multiplication kernels that detect features like DOTPROD and I8MM at runtime.
The CUDA backend provides GPU acceleration for NVIDIA GPUs. It implements GPU kernels for matrix multiplication, quantization/dequantization, element-wise operations, and specialized operations like flash attention. The CUDA backend is enabled at build time with -DGGML_CUDA=ON and requires the NVIDIA CUDA toolkit.
The Metal backend targets Apple Silicon GPUs (M1, M2, M3, M4 series) and is enabled by default on macOS builds. Metal support was one of the early differentiators for GGML, as it allowed MacBook users to accelerate LLM inference using their integrated GPUs. The Metal backend can fuse multiple operations into a single kernel for improved performance.
The Vulkan backend provides cross-platform GPU acceleration through the Vulkan compute API. It compiles GLSL compute shaders to SPIR-V bytecode and executes them via Vulkan compute pipelines. This backend works across NVIDIA, AMD, Intel, and other Vulkan-capable GPUs on Windows, Linux, and Android, making it the most hardware-agnostic GPU option in GGML.
The HIP backend provides GPU acceleration for AMD Radeon graphics cards through AMD's ROCm (Radeon Open Compute) platform. HIP is AMD's CUDA-compatible API, so the CUDA backend code is largely reused through HIP's compatibility layer, reducing maintenance burden. In July 2025, AMD completed extensive optimization efforts for llama.cpp that were upstreamed to the main repository, improving performance on AMD Instinct GPUs such as the MI300X.
The SYCL backend supports Intel GPUs through the Intel oneAPI toolkit and the SYCL/DPC++ programming model. It covers Intel Data Center Max Series (Ponte Vecchio), Flex Series, Arc Series discrete GPUs, and integrated GPUs from 11th Generation Intel Core processors onward. The SYCL backend leverages Intel oneMKL for optimized BLAS operations and oneDNN for neural network primitives.
GGML also supports several other hardware targets:
| Backend | Target Hardware | API |
|---|---|---|
| MUSA | Moore Threads GPUs | Moore Threads MUSA API |
| OpenCL | Qualcomm Adreno GPUs | OpenCL compute |
| CANN | Huawei Ascend NPUs | Huawei CANN toolkit |
GGML supports using multiple backends simultaneously. For example, a build can enable both CUDA and Vulkan backends with the CMake flags -DGGML_CUDA=ON -DGGML_VULKAN=ON. The backend scheduler distributes operations across available hardware using a 5-pass assignment algorithm, inserts automatic tensor copy operations at backend boundaries, and falls back to the CPU for any operation that a GPU backend does not implement.
The original GGML file format stored model weights and basic metadata in a single file optimized for memory mapping. While effective, it had significant limitations: adding new features often broke compatibility with existing models, hyperparameters were stored as a list of untyped values, and the format lacked flexibility for storing essential metadata like tokenizer information or model-specific parameters. Two intermediate revisions, GGMF and GGJT, attempted to address some of these issues but still suffered from extensibility problems.
On August 21, 2023, the llama.cpp team introduced GGUF (GPT-Generated Unified Format) as a replacement for the GGML file format. GGUF was a complete redesign that addressed the predecessor's shortcomings:
A GGUF file is a binary format consisting of three main parts: a header (containing a magic number, version, and counts), metadata (key-value pairs), and tensor data (the actual model weights). The format has become the standard for distributing quantized models. As of 2026, Hugging Face hosts thousands of models in GGUF format. The original GGML format is deprecated and is no longer supported by modern tools like llama.cpp, Ollama, and LM Studio.
GGML has become the foundation for a growing ecosystem of C/C++ inference tools.
| Project | Description | GitHub Stars (approx.) |
|---|---|---|
| llama.cpp | LLM inference engine supporting hundreds of model architectures | ~97,000 |
| whisper.cpp | Port of OpenAI's Whisper speech recognition model | ~47,000 |
| stable-diffusion.cpp | Stable Diffusion, Flux, Wan, and other diffusion model inference | ~4,000 |
| bark.cpp | Port of Suno AI's Bark text-to-speech model | ~800 |
Beyond direct C/C++ implementations, GGML powers several popular user-facing applications:
Several community-maintained bindings allow GGML and llama.cpp to be used from other programming languages:
| Language | Binding | Notes |
|---|---|---|
| Python | llama-cpp-python, ggml-python | Most widely used bindings |
| Go | go-llama.cpp | Go native interface |
| Rust | llama-cpp-rs | Rust safe wrapper |
| Node.js | node-llama-cpp | JavaScript/TypeScript integration |
| Swift | Native Swift bindings | Included in the llama.cpp repository |
| C# / .NET | LLamaSharp | .NET ecosystem support |
The following table compares GGML with other frameworks commonly used for machine learning inference.
| Feature | GGML | PyTorch | ONNX Runtime |
|---|---|---|---|
| Primary language | C | Python / C++ | C++ |
| Primary use case | Inference | Training + Inference | Inference |
| Binary size | < 1 MB | Hundreds of MB | ~50 MB |
| Dependencies | None | Many (CUDA, cuDNN, etc.) | Moderate |
| Quantization | Native (2-8 bit, 40+ formats) | Via add-on libraries (bitsandbytes, GPTQ) | Limited native support |
| Training support | Limited (autodiff, ADAM, L-BFGS) | Full | None (inference only) |
| GPU backends | CUDA, Metal, Vulkan, ROCm, SYCL | CUDA, ROCm, MPS | CUDA, DirectML, ROCm |
| Memory mapping | Native mmap | Not standard | Not standard |
| Target deployment | Edge / consumer devices | Cloud / research | Cross-platform |
| Model format | GGUF | .pt / .safetensors | .onnx |
| Runtime memory allocation | Zero during inference | Dynamic | Dynamic |
| WebAssembly support | Yes (WASM SIMD) | No | Limited (ONNX.js) |
| License | MIT | BSD-3 | MIT |
GGML occupies a distinct niche in the ML framework landscape. While PyTorch and TensorFlow are general-purpose frameworks designed for both training and inference with extensive Python ecosystems, GGML focuses almost exclusively on inference efficiency with a C-first approach. ONNX Runtime is the closest in purpose (cross-platform inference), but GGML's aggressive quantization support and native memory mapping give it clear advantages for running very large models on memory-constrained consumer devices. For local inference on CPUs, GGML-based tools regularly outperform both PyTorch and ONNX Runtime in tokens-per-second throughput at equivalent memory budgets.
Although GGML is primarily used for inference, it does support automatic differentiation for building backward passes. When gradients are enabled on tensors, the system constructs a backward graph that computes derivatives with respect to each input. Each of the approximately 95 operations has a corresponding backward implementation. The library includes ADAM and L-BFGS optimizers, making it technically capable of fine-tuning and small-scale training. However, this feature sees limited practical use compared to the inference capabilities, and most training workloads remain with PyTorch or similar frameworks.
The GGML community is closely intertwined with the llama.cpp community, as GGML development largely happens within the llama.cpp repository alongside standalone GGML releases.
GGML follows an open development model on GitHub under the ggml-org organization. As of March 2026, the standalone GGML repository has seen over 3,500 commits from more than 585 contributors. The project maintains bidirectional synchronization with downstream projects (llama.cpp, whisper.cpp) using automated scripts that track synchronized commits, filter already back-ported changes, and generate patches preserving authorship.
GGML and llama.cpp have had a transformative effect on the local AI movement. Before their arrival, running large language models required expensive cloud GPU instances or high-end workstations with dedicated GPUs. By combining aggressive quantization with optimized CPU and GPU kernels, GGML made it practical to run 7-billion-parameter models on a laptop and 70-billion-parameter models on a single consumer GPU. This democratization of model access contributed to the rapid growth of open-source AI and the creation of numerous local AI tools and interfaces. The ability to run models entirely offline, without sending data to external servers, also addressed growing privacy concerns around cloud-based AI services.
llama.cpp was recognized in GitHub's Octoverse 2025 report as one of the top open-source projects by contributor count, reflecting the project's central role in open-source AI infrastructure.
While GGML is highly effective for its intended use case, it has several notable limitations: