# GGML

> Source: https://aiwiki.ai/wiki/ggml
> Updated: 2026-06-23
> Categories: Developer Tools, Machine Learning, Open Source AI
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

**GGML** is an open-source [tensor](/wiki/tensor) library written in pure C that runs [machine learning](/wiki/machine_learning) inference efficiently on consumer hardware, and it is the computational foundation beneath [llama.cpp](/wiki/llama_cpp), [whisper.cpp](/wiki/whisper_cpp), and the wider C/C++ local-inference ecosystem [1][2][3]. Created by Georgi Gerganov in late 2022, the name stands for "Georgi Gerganov Machine Learning," combining the creator's initials (GG) with ML. Released under the [MIT license](/wiki/mit_license) and carrying no third-party dependencies, GGML compiles to a binary under 1 MB and is one of the most portable tensor libraries available for running large AI models on everyday devices [1][12].

GGML's defining contribution is making billion-parameter models runnable on laptops and phones through aggressive [quantization](/wiki/quantization): with 4-bit quantization, a 7-billion-parameter [LLaMA](/wiki/llama) model fits into roughly 4 GB of RAM [2][12]. As of March 2026, the standalone GGML repository on GitHub has over 14,000 stars and more than 585 contributors [1]. On February 20, 2026, Gerganov's company ggml.ai joined [Hugging Face](/wiki/hugging_face) to secure long-term institutional support while keeping the projects fully open-source [5].

## What is GGML and who created it?

### Origins and Creation

Towards the end of September 2022, Georgi Gerganov began developing GGML as a C library implementing tensor algebra [1]. Gerganov, a Bulgarian software engineer with an MSc in Medical Physics from Sofia University St. Kliment Ohridski, had previously worked as a Principal Scientist at ViewRay, a medical technology company [4]. Beyond his work in medical physics, Gerganov had a track record of creative side projects spanning audio-based applications, including keyboard acoustic analysis tools and peer-to-peer communication experiments. His goal was to create a minimal, dependency-free tensor library with strict memory management and built-in multithreading support.

The creation of GGML was partly inspired by Fabrice Bellard's LibNC, a C library for tensor manipulation that supported [automatic differentiation](/wiki/automatic_differentiation) and could implement models such as LSTMs and [Transformers](/wiki/transformer) [13]. Bellard, known for creating FFmpeg and QEMU, had used LibNC for his NNCP lossless data compressor, which applied neural networks to achieve state-of-the-art compression ratios. In a 2023 interview on the Changelog podcast (episode #532), Gerganov confirmed this connection, noting he was "definitely inspired by Fabrice Bellard" and his LibNC library [4]. A key difference was that while LibNC was distributed primarily as a binary, GGML was designed from the start as a fully open-source project that other developers could build upon.

### How did whisper.cpp drive early adoption?

Before GGML gained widespread recognition through [large language model](/wiki/large_language_model) inference, its first high-profile application was whisper.cpp. Released in late 2022, whisper.cpp is a complete C/C++ port of [OpenAI](/wiki/openai)'s [Whisper](/wiki/whisper) [speech recognition](/wiki/speech_recognition) model [3]. The entire high-level model implementation is contained in whisper.h and whisper.cpp, while all the low-level tensor computation is handled by the GGML library. The project demonstrated that production-quality transcription could run efficiently on consumer devices, including iPhones and Raspberry Pis, without Python or [PyTorch](/wiki/pytorch) dependencies. On Apple M1 hardware, whisper.cpp could transcribe one minute of audio in roughly one second using ARM NEON optimizations. The project has since accumulated over 47,000 stars on GitHub [3].

### How did llama.cpp make GGML mainstream?

In March 2023, days after [Meta](/wiki/meta_ai) released the [LLaMA](/wiki/llama) model weights, Gerganov published llama.cpp on March 10, 2023. It was an implementation of LLaMA inference in pure C/C++ built on top of GGML [2]. The original README stated that the main goal was "to run the model using 4-bit quantization on a MacBook" [2]. The project proved that billion-parameter language models could run on a laptop without a dedicated GPU, sparking enormous interest in local AI inference. By applying 4-bit [quantization](/wiki/quantization), the 7-billion-parameter LLaMA model could fit into roughly 4 GB of RAM, making it accessible on hardware that most developers already owned [12]. llama.cpp rapidly became one of the fastest-growing open-source projects on GitHub, accumulating over 97,000 stars and more than 1,200 contributors by early 2026 [2].

### What is ggml.ai?

In 2023, Gerganov founded ggml.ai, a company based in Sofia, Bulgaria, dedicated to supporting the development of GGML and related projects. The company received pre-seed funding from Nat Friedman (former CEO of GitHub) and Daniel Gross (former Y Combinator partner and AI investor). Core team members included Xuan-Son Nguyen (ngxson) and Aleksander Grygier (allozaur), both of whom made significant contributions to llama.cpp's codebase and backend systems [5].

### Why did ggml.ai join Hugging Face in February 2026?

On February 20, 2026, Gerganov announced that ggml.ai would be joining Hugging Face [5]. The move was designed to ensure long-term sustainable resources for GGML and llama.cpp while keeping both projects fully open-source and community-driven. According to the joint announcement, "Georgi and team still dedicate 100% of their time maintaining llama.cpp and have full autonomy and leadership on the technical directions and the community" [5]. Under the arrangement, Gerganov and the core team became full-time Hugging Face employees but retained complete technical autonomy over the projects.

The announcement, co-authored by Gerganov, Nguyen, Grygier, and Hugging Face team members Lysandre, Victor Mustar, and Julien Chaumond, emphasized several strategic goals: near single-click deployment from Hugging Face's model hub to local inference, tighter integration between llama.cpp and Hugging Face's [Transformers](/wiki/transformers_library) library, and faster delivery of quantized model support after new model releases [5]. The blog framed the shared objective as providing "the building blocks to make open-source superintelligence accessible to the world over the coming years" [5]. The GitHub discussion thread announcing the move drew 389 combined reactions within its first day, reflecting broad community confidence in the decision.

## How is GGML designed?

GGML follows several core design principles that set it apart from larger frameworks like PyTorch and [TensorFlow](/wiki/tensorflow).

### Minimalism and Portability

The library is composed primarily of C++ (58.9%), C (21.2%), with specialized implementations in [CUDA](/wiki/cuda) (10.5%), Metal (3.1%), and GLSL (2.1%) for hardware acceleration [1]. When compiled, the binary is under 1 MB, compared to the hundreds of megabytes typical of Python-based frameworks [12]. This makes GGML suitable for embedding in desktop applications, mobile apps, and other resource-constrained environments.

### Zero Dependencies

GGML has no third-party dependencies. The only requirement to build it is a C compiler (GCC or Clang) [1]. GPU backends are optional and can be enabled at compile time through CMake flags. This approach eliminates dependency conflicts and simplifies cross-platform deployment. In contrast, installing PyTorch typically requires downloading hundreds of megabytes of libraries including CUDA toolkit components, cuDNN, and numerous Python packages. Building GGML requires just a few commands:

```
mkdir build && cd build
cmake ..
cmake --build . --config Release
```

### No Runtime Memory Allocation

One of GGML's most distinctive features is that it performs zero memory allocations during inference [12]. All memory is pre-allocated through a context system (`ggml_context`) before computation begins. Users calculate the required memory size upfront, accounting for tensor data, tensor metadata overhead (via `ggml_tensor_overhead()`), and graph overhead (via `ggml_graph_overhead()`). This design eliminates memory fragmentation, makes memory usage predictable, and avoids the overhead of dynamic allocation during model execution. The dynamic allocator (`ggml_dyn_tallocr`) maintains free block lists (up to 256 blocks per chunk) using best-fit search to minimize fragmentation [9].

### Memory Mapping (mmap)

GGML uses memory-mapped file I/O (mmap) to load model weights directly into memory without copying them. When a model file is memory-mapped, the operating system creates page table entries that reserve virtual memory addresses, but the actual data pages are loaded into physical memory on demand [12]. This lazy-loading approach means that mapping a 20 GB model file requires only about 40 MB of page tables initially. The individual pages are not loaded into the resident memory set until inference actually accesses them. Multiple processes can share the same memory-mapped model data, enabling efficient multi-instance deployments where several LLM server processes serve different users from the same model file without duplicating the weights in memory.

### Deferred Execution

GGML uses a deferred (lazy) execution model. When tensor operations are called, they do not execute immediately. Instead, they create nodes in a directed acyclic graph (DAG) that records the operation type and source tensors [9]. The computation graph is executed only when explicitly triggered via `ggml_graph_compute()` or `ggml_backend_graph_compute()`. This approach allows the library to optimize execution order through topological sorting and enables multi-threaded parallel execution of independent operations.

## Architecture and Core Concepts

### Tensor System

GGML represents multi-dimensional data using the `ggml_tensor` structure, supporting up to 4 dimensions. Each tensor contains metadata including element counts (`ne[4]`), byte strides (`nb[4]`), data type, and operation information [9]. The stride mechanism enables zero-copy operations like transpose without data duplication.

### Computation Graph

At the heart of GGML is the computation graph (`ggml_cgraph`), which represents a DAG of tensor operations. The graph contains two categories of tensors:

- **Leaf tensors**: Input tensors with no associated operation (`op == GGML_OP_NONE`), serving as the starting points of computation.
- **Node tensors**: Tensors produced by operations that need to be computed, representing intermediate and output values.

Graph construction uses `ggml_build_forward_expand()`, which recursively visits parent tensors via the `ggml_visit_parents()` function to build the dependency structure [9]. Each tensor is added to a hash table to prevent duplicate processing. When execution is triggered, the graph traverses nodes in topological order, ensuring that every tensor's inputs have been computed before the tensor itself is evaluated. Once built, multiple threads can safely execute the same graph with different input data.

### Backend Abstraction Layer

GGML's backend system abstracts hardware-specific execution behind a uniform interface (`ggml_backend_i`). The key components are:

| Component | Purpose |
|---|---|
| `ggml_backend` | Interface for executing computation graphs on specific hardware |
| `ggml_backend_buffer_type` | Memory allocator tied to each backend |
| `ggml_backend_buffer` | Allocated buffer holding data for multiple tensors |
| `ggml_gallocr` | Graph memory allocator that performs liveness analysis for efficient tensor memory reuse |
| `ggml_backend_sched` | Scheduler enabling concurrent use of multiple backends with automatic CPU fallback |

The scheduler (`ggml_backend_sched`) distributes operations across available hardware using a 5-pass assignment algorithm. If an operation is not supported on a GPU backend, the scheduler automatically falls back to the CPU implementation without requiring user intervention. It also inserts automatic tensor copy operations at backend boundaries, so that even if a model uses exotic operations only implemented on the CPU, the rest of the computation can still be offloaded to the GPU [9].

Backends can be loaded statically (compiled into the library) or dynamically at runtime when `GGML_BACKEND_DL` is enabled, loading shared libraries from a configurable directory (`GGML_BACKEND_DIR`).

### Data Types

GGML supports a broad range of data types, spanning full precision, half precision, and over 40 quantized formats [10]:

| Data Type | Description |
|---|---|
| GGML_TYPE_F32 | 32-bit floating point |
| GGML_TYPE_F16 | 16-bit floating point |
| GGML_TYPE_BF16 | Brain floating point (16-bit) |
| GGML_TYPE_Q8_0 | 8-bit quantized (scale-only) |
| GGML_TYPE_Q5_1 | 5-bit quantized (scale + minimum) |
| GGML_TYPE_Q5_0 | 5-bit quantized (scale-only) |
| GGML_TYPE_Q4_1 | 4-bit quantized (scale + minimum) |
| GGML_TYPE_Q4_0 | 4-bit quantized (scale-only) |
| GGML_TYPE_I8 | 8-bit integer |
| GGML_TYPE_I16 | 16-bit integer |
| GGML_TYPE_I32 | 32-bit integer |

Each quantization format in GGML's type system provides three key functions: dequantization (converting quantized values back to floats), quantization (converting float values to the quantized representation), and an optimized vector dot product that operates directly on quantized data [10]. Backend-specific implementations use SIMD instructions, GPU compute kernels, or hardware accelerator instructions to maximize performance.

## Tensor Operations

GGML implements approximately 95 operation types organized into several categories [9]. These operations cover the full range of computations needed for modern neural network inference.

### Arithmetic and Element-wise Operations

Basic mathematical operations include ADD, SUB, MUL, and DIV for element-wise binary computation, with broadcasting support for operations between tensors of different shapes. Unary operations include ABS, SQR, SQRT, LOG, and common [activation functions](/wiki/activation_function) such as GELU, [ReLU](/wiki/relu), SIGMOID, SiLU, and Tanh.

### Linear Algebra

Matrix multiplication (MUL_MAT) is the most performance-critical operation in GGML, as it dominates the computation time during neural network inference [9]. The second operand is automatically transposed internally, so the operation computes A * B^T. GGML also supports MUL_MAT_ID for multi-expert matrix multiplication (used in [Mixture of Experts](/wiki/mixture_of_experts) architectures) and OUT_PROD for outer products.

### Reduction Operations

Reduction operations include SUM, MEAN, ARGMAX, COUNT_EQUAL, and SUM_ROWS, used for aggregation steps in normalization layers, loss computation, and sampling.

### Normalization

GGML supports several normalization schemes: NORM (layer normalization), RMS_NORM (root mean square normalization, commonly used in LLaMA-style models), GROUP_NORM, and L2_NORM. SOFT_MAX implements the [softmax](/wiki/softmax) function used in [attention mechanisms](/wiki/attention).

### Transformer-Specific Operations

For [Transformer](/wiki/transformer) model inference, GGML includes specialized operations:

- **ROPE / ROPE_BACK**: Rotary [positional encoding](/wiki/positional_encoding) (RoPE), the position embedding scheme used by LLaMA, [Mistral](/wiki/mistral), and many modern LLMs.
- **FLASH_ATTN_EXT**: An optimized flash attention implementation that reduces memory usage and improves performance for attention computation, computing Attention(Q, K, V) = Softmax(QK^T / sqrt(d_k))V with fused kernels. It supports head dimensions that are not multiples of 16 through automatic padding.
- **GET_ROWS**: Embedding lookup operations for token embedding retrieval.

### Convolution, Pooling, and Shape Manipulation

GGML supports CONV_1D and CONV_2D convolution operations with IM2COL support, as well as POOL_2D for pooling operations. Shape manipulation operations include VIEW, RESHAPE, PERMUTE, TRANSPOSE, CONCAT, and REPEAT. Many of these are zero-copy operations, creating new views of existing data rather than copying tensor contents. Custom operations (MAP_CUSTOM1, MAP_CUSTOM2, CUSTOM) allow users to define their own computations, extending the library for specialized use cases.

## What is quantization in GGML?

Quantization is one of GGML's most important features. It allows large model weights to be compressed from 16-bit or 32-bit floating point down to as few as 2 bits per weight, dramatically reducing memory requirements and often improving inference speed on CPUs [10].

### Quantization Approaches

GGML uses block-based quantization, where weights are grouped into blocks and each block stores a scale factor (and optionally a minimum value) alongside the quantized weights.

- **Type-0 (Scale-only)**: weight = scale x q. This symmetric approach is used by Q4_0, Q5_0, and Q8_0.
- **Type-1 (Scale + Minimum)**: weight = scale x q + minimum. This handles asymmetric weight distributions better and is used by Q4_1 and Q5_1.

### Legacy Quantization Types

Legacy formats use a block size of 32 elements per block:

| Type | Bits per Weight | Size (7B model) | Perplexity Increase | Notes |
|---|---|---|---|---|
| Q4_0 | 4 | ~3.50 GB | +0.2499 | Superseded by Q3_K_M |
| Q4_1 | 4 | ~3.90 GB | +0.1846 | Superseded by Q3_K_L |
| Q5_0 | 5 | ~4.30 GB | +0.0796 | Superseded by Q4_K_M |
| Q5_1 | 5 | ~4.70 GB | +0.0415 | Superseded by Q5_K_M |
| Q8_0 | 8 | ~6.70 GB | ~+0.0004 | Nearly indistinguishable from FP16 |

### K-Quantization (Modern)

K-quantization, introduced in mid-2023, uses super-blocks of 256 elements subdivided into smaller sub-blocks of 16 or 32 elements [10]. This scheme intelligently allocates bits across different layers of the model: more critical weights receive higher precision while less important ones use lower precision. The suffixes _S (Small), _M (Medium), and _L (Large) denote how aggressively mixed precision is applied. K-quants provide better quality at the same average bit rate compared to legacy types.

| Type | Avg. Bits per Weight | Size (7B model) | Quality Impact | Recommendation |
|---|---|---|---|---|
| Q2_K | ~2.5 | ~2.67 GB | Extreme loss | Not recommended |
| Q3_K_S | ~3.0 | ~2.75 GB | Very high loss | Small models only |
| Q3_K_M | ~3.5 | ~3.06 GB | High loss | Budget constrained |
| Q3_K_L | ~3.5 | ~3.35 GB | High loss | Budget constrained |
| Q4_K_S | ~4.5 | ~3.59 GB | Moderate | Good balance |
| Q4_K_M | ~4.5 | ~3.80 GB | Balanced | Recommended default |
| Q5_K_S | ~5.0 | ~4.33 GB | Low loss | High quality |
| Q5_K_M | ~5.0 | ~4.45 GB | Low loss | High quality |
| Q6_K | ~6.0 | ~5.15 GB | Very low loss | Near-lossless |

For most users, Q4_K_M provides the best balance between model size and output quality. Q5_K_M offers high quality that is close to imperceptible degradation for many tasks. Q6_K approaches near-lossless behavior while still offering meaningful compression [10].

### I-Quantization (Importance-Weighted)

I-quantization (IQ) formats represent the most advanced quantization approach in GGML. Instead of uniform block-wise quantization, IQ formats use vector quantization with lookup tables and importance matrices to achieve better quality at extreme compression levels (2 to 4 bits per weight) [10].

The importance matrix (imatrix) is generated by running calibration data through the model and recording which weights produce the largest activations during inference. Weights that consistently have high impact are quantized with greater care. Non-linear quantization levels allow IQ formats to learn optimal level placements that minimize reconstruction error, rather than using uniformly spaced quantization bins. The key IQ formats include:

| Type | Bits per Weight | Block Size | Description |
|---|---|---|---|
| IQ2_XXS | ~2.06 | 256 | Extreme compression with lookup tables |
| IQ2_XS | ~2.31 | 256 | Slightly higher precision than IQ2_XXS |
| IQ2_S | ~2.50 | 256 | 2-bit with sign bits stored separately |
| IQ3_XXS | ~3.06 | 256 | 3-bit importance-weighted |
| IQ3_S | ~3.44 | 256 | 3-bit with separate sign bits |
| IQ4_XS | ~4.25 | 256 | 4-bit importance-weighted |
| IQ4_NL | ~4.50 | 32 | 4-bit with non-linear quantization levels |

Generating the importance matrix requires access to representative calibration text, which is passed through the model using the `llama-imatrix` tool. The resulting imatrix file is then provided during quantization. Using an importance matrix is highly recommended for IQ formats and generally improves the output quality of all quantization types.

### Special Quantization Types

GGML also supports specialty formats such as TQ1_0, which encodes ternary weights (values restricted to {-1, 0, +1}). This format is relevant for models like BitNet that use ternary weight representations, and for very large models (such as [DeepSeek](/wiki/deepseek)) where extreme compression is needed. Additionally, MXFP4 provides a microscaling 4-bit floating point format.

## Which hardware backends does GGML support?

GGML supports a wide range of hardware backends through its backend abstraction layer. The same model code can run across different processors and accelerators without modification.

### CPU

The CPU backend is GGML's most mature and fully featured backend. It includes hand-tuned SIMD (Single Instruction, Multiple Data) kernels for multiple instruction set architectures:

| Architecture | Instruction Sets | Notes |
|---|---|---|
| x86_64 | AVX, AVX2, AVX-512 | Intel and AMD processors |
| ARM | NEON, SVE | Apple Silicon, Qualcomm, ARM servers |
| WebAssembly | WASM SIMD | Browser-based inference |

CPU feature detection happens both at build time through CMake options (`GGML_NATIVE`, `GGML_AVX2`, `GGML_AVX512`) and at runtime. When `GGML_CPU_ALL_VARIANTS` is enabled, multiple CPU backend variants are compiled, and the runtime selects the best one based on detected processor features. On ARM platforms, the KleidiAI integration provides additional optimized matrix multiplication kernels that detect features like DOTPROD and I8MM at runtime.

### CUDA (NVIDIA GPUs)

The CUDA backend provides GPU acceleration for [NVIDIA](/wiki/nvidia) GPUs. It implements GPU kernels for matrix multiplication, quantization/dequantization, element-wise operations, and specialized operations like flash attention. The CUDA backend is enabled at build time with `-DGGML_CUDA=ON` and requires the NVIDIA CUDA toolkit.

### Metal (Apple GPUs)

The Metal backend targets Apple Silicon GPUs (M1, M2, M3, M4 series) and is enabled by default on macOS builds. Metal support was one of the early differentiators for GGML, as it allowed MacBook users to accelerate LLM inference using their integrated GPUs. The Metal backend can fuse multiple operations into a single kernel for improved performance.

### Vulkan

The [Vulkan](/wiki/vulkan) backend provides cross-platform GPU acceleration through the Vulkan compute API. It compiles GLSL compute shaders to SPIR-V bytecode and executes them via Vulkan compute pipelines. This backend works across NVIDIA, AMD, Intel, and other Vulkan-capable GPUs on Windows, Linux, and Android, making it the most hardware-agnostic GPU option in GGML.

### ROCm / HIP (AMD GPUs)

The HIP backend provides GPU acceleration for AMD Radeon graphics cards through AMD's ROCm (Radeon Open Compute) platform. HIP is AMD's CUDA-compatible API, so the CUDA backend code is largely reused through HIP's compatibility layer, reducing maintenance burden. In July 2025, AMD completed extensive optimization efforts for llama.cpp that were upstreamed to the main repository, improving performance on AMD Instinct GPUs such as the MI300X [11].

### SYCL (Intel GPUs)

The SYCL backend supports Intel GPUs through the Intel oneAPI toolkit and the SYCL/DPC++ programming model. It covers Intel Data Center Max Series (Ponte Vecchio), Flex Series, Arc Series discrete GPUs, and integrated GPUs from 11th Generation Intel Core processors onward. The SYCL backend leverages Intel oneMKL for optimized BLAS operations and oneDNN for neural network primitives.

### Additional Backends

GGML also supports several other hardware targets:

| Backend | Target Hardware | API |
|---|---|---|
| MUSA | Moore Threads GPUs | Moore Threads MUSA API |
| OpenCL | Qualcomm Adreno GPUs | OpenCL compute |
| CANN | Huawei Ascend NPUs | Huawei CANN toolkit |

### Multi-Backend Execution

GGML supports using multiple backends simultaneously. For example, a build can enable both CUDA and Vulkan backends with the CMake flags `-DGGML_CUDA=ON -DGGML_VULKAN=ON`. The backend scheduler distributes operations across available hardware using a 5-pass assignment algorithm, inserts automatic tensor copy operations at backend boundaries, and falls back to the CPU for any operation that a GPU backend does not implement [9].

## How does GGUF differ from the original GGML format?

### The Original GGML Format

The original GGML file format stored model weights and basic metadata in a single file optimized for memory mapping. While effective, it had significant limitations: adding new features often broke compatibility with existing models, hyperparameters were stored as a list of untyped values, and the format lacked flexibility for storing essential metadata like tokenizer information or model-specific parameters [7]. Two intermediate revisions, GGMF and GGJT, attempted to address some of these issues but still suffered from extensibility problems [8].

### GGUF: The Successor

On August 21, 2023, the llama.cpp team introduced [GGUF](/wiki/gguf) as a replacement for the GGML file format, formalized through pull request #302 in the ggml repository [8]. The acronym is commonly expanded as GGML Universal File (and sometimes as GPT-Generated Unified Format) [8]. GGUF was a complete redesign that addressed the predecessor's shortcomings:

- **Extensible key-value metadata**: GGUF replaced untyped hyperparameter lists with a structured key-value metadata system that can store tokenizer vocabularies, model architecture parameters, training details, and custom fields [6].
- **Self-contained files**: A single GGUF file contains everything needed to load and run a model, including all metadata that previously had to be supplied separately.
- **Backward-compatible evolution**: New fields can be added without breaking older readers [8].

A GGUF file is a binary format consisting of three main parts: a header (containing a magic number, version, and counts), metadata (key-value pairs), and tensor data (the actual model weights) [6]. The format has become the standard for distributing quantized models. As of 2026, Hugging Face hosts thousands of models in GGUF format. The original GGML format is deprecated and is no longer supported by modern tools like llama.cpp, [Ollama](/wiki/ollama), and [LM Studio](/wiki/lmstudio) [8].

## Projects Built on GGML

GGML has become the foundation for a growing ecosystem of C/C++ inference tools.

| Project | Description | GitHub Stars (approx.) |
|---|---|---|
| [llama.cpp](/wiki/llama_cpp) | LLM inference engine supporting hundreds of model architectures | ~97,000 |
| [whisper.cpp](/wiki/whisper_cpp) | Port of OpenAI's Whisper speech recognition model | ~47,000 |
| [stable-diffusion.cpp](https://github.com/leejet/stable-diffusion.cpp) | Stable Diffusion, Flux, Wan, and other diffusion model inference | ~4,000 |
| [bark.cpp](https://github.com/PABannier/bark.cpp) | Port of Suno AI's Bark text-to-speech model | ~800 |

Beyond direct C/C++ implementations, GGML powers several popular user-facing applications:

- **Ollama**: A tool for running LLMs locally with a simple command-line interface, built on llama.cpp.
- **LM Studio**: A desktop application for discovering, downloading, and running local LLMs.
- **GPT4All**: A desktop application by Nomic AI for running local language models.
- **Jan**: An open-source [ChatGPT](/wiki/chatgpt) alternative that runs on consumer hardware.
- **LocalAI**: An OpenAI-compatible API server for running models locally.
- **KoboldCpp**: A text generation interface designed for creative writing and roleplaying.

### Language Bindings

Several community-maintained bindings allow GGML and llama.cpp to be used from other programming languages:

| Language | Binding | Notes |
|---|---|---|
| Python | llama-cpp-python, ggml-python | Most widely used bindings |
| Go | go-llama.cpp | Go native interface |
| Rust | llama-cpp-rs | Rust safe wrapper |
| Node.js | node-llama-cpp | JavaScript/TypeScript integration |
| Swift | Native Swift bindings | Included in the llama.cpp repository |
| C# / .NET | LLamaSharp | .NET ecosystem support |

## How does GGML compare to PyTorch and ONNX Runtime?

The following table compares GGML with other frameworks commonly used for machine learning inference.

| Feature | GGML | [PyTorch](/wiki/pytorch) | [ONNX Runtime](/wiki/onnx) |
|---|---|---|---|
| Primary language | C | Python / C++ | C++ |
| Primary use case | Inference | Training + Inference | Inference |
| Binary size | < 1 MB | Hundreds of MB | ~50 MB |
| Dependencies | None | Many (CUDA, cuDNN, etc.) | Moderate |
| Quantization | Native (2-8 bit, 40+ formats) | Via add-on libraries (bitsandbytes, GPTQ) | Limited native support |
| Training support | Limited (autodiff, ADAM, L-BFGS) | Full | None (inference only) |
| GPU backends | CUDA, Metal, Vulkan, ROCm, SYCL | CUDA, ROCm, MPS | CUDA, DirectML, ROCm |
| Memory mapping | Native mmap | Not standard | Not standard |
| Target deployment | Edge / consumer devices | Cloud / research | Cross-platform |
| Model format | GGUF | .pt / [.safetensors](/wiki/safetensors) | .onnx |
| Runtime memory allocation | Zero during inference | Dynamic | Dynamic |
| WebAssembly support | Yes (WASM SIMD) | No | Limited (ONNX.js) |
| License | MIT | BSD-3 | MIT |

GGML occupies a distinct niche in the ML framework landscape. While PyTorch and [TensorFlow](/wiki/tensorflow) are general-purpose frameworks designed for both training and inference with extensive Python ecosystems, GGML focuses almost exclusively on inference efficiency with a C-first approach. ONNX Runtime is the closest in purpose (cross-platform inference), but GGML's aggressive quantization support and native memory mapping give it clear advantages for running very large models on memory-constrained consumer devices [12]. For local inference on CPUs, GGML-based tools regularly outperform both PyTorch and ONNX Runtime in tokens-per-second throughput at equivalent memory budgets.

## Automatic Differentiation

Although GGML is primarily used for inference, it does support automatic differentiation for building backward passes. When gradients are enabled on tensors, the system constructs a backward graph that computes derivatives with respect to each input [9]. Each of the approximately 95 operations has a corresponding backward implementation. The library includes ADAM and L-BFGS optimizers, making it technically capable of fine-tuning and small-scale training. However, this feature sees limited practical use compared to the inference capabilities, and most training workloads remain with PyTorch or similar frameworks.

## Community and Ecosystem

The GGML community is closely intertwined with the llama.cpp community, as GGML development largely happens within the llama.cpp repository alongside standalone GGML releases.

### Development Model

GGML follows an open development model on GitHub under the ggml-org organization. As of March 2026, the standalone GGML repository has seen over 3,500 commits from more than 585 contributors [1]. The project maintains bidirectional synchronization with downstream projects (llama.cpp, whisper.cpp) using automated scripts that track synchronized commits, filter already back-ported changes, and generate patches preserving authorship.

### Impact on Local AI

GGML and llama.cpp have had a transformative effect on the [local AI](/wiki/local_ai) movement. Before their arrival, running large language models required expensive cloud GPU instances or high-end workstations with dedicated GPUs. By combining aggressive quantization with optimized CPU and GPU kernels, GGML made it practical to run 7-billion-parameter models on a laptop and 70-billion-parameter models on a single consumer GPU [12]. This democratization of model access contributed to the rapid growth of open-source AI and the creation of numerous local AI tools and interfaces. The ability to run models entirely offline, without sending data to external servers, also addressed growing privacy concerns around cloud-based AI services.

llama.cpp was recognized in GitHub's Octoverse 2025 report as one of the top open-source projects by contributor count, reflecting the project's central role in open-source AI infrastructure [2].

## Limitations

While GGML is highly effective for its intended use case, it has several notable limitations:

- **Training**: Although automatic differentiation is supported, GGML is not designed for large-scale training workloads. PyTorch and TensorFlow remain far superior for this purpose.
- **Backend coverage**: Not all 95 tensor operations are supported on all backends. Some operations that work on CPU may not have GPU implementations, requiring CPU fallback.
- **Low-level API**: GGML requires knowledge of C programming, memory management, and tensor computation. There is no high-level Python-first interface comparable to PyTorch's.
- **Rapid development pace**: The library is under heavy active development. Breaking API changes occur regularly, which can be challenging for downstream projects that depend on specific GGML interfaces.
- **Documentation**: While improving, documentation remains sparse compared to established frameworks. Much of the learning happens through reading source code and community discussions.

## See Also

- [llama.cpp](/wiki/llama_cpp)
- [Whisper](/wiki/whisper)
- [GGUF](/wiki/gguf)
- [Quantization](/wiki/quantization)
- [Hugging Face](/wiki/hugging_face)
- [Local AI](/wiki/local_ai)
- [ONNX](/wiki/onnx)
- [PyTorch](/wiki/pytorch)

## References

1. Gerganov, G. "ggml: Tensor library for machine learning." GitHub repository, ggml-org/ggml. https://github.com/ggml-org/ggml
2. Gerganov, G. "llama.cpp: LLM inference in C/C++." GitHub repository, ggml-org/llama.cpp. https://github.com/ggml-org/llama.cpp
3. Gerganov, G. "whisper.cpp: Port of OpenAI's Whisper model in C/C++." GitHub repository, ggml-org/whisper.cpp. https://github.com/ggml-org/whisper.cpp
4. "Bringing Whisper and LLaMA to the masses with Georgi Gerganov." Changelog Interviews #532, 2023. https://changelog.com/podcast/532
5. "GGML and llama.cpp join HF to ensure the long-term progress of Local AI." Hugging Face Blog, February 20, 2026. https://huggingface.co/blog/ggml-joins-hf
6. "GGUF specification." GitHub, ggml-org/ggml. https://github.com/ggml-org/ggml/blob/master/docs/gguf.md
7. "GGUF versus GGML." IBM Think. https://www.ibm.com/think/topics/gguf-versus-ggml
8. "GGUF." Wikipedia. https://en.wikipedia.org/wiki/GGUF
9. "Tensor Operations and Computation Graphs." DeepWiki, ggml-org/ggml. https://deepwiki.com/ggml-org/ggml/2.1-tensor-operations
10. "Quantization Techniques." DeepWiki, ggml-org/llama.cpp. https://deepwiki.com/ggml-org/llama.cpp/6.3-quantization-techniques
11. "Llama.cpp Meets Instinct: A New Era of Open-Source AI Acceleration." AMD ROCm Blog. https://rocm.blogs.amd.com/ecosystems-and-partners/llama-cpp/README.html
12. "Understanding ggml, from the ground up." Manan Shah, February 2025. https://mananshah99.github.io/blog/2025/02/23/ggml/
13. Bellard, F. "LibNC: C Library for Tensor Manipulation." https://bellard.org/libnc/
14. leejet. "stable-diffusion.cpp: Diffusion model inference in pure C/C++." GitHub repository. https://github.com/leejet/stable-diffusion.cpp
15. "ggml.ai joins Hugging Face to ensure the long-term progress of Local AI." Simon Willison's Weblog, February 20, 2026. https://simonwillison.net/2026/Feb/20/ggmlai-joins-hugging-face/

