llama.cpp

Developer Tools Large Language Models Machine Learning Open Source AI

22 min read

Updated Jun 21, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 21, 2026

Fact-checked

In review queue

Sources

13 citations

Revision

v7 · 4,465 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

llama.cpp is an open-source large language model inference engine written in C and C++ by Bulgarian software engineer Georgi Gerganov that runs large language models on consumer-grade hardware without requiring Python or PyTorch. First released on March 10, 2023, its stated goal is "to enable LLM inference with minimal setup and state-of-the-art performance on a wide range of hardware, locally and in the cloud." ^[1] It has become one of the most influential open-source AI projects in history, surpassing 100,000 GitHub stars in March 2026 (faster than PyTorch or TensorFlow reached that count) and standing at roughly 117,000 stars, 19,800 forks, and contributions from more than 700 developers by mid-2026. ^[1]^[13] It serves as the foundation for popular tools such as Ollama, LM Studio, and GPT4All. In February 2026, Gerganov and his team at ggml.ai joined Hugging Face to continue development under long-term institutional support. ^[3]

History

Origins and Motivation

Before creating llama.cpp, Georgi Gerganov had already built a reputation in the open-source machine learning community with whisper.cpp, a C/C++ port of OpenAI's Whisper speech recognition model. In late September 2022, Gerganov began work on the GGML (Georgi Gerganov Machine Learning) tensor library, a lightweight C library for tensor algebra designed with strict memory management and multi-threading in mind. The ggml.ai project describes its aims as "AI at the edge," with "no third-party dependencies" and "zero memory allocations during runtime." ^[11]

When Meta released its LLaMA (Large Language Model Meta AI) family of models in February 2023, the weights were intended for academic researchers. However, the model weights were leaked to the public within days via a torrent posted on 4chan. Meta's original LLaMA implementation depended on PyTorch and their FairScale extension for multi-GPU execution, requiring CUDA and NVIDIA hardware. This meant most individual developers and researchers could not easily run the model.

Gerganov recognized the opportunity to make LLaMA accessible to everyone. Using his GGML tensor library as the backbone, he built llama.cpp as a from-scratch implementation of the LLaMA inference code in pure C/C++ with minimal dependencies. The initial commit was pushed to GitHub on March 10, 2023, and the project immediately gained traction in the AI community. ^[1]^[2]

When did llama.cpp become popular?

Within its first week, llama.cpp attracted thousands of stars on GitHub. The project filled a critical gap: it let anyone with a modern laptop or desktop computer run a capable language model locally, privately, and for free. Developers across the world began contributing optimizations, hardware backends, and support for additional model architectures.

By mid-2023, llama.cpp had become the de facto standard for local LLM inference. The project was featured in GitHub's Octoverse 2025 report as one of the top open-source projects by contributor count. In March 2026 it crossed 100,000 stars, a milestone that, by multiple accounts, it reached faster than PyTorch (roughly seven years) or TensorFlow (roughly eight years). ^[13] As of mid-2026, the repository has accumulated approximately 117,000 stars, 19,800 forks, and contributions from more than 700 developers. ^[1]

The ggml.ai Company

In 2023, Gerganov founded ggml.ai, a company based in Sofia, Bulgaria, to support the ongoing development of the GGML tensor library and llama.cpp. The company received pre-seed funding from Nat Friedman (former CEO of GitHub) and Daniel Gross, two prominent angel investors in the AI space. ^[11]

When did llama.cpp join Hugging Face?

On February 20, 2026, Hugging Face announced that Gerganov and the GGML team had joined the organization. ^[3] Under this arrangement, the GGML team became full-time Hugging Face employees while retaining full autonomy and technical leadership over the project. The announcement stated that "Georgi and team still dedicate 100% of their time maintaining llama.cpp and have full autonomy and leadership on the technical directions and the community," and that "the project will continue to be 100% open-source and community driven as it is now." ^[3] Two existing Hugging Face employees, Xuan-Son Nguyen ("ngxson") and Aleksander Grygier ("allozaur"), were already core llama.cpp contributors before the announcement, so the move formalized a collaboration that was already underway. ^[3]

Critically, llama.cpp and GGML remain 100% open-source under the MIT License, and the community-driven development model continues unchanged. The partnership aims to create seamless, near single-click integration between the Hugging Face transformers library and the GGML/llama.cpp ecosystem; the stated goal is to make it "as seamless as possible in the future (almost 'single-click') to ship new models in llama.cpp from the transformers library." ^[3]

Purpose and Design Philosophy

What is llama.cpp used for?

The core goal of llama.cpp is to run large language models efficiently on commodity hardware with minimal setup. Typical uses include local chat assistants, private document processing, retrieval-augmented generation, embeddings generation for semantic search, and on-device inference for edge AI. Several design decisions reflect this philosophy:

Pure C/C++ with minimal dependencies. The project avoids reliance on heavy frameworks like PyTorch or TensorFlow. This keeps compile times short, binary sizes small, and cross-platform builds straightforward.
No Python required. Users can download a pre-built binary or compile from source in minutes. There is no need to manage Python virtual environments, pip packages, or CUDA toolkit versions.
CPU-first design. While GPU acceleration is fully supported, llama.cpp was designed from the start to run well on CPUs. This means users without discrete GPUs can still achieve usable inference speeds.
Quantization as a first-class feature. The engine supports aggressive weight quantization (down to roughly 1.5 bits per weight), enabling large models to fit into limited RAM or VRAM.
Single-file model format. The GGUF format bundles model weights, tokenizer data, and metadata into one portable file, eliminating complex directory structures and configuration files.

Technical Architecture

The GGML Tensor Library

At the heart of llama.cpp lies GGML, a C library for machine learning that provides the tensor operations needed for transformer inference. GGML implements:

Basic tensor algebra (matrix multiplication, addition, normalization)
Optimized SIMD kernels for x86 (AVX, AVX2, AVX512, AMX) and ARM (NEON, SVE)
Memory-mapped file I/O for fast model loading
Custom memory allocators for predictable memory usage
Automatic computation graph optimization

GGML takes a different approach from frameworks like PyTorch or TensorFlow. Rather than providing automatic differentiation and training capabilities, it focuses exclusively on inference performance. This narrower scope allows for aggressive optimization and a much smaller codebase.

Supported Hardware Backends

llama.cpp supports a wide range of hardware acceleration backends, allowing it to run on nearly any modern computing device, from Raspberry Pi boards to multi-GPU servers. ^[13]

Backend	Hardware	Vendor	Notes
CPU	x86, ARM, RISC-V	Various	Default backend; uses SIMD instructions (AVX2, NEON, etc.)
Metal	Apple GPU	Apple	Optimized for Apple Silicon (M1, M2, M3, M4); uses unified memory
CUDA	NVIDIA GPU	NVIDIA	Custom kernels for high throughput; requires NVIDIA driver
Vulkan	Cross-platform GPU	Khronos Group	Works on NVIDIA, AMD, Intel, and mobile GPUs
HIP/ROCm	AMD GPU	AMD	AMD's CUDA-equivalent; supports Radeon and Instinct GPUs
SYCL	Intel GPU/CPU	Intel	Supports Intel Arc, Data Center Max, and integrated graphics
OpenCL	Qualcomm Adreno GPU	Qualcomm	Contributed by Qualcomm for mobile and edge devices
MUSA	Moore Threads GPU	Moore Threads	Support for Chinese-market GPUs
CANN	Huawei Ascend NPU	Huawei	For Huawei's AI accelerator hardware
WebGPU	Browser GPU	W3C	Enables in-browser LLM inference
RPC	Network-distributed	Community	Distributes computation across multiple machines
Hexagon	Qualcomm DSP	Qualcomm	Targets Qualcomm's Hexagon digital signal processors

Multiple backends can be enabled simultaneously. For example, a user can build llama.cpp with both CUDA and Vulkan support, then choose the backend at runtime. The engine also supports hybrid CPU+GPU inference, where some model layers run on the GPU and the remainder run on the CPU. This is particularly useful when a model is too large to fit entirely in GPU VRAM.

Performance on Apple Silicon

llama.cpp is especially well-suited for Apple Silicon Macs, thanks to the Metal backend and Apple's unified memory architecture. Because the CPU and GPU share the same memory pool, there is no need to copy model weights between system RAM and VRAM. Typical performance figures for a 7B-parameter model on Apple Silicon (Q4 quantization) range from 30 to 120 tokens per second depending on the chip generation and memory bandwidth. On an M2 Ultra with 96 GB of unified memory, even 70B-parameter models can run at around 8 tokens per second. ^[10]

The GGUF Model Format

Format Evolution

The model file format used by llama.cpp has gone through several iterations:

GGML (March 2023): The original format used when llama.cpp first launched. It was simple but lacked extensibility and had no versioning system.
GGMF (2023): Added a version field to allow backward-compatible changes.
GGJT (2023): Introduced alignment requirements for memory-mapped loading and improved quantization support.
GGUF (August 21, 2023): The current and definitive format, introduced to address all the limitations of prior formats. ^[2]

As of 2026, GGUF is the only format supported by llama.cpp. All prior formats have been deprecated.

What is GGUF?

GGUF (GPT-Generated Unified Format) is a binary file format optimized for fast loading and inference that stores both model tensors and metadata in a single file. It was introduced on August 21, 2023, as the successor to the original GGML format. ^[2] Key design features include:

Self-contained metadata. All model configuration (architecture type, context length, vocabulary size, tokenizer data, RoPE scaling parameters, special tokens) is stored as key-value pairs in the file header. There is no need for separate configuration files.
Memory-mapped loading. The tensor data section is aligned so the operating system can memory-map the file directly, avoiding the need to read the entire file into RAM before inference begins.
Extensibility without breakage. New metadata keys can be added without breaking compatibility with older readers. The format has gone through three internal versions: version 1 established the basic structure, version 2 added explicit alignment padding for memory mapping, and version 3 (the current version) added optional big-endian support. ^[2]
Single-file distribution. One GGUF file contains everything needed to load and run the model, making distribution and sharing straightforward.

GGUF files are widely hosted on Hugging Face, where community members publish pre-quantized versions of popular models in various quantization levels. GGUF is natively supported by llama.cpp, Ollama, LM Studio, GPT4All, Jan, and koboldcpp. ^[2]

Quantization

Quantization is the process of reducing the precision of model weights from their original 16-bit or 32-bit floating-point representation to lower bit widths. This reduces model file size and memory usage while also improving inference speed on many hardware configurations. llama.cpp provides a rich set of quantization options.

Quantization Types

The following table lists the quantization types available in llama.cpp, using Llama 3.1 8B as a reference model (original F16 size: 14.96 GiB). ^[12]

Type	Bits per Weight	Model Size (8B)	Category	Notes
F16	16.00	14.96 GiB	Full precision	Baseline; no quantization
Q8_0	8.50	7.95 GiB	8-bit	Near-lossless quality; best for accuracy-sensitive tasks
Q6_K	6.56	6.14 GiB	6-bit K-quant	Very high quality; minimal perplexity loss
Q5_K_M	5.70	5.33 GiB	5-bit K-quant	Recommended; excellent balance of quality and size
Q5_K_S	5.57	5.21 GiB	5-bit K-quant	Slightly smaller than Q5_K_M with minor quality trade-off
Q4_K_M	4.89	4.58 GiB	4-bit K-quant	Recommended default; good balance for most use cases
Q4_K_S	4.67	4.36 GiB	4-bit K-quant	Smaller variant with slightly lower quality
Q3_K_L	4.30	4.02 GiB	3-bit K-quant	Noticeable quality reduction; suitable for constrained memory
Q3_K_M	4.00	3.74 GiB	3-bit K-quant	Lower quality; for tight memory budgets
Q3_K_S	3.64	3.41 GiB	3-bit K-quant	Aggressive compression; noticeable quality loss
Q2_K	3.16	2.95 GiB	2-bit K-quant	Very aggressive; significant quality degradation
Q2_K_S	2.97	2.78 GiB	2-bit K-quant	Most aggressive K-quant; for extreme memory constraints
IQ4_XS	4.46	4.17 GiB	Importance quant	Better quality than Q3_K_L at similar size
IQ3_M	3.76	3.52 GiB	Importance quant	Competitive with Q3_K_M
IQ3_XXS	3.25	3.04 GiB	Importance quant	Very small with reasonable quality
IQ2_M	2.93	2.74 GiB	Importance quant	Ultra-compressed; for extreme constraints
IQ2_XS	2.59	2.42 GiB	Importance quant	Sub-3-bit with importance matrix
IQ1_M	2.15	2.01 GiB	Importance quant	Near-minimum viable quality
IQ1_S	2.00	1.87 GiB	Importance quant	Smallest available; experimental quality

K-Quant Methods

The "K-quant" methods (types containing "K" in their name) were introduced by community contributor "ikawrakow" and represent a significant improvement over the original quantization types (Q4_0, Q4_1, Q5_0, Q5_1, Q8_0). K-quants use a block-wise approach where different tensor layers receive different quantization precision based on their importance to model quality. The suffixes S (Small), M (Medium), and L (Large) indicate how many tensors receive higher-precision treatment:

S (Small): More aggressive quantization; smallest file size within that bit level.
M (Medium): Balanced approach; higher precision on the most important tensors.
L (Large): Most conservative; higher precision on more tensors, resulting in larger files but better quality.

For most users, Q4_K_M is the recommended starting point. It provides a good balance between model size, inference speed, and output quality. Users who need higher quality should try Q5_K_M or Q6_K, while those with tight memory constraints can drop to Q3_K_M or the IQ-series quantizations.

Importance Quantization (IQ Series)

The IQ (Importance Quantization) series uses an importance matrix to determine which weights are most critical to model performance. By allocating more bits to important weights and fewer bits to less important ones, IQ methods achieve better quality at the same file size compared to standard quantization. The IQ methods are particularly effective at very low bit widths (2-3 bits), where standard quantization would cause unacceptable quality degradation.

Supported Models

How many models does llama.cpp support?

llama.cpp supports over 70 model architectures. Any model that has been converted to the GGUF format can potentially be loaded. The project maintains a conversion script (convert_hf_to_gguf.py) that can convert models from Hugging Face's safetensors format to GGUF.

Major supported model families include:

LLaMA (1, 2, 3, 3.1, 3.2, 3.3, 4 Scout) by Meta
Mistral (7B, Small, Medium, Large) and Mixtral (8x7B, 8x22B) by Mistral AI
Phi (Phi-2, Phi-3, Phi-3.5, Phi-4) by Microsoft
Gemma (2B, 7B, Gemma 2, Gemma 3) by Google
Qwen (Qwen, Qwen 1.5, Qwen 2, Qwen 2.5, Qwen 3) by Alibaba
DeepSeek (DeepSeek-V2, DeepSeek-V3, DeepSeek-R1) by DeepSeek
Yi (6B, 34B) by 01.AI
StarCoder and StarCoder2 by BigCode
Falcon (7B, 40B, 180B) by TII
GPT-NeoX and GPT-J by EleutherAI
StableLM by Stability AI
Baichuan (7B, 13B) by Baichuan Intelligence
InternLM (7B, 20B) by Shanghai AI Lab
Solar (10.7B) by Upstage
CodeLlama by Meta
Command R and Command R+ by Cohere

Multimodal and Vision-Language Models

llama.cpp also supports several multimodal vision-language models through its libmtmd (multi-modal) library. Supported vision models include:

Gemma 3 (4B, 12B, 27B instruction-tuned variants)
Qwen2-VL and Qwen2.5-VL (2B through 72B)
SmolVLM and SmolVLM2 by Hugging Face
Pixtral 12B by Mistral AI
InternVL2.5 and InternVL3 (1B through 14B)
LLaVA family
MiniCPM-V
Moondream2
Llama 4 Scout (17B-16E multimodal)
Mistral Small 3.1 (24B with vision)

Key Features

Grammar-Constrained Sampling

llama.cpp implements a powerful grammar system that constrains model output to follow specific formats. Users can define valid output patterns using GBNF (GGML Backus-Naur Form), a grammar specification syntax. This enables reliable structured output for tasks that require precise formatting, such as generating valid JSON, XML, SQL queries, or code in a specific programming language. The grammar system can also enforce JSON Schema constraints directly, making it easy to integrate with applications that expect structured data.

Speculative Decoding

Speculative decoding is an inference optimization technique that uses a smaller, faster "draft" model to predict multiple tokens ahead, which are then verified by the larger "target" model in a single forward pass. When the draft model's predictions are correct (which happens frequently for common patterns and predictable text), the effective generation speed increases significantly. Users have reported speedups of 1.5x to 2x or more with this technique. llama.cpp's server supports multiple speculative decoding strategies, including n-gram based approaches that do not require a separate draft model.

LoRA Adapter Support

LoRA (Low-Rank Adaptation) adapters can be loaded at runtime using the --lora flag. This allows users to apply fine-tuned behavior on top of a base model without needing a separate full-sized model file. Multiple LoRA adapters can be loaded simultaneously, and each can be assigned a different scaling factor.

Conversation Mode and Chat Templates

The command-line interface (llama-cli) supports an interactive conversation mode with customizable chat templates. Templates for popular model formats (ChatML, Llama-style, Alpaca, Vicuna, and others) are built in, and users can define custom templates for new model formats.

Embeddings Generation

llama.cpp can generate text embeddings from supported models, enabling use cases like semantic search, retrieval-augmented generation (RAG), and document clustering.

Distributed Inference

Through its RPC backend, llama.cpp can distribute model layers across multiple machines on a network. This allows users to pool the resources of several computers to run models that would not fit on any single machine.

Server Mode

llama.cpp includes a built-in HTTP server (llama-server) that exposes an OpenAI-compatible API. This means applications, scripts, and libraries designed to work with the OpenAI API can be pointed at a local llama.cpp server with minimal code changes.

Supported Endpoints

The server provides the following API endpoints:

/v1/chat/completions for chat-based interaction (compatible with the OpenAI Chat API)
/v1/completions for text completion
/v1/embeddings for generating text embeddings
/v1/models for listing loaded models
/health for server health checks

Usage Example

Starting the server is straightforward:

llama-server -hf ggml-org/gemma-3-1b-it-GGUF

This downloads the specified model from Hugging Face and starts the server on port 8080. Applications can then connect using the standard OpenAI Python client:

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8080/v1", api_key="not-needed")
response = client.chat.completions.create(
    model="gemma-3-1b-it",
    messages=[{"role": "user", "content": "Explain quantum computing."}]
)
print(response.choices[0].message.content)

The server also supports concurrent requests, model hot-swapping, streaming responses, and function calling.

Ecosystem

llama.cpp has spawned a rich ecosystem of applications and tools that build on its inference capabilities.

Ollama

Ollama is the most widely used consumer-facing tool built on llama.cpp. It provides a single-binary installation with a built-in model registry, automatic GPU detection, and an OpenAI-compatible REST API. Users can download and run models with simple commands like ollama pull llama3 and ollama run llama3. Ollama abstracts away the complexity of model management, quantization selection, and hardware configuration, making local LLM inference accessible to users who may not be comfortable compiling C++ software from source.

LM Studio

LM Studio is a desktop application that uses llama.cpp as its primary inference backend. It provides a graphical interface for browsing, downloading, and running GGUF models from Hugging Face. LM Studio includes a built-in chat interface, a local server with an OpenAI-compatible API, and tools for comparing model outputs side by side. It is particularly popular among users who want a visual, point-and-click experience.

GPT4All

GPT4All is a local-first AI assistant developed by Nomic AI. It uses llama.cpp under the hood and provides a desktop application with a clean chat interface, document ingestion for local RAG, and support for running models entirely offline. GPT4All targets users who prioritize privacy and want a self-contained AI assistant that never sends data to external servers.

Jan

Jan is an open-source desktop application that positions itself as a privacy-first alternative to ChatGPT. Built on top of llama.cpp, it offers a polished chat interface, local model management, and an OpenAI-compatible API server. Jan supports extensions and plugins, allowing developers to customize its functionality.

Python Bindings

The llama-cpp-python library provides Python bindings for llama.cpp, making it easy to integrate local inference into Python applications. It includes an OpenAI-compatible server mode, support for function calling, and integration with popular frameworks like LangChain and LlamaIndex.

Other Tools

Additional projects in the llama.cpp ecosystem include:

koboldcpp: A fork focused on creative writing and role-playing, with a web UI
text-generation-webui (oobabooga): A popular web interface that supports llama.cpp as one of its backends
LocalAI: A drop-in replacement for the OpenAI API that supports multiple backends including llama.cpp
LlamaEdge: A WebAssembly-based runtime for running llama.cpp models in edge environments

Building and Installation

Is llama.cpp open source?

Yes. llama.cpp is released under the MIT License, one of the most permissive open-source licenses, which allows free commercial and non-commercial use, modification, and redistribution. ^[1] It can be installed through multiple methods:

Homebrew (macOS/Linux): brew install llama.cpp
Nix: Available in the Nix package repository
Winget (Windows): winget install llama.cpp
Docker: Official container images with various backend configurations
Source compilation: Using CMake with flags to enable specific backends (e.g., -DGGML_CUDA=ON for NVIDIA GPU support)

Compiling from source provides the most flexibility, allowing users to enable exactly the backends and optimizations their hardware supports.

Impact on the Local AI Movement

llama.cpp has had a transformative effect on the accessibility of large language models. Before its release in March 2023, running a large language model locally required expensive NVIDIA GPUs, deep knowledge of Python environments, and familiarity with machine learning frameworks. Cloud inference through APIs was the only practical option for most developers and businesses.

llama.cpp changed this calculus entirely. By enabling efficient inference on commodity hardware, including laptops, desktop PCs, and even smartphones, it opened the door to a new paradigm of private, local AI. Several important consequences followed:

Privacy and data sovereignty. Organizations handling sensitive data (medical records, legal documents, financial information) could now run AI models entirely on-premises without sending data to third-party cloud providers.

Cost reduction. For many use cases, running a quantized model locally is far cheaper than paying per-token API fees, especially for high-volume applications.

Offline capability. llama.cpp models run without an internet connection, making them suitable for field work, air-gapped environments, and regions with unreliable connectivity.

Developer experimentation. The low barrier to entry encouraged a wave of experimentation. Thousands of developers who had never worked with machine learning began building AI-powered applications using local models.

Standardization of GGUF. The GGUF format, driven by llama.cpp's adoption, became the de facto standard for distributing quantized language models. By 2024, the AI community had largely abandoned older formats in favor of GGUF, and Hugging Face integrated GGUF support directly into its platform.

Edge AI expansion. llama.cpp demonstrated that capable language models could run on edge devices, catalyzing interest in on-device AI for smartphones, embedded systems, IoT devices, and automotive applications.

The project's success also influenced how model developers release their work. Many organizations now publish GGUF-quantized versions of their models alongside the standard formats, recognizing that a significant portion of their user base runs models locally through llama.cpp or tools built on top of it. Industry observers describe the 100,000-star milestone as reflecting "a structural shift from cloud-dependent AI inference to local, privacy-preserving deployment." ^[13]

Community and Development

llama.cpp is one of the most actively developed open-source projects in the AI space. The project follows a rapid release cadence, with new tagged releases appearing multiple times per month. Development is coordinated through GitHub issues and pull requests, with Gerganov maintaining the role of lead maintainer and primary architect.

The community has contributed support for dozens of model architectures, multiple hardware backends, and a steady stream of performance optimizations. Notable community contributions include the K-quant methods (by ikawrakow), the Vulkan backend, the OpenCL backend (by Qualcomm), and the RPC-based distributed inference system. ^[6]

The project's MIT license has encouraged widespread adoption and forking. Ollama, which incorporates llama.cpp as a core component, is itself one of the most popular AI projects on GitHub.

Comparison with Other Inference Frameworks

How does llama.cpp differ from vLLM and other engines?

llama.cpp occupies a distinct niche in the landscape of LLM inference engines:

vLLM focuses on high-throughput server deployment with PagedAttention and is optimized for NVIDIA GPUs. It offers higher throughput for multi-user server scenarios but requires a heavier software stack.
MLX is Apple's native machine learning framework for Apple Silicon. It is typically 20-30% faster than llama.cpp on Apple hardware but only runs on macOS.
TensorRT-LLM is NVIDIA's proprietary inference engine. It offers the highest performance on NVIDIA GPUs but requires complex setup and is limited to NVIDIA hardware.
ONNX Runtime provides cross-platform inference but lacks the quantization depth and model-specific optimizations of llama.cpp.

llama.cpp's key advantages are its broad hardware support, minimal dependencies, aggressive quantization options, and the simplicity of the single-file GGUF format. These qualities make it the preferred choice for individual developers, hobbyists, and organizations that need to run models across diverse hardware environments.

References

Gerganov, G. "ggml-org/llama.cpp: LLM inference in C/C++." GitHub. https://github.com/ggml-org/llama.cpp ↩
"GGUF." Wikipedia. https://en.wikipedia.org/wiki/GGUF ↩
"GGML and llama.cpp join HF to ensure the long-term progress of Local AI." Hugging Face Blog, February 20, 2026. https://huggingface.co/blog/ggml-joins-hf ↩
"Bringing Whisper and LLaMA to the masses with Georgi Gerganov." The Changelog, Podcast #532. https://changelog.com/podcast/532
"Quantize Llama models with GGUF and llama.cpp." Towards Data Science. https://towardsdatascience.com/quantize-llama-models-with-ggml-and-llama-cpp-3612dfbcc172/
"Introducing New OpenCL GPU Backend in llama.cpp for Qualcomm Adreno GPUs." Qualcomm Developer Blog, November 2024. https://www.qualcomm.com/developer/blog/2024/11/introducing-new-opn-cl-gpu-backend-llama-cpp-for-qualcomm-adreno-gpu ↩
"llama.cpp: The Lightweight Engine Behind Local LLMs." Sandgarden. https://www.sandgarden.com/learn/llama-cpp
Willison, S. "Trying out llama.cpp's new vision support." Simon Willison's Weblog, May 2025. https://simonwillison.net/2025/May/10/llama-cpp-vision/
"llama.cpp Joins Hugging Face: What It Means for Local AI." Enclave AI Blog, February 21, 2026. https://enclaveai.app/blog/2026/02/21/llama-cpp-joins-hugging-face-local-ai/
"Performance of llama.cpp on Apple Silicon M-series." GitHub Discussion #4167. https://github.com/ggml-org/llama.cpp/discussions/4167 ↩
ggml.ai official website. https://ggml.ai/ ↩
"llama.cpp tools/quantize/README.md." GitHub. https://github.com/ggml-org/llama.cpp/blob/master/tools/quantize/README.md ↩
"llama.cpp 100K GitHub Stars 2026: 7 Reasons Devs Obsess." AI Thinker Lab, 2026. https://aithinkerlab.com/llama-cpp-100k-github-stars-2026/ ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

6 revisions by 1 contributors · full history

Suggest edit

History

Origins and Motivation

When did llama.cpp become popular?

The ggml.ai Company

When did llama.cpp join Hugging Face?

Purpose and Design Philosophy

What is llama.cpp used for?

Technical Architecture

The GGML Tensor Library

Supported Hardware Backends

Performance on Apple Silicon

The GGUF Model Format

Format Evolution

What is GGUF?

Quantization

Quantization Types

K-Quant Methods

Importance Quantization (IQ Series)

Supported Models

How many models does llama.cpp support?

Multimodal and Vision-Language Models

Key Features

Grammar-Constrained Sampling

Speculative Decoding

LoRA Adapter Support

Conversation Mode and Chat Templates

Embeddings Generation

Distributed Inference

Server Mode

Supported Endpoints

Usage Example

Ecosystem

Ollama

LM Studio

GPT4All

Jan

Python Bindings

Other Tools

Building and Installation

Is llama.cpp open source?

Impact on the Local AI Movement

Community and Development

Comparison with Other Inference Frameworks

How does llama.cpp differ from vLLM and other engines?

See Also

References

Improve this article

Related Articles

LangChain

Ollama

CrewAI

LM Studio

Open WebUI

Qwen3-Coder

What links here (24 of 65)

Related Articles

LangChain

Ollama

CrewAI

LM Studio

Open WebUI

Qwen3-Coder

What links here (24 of 65)