llama.cpp is an open-source large language model inference engine written in C and C++ by Bulgarian software engineer Georgi Gerganov. First released on March 10, 2023, it allows users to run large language models on consumer-grade hardware without requiring Python or PyTorch. The project has become one of the most influential open-source AI projects in history, accumulating over 98,000 stars on GitHub and serving as the foundation for popular tools such as Ollama, LM Studio, and GPT4All. In February 2026, Gerganov and his team at ggml.ai joined Hugging Face to continue development under long-term institutional support.
Before creating llama.cpp, Georgi Gerganov had already built a reputation in the open-source machine learning community with whisper.cpp, a C/C++ port of OpenAI's Whisper speech recognition model. In late September 2022, Gerganov began work on the GGML (Georgi Gerganov Machine Learning) tensor library, a lightweight C library for tensor algebra designed with strict memory management and multi-threading in mind.
When Meta released its LLaMA (Large Language Model Meta AI) family of models in February 2023, the weights were intended for academic researchers. However, the model weights were leaked to the public within days via a torrent posted on 4chan. Meta's original LLaMA implementation depended on PyTorch and their FairScale extension for multi-GPU execution, requiring CUDA and NVIDIA hardware. This meant most individual developers and researchers could not easily run the model.
Gerganov recognized the opportunity to make LLaMA accessible to everyone. Using his GGML tensor library as the backbone, he built llama.cpp as a from-scratch implementation of the LLaMA inference code in pure C/C++ with minimal dependencies. The initial commit was pushed to GitHub on March 10, 2023, and the project immediately gained traction in the AI community.
Within its first week, llama.cpp attracted thousands of stars on GitHub. The project filled a critical gap: it let anyone with a modern laptop or desktop computer run a capable language model locally, privately, and for free. Developers across the world began contributing optimizations, hardware backends, and support for additional model architectures.
By mid-2023, llama.cpp had become the de facto standard for local LLM inference. The project was featured in GitHub's Octoverse 2025 report as one of the top open-source projects by contributor count. As of March 2026, the repository has accumulated over 98,700 stars, 15,600 forks, and contributions from nearly 800 developers.
In 2023, Gerganov founded ggml.ai, a company based in Sofia, Bulgaria, to support the ongoing development of the GGML tensor library and llama.cpp. The company received pre-seed funding from Nat Friedman (former CEO of GitHub) and Daniel Gross, two prominent angel investors in the AI space.
On February 20, 2026, Gerganov announced that ggml.ai would join Hugging Face. Under this arrangement, the GGML team became full-time Hugging Face employees while retaining full autonomy and technical leadership over the project. Critically, llama.cpp and GGML remain 100% open-source under the MIT License, and the community-driven development model continues unchanged. The partnership aims to create seamless, near-single-click integration between the Hugging Face transformers library and the GGML/llama.cpp ecosystem, making local inference more accessible than ever.
The core goal of llama.cpp is to run large language models efficiently on commodity hardware with minimal setup. Several design decisions reflect this philosophy:
At the heart of llama.cpp lies GGML, a C library for machine learning that provides the tensor operations needed for transformer inference. GGML implements:
GGML takes a different approach from frameworks like PyTorch or TensorFlow. Rather than providing automatic differentiation and training capabilities, it focuses exclusively on inference performance. This narrower scope allows for aggressive optimization and a much smaller codebase.
llama.cpp supports a wide range of hardware acceleration backends, allowing it to run on nearly any modern computing device.
| Backend | Hardware | Vendor | Notes |
|---|---|---|---|
| CPU | x86, ARM, RISC-V | Various | Default backend; uses SIMD instructions (AVX2, NEON, etc.) |
| Metal | Apple GPU | Apple | Optimized for Apple Silicon (M1, M2, M3, M4); uses unified memory |
| CUDA | NVIDIA GPU | NVIDIA | Custom kernels for high throughput; requires NVIDIA driver |
| Vulkan | Cross-platform GPU | Khronos Group | Works on NVIDIA, AMD, Intel, and mobile GPUs |
| HIP/ROCm | AMD GPU | AMD | AMD's CUDA-equivalent; supports Radeon and Instinct GPUs |
| SYCL | Intel GPU/CPU | Intel | Supports Intel Arc, Data Center Max, and integrated graphics |
| OpenCL | Qualcomm Adreno GPU | Qualcomm | Contributed by Qualcomm for mobile and edge devices |
| MUSA | Moore Threads GPU | Moore Threads | Support for Chinese-market GPUs |
| CANN | Huawei Ascend NPU | Huawei | For Huawei's AI accelerator hardware |
| WebGPU | Browser GPU | W3C | Enables in-browser LLM inference |
| RPC | Network-distributed | Community | Distributes computation across multiple machines |
| Hexagon | Qualcomm DSP | Qualcomm | Targets Qualcomm's Hexagon digital signal processors |
Multiple backends can be enabled simultaneously. For example, a user can build llama.cpp with both CUDA and Vulkan support, then choose the backend at runtime. The engine also supports hybrid CPU+GPU inference, where some model layers run on the GPU and the remainder run on the CPU. This is particularly useful when a model is too large to fit entirely in GPU VRAM.
llama.cpp is especially well-suited for Apple Silicon Macs, thanks to the Metal backend and Apple's unified memory architecture. Because the CPU and GPU share the same memory pool, there is no need to copy model weights between system RAM and VRAM. Typical performance figures for a 7B-parameter model on Apple Silicon (Q4 quantization) range from 30 to 120 tokens per second depending on the chip generation and memory bandwidth. On an M2 Ultra with 96 GB of unified memory, even 70B-parameter models can run at around 8 tokens per second.
The model file format used by llama.cpp has gone through several iterations:
As of 2026, GGUF is the only format supported by llama.cpp. All prior formats have been deprecated.
GGUF (GPT-Generated Unified Format) is a binary file format optimized for fast loading and inference. Key design features include:
GGUF files are widely hosted on Hugging Face, where community members publish pre-quantized versions of popular models in various quantization levels.
Quantization is the process of reducing the precision of model weights from their original 16-bit or 32-bit floating-point representation to lower bit widths. This reduces model file size and memory usage while also improving inference speed on many hardware configurations. llama.cpp provides a rich set of quantization options.
The following table lists the quantization types available in llama.cpp, using Llama 3.1 8B as a reference model (original F16 size: 14.96 GiB).
| Type | Bits per Weight | Model Size (8B) | Category | Notes |
|---|---|---|---|---|
| F16 | 16.00 | 14.96 GiB | Full precision | Baseline; no quantization |
| Q8_0 | 8.50 | 7.95 GiB | 8-bit | Near-lossless quality; best for accuracy-sensitive tasks |
| Q6_K | 6.56 | 6.14 GiB | 6-bit K-quant | Very high quality; minimal perplexity loss |
| Q5_K_M | 5.70 | 5.33 GiB | 5-bit K-quant | Recommended; excellent balance of quality and size |
| Q5_K_S | 5.57 | 5.21 GiB | 5-bit K-quant | Slightly smaller than Q5_K_M with minor quality trade-off |
| Q4_K_M | 4.89 | 4.58 GiB | 4-bit K-quant | Recommended default; good balance for most use cases |
| Q4_K_S | 4.67 | 4.36 GiB | 4-bit K-quant | Smaller variant with slightly lower quality |
| Q3_K_L | 4.30 | 4.02 GiB | 3-bit K-quant | Noticeable quality reduction; suitable for constrained memory |
| Q3_K_M | 4.00 | 3.74 GiB | 3-bit K-quant | Lower quality; for tight memory budgets |
| Q3_K_S | 3.64 | 3.41 GiB | 3-bit K-quant | Aggressive compression; noticeable quality loss |
| Q2_K | 3.16 | 2.95 GiB | 2-bit K-quant | Very aggressive; significant quality degradation |
| Q2_K_S | 2.97 | 2.78 GiB | 2-bit K-quant | Most aggressive K-quant; for extreme memory constraints |
| IQ4_XS | 4.46 | 4.17 GiB | Importance quant | Better quality than Q3_K_L at similar size |
| IQ3_M | 3.76 | 3.52 GiB | Importance quant | Competitive with Q3_K_M |
| IQ3_XXS | 3.25 | 3.04 GiB | Importance quant | Very small with reasonable quality |
| IQ2_M | 2.93 | 2.74 GiB | Importance quant | Ultra-compressed; for extreme constraints |
| IQ2_XS | 2.59 | 2.42 GiB | Importance quant | Sub-3-bit with importance matrix |
| IQ1_M | 2.15 | 2.01 GiB | Importance quant | Near-minimum viable quality |
| IQ1_S | 2.00 | 1.87 GiB | Importance quant | Smallest available; experimental quality |
The "K-quant" methods (types containing "K" in their name) were introduced by community contributor "ikawrakow" and represent a significant improvement over the original quantization types (Q4_0, Q4_1, Q5_0, Q5_1, Q8_0). K-quants use a block-wise approach where different tensor layers receive different quantization precision based on their importance to model quality. The suffixes S (Small), M (Medium), and L (Large) indicate how many tensors receive higher-precision treatment:
For most users, Q4_K_M is the recommended starting point. It provides a good balance between model size, inference speed, and output quality. Users who need higher quality should try Q5_K_M or Q6_K, while those with tight memory constraints can drop to Q3_K_M or the IQ-series quantizations.
The IQ (Importance Quantization) series uses an importance matrix to determine which weights are most critical to model performance. By allocating more bits to important weights and fewer bits to less important ones, IQ methods achieve better quality at the same file size compared to standard quantization. The IQ methods are particularly effective at very low bit widths (2-3 bits), where standard quantization would cause unacceptable quality degradation.
llama.cpp supports over 70 model architectures. Any model that has been converted to the GGUF format can potentially be loaded. The project maintains a conversion script (convert_hf_to_gguf.py) that can convert models from Hugging Face's safetensors format to GGUF.
Major supported model families include:
llama.cpp also supports several multimodal vision-language models through its libmtmd (multi-modal) library. Supported vision models include:
llama.cpp implements a powerful grammar system that constrains model output to follow specific formats. Users can define valid output patterns using GBNF (GGML Backus-Naur Form), a grammar specification syntax. This enables reliable structured output for tasks that require precise formatting, such as generating valid JSON, XML, SQL queries, or code in a specific programming language. The grammar system can also enforce JSON Schema constraints directly, making it easy to integrate with applications that expect structured data.
Speculative decoding is an inference optimization technique that uses a smaller, faster "draft" model to predict multiple tokens ahead, which are then verified by the larger "target" model in a single forward pass. When the draft model's predictions are correct (which happens frequently for common patterns and predictable text), the effective generation speed increases significantly. Users have reported speedups of 1.5x to 2x or more with this technique. llama.cpp's server supports multiple speculative decoding strategies, including n-gram based approaches that do not require a separate draft model.
LoRA (Low-Rank Adaptation) adapters can be loaded at runtime using the --lora flag. This allows users to apply fine-tuned behavior on top of a base model without needing a separate full-sized model file. Multiple LoRA adapters can be loaded simultaneously, and each can be assigned a different scaling factor.
The command-line interface (llama-cli) supports an interactive conversation mode with customizable chat templates. Templates for popular model formats (ChatML, Llama-style, Alpaca, Vicuna, and others) are built in, and users can define custom templates for new model formats.
llama.cpp can generate text embeddings from supported models, enabling use cases like semantic search, retrieval-augmented generation (RAG), and document clustering.
Through its RPC backend, llama.cpp can distribute model layers across multiple machines on a network. This allows users to pool the resources of several computers to run models that would not fit on any single machine.
llama.cpp includes a built-in HTTP server (llama-server) that exposes an OpenAI-compatible API. This means applications, scripts, and libraries designed to work with the OpenAI API can be pointed at a local llama.cpp server with minimal code changes.
The server provides the following API endpoints:
/v1/chat/completions for chat-based interaction (compatible with the OpenAI Chat API)/v1/completions for text completion/v1/embeddings for generating text embeddings/v1/models for listing loaded models/health for server health checksStarting the server is straightforward:
llama-server -hf ggml-org/gemma-3-1b-it-GGUF
This downloads the specified model from Hugging Face and starts the server on port 8080. Applications can then connect using the standard OpenAI Python client:
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8080/v1", api_key="not-needed")
response = client.chat.completions.create(
model="gemma-3-1b-it",
messages=[{"role": "user", "content": "Explain quantum computing."}]
)
print(response.choices<sup><a href="#cite_note-0" class="cite-ref">[0]</a></sup>.message.content)
The server also supports concurrent requests, model hot-swapping, streaming responses, and function calling.
llama.cpp has spawned a rich ecosystem of applications and tools that build on its inference capabilities.
Ollama is the most widely used consumer-facing tool built on llama.cpp. It provides a single-binary installation with a built-in model registry, automatic GPU detection, and an OpenAI-compatible REST API. Users can download and run models with simple commands like ollama pull llama3 and ollama run llama3. Ollama abstracts away the complexity of model management, quantization selection, and hardware configuration, making local LLM inference accessible to users who may not be comfortable compiling C++ software from source.
LM Studio is a desktop application that uses llama.cpp as its primary inference backend. It provides a graphical interface for browsing, downloading, and running GGUF models from Hugging Face. LM Studio includes a built-in chat interface, a local server with an OpenAI-compatible API, and tools for comparing model outputs side by side. It is particularly popular among users who want a visual, point-and-click experience.
GPT4All is a local-first AI assistant developed by Nomic AI. It uses llama.cpp under the hood and provides a desktop application with a clean chat interface, document ingestion for local RAG, and support for running models entirely offline. GPT4All targets users who prioritize privacy and want a self-contained AI assistant that never sends data to external servers.
Jan is an open-source desktop application that positions itself as a privacy-first alternative to ChatGPT. Built on top of llama.cpp, it offers a polished chat interface, local model management, and an OpenAI-compatible API server. Jan supports extensions and plugins, allowing developers to customize its functionality.
The llama-cpp-python library provides Python bindings for llama.cpp, making it easy to integrate local inference into Python applications. It includes an OpenAI-compatible server mode, support for function calling, and integration with popular frameworks like LangChain and LlamaIndex.
Additional projects in the llama.cpp ecosystem include:
llama.cpp can be installed through multiple methods:
brew install llama.cppwinget install llama.cpp-DGGML_CUDA=ON for NVIDIA GPU support)Compiling from source provides the most flexibility, allowing users to enable exactly the backends and optimizations their hardware supports.
llama.cpp has had a transformative effect on the accessibility of large language models. Before its release in March 2023, running a large language model locally required expensive NVIDIA GPUs, deep knowledge of Python environments, and familiarity with machine learning frameworks. Cloud inference through APIs was the only practical option for most developers and businesses.
llama.cpp changed this calculus entirely. By enabling efficient inference on commodity hardware, including laptops, desktop PCs, and even smartphones, it opened the door to a new paradigm of private, local AI. Several important consequences followed:
Privacy and data sovereignty. Organizations handling sensitive data (medical records, legal documents, financial information) could now run AI models entirely on-premises without sending data to third-party cloud providers.
Cost reduction. For many use cases, running a quantized model locally is far cheaper than paying per-token API fees, especially for high-volume applications.
Offline capability. llama.cpp models run without an internet connection, making them suitable for field work, air-gapped environments, and regions with unreliable connectivity.
Developer experimentation. The low barrier to entry encouraged a wave of experimentation. Thousands of developers who had never worked with machine learning began building AI-powered applications using local models.
Standardization of GGUF. The GGUF format, driven by llama.cpp's adoption, became the de facto standard for distributing quantized language models. By 2024, the AI community had largely abandoned older formats in favor of GGUF, and Hugging Face integrated GGUF support directly into its platform.
Edge AI expansion. llama.cpp demonstrated that capable language models could run on edge devices, catalyzing interest in on-device AI for smartphones, embedded systems, IoT devices, and automotive applications.
The project's success also influenced how model developers release their work. Many organizations now publish GGUF-quantized versions of their models alongside the standard formats, recognizing that a significant portion of their user base runs models locally through llama.cpp or tools built on top of it.
llama.cpp is one of the most actively developed open-source projects in the AI space. The project follows a rapid release cadence, with new tagged releases appearing multiple times per month. Development is coordinated through GitHub issues and pull requests, with Gerganov maintaining the role of lead maintainer and primary architect.
The community has contributed support for dozens of model architectures, multiple hardware backends, and a steady stream of performance optimizations. Notable community contributions include the K-quant methods (by ikawrakow), the Vulkan backend, the OpenCL backend (by Qualcomm), and the RPC-based distributed inference system.
The project's MIT license has encouraged widespread adoption and forking. Ollama, which incorporates llama.cpp as a core component, is itself one of the most popular AI projects on GitHub.
llama.cpp occupies a distinct niche in the landscape of LLM inference engines:
llama.cpp's key advantages are its broad hardware support, minimal dependencies, aggressive quantization options, and the simplicity of the single-file GGUF format. These qualities make it the preferred choice for individual developers, hobbyists, and organizations that need to run models across diverse hardware environments.