Candle (HuggingFace Rust ML)
Last reviewed
May 21, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 3,475 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 21, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 3,475 words
Add missing citations, update stale details, or suggest a clearer explanation.
Candle is a minimalist machine learning framework written in pure Rust and published by Hugging Face under the huggingface/candle GitHub repository.[^1] The project is led primarily by Laurent Mazare (GitHub @LaurentMazare), who is also the author of the older tch-rs PyTorch bindings and co-founder of the speech AI lab Kyutai.[^2][^3] Candle's stated goal is to "make serverless inference possible" by producing small, fast-booting binaries that load models from the Hugging Face Hub directly, removing the Python runtime (and its global interpreter lock) from production deployments.[^1] It supports CPU, NVIDIA CUDA, and Apple Metal back-ends, as well as compilation to WebAssembly, and loads weights from safetensors, GGUF, NPZ, and PyTorch .pth files.[^1][^4]
| Item | Value |
|---|---|
| Repository | github.com/huggingface/candle[^1] |
| Initial public commits | July 2023 (wiki edits); first crate uploads August 2023[^5] |
| Primary author / lead | Laurent Mazare (@LaurentMazare)[^2] |
| Publisher | Hugging Face[^1] |
| License | Dual MIT / Apache-2.0[^4][^6] |
| Core crates | candle-core, candle-nn, candle-transformers, candle-kernels, candle-flash-attn, candle-onnx, candle-datasets, candle-pyo3, candle-examples[^1] |
| Supported backends | CPU (with MKL on x86 and Accelerate on macOS), CUDA (multi-GPU via NCCL), Apple Metal, WebAssembly[^1] |
| Supported model formats | Safetensors, GGUF, GGML, PyTorch .pth, NPZ[^1] |
Hugging Face had, by 2023, established itself as the dominant hub for open model weights and the transformers/diffusers Python libraries, but several deployment problems remained intractable when serving models in Python: very large container images (PyTorch wheels routinely exceed several hundred megabytes), slow cold starts on serverless platforms, and contention from the CPython global interpreter lock when many concurrent requests share a process.[^1][^7] Candle was conceived as a Rust-native alternative tailored specifically to inference workloads where these costs dominate. The official README summarises the motivation succinctly: Candle's "core goal" is "to make serverless inference possible" by producing "lightweight binaries", and to let users "remove Python from production workloads", noting that "Python overhead can seriously hurt performance, and the GIL is a notorious source of headaches".[^1]
The framework was developed under the huggingface GitHub organisation, with Laurent Mazare as the principal designer and most prolific contributor.[^2] Mazare had spent several years building tch-rs, a Rust binding for PyTorch's C++ API (LibTorch), and the OCaml ocaml-torch bindings before that, so Candle in effect represents a second, ground-up attempt to bring PyTorch-style tensor programming into a statically compiled language without dragging in the LibTorch shared object.[^2][^8] Hugging Face engineer Nicolas Patry (GitHub Narsil), the original author of the Rust safetensors and tokenizers crates, is also a regular contributor and reviewer.[^9]
Public archival snapshots show the repository active in early August 2023, with Internet Archive captures dated 2023-08-09, 2023-08-11, and 2023-08-12, and the candle-kernels crate has been continuously published to crates.io since 2 August 2023.[^5][^7] The first tagged minor release recorded in the project changelog is v0.2.0 on 30 August 2023, which already shipped Stable Diffusion XL, additional quantized GGUF data types (Q2K, Q4K, Q5K), and an API for writing GGUF files.[^10] Version 0.2.1 on 11 September 2023 added Segment-Anything, RNNs (GRU and LSTM), TinyViT and GGUF v2 support; 0.2.2 on 18 September 2023 added a T5 decoding loop and top-p sampling; and 0.3.0 on 1 October 2023 added Mistral 7B v0.1, the Phi-v1.5 mixformer variant, the Wuerstchen diffusion stack, and SIMD128 intrinsics for quantized operations.[^10] The Apple Metal back-end was the subject of a tracking issue opened on 3 August 2023 and was added incrementally over the following months.[^11]
Candle is delivered as a Cargo workspace of small crates that compose into a deployable binary. The README's "Structure" section enumerates the principal pieces.[^1]
| Crate | Role |
|---|---|
candle-core | Core Tensor, Device, DType, layout, and back-end dispatch; defines the autograd engine.[^1][^12] |
candle-nn | Higher-level neural network primitives: layers, optimisers, the VarBuilder weight loader.[^1][^13] |
candle-transformers | Reference implementations of named model architectures (LLaMA, Mistral, Mixtral, Phi, Gemma, Whisper, Stable Diffusion, etc.).[^1] |
candle-kernels | CUDA C++ kernel source compiled at crate build time; provides the device-side ops invoked through cudarc.[^14] |
candle-flash-attn | Rust wrapper around Flash Attention v2 CUDA kernels.[^1] |
candle-onnx | Loader and evaluator for ONNX graphs.[^1] |
candle-datasets | Dataset helpers (MNIST, CIFAR, etc.).[^1] |
candle-pyo3 | Python bindings exposing a subset of the Tensor API through PyO3.[^1] |
candle-examples | Self-contained example binaries for each supported model.[^1] |
Tensor objects are constructed against a Device (one of Cpu, Cuda(ordinal), or Metal(ordinal)) and a DType (covering f16/bf16/f32/f64, u8/u32/i64, and the GGML-style quantized integer types).[^12] The canonical "hello world" from the crate's documentation page on docs.rs is a small matmul that runs on the CPU device:[^12]
use candle_core::{Tensor, Device};
let a = Tensor::arange(0f32, 6f32, &Device::Cpu)?.reshape((2, 3))?;
let b = Tensor::arange(0f32, 12f32, &Device::Cpu)?.reshape((3, 4))?;
let c = a.matmul(&b)?;
Models are typically wired up by parsing a Hugging Face config.json, downloading the weights via the hf-hub crate (which manages the same ~/.cache/huggingface layout as the Python huggingface_hub), memory-mapping the safetensor shards through a VarBuilder, and then materialising tensors into Rust structs that mirror the layout used by the Python transformers library.[^13] Hugging Face's own integration page on the transformers documentation site describes this loading flow explicitly and gives a three-line snippet showing VarBuilder::from_mmaped_safetensors followed by Model::new(...).[^13]
Candle's CPU back-end is the default and is implemented in safe Rust over ndarray-style strided buffers, with optional integration with Intel MKL on x86 Linux/Windows and Apple Accelerate on macOS for BLAS-accelerated matmul.[^1] Quantized kernels reuse the integer formats popularised by llama.cpp (Q4_0, Q5_0, Q8_0, Q2K, Q4K, Q5K, etc.), with SIMD128 intrinsics added in v0.3.0 for ARM and WebAssembly targets.[^10]
GPU support targets NVIDIA hardware via the third-party cudarc Rust crate, which provides safe wrappers over the CUDA driver API, NVRTC, cuBLAS, cuDNN, cuFFT, and NCCL.[^14][^15] Device kernels that are not provided by cuBLAS or cuDNN (custom element-wise ops, reductions, the quantized matmul paths) live in candle-kernels, whose build.rs script compiles CUDA C++ sources at compile time and embeds them into the resulting Rust binary.[^14] Multi-GPU inference uses the NCCL bindings in cudarc for collective operations, mirroring how the PyTorch and JAX ecosystems perform tensor and pipeline parallel sharding.[^1][^15]
Candle ships a Metal back-end for Apple Silicon GPUs. A tracking issue (#313) was opened on 3 August 2023 requesting MPS-style support, and Metal kernels were progressively added over the following months, accompanied by their own set of .metal source files compiled at build time.[^11] By Candle 0.3.x the back-end was sufficiently mature to power third-party text-to-speech and LLM projects on M-series Macs, though some long-tail issues such as missing BF16 matmul paths for the Falcon architecture persisted into 2024.[^16]
Because the CPU back-end is pure Rust, the workspace can be cross-compiled to the wasm32-unknown-unknown target. The repository's candle-wasm-examples directory contains in-browser ports of Whisper, LLaMA2-c, YOLO, Segment Anything, and T5 (language model) that run entirely client-side without a backend server.[^1][^17] These are deployed as live demos on Hugging Face Spaces (for example radames/candle-segment-anything-wasm and Laurent Mazare's lmz/candle-whisper, lmz/candle-llama2, and lmz/candle-yolo Spaces).[^18][^17]
Candle is unusual among Rust ML frameworks in supporting four production weight formats out of the box:[^1][^4]
.safetensors): the standard Hugging Face format, loaded zero-copy through memory mapping by VarBuilder::from_mmaped_safetensors.[^13]llama.cpp, including its quantized variants. GGUF v2 reading and writing have been supported since v0.2.x.[^10].pth / .bin pickle files: parsed without invoking a Python interpreter, allowing direct interoperability with models published in the legacy PyTorch pickle format.[^1]This loader breadth is what makes Candle a practical drop-in for serving existing Hugging Face checkpoints without any conversion step: the same weight files that transformers consumes in Python can be mmapped directly from disk into a Rust process.[^13]
The candle-examples crate doubles as a regression suite and as a demo gallery. Each example is a small Cargo binary that downloads its weights via hf-hub, builds the model, and runs inference. Notable examples shipped in tree include:
| Domain | Example binaries (selected) |
|---|---|
| Large language models | LLaMA v1/v2/v3, Mistral, Mixtral, Phi-1.5 / Phi-2 / Phi-3, Gemma 1 / 2, Falcon, StarCoder, Qwen, Mamba, RWKV, StableLM, Yi[^1] |
| Speech | Whisper (multilingual), EnCodec, MetaVoice, Parler-TTS[^1] |
| Vision | YOLO v3/v8, Segment Anything, DINOv2, ViT, ResNet, EfficientNet, MobileNet v4, TrOCR, BLIP, CLIP[^1] |
| Text-to-image | Stable Diffusion 1.5 / 2.1, SDXL 1.0 and Turbo, Wuerstchen, FLUX.1[^1][^19] |
| Other | T5, Marian MT, BERT, sentence-transformers[^1] |
The FLUX.1 implementation was announced by Mazare on 2 August 2024: "The FLUX.1 image generation model is now available in candle. Pure #Rust implementation, no python involved."[^19] Earlier, the lmz/candle-mistral, lmz/candle-whisper, and lmz/candle-yolo-v8 Hugging Face repositories pre-stage weights in the exact layout the example binaries expect.[^3][^20]
While candle-core does include a reverse-mode automatic differentiation engine and candle-nn provides optimisers, the project is in practice strongly inference-oriented. The official documentation lists "model training" as a supported capability and the example crate contains MNIST and CIFAR training scripts, but the marquee features (quantization, GGUF, WebAssembly, the breadth of candle-transformers model definitions, multi-GPU NCCL sharding) all serve deployment rather than research training.[^1][^4][^7] HuggingFace's own framing on the integration page describes Candle as a way to provide "native Rust implementations of Transformers models" for serving, not as a competitor to PyTorch for large-scale pre-training.[^13] The community comparison literature consistently slots Candle alongside llama.cpp and ONNX Runtime in the "inference engine" bucket rather than alongside JAX or PyTorch in the "training framework" bucket.[^21][^22]
The most visible Candle deployments are the WebAssembly demos hosted on Hugging Face Spaces, which run real models entirely in the user's browser. These include Whisper-based transcription, LLaMA2-c text generation, YOLO object detection, Segment Anything masking, and T5 text-to-text. Several of these Spaces (notably lmz/candle-yolo and lmz/candle-whisper) are featured Spaces with many hundreds of likes.[^3][^18] They double as load-tests for the WASM build and as the most accessible "first contact" with the framework for users without a GPU.[^17]
The mistral.rs project by Eric L. Buehler (EricLBuehler/mistral.rs) is a Rust-native LLM inference engine built directly on top of Candle. The README's credits section states: "This project would not be possible without the excellent work at Candle."[^23] As of 2026 the project supports a long list of model families (Qwen, LLaMA, Mistral, Phi, Gemma, DeepSeek variants), vision/audio multimodality, and a Python SDK in addition to its Rust crate, and it explicitly migrated to the official crates.io release of Candle 0.9.2 to stabilise its back-end dependency.[^23] mistral.rs is independent of Hugging Face but represents the most prominent third-party project that depends on Candle.
EndlessReform/fish-speech.rs is a pure-Rust reimplementation of the Fish Speech 1.5 text-to-speech model, also built on Candle.[^24] It compiles to a roughly 15 MB static binary, supports CUDA, Apple Metal, and CPU back-ends, and ships custom grouped-query attention CUDA kernels in a candle-gqa-kernels subdirectory that are intended to be useful to other Candle users.[^24] The project is a representative example of how Candle is used outside Hugging Face: a small team takes a Python reference implementation and republishes it as a single-binary Rust server.
Laurent Mazare's other affiliation, the Paris-based speech AI lab Kyutai, has released the Moshi full-duplex speech model with Candle-compatible weight variants under names like kyutai/moshika-candle-bf16 and kyutai/moshiko-candle-q8, allowing the open-source Moshi models to be run from Rust with no Python in the loop.[^25][^26]
A growing number of small Rust projects use Candle as their tensor library, including qwen3-asr-rs (a pure-Rust ASR inference engine for Qwen3-ASR using Metal and CUDA acceleration through Candle)[^27] and various sentence-transformer servers used as embedding back-ends.[^21]
| Library | Architecture | Primary focus | GPU back-ends | Training? | Dependency footprint |
|---|---|---|---|---|---|
| Candle (huggingface/candle) | Pure Rust + cudarc + Metal kernels[^1][^14] | Inference, serverless, edge[^1] | NVIDIA CUDA, Apple Metal, WASM[^1] | Yes, but secondary[^1] | Megabyte-class binary, no LibTorch[^7] |
tch-rs (LaurentMazare/tch-rs) | Thin FFI bindings to PyTorch C++ (LibTorch)[^8] | Bringing the full PyTorch API into Rust[^8] | Whatever LibTorch supports (CUDA, ROCm via PyTorch)[^8] | Yes, full PyTorch parity[^8] | Requires a multi-hundred-MB LibTorch shared object at runtime[^8] |
burn (tracel-ai/burn) | Generic over a Backend trait[^28] | Full training + inference stack[^28] | Multi-backend: WGPU (Vulkan/Metal/DX12/WebGPU), CUDA, ROCm, LibTorch, NDArray[^28] | Yes, training is a first-class goal[^28] | Pure Rust by default; LibTorch only if the Tch backend is selected[^28] |
ort (ONNX Runtime Rust bindings) | FFI over Microsoft's ONNX Runtime C/C++ library[^22] | Cross-framework inference of ONNX graphs[^22] | Whatever ONNX Runtime ships (CUDA, TensorRT, DirectML, CoreML, OpenVINO, CANN, QNN)[^22] | No (inference only)[^22] | Requires the bundled onnxruntime shared library[^22] |
tch-rs is the most direct point of comparison because Laurent Mazare is the author of both: where tch-rs exposes the LibTorch API verbatim and inherits both its breadth and its multi-hundred-megabyte runtime footprint, Candle reimplements the necessary subset in pure Rust and trades full PyTorch parity for a small static binary.[^2][^8] burn, maintained by tracel-ai, takes a third path: it is also pure Rust by default but is generic over its compute back-end, supports cross-platform GPU execution through WGPU (Vulkan, Metal, DirectX, WebGPU), and explicitly markets itself for training as well as inference.[^28] On a YOLOv8 benchmark reported in the Candle issue tracker, ONNX Runtime ran end-to-end inference roughly an order of magnitude faster than Candle for the same model (~7 ms vs ~55 ms with CUDA + cuDNN), which has been cited as evidence that Candle prioritises portability and small binaries over peak per-kernel throughput.[^22]
Candle's primary contribution is showing that a credible, ergonomic, statically compiled ML inference stack is achievable without binding to LibTorch or ONNX Runtime. By keeping the tensor library, model definitions, kernels, and weight loaders all in Rust, the framework produces single-binary deployments measured in tens of megabytes (typical example sizes: 22 MB for Whisper-tiny, 48 MB for a 4-bit LLaMA-2 7B, 38 MB for Phi-2) that cold-start in milliseconds.[^7] This profile fits cleanly into the "Rust for Lambda / Cloud Run" pattern that has emerged independently in the serverless community: Datadog, for instance, reported cutting cold starts by 82 % and reducing extension binary size from 55 MB to 7 MB by rewriting their AWS Lambda extension in Rust, the same shape of result that Candle aims to make available for model inference.[^29] For browser and embedded deployments, the WASM target eliminates a server hop entirely.[^17]
The framework also matters as part of Hugging Face's broader strategy of moving performance-critical infrastructure out of Python. safetensors, tokenizers, Text Generation Inference, and hf-hub are all Rust crates maintained by Hugging Face; Candle slots between them as the layer that actually executes models.[^1][^9][^13]
Several recurring criticisms appear across community comparisons and the project's own issue tracker:
lightning-style training loop. Users wanting a Rust framework for training are typically directed to burn.[^28]candle-transformers, in contrast to ONNX or PyTorch-loading frameworks that can in principle execute any graph.[^1]CHANGELOG.md or the crates.io version history to see what shipped when.[^10]ort Rust crate: a cross-framework alternative for inference.[^22]