Candle (HuggingFace Rust ML)

Developer Tools Machine Learning Open Source AI

17 min read

Updated Jul 23, 2026

Suggest edit History Talk

RawGraph

Last edited

Jul 23, 2026

Fact-checked

In review queue

Sources

29 citations

Revision

v4 · 3,468 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Candle is a minimalist machine learning framework written in pure Rust and published by Hugging Face under the huggingface/candle GitHub repository.^[1] The project is led primarily by Laurent Mazare (GitHub @LaurentMazare), who is also the author of the older tch-rs PyTorch bindings and co-founder of the speech AI lab Kyutai.^[2]^[3] Candle's stated goal is to "make serverless inference possible" by producing small, fast-booting binaries that load models from the Hugging Face Hub directly, removing the Python runtime (and its global interpreter lock) from production deployments.^[1] It supports CPU, NVIDIA CUDA, and Apple Metal back-ends, as well as compilation to WebAssembly, and loads weights from safetensors, GGUF, NPZ, and PyTorch .pth files.^[1]^[4]

Project information

Item	Value
Repository	`github.com/huggingface/candle`^[1]
Initial public commits	July 2023 (wiki edits); first crate uploads August 2023^[5]
Primary author / lead	Laurent Mazare (`@LaurentMazare`)^[2]
Publisher	Hugging Face^[1]
License	Dual MIT / Apache-2.0^[4]^[6]
Core crates	`candle-core`, `candle-nn`, `candle-transformers`, `candle-kernels`, `candle-flash-attn`, `candle-onnx`, `candle-datasets`, `candle-pyo3`, `candle-examples`^[1]
Supported backends	CPU (with MKL on x86 and Accelerate on macOS), CUDA (multi-GPU via NCCL), Apple Metal, WebAssembly^[1]
Supported model formats	Safetensors, GGUF, GGML, PyTorch `.pth`, NPZ^[1]

Background and motivation

Hugging Face had, by 2023, established itself as the dominant hub for open model weights and the transformers/diffusers Python libraries, but several deployment problems remained intractable when serving models in Python: very large container images (PyTorch wheels routinely exceed several hundred megabytes), slow cold starts on serverless platforms, and contention from the CPython global interpreter lock when many concurrent requests share a process.^[1]^[7] Candle was conceived as a Rust-native alternative tailored specifically to inference workloads where these costs dominate. The official README summarises the motivation succinctly: Candle's "core goal" is "to make serverless inference possible" by producing "lightweight binaries", and to let users "remove Python from production workloads", noting that "Python overhead can seriously hurt performance, and the GIL is a notorious source of headaches".^[1]

The framework was developed under the huggingface GitHub organisation, with Laurent Mazare as the principal designer and most prolific contributor.^[2] Mazare had spent several years building tch-rs, a Rust binding for PyTorch's C++ API (LibTorch), and the OCaml ocaml-torch bindings before that, so Candle in effect represents a second, ground-up attempt to bring PyTorch-style tensor programming into a statically compiled language without dragging in the LibTorch shared object.^[2]^[8] Hugging Face engineer Nicolas Patry (GitHub Narsil), the original author of the Rust safetensors and tokenizers crates, is also a regular contributor and reviewer.^[9]

Public archival snapshots show the repository active in early August 2023, with Internet Archive captures dated 2023-08-09, 2023-08-11, and 2023-08-12, and the candle-kernels crate has been continuously published to crates.io since 2 August 2023.^[5]^[7] The first tagged minor release recorded in the project changelog is v0.2.0 on 30 August 2023, which already shipped Stable Diffusion XL, additional quantized GGUF data types (Q2K, Q4K, Q5K), and an API for writing GGUF files.^[10] Version 0.2.1 on 11 September 2023 added Segment-Anything, RNNs (GRU and LSTM), TinyViT and GGUF v2 support; 0.2.2 on 18 September 2023 added a T5 decoding loop and top-p sampling; and 0.3.0 on 1 October 2023 added Mistral 7B v0.1, the Phi-v1.5 mixformer variant, the Wuerstchen diffusion stack, and SIMD128 intrinsics for quantized operations.^[10] The Apple Metal back-end was the subject of a tracking issue opened on 3 August 2023 and was added incrementally over the following months.^[11]

How Candle is structured

Candle is delivered as a Cargo workspace of small crates that compose into a deployable binary. The README's "Structure" section enumerates the principal pieces.^[1]

Crate	Role
`candle-core`	Core `Tensor`, `Device`, `DType`, layout, and back-end dispatch; defines the autograd engine.^[1]^[12]
`candle-nn`	Higher-level neural network primitives: layers, optimisers, the `VarBuilder` weight loader.^[1]^[13]
`candle-transformers`	Reference implementations of named model architectures (LLaMA, Mistral, Mixtral, Phi, Gemma, Whisper, Stable Diffusion, etc.).^[1]
`candle-kernels`	CUDA C++ kernel source compiled at crate build time; provides the device-side ops invoked through cudarc.^[14]
`candle-flash-attn`	Rust wrapper around Flash Attention v2 CUDA kernels.^[1]
`candle-onnx`	Loader and evaluator for ONNX graphs.^[1]
`candle-datasets`	Dataset helpers (MNIST, CIFAR, etc.).^[1]
`candle-pyo3`	Python bindings exposing a subset of the Tensor API through PyO3.^[1]
`candle-examples`	Self-contained example binaries for each supported model.^[1]

Tensor objects are constructed against a Device (one of Cpu, Cuda(ordinal), or Metal(ordinal)) and a DType (covering f16/bf16/f32/f64, u8/u32/i64, and the GGML-style quantized integer types).^[12] The canonical "hello world" from the crate's documentation page on docs.rs is a small matmul that runs on the CPU device:^[12]

use candle_core::{Tensor, Device};

let a = Tensor::arange(0f32, 6f32, &Device::Cpu)?.reshape((2, 3))?;
let b = Tensor::arange(0f32, 12f32, &Device::Cpu)?.reshape((3, 4))?;
let c = a.matmul(&b)?;

Models are typically wired up by parsing a Hugging Face config.json, downloading the weights via the hf-hub crate (which manages the same ~/.cache/huggingface layout as the Python huggingface_hub), memory-mapping the safetensor shards through a VarBuilder, and then materialising tensors into Rust structs that mirror the layout used by the Python transformers library.^[13] Hugging Face's own integration page on the transformers documentation site describes this loading flow explicitly and gives a three-line snippet showing VarBuilder::from_mmaped_safetensors followed by Model::new(...).^[13]

Back-ends

CPU

Candle's CPU back-end is the default and is implemented in safe Rust over ndarray-style strided buffers, with optional integration with Intel MKL on x86 Linux/Windows and Apple Accelerate on macOS for BLAS-accelerated matmul.^[1] Quantized kernels reuse the integer formats popularised by llama.cpp (Q4_0, Q5_0, Q8_0, Q2K, Q4K, Q5K, etc.), with SIMD128 intrinsics added in v0.3.0 for ARM and WebAssembly targets.^[10]

CUDA

GPU support targets NVIDIA hardware via the third-party cudarc Rust crate, which provides safe wrappers over the CUDA driver API, NVRTC, cuBLAS, cuDNN, cuFFT, and NCCL.^[14]^[15] Device kernels that are not provided by cuBLAS or cuDNN (custom element-wise ops, reductions, the quantized matmul paths) live in candle-kernels, whose build.rs script compiles CUDA C++ sources at compile time and embeds them into the resulting Rust binary.^[14] Multi-GPU inference uses the NCCL bindings in cudarc for collective operations, mirroring how the PyTorch and JAX ecosystems perform tensor and pipeline parallel sharding.^[1]^[15]

Apple Metal

Candle ships a Metal back-end for Apple Silicon GPUs. A tracking issue (#313) was opened on 3 August 2023 requesting MPS-style support, and Metal kernels were progressively added over the following months, accompanied by their own set of .metal source files compiled at build time.^[11] By Candle 0.3.x the back-end was sufficiently mature to power third-party text-to-speech and LLM projects on M-series Macs, though some long-tail issues such as missing BF16 matmul paths for the Falcon architecture persisted into 2024.^[16]

WebAssembly

Because the CPU back-end is pure Rust, the workspace can be cross-compiled to the wasm32-unknown-unknown target. The repository's candle-wasm-examples directory contains in-browser ports of Whisper, LLaMA2-c, YOLO, Segment Anything, and T5 (language model) that run entirely client-side without a backend server.^[1]^[17] These are deployed as live demos on Hugging Face Spaces (for example radames/candle-segment-anything-wasm and Laurent Mazare's lmz/candle-whisper, lmz/candle-llama2, and lmz/candle-yolo Spaces).^[18]^[17]

Model loaders and file formats

Candle is unusual among Rust ML frameworks in supporting four production weight formats out of the box:^[1]^[4]

Safetensors (.safetensors): the standard Hugging Face format, loaded zero-copy through memory mapping by VarBuilder::from_mmaped_safetensors.^[13]
GGUF / GGML: the on-disk format used by llama.cpp, including its quantized variants. GGUF v2 reading and writing have been supported since v0.2.x.^[10]
PyTorch .pth / .bin pickle files: parsed without invoking a Python interpreter, allowing direct interoperability with models published in the legacy PyTorch pickle format.^[1]
NPZ: NumPy's zipped array container, useful for small datasets and reference activations.^[1]

This loader breadth is what makes Candle a practical drop-in for serving existing Hugging Face checkpoints without any conversion step: the same weight files that transformers consumes in Python can be mmapped directly from disk into a Rust process.^[13]

Example binaries

The candle-examples crate doubles as a regression suite and as a demo gallery. Each example is a small Cargo binary that downloads its weights via hf-hub, builds the model, and runs inference. Notable examples shipped in tree include:

Domain	Example binaries (selected)
Large language models	LLaMA v1/v2/v3, Mistral, Mixtral, Phi-1.5 / Phi-2 / Phi-3, Gemma 1 / 2, Falcon, StarCoder, Qwen, Mamba, RWKV, StableLM, Yi^[1]
Speech	Whisper (multilingual), EnCodec, MetaVoice, Parler-TTS^[1]
Vision	YOLO v3/v8, Segment Anything, DINOv2, ViT, ResNet, EfficientNet, MobileNet v4, TrOCR, BLIP, CLIP^[1]
Text-to-image	Stable Diffusion 1.5 / 2.1, SDXL 1.0 and Turbo, Wuerstchen, FLUX.1^[1]^[19]
Other	T5, Marian MT, BERT, sentence-transformers^[1]

The FLUX.1 implementation was announced by Mazare on 2 August 2024: "The FLUX.1 image generation model is now available in candle. Pure #Rust implementation, no python involved."^[19] Earlier, the lmz/candle-mistral, lmz/candle-whisper, and lmz/candle-yolo-v8 Hugging Face repositories pre-stage weights in the exact layout the example binaries expect.^[3]^[20]

Inference versus training

While candle-core does include a reverse-mode automatic differentiation engine and candle-nn provides optimisers, the project is in practice strongly inference-oriented. The official documentation lists "model training" as a supported capability and the example crate contains MNIST and CIFAR training scripts, but the marquee features (quantization, GGUF, WebAssembly, the breadth of candle-transformers model definitions, multi-GPU NCCL sharding) all serve deployment rather than research training.^[1]^[4]^[7] HuggingFace's own framing on the integration page describes Candle as a way to provide "native Rust implementations of Transformers models" for serving, not as a competitor to PyTorch for large-scale pre-training.^[13] The community comparison literature consistently slots Candle alongside llama.cpp and ONNX Runtime in the "inference engine" bucket rather than alongside JAX or PyTorch in the "training framework" bucket.^[21]^[22]

Adoption and downstream projects

Hugging Face Spaces

The most visible Candle deployments are the WebAssembly demos hosted on Hugging Face Spaces, which run real models entirely in the user's browser. These include Whisper-based transcription, LLaMA2-c text generation, YOLO object detection, Segment Anything masking, and T5 text-to-text. Several of these Spaces (notably lmz/candle-yolo and lmz/candle-whisper) are featured Spaces with many hundreds of likes.^[3]^[18] They double as load-tests for the WASM build and as the most accessible "first contact" with the framework for users without a GPU.^[17]

mistral.rs

The mistral.rs project by Eric L. Buehler (EricLBuehler/mistral.rs) is a Rust-native LLM inference engine built directly on top of Candle. The README's credits section states: "This project would not be possible without the excellent work at Candle."^[23] As of 2026 the project supports a long list of model families (Qwen, LLaMA, Mistral, Phi, Gemma, DeepSeek variants), vision/audio multimodality, and a Python SDK in addition to its Rust crate, and it explicitly migrated to the official crates.io release of Candle 0.9.2 to stabilise its back-end dependency.^[23] mistral.rs is independent of Hugging Face but represents the most prominent third-party project that depends on Candle.

fish-speech.rs

EndlessReform/fish-speech.rs is a pure-Rust reimplementation of the Fish Speech 1.5 text-to-speech model, also built on Candle.^[24] It compiles to a roughly 15 MB static binary, supports CUDA, Apple Metal, and CPU back-ends, and ships custom grouped-query attention CUDA kernels in a candle-gqa-kernels subdirectory that are intended to be useful to other Candle users.^[24] The project is a representative example of how Candle is used outside Hugging Face: a small team takes a Python reference implementation and republishes it as a single-binary Rust server.

Kyutai and Moshi

Laurent Mazare's other affiliation, the Paris-based speech AI lab Kyutai, has released the Moshi full-duplex speech model with Candle-compatible weight variants under names like kyutai/moshika-candle-bf16 and kyutai/moshiko-candle-q8, allowing the open-source Moshi models to be run from Rust with no Python in the loop.^[25]^[26]

Other downstream projects

A growing number of small Rust projects use Candle as their tensor library, including qwen3-asr-rs (a pure-Rust ASR inference engine for Qwen3-ASR using Metal and CUDA acceleration through Candle)^[27] and various sentence-transformer servers used as embedding back-ends.^[21]

Comparison with other Rust ML libraries

Library	Architecture	Primary focus	GPU back-ends	Training?	Dependency footprint
Candle (huggingface/candle)	Pure Rust + cudarc + Metal kernels^[1]^[14]	Inference, serverless, edge^[1]	NVIDIA CUDA, Apple Metal, WASM^[1]	Yes, but secondary^[1]	Megabyte-class binary, no LibTorch^[7]
`tch-rs` (`LaurentMazare/tch-rs`)	Thin FFI bindings to PyTorch C++ (LibTorch)^[8]	Bringing the full PyTorch API into Rust^[8]	Whatever LibTorch supports (CUDA, ROCm via PyTorch)^[8]	Yes, full PyTorch parity^[8]	Requires a multi-hundred-MB LibTorch shared object at runtime^[8]
`burn` (`tracel-ai/burn`)	Generic over a `Backend` trait^[28]	Full training + inference stack^[28]	Multi-backend: WGPU (Vulkan/Metal/DX12/WebGPU), CUDA, ROCm, LibTorch, NDArray^[28]	Yes, training is a first-class goal^[28]	Pure Rust by default; LibTorch only if the Tch backend is selected^[28]
`ort` (ONNX Runtime Rust bindings)	FFI over Microsoft's ONNX Runtime C/C++ library^[22]	Cross-framework inference of ONNX graphs^[22]	Whatever ONNX Runtime ships (CUDA, TensorRT, DirectML, CoreML, OpenVINO, CANN, QNN)^[22]	No (inference only)^[22]	Requires the bundled `onnxruntime` shared library^[22]

tch-rs is the most direct point of comparison because Laurent Mazare is the author of both: where tch-rs exposes the LibTorch API verbatim and inherits both its breadth and its multi-hundred-megabyte runtime footprint, Candle reimplements the necessary subset in pure Rust and trades full PyTorch parity for a small static binary.^[2]^[8] burn, maintained by tracel-ai, takes a third path: it is also pure Rust by default but is generic over its compute back-end, supports cross-platform GPU execution through WGPU (Vulkan, Metal, DirectX, WebGPU), and explicitly markets itself for training as well as inference.^[28] On a YOLOv8 benchmark reported in the Candle issue tracker, ONNX Runtime ran end-to-end inference roughly an order of magnitude faster than Candle for the same model (~7 ms vs ~55 ms with CUDA + cuDNN), which has been cited as evidence that Candle prioritises portability and small binaries over peak per-kernel throughput.^[22]

Significance

Candle's primary contribution is showing that a credible, ergonomic, statically compiled ML inference stack is achievable without binding to LibTorch or ONNX Runtime. By keeping the tensor library, model definitions, kernels, and weight loaders all in Rust, the framework produces single-binary deployments measured in tens of megabytes (typical example sizes: 22 MB for Whisper-tiny, 48 MB for a 4-bit LLaMA-2 7B, 38 MB for Phi-2) that cold-start in milliseconds.^[7] This profile fits cleanly into the "Rust for Lambda / Cloud Run" pattern that has emerged independently in the serverless community: Datadog, for instance, reported cutting cold starts by 82 % and reducing extension binary size from 55 MB to 7 MB by rewriting their AWS Lambda extension in Rust, the same shape of result that Candle aims to make available for model inference.^[29] For browser and embedded deployments, the WASM target eliminates a server hop entirely.^[17]

The framework also matters as part of Hugging Face's broader strategy of moving performance-critical infrastructure out of Python. safetensors, tokenizers, Text Generation Inference, and hf-hub are all Rust crates maintained by Hugging Face; Candle slots between them as the layer that actually executes models.^[1]^[9]^[13]

Limitations

Several recurring criticisms appear across community comparisons and the project's own issue tracker:

Single-kernel throughput is not yet competitive with vendor-tuned engines. Reported benchmarks (notably the YOLOv8 case above) show ONNX Runtime delivering substantially lower latency for the same model, because ORT can dispatch to TensorRT, oneDNN, and other vendor execution providers that Candle does not.^[22]
Training is a secondary concern. While autodiff and optimisers exist, Candle does not provide a training dashboard, distributed data-parallel training utilities, or a research-grade lightning-style training loop. Users wanting a Rust framework for training are typically directed to burn.^[28]
Limited Apple Metal coverage in some data types. The Metal back-end has known gaps such as missing BF16 matmul support, leading to runtime errors on certain model/architecture combinations.^[16]
Model coverage is curated, not exhaustive. Adding a new architecture requires writing a new module under candle-transformers, in contrast to ONNX or PyTorch-loading frameworks that can in principle execute any graph.^[1]
GitHub releases are not published. The project's GitHub Releases tab is empty; users must consult CHANGELOG.md or the crates.io version history to see what shipped when.^[10]

Text Generation Inference: Hugging Face's Rust-based serving system, which uses Python model code rather than Candle.^[9]
llama.cpp: the C++ inference engine whose GGUF format and quantization schemes Candle reuses.^[10]
ONNX and the ort Rust crate: a cross-framework alternative for inference.^[22]
tinygrad: another small-footprint inference framework, in Python rather than Rust.
Ollama and vLLM: higher-level serving wrappers that primarily target Python/Go workflows.

References

Hugging Face, "candle: Minimalist ML framework for Rust", GitHub README, 2024-2026. https://github.com/huggingface/candle. Accessed 2026-05-21. ↩
Laurent Mazare, "GitHub profile (LaurentMazare): co-founder of Kyutai and Gradium, Paris; pinned project tch-rs (5.4k stars)", GitHub, 2026. https://github.com/LaurentMazare. Accessed 2026-05-21. ↩
Hugging Face, "lmz (Laurent Mazare) profile page: 29 models including lmz/candle-flux, lmz/candle-gemma, lmz/candle-rwkv, lmz/candle-metavoice; Spaces include Candle Whisper, Candle Llama2, Candle YOLO", huggingface.co, 2024-2026. https://huggingface.co/lmz. Accessed 2026-05-21. ↩
Hugging Face, "candle-core 0.10.2 crate documentation", docs.rs, 2026. https://docs.rs/candle-core. Accessed 2026-05-21. ↩
Internet Archive, "github.com-huggingface-candle snapshots (2023-08-09, 2023-08-11, 2023-08-12)", archive.org, 2023-08. https://archive.org/details/github.com-huggingface-candle_-_2023-08-09_16-14-17. Accessed 2026-05-21. ↩
Rust Foundation / Hugging Face, "candle-core crate page (license: MIT OR Apache-2.0)", crates.io, 2026. https://crates.io/crates/candle-core. Accessed 2026-05-21. ↩
BrightCoding, "Candle: A Minimalist Rust ML Framework with Fast Demos like Whisper and LLaMA2 (binary sizes: 22 MB Whisper tiny, 48 MB LLaMA2-7B q4k, 38 MB Phi-2)", brightcoding.dev, 2025-09-29. https://www.blog.brightcoding.dev/2025/09/29/candle-a-minimalist-rust-ml-framework-with-fast-demos-like-whisper-and-llama2/. Accessed 2026-05-21. ↩
Laurent Mazare, "tch-rs: Rust bindings for the C++ api of PyTorch (libtorch)", GitHub, 2026. https://github.com/LaurentMazare/tch-rs. Accessed 2026-05-21. ↩
Nicolas Patry (Narsil), "GitHub profile, affiliated with Hugging Face; contributor to huggingface/candle (e.g. PR #2841 enabling different linking options for MKL)", github.com, 2026. https://github.com/Narsil. Accessed 2026-05-21. ↩
Hugging Face, "candle CHANGELOG.md: v0.2.0 (2023-08-30) Stable Diffusion XL, GGUF writing, Q2K/Q4K/Q5K; v0.2.1 (2023-09-11) SAM, RNNs, GGUF v2; v0.2.2 (2023-09-18) T5 decoding; v0.3.0 (2023-10-01) Mistral 7B v0.1, Phi-v1.5 mixformer, Wuerstchen, SIMD128", GitHub, 2023-2024. https://github.com/huggingface/candle/blob/main/CHANGELOG.md. Accessed 2026-05-21. ↩
Hugging Face contributors, "Apple silicon (MPS backends) support? Issue #313", github.com, opened 2023-08-03. https://github.com/huggingface/candle/issues/313. Accessed 2026-05-21. ↩
Hugging Face, "candle_core - Rust crate documentation (Tensor, Device, DType; matmul example)", docs.rs, 2026. https://docs.rs/candle-core/latest/candle_core/. Accessed 2026-05-21. ↩
Hugging Face, "Candle - community_integrations docs (VarBuilder::from_mmaped_safetensors, hf-hub integration, config.json parsing)", huggingface.co, 2024-2026. https://huggingface.co/docs/transformers/community_integrations/candle. Accessed 2026-05-21. ↩
Hugging Face, "candle-kernels crate (CUDA kernels for Candle, build script compiles kernel sources, configurable via CUDA_COMPUTE_CAP)", lib.rs / crates.io, 2023-2026. https://lib.rs/crates/candle-kernels. Accessed 2026-05-21. ↩
Corey Lowman et al., "cudarc: minimal and safe API over the CUDA toolkit (driver API, NVRTC, cuBLAS, cuDNN, NCCL, cuFFT)", GitHub, 2026. https://github.com/coreylowman/cudarc. Accessed 2026-05-21. ↩
Hugging Face contributors, "Apple metal embedding generation panics w/ IOGPUMetalCommandBuffer failed assertion (BF16 matmul not supported)", Discussion #2818, github.com, 2024-2025. https://github.com/huggingface/candle/discussions/2818. Accessed 2026-05-21. ↩
Hugging Face Spaces, "Candle Segment Anything Model (SAM) Rust/WASM live demo (radames/candle-segment-anything-wasm)", huggingface.co, 2024. https://huggingface.co/spaces/radames/candle-segment-anything-wasm/resolve/main/index.html. Accessed 2026-05-21. ↩
Hugging Face, "lmz Spaces: Candle Whisper (65 likes), Candle Llama2, Candle YOLO (Featured, 74 likes)", huggingface.co, 2024-2026. https://huggingface.co/lmz. Accessed 2026-05-21. ↩
Laurent Mazare (@lmazare), "Time for more rusty robots! The FLUX.1 image generation model is now available in candle. Pure #Rust implementation, no python involved.", X (formerly Twitter), 2024-08-02. https://x.com/lmazare/status/1820023217773334888. Accessed 2026-05-21. ↩
Hugging Face, "lmz/candle-mistral, lmz/candle-yolo-v8, lmz/candle-whisper model repositories", huggingface.co, 2023-2024. https://huggingface.co/lmz. Accessed 2026-05-21. ↩
Athan X, "Choosing the Right Rust Machine Learning Framework: Candle, Burn, DFDX, or tch-rs?", Medium, 2024. https://medium.com/@athan.seal/choosing-the-right-rust-machine-learning-framework-candle-burn-dfdx-or-tch-rs-17501f6cd765. Accessed 2026-05-21. ↩
Hugging Face contributors, "using candle to make inference of yolov8 model is quite slow than PyTorch or onnxruntime! Issue #942 (Candle ~55 ms vs ONNX Runtime ~7 ms with CUDA + cuDNN)", github.com, 2023-2024. https://github.com/huggingface/candle/issues/942. Accessed 2026-05-21. ↩
Eric L. Buehler, "mistral.rs: Fast, flexible LLM inference (README credits: 'This project would not be possible without the excellent work at Candle'; migrated to candle 0.9.2)", GitHub, 2024-2026. https://github.com/EricLBuehler/mistral.rs. Accessed 2026-05-21. ↩
EndlessReform, "fish-speech.rs: A Fish Speech implementation in Rust, with Candle.rs (15 MB static binary, CUDA / Apple Silicon / CPU back-ends, custom candle-gqa-kernels)", GitHub, 2024-2026. https://github.com/EndlessReform/fish-speech.rs. Accessed 2026-05-21. ↩
Kyutai, "kyutai/moshika-candle-bf16 model card", huggingface.co, 2024. https://huggingface.co/kyutai/moshika-candle-bf16. Accessed 2026-05-21. ↩
Kyutai, "kyutai/moshiko-candle-q8 model card", huggingface.co, 2024. https://huggingface.co/kyutai/moshiko-candle-q8. Accessed 2026-05-21. ↩
alan890104, "qwen3-asr-rs: Pure-Rust inference engine for Qwen3-ASR speech recognition models (0.6B and 1.7B) using candle with Metal/CUDA acceleration", GitHub, 2025-2026. https://github.com/alan890104/qwen3-asr-rs. Accessed 2026-05-21. ↩
tracel-ai, "burn: a next generation tensor library and Deep Learning Framework (CUDA, ROCm, Metal, Vulkan, WebGPU, LibTorch back-ends; training dashboard; v0.21.0 May 2026)", GitHub, 2026. https://github.com/tracel-ai/burn. Accessed 2026-05-21. ↩
Luciano Mammino, "Why you should consider Rust for your Lambdas (cold starts as low as 50-75 ms; Datadog Lambda extension rewrite cut cold starts by 82% and binary size from 55 MB to 7 MB)", loige.co, 2024. https://loige.co/why-you-should-consider-rust-for-your-lambdas/. Accessed 2026-05-21. ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

3 revisions by 1 contributor · full history

Suggest edit

What links here

Hugging Face

Project information

Background and motivation

How Candle is structured

Back-ends

CPU

CUDA

Apple Metal

WebAssembly

Model loaders and file formats

Example binaries

Inference versus training

Adoption and downstream projects

Hugging Face Spaces

mistral.rs

fish-speech.rs

Kyutai and Moshi

Other downstream projects

Comparison with other Rust ML libraries

Significance

Limitations

Related work

See also

References

Improve this article

Related Articles

Hugging Face

PyTorch

llama.cpp

Gradio

MLflow

Safetensors

What links here

Related Articles

Hugging Face

PyTorch

llama.cpp

Gradio

MLflow

Safetensors