# Candle (HuggingFace Rust ML)

> Source: https://aiwiki.ai/wiki/candle_hf
> Updated: 2026-06-07
> Categories: Developer Tools, Machine Learning, Open Source AI
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

# Candle (Hugging Face Rust ML framework)

**Candle** is a minimalist machine learning framework written in pure [Rust](/wiki/rust) and published by [Hugging Face](https://github.com/huggingface) under the `huggingface/candle` GitHub repository.[^1] The project is led primarily by Laurent Mazare (GitHub `@LaurentMazare`), who is also the author of the older `tch-rs` PyTorch bindings and co-founder of the speech AI lab Kyutai.[^2][^3] Candle's stated goal is to "make serverless inference possible" by producing small, fast-booting binaries that load models from the Hugging Face Hub directly, removing the Python runtime (and its global interpreter lock) from production deployments.[^1] It supports CPU, NVIDIA CUDA, and Apple Metal back-ends, as well as compilation to WebAssembly, and loads weights from safetensors, GGUF, NPZ, and PyTorch `.pth` files.[^1][^4]

## Project information

| Item | Value |
|---|---|
| Repository | `github.com/huggingface/candle`[^1] |
| Initial public commits | July 2023 (wiki edits); first crate uploads August 2023[^5] |
| Primary author / lead | Laurent Mazare (`@LaurentMazare`)[^2] |
| Publisher | Hugging Face[^1] |
| License | Dual MIT / Apache-2.0[^4][^6] |
| Core crates | `candle-core`, `candle-nn`, `candle-transformers`, `candle-kernels`, `candle-flash-attn`, `candle-onnx`, `candle-datasets`, `candle-pyo3`, `candle-examples`[^1] |
| Supported backends | CPU (with MKL on x86 and Accelerate on macOS), CUDA (multi-GPU via [NCCL](/wiki/nccl)), Apple Metal, WebAssembly[^1] |
| Supported model formats | [Safetensors](/wiki/safetensors), [GGUF](/wiki/gguf), [GGML](/wiki/ggml), PyTorch `.pth`, NPZ[^1] |

## Background and motivation

Hugging Face had, by 2023, established itself as the dominant hub for open model weights and the `transformers`/`diffusers` Python libraries, but several deployment problems remained intractable when serving models in Python: very large container images (PyTorch wheels routinely exceed several hundred megabytes), slow cold starts on serverless platforms, and contention from the CPython global interpreter lock when many concurrent requests share a process.[^1][^7] Candle was conceived as a Rust-native alternative tailored specifically to inference workloads where these costs dominate. The official `README` summarises the motivation succinctly: Candle's "core goal" is "to make serverless inference possible" by producing "lightweight binaries", and to let users "remove Python from production workloads", noting that "Python overhead can seriously hurt performance, and the GIL is a notorious source of headaches".[^1]

The framework was developed under the `huggingface` GitHub organisation, with Laurent Mazare as the principal designer and most prolific contributor.[^2] Mazare had spent several years building `tch-rs`, a Rust binding for PyTorch's C++ API (LibTorch), and the OCaml `ocaml-torch` bindings before that, so Candle in effect represents a second, ground-up attempt to bring PyTorch-style tensor programming into a statically compiled language without dragging in the LibTorch shared object.[^2][^8] Hugging Face engineer Nicolas Patry (GitHub `Narsil`), the original author of the Rust `safetensors` and `tokenizers` crates, is also a regular contributor and reviewer.[^9]

Public archival snapshots show the repository active in early August 2023, with Internet Archive captures dated 2023-08-09, 2023-08-11, and 2023-08-12, and the `candle-kernels` crate has been continuously published to crates.io since 2 August 2023.[^5][^7] The first tagged minor release recorded in the project changelog is `v0.2.0` on 30 August 2023, which already shipped Stable Diffusion XL, additional quantized GGUF data types (Q2K, Q4K, Q5K), and an API for writing GGUF files.[^10] Version `0.2.1` on 11 September 2023 added Segment-Anything, RNNs (GRU and LSTM), TinyViT and GGUF v2 support; `0.2.2` on 18 September 2023 added a T5 decoding loop and top-p sampling; and `0.3.0` on 1 October 2023 added Mistral 7B v0.1, the Phi-v1.5 mixformer variant, the Wuerstchen diffusion stack, and SIMD128 intrinsics for quantized operations.[^10] The Apple Metal back-end was the subject of a tracking issue opened on 3 August 2023 and was added incrementally over the following months.[^11]

## How Candle is structured

Candle is delivered as a Cargo workspace of small crates that compose into a deployable binary. The README's "Structure" section enumerates the principal pieces.[^1]

| Crate | Role |
|---|---|
| `candle-core` | Core `Tensor`, `Device`, `DType`, layout, and back-end dispatch; defines the autograd engine.[^1][^12] |
| `candle-nn` | Higher-level neural network primitives: layers, optimisers, the `VarBuilder` weight loader.[^1][^13] |
| `candle-transformers` | Reference implementations of named model architectures ([LLaMA](/wiki/llama), [Mistral](/wiki/mistral), [Mixtral](/wiki/mixtral), [Phi](/wiki/phi), [Gemma](/wiki/gemma), [Whisper](/wiki/whisper), [Stable Diffusion](/wiki/stable_diffusion), etc.).[^1] |
| `candle-kernels` | CUDA C++ kernel source compiled at crate build time; provides the device-side ops invoked through cudarc.[^14] |
| `candle-flash-attn` | Rust wrapper around Flash Attention v2 CUDA kernels.[^1] |
| `candle-onnx` | Loader and evaluator for [ONNX](/wiki/onnx) graphs.[^1] |
| `candle-datasets` | Dataset helpers (MNIST, CIFAR, etc.).[^1] |
| `candle-pyo3` | Python bindings exposing a subset of the Tensor API through PyO3.[^1] |
| `candle-examples` | Self-contained example binaries for each supported model.[^1] |

Tensor objects are constructed against a `Device` (one of `Cpu`, `Cuda(ordinal)`, or `Metal(ordinal)`) and a `DType` (covering f16/bf16/f32/f64, u8/u32/i64, and the GGML-style quantized integer types).[^12] The canonical "hello world" from the crate's documentation page on docs.rs is a small matmul that runs on the CPU device:[^12]

```rust
use candle_core::{Tensor, Device};

let a = Tensor::arange(0f32, 6f32, &Device::Cpu)?.reshape((2, 3))?;
let b = Tensor::arange(0f32, 12f32, &Device::Cpu)?.reshape((3, 4))?;
let c = a.matmul(&b)?;
```

Models are typically wired up by parsing a Hugging Face `config.json`, downloading the weights via the `hf-hub` crate (which manages the same `~/.cache/huggingface` layout as the Python `huggingface_hub`), memory-mapping the safetensor shards through a `VarBuilder`, and then materialising tensors into Rust structs that mirror the layout used by the Python `transformers` library.[^13] Hugging Face's own integration page on the `transformers` documentation site describes this loading flow explicitly and gives a three-line snippet showing `VarBuilder::from_mmaped_safetensors` followed by `Model::new(...)`.[^13]

## Back-ends

### CPU

Candle's CPU back-end is the default and is implemented in safe Rust over `ndarray`-style strided buffers, with optional integration with Intel MKL on x86 Linux/Windows and Apple Accelerate on macOS for BLAS-accelerated matmul.[^1] Quantized kernels reuse the integer formats popularised by [llama.cpp](/wiki/llama_cpp) (Q4_0, Q5_0, Q8_0, Q2K, Q4K, Q5K, etc.), with SIMD128 intrinsics added in v0.3.0 for ARM and WebAssembly targets.[^10]

### CUDA

GPU support targets NVIDIA hardware via the third-party `cudarc` Rust crate, which provides safe wrappers over the CUDA driver API, NVRTC, cuBLAS, cuDNN, cuFFT, and [NCCL](/wiki/nccl).[^14][^15] Device kernels that are not provided by cuBLAS or cuDNN (custom element-wise ops, reductions, the quantized matmul paths) live in `candle-kernels`, whose `build.rs` script compiles CUDA C++ sources at compile time and embeds them into the resulting Rust binary.[^14] Multi-GPU inference uses the NCCL bindings in `cudarc` for collective operations, mirroring how the PyTorch and JAX ecosystems perform tensor and pipeline parallel sharding.[^1][^15]

### Apple Metal

Candle ships a Metal back-end for Apple Silicon GPUs. A tracking issue (#313) was opened on 3 August 2023 requesting MPS-style support, and Metal kernels were progressively added over the following months, accompanied by their own set of `.metal` source files compiled at build time.[^11] By Candle 0.3.x the back-end was sufficiently mature to power third-party text-to-speech and LLM projects on M-series Macs, though some long-tail issues such as missing BF16 matmul paths for the Falcon architecture persisted into 2024.[^16]

### WebAssembly

Because the CPU back-end is pure Rust, the workspace can be cross-compiled to the `wasm32-unknown-unknown` target. The repository's `candle-wasm-examples` directory contains in-browser ports of Whisper, LLaMA2-c, [YOLO](/wiki/yolo), [Segment Anything](/wiki/segment_anything), and [T5 (language model)](/wiki/t5) that run entirely client-side without a backend server.[^1][^17] These are deployed as live demos on [Hugging Face Spaces](https://huggingface.co/spaces) (for example `radames/candle-segment-anything-wasm` and Laurent Mazare's `lmz/candle-whisper`, `lmz/candle-llama2`, and `lmz/candle-yolo` Spaces).[^18][^17]

## Model loaders and file formats

Candle is unusual among Rust ML frameworks in supporting four production weight formats out of the box:[^1][^4]

1. **Safetensors** (`.safetensors`): the standard Hugging Face format, loaded zero-copy through memory mapping by `VarBuilder::from_mmaped_safetensors`.[^13]
2. **GGUF / GGML**: the on-disk format used by `llama.cpp`, including its quantized variants. GGUF v2 reading and writing have been supported since v0.2.x.[^10]
3. **PyTorch `.pth` / `.bin` pickle files**: parsed without invoking a Python interpreter, allowing direct interoperability with models published in the legacy PyTorch pickle format.[^1]
4. **NPZ**: NumPy's zipped array container, useful for small datasets and reference activations.[^1]

This loader breadth is what makes Candle a practical drop-in for serving existing Hugging Face checkpoints without any conversion step: the same weight files that `transformers` consumes in Python can be mmapped directly from disk into a Rust process.[^13]

## Example binaries

The `candle-examples` crate doubles as a regression suite and as a demo gallery. Each example is a small Cargo binary that downloads its weights via `hf-hub`, builds the model, and runs inference. Notable examples shipped in tree include:

| Domain | Example binaries (selected) |
|---|---|
| Large language models | LLaMA v1/v2/v3, Mistral, Mixtral, Phi-1.5 / Phi-2 / Phi-3, Gemma 1 / 2, [Falcon](/wiki/falcon), [StarCoder](/wiki/starcoder), [Qwen](/wiki/qwen), [Mamba](/wiki/mamba), [RWKV](/wiki/rwkv), StableLM, Yi[^1] |
| Speech | [Whisper](/wiki/whisper) (multilingual), [EnCodec](/wiki/encodec), MetaVoice, Parler-TTS[^1] |
| Vision | [YOLO](/wiki/yolo) v3/v8, [Segment Anything](/wiki/segment_anything), [DINOv2](/wiki/dinov2), ViT, ResNet, EfficientNet, MobileNet v4, TrOCR, BLIP, [CLIP](/wiki/clip)[^1] |
| Text-to-image | [Stable Diffusion](/wiki/stable_diffusion) 1.5 / 2.1, [SDXL](/wiki/sdxl) 1.0 and Turbo, Wuerstchen, FLUX.1[^1][^19] |
| Other | [T5](/wiki/t5), Marian MT, BERT, sentence-transformers[^1] |

The FLUX.1 implementation was announced by Mazare on 2 August 2024: "The FLUX.1 image generation model is now available in candle. Pure #Rust implementation, no python involved."[^19] Earlier, the `lmz/candle-mistral`, `lmz/candle-whisper`, and `lmz/candle-yolo-v8` Hugging Face repositories pre-stage weights in the exact layout the example binaries expect.[^3][^20]

## Inference versus training

While `candle-core` does include a reverse-mode automatic differentiation engine and `candle-nn` provides optimisers, the project is in practice strongly inference-oriented. The official documentation lists "model training" as a supported capability and the example crate contains MNIST and CIFAR training scripts, but the marquee features (quantization, GGUF, WebAssembly, the breadth of `candle-transformers` model definitions, multi-GPU NCCL sharding) all serve deployment rather than research training.[^1][^4][^7] HuggingFace's own framing on the integration page describes Candle as a way to provide "native Rust implementations of Transformers models" for serving, not as a competitor to PyTorch for large-scale pre-training.[^13] The community comparison literature consistently slots Candle alongside `llama.cpp` and ONNX Runtime in the "inference engine" bucket rather than alongside JAX or PyTorch in the "training framework" bucket.[^21][^22]

## Adoption and downstream projects

### Hugging Face Spaces

The most visible Candle deployments are the WebAssembly demos hosted on Hugging Face Spaces, which run real models entirely in the user's browser. These include Whisper-based transcription, LLaMA2-c text generation, YOLO object detection, Segment Anything masking, and T5 text-to-text. Several of these Spaces (notably `lmz/candle-yolo` and `lmz/candle-whisper`) are featured Spaces with many hundreds of likes.[^3][^18] They double as load-tests for the WASM build and as the most accessible "first contact" with the framework for users without a GPU.[^17]

### mistral.rs

The `mistral.rs` project by Eric L. Buehler (`EricLBuehler/mistral.rs`) is a Rust-native LLM inference engine built directly on top of Candle. The README's credits section states: "This project would not be possible without the excellent work at Candle."[^23] As of 2026 the project supports a long list of model families (Qwen, LLaMA, Mistral, Phi, Gemma, DeepSeek variants), vision/audio multimodality, and a Python SDK in addition to its Rust crate, and it explicitly migrated to the official crates.io release of Candle 0.9.2 to stabilise its back-end dependency.[^23] mistral.rs is independent of Hugging Face but represents the most prominent third-party project that depends on Candle.

### fish-speech.rs

`EndlessReform/fish-speech.rs` is a pure-Rust reimplementation of the Fish Speech 1.5 text-to-speech model, also built on Candle.[^24] It compiles to a roughly 15 MB static binary, supports CUDA, Apple Metal, and CPU back-ends, and ships custom grouped-query attention CUDA kernels in a `candle-gqa-kernels` subdirectory that are intended to be useful to other Candle users.[^24] The project is a representative example of how Candle is used outside Hugging Face: a small team takes a Python reference implementation and republishes it as a single-binary Rust server.

### Kyutai and Moshi

Laurent Mazare's other affiliation, the Paris-based speech AI lab Kyutai, has released the Moshi full-duplex speech model with Candle-compatible weight variants under names like `kyutai/moshika-candle-bf16` and `kyutai/moshiko-candle-q8`, allowing the open-source [Moshi](/wiki/moshi) models to be run from Rust with no Python in the loop.[^25][^26]

### Other downstream projects

A growing number of small Rust projects use Candle as their tensor library, including `qwen3-asr-rs` (a pure-Rust ASR inference engine for Qwen3-ASR using Metal and CUDA acceleration through Candle)[^27] and various sentence-transformer servers used as embedding back-ends.[^21]

## Comparison with other Rust ML libraries

| Library | Architecture | Primary focus | GPU back-ends | Training? | Dependency footprint |
|---|---|---|---|---|---|
| **Candle** (huggingface/candle) | Pure Rust + cudarc + Metal kernels[^1][^14] | Inference, serverless, edge[^1] | NVIDIA CUDA, Apple Metal, WASM[^1] | Yes, but secondary[^1] | Megabyte-class binary, no LibTorch[^7] |
| `tch-rs` (`LaurentMazare/tch-rs`) | Thin FFI bindings to PyTorch C++ (LibTorch)[^8] | Bringing the full PyTorch API into Rust[^8] | Whatever LibTorch supports (CUDA, ROCm via PyTorch)[^8] | Yes, full PyTorch parity[^8] | Requires a multi-hundred-MB LibTorch shared object at runtime[^8] |
| `burn` (`tracel-ai/burn`) | Generic over a `Backend` trait[^28] | Full training + inference stack[^28] | Multi-backend: WGPU (Vulkan/Metal/DX12/WebGPU), CUDA, ROCm, LibTorch, NDArray[^28] | Yes, training is a first-class goal[^28] | Pure Rust by default; LibTorch only if the Tch backend is selected[^28] |
| `ort` (ONNX Runtime Rust bindings) | FFI over Microsoft's ONNX Runtime C/C++ library[^22] | Cross-framework inference of [ONNX](/wiki/onnx) graphs[^22] | Whatever ONNX Runtime ships (CUDA, TensorRT, DirectML, CoreML, OpenVINO, CANN, QNN)[^22] | No (inference only)[^22] | Requires the bundled `onnxruntime` shared library[^22] |

`tch-rs` is the most direct point of comparison because Laurent Mazare is the author of both: where `tch-rs` exposes the LibTorch API verbatim and inherits both its breadth and its multi-hundred-megabyte runtime footprint, Candle reimplements the necessary subset in pure Rust and trades full PyTorch parity for a small static binary.[^2][^8] `burn`, maintained by tracel-ai, takes a third path: it is also pure Rust by default but is generic over its compute back-end, supports cross-platform GPU execution through WGPU (Vulkan, Metal, DirectX, WebGPU), and explicitly markets itself for training as well as inference.[^28] On a YOLOv8 benchmark reported in the Candle issue tracker, ONNX Runtime ran end-to-end inference roughly an order of magnitude faster than Candle for the same model (~7 ms vs ~55 ms with CUDA + cuDNN), which has been cited as evidence that Candle prioritises portability and small binaries over peak per-kernel throughput.[^22]

## Significance

Candle's primary contribution is showing that a credible, ergonomic, statically compiled ML inference stack is achievable without binding to LibTorch or ONNX Runtime. By keeping the tensor library, model definitions, kernels, and weight loaders all in Rust, the framework produces single-binary deployments measured in tens of megabytes (typical example sizes: 22 MB for Whisper-tiny, 48 MB for a 4-bit LLaMA-2 7B, 38 MB for Phi-2) that cold-start in milliseconds.[^7] This profile fits cleanly into the "Rust for Lambda / Cloud Run" pattern that has emerged independently in the serverless community: Datadog, for instance, reported cutting cold starts by 82 % and reducing extension binary size from 55 MB to 7 MB by rewriting their AWS Lambda extension in Rust, the same shape of result that Candle aims to make available for model inference.[^29] For browser and embedded deployments, the WASM target eliminates a server hop entirely.[^17]

The framework also matters as part of Hugging Face's broader strategy of moving performance-critical infrastructure out of Python. `safetensors`, `tokenizers`, Text Generation Inference, and `hf-hub` are all Rust crates maintained by Hugging Face; Candle slots between them as the layer that actually executes models.[^1][^9][^13]

## Limitations

Several recurring criticisms appear across community comparisons and the project's own issue tracker:

- **Single-kernel throughput is not yet competitive with vendor-tuned engines.** Reported benchmarks (notably the YOLOv8 case above) show ONNX Runtime delivering substantially lower latency for the same model, because ORT can dispatch to TensorRT, oneDNN, and other vendor execution providers that Candle does not.[^22]
- **Training is a secondary concern.** While autodiff and optimisers exist, Candle does not provide a training dashboard, distributed data-parallel training utilities, or a research-grade `lightning`-style training loop. Users wanting a Rust framework for training are typically directed to `burn`.[^28]
- **Limited Apple Metal coverage in some data types.** The Metal back-end has known gaps such as missing BF16 matmul support, leading to runtime errors on certain model/architecture combinations.[^16]
- **Model coverage is curated, not exhaustive.** Adding a new architecture requires writing a new module under `candle-transformers`, in contrast to ONNX or PyTorch-loading frameworks that can in principle execute any graph.[^1]
- **GitHub releases are not published.** The project's GitHub Releases tab is empty; users must consult `CHANGELOG.md` or the crates.io version history to see what shipped when.[^10]

## Related work

- [Text Generation Inference](/wiki/huggingface_tgi): Hugging Face's Rust-based serving system, which uses Python model code rather than Candle.[^9]
- [llama.cpp](/wiki/llama_cpp): the C++ inference engine whose GGUF format and quantization schemes Candle reuses.[^10]
- [ONNX](/wiki/onnx) and the `ort` Rust crate: a cross-framework alternative for inference.[^22]
- [tinygrad](/wiki/tinygrad): another small-footprint inference framework, in Python rather than Rust.
- [Ollama](/wiki/ollama) and [vLLM](/wiki/vllm): higher-level serving wrappers that primarily target Python/Go workflows.

## See also

- [PyTorch](/wiki/pytorch)
- [Falcon (language model)](/wiki/falcon)
- [FLUX.1](/wiki/flux_1)
- [Kyutai](/wiki/kyutai_labs)
- [CUDA](/wiki/cuda)
- [Flash Attention](/wiki/flash_attention)
- [Quantization](/wiki/quantization)
- [WebAssembly](/wiki/webassembly)
- [Hugging Face Transformers](/wiki/transformers_library)
- [Edge AI](/wiki/edge_ai)
- [Inference](/wiki/inference)

## References

[^1]: Hugging Face, "candle: Minimalist ML framework for Rust", GitHub README, 2024-2026. https://github.com/huggingface/candle. Accessed 2026-05-21.
[^2]: Laurent Mazare, "GitHub profile (LaurentMazare): co-founder of Kyutai and Gradium, Paris; pinned project tch-rs (5.4k stars)", GitHub, 2026. https://github.com/LaurentMazare. Accessed 2026-05-21.
[^3]: Hugging Face, "lmz (Laurent Mazare) profile page: 29 models including lmz/candle-flux, lmz/candle-gemma, lmz/candle-rwkv, lmz/candle-metavoice; Spaces include Candle Whisper, Candle Llama2, Candle YOLO", huggingface.co, 2024-2026. https://huggingface.co/lmz. Accessed 2026-05-21.
[^4]: Hugging Face, "candle-core 0.10.2 crate documentation", docs.rs, 2026. https://docs.rs/candle-core. Accessed 2026-05-21.
[^5]: Internet Archive, "github.com-huggingface-candle snapshots (2023-08-09, 2023-08-11, 2023-08-12)", archive.org, 2023-08. https://archive.org/details/github.com-huggingface-candle_-_2023-08-09_16-14-17. Accessed 2026-05-21.
[^6]: Rust Foundation / Hugging Face, "candle-core crate page (license: MIT OR Apache-2.0)", crates.io, 2026. https://crates.io/crates/candle-core. Accessed 2026-05-21.
[^7]: BrightCoding, "Candle: A Minimalist Rust ML Framework with Fast Demos like Whisper and LLaMA2 (binary sizes: 22 MB Whisper tiny, 48 MB LLaMA2-7B q4k, 38 MB Phi-2)", brightcoding.dev, 2025-09-29. https://www.blog.brightcoding.dev/2025/09/29/candle-a-minimalist-rust-ml-framework-with-fast-demos-like-whisper-and-llama2/. Accessed 2026-05-21.
[^8]: Laurent Mazare, "tch-rs: Rust bindings for the C++ api of PyTorch (libtorch)", GitHub, 2026. https://github.com/LaurentMazare/tch-rs. Accessed 2026-05-21.
[^9]: Nicolas Patry (Narsil), "GitHub profile, affiliated with Hugging Face; contributor to huggingface/candle (e.g. PR #2841 enabling different linking options for MKL)", github.com, 2026. https://github.com/Narsil. Accessed 2026-05-21.
[^10]: Hugging Face, "candle CHANGELOG.md: v0.2.0 (2023-08-30) Stable Diffusion XL, GGUF writing, Q2K/Q4K/Q5K; v0.2.1 (2023-09-11) SAM, RNNs, GGUF v2; v0.2.2 (2023-09-18) T5 decoding; v0.3.0 (2023-10-01) Mistral 7B v0.1, Phi-v1.5 mixformer, Wuerstchen, SIMD128", GitHub, 2023-2024. https://github.com/huggingface/candle/blob/main/CHANGELOG.md. Accessed 2026-05-21.
[^11]: Hugging Face contributors, "Apple silicon (MPS backends) support? Issue #313", github.com, opened 2023-08-03. https://github.com/huggingface/candle/issues/313. Accessed 2026-05-21.
[^12]: Hugging Face, "candle_core - Rust crate documentation (Tensor, Device, DType; matmul example)", docs.rs, 2026. https://docs.rs/candle-core/latest/candle_core/. Accessed 2026-05-21.
[^13]: Hugging Face, "Candle - community_integrations docs (VarBuilder::from_mmaped_safetensors, hf-hub integration, config.json parsing)", huggingface.co, 2024-2026. https://huggingface.co/docs/transformers/community_integrations/candle. Accessed 2026-05-21.
[^14]: Hugging Face, "candle-kernels crate (CUDA kernels for Candle, build script compiles kernel sources, configurable via CUDA_COMPUTE_CAP)", lib.rs / crates.io, 2023-2026. https://lib.rs/crates/candle-kernels. Accessed 2026-05-21.
[^15]: Corey Lowman et al., "cudarc: minimal and safe API over the CUDA toolkit (driver API, NVRTC, cuBLAS, cuDNN, NCCL, cuFFT)", GitHub, 2026. https://github.com/coreylowman/cudarc. Accessed 2026-05-21.
[^16]: Hugging Face contributors, "Apple metal embedding generation panics w/ IOGPUMetalCommandBuffer failed assertion (BF16 matmul not supported)", Discussion #2818, github.com, 2024-2025. https://github.com/huggingface/candle/discussions/2818. Accessed 2026-05-21.
[^17]: Hugging Face Spaces, "Candle Segment Anything Model (SAM) Rust/WASM live demo (radames/candle-segment-anything-wasm)", huggingface.co, 2024. https://huggingface.co/spaces/radames/candle-segment-anything-wasm/resolve/main/index.html. Accessed 2026-05-21.
[^18]: Hugging Face, "lmz Spaces: Candle Whisper (65 likes), Candle Llama2, Candle YOLO (Featured, 74 likes)", huggingface.co, 2024-2026. https://huggingface.co/lmz. Accessed 2026-05-21.
[^19]: Laurent Mazare (@lmazare), "Time for more rusty robots! The FLUX.1 image generation model is now available in candle. Pure #Rust implementation, no python involved.", X (formerly Twitter), 2024-08-02. https://x.com/lmazare/status/1820023217773334888. Accessed 2026-05-21.
[^20]: Hugging Face, "lmz/candle-mistral, lmz/candle-yolo-v8, lmz/candle-whisper model repositories", huggingface.co, 2023-2024. https://huggingface.co/lmz. Accessed 2026-05-21.
[^21]: Athan X, "Choosing the Right Rust Machine Learning Framework: Candle, Burn, DFDX, or tch-rs?", Medium, 2024. https://medium.com/@athan.seal/choosing-the-right-rust-machine-learning-framework-candle-burn-dfdx-or-tch-rs-17501f6cd765. Accessed 2026-05-21.
[^22]: Hugging Face contributors, "using candle to make inference of yolov8 model is quite slow than PyTorch or onnxruntime! Issue #942 (Candle ~55 ms vs ONNX Runtime ~7 ms with CUDA + cuDNN)", github.com, 2023-2024. https://github.com/huggingface/candle/issues/942. Accessed 2026-05-21.
[^23]: Eric L. Buehler, "mistral.rs: Fast, flexible LLM inference (README credits: 'This project would not be possible without the excellent work at Candle'; migrated to candle 0.9.2)", GitHub, 2024-2026. https://github.com/EricLBuehler/mistral.rs. Accessed 2026-05-21.
[^24]: EndlessReform, "fish-speech.rs: A Fish Speech implementation in Rust, with Candle.rs (15 MB static binary, CUDA / Apple Silicon / CPU back-ends, custom candle-gqa-kernels)", GitHub, 2024-2026. https://github.com/EndlessReform/fish-speech.rs. Accessed 2026-05-21.
[^25]: Kyutai, "kyutai/moshika-candle-bf16 model card", huggingface.co, 2024. https://huggingface.co/kyutai/moshika-candle-bf16. Accessed 2026-05-21.
[^26]: Kyutai, "kyutai/moshiko-candle-q8 model card", huggingface.co, 2024. https://huggingface.co/kyutai/moshiko-candle-q8. Accessed 2026-05-21.
[^27]: alan890104, "qwen3-asr-rs: Pure-Rust inference engine for Qwen3-ASR speech recognition models (0.6B and 1.7B) using candle with Metal/CUDA acceleration", GitHub, 2025-2026. https://github.com/alan890104/qwen3-asr-rs. Accessed 2026-05-21.
[^28]: tracel-ai, "burn: a next generation tensor library and Deep Learning Framework (CUDA, ROCm, Metal, Vulkan, WebGPU, LibTorch back-ends; training dashboard; v0.21.0 May 2026)", GitHub, 2026. https://github.com/tracel-ai/burn. Accessed 2026-05-21.
[^29]: Luciano Mammino, "Why you should consider Rust for your Lambdas (cold starts as low as 50-75 ms; Datadog Lambda extension rewrite cut cold starts by 82% and binary size from 55 MB to 7 MB)", loige.co, 2024. https://loige.co/why-you-should-consider-rust-for-your-lambdas/. Accessed 2026-05-21.

