Unsloth
Last reviewed
May 20, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 4,008 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 20, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 4,008 words
Add missing citations, update stale details, or suggest a clearer explanation.
Unsloth is an open-source Python library for fast, memory-efficient fine-tuning of large language models, developed by Australian brothers Daniel and Michael Han.[1][2] The project rewrites the heaviest computational paths of transformer training (attention, rotary position embeddings, root mean square normalization, cross-entropy loss, and several other hot operations) as hand-derived backward passes implemented in OpenAI Triton, advertising roughly two times faster training and up to seventy percent less GPU memory consumption versus a baseline of Hugging Face Transformers, the TRL trainer, and the bitsandbytes 4-bit kernels.[1][3] Unsloth supports parameter-efficient methods such as LoRA and QLoRA, full fine-tuning, continued pretraining, and a wide range of preference-tuning and reinforcement learning techniques including DPO, ORPO, KTO, and GRPO.[4][5] The core library is released under the Apache 2.0 license on GitHub under the organization name unslothai, while the company behind it (Unsloth AI) was admitted to Y Combinator's Summer 2024 batch and operates from San Francisco.[1][2] By early 2026 the project reported over forty thousand GitHub stars and more than ten million monthly downloads of its prequantized model weights on the Hugging Face Hub.[2]
| Field | Value |
|---|---|
| Project name | Unsloth |
| Founders | Daniel Han, Michael Han |
| Company | Unsloth AI |
| Year founded | 2023 |
| Headquarters | San Francisco, California (originally Sydney, Australia) |
| Y Combinator batch | Summer 2024 |
| Source repository | github.com/unslothai/unsloth |
| Core library license | Apache 2.0 |
| Studio UI license | AGPL-3.0 |
| Primary language | Python (with OpenAI Triton kernels) |
| Main use case | Fine-tuning, RL, and quantization of open LLMs |
| Headline claim | ~2x faster training, up to ~70% less VRAM versus FlashAttention 2 plus Hugging Face baseline |
Daniel and Michael Han began Unsloth in late 2023 as an open-source side project aimed at making single-GPU fine-tuning of LLaMA derivatives substantially faster. Daniel Han had previously worked as an engineer at Nvidia on optimization-heavy software, and prior to that he had built and maintained Hyperlearn, a small linear algebra package focused on numerically stable, low-memory implementations of classical machine learning algorithms.[2] Michael Han contributed product engineering and design alongside fine-tuning support work. The initial release shipped a set of Google Colab notebooks demonstrating that supervised fine-tuning of a 7-billion parameter LLaMA-style model could be completed on a free Tesla T4 GPU in a fraction of the time and memory needed by a stock Hugging Face configuration; early benchmarks circulated under headlines such as "five times faster" because some configurations on Kaggle's two-GPU T4 instances delivered roughly that uplift over the standard transformers plus bitsandbytes baseline.[1][6]
During 2024, Unsloth gained visibility not only for its kernels but also because Daniel Han began publishing detailed bug reports on flagship open models. The team identified and fixed eight separate issues in Google's Gemma release, several tokenization defects in Meta's Llama 3 family, and a sliding-window-attention defect affecting Microsoft's Phi-3 at 2048-token windows.[6][7] These fixes propagated back into Hugging Face Transformers, llama.cpp, and other downstream packages, giving Unsloth an unusually visible role in the open weights ecosystem despite its small team size. Daniel Han gave a widely circulated talk at the AI Engineer World's Fair 2024 titled "Fixing bugs in Gemma, Llama and Phi-3," which summarized this work for a broad practitioner audience.[7]
Unsloth AI, Inc. was admitted to Y Combinator's Summer 2024 batch and was publicly described as a company "developing Open-Source Reinforcement Learning (RL) and Fine-tuning for LLMs."[2] Public records list the seed-stage round at roughly $500,000 with backing including Y Combinator, the GitHub Accelerator program, and Microsoft's M12 venture arm.[8] Headquarters relocated to San Francisco, while the founders continued to maintain strong ties to the Australian developer community where the project began. Team size reported on the company's YC profile in 2026 was eight people, with a posted founding ML engineer role offering 0.30 percent to 0.70 percent equity.[2]
Through 2025 the project released a steady stream of updates covering preference optimization, reasoning training, quantization, and platform support:
By May 2026 the GitHub repository's headline summary described the project as a stack for "training and running open models like Gemma 4, Qwen3.6, DeepSeek, gpt-oss locally," and the team had introduced Unsloth Studio, a desktop and web UI built on top of the core library for users who prefer to point and click through fine-tuning, dataset construction, and local inference.[1] Coverage in independent benchmarks during late 2025 and early 2026 placed Unsloth at the top of single-GPU efficiency tables, with multi-GPU and multi-node operation still considered a relative weakness compared to alternatives like Axolotl and Torchtune.[4][13]
Unsloth's speed and memory claims do not come from a single optimization. They are produced by a stack of overlapping techniques that together change the constants in the training loop while keeping the underlying math equivalent to a standard fine-tuning pass.
The single most distinctive technical choice in Unsloth is that every important operator in a transformer forward pass has a matching backward pass derived analytically by hand, rather than relying on PyTorch's autograd. This approach lets the library fuse what would otherwise be a chain of small operations (matrix multiplications, activation functions, normalizations, and reshapes) into a single Triton kernel that touches each intermediate tensor only once. Eliminating these intermediate clones and transposes is what allows Unsloth to claim large reductions in both wall-clock time and peak memory.[14] Daniel Han has described this work as a "manual autograd engine with hand-derived matrix calculus backpropagation for peak performance"; in practice it means that adding a new architecture to Unsloth requires the team to write and verify the symbolic gradient for any operator that is not already covered.[14]
Unsloth ships custom kernels written in OpenAI Triton for several hot operators:
These kernels target the same hot operators that other libraries such as the Liger Kernel attempt to optimize, but Unsloth combines them with its hand-written autograd and its training-loop changes to deliver the end-to-end speedups it advertises.[15]
A large fraction of Unsloth usage takes the form of QLoRA style fine-tuning: the base model is held in a 4-bit quantized form supplied by bitsandbytes, while small low-rank adapters in higher precision are the only weights that actually receive gradient updates. The library composes its custom kernels around the quantized matrix multiplications so that the dequantize-then-multiply step is also fused into the Triton pipeline, removing one of the largest sources of overhead in vanilla bitsandbytes plus peft training.[1][14]
Unsloth also supports full-parameter fine-tuning, 8-bit and 16-bit LoRA, and (since the December 2025 release) FP8 training on consumer GPUs that expose the required instructions.[12]
Many LLM training datasets contain sequences of widely varying length; the standard approach pads all sequences in a batch to the longest one and wastes compute on the padding tokens. The December 2025 release introduced padding-free training with example packing: short examples are concatenated into long packed sequences, with attention masks rewritten so that the model does not attend across example boundaries. Unsloth attributes a substantial part of the headline "3x faster training, 30% less VRAM" December figure to this change.[12]
In December 2024 Unsloth introduced a quantization scheme it calls Dynamic 4-bit (later updated as Dynamic 2.0 GGUFs).[16][17] Rather than apply a uniform 4-bit quantization to every weight, the method profiles each transformer block's sensitivity to precision loss and elects to leave certain parameters (typically embeddings and the earliest and latest attention blocks) at higher precision while compressing the middle feed-forward layers more aggressively. The team reports that the technique recovers most of the accuracy lost by stock bitsandbytes 4-bit while using less than ten percent more VRAM, and that on the Llama 3.2 Vision 11B and Qwen2 Vision 2B models it restored semantic details that the default 4-bit quantizer dropped or corrupted.[16] Dynamic 2.0 GGUFs extend the same idea to the GGUF quantization format used by llama.cpp, with quantization choices made per layer per model so that the scheme used for Gemma 3 differs from the scheme used for Llama 4.[17]
The February 2025 GRPO release added a tight integration with vLLM so that the inference engine used to sample on-policy completions can share the same GPU and weights as the policy that is being trained. Unsloth reports that this integration delivered roughly twenty times more throughput on the rollout phase compared with running generation through transformers with the same hardware, which in turn made GRPO-style reinforcement learning feasible on a single 16 GB T4 GPU.[9]
Through 2025 the open-source library was strictly single-GPU, and external commentators repeatedly singled this out as Unsloth's most prominent weakness relative to Axolotl, LLaMA-Factory, and Torchtune, all of which had supported multi-GPU and multi-node training for some time.[4][13] The December 2025 release shipped a Distributed Data Parallel guide and basic accelerate launch and torchrun --nproc_per_node support; Unsloth explicitly described this as preliminary and noted that a fuller multi-GPU release was planned for 2026.[12][18] For models too large to fit on a single GPU, the library also exposes a device_map="balanced" argument that splits weights across devices.[18]
Unsloth's coverage tracks the popular open-weight ecosystem closely. As of mid-2026 the project supports more than five hundred model variants spanning the following families:
Unsloth maintains a corresponding unsloth organization on Hugging Face hosting prequantized bnb-4bit and unsloth-bnb-4bit checkpoints of these models, along with GGUF conversions; the company has reported in excess of ten million monthly downloads across these artifacts.[2]
Unsloth is distributed through several mutually reinforcing surfaces:
unsloth on PyPI and GitHub, installable into any PyTorch environment and licensed under Apache 2.0. This is the core library that ships the Triton kernels, the autograd code, the model adapters, and the training utilities.[1]unsloth organization, including both standard bnb-4bit checkpoints and Unsloth's selectively quantized unsloth-bnb-4bit variants and Dynamic 2.0 GGUFs.[16][17]The most common applications described by Unsloth users and in third-party tutorials fall into a few clusters:
Unsloth occupies a particular niche in the open-source post-training stack. The neighboring frameworks differ in their primary optimization target, their multi-GPU story, and the breadth of training algorithms they support.
| Framework | Primary strength | Multi-GPU support | RL/preference coverage | License |
|---|---|---|---|---|
| Unsloth | Single-GPU speed and VRAM via Triton kernels and hand-derived backward passes; vLLM-backed GRPO[1][9] | Preliminary DDP since Dec 2025; multi-GPU and multi-node gated to Pro/Enterprise[12][20] | SFT, DPO, ORPO, KTO, SimPO, GRPO (via TRL backbone)[5] | Apache 2.0 (library), AGPL-3.0 (Studio) |
| Axolotl | Flexible YAML configs; mature production training; broad model coverage[4][13] | Native DeepSpeed ZeRO 2/3 and FSDP support[4] | SFT, DPO, ORPO, KTO, GRPO via TRL[4] | Apache 2.0 |
| TRL (Hugging Face) | Reference implementations of RL/preference algorithms[5] | Inherits Hugging Face Accelerate, DeepSpeed, FSDP support[4] | PPO, DPO, ORPO, KTO, SimPO, GRPO, reward modeling (this is its core)[4][5] | Apache 2.0 |
| DeepSpeed | ZeRO sharding for very large models; multi-node training[4][13] | First-class; multi-node is its primary use case[4] | Provides the optimizer/sharding layer rather than RL algorithms | Apache 2.0 |
| LLaMA-Factory | Web UI; broadest model menu; easy onboarding[4] | DeepSpeed and FSDP[4] | SFT, DPO, ORPO, KTO, RLHF[4] | Apache 2.0 |
| Torchtune | PyTorch-native, lean codebase[13] | Native FSDP[13] | SFT, DPO, with growing RL coverage[13] | BSD 3-Clause |
The recurring summary from independent benchmark write-ups in 2025 and 2026 is that Unsloth dominates on a single GPU but cedes ground above one device, while the multi-GPU-native frameworks pay an overhead in single-card throughput. One frequently cited 2026 comparison reported that an A100 40 GB fine-tuning job that took Unsloth 3.2 hours took Axolotl 5.8 hours on the same hardware.[4][13] When practitioners need RLHF or DPO at scale on many nodes, TRL plus DeepSpeed or Axolotl plus DeepSpeed remain the default choices; when they have one GPU and need to make it count, Unsloth is generally recommended as the most efficient option.[4][13]
The most consistent criticisms of Unsloth in independent technical writing during 2024 through 2026 concern scaling, architecture coverage, and the relationship between the open-source and commercial offerings:
Unsloth's significance to the open-weights LLM ecosystem is twofold. First, by making fine-tuning of seven and thirteen billion parameter models routine on a single consumer or free-tier cloud GPU, it lowered the practical floor for who can specialize an open base model. Many of the popular 2024 and 2025 community fine-tunes of Llama 3 and Mistral 7B were trained using Unsloth's notebooks, and the GRPO recipe in particular drove a wave of reasoning fine-tunes immediately after the DeepSeek-R1 release.[9][21]
Second, through Daniel Han's bug reports the project effectively became one of the de facto QA shops for open weight releases. Fixes that Unsloth proposed for Gemma, Llama, Phi, and (later) gpt-oss propagated into Hugging Face Transformers, llama.cpp, and the upstream model cards, often with measurable effects on benchmark scores.[6][7][11] This positioned Unsloth as more than a kernel library: it became a frequently cited point of reference for whether a newly released open model was in fact correctly implemented in the surrounding open-source stack.
The closest neighbors of Unsloth in the open-source LLM tooling space are TRL (Hugging Face's reference trainer for RLHF-adjacent algorithms), PEFT (Hugging Face's parameter-efficient fine-tuning library that implements LoRA and other adapter techniques), DeepSpeed (Microsoft's distributed training and ZeRO sharding system), and the Liger Kernel (a separate set of fused Triton kernels for LLM training).[4][5][15] On the inference side, Unsloth's prequantized weights are routinely consumed through vLLM, llama.cpp, and Ollama; on the optimization side, Unsloth's GRPO loop builds on the same algorithm popularized by DeepSeek-R1.[9] The library is also frequently discussed alongside supervised fine-tuning tutorials for individual model families such as Gemma 3, Llama 4, and Qwen3.[1][17]