# Unsloth

> Source: https://aiwiki.ai/wiki/unsloth
> Updated: 2026-06-24
> Categories: Developer Tools, Open Source AI, Training & Optimization
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

Unsloth is an open-source Python library that fine-tunes large language models up to two times faster while using up to 70 percent less GPU memory and, in its own words, with "no accuracy loss".[^1][^2] It was created in 2023 by Australian brothers Daniel and Michael Han, who rewrote the heaviest computational paths of transformer training (attention, rotary position embeddings, root mean square normalization, cross-entropy loss, and several other hot operations) as hand-derived backward passes implemented in OpenAI Triton, advertising those speedups versus a baseline of Hugging Face Transformers, the TRL trainer, and the bitsandbytes 4-bit kernels.[^1][^3][^22] Unsloth supports parameter-efficient methods such as [LoRA](/wiki/lora) and [QLoRA](/wiki/qlora), full fine-tuning, continued pretraining, and a wide range of preference-tuning and reinforcement learning techniques including [DPO](/wiki/dpo), [ORPO](/wiki/orpo), [KTO](/wiki/kto), and [GRPO](/wiki/grpo).[^4][^5] The core library is released under the Apache 2.0 license on GitHub under the organization name `unslothai`, while the company behind it (Unsloth AI) was admitted to Y Combinator's Summer 2024 batch, where it is described as building "Open-Source Reinforcement Learning (RL) & Fine-tuning for LLMs".[^1][^2] By mid-2026 the repository had passed roughly 67,000 GitHub stars, and the team reported more than ten million monthly downloads of its prequantized model weights on the Hugging Face Hub.[^2][^22][^23]

## Overview

| Field | Value |
|-------|-------|
| Project name | Unsloth |
| Founders | Daniel Han, Michael Han |
| Company | Unsloth AI |
| Year founded | 2023 |
| Headquarters | San Francisco, California (originally Sydney, Australia) |
| Y Combinator batch | Summer 2024 |
| Source repository | github.com/unslothai/unsloth |
| Core library license | Apache 2.0 |
| Studio UI license | AGPL-3.0 |
| Primary language | Python (with OpenAI [Triton](/wiki/triton) kernels) |
| Main use case | Fine-tuning, RL, and quantization of open LLMs |
| GitHub stars (mid-2026) | ~67,000[^22][^23] |
| Reported funding | ~$500,000 seed (2024)[^8] |
| Headline claim | ~2x faster training, up to ~70% less VRAM, no accuracy loss versus a FlashAttention 2 plus Hugging Face baseline[^1][^22] |

## What is Unsloth used for?

Unsloth is used to fine-tune, post-train, and quantize open-weight large language models on a single GPU, including hardware as modest as a free Google Colab or [Kaggle](/wiki/kaggle) Tesla T4. The library's central promise is that a job which would normally require an expensive multi-GPU server or a paid cloud session can be made to fit on one consumer card while remaining mathematically equivalent to a standard training run. The most common workflows are supervised fine-tuning of an instruction model, [QLoRA](/wiki/qlora) adapter training, reasoning-model training with [GRPO](/wiki/grpo), preference optimization with [DPO](/wiki/dpo) or [ORPO](/wiki/orpo), continued pretraining, vision fine-tuning, and local quantization to the GGUF format for inference through [llama.cpp](/wiki/llama_cpp).[^1][^5][^9][^17] The GitHub README summarizes the scope succinctly: Unsloth can train "500+ models" "up to 2x faster" with "up to 70% less VRAM".[^22]

## History

### Who created Unsloth and when?

Daniel and Michael Han began Unsloth in late 2023 as an open-source side project aimed at making single-GPU fine-tuning of [LLaMA](/wiki/llama) derivatives substantially faster. On the company's own about page the founders introduce themselves plainly: "We started as a team of two brothers!", with Daniel handling "Software, Data, Algorithms" and Michael handling "Design, Product, Engineer".[^24] Daniel Han had previously worked as an engineer at [Nvidia](/wiki/nvidia) on optimization-heavy software, and prior to that he had built and maintained Hyperlearn, a small linear algebra package focused on numerically stable, low-memory implementations of classical machine learning algorithms that the founders say has been used by organizations including Microsoft, NVIDIA, and NASA.[^2][^24] Michael Han contributed product engineering and design alongside fine-tuning support work. The initial release shipped a set of Google Colab notebooks demonstrating that supervised fine-tuning of a 7-billion parameter LLaMA-style model could be completed on a free Tesla T4 GPU in a fraction of the time and memory needed by a stock Hugging Face configuration; early benchmarks circulated under headlines such as "five times faster" because some configurations on Kaggle's two-GPU T4 instances delivered roughly that uplift over the standard `transformers` plus `bitsandbytes` baseline.[^1][^6]

### How did Unsloth build its reputation as a bug-fixer? (2024)

During 2024, Unsloth gained visibility not only for its kernels but also because Daniel Han began publishing detailed bug reports on flagship open models. The team identified and fixed eight separate issues in Google's [Gemma](/wiki/gemma) release, several tokenization defects in Meta's Llama 3 family, and a sliding-window-attention defect affecting Microsoft's [Phi-3](/wiki/phi_3) at 2048-token windows.[^6][^7] These fixes propagated back into Hugging Face Transformers, llama.cpp, and other downstream packages, giving Unsloth an unusually visible role in the open weights ecosystem despite its small team size. Daniel Han gave a widely circulated talk at the AI Engineer World's Fair 2024 titled "Fixing bugs in Gemma, Llama and Phi-3," which summarized this work for a broad practitioner audience.[^7]

### Is Unsloth a Y Combinator startup? (2024)

Unsloth AI, Inc. was admitted to Y Combinator's Summer 2024 batch and was publicly described as a company building "Open-Source Reinforcement Learning (RL) & Fine-tuning for LLMs."[^2] Public records list the seed-stage round at roughly $500,000 with backing including Y Combinator, the GitHub Accelerator program, and Microsoft's M12 venture arm.[^8] Headquarters relocated to San Francisco, while the founders continued to maintain strong ties to the Australian developer community where the project began. Team size reported on the company's YC profile in 2026 was eight people, with a posted founding ML engineer role offering 0.30 percent to 0.70 percent equity.[^2]

### What did Unsloth release in 2025?

Through 2025 the project released a steady stream of updates covering preference optimization, reasoning training, quantization, and platform support:

- February 2025: An end-to-end recipe for training reasoning models with [GRPO](/wiki/grpo) (Group Relative Policy Optimization), the algorithm that underlies [DeepSeek-R1](/wiki/deepseek_r1), paired with [vLLM](/wiki/vllm) integration that enabled concurrent generation and training on a single GPU. Unsloth reported that the recipe required only 7 GB of VRAM when applied to a 1.5-billion-parameter Qwen2.5 backbone, compared with prior implementations that needed two A100 cards (roughly 160 GB), and the GitHub README advertises up to 80 percent less VRAM for GRPO reinforcement learning.[^9][^22]
- April through August 2025: Vision and multimodal fine-tuning matured. [Llama 3.2](/wiki/llama_3_2) Vision (11B and 90B), Qwen2-VL, and Pixtral were brought under the same Triton-accelerated stack, with the team reporting 1.5 to 2 times faster training and up to 70 percent memory savings versus a [FlashAttention](/wiki/flashattention) 2 baseline, and the option to selectively unfreeze vision-only, language-only, attention, or MLP submodules.[^10]
- August 8, 2025: Support for OpenAI's [gpt-oss](/wiki/gpt_oss) 20B and 120B models on release day, with custom training functions for the MXFP4 weight format and direct integration with the Harmony tokenization library. Unsloth reported 1.5 times faster training, over 50 percent less VRAM, and 5 times longer supported context length than baseline approaches, claiming that gpt-oss-20B could be QLoRA fine-tuned within roughly 14 GB of VRAM.[^11]
- December 2025: A "December Release" tagged on GitHub introduced new Triton kernels, padding-free training with sequence packing, and preliminary multi-GPU support via Distributed Data Parallel, with the team summarizing the changes as "3x faster training, 30% less VRAM" versus the prior Unsloth baseline.[^12]

### What is the status of Unsloth in 2026?

By mid-2026 the GitHub repository's headline summary described the project as a stack for "training and running open models like Gemma 4, Qwen3.6, DeepSeek, gpt-oss locally," and the team had introduced Unsloth Studio, a desktop and web UI built on top of the core library for users who prefer to point and click through fine-tuning, dataset construction, and local inference.[^1][^22] The README also added a mixture-of-experts training path, advertising 12 times faster MoE training with 35 percent less VRAM for models such as DeepSeek, GLM, Qwen, and gpt-oss.[^22] Coverage in independent benchmarks during late 2025 and early 2026 placed Unsloth at the top of single-GPU efficiency tables, with multi-GPU and multi-node operation still considered a relative weakness compared to alternatives like Axolotl and Torchtune.[^4][^13]

## How does Unsloth make fine-tuning faster?

Unsloth's speed and memory claims do not come from a single optimization. They are produced by a stack of overlapping techniques that together change the constants in the training loop while keeping the underlying math equivalent to a standard fine-tuning pass.

### Hand-derived backward passes

The single most distinctive technical choice in Unsloth is that every important operator in a transformer forward pass has a matching backward pass derived analytically by hand, rather than relying on PyTorch's autograd. This approach lets the library fuse what would otherwise be a chain of small operations (matrix multiplications, activation functions, normalizations, and reshapes) into a single Triton kernel that touches each intermediate tensor only once. Eliminating these intermediate clones and transposes is what allows Unsloth to claim large reductions in both wall-clock time and peak memory.[^14] Daniel Han has described this work as a "manual autograd engine with hand-derived matrix calculus backpropagation for peak performance"; in practice it means that adding a new architecture to Unsloth requires the team to write and verify the symbolic gradient for any operator that is not already covered.[^14]

### Custom OpenAI Triton kernels

Unsloth ships custom kernels written in OpenAI [Triton](/wiki/triton) for several hot operators:

- A fused [rotary position embedding](/wiki/rope) (RoPE) kernel that combines the query/key projection with the rotation in a single inplace pass, reported in external write-ups to deliver a 2.3 times speedup at long context lengths and 1.9 times at shorter ones for the rotation step itself.[^14]
- A fused [RMSNorm](/wiki/rmsnorm) kernel that performs the variance computation and rescaling in a single launch, with the backward pass written by hand to avoid materializing the standard intermediate buffers.[^14]
- A fused [cross-entropy loss](/wiki/cross_entropy_loss) kernel that combines the final linear projection (the "language modeling head") with the softmax and the loss reduction in a streaming fashion, so that the full vocabulary logits never have to be held in memory at once. Because vocabulary size in modern open models often exceeds 128,000 tokens, this single fusion accounts for a large share of Unsloth's memory savings on long sequences.[^15]

These kernels target the same hot operators that other libraries such as the Liger Kernel attempt to optimize, but Unsloth combines them with its hand-written autograd and its training-loop changes to deliver the end-to-end speedups it advertises.[^15]

### QLoRA and adapter-only training

A large fraction of Unsloth usage takes the form of [QLoRA](/wiki/qlora) style fine-tuning: the base model is held in a 4-bit quantized form supplied by `bitsandbytes`, while small low-rank adapters in higher precision are the only weights that actually receive gradient updates. The library composes its custom kernels around the quantized matrix multiplications so that the dequantize-then-multiply step is also fused into the Triton pipeline, removing one of the largest sources of overhead in vanilla `bitsandbytes` plus `peft` training.[^1][^14]

Unsloth also supports full-parameter fine-tuning, 8-bit and 16-bit LoRA, and (since the December 2025 release) FP8 training on consumer GPUs that expose the required instructions.[^12]

### Padding-free packing

Many LLM training datasets contain sequences of widely varying length; the standard approach pads all sequences in a batch to the longest one and wastes compute on the padding tokens. The December 2025 release introduced padding-free training with example packing: short examples are concatenated into long packed sequences, with attention masks rewritten so that the model does not attend across example boundaries. Unsloth attributes a substantial part of the headline "3x faster training, 30% less VRAM" December figure to this change.[^12]

### What is Unsloth Dynamic 2.0 quantization?

In December 2024 Unsloth introduced a quantization scheme it calls Dynamic 4-bit (later updated as Dynamic 2.0 GGUFs).[^16][^17] Rather than apply a uniform 4-bit quantization to every weight, the method profiles each transformer block's sensitivity to precision loss and elects to leave certain parameters (typically embeddings and the earliest and latest attention blocks) at higher precision while compressing the middle feed-forward layers more aggressively. The team reports that the technique recovers most of the accuracy lost by stock `bitsandbytes` 4-bit while using less than ten percent more VRAM, and that on the Llama 3.2 Vision 11B and Qwen2 Vision 2B models it restored semantic details that the default 4-bit quantizer dropped or corrupted.[^16] Dynamic 2.0 GGUFs extend the same idea to the GGUF [quantization](/wiki/quantization) format used by `llama.cpp`, with quantization choices made per layer per model so that the scheme used for [Gemma 3](/wiki/gemma_3) differs from the scheme used for [Llama 4](/wiki/llama_4).[^17]

### vLLM integration for online RL

The February 2025 GRPO release added a tight integration with [vLLM](/wiki/vllm) so that the inference engine used to sample on-policy completions can share the same GPU and weights as the policy that is being trained. Unsloth reports that this integration delivered roughly twenty times more throughput on the rollout phase compared with running generation through `transformers` with the same hardware, which in turn made GRPO-style reinforcement learning feasible on a single 16 GB T4 GPU.[^9]

### Does Unsloth support multi-GPU training?

Through 2025 the open-source library was strictly single-GPU, and external commentators repeatedly singled this out as Unsloth's most prominent weakness relative to Axolotl, LLaMA-Factory, and Torchtune, all of which had supported multi-GPU and multi-node training for some time.[^4][^13] The December 2025 release shipped a Distributed Data Parallel guide and basic `accelerate launch` and `torchrun --nproc_per_node` support; Unsloth explicitly described this as preliminary and noted that a fuller multi-GPU release was planned for 2026.[^12][^18] For models too large to fit on a single GPU, the library also exposes a `device_map="balanced"` argument that splits weights across devices.[^18]

## Which models does Unsloth support?

Unsloth's coverage tracks the popular open-weight ecosystem closely. As of mid-2026 the project supports more than five hundred model variants spanning the following families:[^22]

- Meta's [LLaMA](/wiki/llama) family: LLaMA 1 and 2 (legacy), [Llama 3](/wiki/llama_3), [Llama 3.1](/wiki/llama_3_1), [Llama 3.2](/wiki/llama_3_2) (including the 11B and 90B vision variants), Llama 3.3, and [Llama 4](/wiki/llama_4).[^1][^22]
- Mistral AI: [Mistral 7B](/wiki/mistral_7b), [Mixtral](/wiki/mixtral) 8x7B and 8x22B, the Ministral checkpoints, and the more recent dense and mixture-of-experts models.[^1][^22]
- Google DeepMind: [Gemma](/wiki/gemma), [Gemma 2](/wiki/gemma_2), [Gemma 3](/wiki/gemma_3), the EmbeddingGemma model, and the Gemma 4 series including E2B and E4B variants visible on Unsloth's Hugging Face organization.[^1][^17][^22]
- Alibaba: [Qwen](/wiki/qwen) 2 and 2.5, [Qwen3](/wiki/qwen_3) (including 4B, 14B, and 32B variants, with the QwQ-32B reasoning model receiving its own dynamic quantization), and the Qwen3.5 and Qwen3.6 families referenced in the 2026 repository description.[^1][^22]
- Microsoft Research: [Phi-3](/wiki/phi_3) and [Phi-4](/wiki/phi_4), for both of which Unsloth published its own bug-fix variants. The team reports that "four bugs in Phi-4" were fixed in the Unsloth release, materially improving evaluation scores.[^6][^19]
- DeepSeek: [DeepSeek-R1](/wiki/deepseek_r1) (with dedicated GGUF and dynamic quantization releases), DeepSeek-V3, GLM, and the smaller distillations.[^9][^17][^22]
- OpenAI: [gpt-oss](/wiki/gpt_oss) 20B and 120B since the August 2025 release, with custom MXFP4 training kernels.[^11]
- A range of text-to-speech, embedding, and vision-language models added through 2025 and 2026, including selectively fine-tunable Llama 3.2 Vision configurations.[^10]

Unsloth maintains a corresponding `unsloth` organization on Hugging Face hosting prequantized `bnb-4bit` and `unsloth-bnb-4bit` checkpoints of these models, along with GGUF conversions; the company has reported in excess of ten million monthly downloads across these artifacts.[^2]

## Variants and distribution

Unsloth is distributed through several mutually reinforcing surfaces:

- The Python package `unsloth` on PyPI and GitHub, installable into any [PyTorch](/wiki/pytorch) environment and licensed under Apache 2.0. This is the core library that ships the Triton kernels, the autograd code, the model adapters, and the training utilities.[^1]
- A constantly updated catalogue of Google Colab and Kaggle notebooks covering supervised fine-tuning, DPO, ORPO, KTO, GRPO, continued pretraining, vision fine-tuning, and gpt-oss specific workflows. Many of these notebooks are usable on the free Tesla T4 GPU tier of [Kaggle](/wiki/kaggle) or Colab, which is itself a deliberate marketing channel for the library.[^5][^9]
- Prequantized model weights on the Hugging Face Hub under the `unsloth` organization, including both standard `bnb-4bit` checkpoints and Unsloth's selectively quantized `unsloth-bnb-4bit` variants and Dynamic 2.0 GGUFs.[^16][^17]
- Unsloth Studio, a graphical front end released in 2025 that wraps the core library and adds dataset construction from PDF, CSV, and JSON, model comparison ("Model Arena"), an OpenAI-compatible local inference server, and an offline mode for Mac, Windows, Linux, and WSL. The Studio component is licensed AGPL-3.0 while the underlying library remains Apache 2.0.[^1][^22]
- A commercial Pro and Enterprise offering with multi-GPU and multi-node features, higher reported speedups (up to 2.5x in Pro, with claims of 30x and 90 percent VRAM reduction in Enterprise), and customer-specific deployment options. Pricing for these tiers is not publicly listed.[^20]

## Applications

The most common applications described by Unsloth users and in third-party tutorials fall into a few clusters:

- Domain or company-specific supervised fine-tuning of open base models on instruction datasets, where the appeal of Unsloth is that the same job that previously required an A100 (or a paid Colab Pro session) can be made to fit on a free T4 or a single consumer RTX card.[^21]
- Reasoning-model training with GRPO, in which the practitioner supplies a question-answer pair plus a reward function and uses Unsloth's vLLM-backed GRPO loop to induce chain-of-thought style behavior in a small open-weight backbone. This pattern grew rapidly in popularity after the DeepSeek-R1 paper appeared, and Unsloth's "train your own R1 reasoning model locally" recipe was the dominant practical entry point on consumer hardware.[^9]
- Preference optimization with [DPO](/wiki/dpo), [ORPO](/wiki/orpo), and [KTO](/wiki/kto) against pairwise or thumbs-up/thumbs-down feedback. Unsloth's preference-optimization documentation covers these alongside SimPO and other [RLHF](/wiki/rlhf)-adjacent algorithms, with code that delegates the training loop to TRL while substituting Unsloth's kernels for the heavy operators.[^5]
- Vision fine-tuning of Llama 3.2 Vision and related vision-language models, where Unsloth's ability to selectively unfreeze the vision encoder, the language model, the cross-attention layers, or the MLPs is particularly useful for domain transfer tasks like document understanding or chart analysis.[^10]
- Local quantization and inference: many users adopt Unsloth's Dynamic 2.0 GGUFs as drop-in replacements for community GGUF conversions when running models through [llama.cpp](/wiki/llama_cpp) or local desktop apps, because the dynamic per-layer scheme preserves more accuracy at a given bit width than uniform quantization.[^16][^17]

## How does Unsloth compare with other fine-tuning frameworks?

Unsloth occupies a particular niche in the open-source post-training stack. The neighboring frameworks differ in their primary optimization target, their multi-GPU story, and the breadth of training algorithms they support.

| Framework | Primary strength | Multi-GPU support | RL/preference coverage | License |
|-----------|------------------|-------------------|------------------------|---------|
| Unsloth | Single-GPU speed and VRAM via Triton kernels and hand-derived backward passes; vLLM-backed GRPO[^1][^9] | Preliminary DDP since Dec 2025; multi-GPU and multi-node gated to Pro/Enterprise[^12][^20] | SFT, [DPO](/wiki/dpo), [ORPO](/wiki/orpo), [KTO](/wiki/kto), SimPO, [GRPO](/wiki/grpo) (via TRL backbone)[^5] | Apache 2.0 (library), AGPL-3.0 (Studio) |
| Axolotl | Flexible YAML configs; mature production training; broad model coverage[^4][^13] | Native [DeepSpeed](/wiki/deepspeed) ZeRO 2/3 and [FSDP](/wiki/fsdp) support[^4] | SFT, DPO, ORPO, KTO, GRPO via TRL[^4] | Apache 2.0 |
| TRL (Hugging Face) | Reference implementations of RL/preference algorithms[^5] | Inherits Hugging Face Accelerate, DeepSpeed, FSDP support[^4] | PPO, DPO, ORPO, KTO, SimPO, GRPO, reward modeling (this is its core)[^4][^5] | Apache 2.0 |
| [DeepSpeed](/wiki/deepspeed) | ZeRO sharding for very large models; multi-node training[^4][^13] | First-class; multi-node is its primary use case[^4] | Provides the optimizer/sharding layer rather than RL algorithms | Apache 2.0 |
| LLaMA-Factory | Web UI; broadest model menu; easy onboarding[^4] | DeepSpeed and FSDP[^4] | SFT, DPO, ORPO, KTO, RLHF[^4] | Apache 2.0 |
| Torchtune | PyTorch-native, lean codebase[^13] | Native FSDP[^13] | SFT, DPO, with growing RL coverage[^13] | BSD 3-Clause |

The recurring summary from independent benchmark write-ups in 2025 and 2026 is that Unsloth dominates on a single GPU but cedes ground above one device, while the multi-GPU-native frameworks pay an overhead in single-card throughput. One frequently cited 2026 comparison reported that an A100 40 GB fine-tuning job that took Unsloth 3.2 hours took Axolotl 5.8 hours on the same hardware.[^4][^13] When practitioners need RLHF or DPO at scale on many nodes, TRL plus DeepSpeed or Axolotl plus DeepSpeed remain the default choices; when they have one GPU and need to make it count, Unsloth is generally recommended as the most efficient option.[^4][^13]

## What are the limitations and criticisms of Unsloth?

The most consistent criticisms of Unsloth in independent technical writing during 2024 through 2026 concern scaling, architecture coverage, and the relationship between the open-source and commercial offerings:

- Single-GPU lock-in until late 2025. For roughly two years after launch, the open-source library could not be straightforwardly used across multiple GPUs, with multi-GPU training reserved for the commercial Pro tier. The December 2025 release added preliminary DDP support, but third-party reviewers continued to recommend Axolotl, Torchtune, or LLaMA-Factory for any serious multi-node workload.[^4][^12][^13]
- Bench-vs-real-world gap. Independent comparisons typically reproduce Unsloth's relative ranking but with smaller absolute speedups than the official "2x faster, 70% less VRAM" headline, especially when the baseline already uses FlashAttention 2, well-tuned gradient checkpointing, and modern Hugging Face Transformers. The official numbers are usually quoted versus a less optimized baseline.[^4][^13]
- Custom autograd surface. Because every important operator's backward pass is written by hand, supporting a new architecture or a new fused operator requires the Unsloth team to do non-trivial mathematical work. Users on emerging architectures sometimes have to wait for an Unsloth release before fine-tuning is possible, even when the model itself is supported by Hugging Face Transformers from day one.[^14]
- Tight coupling to TRL and bitsandbytes. Unsloth depends on TRL for the high-level training loops of DPO, ORPO, KTO, and GRPO, and on bitsandbytes for the 4-bit quantization that underpins QLoRA-style fine-tuning. Regressions or behavior changes in those upstream libraries occasionally propagate into Unsloth before they can be patched.[^5][^14]
- Open-core licensing concerns. While the Apache 2.0 license on the core library is permissive, the Studio UI ships under the copyleft AGPL-3.0, and the higher-performance multi-GPU and multi-node features are gated to undisclosed commercial pricing. Several reviewers have characterized this as a standard open-core arrangement that nevertheless requires due diligence for enterprise users.[^20]

## Why is Unsloth significant?

Unsloth's significance to the open-weights LLM ecosystem is twofold. First, by making fine-tuning of seven and thirteen billion parameter models routine on a single consumer or free-tier cloud GPU, it lowered the practical floor for who can specialize an open base model. Many of the popular 2024 and 2025 community fine-tunes of Llama 3 and Mistral 7B were trained using Unsloth's notebooks, and the GRPO recipe in particular drove a wave of reasoning fine-tunes immediately after the DeepSeek-R1 release.[^9][^21]

Second, through Daniel Han's bug reports the project effectively became one of the de facto QA shops for open weight releases. Fixes that Unsloth proposed for Gemma, Llama, Phi, and (later) gpt-oss propagated into Hugging Face Transformers, llama.cpp, and the upstream model cards, often with measurable effects on benchmark scores.[^6][^7][^11] This positioned Unsloth as more than a kernel library: it became a frequently cited point of reference for whether a newly released open model was in fact correctly implemented in the surrounding open-source stack.

## Related work

The closest neighbors of Unsloth in the open-source LLM tooling space are TRL (Hugging Face's reference trainer for [RLHF](/wiki/rlhf)-adjacent algorithms), [PEFT](/wiki/huggingface_peft) (Hugging Face's parameter-efficient fine-tuning library that implements [LoRA](/wiki/lora) and other adapter techniques), [DeepSpeed](/wiki/deepspeed) (Microsoft's distributed training and ZeRO sharding system), and the Liger Kernel (a separate set of fused Triton kernels for LLM training).[^4][^5][^15] On the inference side, Unsloth's prequantized weights are routinely consumed through [vLLM](/wiki/vllm), [llama.cpp](/wiki/llama_cpp), and Ollama; on the optimization side, Unsloth's GRPO loop builds on the same algorithm popularized by [DeepSeek-R1](/wiki/deepseek_r1).[^9] The library is also frequently discussed alongside [supervised fine-tuning](/wiki/supervised_fine-tuning) tutorials and the broader practice of [fine-tuning](/wiki/fine_tuning) for individual model families such as [Gemma 3](/wiki/gemma_3), [Llama 4](/wiki/llama_4), and [Qwen3](/wiki/qwen_3).[^1][^17]

## See also

- [LoRA](/wiki/lora)
- [QLoRA](/wiki/qlora)
- [Fine-tuning](/wiki/fine_tuning)
- [Hugging Face PEFT](/wiki/huggingface_peft)
- [DPO](/wiki/dpo)
- [ORPO](/wiki/orpo)
- [KTO](/wiki/kto)
- [GRPO](/wiki/grpo)
- [Rotary position embedding](/wiki/rope)
- [RMSNorm](/wiki/rmsnorm)
- [Cross-entropy loss](/wiki/cross_entropy_loss)
- [Triton (compiler)](/wiki/triton)
- [FlashAttention](/wiki/flashattention)
- [DeepSpeed](/wiki/deepspeed)
- [Fully Sharded Data Parallel](/wiki/fsdp)
- [vLLM](/wiki/vllm)
- [llama.cpp](/wiki/llama_cpp)
- [GGUF](/wiki/gguf)
- [DeepSeek-R1](/wiki/deepseek_r1)
- [gpt-oss](/wiki/gpt_oss)
- [Llama 3.2](/wiki/llama_3_2)
- [Gemma 3](/wiki/gemma_3)
- [Qwen3](/wiki/qwen_3)
- [Phi-4](/wiki/phi_4)
- [RLHF](/wiki/rlhf)
- [Supervised fine-tuning](/wiki/supervised_fine-tuning)
- [Hugging Face](/wiki/hugging_face)

## References

[^1]: Unsloth AI, "unslothai/unsloth GitHub repository", GitHub, 2026-05. https://github.com/unslothai/unsloth/. Accessed 2026-06-24.

[^2]: Y Combinator, "Unsloth AI: Open-Source Reinforcement Learning (RL) & Fine-tuning for LLMs", Y Combinator company directory, 2026. https://www.ycombinator.com/companies/unsloth-ai. Accessed 2026-06-24.

[^3]: Unsloth AI, "Unsloth: Train and Run Models Locally", unsloth.ai homepage, 2026. https://unsloth.ai/. Accessed 2026-06-24.

[^4]: Ultradune AI, "EVAL #003: Fine-Tuning in 2026 - Axolotl vs Unsloth vs TRL vs LLaMA-Factory", DEV Community, 2026. https://dev.to/ultraduneai/eval-003-fine-tuning-in-2026-axolotl-vs-unsloth-vs-trl-vs-llama-factory-2ohg. Accessed 2026-06-24.

[^5]: Unsloth AI, "Preference Optimization Training - DPO, ORPO and KTO", Unsloth Documentation, 2026. https://unsloth.ai/docs/get-started/reinforcement-learning-rl-guide/preference-dpo-orpo-and-kto. Accessed 2026-06-24.

[^6]: Unsloth AI, "Unsloth Updates / Changelog", Unsloth Documentation, 2026. https://unsloth.ai/docs/new/changelog. Accessed 2026-06-24.

[^7]: AI Engineer World's Fair 2024, "Fixing bugs in Gemma, Llama and Phi-3 (Daniel Han)", ai.engineer schedule, 2024. https://www.ai.engineer/worldsfair/2024/schedule/daniel-han-tba. Accessed 2026-06-24.

[^8]: Crunchbase, "Unsloth AI Company Profile and Funding", Crunchbase, 2026. https://www.crunchbase.com/organization/unsloth-ai. Accessed 2026-06-24.

[^9]: Unsloth AI, "Train your own R1 reasoning model locally (GRPO)", Unsloth Blog, 2025-02-06. https://unsloth.ai/blog/r1-reasoning. Accessed 2026-06-24.

[^10]: Unsloth AI, "Fine-tune Llama 3.2 Vision with Unsloth", Unsloth Blog, 2025. https://www.unsloth.ai/blog/llama3-2. Accessed 2026-06-24.

[^11]: Unsloth AI, "Fine-tune gpt-oss with Unsloth", Unsloth Blog, 2025-08-08. https://unsloth.ai/blog/gpt-oss. Accessed 2026-06-24.

[^12]: Unsloth AI, "Release December-2025: December Release + 3x Faster Training", GitHub Releases, 2025-12. https://github.com/unslothai/unsloth/releases/tag/December-2025. Accessed 2026-06-24.

[^13]: Spheron Network, "Axolotl vs Unsloth vs TorchTune: Best LLM Fine-Tuning Frameworks in 2026", Spheron Blog, 2026. https://www.spheron.network/blog/axolotl-vs-unsloth-vs-torchtune/. Accessed 2026-06-24.

[^14]: Ahmed Lahlou Mimi, "Train LLMs faster with Unsloth (Part 1)", Medium, 2024. https://medium.com/@ahmed.mimilahlou/train-llms-faster-with-unsloth-part-1-042ab1fb7618. Accessed 2026-06-24.

[^15]: Ryan Pegoud, "Cutting LLM Memory by 84%: A Deep Dive into Fused Kernels", Towards Data Science, 2026-02. https://towardsdatascience.com/cutting-llm-memory-by-84-a-deep-dive-into-fused-kernels/. Accessed 2026-06-24.

[^16]: Unsloth AI, "Unsloth: Dynamic 4-bit Quantization", Unsloth Blog, 2024-12-04. https://unsloth.ai/blog/dynamic-4bit. Accessed 2026-06-24.

[^17]: Unsloth AI, "Unsloth Dynamic 2.0 GGUFs", Unsloth Documentation, 2025. https://unsloth.ai/docs/basics/unsloth-dynamic-2.0-ggufs. Accessed 2026-06-24.

[^18]: Unsloth AI, "Multi-GPU Fine-tuning with Distributed Data Parallel (DDP)", Unsloth Documentation, 2025-12. https://unsloth.ai/docs/basics/multi-gpu-training-with-unsloth/ddp. Accessed 2026-06-24.

[^19]: Unsloth AI, "Finetune Phi-4 with Unsloth", Unsloth Blog, 2025-01. https://unsloth.ai/blog/phi4. Accessed 2026-06-24.

[^20]: OpenTechHub, "Unsloth: Strategic Open Source Alternative to OpenAI Fine-tuning", opentechhub.io, 2026. https://www.opentechhub.io/unsloth/. Accessed 2026-06-24.

[^21]: BrightCoding, "Unsloth: Train Massive LLMs on Consumer GPUs with 70% Less VRAM", BrightCoding Blog, 2026-02-05. https://blog.brightcoding.dev/2026/02/05/unsloth-train-massive-llms-on-consumer-gpus-with-70-less-vram. Accessed 2026-06-24.

[^22]: Unsloth AI, "unslothai/unsloth README", GitHub, 2026-06. https://github.com/unslothai/unsloth/blob/main/README.md. Accessed 2026-06-24.

[^23]: GitHub, "unslothai/unsloth stargazer count", GitHub repository header, 2026-06. https://github.com/unslothai/unsloth. Accessed 2026-06-24.

[^24]: Unsloth AI, "About Unsloth", unsloth.ai about page, 2026. https://unsloth.ai/about. Accessed 2026-06-24.

