Axolotl

Developer Tools Open Source AI Training & Optimization

22 min read

Updated Jun 24, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 24, 2026

Fact-checked

In review queue

Sources

20 citations

Revision

v4 · 4,396 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Axolotl is a free and open source framework for fine-tuning and post-training large language models, written in Python and driven entirely by a single YAML configuration file. Released under the Apache 2.0 license, it wraps Hugging Face Transformers, the TRL library, Accelerate, DeepSpeed, and PEFT into one declarative layer that supports supervised fine-tuning, preference optimization (DPO, IPO, KTO, ORPO, GRPO, GDPO), parameter-efficient methods such as LoRA and QLoRA, full multi-GPU fine-tunes via FSDP and DeepSpeed, and high-throughput training tricks including sample packing (Multipack) with FlashAttention^[3]^[4]. Originally created by Wing Lian in 2023 under the OpenAccess-AI-Collective banner, the project is now stewarded by Axolotl AI Inc., a San Francisco company founded in 2024 that maintains the framework alongside a community of contributors^[1]^[2]. The GitHub README describes Axolotl simply as "A Free and Open Source LLM Fine-tuning Framework," and the project has been used to train many of the most widely downloaded open chat models on Hugging Face, including the OpenHermes series from Teknium and the Dolphin series from Cognitive Computations^[3]^[5]^[6].

Infobox

Attribute	Value
Project type	Open source LLM fine-tuning framework
Primary language	Python
Configuration format	YAML
License	Apache 2.0^[3]
Original author	Wing Lian (caseus)^[1]
Original organization	OpenAccess-AI-Collective^[7]
Current organization	axolotl-ai-cloud / Axolotl AI Inc.^[3]
Initial public release	2023^[1]
Latest release (as of writing)	v0.17.0, June 3, 2026^[8]
GitHub popularity	~12.1k stars, 1.4k forks^[3]
Notable funding	Andreessen Horowitz Open Source AI grant, December 2023^[2]

Is Axolotl free and open source?

Yes. Axolotl is distributed under the Apache 2.0 license, a permissive open source license that allows commercial use, modification, and redistribution^[3]. The full source code, documentation, and example YAML configurations live in the public axolotl-ai-cloud/axolotl repository on GitHub, and the package is installable from PyPI as axolotl^[3]^[8]. There is no closed core or paywalled feature set in the framework itself; the commercial entity, Axolotl AI Inc., builds hosted and managed offerings around the same open source code rather than gating the library^[9]^[16].

Who created Axolotl?

Axolotl began as a personal project by Wing Lian, a software engineer with prior experience at SoundCloud and UnitedMasters.

Origins (2023)

In March 2023, a skiing injury left Lian sidelined and looking for something to occupy his recovery; he chose to learn LLM fine-tuning, which had become a fast-moving research topic after the release of Meta's LLaMA weights and Stanford's Alpaca instruction-tuning recipe^[1]. While experimenting with existing tools such as Alpaca-LoRA, he ran into two recurring frustrations: prompt formats varied wildly across the datasets being shared on Hugging Face, and the dominant training scripts were configured through long command-line argument lists that made experiments hard to reproduce and share^[1].

To solve both problems, Lian wrote a wrapper that consumed a single YAML configuration file describing the model, dataset, prompt template, optimizer, and distributed training strategy. The wrapper validated the configuration up front, then handed off to Hugging Face Transformers and Accelerate for the actual training loop. The project was released on GitHub under the OpenAccess-AI-Collective organization in mid-2023 and quickly accumulated contributors^[7]^[1].

A turning point came with Tim Dettmers's QLoRA paper in May 2023. Lian integrated QLoRA support into Axolotl within roughly a week of the paper's release, recalling in a later interview that "between that announcement it took us seven days to get that integrated into Axolotl," giving practitioners with consumer-grade hardware an immediate way to fine-tune 7B and even 70B parameter models in 4-bit precision^[1]. This pattern (rapid integration of new research) became a signature trait of the project. When Tri Dao and Albert Gu's Mamba state space model paper appeared in December 2023, Axolotl shipped support for fine-tuning Mamba checkpoints within days^[1].

Community traction and the OpenAccess-AI-Collective

Through the second half of 2023, Axolotl became the de facto fine-tuning framework for the open-weights community that grew up around LLaMA, Mistral, and the larger Hugging Face ecosystem. Teknium chose Axolotl for the OpenHermes line of Mistral fine-tunes; the OpenHermes-2.5-Mistral-7B model card explicitly notes that datasets were converted "to ShareGPT, which was then further transformed by axolotl to use ChatML"^[5]. Eric Hartford's Dolphin series, including dolphin-2.5-mixtral-8x7b and dolphin-2.8-mistral-7b-v0.2, displayed the "Built with Axolotl" badge on their model cards and shipped the full Axolotl YAML configuration alongside the weights so others could reproduce the training^[6]. Nous Research used Axolotl for several of its Capybara, Puffin, and Hermes derivatives, and Lian himself trained models such as Manticore, Minotaur, Jackalope, and Hippogriff that lived under the openaccess-ai-collective namespace on Hugging Face^[1].

Andreessen Horowitz grant (December 2023)

On December 13, 2023, Andreessen Horowitz announced a second batch of Open Source AI Grants. Axolotl appeared on that list alongside six other projects spanning model training, hosting, evaluation, and visual AI, recognizing the framework as a piece of critical open infrastructure for the LLM ecosystem. The a16z program provides grant funding rather than equity, so the award was not a venture investment^[2].

Incorporation as Axolotl AI Inc. (2024)

Through 2024 the project transitioned from a hobbyist tool maintained under OpenAccess-AI-Collective into a company. The GitHub organization was renamed and the canonical repository moved to axolotl-ai-cloud/axolotl, with the company adopting axolotl.ai as its domain and docs.axolotl.ai for documentation^[3]^[4]. Wing Lian announced the company publicly at a Nous Research meetup hosted at the a16z offices in San Francisco^[1]. Public company databases describe Axolotl AI as a San Francisco company founded in 2024 that builds open source tools for customizing and scaling AI language models, with Essence Venture Capital listed among its investors^[9].

Lian represented the new company at the PyTorch Conference 2024 Fine-Tuning Mini-Summit on September 18, 2024, giving a talk titled "The Challenges of Building an Opinionated Open Source LLM Framework" alongside the maintainers of Unsloth, torchtune, and researchers including Tim Dettmers^[10].

Recent releases (2024 to 2026)

The framework's release cadence has been steady. Version 0.12.0 (August 8, 2024) introduced N-D parallel support, DeepSpeed Automatic Tensor Parallelism, and FP8 training. Subsequent 2025 releases added reward modeling and process reward modeling, LoRA optimizations, and a beta for multimodal vision-language fine-tuning, with January 2025 specifically delivering reward and process reward modeling and February 2025 shipping the LoRA memory and speed work that targeted both single-GPU and multi-GPU adapter training. In early 2026, version 0.15.0 (March 6) shipped a Torch 2.10 upgrade, uv-based Docker builds, ScatterMoE LoRA, SonicMoE Triton kernels, and MoE expert quantization that the maintainers reported as reducing peak reserved memory dramatically on mixture-of-experts models. Version 0.16.0 (April 2) added asynchronous GRPO training reported as up to 58% faster step-time, ScatterMoE/SonicMoE fused kernels claimed to deliver up to 15x faster MoE forward passes and roughly 40x reductions in memory, FlashAttention 4 support for NVIDIA Hopper and Blackwell GPUs, NeMo Gym integration for reinforcement learning, and Energy-Based Fine-Tuning (EBFT). Version 0.16.1 followed the same day with Gemma 4 support^[8]. Version 0.17.0 (June 3, 2026) extended the stack further with Expert Parallelism via DeepEP, BitNet 1.58-bit support, the Q-GaLore optimizer, MoRA and ReMoRA adapters, context parallelism for hybrid state space models, and fused RMSNorm-plus-RoPE kernels^[8]. The roughly monthly cadence and the public release notes make it possible for downstream teams to track which research methods and model families are stable in production versus still considered beta^[4]^[8].

How does Axolotl work?

The YAML configuration model

The defining design decision in Axolotl is that an entire training run, from data preprocessing through final inference, is captured in one reusable YAML file. The configuration declares the base model, optional adapter strategy, dataset paths and templates, sequence length, batch and gradient accumulation parameters, optimizer and scheduler, distributed training backend, attention implementation, and downstream evaluation hooks. The file is parsed and statically validated; incompatible parameter combinations (for example, sample packing without an attention implementation that supports it) fail the lint step before any GPU time is spent^[1]^[4].

The CLI exposes a small set of commands that all consume the same configuration: axolotl preprocess tokenizes and caches the dataset, axolotl train runs the training loop, axolotl inference provides an interactive prompt, and axolotl evaluate runs offline evaluation. Most users interact with Axolotl entirely through these commands and a single YAML^[4].

Stack and integrations

Internally, Axolotl is a relatively thin coordination layer over a stack of mature libraries:

Hugging Face Transformers supplies the model implementations, tokenizers, and the Trainer API^[4].
PEFT provides adapter classes for LoRA and QLoRA training^[1].
Accelerate handles device placement, mixed precision, and the launcher abstraction for distributed runs^[4].
TRL contributes the preference-optimization trainers underpinning DPO, IPO, KTO, ORPO, and GRPO^[11].
DeepSpeed and Fully Sharded Data Parallel (FSDP) (both FSDP1 and FSDP2) are supported as multi-GPU and multi-node sharding backends^[3].
Flash Attention 2 is the default attention implementation; recent versions also support Flash Attention 3 and FlashAttention 4 on Hopper and Blackwell hardware, plus Flex Attention, SageAttention, and Xformers as alternatives^[3]^[8].

This approach keeps Axolotl close to the moving frontier of upstream libraries while concentrating the project's own code on what it actually owns: configuration schema, dataset format adapters, sample-packing logic, validation rules, and the integration glue.

What is sample packing (Multipack)?

Sample packing, called Multipack in the Axolotl documentation, is the framework's headline throughput optimization. The naive approach to batching variable-length sequences pads each sequence in a batch up to the longest sequence in that batch, wasting compute on padding tokens that the model is forced to process but which contribute nothing to the loss. Multipack instead concatenates multiple short sequences into a single packed sequence whose length matches the configured sequence_len, then relies on the attention implementation to prevent tokens in one packed example from attending to tokens in another^[12].

With Flash Attention enabled, Multipack passes per-sequence boundary information so that FlashAttention's variable-length kernels compute attention only within each original sequence. Without FlashAttention, Axolotl can still pack sequences by constructing 4D attention masks for PyTorch's scaled dot-product or native attention paths, though at lower efficiency because the framework cannot join multiple batches into a single batch without the variable-length attention support that FlashAttention provides^[12]. Lian has reported that the combination of sample packing and FlashAttention drives roughly an order-of-magnitude improvement in tokens-per-second relative to padded training, describing gains of "up to like a 20x improvement sometimes," and gave an illustrative figure of reproducing an Alpaca-style fine-tune for roughly $4 to $5 on L40 GPUs versus the original Alpaca team's $100 on 8x A100s^[1]. The packing scheme is effectively a descendant of StackLlama-style sequence concatenation but with attention masking that preserves the per-sample loss exactly, so models trained with Multipack are mathematically equivalent to those trained without it given the same hyperparameters and data ordering^[1]^[12].

Optimization features

Beyond sample packing, Axolotl exposes a long list of optional optimizations^[13]:

LoRA / QLoRA: train small adapter parameters instead of full weights to drastically lower memory.
Gradient checkpointing: recompute activations on the backward pass to trade compute for VRAM.
Layer offloading: stream frozen parameters between CPU and GPU during training.
Liger Kernel: Triton kernels for cross-entropy, RoPE, RMSNorm, and other hotspots that reduce both step time and peak memory.
Cut Cross Entropy: a fused loss implementation that avoids materializing the full logits tensor.
RoPE scaling: extend a base model's context window beyond its pretraining length.
Sequence parallelism and N-D parallelism: compose tensor, context, and data parallel sharding for very long contexts or very large models.
Quantization: 4-bit QLoRA, FP8 mixed precision, Quantization-Aware Training (QAT, including an NVFP4 variant), and GPTQ.
MoE-specific kernels: ScatterMoE and SonicMoE fused Triton kernels and MoE expert quantization to make MoE fine-tuning tractable on smaller node counts^[8].

Distributed training

Axolotl supports single-GPU, multi-GPU, and multi-node training through Accelerate launchers. For sharded training the user picks between DeepSpeed ZeRO stages (commonly ZeRO-2 or ZeRO-3 with optional BF16) and Fully Sharded Data Parallel (FSDP) (both the original implementation and the newer FSDP2 rewrite)^[3]. Recent releases extend this with N-D parallelism that composes tensor, context, and FSDP sharding, sequence parallelism for very long contexts, DeepSpeed Auto Tensor Parallelism introduced in v0.12.0, and Expert Parallelism via DeepEP added in v0.17.0^[8].

For preference optimization and RL methods that need rollouts, Axolotl integrates with vLLM for fast inference during trajectory generation in GRPO and GDPO, and provides async training paths that overlap rollout generation with gradient updates^[11]^[8].

What training methods does Axolotl support?

Axolotl supports a broad menu of training objectives, all selected from the same YAML^[11]^[3]:

Family	Methods	Notes
Supervised fine-tuning	Standard SFT, Instruction Tuning, continued pretraining	The default mode; ChatML, Alpaca, ShareGPT, Vicuna, and template-free formats are all supported.
Preference / Reinforcement Learning from Human Feedback (RLHF)	DPO, IPO, KTO, ORPO, SimPO, GDPO	DPO compares chosen vs. rejected; IPO is a DPO loss variant; KTO uses desirable/undesirable single-response signals; ORPO adds an odds-ratio term; SimPO removes the reference model; GDPO normalizes multiple reward signals.
RL with policy optimization	GRPO, Async GRPO	Group Relative Policy Optimization with vLLM for trajectory generation, custom reward functions, and async pipelines.
Reward modeling	Reward Modeling, Process Reward Modeling	Added in early 2025 for training scalar reward and step-level process reward models.
Parameter-efficient	LoRA, QLoRA, ReLoRA, ScatterMoE LoRA	QLoRA pairs with bitsandbytes 4-bit quantization; ScatterMoE LoRA targets MoE expert weights.
Quantization-aware	QAT, NVFP4 QAT, GPTQ	Train models to be robust to low-precision inference.
Energy-based	EBFT (Energy-Based Fine-Tuning)	Introduced in v0.16.0 as a novel RL method.

The library also supports multimodal vision-language fine-tunes (Qwen2-VL, Qwen2.5-VL, Qwen3-VL, LLaVA, SmolVLM2, and InternVL families) and audio models such as Voxtral, with multimodal SFT moving from beta into stable status during 2025^[3]^[4].

Which model families does Axolotl support?

Axolotl tracks the upstream model zoo aggressively, typically adding configurations within days of a major open-weights release^[3]^[4]:

LLaMA family: LLaMA, Llama 2, Llama 3 and the 3.1/3.2/3.3 point releases, and Llama 4 (both Scout/Maverick and Behemoth variants).
Mistral family: Mistral 7B, Mixtral 8x7B, Mixtral 8x22B, and Magistral.
Qwen family: Qwen, Qwen3 including MoE and Next variants, plus Qwen3-VL multimodal.
Google models: Gemma, Gemma 2, Gemma 3, and Gemma 4 (added in v0.16.1).
OpenAI weights: gpt-oss variants.
Microsoft models: Phi, Phi-3, Phi-4 and its mini and reasoning derivatives.
Other open models: Falcon (language model), Falcon 3, RWKV and RWKV-7 (Goose), Pythia, IBM Granite 4, Tencent HunYuan, Apertus, Seed-OSS, GLM-4, and others as they appear^[3].
State space models: Mamba and Mamba 2 checkpoints^[1].

The 2025 to 2026 releases broadened coverage further with multimodal vision and audio models, and Lian has emphasized in interviews and conference talks that adding a new architecture typically means writing a model wrapper plus example YAMLs rather than reimplementing the base model, since the heavy lifting lives in Transformers^[4]^[10].

What dataset formats does Axolotl accept?

Axolotl's dataset layer is one of the parts most directly written by the project itself. The framework natively understands several chat and instruction formats and converts them into the prompt template required by the target model^[4]^[14]:

Alpaca-style instruction/input/output triplets.
ShareGPT conversation arrays, the de facto exchange format among open chat models.
ChatML, the OpenAI-derived structured turn format adopted by Hermes, Dolphin, and many others.
Vicuna and Pygmalion dialogue formats from earlier open chat eras.
Completion and template-free datasets for raw text and for users who pre-format their own prompts.
Pre-tokenized datasets, where the user supplies token IDs directly and Axolotl skips its own preprocessing.
Stepwise supervised datasets for process reward modeling.
Preference datasets (chosen/rejected pairs and KTO-style desirable/undesirable labels) for DPO/IPO/KTO/ORPO/SimPO.

The fact that Dolphin and OpenHermes consistently distributed their training data in ShareGPT/ChatML and pointed users at Axolotl as the canonical trainer is a significant reason both formats became standard in the open-weights community^[5]^[6].

Which models were trained with Axolotl?

Axolotl's footprint on the Hugging Face Hub is broad and visible. The "Built with Axolotl" badge and accompanying YAML appear on the model cards of many of the highest-download open chat models from 2023 onward^[5]^[6]. Notable examples include:

Model series	Maintainer	Base model	Role of Axolotl
OpenHermes 2 / 2.5	Teknium	Mistral 7B	SFT with ChatML conversion via Axolotl; the OpenHermes-2.5-Mistral-7B card explicitly documents this^[5].
Dolphin 2.5	Cognitive Computations	Mixtral 8x7B	qLoRA fine-tune with Axolotl on Mixtral^[15].
Dolphin 2.6	Cognitive Computations	Mistral 7B	qLoRA fine-tune, reported as 2 days on 4x A100s^[1].
Dolphin 2.8 v0.2	Cognitive Computations	Mistral 7B v0.2	Full SFT with sample packing, DeepSpeed ZeRO-3, Flash Attention, 16k sequence length on 10x L40S over 3 days^[6].
Capybara, Puffin, Hermes derivatives	Nous Research	Llama 2, Mistral	SFT and DPO via Axolotl^[1].
Mistral-OpenOrca	OpenOrca/OpenChat	Mistral 7B	Axolotl-based SFT on Mistral-7B^[1].
Manticore, Minotaur, Jackalope, Hippogriff	OpenAccess-AI-Collective (Wing Lian)	Various	Axolotl reference models maintained alongside the framework^[1]^[7].
Mythalion, DiscoLM	Pygmalion, DiscoResearch	Llama 2 derivatives	Axolotl-based community releases^[1].

Cloud platforms tailored their environments to Axolotl in response: RunPod and Vast.ai both offer Axolotl Docker images, and Modal (platform) and Replicate publish example notebooks and templates for running Axolotl jobs^[4]^[3]. Recent releases also publish documentation tuned for AI coding assistants such as Claude Code, Cursor, and Copilot, reflecting how heavily contemporary users mix LLM-generated code with hand-edited configuration^[3].

The official site lists 170+ contributors and 500+ active Discord members; the GitHub repository displays around 12.1k stars and 1.4k forks as of writing^[3]^[16].

How does Axolotl compare to Unsloth and LLaMA-Factory?

Axolotl is the most prominent member of a small set of open source LLM fine-tuning frameworks that emerged in 2023 to 2024. The three most often compared are Axolotl itself, Unsloth, and LLaMA-Factory^[17]^[18]^[19].

Dimension	Axolotl	Unsloth	LLaMA-Factory
Configuration model	YAML files, declarative^[4]	Python-first API with notebooks^[17]	YAML + a polished web UI^[17]
Primary differentiator	Extensive feature surface, distributed training (FSDP, DeepSpeed, N-D parallel), MoE kernels^[3]^[17]	Hand-written Triton kernels delivering 2 to 5x speedups and large memory reductions on single GPU^[17]	Breadth of model support and a low-friction web UI for non-engineers^[17]
Multi-GPU / multi-node	Strong; FSDP1/2, DeepSpeed, sequence parallel, ND parallel^[3]	Historically single-GPU focused; multi-GPU support has expanded^[17]	Supported, often by delegating to DeepSpeed^[17]
Sample packing	Multipack with FlashAttention is a core feature^[12]	Supported	Supported, can use Unsloth as an acceleration backend^[17]
Preference / RL methods	SFT, DPO, IPO, KTO, ORPO, SimPO, GDPO, GRPO, reward modeling, EBFT^[11]	DPO, GRPO, and others^[17]	DPO, GRPO, ORPO, and others^[17]
Typical user	Engineering teams running production training and research-style ablations^[17]	Individual developers and resource-constrained setups^[17]	Cross-functional teams that want a web UI^[17]
License	Apache 2.0^[3]	Apache 2.0^[17]	Apache 2.0^[17]

By 2026, all three frameworks support the same core menu of objectives (LoRA, QLoRA, full fine-tuning, DPO, GRPO, multimodal) and the practical differences are mostly in workflow ergonomics and distributed training depth rather than capability^[19]^[20]. Independent 2026 benchmarks illustrate the split: on a single A100 fine-tuning Llama-3.1 8B with QLoRA, Unsloth's custom Triton kernels tend to finish fastest, while Axolotl's advantages emerge once training is parallelized across multiple GPUs with FSDP2 or DeepSpeed^[17]^[18]. Axolotl is generally positioned as the framework for ML engineering teams that need reproducible YAML configs, multi-node training, and the latest research methods landed quickly; Unsloth as the choice for individual practitioners who need maximum throughput from a single GPU through custom kernels; and LLaMA-Factory as the one with the most approachable UI for people who do not want to write Python^[17]^[18].

Axolotl is also commonly contrasted with PyTorch's own torchtune library, which is more conservative in feature scope but tightly integrated with the PyTorch core^[10]^[20].

Why does Axolotl matter?

Axolotl's significance comes less from a single algorithmic innovation than from being the connective tissue that made it practical for hobbyists, researchers, and small teams to fine-tune open-weights LLMs at modern scale. Concretely, it enables:

Reproducible community fine-tunes. Because a training run is captured in one YAML, model authors can ship the config alongside the weights. The Dolphin and OpenHermes model cards do exactly this, allowing third parties to retrain or audit a model with a single command^[5]^[6].
Rapid adoption of new research. Axolotl integrated QLoRA within a week of its publication and Mamba support within days, and continues to track new architectures and RL methods in subsequent releases^[1]^[8].
A standard recipe for preference learning. DPO, ORPO, KTO, and GRPO are all available behind the same YAML schema, which lowered the activation energy for community alignment experiments beyond classic Reinforcement Learning from Human Feedback (RLHF) with PPO^[11].
Practical multi-GPU training. Configuring Fully Sharded Data Parallel (FSDP) or DeepSpeed from scratch is non-trivial; Axolotl reduces it to a few YAML fields, which has been important for teams training 70B-class models or MoE models^[3].
A platform for cloud providers and managed services. RunPod, Modal (platform), Replicate, and other GPU clouds publish Axolotl recipes and Docker images so users can launch jobs without writing infrastructure code^[3]^[4].

In commercial terms, the framework's existence has been a meaningful contributor to the viability of the open-weights ecosystem: if fine-tuning required bespoke engineering, far fewer of the Mistral and Llama derivatives that populate Hugging Face would exist.

What are Axolotl's limitations?

Axolotl's design choices come with trade-offs that practitioners frequently surface in community discussion^[17]^[18]^[19]:

YAML-first ergonomics. The declarative model is excellent for reproducibility but less convenient for users who want to step through training code, attach a debugger, or inject custom behavior. Compared with Unsloth's Python-first notebooks, Axolotl asks users to learn its schema and conventions before they can run anything^[17].
Surface area and configuration complexity. The breadth of supported models, methods, and optimizations means the YAML schema has grown large, and not all combinations are valid. The framework's lint step catches many such errors, but the learning curve has been called out as steeper than Unsloth's or LLaMA-Factory's^[17]^[18].
Single-GPU throughput. Independent benchmarks generally show Unsloth ahead of Axolotl on single-GPU runs because of its custom Triton kernels; Axolotl's advantages emerge at multi-GPU and multi-node scale where FSDP, DeepSpeed, sequence parallelism, and Multipack matter more^[17]^[18].
Dependency on upstream libraries. Because Axolotl wraps Transformers, TRL, PEFT, Accelerate, DeepSpeed, and bitsandbytes, breaking changes in any of those packages can ripple through; pinned versions and Docker images partly mitigate this but contribute to occasional install friction^[4]^[3].
Hardware floor. Some optimizations require recent NVIDIA hardware: FlashAttention 2 requires Ampere or newer, FlashAttention 3 targets Hopper, and FlashAttention 4 in v0.16.0 targets Hopper and Blackwell. Users on older GPUs lose access to the largest performance gains^[3]^[8].
AMD and non-NVIDIA support. While Axolotl works on ROCm to varying degrees, the bulk of the optimization work and testing is NVIDIA-centric, mirroring the wider PyTorch ecosystem.

Axolotl sits at the intersection of several ecosystems that are worth navigating in their own right:

The Hugging Face ecosystem, particularly Hugging Face Transformers, PEFT, and TRL, which Axolotl uses as its foundation^[4].
Parameter-efficient fine-tuning methods such as LoRA (Low-Rank Adaptation) and QLoRA that account for most production fine-tuning runs^[1].
Preference optimization techniques DPO, ORPO, KTO, and GRPO that have become standard for instruction-tuned and reasoning models in the open-weights world^[11].
Distributed training stacks DeepSpeed and Fully Sharded Data Parallel (FSDP) that handle multi-GPU and multi-node sharding^[3].
High-throughput attention kernels Flash Attention and Flash Attention 3 underlying Multipack^[12].
Competing fine-tuning frameworks Unsloth and LLaMA-Factory that occupy the same niche with different ergonomics^[17].
The Nous Research and broader community of teams (Cognitive Computations, OpenChat, OpenOrca) that publish models trained with Axolotl^[1]^[6].
Cloud platforms RunPod, Modal (platform), and Replicate that provide hosted environments^[3].

References

Swyx and Alessio Fanelli, "The Busy Person's Intro to Finetuning & Open Source AI", Latent Space, 2023-12-22. https://www.latent.space/p/axolotl. Accessed 2026-05-20. ↩
Andreessen Horowitz, "Announcing Our Latest Open Source AI Grants", a16z.com, 2023-12-13. https://a16z.com/announcing-our-latest-open-source-ai-grants/. Accessed 2026-05-20. ↩
axolotl-ai-cloud, "axolotl GitHub repository README", GitHub, 2026. https://github.com/axolotl-ai-cloud/axolotl. Accessed 2026-06-24. ↩
Axolotl AI, "Axolotl Documentation", docs.axolotl.ai, 2026. https://docs.axolotl.ai/. Accessed 2026-05-20. ↩
Teknium, "OpenHermes-2.5-Mistral-7B model card", Hugging Face, 2023-11. https://huggingface.co/teknium/OpenHermes-2.5-Mistral-7B. Accessed 2026-05-20. ↩
Cognitive Computations, "dolphin-2.8-mistral-7b-v02 model card", Hugging Face, 2024. https://huggingface.co/cognitivecomputations/dolphin-2.8-mistral-7b-v02. Accessed 2026-05-20. ↩
OpenAccess-AI-Collective, "OpenAccess-AI-Collective GitHub organization", GitHub. https://github.com/OpenAccess-AI-Collective. Accessed 2026-05-20. ↩
axolotl-ai-cloud, "Releases page (v0.12.0 through v0.17.0)", GitHub, 2026-06-03. https://github.com/axolotl-ai-cloud/axolotl/releases. Accessed 2026-06-24. ↩
Tracxn, "AXOLOTL Company Profile", Tracxn.com, 2026. https://tracxn.com/d/companies/axolotl/__U_EjBD7RKauk8tA7WkS17nrrSfn-FrrbP8xofiBxnzg. Accessed 2026-05-20. ↩
PyTorch Conference 2024, "The Challenges of Building an Opinionated Open Source LLM Framework, Wing Lian, Axolotl AI", PyTorch Conference, 2024-09-18. https://pytorch2024.sched.com/event/1hZiF/the-challenges-of-building-an-opinionated-open-source-llm-framework-wing-lian-axolotl-ai. Accessed 2026-05-20. ↩
axolotl-ai-cloud, "rlhf.qmd: RLHF and preference optimization in Axolotl", GitHub, 2026. https://github.com/axolotl-ai-cloud/axolotl/blob/main/docs/rlhf.qmd. Accessed 2026-05-20. ↩
Axolotl AI, "Multipack (Sample Packing)", docs.axolotl.ai, 2026. https://docs.axolotl.ai/docs/multipack.html. Accessed 2026-05-20. ↩
Axolotl AI, "Optimizations Guide", docs.axolotl.ai, 2026. https://docs.axolotl.ai/docs/optimizations.html. Accessed 2026-05-20. ↩
Axolotl AI, "CLI Reference", docs.axolotl.ai, 2026. https://axolotl-ai-cloud.github.io/axolotl/docs/cli.html. Accessed 2026-05-20. ↩
Cognitive Computations, "dolphin-2.5-mixtral-8x7b model card", Hugging Face, 2023. https://huggingface.co/dphn/dolphin-2.5-mixtral-8x7b. Accessed 2026-05-20. ↩
Axolotl AI, "Axolotl AI homepage", axolotl.ai, 2026. https://axolotl.ai/. Accessed 2026-05-20. ↩
Index.dev, "Axolotl vs LLaMA-Factory vs Unsloth for AI Fine-Tuning 2026", index.dev, 2026. https://www.index.dev/skill-vs-skill/ai-axolotl-vs-llama-factory-vs-unsloth. Accessed 2026-05-20. ↩
Spheron Network, "Axolotl vs Unsloth vs TorchTune: Best LLM Fine-Tuning Frameworks in 2026", Spheron Blog, 2026. https://www.spheron.network/blog/axolotl-vs-unsloth-vs-torchtune/. Accessed 2026-05-20. ↩
Paolo Perrone, "Unsloth vs Axolotl vs LLaMA-Factory", The AI Engineer Substack, 2026. https://theaiengineer.substack.com/p/unsloth-vs-axolotl-vs-llama-factory. Accessed 2026-05-20. ↩
UltraDuneAI, "EVAL #003: Fine-Tuning in 2026 - Axolotl vs Unsloth vs TRL vs LLaMA-Factory", DEV Community, 2026. https://dev.to/ultraduneai/eval-003-fine-tuning-in-2026-axolotl-vs-unsloth-vs-trl-vs-llama-factory-2ohg. Accessed 2026-05-20. ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

3 revisions by 1 contributor · full history

Suggest edit

What links here

DoRA (Weight-Decomposed Low-Rank Adaptation)GaLore (Gradient Low-Rank Projection)HuggingFace TRL LLaMA-Factory NormalFloat 4-bit (NF4)OpenOrca QLoRA rsLoRA (Rank-Stabilized LoRA)

Infobox

Is Axolotl free and open source?

Who created Axolotl?

Origins (2023)

Community traction and the OpenAccess-AI-Collective

Andreessen Horowitz grant (December 2023)

Incorporation as Axolotl AI Inc. (2024)

Recent releases (2024 to 2026)

How does Axolotl work?

The YAML configuration model

Stack and integrations

What is sample packing (Multipack)?

Optimization features

Distributed training

What training methods does Axolotl support?

Which model families does Axolotl support?

What dataset formats does Axolotl accept?

Which models were trained with Axolotl?

How does Axolotl compare to Unsloth and LLaMA-Factory?

Why does Axolotl matter?

What are Axolotl's limitations?

Related work and ecosystem

See also

References

Improve this article

Related Articles

Unsloth

LLaMA-Factory

HuggingFace PEFT

Fully Sharded Data Parallel (FSDP)

AutoML (Automated Machine Learning)

torch.compile

What links here

Related Articles

Unsloth

LLaMA-Factory

HuggingFace PEFT

Fully Sharded Data Parallel (FSDP)

AutoML (Automated Machine Learning)

torch.compile

What links here