LLaMA-Factory
Last reviewed
May 20, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 4,115 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 20, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 4,115 words
Add missing citations, update stale details, or suggest a clearer explanation.
LLaMA-Factory is an open-source unified framework for the efficient fine-tuning of large language models (LLMs) and vision-language models (VLMs). It integrates a wide range of training algorithms, parameter-efficient adapters, and acceleration kernels behind a single command-line interface, a Python API, and a no-code web UI named LLaMA Board. The project was created and is led by Yaowei Zheng, a Ph.D. student in the School of Computer Science and Engineering at Beihang University, and was first presented in the paper "LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models" at the System Demonstrations track of ACL 2024.[1][2] As of 2026, the GitHub repository hiyouga/LLaMA-Factory has accumulated more than seventy thousand stars and is one of the most widely adopted fine-tuning toolkits in the open-source ecosystem.[3]
Released under the Apache 2.0 license, LLaMA-Factory bundles supervised fine-tuning (SFT), reward modeling, reinforcement learning from human feedback (PPO), and a family of preference-optimization algorithms (DPO, KTO, ORPO, SimPO) together with full-parameter, freeze, LoRA, QLoRA, GaLore, BAdam, DoRA, and Unsloth-accelerated training paths. Multi-GPU scaling is handled through DeepSpeed (including ZeRO stages), PyTorch FSDP/FSDP2, Ray, and, since late 2025, Megatron-LM. The system also supports Ascend NPU and AMD ROCm backends in addition to NVIDIA CUDA hardware.[1][3][4]
| Field | Value |
|---|---|
| Name | LLaMA-Factory (also stylized "LlamaFactory") |
| Original author | Yaowei Zheng |
| Affiliation of authors | Beihang University, School of Computer Science and Engineering |
| Initial release | 2023 (first GitHub commits as "ChatGLM-Efficient-Tuning" / "LLaMA-Efficient-Tuning") |
| Stable release at time of writing | v0.9.4 (December 31, 2025) |
| License | Apache 2.0 |
| Repository | github.com/hiyouga/LLaMA-Factory |
| Paper | arXiv:2403.13372; ACL 2024 demos, pages 400 to 410 |
| Web UI | LLaMA Board (Gradio-based) |
| Programming language | Python (>= 3.11 from v0.9.4) |
| Built on | PyTorch, Hugging Face Transformers, PEFT |
The project began in 2023 as a pair of repositories maintained by Yaowei Zheng ("hiyouga") that targeted parameter-efficient tuning of specific model families. The earliest repository was "ChatGLM-Efficient-Tuning," which provided supervised and reward modeling pipelines for Baichuan and other Chinese LLMs together with the THUDM ChatGLM models. A sibling project, "LLaMA-Efficient-Tuning," targeted the Meta LLaMA series. As the underlying Hugging Face Transformers interface converged on a shared abstraction across architectures, the two codebases were merged and rebranded as LLaMA-Factory in late 2023, with a unified loader and trainer covering both English-centric and Chinese-centric pretrained checkpoints.[3]
A first version of the academic paper describing the framework was posted to arXiv on March 20, 2024, with revisions through June 27, 2024.[1] The paper was accepted to the System Demonstrations track of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL 2024), held in Bangkok, Thailand, with the camera-ready version appearing in the ACL Anthology as paper 2024.acl-demos.38, pages 400 to 410.[2] The listed authors are Yaowei Zheng, Richong Zhang, Junhao Zhang, Yanhan Ye, Zheyan Luo, Zhangchi Feng, and Yongqiang Ma; the affiliations point to Beihang University.[1][2]
The release history of LLaMA-Factory tracks the broader pace of open-weight model releases and post-training research. The table below summarizes major versions documented in the project's GitHub release notes.[3]
| Version | Date | Notable additions |
|---|---|---|
| v0.6.x | early 2024 | Mixture-of-experts (Mixtral) support, LoRA+, Gemma, unified preference data |
| v0.7.x | mid 2024 | ORPO, SimPO, KTO trainers; BAdam optimizer |
| v0.8.3 | July 18, 2024 | Neat packing (contamination-free packing), split evaluation, HQQ and EETQ quantization, NPU Dockerfile |
| v0.9.0 | September 8, 2024 | Qwen2-VL multimodal SFT, Weights & Biases hooks, Adam-mini optimizer, VLM RLHF/DPO/ORPO/SimPO, MFU calculation |
| v0.9.1 | November 24, 2024 | Llama 3.2 and Llama-3.2-Vision, LLaVA-NeXT, Video-LLaVA, Pixtral, gradient-accumulation fixes for Transformers 4.46 |
| v0.9.2 | March 11, 2025 | APOLLO optimizer, SwanLab experiment tracker, Ray Trainer, vLLM batch inference with tensor parallel, QLoRA on Ascend NPU |
| v0.9.3 | June 16, 2025 | InternVL 2.5 and 3, Qwen2.5-Omni audio-visual, Llama 4, Gemma 3, official GPU Docker images, SGLang inference backend |
| v0.9.4 | December 31, 2025 | Repository renamed to "LlamaFactory," Python 3.11 to 3.13 (3.9 to 3.10 deprecated), migration from pip to uv, Orthogonal Fine-Tuning (OFT), FP8 training, Megatron-core backend, Transformers v5 |
The cadence shows three identifiable phases. From early to mid-2024 the focus was on filling out preference-optimization trainers and integrating the new optimizers proposed in 2024 papers (LoRA+, GaLore, BAdam, DoRA). From late 2024 through mid-2025 the focus shifted to multimodal models, with LLaVA-series, Pixtral, InternVL, and Qwen-VL variants becoming first-class training targets. The v0.9.4 milestone in late 2025 modernized the project's build toolchain and introduced Megatron-core for very large-scale model-parallel training.[3]
LLaMA-Factory grew rapidly after the public release of Llama 2 in mid-2023 and accelerated again after the Llama 3 releases of April 2024. By the time of the ACL 2024 publication, the GitHub repository already counted more than 25,000 stars and 3,000 forks.[1] By December 2025 the project README reported community adoption that included the AMD ROCm, Hugging Face, and NVIDIA developer ecosystems, and the Anyscale documentation references LLaMA-Factory as one of the recommended LLM post-training stacks on the Anyscale platform.[5][3]
The 2024 paper describes the framework as a three-layer modular design: a Model Loader that normalizes model and tokenizer initialization across architectures, a Data Worker that converts heterogeneous chat and preference datasets to a unified internal schema, and a Trainer that exposes a uniform interface across the four supported training paradigms.[1][2]
The Model Loader resolves a model identifier (Hugging Face Hub name or local path) to a Hugging Face Transformers configuration, instantiates the model, patches it to insert adapter modules where required, attaches a chat template, and applies the chosen quantization. Quantization paths in the current release include 8-bit and 4-bit weight quantization via bitsandbytes (LLM.int8), 4-bit NF4 QLoRA, plus AQLM, AWQ, GPTQ, HQQ, and EETQ for inference and adapter-only training. From v0.9.2 onward, Ascend NPU is a first-class device target, with PyTorch operations dispatched through the torch_npu shim.[3][4]
The Data Worker handles dataset loading, alignment, merging, and tokenization. It expects datasets in one of three canonical formats: Alpaca-style instruction tuples (instruction, input, output), ShareGPT-style multi-turn conversations, and preference triplets (prompt, chosen, rejected) used by DPO and related preference algorithms. Internally, the worker emits a unified record shape so the Trainer does not need to know which dataset format produced it. Streaming datasets, interleaved sampling, and dataset packing (including "neat packing," which avoids cross-document attention contamination) are all configured through the same YAML config.[1][3]
The Trainer wraps the transformers.Trainer class and adds: a generative pre-training stage, a SFT stage, a reward model training stage, and a preference-optimization or RLHF stage. The preference optimization branch exposes DPO, KTO, ORPO, SimPO, and full online RL via PPO. The Trainer abstracts away the difference between a base model with attached LoRA adapters and a fully fine-tuned model, so the same configuration interface works for full-parameter and parameter-efficient runs.[1][2]
A distinctive contribution of the paper is the model-sharing RLHF trick: the reward model, the value model, and the policy model can be served by a single set of base weights with three different LoRA adapters swapped dynamically per forward pass. This allows end-to-end RLHF to run on a single consumer GPU for models in the 7B parameter class, an order-of-magnitude reduction in memory compared with naive multi-model RLHF setups.[1][2]
The framework's headline feature is the breadth of training algorithms accessible from one configuration file. The following table groups the documented options by purpose.[1][3][4]
| Category | Method |
|---|---|
| Pretraining and continued pretraining | Causal LM next-token prediction over plain-text corpora |
| Supervised fine-tuning | SFT on Alpaca, ShareGPT, OpenAssistant, and custom formats |
| Reward modeling | Bradley-Terry pairwise loss over preference data |
| Online RL | PPO with optional reward model or rule-based reward |
| Offline preference optimization | DPO, KTO, ORPO, SimPO |
| Parameter-efficient adapters | LoRA, QLoRA, DoRA, LoRA+, PiSSA, LoftQ, OFT/QOFT |
| Memory-efficient full tuning | Freeze-tuning, GaLore, BAdam, APOLLO, Adam-mini, Muon |
| Long-context fine-tuning | LongLoRA shifted-attention, sequence packing, RoPE scaling |
| Acceleration kernels | FlashAttention-2, Unsloth, Liger Kernel |
| Distributed backends | NativeDDP, DeepSpeed (ZeRO-1/2/3, offload), FSDP and FSDP2, Ray Trainer, Megatron-LM core |
The pairing of memory-efficient optimizers (GaLore, BAdam, APOLLO) with adapter methods (LoRA, DoRA) is presented as a key axis of the design: a user who needs full-parameter quality but cannot fit the optimizer state can switch from Adam to GaLore without rewriting any training code; conversely, a user who needs to swap many adapters at inference can use LoRA or DoRA with the same Trainer.[1][4]
LLaMA Board is the no-code web interface bundled with the framework. It is implemented in Gradio and is launched with the command llamafactory-cli webui.[3][4] The interface mirrors the underlying YAML configuration schema, exposing tabs for model selection, training hyperparameters, dataset choice, evaluation, and chat-style testing. The 2024 paper highlights three properties: a localized UI in English, Russian, and Chinese; live loss-curve and metric plots streamed from a background trainer process; and an evaluation pane that supports both n-gram overlap metrics (ROUGE, BLEU) and side-by-side chat testing against the current checkpoint.[1][2]
The web UI is the recommended entry point for users without a deep-learning engineering background. For research and production users, the same configurations can be exported as YAML and run from the command line, ensuring parity between interactive exploration and reproducible batch jobs.[3]
The package installs a single llamafactory-cli entry point with five subcommands:[3][4]
llamafactory-cli train <config.yaml> runs a training jobllamafactory-cli chat <config.yaml> opens a terminal chat loop with the trained adapterllamafactory-cli export <config.yaml> merges adapters into the base model and exports a runnable Hugging Face checkpointllamafactory-cli api <config.yaml> serves an OpenAI-compatible HTTP API backed by vLLM or SGLangllamafactory-cli webui launches LLaMA BoardBeyond the CLI, the package exposes a Python API for embedding the framework inside other systems. Each subcommand corresponds to a function in llamafactory.train, llamafactory.chat, and so on, with the same YAML configuration dict accepted as input.[3]
The framework's name notwithstanding, supported architectures extend far beyond the LLaMA family. As of v0.9.4, the README enumerates more than one hundred model checkpoints across the following families:[3][4]
The 2024 paper reports having validated the framework against more than forty distinct model families at submission time, and the post-2024 release notes document continuous additions for each new open-weight launch.[1][3]
The repository ships with built-in loaders for a curated catalog of more than fifty instruction, dialogue, preference, and reasoning datasets. The catalog covers English instruction data (Alpaca, ShareGPT, Open-Orca), Chinese instruction data (Belle, COIG), preference datasets (UltraFeedback, HH-RLHF), and math and code datasets (MetaMathQA, MagiCoder).[1][3] Each dataset entry maps to a dataset_info.json record that specifies its format (Alpaca-style or ShareGPT-style), its remote URL or local path, and its column names. Users can register a new dataset by appending an entry to this file, making the dataset usable from both the CLI and LLaMA Board without code changes.[3]
Chat templates are stored as Jinja2 templates in src/llamafactory/data/template.py. The Model Loader auto-detects the appropriate template from the tokenizer's name (with manual override via --template), avoiding the common error of training a model with the wrong chat formatting.[3]
The 2024 paper provides side-by-side memory and throughput numbers measured on Gemma-2B, Llama-2-7B, and Llama-2-13B using SFT on the Alpaca dataset at sequence length 512. The reported numbers (Table 3 of the paper) document the design's headline efficiency claim: QLoRA on Gemma-2B fits in 5.21 GB of GPU memory and runs at roughly 3,158 tokens/second, while LoRA on Llama-2-13B uses about 30.09 GB at roughly 1,468 tokens/second. Llama-2-7B with freeze-tuning is reported at 15.69 GB and roughly 2,905 tokens/second.[1] These figures correspond to single-GPU runs on a Nvidia A100; multi-GPU runs scale further through DeepSpeed ZeRO and FSDP.
For downstream task quality, the paper's Table 4 evaluates several models and fine-tuning methods on CNN/DailyMail, XSum, and AdGen summarization datasets using ROUGE metrics. The reported results show LoRA and QLoRA matching or surpassing freeze-tuning on most settings, with the seven-billion parameter Mistral-7B model achieving approximately 23.47 ROUGE on CNN/DailyMail.[1] The paper deliberately does not claim that LLaMA-Factory's algorithms outperform their underlying papers; rather, it documents that the unified implementation reproduces the expected efficiency and quality of each constituent method.
A subsequent third-party benchmark in the paper "Dissecting the Runtime Performance of the Training, Fine-tuning, and Inference of Large Language Models" (arXiv:2311.03687) compares DeepSpeed ZeRO configurations and FlashAttention kernels on similar workloads, providing context for how LLaMA-Factory's defaults sit within the wider design space.[6] The Anyscale documentation, in its overview of speed and memory optimizations for LLM post-training, similarly recommends combining FlashAttention and ZeRO with LLaMA-Factory's CLI for production-scale jobs.[5]
LLaMA-Factory is widely cited in the open-source community as a default starting point for adapting open-weight LLMs. By the time of the v0.9.4 release in late 2025, the project README listed more than seventy thousand GitHub stars; tutorials produced by Anyscale, DigitalOcean, and the AMD ROCm Developer Hub document end-to-end fine-tuning workflows on a range of hardware, including A100, H100, MI300, and Huawei Ascend 910B platforms.[3][4][5][7]
A notable industry case study is Apoidea Group's use of LLaMA-Factory on Amazon SageMaker HyperPod to fine-tune multimodal vision-language models for banking document extraction. The AWS Machine Learning Blog details a pipeline that uses LLaMA-Factory's YAML configurations to launch distributed training jobs from SageMaker HyperPod nodes, combining LLaMA-Factory's training stack with HyperPod's resilience features for large-cluster runs.[8]
The vLLM-Ascend project (an effort to port vLLM to Huawei Ascend hardware) documents LLaMA-Factory as one of its user stories, describing the combination of LLaMA-Factory training on Ascend NPUs with vLLM-Ascend for downstream inference.[9] Yaowei Zheng has also received an Outstanding Open-Source Contributor award from the Ascend ecosystem in recognition of this porting work.[10]
LLaMA-Factory's principal contribution to the field is unification. Before its release, a typical research workflow required gluing together separate codebases: PEFT for LoRA adapters, TRL for PPO and DPO, specialized scripts for each base-model family, and bespoke chat-template handling. By exposing all of these under one configuration schema with a shared Model Loader, Data Worker, and Trainer, LLaMA-Factory lowered the engineering burden of running a controlled experiment that varies one axis (for example, "LoRA vs. DoRA at fixed data and base model") without rewriting boilerplate.[1][2]
A second contribution is the no-code web UI, which made the framework accessible to non-engineering users: domain experts, language teams localizing models, and researchers in adjacent fields who lack a deep PyTorch background. LLaMA Board's defaults are tuned to "run reasonably out of the box," letting users iterate on data and prompt design rather than infrastructure.[1][2][3]
A third contribution, less visible from outside, is operational hardening. The repository tracks the Hugging Face Transformers release cycle closely; for example, v0.9.1 explicitly fixed gradient accumulation behavior changed in Transformers 4.46, and v0.9.4 was rebased on Transformers v5.[3] This ongoing maintenance is what allows the project to remain compatible with each new open-weight model release within weeks.
LLaMA-Factory occupies a position in the open-source LLM tooling landscape alongside several other frameworks with overlapping but non-identical goals. The table below sketches the comparison.[1][3][11]
| Framework | Primary focus | UI | Notable strength |
|---|---|---|---|
| LLaMA-Factory | Unified SFT/RM/PPO/DPO across 100+ models | LLaMA Board (Gradio) | Breadth of supported models and algorithms; no-code UI |
| Axolotl | YAML-driven fine-tuning of open LLMs | None (CLI only) | Mature config recipes; community fine-tunes |
| Unsloth | Triton-kernel acceleration of LoRA/QLoRA | None | Single-GPU speedups |
| Hugging Face Transformers + PEFT | Low-level building blocks | None | Maximum flexibility; canonical reference |
| DeepSpeed | Distributed training engine | None | ZeRO sharding, offloading |
| Megatron-LM | Large-scale 3D parallelism | None | Multi-thousand-GPU training |
These projects are typically complementary rather than competing: LLaMA-Factory uses DeepSpeed, FSDP, Megatron-core, and Unsloth as backends, and its PEFT integration covers most adapter variants documented in the Hugging Face PEFT library.[3][4]
Despite the breadth of the framework, several limitations are documented in the project's own issues and in external coverage:[3][4][11]
uv and Python 3.11+ also broke older environments that depended on Python 3.9 or 3.10.[3]The 2024 paper devotes its central section to a taxonomy of efficient fine-tuning techniques implemented inside the framework. The techniques fall into two broad categories: those that change which parameters are trained (parameter-efficient approaches), and those that change how the gradients and activations are computed (computation-efficient approaches).[1][2]
LoRA freezes the pretrained weights and introduces two low-rank matrices A and B such that the effective update is the product BA, materialized only at adapter sites (typically the attention projection matrices). The rank is a hyperparameter exposed in LLaMA-Factory as lora_rank, and the targeted modules are configurable via lora_target. QLoRA composes this idea with 4-bit NF4 quantization of the frozen base weights, achieving the lowest memory footprint among supported methods. DoRA (Weight-Decomposed LoRA) decomposes each weight matrix into a direction component and a magnitude component, training only the direction through LoRA and the magnitude scalars separately. LoRA+ assigns a higher learning rate to the B matrix than to the A matrix, addressing an asymmetry noted in the LoRA+ paper. PiSSA initializes the LoRA matrices from the principal singular values of the underlying weight matrix, accelerating convergence. OFT (Orthogonal Fine-Tuning), introduced in v0.9.4, applies an orthogonal transformation that preserves angular relationships between hidden states, a property argued to reduce catastrophic forgetting.[1][3][4]
GaLore (Gradient Low-Rank Projection) is a memory-efficient full-parameter method that projects gradients into a low-rank subspace before applying the optimizer state, then projects back. Unlike LoRA, GaLore updates the full weight matrix; unlike a naive Adam run, the optimizer state grows with the projected rank rather than the full matrix size. BAdam (Block-Wise Adam) further reduces optimizer memory by updating only a single transformer block at a time per training step, rotating across blocks. APOLLO and Adam-mini are recent variants in the same family, both supported as drop-in optimizer choices via LLaMA-Factory's --optim flag.[1][3][4]
FlashAttention and FlashAttention-2 are exact-attention kernels that avoid materializing the full attention matrix in high-bandwidth memory, achieving substantial speedups on long sequences. LLaMA-Factory enables FlashAttention-2 with a single flag (flash_attn=fa2) when the underlying model's attention implementation supports it. S2 attention (shifted sparse attention), introduced in the LongLoRA paper, is exposed for long-context fine-tuning of base models that lack native long-context training. Unsloth is a Triton-kernel based acceleration library that rewrites attention and LoRA backward passes for higher single-GPU throughput; LLaMA-Factory wraps Unsloth as an optional backend.[1][3]
Mixed-precision training defaults to bfloat16 on NVIDIA GPUs of compute capability 8.0 and above (Ampere and later) and falls back to float16 on older hardware. Activation checkpointing trades recompute for memory and is enabled by default in most preset configurations. Sequence packing (concatenating multiple shorter examples into a single long sequence) raises GPU utilization for datasets dominated by short examples; "neat packing" (added in v0.8.3) uses a block-diagonal attention mask to prevent attention from crossing example boundaries, preserving the semantic equivalence between packed and unpacked training.[3]
For single-node multi-GPU runs, LLaMA-Factory supports plain PyTorch DistributedDataParallel (DDP), DeepSpeed (ZeRO-1, ZeRO-2, ZeRO-3, with optional CPU and NVMe offload), and PyTorch's FSDP and FSDP2 implementations. A DeepSpeed configuration is passed by reference: the main YAML config points to a separate JSON file with the ZeRO stage, optimizer offload settings, and bf16 precision flags. For multi-node runs, Ray Trainer integration (v0.9.2) launches workers across a Ray cluster, and the Megatron-core backend (v0.9.4) enables tensor parallelism, pipeline parallelism, and expert parallelism for very large models such as DeepSeek-V3 and the Llama 4 family.[3][4]
LLaMA-Factory is also one of the most visible Chinese-led open-source projects in post-training tooling. The maintainers are based at Beihang University; the framework includes first-class support for Chinese-centric models such as Baichuan, ChatGLM, Qwen, DeepSeek, InternLM, and Yi; and Ascend NPU support was added before many comparable Western frameworks. The localized LLaMA Board UI (English, Russian, Chinese) reflects this global-first orientation.[1][3][4][10]
The project's CITATION.cff file in the repository directs users to the ACL 2024 paper. The canonical citation is:
Yaowei Zheng, Richong Zhang, Junhao Zhang, Yanhan Ye, Zheyan Luo. "LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models." Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations), Bangkok, Thailand, August 2024, pages 400 to 410. DOI: 10.18653/v1/2024.acl-demos.38.[2]
The arXiv version (arXiv:2403.13372) additionally lists Zhangchi Feng and Yongqiang Ma as authors and includes appendix material on the LLaMA Board design and additional efficiency experiments.[1]