# LLaMA-Factory

> Source: https://aiwiki.ai/wiki/llama_factory
> Updated: 2026-07-11
> Categories: Developer Tools, Open Source AI, Training & Optimization
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

**LLaMA-Factory** is an open-source unified framework for the efficient fine-tuning of large language models (LLMs) and vision-language models (VLMs). It integrates a wide range of training algorithms, parameter-efficient adapters, and acceleration kernels behind a single command-line interface, a Python API, and a no-code web UI named **LLaMA Board**. The project was created and is led by Yaowei Zheng, a Ph.D. student in the School of Computer Science and Engineering at Beihang University, and was first presented in the paper "LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models" at the System Demonstrations track of ACL 2024.[1][2] As of mid-2026, the GitHub repository hiyouga/LLaMA-Factory has accumulated more than 73,000 stars and roughly 8,900 forks, making it one of the most widely adopted fine-tuning toolkits in the open-source ecosystem; the project README describes it as "Used by Amazon, NVIDIA, Aliyun."[3]

Released under the Apache 2.0 license, LLaMA-Factory bundles supervised fine-tuning (SFT), reward modeling, reinforcement learning from human feedback (PPO), and a family of preference-optimization algorithms (DPO, KTO, ORPO, SimPO) together with full-parameter, freeze, LoRA, QLoRA, GaLore, BAdam, DoRA, and Unsloth-accelerated training paths. Multi-GPU scaling is handled through DeepSpeed (including ZeRO stages), PyTorch FSDP/FSDP2, Ray, and, since late 2025, Megatron-LM. The system also supports Ascend NPU and AMD ROCm backends in addition to NVIDIA CUDA hardware.[1][3][4] The latest stable release, v0.9.5 (May 30, 2026), adds primary support for the Qwen3.5, Qwen3.6, and Gemma 4 model families and consolidates compatibility with Hugging Face Transformers v5.[12]

The framework's motivation is stated plainly in its ACL 2024 paper: "Efficient fine-tuning is vital for adapting large language models (LLMs) to downstream tasks," yet it "requires non-trivial efforts to implement these methods on different models." LLaMA-Factory answers this by "flexibly customizing the fine-tuning of 100+ LLMs without the need for coding through the built-in web UI LlamaBoard."[1][2]

## Infobox

| Field | Value |
|---|---|
| Name | LLaMA-Factory (also stylized "LlamaFactory") |
| Original author | Yaowei Zheng |
| Affiliation of authors | Beihang University, School of Computer Science and Engineering |
| Initial release | 2023 (first GitHub commits as "ChatGLM-Efficient-Tuning" / "LLaMA-Efficient-Tuning") |
| Latest stable release | v0.9.5 (May 30, 2026); previous v0.9.4 "Goodbye 2025" (December 31, 2025) |
| License | Apache 2.0 |
| Repository | github.com/hiyouga/LLaMA-Factory |
| GitHub stars | More than 73,000 (mid-2026) |
| Paper | arXiv:2403.13372; ACL 2024 demos, pages 400 to 410 |
| Web UI | LLaMA Board (Gradio-based) |
| Programming language | Python (>= 3.11 from v0.9.4) |
| Built on | [PyTorch](/wiki/pytorch), [Hugging Face Transformers](/wiki/transformers_library), [PEFT](/wiki/peft) |

## History

### Origins and naming

The project began in 2023 as a pair of repositories maintained by Yaowei Zheng ("hiyouga") that targeted parameter-efficient tuning of specific model families. The earliest repository was "ChatGLM-Efficient-Tuning," which provided supervised and reward modeling pipelines for [Baichuan](/wiki/baichuan) and other Chinese LLMs together with the THUDM ChatGLM models. A sibling project, "LLaMA-Efficient-Tuning," targeted the Meta LLaMA series. As the underlying [Hugging Face Transformers](/wiki/transformers_library) interface converged on a shared abstraction across architectures, the two codebases were merged and rebranded as **LLaMA-Factory** in late 2023, with a unified loader and trainer covering both English-centric and Chinese-centric pretrained checkpoints.[3]

A first version of the academic paper describing the framework was posted to arXiv on March 20, 2024, with revisions through June 27, 2024.[1] The paper was accepted to the System Demonstrations track of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL 2024), held in Bangkok, Thailand, with the camera-ready version appearing in the ACL Anthology as paper 2024.acl-demos.38, pages 400 to 410.[2] The listed authors are Yaowei Zheng, Richong Zhang, Junhao Zhang, Yanhan Ye, Zheyan Luo, Zhangchi Feng, and Yongqiang Ma; the affiliations point to Beihang University.[1][2]

### Release timeline

The release history of LLaMA-Factory tracks the broader pace of open-weight model releases and post-training research. The table below summarizes major versions documented in the project's GitHub release notes.[3][12]

| Version | Date | Notable additions |
|---|---|---|
| v0.6.x | early 2024 | Mixture-of-experts ([Mixtral](/wiki/mixtral)) support, [LoRA](/wiki/lora)+, [Gemma](/wiki/gemma), unified preference data |
| v0.7.x | mid 2024 | [ORPO](/wiki/orpo), [SimPO](/wiki/simpo), [KTO](/wiki/kto) trainers; BAdam optimizer |
| v0.8.3 | July 18, 2024 | Neat packing (contamination-free packing), split evaluation, HQQ and EETQ quantization, NPU Dockerfile |
| v0.9.0 | September 8, 2024 | Qwen2-VL multimodal SFT, [Weights & Biases](/wiki/wandb) hooks, Adam-mini optimizer, VLM RLHF/DPO/ORPO/SimPO, MFU calculation |
| v0.9.1 | November 24, 2024 | [Llama 3.2](/wiki/llama_3_2) and Llama-3.2-Vision, LLaVA-NeXT, Video-LLaVA, [Pixtral](/wiki/pixtral), gradient-accumulation fixes for Transformers 4.46 |
| v0.9.2 | March 11, 2025 | APOLLO optimizer, SwanLab experiment tracker, [Ray](/wiki/ray) Trainer, [vLLM](/wiki/vllm) batch inference with tensor parallel, QLoRA on Ascend NPU |
| v0.9.3 | June 16, 2025 | [InternVL](/wiki/internvl) 2.5 and 3, Qwen2.5-Omni audio-visual, [Llama 4](/wiki/llama_4), [Gemma 3](/wiki/gemma_3), official GPU Docker images, [SGLang](/wiki/sglang) inference backend |
| v0.9.4 | December 31, 2025 | Repository renamed to "LlamaFactory," Python 3.11 to 3.13 (3.9 to 3.10 deprecated), migration from pip to `uv`, Orthogonal Fine-Tuning (OFT), FP8 training, Megatron-core backend, Transformers v5 |
| v0.9.5 | May 30, 2026 | Qwen3.5, Qwen3.6, and Gemma 4 support, Qwen3-VL and further multimodal models, FP8 Transformer Engine backend, FSDP2 training via Ray, consolidated Transformers v5 compatibility |

The cadence shows three identifiable phases. From early to mid-2024 the focus was on filling out preference-optimization trainers and integrating the new optimizers proposed in 2024 papers (LoRA+, GaLore, BAdam, DoRA). From late 2024 through mid-2025 the focus shifted to multimodal models, with [LLaVA](/wiki/llava)-series, [Pixtral](/wiki/pixtral), [InternVL](/wiki/internvl), and Qwen-VL variants becoming first-class training targets. The v0.9.4 milestone in late 2025 modernized the project's build toolchain and introduced Megatron-core for very large-scale model-parallel training, and the v0.9.5 release of May 2026 extended coverage to the Qwen3.5, Qwen3.6, and Gemma 4 families while adding an FP8 Transformer Engine backend and FSDP2 training through Ray.[3][12]

### Project growth

LLaMA-Factory grew rapidly after the public release of [Llama 2](/wiki/llama_2) in mid-2023 and accelerated again after the [Llama 3](/wiki/llama_3) releases of April 2024. By the time of the ACL 2024 publication, the paper reported that the GitHub repository had "received over 25,000 stars and 3,000 forks."[1] By December 2025 the project README reported community adoption that included the AMD ROCm, Hugging Face, and NVIDIA developer ecosystems, and the Anyscale documentation references LLaMA-Factory as one of the recommended LLM post-training stacks on the Anyscale platform.[5][3] By mid-2026 the repository had grown to more than 73,000 stars and roughly 8,900 forks.[3]

## Architecture

The 2024 paper describes the framework as a three-layer modular design: a **Model Loader** that normalizes model and tokenizer initialization across architectures, a **Data Worker** that converts heterogeneous chat and preference datasets to a unified internal schema, and a **Trainer** that exposes a uniform interface across the four supported training paradigms.[1][2]

### Model Loader

The Model Loader resolves a model identifier (Hugging Face Hub name or local path) to a [Hugging Face Transformers](/wiki/transformers_library) configuration, instantiates the model, patches it to insert adapter modules where required, attaches a chat template, and applies the chosen quantization. Quantization paths in the current release include 8-bit and 4-bit weight quantization via bitsandbytes (LLM.int8), 4-bit NF4 [QLoRA](/wiki/qlora), plus AQLM, AWQ, [GPTQ](/wiki/gptq), HQQ, and EETQ for inference and adapter-only training. From v0.9.2 onward, Ascend NPU is a first-class device target, with PyTorch operations dispatched through the torch_npu shim.[3][4]

### Data Worker

The Data Worker handles dataset loading, alignment, merging, and tokenization. It expects datasets in one of three canonical formats: Alpaca-style instruction tuples (instruction, input, output), ShareGPT-style multi-turn conversations, and preference triplets (prompt, chosen, rejected) used by DPO and related preference algorithms. Internally, the worker emits a unified record shape so the Trainer does not need to know which dataset format produced it. Streaming datasets, interleaved sampling, and dataset packing (including "neat packing," which avoids cross-document attention contamination) are all configured through the same YAML config.[1][3]

### Trainer

The Trainer wraps the `transformers.Trainer` class and adds: a generative pre-training stage, a [SFT](/wiki/supervised_fine-tuning) stage, a reward model training stage, and a preference-optimization or [RLHF](/wiki/rlhf) stage. The preference optimization branch exposes [DPO](/wiki/dpo), [KTO](/wiki/kto), [ORPO](/wiki/orpo), [SimPO](/wiki/simpo), and full online RL via [PPO](/wiki/ppo). The Trainer abstracts away the difference between a base model with attached LoRA adapters and a fully fine-tuned model, so the same configuration interface works for full-parameter and parameter-efficient runs.[1][2]

A distinctive contribution of the paper is the **model-sharing RLHF** trick: the reward model, the value model, and the policy model can be served by a single set of base weights with three different LoRA adapters swapped dynamically per forward pass. This allows end-to-end RLHF to run on a single consumer GPU for models in the 7B parameter class, an order-of-magnitude reduction in memory compared with naive multi-model RLHF setups.[1][2]

## What training methods does LLaMA-Factory support?

The framework's headline feature is the breadth of training algorithms accessible from one configuration file. The following table groups the documented options by purpose.[1][3][4]

| Category | Method |
|---|---|
| Pretraining and continued pretraining | Causal LM next-token prediction over plain-text corpora |
| Supervised fine-tuning | [SFT](/wiki/supervised_fine-tuning) on Alpaca, ShareGPT, OpenAssistant, and custom formats |
| Reward modeling | Bradley-Terry pairwise loss over preference data |
| Online RL | [PPO](/wiki/ppo) with optional reward model or rule-based reward |
| Offline preference optimization | [DPO](/wiki/dpo), [KTO](/wiki/kto), [ORPO](/wiki/orpo), [SimPO](/wiki/simpo) |
| Parameter-efficient adapters | [LoRA](/wiki/lora), [QLoRA](/wiki/qlora), DoRA, LoRA+, PiSSA, LoftQ, OFT/QOFT |
| Memory-efficient full tuning | Freeze-tuning, GaLore, BAdam, APOLLO, Adam-mini, Muon |
| Long-context fine-tuning | LongLoRA shifted-attention, sequence packing, RoPE scaling |
| Acceleration kernels | [FlashAttention-2](/wiki/flashattention), [Unsloth](/wiki/unsloth), Liger Kernel |
| Distributed backends | NativeDDP, [DeepSpeed](/wiki/deepspeed) (ZeRO-1/2/3, offload), [FSDP](/wiki/fsdp) and FSDP2, [Ray](/wiki/ray) Trainer, [Megatron-LM](/wiki/megatron_lm) core |

The pairing of memory-efficient optimizers (GaLore, BAdam, APOLLO) with adapter methods ([LoRA](/wiki/lora), DoRA) is presented as a key axis of the design: a user who needs full-parameter quality but cannot fit the optimizer state can switch from Adam to GaLore without rewriting any training code; conversely, a user who needs to swap many adapters at inference can use [LoRA](/wiki/lora) or DoRA with the same Trainer.[1][4]

## What is LLaMA Board?

LLaMA Board is the no-code web interface bundled with the framework. It is implemented in [Gradio](/wiki/gradio) and is launched with the command `llamafactory-cli webui`.[3][4] The interface mirrors the underlying YAML configuration schema, exposing tabs for model selection, training hyperparameters, dataset choice, evaluation, and chat-style testing. The 2024 paper highlights three properties: a localized UI in English, Russian, and Chinese; live loss-curve and metric plots streamed from a background trainer process; and an evaluation pane that supports both n-gram overlap metrics ([ROUGE](/wiki/rouge_score), BLEU) and side-by-side chat testing against the current checkpoint. On localization, the paper states: "Currently we support three languages: English, Russian and Chinese, which allows a broader range of users to utilize LlamaBoard for fine-tuning LLMs."[1][2]

The web UI is the recommended entry point for users without a deep-learning engineering background. For research and production users, the same configurations can be exported as YAML and run from the command line, ensuring parity between interactive exploration and reproducible batch jobs.[3]

## How do you run LLaMA-Factory?

The package installs a single `llamafactory-cli` entry point with five subcommands:[3][4]

- `llamafactory-cli train <config.yaml>` runs a training job
- `llamafactory-cli chat <config.yaml>` opens a terminal chat loop with the trained adapter
- `llamafactory-cli export <config.yaml>` merges adapters into the base model and exports a runnable Hugging Face checkpoint
- `llamafactory-cli api <config.yaml>` serves an OpenAI-compatible HTTP API backed by [vLLM](/wiki/vllm) or [SGLang](/wiki/sglang)
- `llamafactory-cli webui` launches LLaMA Board

Beyond the CLI, the package exposes a Python API for embedding the framework inside other systems. Each subcommand corresponds to a function in `llamafactory.train`, `llamafactory.chat`, and so on, with the same YAML configuration dict accepted as input.[3]

## Which models does LLaMA-Factory support?

The framework's name notwithstanding, supported architectures extend far beyond the [LLaMA](/wiki/llama) family. As of the current release, the README enumerates more than one hundred model checkpoints across the following families:[3][4]

- **Meta**: [LLaMA](/wiki/llama), [Llama 2](/wiki/llama_2), [Llama 3](/wiki/llama_3), [Llama 3.2](/wiki/llama_3_2), [Llama 4](/wiki/llama_4)
- **Mistral**: [Mistral 7B](/wiki/mistral_7b), [Mixtral 8x7B](/wiki/mixtral) and 8x22B, [Pixtral](/wiki/pixtral) 12B
- **Google**: [Gemma](/wiki/gemma), [Gemma 2](/wiki/gemma_2), [Gemma 3](/wiki/gemma_3) and Gemma 3-VL, Gemma 4
- **Alibaba**: [Qwen](/wiki/qwen) and Qwen 1.5/2/2.5/3 series including Qwen2-VL, Qwen2.5-VL, Qwen2-Audio, Qwen2.5-Omni, Qwen3-VL, and the Qwen3.5/3.6 releases
- **DeepSeek**: [DeepSeek](/wiki/deepseek) LLM, MoE, Math, Coder, V2, V3, and the DeepSeek-R1 reasoning family
- **Microsoft**: [Phi](/wiki/phi), [Phi-3](/wiki/phi_3), Phi-3.5, Phi-4
- **Shanghai AI Lab / OpenGVLab**: InternLM and [InternVL](/wiki/internvl) 2.5/3
- **OpenAI**: [GPT-2](/wiki/gpt-2), gpt-oss
- **Other**: [Falcon](/wiki/falcon), [BLOOM](/wiki/bloom), [Baichuan](/wiki/baichuan), Yi, MiniCPM and MiniCPM-V, GLM-4, ChatGLM-2/3, StarCoder 2, Granite, TeleChat2, Yuan 2

The 2024 paper reports having validated the framework against more than forty distinct model families at submission time, and the post-2024 release notes document continuous additions for each new open-weight launch.[1][3]

## Which datasets and chat templates are built in?

The repository ships with built-in loaders for a curated catalog of instruction, dialogue, preference, and reasoning datasets: the README groups these into pre-training corpora (16 datasets), supervised fine-tuning sets (more than 50), and preference datasets (11).[3] The catalog covers English instruction data (Alpaca, ShareGPT, Open-Orca), Chinese instruction data (Belle, COIG), preference datasets (UltraFeedback, HH-RLHF), and math and code datasets (MetaMathQA, MagiCoder).[1][3] Each dataset entry maps to a dataset_info.json record that specifies its format (Alpaca-style or ShareGPT-style), its remote URL or local path, and its column names. Users can register a new dataset by appending an entry to this file, making the dataset usable from both the CLI and LLaMA Board without code changes.[3]

Chat templates are stored as Jinja2 templates in `src/llamafactory/data/template.py`. The Model Loader auto-detects the appropriate template from the tokenizer's name (with manual override via `--template`), avoiding the common error of training a model with the wrong chat formatting.[3]

## How efficient is LLaMA-Factory?

The 2024 paper provides side-by-side memory and throughput numbers measured on Gemma-2B, Llama-2-7B, and Llama-2-13B using SFT on the Alpaca dataset at sequence length 512. The reported numbers (Table 3 of the paper) document the design's headline efficiency claim: QLoRA on Gemma-2B fits in 5.21 GB of GPU memory and runs at roughly 3,158 tokens/second, while LoRA on Llama-2-13B uses about 30.09 GB at roughly 1,468 tokens/second. Llama-2-7B with freeze-tuning is reported at 15.69 GB and roughly 2,905 tokens/second.[1] These figures correspond to single-GPU runs on a Nvidia A100; multi-GPU runs scale further through [DeepSpeed](/wiki/deepspeed) ZeRO and [FSDP](/wiki/fsdp).

For downstream task quality, the paper's Table 4 evaluates several models and fine-tuning methods on CNN/DailyMail, XSum, and AdGen summarization datasets using [ROUGE](/wiki/rouge_score) metrics. The reported results show LoRA and QLoRA matching or surpassing freeze-tuning on most settings, with the seven-billion parameter Mistral-7B model achieving approximately 23.47 ROUGE on CNN/DailyMail.[1] The paper deliberately does not claim that LLaMA-Factory's algorithms outperform their underlying papers; rather, it documents that the unified implementation reproduces the expected efficiency and quality of each constituent method.

A subsequent third-party benchmark in the paper "Dissecting the Runtime Performance of the Training, Fine-tuning, and Inference of Large Language Models" (arXiv:2311.03687) compares [DeepSpeed](/wiki/deepspeed) ZeRO configurations and [FlashAttention](/wiki/flashattention) kernels on similar workloads, providing context for how LLaMA-Factory's defaults sit within the wider design space.[6] The Anyscale documentation, in its overview of speed and memory optimizations for LLM post-training, similarly recommends combining [FlashAttention](/wiki/flashattention) and ZeRO with LLaMA-Factory's CLI for production-scale jobs.[5]

## Who uses LLaMA-Factory?

LLaMA-Factory is widely cited in the open-source community as a default starting point for adapting open-weight LLMs. By mid-2026 the project README listed more than 73,000 GitHub stars and described the framework as "Used by Amazon, NVIDIA, Aliyun"; tutorials produced by Anyscale, DigitalOcean, and the AMD ROCm Developer Hub document end-to-end fine-tuning workflows on a range of hardware, including A100, H100, MI300, and Huawei Ascend 910B platforms.[3][4][5][7]

A notable industry case study is Apoidea Group's use of LLaMA-Factory on Amazon SageMaker HyperPod to fine-tune multimodal vision-language models for banking document extraction. The AWS Machine Learning Blog details a pipeline that uses LLaMA-Factory's YAML configurations to launch distributed training jobs from SageMaker HyperPod nodes, combining LLaMA-Factory's training stack with HyperPod's resilience features for large-cluster runs.[8]

The vLLM-Ascend project (an effort to port [vLLM](/wiki/vllm) to Huawei Ascend hardware) documents LLaMA-Factory as one of its user stories, describing the combination of LLaMA-Factory training on Ascend NPUs with vLLM-Ascend for downstream inference.[9] Yaowei Zheng has also received an Outstanding Open-Source Contributor award from the Ascend ecosystem in recognition of this porting work.[10]

## Why is LLaMA-Factory significant?

LLaMA-Factory's principal contribution to the field is **unification**. Before its release, a typical research workflow required gluing together separate codebases: PEFT for [LoRA](/wiki/lora) adapters, TRL for [PPO](/wiki/ppo) and [DPO](/wiki/dpo), specialized scripts for each base-model family, and bespoke chat-template handling. By exposing all of these under one configuration schema with a shared Model Loader, Data Worker, and Trainer, LLaMA-Factory lowered the engineering burden of running a controlled experiment that varies one axis (for example, "LoRA vs. DoRA at fixed data and base model") without rewriting boilerplate.[1][2]

A second contribution is the **no-code web UI**, which made the framework accessible to non-engineering users: domain experts, language teams localizing models, and researchers in adjacent fields who lack a deep PyTorch background. LLaMA Board's defaults are tuned to "run reasonably out of the box," letting users iterate on data and prompt design rather than infrastructure.[1][2][3]

A third contribution, less visible from outside, is **operational hardening**. The repository tracks the [Hugging Face Transformers](/wiki/transformers_library) release cycle closely; for example, v0.9.1 explicitly fixed gradient accumulation behavior changed in Transformers 4.46, and v0.9.4 was rebased on Transformers v5.[3] This ongoing maintenance is what allows the project to remain compatible with each new open-weight model release within weeks.

## How does LLaMA-Factory compare to other frameworks?

LLaMA-Factory occupies a position in the open-source LLM tooling landscape alongside several other frameworks with overlapping but non-identical goals. The table below sketches the comparison.[1][3][11]

| Framework | Primary focus | UI | Notable strength |
|---|---|---|---|
| LLaMA-Factory | Unified SFT/RM/PPO/DPO across 100+ models | LLaMA Board (Gradio) | Breadth of supported models and algorithms; no-code UI |
| [Axolotl](/wiki/axolotl) | YAML-driven fine-tuning of open LLMs | None (CLI only) | Mature config recipes; community fine-tunes |
| [Unsloth](/wiki/unsloth) | Triton-kernel acceleration of LoRA/QLoRA | None | Single-GPU speedups |
| Hugging Face [Transformers](/wiki/transformers_library) + [PEFT](/wiki/peft) | Low-level building blocks | None | Maximum flexibility; canonical reference |
| [DeepSpeed](/wiki/deepspeed) | Distributed training engine | None | ZeRO sharding, offloading |
| [Megatron-LM](/wiki/megatron_lm) | Large-scale 3D parallelism | None | Multi-thousand-GPU training |

These projects are typically complementary rather than competing: LLaMA-Factory uses [DeepSpeed](/wiki/deepspeed), [FSDP](/wiki/fsdp), Megatron-core, and [Unsloth](/wiki/unsloth) as backends, and its [PEFT](/wiki/peft) integration covers most adapter variants documented in the Hugging Face PEFT library.[3][4]

## What are the limitations of LLaMA-Factory?

Despite the breadth of the framework, several limitations are documented in the project's own issues and in external coverage:[3][4][11]

- **Configuration surface area**. A single YAML file can express many incompatible combinations of options. Users new to the framework frequently report training failures caused by mismatches (for example, enabling [FlashAttention-2](/wiki/flashattention) with a tokenizer that does not pad on the right, or combining [QLoRA](/wiki/qlora) with full-parameter optimizers). The project mitigates this through validators in the loader, but the matrix of supported combinations remains large.
- **Multi-node maturity**. While single-node multi-GPU [DeepSpeed](/wiki/deepspeed) and [FSDP](/wiki/fsdp) runs are well-tested, multi-node clusters have historically required more user effort. The v0.9.4 introduction of a Megatron-core backend was driven in part by feedback from users training at the hundred-GPU scale and above.[3]
- **Documentation lag for cutting-edge models**. Each new model family typically requires a chat template and tokenizer adjustment; in the days after a major open-weight release, training behavior can be non-obvious until the README and template files are updated.
- **Evaluation depth**. The built-in evaluation pane covers n-gram overlap metrics and chat-style spot checks, but for systematic benchmarks ([MMLU](/wiki/mmlu), [GSM8K](/wiki/gsm8k), HumanEval, AlpacaEval, MT-Bench) users typically run external harnesses such as lm-eval-harness against the merged checkpoint.
- **API stability**. Several configuration keys have been renamed between minor releases as the project absorbed new techniques. The v0.9.4 migration to `uv` and Python 3.11+ also broke older environments that depended on Python 3.9 or 3.10.[3]

## What efficient training techniques does LLaMA-Factory implement?

The 2024 paper devotes its central section to a taxonomy of efficient fine-tuning techniques implemented inside the framework. The techniques fall into two broad categories: those that change which parameters are trained (parameter-efficient approaches), and those that change how the gradients and activations are computed (computation-efficient approaches).[1][2]

### Parameter-efficient approaches

[LoRA](/wiki/lora) freezes the pretrained weights and introduces two low-rank matrices A and B such that the effective update is the product BA, materialized only at adapter sites (typically the attention projection matrices). The rank is a hyperparameter exposed in LLaMA-Factory as `lora_rank`, and the targeted modules are configurable via `lora_target`. [QLoRA](/wiki/qlora) composes this idea with 4-bit NF4 quantization of the frozen base weights, achieving the lowest memory footprint among supported methods. DoRA (Weight-Decomposed LoRA) decomposes each weight matrix into a direction component and a magnitude component, training only the direction through LoRA and the magnitude scalars separately. LoRA+ assigns a higher learning rate to the B matrix than to the A matrix, addressing an asymmetry noted in the LoRA+ paper. PiSSA initializes the LoRA matrices from the principal singular values of the underlying weight matrix, accelerating convergence. OFT (Orthogonal Fine-Tuning), introduced in v0.9.4, applies an orthogonal transformation that preserves angular relationships between hidden states, a property argued to reduce catastrophic forgetting.[1][3][4]

GaLore (Gradient Low-Rank Projection) is a memory-efficient full-parameter method that projects gradients into a low-rank subspace before applying the optimizer state, then projects back. Unlike LoRA, GaLore updates the full weight matrix; unlike a naive Adam run, the optimizer state grows with the projected rank rather than the full matrix size. BAdam (Block-Wise Adam) further reduces optimizer memory by updating only a single transformer block at a time per training step, rotating across blocks. APOLLO and Adam-mini are recent variants in the same family, both supported as drop-in optimizer choices via LLaMA-Factory's `--optim` flag.[1][3][4]

### Computation-efficient approaches

[FlashAttention](/wiki/flashattention) and FlashAttention-2 are exact-attention kernels that avoid materializing the full attention matrix in high-bandwidth memory, achieving substantial speedups on long sequences. LLaMA-Factory enables FlashAttention-2 with a single flag (`flash_attn=fa2`) when the underlying model's attention implementation supports it. S2 attention (shifted sparse attention), introduced in the LongLoRA paper, is exposed for long-context fine-tuning of base models that lack native long-context training. [Unsloth](/wiki/unsloth) is a Triton-kernel based acceleration library that rewrites attention and LoRA backward passes for higher single-GPU throughput; LLaMA-Factory wraps Unsloth as an optional backend.[1][3]

Mixed-precision training defaults to bfloat16 on NVIDIA GPUs of compute capability 8.0 and above (Ampere and later) and falls back to float16 on older hardware. Activation checkpointing trades recompute for memory and is enabled by default in most preset configurations. Sequence packing (concatenating multiple shorter examples into a single long sequence) raises GPU utilization for datasets dominated by short examples; "neat packing" (added in v0.8.3) uses a block-diagonal attention mask to prevent attention from crossing example boundaries, preserving the semantic equivalence between packed and unpacked training.[3]

### Distributed training

For single-node multi-GPU runs, LLaMA-Factory supports plain PyTorch DistributedDataParallel (DDP), [DeepSpeed](/wiki/deepspeed) (ZeRO-1, ZeRO-2, ZeRO-3, with optional CPU and NVMe offload), and PyTorch's FSDP and FSDP2 implementations. A DeepSpeed configuration is passed by reference: the main YAML config points to a separate JSON file with the ZeRO stage, optimizer offload settings, and bf16 precision flags. For multi-node runs, [Ray](/wiki/ray) Trainer integration (v0.9.2) launches workers across a Ray cluster, and the Megatron-core backend (v0.9.4) enables tensor parallelism, pipeline parallelism, and expert parallelism for very large models such as DeepSeek-V3 and the Llama 4 family.[3][4]

## Significance for Chinese open-source AI

LLaMA-Factory is also one of the most visible Chinese-led open-source projects in post-training tooling. The maintainers are based at Beihang University; the framework includes first-class support for Chinese-centric models such as [Baichuan](/wiki/baichuan), ChatGLM, [Qwen](/wiki/qwen), [DeepSeek](/wiki/deepseek), InternLM, and Yi; and Ascend NPU support was added before many comparable Western frameworks. The localized LLaMA Board UI (English, Russian, Chinese) reflects this global-first orientation.[1][3][4][10]

## How do you cite LLaMA-Factory?

The project's CITATION.cff file in the repository directs users to the ACL 2024 paper. The canonical citation is:

> Yaowei Zheng, Richong Zhang, Junhao Zhang, Yanhan Ye, Zheyan Luo. "LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models." Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations), Bangkok, Thailand, August 2024, pages 400 to 410. DOI: 10.18653/v1/2024.acl-demos.38.[2]

The arXiv version (arXiv:2403.13372) additionally lists Zhangchi Feng and Yongqiang Ma as authors and includes appendix material on the LLaMA Board design and additional efficiency experiments.[1]

## See also

- [LoRA (Low-Rank Adaptation)](/wiki/lora)
- [QLoRA](/wiki/qlora)
- [Direct Preference Optimization (DPO)](/wiki/dpo)
- [Proximal Policy Optimization (PPO)](/wiki/ppo)
- [ORPO](/wiki/orpo)
- [KTO](/wiki/kto)
- [SimPO](/wiki/simpo)
- [Unsloth](/wiki/unsloth)
- [Axolotl](/wiki/axolotl)
- [DeepSpeed](/wiki/deepspeed)
- [Fully Sharded Data Parallel (FSDP)](/wiki/fsdp)
- [Megatron-LM](/wiki/megatron_lm)
- [PEFT](/wiki/peft)
- [Hugging Face Transformers](/wiki/transformers_library)
- [FlashAttention](/wiki/flashattention)
- [Supervised fine-tuning](/wiki/supervised_fine-tuning)
- [Reinforcement Learning from Human Feedback (RLHF)](/wiki/rlhf)
- [Instruction Tuning](/wiki/instruction_tuning)
- [LLaVA](/wiki/llava)
- [vLLM](/wiki/vllm)
- [SGLang](/wiki/sglang)
- [Quantization](/wiki/quantization)

## References

1. Yaowei Zheng, Richong Zhang, Junhao Zhang, Yanhan Ye, Zheyan Luo, Zhangchi Feng, Yongqiang Ma, "LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models", arXiv, 2024-03-20 (v1) revised 2024-06-27 (v4). https://arxiv.org/abs/2403.13372. Accessed 2026-07-12.
2. Yaowei Zheng, Richong Zhang, Junhao Zhang, Yanhan Ye, Zheyan Luo, "LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models", Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations), Association for Computational Linguistics, 2024-08. https://aclanthology.org/2024.acl-demos.38/. Accessed 2026-07-12.
3. hiyouga (Yaowei Zheng) and contributors, "LLaMA-Factory: Unified Efficient Fine-Tuning of 100+ LLMs & VLMs (ACL 2024)", GitHub repository (hiyouga/LLaMA-Factory), 2026-07. https://github.com/hiyouga/LLaMA-Factory. Accessed 2026-07-12.
4. LLaMA-Factory maintainers, "Welcome to LLaMA Factory! (documentation site)", llamafactory.readthedocs.io. https://llamafactory.readthedocs.io/en/latest/. Accessed 2026-07-12.
5. Anyscale, "Speed and memory optimizations for LLM post-training and fine-tuning", Anyscale Documentation. https://docs.anyscale.com/llm/fine-tuning/speed-and-memory-optimizations. Accessed 2026-05-20.
6. Longteng Zhang et al., "Dissecting the Runtime Performance of the Training, Fine-tuning, and Inference of Large Language Models", arXiv, 2023-11-06 (v1) revised 2024. https://arxiv.org/abs/2311.03687. Accessed 2026-05-20.
7. AMD, "Fine-tune Llama-3.1 8B with Llama-Factory", ROCm AI Developer Hub Tutorials 3.0. https://rocm.docs.amd.com/projects/ai-developer-hub/en/v3.0/notebooks/fine_tune/llama_factory_llama3.html. Accessed 2026-05-20.
8. AWS Machine Learning Blog, "How Apoidea Group enhances visual information extraction from banking documents with multimodal models using LLaMA-Factory on Amazon SageMaker HyperPod". https://aws.amazon.com/blogs/machine-learning/how-apoidea-group-enhances-visual-information-extraction-from-banking-documents-with-multimodal-models-using-llama-factory-on-amazon-sagemaker-hyperpod/. Accessed 2026-05-20.
9. vLLM-Ascend project, "LLaMA-Factory (user stories)", vLLM Ascend documentation. https://docs.vllm.ai/projects/ascend/en/latest/community/user_stories/llamafactory.html. Accessed 2026-05-20.
10. GOSIM, "Yaowei Zheng (speaker profile, GOSIM Paris 2025)", GOSIM Paris 2025 program. https://paris2025.gosim.org/speakers/yaowei-zheng/. Accessed 2026-05-20.
11. Cogni Down Under, "Accelerating AI: How Unsloth, DeepSpeed, Axolotl, and LLaMA Factory Are Revolutionizing LLM Training", Medium, 2024. https://medium.com/@cognidownunder/accelerating-ai-how-unsloth-deepspeed-axolotl-and-llama-factory-are-revolutionizing-llm-37ba0bab2e1b. Accessed 2026-05-20.
12. LLaMA-Factory maintainers, "Releases: v0.9.5 (Qwen3.5/3.6, Gemma 4, Transformers v5) and v0.9.4 (Goodbye 2025)", GitHub release notes (hiyouga/LLaMA-Factory), published 2026-05-30 and 2025-12-31. https://github.com/hiyouga/LLaMA-Factory/releases. Accessed 2026-07-12.