LLaMA-Factory

Developer Tools Open Source AI Training & Optimization

22 min read

Updated Jul 11, 2026

Suggest edit History Talk

RawGraph

Last edited

Jul 11, 2026

Fact-checked

In review queue

Sources

12 citations

Revision

v3 · 4,413 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

LLaMA-Factory is an open-source unified framework for the efficient fine-tuning of large language models (LLMs) and vision-language models (VLMs). It integrates a wide range of training algorithms, parameter-efficient adapters, and acceleration kernels behind a single command-line interface, a Python API, and a no-code web UI named LLaMA Board. The project was created and is led by Yaowei Zheng, a Ph.D. student in the School of Computer Science and Engineering at Beihang University, and was first presented in the paper "LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models" at the System Demonstrations track of ACL 2024.^[1]^[2] As of mid-2026, the GitHub repository hiyouga/LLaMA-Factory has accumulated more than 73,000 stars and roughly 8,900 forks, making it one of the most widely adopted fine-tuning toolkits in the open-source ecosystem; the project README describes it as "Used by Amazon, NVIDIA, Aliyun."^[3]

Released under the Apache 2.0 license, LLaMA-Factory bundles supervised fine-tuning (SFT), reward modeling, reinforcement learning from human feedback (PPO), and a family of preference-optimization algorithms (DPO, KTO, ORPO, SimPO) together with full-parameter, freeze, LoRA, QLoRA, GaLore, BAdam, DoRA, and Unsloth-accelerated training paths. Multi-GPU scaling is handled through DeepSpeed (including ZeRO stages), PyTorch FSDP/FSDP2, Ray, and, since late 2025, Megatron-LM. The system also supports Ascend NPU and AMD ROCm backends in addition to NVIDIA CUDA hardware.^[1]^[3]^[4] The latest stable release, v0.9.5 (May 30, 2026), adds primary support for the Qwen3.5, Qwen3.6, and Gemma 4 model families and consolidates compatibility with Hugging Face Transformers v5.^[12]

The framework's motivation is stated plainly in its ACL 2024 paper: "Efficient fine-tuning is vital for adapting large language models (LLMs) to downstream tasks," yet it "requires non-trivial efforts to implement these methods on different models." LLaMA-Factory answers this by "flexibly customizing the fine-tuning of 100+ LLMs without the need for coding through the built-in web UI LlamaBoard."^[1]^[2]

Infobox

Field	Value
Name	LLaMA-Factory (also stylized "LlamaFactory")
Original author	Yaowei Zheng
Affiliation of authors	Beihang University, School of Computer Science and Engineering
Initial release	2023 (first GitHub commits as "ChatGLM-Efficient-Tuning" / "LLaMA-Efficient-Tuning")
Latest stable release	v0.9.5 (May 30, 2026); previous v0.9.4 "Goodbye 2025" (December 31, 2025)
License	Apache 2.0
Repository	github.com/hiyouga/LLaMA-Factory
GitHub stars	More than 73,000 (mid-2026)
Paper	arXiv:2403.13372; ACL 2024 demos, pages 400 to 410
Web UI	LLaMA Board (Gradio-based)
Programming language	Python (>= 3.11 from v0.9.4)
Built on	PyTorch, Hugging Face Transformers, PEFT

History

Origins and naming

The project began in 2023 as a pair of repositories maintained by Yaowei Zheng ("hiyouga") that targeted parameter-efficient tuning of specific model families. The earliest repository was "ChatGLM-Efficient-Tuning," which provided supervised and reward modeling pipelines for Baichuan and other Chinese LLMs together with the THUDM ChatGLM models. A sibling project, "LLaMA-Efficient-Tuning," targeted the Meta LLaMA series. As the underlying Hugging Face Transformers interface converged on a shared abstraction across architectures, the two codebases were merged and rebranded as LLaMA-Factory in late 2023, with a unified loader and trainer covering both English-centric and Chinese-centric pretrained checkpoints.^[3]

A first version of the academic paper describing the framework was posted to arXiv on March 20, 2024, with revisions through June 27, 2024.^[1] The paper was accepted to the System Demonstrations track of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL 2024), held in Bangkok, Thailand, with the camera-ready version appearing in the ACL Anthology as paper 2024.acl-demos.38, pages 400 to 410.^[2] The listed authors are Yaowei Zheng, Richong Zhang, Junhao Zhang, Yanhan Ye, Zheyan Luo, Zhangchi Feng, and Yongqiang Ma; the affiliations point to Beihang University.^[1]^[2]

Release timeline

The release history of LLaMA-Factory tracks the broader pace of open-weight model releases and post-training research. The table below summarizes major versions documented in the project's GitHub release notes.^[3]^[12]

Version	Date	Notable additions
v0.6.x	early 2024	Mixture-of-experts (Mixtral) support, LoRA+, Gemma, unified preference data
v0.7.x	mid 2024	ORPO, SimPO, KTO trainers; BAdam optimizer
v0.8.3	July 18, 2024	Neat packing (contamination-free packing), split evaluation, HQQ and EETQ quantization, NPU Dockerfile
v0.9.0	September 8, 2024	Qwen2-VL multimodal SFT, Weights & Biases hooks, Adam-mini optimizer, VLM RLHF/DPO/ORPO/SimPO, MFU calculation
v0.9.1	November 24, 2024	Llama 3.2 and Llama-3.2-Vision, LLaVA-NeXT, Video-LLaVA, Pixtral, gradient-accumulation fixes for Transformers 4.46
v0.9.2	March 11, 2025	APOLLO optimizer, SwanLab experiment tracker, Ray Trainer, vLLM batch inference with tensor parallel, QLoRA on Ascend NPU
v0.9.3	June 16, 2025	InternVL 2.5 and 3, Qwen2.5-Omni audio-visual, Llama 4, Gemma 3, official GPU Docker images, SGLang inference backend
v0.9.4	December 31, 2025	Repository renamed to "LlamaFactory," Python 3.11 to 3.13 (3.9 to 3.10 deprecated), migration from pip to `uv`, Orthogonal Fine-Tuning (OFT), FP8 training, Megatron-core backend, Transformers v5
v0.9.5	May 30, 2026	Qwen3.5, Qwen3.6, and Gemma 4 support, Qwen3-VL and further multimodal models, FP8 Transformer Engine backend, FSDP2 training via Ray, consolidated Transformers v5 compatibility

The cadence shows three identifiable phases. From early to mid-2024 the focus was on filling out preference-optimization trainers and integrating the new optimizers proposed in 2024 papers (LoRA+, GaLore, BAdam, DoRA). From late 2024 through mid-2025 the focus shifted to multimodal models, with LLaVA-series, Pixtral, InternVL, and Qwen-VL variants becoming first-class training targets. The v0.9.4 milestone in late 2025 modernized the project's build toolchain and introduced Megatron-core for very large-scale model-parallel training, and the v0.9.5 release of May 2026 extended coverage to the Qwen3.5, Qwen3.6, and Gemma 4 families while adding an FP8 Transformer Engine backend and FSDP2 training through Ray.^[3]^[12]

Project growth

LLaMA-Factory grew rapidly after the public release of Llama 2 in mid-2023 and accelerated again after the Llama 3 releases of April 2024. By the time of the ACL 2024 publication, the paper reported that the GitHub repository had "received over 25,000 stars and 3,000 forks."^[1] By December 2025 the project README reported community adoption that included the AMD ROCm, Hugging Face, and NVIDIA developer ecosystems, and the Anyscale documentation references LLaMA-Factory as one of the recommended LLM post-training stacks on the Anyscale platform.^[5]^[3] By mid-2026 the repository had grown to more than 73,000 stars and roughly 8,900 forks.^[3]

Architecture

The 2024 paper describes the framework as a three-layer modular design: a Model Loader that normalizes model and tokenizer initialization across architectures, a Data Worker that converts heterogeneous chat and preference datasets to a unified internal schema, and a Trainer that exposes a uniform interface across the four supported training paradigms.^[1]^[2]

Model Loader

The Model Loader resolves a model identifier (Hugging Face Hub name or local path) to a Hugging Face Transformers configuration, instantiates the model, patches it to insert adapter modules where required, attaches a chat template, and applies the chosen quantization. Quantization paths in the current release include 8-bit and 4-bit weight quantization via bitsandbytes (LLM.int8), 4-bit NF4 QLoRA, plus AQLM, AWQ, GPTQ, HQQ, and EETQ for inference and adapter-only training. From v0.9.2 onward, Ascend NPU is a first-class device target, with PyTorch operations dispatched through the torch_npu shim.^[3]^[4]

Data Worker

The Data Worker handles dataset loading, alignment, merging, and tokenization. It expects datasets in one of three canonical formats: Alpaca-style instruction tuples (instruction, input, output), ShareGPT-style multi-turn conversations, and preference triplets (prompt, chosen, rejected) used by DPO and related preference algorithms. Internally, the worker emits a unified record shape so the Trainer does not need to know which dataset format produced it. Streaming datasets, interleaved sampling, and dataset packing (including "neat packing," which avoids cross-document attention contamination) are all configured through the same YAML config.^[1]^[3]

Trainer

The Trainer wraps the transformers.Trainer class and adds: a generative pre-training stage, a SFT stage, a reward model training stage, and a preference-optimization or RLHF stage. The preference optimization branch exposes DPO, KTO, ORPO, SimPO, and full online RL via PPO. The Trainer abstracts away the difference between a base model with attached LoRA adapters and a fully fine-tuned model, so the same configuration interface works for full-parameter and parameter-efficient runs.^[1]^[2]

A distinctive contribution of the paper is the model-sharing RLHF trick: the reward model, the value model, and the policy model can be served by a single set of base weights with three different LoRA adapters swapped dynamically per forward pass. This allows end-to-end RLHF to run on a single consumer GPU for models in the 7B parameter class, an order-of-magnitude reduction in memory compared with naive multi-model RLHF setups.^[1]^[2]

What training methods does LLaMA-Factory support?

The framework's headline feature is the breadth of training algorithms accessible from one configuration file. The following table groups the documented options by purpose.^[1]^[3]^[4]

Category	Method
Pretraining and continued pretraining	Causal LM next-token prediction over plain-text corpora
Supervised fine-tuning	SFT on Alpaca, ShareGPT, OpenAssistant, and custom formats
Reward modeling	Bradley-Terry pairwise loss over preference data
Online RL	PPO with optional reward model or rule-based reward
Offline preference optimization	DPO, KTO, ORPO, SimPO
Parameter-efficient adapters	LoRA, QLoRA, DoRA, LoRA+, PiSSA, LoftQ, OFT/QOFT
Memory-efficient full tuning	Freeze-tuning, GaLore, BAdam, APOLLO, Adam-mini, Muon
Long-context fine-tuning	LongLoRA shifted-attention, sequence packing, RoPE scaling
Acceleration kernels	FlashAttention-2, Unsloth, Liger Kernel
Distributed backends	NativeDDP, DeepSpeed (ZeRO-1/2/3, offload), FSDP and FSDP2, Ray Trainer, Megatron-LM core

The pairing of memory-efficient optimizers (GaLore, BAdam, APOLLO) with adapter methods (LoRA, DoRA) is presented as a key axis of the design: a user who needs full-parameter quality but cannot fit the optimizer state can switch from Adam to GaLore without rewriting any training code; conversely, a user who needs to swap many adapters at inference can use LoRA or DoRA with the same Trainer.^[1]^[4]

What is LLaMA Board?

LLaMA Board is the no-code web interface bundled with the framework. It is implemented in Gradio and is launched with the command llamafactory-cli webui.^[3]^[4] The interface mirrors the underlying YAML configuration schema, exposing tabs for model selection, training hyperparameters, dataset choice, evaluation, and chat-style testing. The 2024 paper highlights three properties: a localized UI in English, Russian, and Chinese; live loss-curve and metric plots streamed from a background trainer process; and an evaluation pane that supports both n-gram overlap metrics (ROUGE, BLEU) and side-by-side chat testing against the current checkpoint. On localization, the paper states: "Currently we support three languages: English, Russian and Chinese, which allows a broader range of users to utilize LlamaBoard for fine-tuning LLMs."^[1]^[2]

The web UI is the recommended entry point for users without a deep-learning engineering background. For research and production users, the same configurations can be exported as YAML and run from the command line, ensuring parity between interactive exploration and reproducible batch jobs.^[3]

How do you run LLaMA-Factory?

The package installs a single llamafactory-cli entry point with five subcommands:^[3]^[4]

llamafactory-cli train <config.yaml> runs a training job
llamafactory-cli chat <config.yaml> opens a terminal chat loop with the trained adapter
llamafactory-cli export <config.yaml> merges adapters into the base model and exports a runnable Hugging Face checkpoint
llamafactory-cli api <config.yaml> serves an OpenAI-compatible HTTP API backed by vLLM or SGLang
llamafactory-cli webui launches LLaMA Board

Beyond the CLI, the package exposes a Python API for embedding the framework inside other systems. Each subcommand corresponds to a function in llamafactory.train, llamafactory.chat, and so on, with the same YAML configuration dict accepted as input.^[3]

Which models does LLaMA-Factory support?

The framework's name notwithstanding, supported architectures extend far beyond the LLaMA family. As of the current release, the README enumerates more than one hundred model checkpoints across the following families:^[3]^[4]

Meta: LLaMA, Llama 2, Llama 3, Llama 3.2, Llama 4
Mistral: Mistral 7B, Mixtral 8x7B and 8x22B, Pixtral 12B
Google: Gemma, Gemma 2, Gemma 3 and Gemma 3-VL, Gemma 4
Alibaba: Qwen and Qwen 1.5/2/2.5/3 series including Qwen2-VL, Qwen2.5-VL, Qwen2-Audio, Qwen2.5-Omni, Qwen3-VL, and the Qwen3.5/3.6 releases
DeepSeek: DeepSeek LLM, MoE, Math, Coder, V2, V3, and the DeepSeek-R1 reasoning family
Microsoft: Phi, Phi-3, Phi-3.5, Phi-4
Shanghai AI Lab / OpenGVLab: InternLM and InternVL 2.5/3
OpenAI: GPT-2, gpt-oss
Other: Falcon, BLOOM, Baichuan, Yi, MiniCPM and MiniCPM-V, GLM-4, ChatGLM-2/3, StarCoder 2, Granite, TeleChat2, Yuan 2

The 2024 paper reports having validated the framework against more than forty distinct model families at submission time, and the post-2024 release notes document continuous additions for each new open-weight launch.^[1]^[3]

Which datasets and chat templates are built in?

The repository ships with built-in loaders for a curated catalog of instruction, dialogue, preference, and reasoning datasets: the README groups these into pre-training corpora (16 datasets), supervised fine-tuning sets (more than 50), and preference datasets (11).^[3] The catalog covers English instruction data (Alpaca, ShareGPT, Open-Orca), Chinese instruction data (Belle, COIG), preference datasets (UltraFeedback, HH-RLHF), and math and code datasets (MetaMathQA, MagiCoder).^[1]^[3] Each dataset entry maps to a dataset_info.json record that specifies its format (Alpaca-style or ShareGPT-style), its remote URL or local path, and its column names. Users can register a new dataset by appending an entry to this file, making the dataset usable from both the CLI and LLaMA Board without code changes.^[3]

Chat templates are stored as Jinja2 templates in src/llamafactory/data/template.py. The Model Loader auto-detects the appropriate template from the tokenizer's name (with manual override via --template), avoiding the common error of training a model with the wrong chat formatting.^[3]

How efficient is LLaMA-Factory?

The 2024 paper provides side-by-side memory and throughput numbers measured on Gemma-2B, Llama-2-7B, and Llama-2-13B using SFT on the Alpaca dataset at sequence length 512. The reported numbers (Table 3 of the paper) document the design's headline efficiency claim: QLoRA on Gemma-2B fits in 5.21 GB of GPU memory and runs at roughly 3,158 tokens/second, while LoRA on Llama-2-13B uses about 30.09 GB at roughly 1,468 tokens/second. Llama-2-7B with freeze-tuning is reported at 15.69 GB and roughly 2,905 tokens/second.^[1] These figures correspond to single-GPU runs on a Nvidia A100; multi-GPU runs scale further through DeepSpeed ZeRO and FSDP.

For downstream task quality, the paper's Table 4 evaluates several models and fine-tuning methods on CNN/DailyMail, XSum, and AdGen summarization datasets using ROUGE metrics. The reported results show LoRA and QLoRA matching or surpassing freeze-tuning on most settings, with the seven-billion parameter Mistral-7B model achieving approximately 23.47 ROUGE on CNN/DailyMail.^[1] The paper deliberately does not claim that LLaMA-Factory's algorithms outperform their underlying papers; rather, it documents that the unified implementation reproduces the expected efficiency and quality of each constituent method.

A subsequent third-party benchmark in the paper "Dissecting the Runtime Performance of the Training, Fine-tuning, and Inference of Large Language Models" (arXiv:2311.03687) compares DeepSpeed ZeRO configurations and FlashAttention kernels on similar workloads, providing context for how LLaMA-Factory's defaults sit within the wider design space.^[6] The Anyscale documentation, in its overview of speed and memory optimizations for LLM post-training, similarly recommends combining FlashAttention and ZeRO with LLaMA-Factory's CLI for production-scale jobs.^[5]

Who uses LLaMA-Factory?

LLaMA-Factory is widely cited in the open-source community as a default starting point for adapting open-weight LLMs. By mid-2026 the project README listed more than 73,000 GitHub stars and described the framework as "Used by Amazon, NVIDIA, Aliyun"; tutorials produced by Anyscale, DigitalOcean, and the AMD ROCm Developer Hub document end-to-end fine-tuning workflows on a range of hardware, including A100, H100, MI300, and Huawei Ascend 910B platforms.^[3]^[4]^[5]^[7]

A notable industry case study is Apoidea Group's use of LLaMA-Factory on Amazon SageMaker HyperPod to fine-tune multimodal vision-language models for banking document extraction. The AWS Machine Learning Blog details a pipeline that uses LLaMA-Factory's YAML configurations to launch distributed training jobs from SageMaker HyperPod nodes, combining LLaMA-Factory's training stack with HyperPod's resilience features for large-cluster runs.^[8]

The vLLM-Ascend project (an effort to port vLLM to Huawei Ascend hardware) documents LLaMA-Factory as one of its user stories, describing the combination of LLaMA-Factory training on Ascend NPUs with vLLM-Ascend for downstream inference.^[9] Yaowei Zheng has also received an Outstanding Open-Source Contributor award from the Ascend ecosystem in recognition of this porting work.^[10]

Why is LLaMA-Factory significant?

LLaMA-Factory's principal contribution to the field is unification. Before its release, a typical research workflow required gluing together separate codebases: PEFT for LoRA adapters, TRL for PPO and DPO, specialized scripts for each base-model family, and bespoke chat-template handling. By exposing all of these under one configuration schema with a shared Model Loader, Data Worker, and Trainer, LLaMA-Factory lowered the engineering burden of running a controlled experiment that varies one axis (for example, "LoRA vs. DoRA at fixed data and base model") without rewriting boilerplate.^[1]^[2]

A second contribution is the no-code web UI, which made the framework accessible to non-engineering users: domain experts, language teams localizing models, and researchers in adjacent fields who lack a deep PyTorch background. LLaMA Board's defaults are tuned to "run reasonably out of the box," letting users iterate on data and prompt design rather than infrastructure.^[1]^[2]^[3]

A third contribution, less visible from outside, is operational hardening. The repository tracks the Hugging Face Transformers release cycle closely; for example, v0.9.1 explicitly fixed gradient accumulation behavior changed in Transformers 4.46, and v0.9.4 was rebased on Transformers v5.^[3] This ongoing maintenance is what allows the project to remain compatible with each new open-weight model release within weeks.

How does LLaMA-Factory compare to other frameworks?

LLaMA-Factory occupies a position in the open-source LLM tooling landscape alongside several other frameworks with overlapping but non-identical goals. The table below sketches the comparison.^[1]^[3]^[11]

Framework	Primary focus	UI	Notable strength
LLaMA-Factory	Unified SFT/RM/PPO/DPO across 100+ models	LLaMA Board (Gradio)	Breadth of supported models and algorithms; no-code UI
Axolotl	YAML-driven fine-tuning of open LLMs	None (CLI only)	Mature config recipes; community fine-tunes
Unsloth	Triton-kernel acceleration of LoRA/QLoRA	None	Single-GPU speedups
Hugging Face Transformers + PEFT	Low-level building blocks	None	Maximum flexibility; canonical reference
DeepSpeed	Distributed training engine	None	ZeRO sharding, offloading
Megatron-LM	Large-scale 3D parallelism	None	Multi-thousand-GPU training

These projects are typically complementary rather than competing: LLaMA-Factory uses DeepSpeed, FSDP, Megatron-core, and Unsloth as backends, and its PEFT integration covers most adapter variants documented in the Hugging Face PEFT library.^[3]^[4]

What are the limitations of LLaMA-Factory?

Despite the breadth of the framework, several limitations are documented in the project's own issues and in external coverage:^[3]^[4]^[11]

Configuration surface area. A single YAML file can express many incompatible combinations of options. Users new to the framework frequently report training failures caused by mismatches (for example, enabling FlashAttention-2 with a tokenizer that does not pad on the right, or combining QLoRA with full-parameter optimizers). The project mitigates this through validators in the loader, but the matrix of supported combinations remains large.
Multi-node maturity. While single-node multi-GPU DeepSpeed and FSDP runs are well-tested, multi-node clusters have historically required more user effort. The v0.9.4 introduction of a Megatron-core backend was driven in part by feedback from users training at the hundred-GPU scale and above.^[3]
Documentation lag for cutting-edge models. Each new model family typically requires a chat template and tokenizer adjustment; in the days after a major open-weight release, training behavior can be non-obvious until the README and template files are updated.
Evaluation depth. The built-in evaluation pane covers n-gram overlap metrics and chat-style spot checks, but for systematic benchmarks (MMLU, GSM8K, HumanEval, AlpacaEval, MT-Bench) users typically run external harnesses such as lm-eval-harness against the merged checkpoint.
API stability. Several configuration keys have been renamed between minor releases as the project absorbed new techniques. The v0.9.4 migration to uv and Python 3.11+ also broke older environments that depended on Python 3.9 or 3.10.^[3]

What efficient training techniques does LLaMA-Factory implement?

The 2024 paper devotes its central section to a taxonomy of efficient fine-tuning techniques implemented inside the framework. The techniques fall into two broad categories: those that change which parameters are trained (parameter-efficient approaches), and those that change how the gradients and activations are computed (computation-efficient approaches).^[1]^[2]

Parameter-efficient approaches

LoRA freezes the pretrained weights and introduces two low-rank matrices A and B such that the effective update is the product BA, materialized only at adapter sites (typically the attention projection matrices). The rank is a hyperparameter exposed in LLaMA-Factory as lora_rank, and the targeted modules are configurable via lora_target. QLoRA composes this idea with 4-bit NF4 quantization of the frozen base weights, achieving the lowest memory footprint among supported methods. DoRA (Weight-Decomposed LoRA) decomposes each weight matrix into a direction component and a magnitude component, training only the direction through LoRA and the magnitude scalars separately. LoRA+ assigns a higher learning rate to the B matrix than to the A matrix, addressing an asymmetry noted in the LoRA+ paper. PiSSA initializes the LoRA matrices from the principal singular values of the underlying weight matrix, accelerating convergence. OFT (Orthogonal Fine-Tuning), introduced in v0.9.4, applies an orthogonal transformation that preserves angular relationships between hidden states, a property argued to reduce catastrophic forgetting.^[1]^[3]^[4]

GaLore (Gradient Low-Rank Projection) is a memory-efficient full-parameter method that projects gradients into a low-rank subspace before applying the optimizer state, then projects back. Unlike LoRA, GaLore updates the full weight matrix; unlike a naive Adam run, the optimizer state grows with the projected rank rather than the full matrix size. BAdam (Block-Wise Adam) further reduces optimizer memory by updating only a single transformer block at a time per training step, rotating across blocks. APOLLO and Adam-mini are recent variants in the same family, both supported as drop-in optimizer choices via LLaMA-Factory's --optim flag.^[1]^[3]^[4]

Computation-efficient approaches

FlashAttention and FlashAttention-2 are exact-attention kernels that avoid materializing the full attention matrix in high-bandwidth memory, achieving substantial speedups on long sequences. LLaMA-Factory enables FlashAttention-2 with a single flag (flash_attn=fa2) when the underlying model's attention implementation supports it. S2 attention (shifted sparse attention), introduced in the LongLoRA paper, is exposed for long-context fine-tuning of base models that lack native long-context training. Unsloth is a Triton-kernel based acceleration library that rewrites attention and LoRA backward passes for higher single-GPU throughput; LLaMA-Factory wraps Unsloth as an optional backend.^[1]^[3]

Mixed-precision training defaults to bfloat16 on NVIDIA GPUs of compute capability 8.0 and above (Ampere and later) and falls back to float16 on older hardware. Activation checkpointing trades recompute for memory and is enabled by default in most preset configurations. Sequence packing (concatenating multiple shorter examples into a single long sequence) raises GPU utilization for datasets dominated by short examples; "neat packing" (added in v0.8.3) uses a block-diagonal attention mask to prevent attention from crossing example boundaries, preserving the semantic equivalence between packed and unpacked training.^[3]

Distributed training

For single-node multi-GPU runs, LLaMA-Factory supports plain PyTorch DistributedDataParallel (DDP), DeepSpeed (ZeRO-1, ZeRO-2, ZeRO-3, with optional CPU and NVMe offload), and PyTorch's FSDP and FSDP2 implementations. A DeepSpeed configuration is passed by reference: the main YAML config points to a separate JSON file with the ZeRO stage, optimizer offload settings, and bf16 precision flags. For multi-node runs, Ray Trainer integration (v0.9.2) launches workers across a Ray cluster, and the Megatron-core backend (v0.9.4) enables tensor parallelism, pipeline parallelism, and expert parallelism for very large models such as DeepSeek-V3 and the Llama 4 family.^[3]^[4]

Significance for Chinese open-source AI

LLaMA-Factory is also one of the most visible Chinese-led open-source projects in post-training tooling. The maintainers are based at Beihang University; the framework includes first-class support for Chinese-centric models such as Baichuan, ChatGLM, Qwen, DeepSeek, InternLM, and Yi; and Ascend NPU support was added before many comparable Western frameworks. The localized LLaMA Board UI (English, Russian, Chinese) reflects this global-first orientation.^[1]^[3]^[4]^[10]

How do you cite LLaMA-Factory?

The project's CITATION.cff file in the repository directs users to the ACL 2024 paper. The canonical citation is:

Yaowei Zheng, Richong Zhang, Junhao Zhang, Yanhan Ye, Zheyan Luo. "LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models." Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations), Bangkok, Thailand, August 2024, pages 400 to 410. DOI: 10.18653/v1/2024.acl-demos.38.^[2]

The arXiv version (arXiv:2403.13372) additionally lists Zhangchi Feng and Yongqiang Ma as authors and includes appendix material on the LLaMA Board design and additional efficiency experiments.^[1]

References

Yaowei Zheng, Richong Zhang, Junhao Zhang, Yanhan Ye, Zheyan Luo, Zhangchi Feng, Yongqiang Ma, "LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models", arXiv, 2024-03-20 (v1) revised 2024-06-27 (v4). https://arxiv.org/abs/2403.13372. Accessed 2026-07-12. ↩
Yaowei Zheng, Richong Zhang, Junhao Zhang, Yanhan Ye, Zheyan Luo, "LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models", Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations), Association for Computational Linguistics, 2024-08. https://aclanthology.org/2024.acl-demos.38/. Accessed 2026-07-12. ↩
hiyouga (Yaowei Zheng) and contributors, "LLaMA-Factory: Unified Efficient Fine-Tuning of 100+ LLMs & VLMs (ACL 2024)", GitHub repository (hiyouga/LLaMA-Factory), 2026-07. https://github.com/hiyouga/LLaMA-Factory. Accessed 2026-07-12. ↩
LLaMA-Factory maintainers, "Welcome to LLaMA Factory! (documentation site)", llamafactory.readthedocs.io. https://llamafactory.readthedocs.io/en/latest/. Accessed 2026-07-12. ↩
Anyscale, "Speed and memory optimizations for LLM post-training and fine-tuning", Anyscale Documentation. https://docs.anyscale.com/llm/fine-tuning/speed-and-memory-optimizations. Accessed 2026-05-20. ↩
Longteng Zhang et al., "Dissecting the Runtime Performance of the Training, Fine-tuning, and Inference of Large Language Models", arXiv, 2023-11-06 (v1) revised 2024. https://arxiv.org/abs/2311.03687. Accessed 2026-05-20. ↩
AMD, "Fine-tune Llama-3.1 8B with Llama-Factory", ROCm AI Developer Hub Tutorials 3.0. https://rocm.docs.amd.com/projects/ai-developer-hub/en/v3.0/notebooks/fine_tune/llama_factory_llama3.html. Accessed 2026-05-20. ↩
AWS Machine Learning Blog, "How Apoidea Group enhances visual information extraction from banking documents with multimodal models using LLaMA-Factory on Amazon SageMaker HyperPod". https://aws.amazon.com/blogs/machine-learning/how-apoidea-group-enhances-visual-information-extraction-from-banking-documents-with-multimodal-models-using-llama-factory-on-amazon-sagemaker-hyperpod/. Accessed 2026-05-20. ↩
vLLM-Ascend project, "LLaMA-Factory (user stories)", vLLM Ascend documentation. https://docs.vllm.ai/projects/ascend/en/latest/community/user_stories/llamafactory.html. Accessed 2026-05-20. ↩
GOSIM, "Yaowei Zheng (speaker profile, GOSIM Paris 2025)", GOSIM Paris 2025 program. https://paris2025.gosim.org/speakers/yaowei-zheng/. Accessed 2026-05-20. ↩
Cogni Down Under, "Accelerating AI: How Unsloth, DeepSpeed, Axolotl, and LLaMA Factory Are Revolutionizing LLM Training", Medium, 2024. https://medium.com/@cognidownunder/accelerating-ai-how-unsloth-deepspeed-axolotl-and-llama-factory-are-revolutionizing-llm-37ba0bab2e1b. Accessed 2026-05-20. ↩
LLaMA-Factory maintainers, "Releases: v0.9.5 (Qwen3.5/3.6, Gemma 4, Transformers v5) and v0.9.4 (Goodbye 2025)", GitHub release notes (hiyouga/LLaMA-Factory), published 2026-05-30 and 2025-12-31. https://github.com/hiyouga/LLaMA-Factory/releases. Accessed 2026-07-12. ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

2 revisions by 1 contributor · full history

Suggest edit

What links here

Axolotl Mistral 7B QLoRA rsLoRA (Rank-Stabilized LoRA)

Infobox

History

Origins and naming

Release timeline

Project growth

Architecture

Model Loader

Data Worker

Trainer

What training methods does LLaMA-Factory support?

What is LLaMA Board?

How do you run LLaMA-Factory?

Which models does LLaMA-Factory support?

Which datasets and chat templates are built in?

How efficient is LLaMA-Factory?

Who uses LLaMA-Factory?

Why is LLaMA-Factory significant?

How does LLaMA-Factory compare to other frameworks?

What are the limitations of LLaMA-Factory?

What efficient training techniques does LLaMA-Factory implement?

Parameter-efficient approaches

Computation-efficient approaches

Distributed training

Significance for Chinese open-source AI

How do you cite LLaMA-Factory?

See also

References

Improve this article

Related Articles

Axolotl

Unsloth

HuggingFace PEFT

Fully Sharded Data Parallel (FSDP)

AutoML (Automated Machine Learning)

torch.compile

What links here

Related Articles

Axolotl

Unsloth

HuggingFace PEFT

Fully Sharded Data Parallel (FSDP)

AutoML (Automated Machine Learning)

torch.compile

What links here