HuggingFace TRL
Last reviewed
May 20, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 · 4,044 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 20, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 · 4,044 words
Add missing citations, update stale details, or suggest a clearer explanation.
TRL (Transformer Reinforcement Learning, now stylized as Transformers Reinforcement Learning) is an open-source Python library maintained by Hugging Face for post-training large language models with reinforcement learning and preference-optimization methods.[^1] Originally created by Leandro von Werra in 2020 as a Proximal Policy Optimization implementation for GPT-2,[^2] the library evolved into the de facto reference implementation of the modern post-training stack, providing trainers for supervised fine-tuning, DPO, GRPO, PPO, reward modeling, and many derivative algorithms.[^1][^3] TRL is released under the Apache 2.0 license, integrates tightly with the Transformers ecosystem, and underpins higher-level frameworks such as Axolotl and LLaMA-Factory that wrap its trainers behind YAML-based configuration interfaces.[^4][^5] The library reached version 1.0 on March 31, 2026, marking a transition from research codebase to stable production library, and is downloaded roughly three million times per month.[^4]
| Property | Value |
|---|---|
| Original author | Leandro von Werra |
| Current maintainer | Hugging Face (lead: Quentin Gallouédec) |
| Initial release | 2020 |
| First HF-namespace release tracked | 2022 (under huggingface/trl) |
| Stable v1.0 release | March 31, 2026 |
| Latest release (May 2026) | v1.4.0 (May 8, 2026) |
| License | Apache 2.0 |
| Language | Python (>= 3.10) |
| Repository | github.com/huggingface/trl |
| GitHub stars (May 2026) | ~18.4k |
| Monthly downloads (2026) | ~3 million |
TRL began in 2020 as a personal project by Leandro von Werra, who at the time wrote a PyTorch implementation of Proximal Policy Optimization that could be applied to transformer language models.[^2] The earliest released versions on PyPI described the package as "A Pytorch implementation of Proximal Policy Optimization for transfomer language models," and its motivating example trained GPT-2 to produce positive movie reviews using a BERT-based sentiment classifier as a learned reward.[^2] The library was hosted at lvwerra/trl on GitHub and shipped helper modules such as AutoModelForCausalLMWithValueHead and AutoModelForSeq2SeqLMWithValueHead, which add a scalar value head to causal and encoder-decoder transformers so they can be plugged into a standard actor-critic PPO loop.[^2] Version 0.3.1, released on March 2, 2023, was still classified as pre-alpha and largely focused on the single PPO trainer.[^2]
By 2022 the project had been adopted into the Hugging Face namespace at huggingface/trl, and von Werra joined Hugging Face as a maintainer.[^1] During this period the library was the main open-source implementation of the four-stage Reinforcement Learning from Human Feedback recipe popularized by OpenAI's InstructGPT paper: a supervised fine-tuning stage, a reward model training stage, a PPO stage with a frozen reference policy, and a KL penalty against that reference.[^1][^3] An influential April 5, 2023 Hugging Face blog post, "StackLLaMA: A hands-on guide to train LLaMA with RLHF," walked through the full pipeline using TRL on a LLaMA model, and a March 9, 2023 post showed how to fine-tune 20-billion-parameter models with RLHF on a single 24 GB consumer GPU by combining TRL with PEFT adapters and 8-bit optimizers.[^3]
The Direct Preference Optimization paper by Rafailov, Sharma, Mitchell, Ermon, Manning, and Finn was published in May 2023 and argued that the entire RL stack of RLHF could be replaced by a simple binary classification loss against a reference policy.[^6] TRL absorbed this finding rapidly: version 0.5.0 introduced the DPOTrainer,[^7] and a Hugging Face blog post dated August 8, 2023 titled "Fine-tune Llama 2 with DPO" demonstrated the new trainer applied to Llama 2 together with QLoRA adapters and the Stack Exchange Paired dataset.[^8] The DPO contribution was authored by Kashif Rasul and was later refactored by Quentin Gallouédec.[^9]
Over the following year TRL added trainers for nearly every variant of preference optimization that appeared in the literature. The DPOTrainer itself grew a large loss_type enum exposing alternative objectives: the original Bradley-Terry sigmoid loss; ipo for the Identity Preference Optimization objective of Azar et al.; hinge for the RSO/SLiC formulation; sigmoid_norm exposing the length-normalized SimPO objective of Meng et al.; bco_pair for the Binary Classifier Optimization formulation; robust for the noise-aware loss of Chowdhury et al.; aot and aot_unpaired for Distributional Preference Alignment via Optimal Transport; apo_zero and apo_down for the anchored objective; sppo_hard, nca_pair, exo_pair, and discopop; and additional combinations for Mixed Preference Optimization.[^9] Standalone trainers for the KTO objective of Ethayarajh et al., ORPO of Hong et al., and CPO of Xu et al. were added during 2024 as separate classes rather than as DPO loss types.[^1][^10]
The PPO line did not disappear. A June 12, 2024 Hugging Face blog post titled "Putting RL back in RLHF" introduced the RLOOTrainer, an implementation of the REINFORCE Leave-One-Out method described in Cohere's paper "Back to Basics: Revisiting REINFORCE-style Optimization for Learning from Human Feedback in LLMs" by Ahmadian and colleagues.[^11] RLOO drops the learned value model (loading three model copies rather than four) and uses the mean of the other completions in a batch as the baseline for each completion, which the blog reported as roughly 50 to 70 percent lower memory and 2 to 3 times faster wall-clock training than PPO while remaining competitive on win-rate.[^11]
DeepSeek-AI's January 2025 paper introducing DeepSeek-R1 popularized Group Relative Policy Optimization, an algorithm the same group had originally published in the DeepSeekMath paper of February 2024.[^12] GRPO eliminates the value model entirely by computing the advantage as the within-group z-score of rewards across multiple completions sampled for the same prompt, recovering most of the variance-reduction benefit of a critic at a fraction of the memory cost. TRL added a GRPOTrainer shortly afterward, contributed by Quentin Gallouédec.[^12] A January 28, 2025 Hugging Face blog post, "Open-R1: a fully open reproduction of DeepSeek-R1," used the new TRL trainer as one of the central building blocks of an effort to replicate the R1-Zero and R1 training pipeline with open data and open code.[^13]
After GRPO, TRL's emphasis shifted toward online RL with verifiable rewards (often abbreviated RLVR), which removed the need to train a separate reward model in many reasoning and code-generation settings. The library added or stabilized trainers for OnlineDPOTrainer, NashMDTrainer, XPOTrainer, and PRMTrainer (Process Reward Model training), and the May 25, 2025 post "Liger GRPO meets TRL" and the June 3, 2025 post "NO GPU left behind: Unlocking Efficiency with Co-located vLLM in TRL" introduced significant kernel- and inference-level optimizations.[^3]
The v1.0 release on March 31, 2026 was accompanied by a Hugging Face blog post titled "TRL v1: Post-Training Library That Holds When the Field Invalidates Its Own Assumptions."[^4] The post described a "chaos-adaptive" design philosophy: stable trainers that follow semantic versioning live in the top-level trl namespace, while experimental implementations live in trl.experimental and may be reworked or removed between minor versions. The blog explicitly framed three eras of post-training that the library has traversed: a "PPO era" centered on a policy, reference, reward, and value model loop; a "DPO era" that cut through the stack by removing the reward and value models; and an "RLVR/GRPO era" that relies on verifier-based rewards and renewed emphasis on sampling efficiency.[^4]
The stable core trainers at v1.0 are SFTTrainer, DPOTrainer, RewardTrainer, RLOOTrainer, and GRPOTrainer.[^4] Subsequent versions added asynchronous GRPO (parallel generation and training), the DistillationTrainer and SSDTrainer for on-policy knowledge distillation, extended tool-calling and vision-language model support, and chunked cross-entropy losses that the v1.4 release notes report as reducing SFT VRAM use by up to 50 percent.[^14]
The TRL documentation organizes trainers into four groups: online methods, offline methods, reward modeling, and knowledge distillation.[^1] Online methods sample completions from the current policy during training and apply RL updates against a reward; offline methods consume static datasets of preferences or completions.
| Group | Trainer | Status | vLLM rollouts |
|---|---|---|---|
| Online | GRPOTrainer | stable | yes |
| Online | RLOOTrainer | stable | yes |
| Online | OnlineDPOTrainer | experimental | yes |
| Online | NashMDTrainer | experimental | yes |
| Online | XPOTrainer | experimental | yes |
| Online | PPOTrainer | experimental | no |
| Reward modeling | RewardTrainer | stable | n/a |
| Reward modeling | PRMTrainer | experimental | n/a |
| Offline | SFTTrainer | stable | n/a |
| Offline | DPOTrainer | stable | n/a |
| Offline | BCOTrainer | experimental | n/a |
| Offline | CPOTrainer | experimental | n/a |
| Offline | KTOTrainer | experimental | n/a |
| Offline | ORPOTrainer | experimental | n/a |
| Distillation | GKDTrainer | experimental | n/a |
| Distillation | MiniLLMTrainer | experimental | n/a |
Source: TRL documentation index.[^1]
Each trainer is a thin subclass of the Hugging Face Transformers Trainer class, which means it inherits the standard TrainingArguments, checkpointing, callback, and metric-logging machinery, and natively supports distributed training methods such as DDP, DeepSpeed ZeRO, and FSDP.[^5]
The SFTTrainer handles the language-modeling stage that precedes most preference optimization. It supports both standard and conversational dataset formats, automatically applies chat templates, packs sequences into fixed-length chunks via a Best-Fit Decreasing algorithm, and offers token-level masking so that loss is computed only on the assistant turn rather than on the user prompt.[^4] The v1.4 release added a chunked cross-entropy loss that splits the LM-head computation across the vocabulary dimension and accumulates gradients, reducing peak memory by up to 50 percent on long sequences.[^14]
The DPOTrainer consumes a preference dataset of {prompt, chosen, rejected} triples and minimizes the loss
L_DPO(θ) = -E[log σ(β(log π_θ(y+|x)/π_ref(y+|x) - log π_θ(y-|x)/π_ref(y-|x)))],
where π_θ is the policy being trained, π_ref is a frozen reference policy (typically the supervised checkpoint), σ is the logistic sigmoid, and β controls the strength of the preference signal.[^9] The trainer optionally caches reference log-probabilities so the reference model need not be kept in memory during the main optimization loop, supports synchronization of the reference model with an EMA (Exponential Moving Average) of the policy (sync_ref_model=True), and can combine multiple loss types with different weights to implement Mixed Preference Optimization.[^9] The trainer's default configuration overrides several TrainingArguments defaults: learning_rate defaults to 1e-6 (rather than 5e-5), gradient_checkpointing defaults to True, bf16 defaults to True when fp16 is not set, and logging_steps defaults to 10.[^9]
The GRPOTrainer implements the algorithm from the DeepSeekMath paper.[^12] For each prompt, the trainer samples G completions from the current policy, evaluates them with a reward function (often a deterministic verifier such as a math-answer checker or a unit-test runner), and computes the advantage of completion i as
A_i = (r_i - mean(r)) / std(r),
where the mean and standard deviation are taken across the group of G completions for that prompt. The policy is then updated with a clipped surrogate objective analogous to PPO but using these within-group advantages, and a separate KL penalty against the reference policy is added.[^12] Because there is no learned value model, GRPO loads three model copies (policy, reference, optional reward model) rather than four, and the memory savings versus PPO are significant for large models.[^4][^12]
The RLOOTrainer treats the entire completion as a single action rather than as a token-level trajectory and uses the average reward of the other completions in the same batch as a baseline.[^11] Concretely, for completions y_1 to y_k sampled for the same prompt, the advantage of completion i is
A_i = r_i - (1/(k-1)) Σ_{j ≠ i} r_j,
and the gradient is the REINFORCE estimator log π_θ(y_i|x) · A_i. The Cohere paper, integrated into TRL via the June 2024 blog post, reported a 40.1 percent win rate against an SFT baseline at 1B scale and a 78.7 percent win rate at 6.9B scale.[^11]
The RewardTrainer fits a scalar reward model on {prompt, chosen, rejected} data with the standard Bradley-Terry pairwise log-likelihood, producing a model that takes a prompt+completion pair and returns a scalar.[^1] Reward models trained with this trainer are interoperable with all of TRL's online RL trainers (PPO, GRPO, RLOO, etc.) and can be loaded from the Hugging Face Hub. An experimental PRMTrainer extends this idea to step-level supervision for process reward models used in math and code reasoning.[^1]
TRL standardizes a small set of dataset schemas so that the same dataset can be reused across trainers with minimal rewiring. The library distinguishes "standard" datasets (plain prompt, completion, chosen, or rejected text fields) from "conversational" datasets (lists of {role, content} messages that get materialized through a chat template at training time).[^9] Preference datasets carry a prompt plus chosen and rejected either as full conversations or as final-turn completions, with an "implicit prompt" form that omits the explicit prompt field when the chosen and rejected messages already share a common prefix.[^9] For vision-language training, an additional image or images column is consumed by DataCollatorForVisionPreference; for tool-calling fine-tuning, a tools column carries JSON schemas of available tools and the chosen and rejected completions may include tool_calls and tool role messages.[^9] These conventions are now followed by most major preference datasets on the Hugging Face Hub, which is what makes the trainers feel interchangeable.
Every preference-optimization trainer reports a fixed set of metrics intended to make alignment dynamics legible: rewards/chosen and rewards/rejected (the implicit DPO rewards), rewards/margins (their difference), rewards/accuracies (the fraction of examples where the chosen reward exceeds the rejected reward), per-token log-probabilities logps/chosen and logps/rejected, an entropy term over the model's predictive distribution, the gradient norm before clipping, and a mean_token_accuracy measuring top-1 agreement with the chosen completion.[^9] For GRPO and RLOO the trainers additionally log per-group reward standard deviations and per-prompt KL divergences against the reference policy.[^12] The v1.0 release announced an emerging "training legibility" effort to embed heuristics that surface actionable warnings (e.g., reward-margin collapse, KL drift) so users can diagnose runaway training without manually staring at dashboards.[^4]
A persistent practical concern with KL-anchored RLHF and DPO is the memory cost of keeping a frozen reference model alongside the policy. TRL offers three mitigations: precomputing reference log-probabilities at the start of training and caching them on disk (precompute_ref_log_probs=True), training adapters (typically LoRA or QLoRA) and using the base model with the adapter disabled as the implicit reference (eliminating the need for a second model copy), and synchronizing the reference model with an EMA of the policy.[^9] The PEFT integration is first-class: passing a peft_config=LoraConfig(...) to any trainer wraps the model with PEFT before training begins, and adapters can be pushed to the Hub at the end of training.[^9]
A bottleneck in online RL training is the latency of sampling completions from the current policy. TRL integrates with vLLM to provide high-throughput, low-latency generation during online training.[^15] Two modes are supported. In colocate mode, the trainer process holds both the training model and a vLLM engine, which share GPU memory; this is the simpler deployment but can fragment memory and introduces synchronization between training and generation. In server mode, vLLM runs as an independent HTTP server on its own GPUs and the trainer pushes weight updates to it after each policy step; this scales better and is the only mode that supports custom rollout_func callbacks.[^15] The supported trainers for vLLM rollouts at v1.4 are GRPOTrainer, RLOOTrainer, OnlineDPOTrainer, NashMDTrainer, and XPOTrainer.[^15] As of v1.4 the supported vLLM range is 0.12.0 to 0.18.0, with data-parallel scaling for dense (non-MoE) models removed after vLLM 0.14.0.[^15]
Because TRL trainers subclass the Transformers Trainer, they inherit the Hugging Face Accelerate launcher, which uniformly exposes single-GPU, multi-GPU, multi-node, and TPU configurations through a single accelerate config step.[^5] DeepSpeed ZeRO stages 1, 2, and 3 are supported (the third optionally with CPU and NVMe offload), as is PyTorch FSDP. The v1.0 release notes explicitly call out training distribution stability and MoE/expert parallelism as ongoing scaling priorities.[^4]
TRL integrates with Unsloth for kernel-level fine-tuning speedups; with Liger Kernel from LinkedIn for Triton-based fused operators (the v0.x line added a Liger-aware GRPO loss in May 2025); with the OpenEnv environment standard (introduced via the October 2025 "Building the Open Agent Ecosystem Together" blog post) for agent-driven training loops; with math-verify for math-answer verification rewards; and with the kernels and quantization optional dependencies for low-precision training.[^1][^3][^16]
TRL is the substrate for most modern open-source post-training pipelines. Two prominent wrappers are Axolotl and LLaMA-Factory, both of which expose YAML-based configuration interfaces and delegate the underlying training to TRL trainers.[^5][^17] Axolotl's RLHF documentation states that it "relies on the TRL library for implementations of various RL training methods" including DPO, KTO, ORPO, and PPO, and adds higher-level conveniences such as dataset loaders, multi-config orchestration, and DeepSpeed/FSDP launchers on top.[^17] LLaMA-Factory similarly wraps TRL trainers and adds a Gradio web UI for non-programmatic post-training.[^17] Unsloth ships custom kernels that interoperate with TRL's stable trainers (its DPO and SFT examples in the official documentation are essentially TRL examples with a faster optimizer attached).[^1] RapidFire AI sits on top of TRL specifically for multi-configuration DPO experimentation on a single GPU.[^9]
The TRL documentation lists over 1,000 community models trained with DPOTrainer and tagged with dpo,trl on the Hugging Face Hub, with similar tag pages for GRPO, ORPO, and KTO.[^9][^12] The trl-lib organization on the Hub hosts canonical reference models and datasets used in the documentation examples.[^1] The Hugging Face team's own Zephyr models (a Mistral 7B fine-tune) and the Tülu 3 series from Allen AI use TRL as the post-training engine, as does the Open-R1 project's reproduction of DeepSeek-R1.[^13]
TRL ships a CLI (trl sft, trl dpo, trl grpo) that lets users launch fine-tuning runs without writing Python, reading both model and dataset directly from the Hub and pushing the resulting checkpoint back at the end of training.[^4] Trainers also automatically populate a model card with training arguments, dataset metadata, and per-epoch loss curves when push_to_hub=True.[^9]
The library is used wherever an open-source large language model needs to be aligned with human preferences, instruction-followed, or specialized to a domain. Concrete applications documented by Hugging Face and the broader community include alignment of base models such as Llama 2, Llama 3, Mistral, and Qwen with conversational preference data;[^3][^8] reasoning fine-tuning of math models using GRPO with deterministic answer-checking rewards, as in the Open-R1 project;[^13] code generation fine-tuning using unit-test pass/fail as the reward; tool-calling and agent training using OpenEnv environments;[^16] vision-language model alignment (the August 7, 2025 blog post "Vision Language Model Alignment in TRL" covers VLM preference optimization with DPO);[^3] and image-generation alignment via the legacy DDPOTrainer, introduced in the September 29, 2023 post "Finetune Stable Diffusion Models with DDPO via TRL" for Stable Diffusion models.[^18]
TRL inherits the difficulties of the training methods it implements. PPO training of language models is notoriously unstable, and the older PPOTrainer is now classified as experimental in v1.0.[^4] DPO and its variants are sensitive to the SFT checkpoint quality and to the choice of β; mis-tuning β can cause the policy to collapse onto the chosen completions (over-fitting) or to remain too close to the reference (under-fitting). The June 2024 RLOO blog post documented a numerical-stability issue in bf16 precision where roughly 20 to 40 percent of RLOO gradient batches were nulled by gradient clipping versus around 3 percent for PPO, attributable to log-probability drift between generation and training rather than to an algorithmic flaw.[^11]
The library's rapid pace of change has occasionally created backward-compatibility friction: argument names and defaults have shifted across minor versions (the trainer constructor signatures for DPO, KTO, and ORPO have been refactored several times), and the v1.0 release explicitly published a MIGRATION.md guide for users coming from the 0.x line.[^4] Reproduction of preference-optimization papers using TRL has sometimes been complicated by the fact that loss-type names and default hyperparameters in TRL do not always match the original papers' notation. The chaos-adaptive philosophy openly accepts this tradeoff: the experimental namespace exists precisely so that breaking changes can be made without violating semantic versioning of the stable surface.[^4]
A second limitation is that TRL is fundamentally a single-policy library. Methods that require multiple competing policies (self-play tournament approaches), elaborate environment loops with long episodes, or fully off-policy RL with large replay buffers either map awkwardly onto its training-step abstraction or are out of scope; these workloads are typically handled by purpose-built frameworks such as OpenRLHF, veRL, or Nemotron's reasoning pipeline. Some users have noted that the integration of agentic environments via OpenEnv, while progressing, is less mature than dedicated agentic RL frameworks.[^16]
| Library | Primary focus | Wraps TRL | License | Notes |
|---|---|---|---|---|
| TRL | Trainers for SFT, DPO, GRPO, RLOO, PPO, reward modeling | n/a | Apache 2.0 | Reference implementation; HF maintained |
| Axolotl | YAML-configured fine-tuning | yes | Apache 2.0 | Adds dataset/config layer over TRL |
| LLaMA-Factory | YAML+Gradio UI fine-tuning | yes | Apache 2.0 | Targets ease of use; wraps TRL trainers |
| Unsloth | Kernel-level fast fine-tuning | interoperable | Apache 2.0 | Custom Triton kernels usable with TRL |
| OpenRLHF | Multi-node PPO/DPO/GRPO at scale | no | Apache 2.0 | Native Ray-based RL framework |
Source: comparison synthesized from each project's official documentation and the v1.0 announcement post.[^4][^17]
DPO and its many descendants (SimPO, KTO, ORPO) form the offline preference-optimization side of TRL's API surface, while PPO and GRPO provide the online RL side anchored by a learned or verifiable reward. The library sits in the post-training stage of the modern LLM pipeline, immediately downstream of pretraining and immediately upstream of evaluation. Closely related libraries include Hugging Face Transformers (its dependency), PEFT (parameter-efficient fine-tuning adapters), and vLLM (the rollout backend for online RL). On the dataset side, public preference datasets such as UltraFeedback and Stack Exchange Paired are the canonical inputs to the offline trainers.[^8][^9]