HuggingFace TRL

Open Source AI Reinforcement Learning Training & Optimization

22 min read

Updated Jul 11, 2026

Suggest edit History Talk

RawGraph

Last edited

Jul 11, 2026

Fact-checked

In review queue

Sources

20 citations

Revision

v3 · 4,318 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

TRL (Transformer Reinforcement Learning, now stylized as Transformers Reinforcement Learning) is an open-source Python library maintained by Hugging Face for post-training large language models with reinforcement learning and preference-optimization methods.^[1] Originally created by Leandro von Werra in 2020 as a Proximal Policy Optimization implementation for GPT-2,^[2] the library evolved into the de facto reference implementation of the modern post-training stack, providing trainers for supervised fine-tuning, DPO, GRPO, PPO, reward modeling, and many derivative algorithms.^[1]^[3] TRL is released under the Apache 2.0 license, integrates tightly with the Transformers ecosystem, and underpins higher-level frameworks such as Axolotl and LLaMA-Factory that wrap its trainers behind YAML-based configuration interfaces.^[4]^[5] The library reached version 1.0 on March 31, 2026, a transition from research codebase to stable production library, and by that release implemented more than 75 post-training methods.^[4] As of mid-2026 it is downloaded roughly 3.9 million times per month on PyPI.^[19]

Infobox

Property	Value
Original author	Leandro von Werra
Current maintainer	Hugging Face (lead: Quentin Gallouédec)
Initial release	2020
First HF-namespace release tracked	2022 (under `huggingface/trl`)
Stable v1.0 release	March 31, 2026
Latest release	v1.8.0 (July 9, 2026)
Post-training methods	more than 75 (at v1.0)
License	Apache 2.0
Language	Python (>= 3.10)
Repository	github.com/huggingface/trl
GitHub stars (July 2026)	~18.8k
Monthly downloads (PyPI, mid-2026)	~3.9 million

History

Origin: a PPO library for GPT-2 (2020)

TRL began in 2020 as a personal project by Leandro von Werra, who at the time wrote a PyTorch implementation of Proximal Policy Optimization that could be applied to transformer language models.^[2] The earliest released versions on PyPI described the package as "A Pytorch implementation of Proximal Policy Optimization for transfomer language models," and its motivating example trained GPT-2 to produce positive movie reviews using a BERT-based sentiment classifier as a learned reward.^[2] The library was hosted at lvwerra/trl on GitHub and shipped helper modules such as AutoModelForCausalLMWithValueHead and AutoModelForSeq2SeqLMWithValueHead, which add a scalar value head to causal and encoder-decoder transformers so they can be plugged into a standard actor-critic PPO loop.^[2] Version 0.3.1, released on March 2, 2023, was still classified as pre-alpha and largely focused on the single PPO trainer.^[2]

Hugging Face integration and the RLHF era (2022 to mid-2023)

By 2022 the project had been adopted into the Hugging Face namespace at huggingface/trl, and von Werra joined Hugging Face as a maintainer.^[1] During this period the library was the main open-source implementation of the four-stage Reinforcement Learning from Human Feedback recipe popularized by OpenAI's InstructGPT paper: a supervised fine-tuning stage, a reward model training stage, a PPO stage with a frozen reference policy, and a KL penalty against that reference.^[1]^[3] An influential April 5, 2023 Hugging Face blog post, "StackLLaMA: A hands-on guide to train LLaMA with RLHF," walked through the full pipeline using TRL on a LLaMA model, and a March 9, 2023 post showed how to fine-tune 20-billion-parameter models with RLHF on a single 24 GB consumer GPU by combining TRL with PEFT adapters and 8-bit optimizers.^[3]

The preference optimization wave (2023 to 2024)

The Direct Preference Optimization paper by Rafailov, Sharma, Mitchell, Ermon, Manning, and Finn was published in May 2023 and argued that the entire RL stack of RLHF could be replaced by a simple binary classification loss against a reference policy.^[6] TRL absorbed this finding rapidly: version 0.5.0 introduced the DPOTrainer,^[7] and a Hugging Face blog post dated August 8, 2023 titled "Fine-tune Llama 2 with DPO" demonstrated the new trainer applied to Llama 2 together with QLoRA adapters and the Stack Exchange Paired dataset.^[8] The DPO contribution was authored by Kashif Rasul and was later refactored by Quentin Gallouédec.^[9]

Over the following year TRL added trainers for nearly every variant of preference optimization that appeared in the literature. The DPOTrainer itself grew a large loss_type enum exposing alternative objectives: the original Bradley-Terry sigmoid loss; ipo for the Identity Preference Optimization objective of Azar et al.; hinge for the RSO/SLiC formulation; sigmoid_norm exposing the length-normalized SimPO objective of Meng et al.; bco_pair for the Binary Classifier Optimization formulation; robust for the noise-aware loss of Chowdhury et al.; aot and aot_unpaired for Distributional Preference Alignment via Optimal Transport; apo_zero and apo_down for the anchored objective; sppo_hard, nca_pair, exo_pair, and discopop; and additional combinations for Mixed Preference Optimization.^[9] Standalone trainers for the KTO objective of Ethayarajh et al., ORPO of Hong et al., and CPO of Xu et al. were added during 2024 as separate classes rather than as DPO loss types.^[1]^[10]

The PPO line did not disappear. A June 12, 2024 Hugging Face blog post titled "Putting RL back in RLHF" introduced the RLOOTrainer, an implementation of the REINFORCE Leave-One-Out method described in Cohere's paper "Back to Basics: Revisiting REINFORCE-style Optimization for Learning from Human Feedback in LLMs" by Ahmadian and colleagues.^[11] RLOO drops the learned value model (loading three model copies rather than four) and uses the mean of the other completions in a batch as the baseline for each completion, which the blog reported as roughly 50 to 70 percent lower memory and 2 to 3 times faster wall-clock training than PPO while remaining competitive on win-rate.^[11]

The GRPO and verifiable-reward era (2025)

DeepSeek-AI's January 2025 paper introducing DeepSeek-R1 popularized Group Relative Policy Optimization, an algorithm the same group had originally published in the DeepSeekMath paper of February 2024.^[12] GRPO eliminates the value model entirely by computing the advantage as the within-group z-score of rewards across multiple completions sampled for the same prompt, recovering most of the variance-reduction benefit of a critic at a fraction of the memory cost. TRL added a GRPOTrainer shortly afterward, contributed by Quentin Gallouédec.^[12] A January 28, 2025 Hugging Face blog post, "Open-R1: a fully open reproduction of DeepSeek-R1," used the new TRL trainer as one of the central building blocks of an effort to replicate the R1-Zero and R1 training pipeline with open data and open code.^[13]

After GRPO, TRL's emphasis shifted toward online RL with verifiable rewards (often abbreviated RLVR), which removed the need to train a separate reward model in many reasoning and code-generation settings. The library added or stabilized trainers for OnlineDPOTrainer, NashMDTrainer, XPOTrainer, and PRMTrainer (Process Reward Model training), and the May 25, 2025 post "Liger GRPO meets TRL" and the June 3, 2025 post "NO GPU left behind: Unlocking Efficiency with Co-located vLLM in TRL" introduced significant kernel- and inference-level optimizations.^[3]

TRL v1.0 and the chaos-adaptive design (2026)

The v1.0 release on March 31, 2026 was announced in a Hugging Face blog post, "TRL v1.0: Post-Training Library Built to Move with the Field."^[4] The post described what its authors call a "chaos-adaptive" design: rather than "try to capture the essence of what's stable today," the library is built to "design around what could change."^[4] In practice, stable trainers that follow semantic versioning live in the top-level trl namespace, while experimental implementations live in trl.experimental and may be reworked or removed between minor versions. As the announcement puts it, "The stable core follows semantic versioning. The experimental layer makes no such promises."^[4] The blog explicitly framed three eras of post-training that the library has traversed: a "PPO era" centered on a policy, reference, reward, and value model loop; a "DPO era" that cut through the stack by removing the reward and value models; and an "RLVR/GRPO era" that relies on verifier-based rewards and renewed emphasis on sampling efficiency.^[4] At the v1.0 milestone the library implemented more than 75 post-training methods across these paradigms.^[4]

The stable core trainers at v1.0 are SFTTrainer, DPOTrainer, RewardTrainer, RLOOTrainer, and GRPOTrainer, "along with their close variants."^[4] Subsequent versions added asynchronous GRPO (parallel generation and training), on-policy knowledge-distillation trainers (GKDTrainer and MiniLLMTrainer), extended tool-calling and vision-language model support, and a chunked cross-entropy loss that the v1.4 release notes report as reducing SFT VRAM use by up to 50 percent.^[14] The July 9, 2026 v1.8.0 release graduated the KTOTrainer to the stable API after aligning its interface with DPOTrainer, and extended GRPOTrainer to multi-environment agentic RL in which each sandboxed environment can define its own reward.^[1]^[14]

Technical details

Which trainers does TRL provide?

The TRL documentation organizes trainers into four groups: online methods, offline methods, reward modeling, and knowledge distillation.^[1] Online methods sample completions from the current policy during training and apply RL updates against a reward; offline methods consume static datasets of preferences or completions.

Group	Trainer	Status	vLLM rollouts
Online	`GRPOTrainer`	stable	yes
Online	`RLOOTrainer`	stable	yes
Online	`OnlineDPOTrainer`	experimental	yes
Online	`NashMDTrainer`	experimental	yes
Online	`XPOTrainer`	experimental	yes
Online	`PPOTrainer`	experimental	no
Reward modeling	`RewardTrainer`	stable	n/a
Reward modeling	`PRMTrainer`	experimental	n/a
Offline	`SFTTrainer`	stable	n/a
Offline	`DPOTrainer`	stable	n/a
Offline	`KTOTrainer`	stable	n/a
Offline	`BCOTrainer`	experimental	n/a
Offline	`CPOTrainer`	experimental	n/a
Offline	`ORPOTrainer`	experimental	n/a
Distillation	`GKDTrainer`	experimental	n/a
Distillation	`MiniLLMTrainer`	experimental	n/a

Source: TRL documentation index, as of July 2026.^[1] The KTOTrainer moved from the experimental namespace to the stable API in the v1.8.0 release.^[1]

Each trainer is a thin subclass of the Hugging Face Transformers Trainer class, which means it inherits the standard TrainingArguments, checkpointing, callback, and metric-logging machinery, and natively supports distributed training methods such as DDP, DeepSpeed ZeRO, and FSDP.^[5]

Supervised fine-tuning

The SFTTrainer handles the language-modeling stage that precedes most preference optimization. It supports both standard and conversational dataset formats, automatically applies chat templates, packs sequences into fixed-length chunks via a Best-Fit Decreasing algorithm, and offers token-level masking so that loss is computed only on the assistant turn rather than on the user prompt.^[4] The v1.4 release added a chunked cross-entropy loss (loss_type="chunked_nll") that splits the LM-head computation across the vocabulary dimension and accumulates gradients, reducing peak memory by up to 50 percent on long sequences. The release notes cite roughly a 3.9 times reduction in peak memory on Qwen3-1.7B with LoRA and about a 1.5 times reduction on Qwen3-14B, achieved by avoiding materialization of the full batch-by-sequence-by-vocabulary logits tensor.^[14]

Direct Preference Optimization

The DPOTrainer consumes a preference dataset of {prompt, chosen, rejected} triples and minimizes the loss

L_DPO(θ) = -E[log σ(β(log π_θ(y+|x)/π_ref(y+|x) - log π_θ(y-|x)/π_ref(y-|x)))],

where π_θ is the policy being trained, π_ref is a frozen reference policy (typically the supervised checkpoint), σ is the logistic sigmoid, and β controls the strength of the preference signal.^[9] The trainer optionally caches reference log-probabilities so the reference model need not be kept in memory during the main optimization loop, supports synchronization of the reference model with an EMA (Exponential Moving Average) of the policy (sync_ref_model=True), and can combine multiple loss types with different weights to implement Mixed Preference Optimization.^[9] The trainer's default configuration overrides several TrainingArguments defaults: learning_rate defaults to 1e-6 (rather than 5e-5), gradient_checkpointing defaults to True, bf16 defaults to True when fp16 is not set, and logging_steps defaults to 10.^[9]

Group Relative Policy Optimization

The GRPOTrainer implements the algorithm from the DeepSeekMath paper.^[12] For each prompt, the trainer samples G completions from the current policy, evaluates them with a reward function (often a deterministic verifier such as a math-answer checker or a unit-test runner), and computes the advantage of completion i as

A_i = (r_i - mean(r)) / std(r),

where the mean and standard deviation are taken across the group of G completions for that prompt. The policy is then updated with a clipped surrogate objective analogous to PPO but using these within-group advantages, and a separate KL penalty against the reference policy is added.^[12] Because there is no learned value model, GRPO loads three model copies (policy, reference, optional reward model) rather than four, and the memory savings versus PPO are significant for large models.^[4]^[12]

REINFORCE Leave-One-Out (RLOO)

The RLOOTrainer treats the entire completion as a single action rather than as a token-level trajectory and uses the average reward of the other completions in the same batch as a baseline.^[11] Concretely, for completions y_1 to y_k sampled for the same prompt, the advantage of completion i is

A_i = r_i - (1/(k-1)) Σ_{j ≠ i} r_j,

and the gradient is the REINFORCE estimator log π_θ(y_i|x) · A_i. The Cohere method, integrated into TRL via the June 2024 blog post, reported a 40.1 percent win rate against an SFT baseline (which itself won 21.3 percent) at 1B scale and a 78.7 percent preferred rate at 6.9B scale as judged by GPT-4.^[11]

Reward modeling

The RewardTrainer fits a scalar reward model on {prompt, chosen, rejected} data with the standard Bradley-Terry pairwise log-likelihood, producing a model that takes a prompt+completion pair and returns a scalar.^[1] Reward models trained with this trainer are interoperable with all of TRL's online RL trainers (PPO, GRPO, RLOO, etc.) and can be loaded from the Hugging Face Hub. An experimental PRMTrainer extends this idea to step-level supervision for process reward models used in math and code reasoning.^[1]

Dataset formats

TRL standardizes a small set of dataset schemas so that the same dataset can be reused across trainers with minimal rewiring. The library distinguishes "standard" datasets (plain prompt, completion, chosen, or rejected text fields) from "conversational" datasets (lists of {role, content} messages that get materialized through a chat template at training time).^[9] Preference datasets carry a prompt plus chosen and rejected either as full conversations or as final-turn completions, with an "implicit prompt" form that omits the explicit prompt field when the chosen and rejected messages already share a common prefix.^[9] For vision-language training, an additional image or images column is consumed by DataCollatorForVisionPreference; for tool-calling fine-tuning, a tools column carries JSON schemas of available tools and the chosen and rejected completions may include tool_calls and tool role messages.^[9] These conventions are now followed by most major preference datasets on the Hugging Face Hub, which is what makes the trainers feel interchangeable.

Logged metrics and observability

Every preference-optimization trainer reports a fixed set of metrics intended to make alignment dynamics legible: rewards/chosen and rewards/rejected (the implicit DPO rewards), rewards/margins (their difference), rewards/accuracies (the fraction of examples where the chosen reward exceeds the rejected reward), per-token log-probabilities logps/chosen and logps/rejected, an entropy term over the model's predictive distribution, the gradient norm before clipping, and a mean_token_accuracy measuring top-1 agreement with the chosen completion.^[9] For GRPO and RLOO the trainers additionally log per-group reward standard deviations and per-prompt KL divergences against the reference policy.^[12] The v1.0 release announced an emerging "training legibility" effort to embed heuristics that surface actionable warnings (e.g., reward-margin collapse, KL drift) so users can diagnose runaway training without manually staring at dashboards.^[4]

Reference model handling and PEFT integration

A persistent practical concern with KL-anchored RLHF and DPO is the memory cost of keeping a frozen reference model alongside the policy. TRL offers three mitigations: precomputing reference log-probabilities at the start of training and caching them on disk (precompute_ref_log_probs=True), training adapters (typically LoRA or QLoRA) and using the base model with the adapter disabled as the implicit reference (eliminating the need for a second model copy), and synchronizing the reference model with an EMA of the policy.^[9] The PEFT integration is first-class: passing a peft_config=LoraConfig(...) to any trainer wraps the model with PEFT before training begins, and adapters can be pushed to the Hub at the end of training.^[9]

vLLM rollout backend

A bottleneck in online RL training is the latency of sampling completions from the current policy. TRL integrates with vLLM to provide high-throughput, low-latency generation during online training.^[15] Two modes are supported. In colocate mode, the trainer process holds both the training model and a vLLM engine, which share GPU memory; this is the simpler deployment but can fragment memory and introduces synchronization between training and generation. In server mode, vLLM runs as an independent HTTP server on its own GPUs and the trainer pushes weight updates to it after each policy step; this scales better and is the only mode that supports custom rollout_func callbacks.^[15] The supported trainers for vLLM rollouts at v1.4 are GRPOTrainer, RLOOTrainer, OnlineDPOTrainer, NashMDTrainer, and XPOTrainer.^[15] As of v1.4 the supported vLLM range is 0.12.0 to 0.18.0, with data-parallel scaling for dense (non-MoE) models removed after vLLM 0.14.0.^[15]

Accelerate, DeepSpeed, FSDP, and distributed training

Because TRL trainers subclass the Transformers Trainer, they inherit the Hugging Face Accelerate launcher, which uniformly exposes single-GPU, multi-GPU, multi-node, and TPU configurations through a single accelerate config step.^[5] DeepSpeed ZeRO stages 1, 2, and 3 are supported (the third optionally with CPU and NVMe offload), as is PyTorch FSDP. The v1.0 release notes explicitly call out training distribution stability and MoE/expert parallelism as ongoing scaling priorities.^[4]

Other integrations

TRL integrates with Unsloth for kernel-level fine-tuning speedups: Hugging Face and Unsloth report that passing an Unsloth-optimized model to the SFTTrainer or DPOTrainer can make fine-tuning up to roughly 2x faster while using substantially less memory (up to about 74 percent less in the collaboration's benchmarks).^[20] It also integrates with Liger Kernel from LinkedIn for Triton-based fused operators (the v0.x line added a Liger-aware GRPO loss in May 2025); with the OpenEnv environment standard (introduced via the October 23, 2025 blog post "Building the Open Agent Ecosystem Together: Introducing OpenEnv") and the Harbor environment suite for agent-driven, multi-environment training loops; with math-verify for math-answer verification rewards; and with the kernels and quantization optional dependencies for low-precision training.^[1]^[3]^[16]

Variants and downstream use

Frameworks that wrap TRL

TRL is the substrate for most modern open-source post-training pipelines. Two prominent wrappers are Axolotl and LLaMA-Factory, both of which expose YAML-based configuration interfaces and delegate the underlying training to TRL trainers.^[5]^[17] Axolotl's RLHF documentation states that it "relies on the TRL library for implementations of various RL training methods" including DPO, KTO, ORPO, and PPO, and adds higher-level conveniences such as dataset loaders, multi-config orchestration, and DeepSpeed/FSDP launchers on top.^[17] LLaMA-Factory similarly wraps TRL trainers and adds a Gradio web UI for non-programmatic post-training.^[17] Unsloth ships custom kernels that interoperate with TRL's stable trainers (its DPO and SFT examples in the official documentation are essentially TRL examples with a faster optimizer attached).^[1] RapidFire AI sits on top of TRL specifically for multi-configuration DPO experimentation on a single GPU.^[9]

Models trained with TRL

The TRL documentation lists over 1,000 community models trained with DPOTrainer and tagged with dpo,trl on the Hugging Face Hub, with similar tag pages for GRPO, ORPO, and KTO.^[9]^[12] The trl-lib organization on the Hub hosts canonical reference models and datasets used in the documentation examples.^[1] The Hugging Face team's own Zephyr models (a Mistral 7B fine-tune) and the Tülu 3 series from Allen AI use TRL as the post-training engine, as does the Open-R1 project's reproduction of DeepSeek-R1.^[13]

TRL on the Hugging Face Hub

TRL ships a CLI (trl sft, trl dpo, trl grpo) that lets users launch fine-tuning runs without writing Python, reading both model and dataset directly from the Hub and pushing the resulting checkpoint back at the end of training.^[4] Trainers also automatically populate a model card with training arguments, dataset metadata, and per-epoch loss curves when push_to_hub=True.^[9]

What is TRL used for?

The library is used wherever an open-source large language model needs to be aligned with human preferences, instruction-followed, or specialized to a domain. Concrete applications documented by Hugging Face and the broader community include alignment of base models such as Llama 2, Llama 3, Mistral, and Qwen with conversational preference data;^[3]^[8] reasoning fine-tuning of math models using GRPO with deterministic answer-checking rewards, as in the Open-R1 project;^[13] code generation fine-tuning using unit-test pass/fail as the reward; tool-calling and agent training using OpenEnv environments;^[16] vision-language model alignment (the August 7, 2025 blog post "Vision Language Model Alignment in TRL" covers VLM preference optimization with DPO);^[3] and image-generation alignment via the legacy DDPOTrainer, introduced in the September 29, 2023 post "Finetune Stable Diffusion Models with DDPO via TRL" for Stable Diffusion models.^[18]

What are the limitations of TRL?

TRL inherits the difficulties of the training methods it implements. PPO training of language models is notoriously unstable, and the older PPOTrainer is now classified as experimental in v1.0.^[4] DPO and its variants are sensitive to the SFT checkpoint quality and to the choice of β; mis-tuning β can cause the policy to collapse onto the chosen completions (over-fitting) or to remain too close to the reference (under-fitting). The June 2024 RLOO blog post documented a numerical-stability issue in bf16 precision where roughly 20 to 40 percent of RLOO gradient batches were nulled by gradient clipping versus around 3 percent for PPO, attributable to log-probability drift between generation and training rather than to an algorithmic flaw.^[11]

The library's rapid pace of change has occasionally created backward-compatibility friction: argument names and defaults have shifted across minor versions (the trainer constructor signatures for DPO, KTO, and ORPO have been refactored several times), and the v1.0 release explicitly published a MIGRATION.md guide for users coming from the 0.x line.^[4] Reproduction of preference-optimization papers using TRL has sometimes been complicated by the fact that loss-type names and default hyperparameters in TRL do not always match the original papers' notation. The chaos-adaptive philosophy openly accepts this tradeoff: the experimental namespace exists precisely so that breaking changes can be made without violating semantic versioning of the stable surface.^[4]

A second limitation is that TRL is fundamentally a single-policy library. Methods that require multiple competing policies (self-play tournament approaches), elaborate environment loops with long episodes, or fully off-policy RL with large replay buffers either map awkwardly onto its training-step abstraction or are out of scope; these workloads are typically handled by purpose-built frameworks such as OpenRLHF, veRL, or Nemotron's reasoning pipeline. Some users have noted that the integration of agentic environments via OpenEnv, while progressing, is less mature than dedicated agentic RL frameworks.^[16]

How does TRL compare to other post-training libraries?

Library	Primary focus	Wraps TRL	License	Notes
TRL	Trainers for SFT, DPO, GRPO, RLOO, PPO, reward modeling	n/a	Apache 2.0	Reference implementation; HF maintained
Axolotl	YAML-configured fine-tuning	yes	Apache 2.0	Adds dataset/config layer over TRL
LLaMA-Factory	YAML+Gradio UI fine-tuning	yes	Apache 2.0	Targets ease of use; wraps TRL trainers
Unsloth	Kernel-level fast fine-tuning	interoperable	Apache 2.0	Custom Triton kernels usable with TRL
OpenRLHF	Multi-node PPO/DPO/GRPO at scale	no	Apache 2.0	Native Ray-based RL framework

Source: comparison synthesized from each project's official documentation and the v1.0 announcement post.^[4]^[17]

DPO and its many descendants (SimPO, KTO, ORPO) form the offline preference-optimization side of TRL's API surface, while PPO and GRPO provide the online RL side anchored by a learned or verifiable reward. The library sits in the post-training stage of the modern LLM pipeline, immediately downstream of pretraining and immediately upstream of evaluation. Closely related libraries include Hugging Face Transformers (its dependency), PEFT (parameter-efficient fine-tuning adapters), and vLLM (the rollout backend for online RL). On the dataset side, public preference datasets such as UltraFeedback and Stack Exchange Paired are the canonical inputs to the offline trainers.^[8]^[9]

References

Hugging Face, "TRL - Transformers Reinforcement Learning", Hugging Face documentation, 2026. https://huggingface.co/docs/trl/en/index. Accessed 2026-07-12. ↩
Leandro von Werra, "trl 0.3.1", PyPI, 2023-03-02. https://pypi.org/project/trl/0.3.1/. Accessed 2026-05-20. ↩
Hugging Face, "TRL blog posts index", Hugging Face documentation, 2026. https://huggingface.co/docs/trl/en/index. Accessed 2026-05-20. ↩
Hugging Face, "TRL v1.0: Post-Training Library Built to Move with the Field", Hugging Face Blog, 2026-03-31. https://huggingface.co/blog/trl-v1. Accessed 2026-07-12. ↩
huggingface/trl contributors, "huggingface/trl repository README", GitHub, 2026. https://github.com/huggingface/trl. Accessed 2026-07-12. ↩
Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, Chelsea Finn, "Direct Preference Optimization: Your Language Model is Secretly a Reward Model", arXiv:2305.18290, 2023-05-29. https://huggingface.co/papers/2305.18290. Accessed 2026-05-20. ↩
huggingface/trl contributors, "Releases - huggingface/trl", GitHub, 2026. https://github.com/huggingface/trl/releases. Accessed 2026-05-20. ↩
Younes Belkada, Kashif Rasul, Leandro von Werra, "Fine-tune Llama 2 with DPO", Hugging Face Blog, 2023-08-08. https://huggingface.co/blog/dpo-trl. Accessed 2026-05-20. ↩
Hugging Face, "DPO Trainer", TRL documentation, 2026. https://huggingface.co/docs/trl/en/dpo_trainer. Accessed 2026-05-20. ↩
Hugging Face, "CPO Trainer", TRL documentation, 2026. https://huggingface.co/docs/trl/en/cpo_trainer. Accessed 2026-05-20. ↩
Costa Huang, Shengyi Huang, Quentin Gallouédec et al., "Putting RL back in RLHF", Hugging Face Blog, 2024-06-12. https://huggingface.co/blog/putting_rl_back_in_rlhf_with_rloo. Accessed 2026-07-12. ↩
Hugging Face, "GRPO Trainer", TRL documentation, 2026. https://huggingface.co/docs/trl/en/grpo_trainer. Accessed 2026-05-20. ↩
Hugging Face, "Open-R1: a fully open reproduction of DeepSeek-R1", Hugging Face Blog, 2025-01-28. https://huggingface.co/blog/open-r1. Accessed 2026-05-20. ↩
huggingface/trl contributors, "Release notes for v1.0.0 through v1.8.0", GitHub, 2026. https://github.com/huggingface/trl/releases. Accessed 2026-07-12. ↩
Hugging Face, "vLLM Integration", TRL documentation, 2026. https://github.com/huggingface/trl/blob/main/docs/source/vllm_integration.md. Accessed 2026-05-20. ↩
Hugging Face, "OpenEnv Integration for Training LLMs with Environments", TRL documentation, 2026. https://huggingface.co/docs/trl/v0.25.0/openenv. Accessed 2026-05-20. ↩
Axolotl contributors, "RLHF (Beta)", Axolotl documentation, 2026. https://docs.axolotl.ai/docs/rlhf.html. Accessed 2026-05-20. ↩
Metric-Space, Sayak Paul, Kashif Rasul, Leandro von Werra, "Finetune Stable Diffusion Models with DDPO via TRL", Hugging Face Blog, 2023-09-29. https://huggingface.co/blog/trl-ddpo. Accessed 2026-05-20. ↩
PyPI Stats, "trl download statistics", pypistats.org, 2026. https://pypistats.org/packages/trl. Accessed 2026-07-12. ↩
Hugging Face and Unsloth, "Make LLM Fine-tuning 2x faster with Unsloth and TRL", Hugging Face Blog, 2024. https://huggingface.co/blog/unsloth-trl. Accessed 2026-07-12. ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

2 revisions by 1 contributor · full history

Suggest edit

What links here

Mistral 7B NormalFloat 4-bit (NF4)RLOO (REINFORCE Leave-One-Out)

Infobox

History

Origin: a PPO library for GPT-2 (2020)

Hugging Face integration and the RLHF era (2022 to mid-2023)

The preference optimization wave (2023 to 2024)

The GRPO and verifiable-reward era (2025)

TRL v1.0 and the chaos-adaptive design (2026)

Technical details

Which trainers does TRL provide?

Supervised fine-tuning

Direct Preference Optimization

Group Relative Policy Optimization

REINFORCE Leave-One-Out (RLOO)

Reward modeling

Dataset formats

Logged metrics and observability

Reference model handling and PEFT integration

vLLM rollout backend

Accelerate, DeepSpeed, FSDP, and distributed training

Other integrations

Variants and downstream use

Frameworks that wrap TRL

Models trained with TRL

TRL on the Hugging Face Hub

What is TRL used for?

What are the limitations of TRL?

How does TRL compare to other post-training libraries?

Related work

See also

References

Improve this article

Related Articles

Policy gradient methods

GRPO

KTO

RLVR

Proximal Policy Optimization (PPO)

RLOO (REINFORCE Leave-One-Out)

What links here

Related Articles

Policy gradient methods

GRPO

KTO

RLVR

Proximal Policy Optimization (PPO)

RLOO (REINFORCE Leave-One-Out)

What links here