# HuggingFace TRL

> Source: https://aiwiki.ai/wiki/huggingface_trl
> Updated: 2026-06-07
> Categories: Open Source AI, Reinforcement Learning, Training & Optimization
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

# HuggingFace TRL

**TRL** (Transformer Reinforcement Learning, now stylized as Transformers Reinforcement Learning) is an open-source Python library maintained by [Hugging Face](/wiki/hugging_face) for post-training large language models with reinforcement learning and preference-optimization methods.[^1] Originally created by Leandro von Werra in 2020 as a Proximal Policy Optimization implementation for [GPT-2](/wiki/gpt2),[^2] the library evolved into the de facto reference implementation of the modern post-training stack, providing trainers for supervised fine-tuning, [DPO](/wiki/direct_preference_optimization_dpo), [GRPO](/wiki/grpo), [PPO](/wiki/ppo), reward modeling, and many derivative algorithms.[^1][^3] TRL is released under the Apache 2.0 license, integrates tightly with the [Transformers](/wiki/transformers) ecosystem, and underpins higher-level frameworks such as [Axolotl](/wiki/axolotl) and LLaMA-Factory that wrap its trainers behind YAML-based configuration interfaces.[^4][^5] The library reached version 1.0 on March 31, 2026, marking a transition from research codebase to stable production library, and is downloaded roughly three million times per month.[^4]

## Infobox

| Property | Value |
|---|---|
| Original author | Leandro von Werra |
| Current maintainer | Hugging Face (lead: Quentin Gallouédec) |
| Initial release | 2020 |
| First HF-namespace release tracked | 2022 (under `huggingface/trl`) |
| Stable v1.0 release | March 31, 2026 |
| Latest release (May 2026) | v1.4.0 (May 8, 2026) |
| License | Apache 2.0 |
| Language | Python (>= 3.10) |
| Repository | github.com/huggingface/trl |
| GitHub stars (May 2026) | ~18.4k |
| Monthly downloads (2026) | ~3 million |

## History

### Origin: a PPO library for GPT-2 (2020)

TRL began in 2020 as a personal project by Leandro von Werra, who at the time wrote a PyTorch implementation of Proximal Policy Optimization that could be applied to transformer language models.[^2] The earliest released versions on PyPI described the package as "A Pytorch implementation of Proximal Policy Optimization for transfomer language models," and its motivating example trained GPT-2 to produce positive movie reviews using a BERT-based sentiment classifier as a learned reward.[^2] The library was hosted at `lvwerra/trl` on GitHub and shipped helper modules such as `AutoModelForCausalLMWithValueHead` and `AutoModelForSeq2SeqLMWithValueHead`, which add a scalar value head to causal and encoder-decoder transformers so they can be plugged into a standard actor-critic PPO loop.[^2] Version 0.3.1, released on March 2, 2023, was still classified as pre-alpha and largely focused on the single PPO trainer.[^2]

### Hugging Face integration and the RLHF era (2022 to mid-2023)

By 2022 the project had been adopted into the Hugging Face namespace at `huggingface/trl`, and von Werra joined Hugging Face as a maintainer.[^1] During this period the library was the main open-source implementation of the four-stage [Reinforcement Learning from Human Feedback](/wiki/rlhf) recipe popularized by OpenAI's [InstructGPT](/wiki/instructgpt) paper: a [supervised fine-tuning](/wiki/supervised_fine-tuning) stage, a reward model training stage, a PPO stage with a frozen reference policy, and a KL penalty against that reference.[^1][^3] An influential April 5, 2023 Hugging Face blog post, "StackLLaMA: A hands-on guide to train LLaMA with RLHF," walked through the full pipeline using TRL on a [LLaMA](/wiki/llama) model, and a March 9, 2023 post showed how to fine-tune 20-billion-parameter models with RLHF on a single 24 GB consumer GPU by combining TRL with [PEFT](/wiki/peft) adapters and 8-bit optimizers.[^3]

### The preference optimization wave (2023 to 2024)

The Direct Preference Optimization paper by Rafailov, Sharma, Mitchell, Ermon, Manning, and Finn was published in May 2023 and argued that the entire RL stack of RLHF could be replaced by a simple binary classification loss against a reference policy.[^6] TRL absorbed this finding rapidly: version 0.5.0 introduced the `DPOTrainer`,[^7] and a Hugging Face blog post dated August 8, 2023 titled "Fine-tune Llama 2 with DPO" demonstrated the new trainer applied to [Llama 2](/wiki/llama_2) together with QLoRA adapters and the Stack Exchange Paired dataset.[^8] The DPO contribution was authored by Kashif Rasul and was later refactored by Quentin Gallouédec.[^9]

Over the following year TRL added trainers for nearly every variant of preference optimization that appeared in the literature. The `DPOTrainer` itself grew a large `loss_type` enum exposing alternative objectives: the original Bradley-Terry sigmoid loss; `ipo` for the Identity Preference Optimization objective of Azar et al.; `hinge` for the RSO/SLiC formulation; `sigmoid_norm` exposing the length-normalized [SimPO](/wiki/simpo) objective of Meng et al.; `bco_pair` for the Binary Classifier Optimization formulation; `robust` for the noise-aware loss of Chowdhury et al.; `aot` and `aot_unpaired` for Distributional Preference Alignment via Optimal Transport; `apo_zero` and `apo_down` for the anchored objective; `sppo_hard`, `nca_pair`, `exo_pair`, and `discopop`; and additional combinations for Mixed Preference Optimization.[^9] Standalone trainers for the [KTO](/wiki/kto) objective of Ethayarajh et al., [ORPO](/wiki/orpo) of Hong et al., and CPO of Xu et al. were added during 2024 as separate classes rather than as DPO loss types.[^1][^10]

The PPO line did not disappear. A June 12, 2024 Hugging Face blog post titled "Putting RL back in RLHF" introduced the `RLOOTrainer`, an implementation of the REINFORCE Leave-One-Out method described in Cohere's paper "Back to Basics: Revisiting REINFORCE-style Optimization for Learning from Human Feedback in LLMs" by Ahmadian and colleagues.[^11] RLOO drops the learned value model (loading three model copies rather than four) and uses the mean of the other completions in a batch as the baseline for each completion, which the blog reported as roughly 50 to 70 percent lower memory and 2 to 3 times faster wall-clock training than PPO while remaining competitive on win-rate.[^11]

### The GRPO and verifiable-reward era (2025)

DeepSeek-AI's January 2025 paper introducing [DeepSeek-R1](/wiki/deepseek_r1) popularized Group Relative Policy Optimization, an algorithm the same group had originally published in the DeepSeekMath paper of February 2024.[^12] GRPO eliminates the value model entirely by computing the advantage as the within-group z-score of rewards across multiple completions sampled for the same prompt, recovering most of the variance-reduction benefit of a critic at a fraction of the memory cost. TRL added a `GRPOTrainer` shortly afterward, contributed by Quentin Gallouédec.[^12] A January 28, 2025 Hugging Face blog post, "Open-R1: a fully open reproduction of DeepSeek-R1," used the new TRL trainer as one of the central building blocks of an effort to replicate the R1-Zero and R1 training pipeline with open data and open code.[^13]

After GRPO, TRL's emphasis shifted toward online RL with verifiable rewards (often abbreviated RLVR), which removed the need to train a separate reward model in many reasoning and code-generation settings. The library added or stabilized trainers for `OnlineDPOTrainer`, `NashMDTrainer`, `XPOTrainer`, and `PRMTrainer` (Process Reward Model training), and the May 25, 2025 post "Liger GRPO meets TRL" and the June 3, 2025 post "NO GPU left behind: Unlocking Efficiency with Co-located vLLM in TRL" introduced significant kernel- and inference-level optimizations.[^3]

### TRL v1.0 and the chaos-adaptive design (2026)

The v1.0 release on March 31, 2026 was accompanied by a Hugging Face blog post titled "TRL v1: Post-Training Library That Holds When the Field Invalidates Its Own Assumptions."[^4] The post described a "chaos-adaptive" design philosophy: stable trainers that follow semantic versioning live in the top-level `trl` namespace, while experimental implementations live in `trl.experimental` and may be reworked or removed between minor versions. The blog explicitly framed three eras of post-training that the library has traversed: a "PPO era" centered on a policy, reference, reward, and value model loop; a "DPO era" that cut through the stack by removing the reward and value models; and an "RLVR/GRPO era" that relies on verifier-based rewards and renewed emphasis on sampling efficiency.[^4]

The stable core trainers at v1.0 are `SFTTrainer`, `DPOTrainer`, `RewardTrainer`, `RLOOTrainer`, and `GRPOTrainer`.[^4] Subsequent versions added asynchronous GRPO (parallel generation and training), the `DistillationTrainer` and `SSDTrainer` for on-policy knowledge distillation, extended tool-calling and vision-language model support, and chunked cross-entropy losses that the v1.4 release notes report as reducing SFT VRAM use by up to 50 percent.[^14]

## Technical details

### Trainer taxonomy

The TRL documentation organizes trainers into four groups: online methods, offline methods, reward modeling, and knowledge distillation.[^1] Online methods sample completions from the current policy during training and apply RL updates against a reward; offline methods consume static datasets of preferences or completions.

| Group | Trainer | Status | vLLM rollouts |
|---|---|---|---|
| Online | `GRPOTrainer` | stable | yes |
| Online | `RLOOTrainer` | stable | yes |
| Online | `OnlineDPOTrainer` | experimental | yes |
| Online | `NashMDTrainer` | experimental | yes |
| Online | `XPOTrainer` | experimental | yes |
| Online | `PPOTrainer` | experimental | no |
| Reward modeling | `RewardTrainer` | stable | n/a |
| Reward modeling | `PRMTrainer` | experimental | n/a |
| Offline | `SFTTrainer` | stable | n/a |
| Offline | `DPOTrainer` | stable | n/a |
| Offline | `BCOTrainer` | experimental | n/a |
| Offline | `CPOTrainer` | experimental | n/a |
| Offline | `KTOTrainer` | experimental | n/a |
| Offline | `ORPOTrainer` | experimental | n/a |
| Distillation | `GKDTrainer` | experimental | n/a |
| Distillation | `MiniLLMTrainer` | experimental | n/a |

Source: TRL documentation index.[^1]

Each trainer is a thin subclass of the Hugging Face Transformers `Trainer` class, which means it inherits the standard `TrainingArguments`, checkpointing, callback, and metric-logging machinery, and natively supports distributed training methods such as DDP, [DeepSpeed](/wiki/deepspeed) ZeRO, and [FSDP](/wiki/fsdp).[^5]

### Supervised fine-tuning

The `SFTTrainer` handles the language-modeling stage that precedes most preference optimization. It supports both standard and conversational dataset formats, automatically applies chat templates, packs sequences into fixed-length chunks via a Best-Fit Decreasing algorithm, and offers token-level masking so that loss is computed only on the assistant turn rather than on the user prompt.[^4] The v1.4 release added a chunked cross-entropy loss that splits the LM-head computation across the vocabulary dimension and accumulates gradients, reducing peak memory by up to 50 percent on long sequences.[^14]

### Direct Preference Optimization

The `DPOTrainer` consumes a preference dataset of `{prompt, chosen, rejected}` triples and minimizes the loss

L_DPO(θ) = -E[log σ(β(log π_θ(y+|x)/π_ref(y+|x) - log π_θ(y-|x)/π_ref(y-|x)))],

where π_θ is the policy being trained, π_ref is a frozen reference policy (typically the supervised checkpoint), σ is the logistic sigmoid, and β controls the strength of the preference signal.[^9] The trainer optionally caches reference log-probabilities so the reference model need not be kept in memory during the main optimization loop, supports synchronization of the reference model with an EMA (Exponential Moving Average) of the policy (`sync_ref_model=True`), and can combine multiple loss types with different weights to implement Mixed Preference Optimization.[^9] The trainer's default configuration overrides several `TrainingArguments` defaults: `learning_rate` defaults to 1e-6 (rather than 5e-5), `gradient_checkpointing` defaults to `True`, `bf16` defaults to `True` when `fp16` is not set, and `logging_steps` defaults to 10.[^9]

### Group Relative Policy Optimization

The `GRPOTrainer` implements the algorithm from the DeepSeekMath paper.[^12] For each prompt, the trainer samples G completions from the current policy, evaluates them with a reward function (often a deterministic verifier such as a math-answer checker or a unit-test runner), and computes the advantage of completion i as

A_i = (r_i - mean(r)) / std(r),

where the mean and standard deviation are taken across the group of G completions for that prompt. The policy is then updated with a clipped surrogate objective analogous to PPO but using these within-group advantages, and a separate KL penalty against the reference policy is added.[^12] Because there is no learned value model, GRPO loads three model copies (policy, reference, optional reward model) rather than four, and the memory savings versus PPO are significant for large models.[^4][^12]

### REINFORCE Leave-One-Out (RLOO)

The `RLOOTrainer` treats the entire completion as a single action rather than as a token-level trajectory and uses the average reward of the other completions in the same batch as a baseline.[^11] Concretely, for completions y_1 to y_k sampled for the same prompt, the advantage of completion i is

A_i = r_i - (1/(k-1)) Σ_{j ≠ i} r_j,

and the gradient is the REINFORCE estimator log π_θ(y_i|x) · A_i. The Cohere paper, integrated into TRL via the June 2024 blog post, reported a 40.1 percent win rate against an SFT baseline at 1B scale and a 78.7 percent win rate at 6.9B scale.[^11]

### Reward modeling

The `RewardTrainer` fits a scalar reward model on `{prompt, chosen, rejected}` data with the standard Bradley-Terry pairwise log-likelihood, producing a model that takes a `prompt+completion` pair and returns a scalar.[^1] Reward models trained with this trainer are interoperable with all of TRL's online RL trainers (PPO, GRPO, RLOO, etc.) and can be loaded from the Hugging Face Hub. An experimental `PRMTrainer` extends this idea to step-level supervision for [process reward models](/wiki/process_reward_model) used in math and code reasoning.[^1]

### Dataset formats

TRL standardizes a small set of dataset schemas so that the same dataset can be reused across trainers with minimal rewiring. The library distinguishes "standard" datasets (plain `prompt`, `completion`, `chosen`, or `rejected` text fields) from "conversational" datasets (lists of `{role, content}` messages that get materialized through a chat template at training time).[^9] Preference datasets carry a `prompt` plus `chosen` and `rejected` either as full conversations or as final-turn completions, with an "implicit prompt" form that omits the explicit prompt field when the chosen and rejected messages already share a common prefix.[^9] For vision-language training, an additional `image` or `images` column is consumed by `DataCollatorForVisionPreference`; for tool-calling fine-tuning, a `tools` column carries JSON schemas of available tools and the chosen and rejected completions may include `tool_calls` and `tool` role messages.[^9] These conventions are now followed by most major preference datasets on the Hugging Face Hub, which is what makes the trainers feel interchangeable.

### Logged metrics and observability

Every preference-optimization trainer reports a fixed set of metrics intended to make alignment dynamics legible: `rewards/chosen` and `rewards/rejected` (the implicit DPO rewards), `rewards/margins` (their difference), `rewards/accuracies` (the fraction of examples where the chosen reward exceeds the rejected reward), per-token log-probabilities `logps/chosen` and `logps/rejected`, an `entropy` term over the model's predictive distribution, the gradient norm before clipping, and a `mean_token_accuracy` measuring top-1 agreement with the chosen completion.[^9] For GRPO and RLOO the trainers additionally log per-group reward standard deviations and per-prompt KL divergences against the reference policy.[^12] The v1.0 release announced an emerging "training legibility" effort to embed heuristics that surface actionable warnings (e.g., reward-margin collapse, KL drift) so users can diagnose runaway training without manually staring at dashboards.[^4]

### Reference model handling and PEFT integration

A persistent practical concern with KL-anchored RLHF and DPO is the memory cost of keeping a frozen reference model alongside the policy. TRL offers three mitigations: precomputing reference log-probabilities at the start of training and caching them on disk (`precompute_ref_log_probs=True`), training adapters (typically [LoRA](/wiki/lora) or [QLoRA](/wiki/qlora)) and using the base model with the adapter disabled as the implicit reference (eliminating the need for a second model copy), and synchronizing the reference model with an EMA of the policy.[^9] The PEFT integration is first-class: passing a `peft_config=LoraConfig(...)` to any trainer wraps the model with PEFT before training begins, and adapters can be pushed to the Hub at the end of training.[^9]

### vLLM rollout backend

A bottleneck in online RL training is the latency of sampling completions from the current policy. TRL integrates with [vLLM](/wiki/vllm) to provide high-throughput, low-latency generation during online training.[^15] Two modes are supported. In **colocate** mode, the trainer process holds both the training model and a vLLM engine, which share GPU memory; this is the simpler deployment but can fragment memory and introduces synchronization between training and generation. In **server** mode, vLLM runs as an independent HTTP server on its own GPUs and the trainer pushes weight updates to it after each policy step; this scales better and is the only mode that supports custom `rollout_func` callbacks.[^15] The supported trainers for vLLM rollouts at v1.4 are `GRPOTrainer`, `RLOOTrainer`, `OnlineDPOTrainer`, `NashMDTrainer`, and `XPOTrainer`.[^15] As of v1.4 the supported vLLM range is 0.12.0 to 0.18.0, with data-parallel scaling for dense (non-MoE) models removed after vLLM 0.14.0.[^15]

### Accelerate, DeepSpeed, FSDP, and distributed training

Because TRL trainers subclass the Transformers `Trainer`, they inherit the Hugging Face Accelerate launcher, which uniformly exposes single-GPU, multi-GPU, multi-node, and TPU configurations through a single `accelerate config` step.[^5] DeepSpeed ZeRO stages 1, 2, and 3 are supported (the third optionally with CPU and NVMe offload), as is PyTorch FSDP. The v1.0 release notes explicitly call out training distribution stability and MoE/expert parallelism as ongoing scaling priorities.[^4]

### Other integrations

TRL integrates with [Unsloth](/wiki/unsloth) for kernel-level fine-tuning speedups; with Liger Kernel from LinkedIn for Triton-based fused operators (the v0.x line added a Liger-aware GRPO loss in May 2025); with the OpenEnv environment standard (introduced via the October 2025 "Building the Open Agent Ecosystem Together" blog post) for agent-driven training loops; with `math-verify` for math-answer verification rewards; and with the `kernels` and `quantization` optional dependencies for low-precision training.[^1][^3][^16]

## Variants and downstream use

### Frameworks that wrap TRL

TRL is the substrate for most modern open-source post-training pipelines. Two prominent wrappers are Axolotl and LLaMA-Factory, both of which expose YAML-based configuration interfaces and delegate the underlying training to TRL trainers.[^5][^17] Axolotl's RLHF documentation states that it "relies on the TRL library for implementations of various RL training methods" including DPO, KTO, ORPO, and PPO, and adds higher-level conveniences such as dataset loaders, multi-config orchestration, and DeepSpeed/FSDP launchers on top.[^17] LLaMA-Factory similarly wraps TRL trainers and adds a Gradio web UI for non-programmatic post-training.[^17] Unsloth ships custom kernels that interoperate with TRL's stable trainers (its DPO and SFT examples in the official documentation are essentially TRL examples with a faster optimizer attached).[^1] RapidFire AI sits on top of TRL specifically for multi-configuration DPO experimentation on a single GPU.[^9]

### Models trained with TRL

The TRL documentation lists over 1,000 community models trained with `DPOTrainer` and tagged with `dpo,trl` on the Hugging Face Hub, with similar tag pages for GRPO, ORPO, and KTO.[^9][^12] The `trl-lib` organization on the Hub hosts canonical reference models and datasets used in the documentation examples.[^1] The Hugging Face team's own Zephyr models (a [Mistral 7B](/wiki/mistral_7b) fine-tune) and the [Tülu 3](/wiki/tulu_3) series from Allen AI use TRL as the post-training engine, as does the Open-R1 project's reproduction of DeepSeek-R1.[^13]

### TRL on the Hugging Face Hub

TRL ships a CLI (`trl sft`, `trl dpo`, `trl grpo`) that lets users launch fine-tuning runs without writing Python, reading both model and dataset directly from the Hub and pushing the resulting checkpoint back at the end of training.[^4] Trainers also automatically populate a model card with training arguments, dataset metadata, and per-epoch loss curves when `push_to_hub=True`.[^9]

## Applications

The library is used wherever an open-source large language model needs to be aligned with human preferences, instruction-followed, or specialized to a domain. Concrete applications documented by Hugging Face and the broader community include alignment of base models such as [Llama 2](/wiki/llama_2), [Llama 3](/wiki/llama_3), Mistral, and [Qwen](/wiki/qwen) with conversational preference data;[^3][^8] reasoning fine-tuning of math models using GRPO with deterministic answer-checking rewards, as in the Open-R1 project;[^13] code generation fine-tuning using unit-test pass/fail as the reward; tool-calling and agent training using OpenEnv environments;[^16] vision-language model alignment (the August 7, 2025 blog post "Vision Language Model Alignment in TRL" covers VLM preference optimization with DPO);[^3] and image-generation alignment via the legacy `DDPOTrainer`, introduced in the September 29, 2023 post "Finetune Stable Diffusion Models with DDPO via TRL" for [Stable Diffusion](/wiki/stable_diffusion) models.[^18]

## Limitations and criticisms

TRL inherits the difficulties of the training methods it implements. PPO training of language models is notoriously unstable, and the older `PPOTrainer` is now classified as experimental in v1.0.[^4] DPO and its variants are sensitive to the SFT checkpoint quality and to the choice of β; mis-tuning β can cause the policy to collapse onto the chosen completions (over-fitting) or to remain too close to the reference (under-fitting). The June 2024 RLOO blog post documented a numerical-stability issue in `bf16` precision where roughly 20 to 40 percent of RLOO gradient batches were nulled by gradient clipping versus around 3 percent for PPO, attributable to log-probability drift between generation and training rather than to an algorithmic flaw.[^11]

The library's rapid pace of change has occasionally created backward-compatibility friction: argument names and defaults have shifted across minor versions (the trainer constructor signatures for DPO, KTO, and ORPO have been refactored several times), and the v1.0 release explicitly published a `MIGRATION.md` guide for users coming from the 0.x line.[^4] Reproduction of preference-optimization papers using TRL has sometimes been complicated by the fact that loss-type names and default hyperparameters in TRL do not always match the original papers' notation. The `chaos-adaptive` philosophy openly accepts this tradeoff: the experimental namespace exists precisely so that breaking changes can be made without violating semantic versioning of the stable surface.[^4]

A second limitation is that TRL is fundamentally a single-policy library. Methods that require multiple competing policies (self-play tournament approaches), elaborate environment loops with long episodes, or fully off-policy RL with large replay buffers either map awkwardly onto its training-step abstraction or are out of scope; these workloads are typically handled by purpose-built frameworks such as OpenRLHF, veRL, or Nemotron's reasoning pipeline. Some users have noted that the integration of agentic environments via OpenEnv, while progressing, is less mature than dedicated agentic RL frameworks.[^16]

## Comparison

| Library | Primary focus | Wraps TRL | License | Notes |
|---|---|---|---|---|
| TRL | Trainers for SFT, DPO, GRPO, RLOO, PPO, reward modeling | n/a | Apache 2.0 | Reference implementation; HF maintained |
| [Axolotl](/wiki/axolotl) | YAML-configured fine-tuning | yes | Apache 2.0 | Adds dataset/config layer over TRL |
| LLaMA-Factory | YAML+Gradio UI fine-tuning | yes | Apache 2.0 | Targets ease of use; wraps TRL trainers |
| [Unsloth](/wiki/unsloth) | Kernel-level fast fine-tuning | interoperable | Apache 2.0 | Custom Triton kernels usable with TRL |
| OpenRLHF | Multi-node PPO/DPO/GRPO at scale | no | Apache 2.0 | Native Ray-based RL framework |

Source: comparison synthesized from each project's official documentation and the v1.0 announcement post.[^4][^17]

## Related work

DPO and its many descendants (SimPO, KTO, ORPO) form the offline preference-optimization side of TRL's API surface, while PPO and GRPO provide the online RL side anchored by a learned or verifiable reward. The library sits in the post-training stage of the modern LLM pipeline, immediately downstream of pretraining and immediately upstream of evaluation. Closely related libraries include Hugging Face Transformers (its dependency), PEFT (parameter-efficient fine-tuning adapters), and vLLM (the rollout backend for online RL). On the dataset side, public preference datasets such as UltraFeedback and Stack Exchange Paired are the canonical inputs to the offline trainers.[^8][^9]

## See also

- [Direct Preference Optimization](/wiki/direct_preference_optimization_dpo)
- [Group Relative Policy Optimization](/wiki/grpo)
- [Proximal Policy Optimization](/wiki/ppo)
- [Reinforcement Learning from Human Feedback](/wiki/rlhf)
- [KTO](/wiki/kto)
- [ORPO](/wiki/orpo)
- [SimPO](/wiki/simpo)
- [Supervised fine-tuning](/wiki/supervised_fine-tuning)
- [Post-training](/wiki/post-training)
- [Transformers](/wiki/transformers)
- [PEFT](/wiki/peft)
- [vLLM](/wiki/vllm)
- [Axolotl](/wiki/axolotl)
- [Unsloth](/wiki/unsloth)
- [DeepSpeed](/wiki/deepspeed)
- [FSDP](/wiki/fsdp)
- [LoRA](/wiki/lora)
- [QLoRA](/wiki/qlora)
- [Hugging Face](/wiki/hugging_face)
- [DeepSeek-R1](/wiki/deepseek_r1)
- [Tülu 3](/wiki/tulu_3)

## References

[^1]: Hugging Face, "TRL - Transformers Reinforcement Learning", Hugging Face documentation, 2026. https://huggingface.co/docs/trl/en/index. Accessed 2026-05-20.

[^2]: Leandro von Werra, "trl 0.3.1", PyPI, 2023-03-02. https://pypi.org/project/trl/0.3.1/. Accessed 2026-05-20.

[^3]: Hugging Face, "TRL blog posts index", Hugging Face documentation, 2026. https://huggingface.co/docs/trl/en/index. Accessed 2026-05-20.

[^4]: Hugging Face, "TRL v1: Post-Training Library That Holds When the Field Invalidates Its Own Assumptions", Hugging Face Blog, 2026-03-27. https://huggingface.co/blog/trl-v1. Accessed 2026-05-20.

[^5]: huggingface/trl contributors, "huggingface/trl repository README", GitHub, 2026. https://github.com/huggingface/trl. Accessed 2026-05-20.

[^6]: Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, Chelsea Finn, "Direct Preference Optimization: Your Language Model is Secretly a Reward Model", arXiv:2305.18290, 2023-05-29. https://huggingface.co/papers/2305.18290. Accessed 2026-05-20.

[^7]: huggingface/trl contributors, "Releases - huggingface/trl", GitHub, 2026. https://github.com/huggingface/trl/releases. Accessed 2026-05-20.

[^8]: Younes Belkada, Kashif Rasul, Leandro von Werra, "Fine-tune Llama 2 with DPO", Hugging Face Blog, 2023-08-08. https://huggingface.co/blog/dpo-trl. Accessed 2026-05-20.

[^9]: Hugging Face, "DPO Trainer", TRL documentation, 2026. https://huggingface.co/docs/trl/en/dpo_trainer. Accessed 2026-05-20.

[^10]: Hugging Face, "CPO Trainer", TRL documentation, 2026. https://huggingface.co/docs/trl/en/cpo_trainer. Accessed 2026-05-20.

[^11]: Costa Huang, Shengyi Huang, Quentin Gallouédec et al., "Putting RL back in RLHF", Hugging Face Blog, 2024-06-12. https://huggingface.co/blog/putting_rl_back_in_rlhf_with_rloo. Accessed 2026-05-20.

[^12]: Hugging Face, "GRPO Trainer", TRL documentation, 2026. https://huggingface.co/docs/trl/en/grpo_trainer. Accessed 2026-05-20.

[^13]: Hugging Face, "Open-R1: a fully open reproduction of DeepSeek-R1", Hugging Face Blog, 2025-01-28. https://huggingface.co/blog/open-r1. Accessed 2026-05-20.

[^14]: huggingface/trl contributors, "Release notes for v1.0.0 through v1.4.0", GitHub, 2026. https://github.com/huggingface/trl/releases. Accessed 2026-05-20.

[^15]: Hugging Face, "vLLM Integration", TRL documentation, 2026. https://github.com/huggingface/trl/blob/main/docs/source/vllm_integration.md. Accessed 2026-05-20.

[^16]: Hugging Face, "OpenEnv Integration for Training LLMs with Environments", TRL documentation, 2026. https://huggingface.co/docs/trl/v0.25.0/openenv. Accessed 2026-05-20.

[^17]: Axolotl contributors, "RLHF (Beta)", Axolotl documentation, 2026. https://docs.axolotl.ai/docs/rlhf.html. Accessed 2026-05-20.

[^18]: Metric-Space, Sayak Paul, Kashif Rasul, Leandro von Werra, "Finetune Stable Diffusion Models with DDPO via TRL", Hugging Face Blog, 2023-09-29. https://huggingface.co/blog/trl-ddpo. Accessed 2026-05-20.

