Post-training

Artificial Intelligence Deep Learning Large Language Models Machine Learning Natural Language Processing

23 min read

Updated Jul 12, 2026

Suggest edit History Talk

RawGraph

Last edited

Jul 12, 2026

Fact-checked

In review queue

Sources

31 citations

Revision

v6 · 4,691 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Post-training is the stage of large language model (LLM) development that comes after pre-training and turns a raw, general-purpose base model into an aligned, instruction-following AI assistant. It groups together the techniques applied to a model after large-scale pre-training, principally supervised fine-tuning (SFT), reinforcement learning from human feedback (RLHF), and preference optimization such as Direct Preference Optimization (DPO).^[1] Despite typically using only about 1-2% of the compute spent on pre-training, post-training is what determines whether a model is actually useful, and it has become the central battleground of frontier AI.^[3]

The decisive evidence for post-training's leverage came from OpenAI's 2022 InstructGPT result: human labelers preferred the outputs of a 1.3-billion-parameter post-trained model over those of the 175-billion-parameter GPT-3 base model, despite the post-trained model having 100 times fewer parameters.^[7] The field has evolved rapidly from 2022's RLHF breakthrough with InstructGPT, to 2023's DPO simplification of the RLHF pipeline, to the 2024-2025 reasoning models (OpenAI o1, DeepSeek-R1) trained with reinforcement learning on verifiable rewards.^[4]^[29]

What is post-training?

Post-training refers to the collection of processes and techniques applied to a model after its initial, large-scale pre-training phase. According to DeepLearning.AI, post-training "transforms a general-purpose token predictor, trained on trillions of unlabeled text tokens, into an assistant that follows instructions and performs specific tasks."^[2] This crucial stage transforms a general-purpose foundation model into a specialized, efficient, and aligned tool ready for real-world deployment.^[5]

The PyTorch Foundation defines post-training (sometimes called "alignment") as "a key component of modern LLMs, and the way to 'teach' models how to answer in a way that humans like, and how to reason."^[1] This phase addresses fundamental limitations that emerge from pre-training alone:

Base models don't naturally understand conversational structure
Can generate toxic or harmful content
Produce confident hallucinations
Lack awareness of when to refuse inappropriate requests

Why is post-training necessary?

Pre-trained architectures reveal significant limitations in specialized contexts, including restricted reasoning capacities, ethical uncertainties, and suboptimal domain-specific performance.^[6] OpenAI's InstructGPT research demonstrated this dramatically: their 1.3B parameter post-trained model was preferred by humans over the 175B parameter GPT-3 base model (despite being 100 times smaller) because post-training unlocked capabilities that prompt engineering alone couldn't elicit.^[7] The InstructGPT authors concluded that "fine-tuning with human feedback is a promising direction for aligning language models with human intent."^[7]

Post-training in the development pipeline

Modern LLM development follows a structured sequence:

Pre-training (weeks to months, consuming millions of GPU-hours on trillions of tokens) builds general language understanding
Post-training (days to weeks, using thousands to millions of curated examples) develops instruction-following, alignment, and specialized capabilities

Recent models employ sophisticated multi-stage approaches: Llama 3 used three pre-training stages (15.6T core tokens, 800B context extension tokens, 40M annealing tokens) followed by multiple post-training rounds combining supervised fine-tuning, rejection sampling, and Direct Preference Optimization.^[3]

How does pre-training differ from post-training?

Pre-training and post-training are complementary phases in the development of modern AI models, especially large models. Pre-training is the process where a model learns from a very large dataset to acquire general knowledge or representations, often without any task-specific supervision. Post-training occurs after this initial learning and focuses on specialization and refinement for particular objectives.

Aspect	Pre-training	Post-training
Purpose	Learn general patterns and representations from large-scale data	Refine and adapt the model for specific tasks, objectives, or constraints
Data	Vast, diverse, often unlabeled datasets (for example Common Crawl text, ImageNet)	Smaller, focused datasets tailored to target task or domain
Duration & Compute	Most resource-intensive phase, requiring extensive computation (large GPU/TPU clusters) and time (days to weeks or more)	Shorter and less costly than pre-training; uses fewer resources and can be completed in hours to days
Outcome	A general-purpose model (foundation model) that can be adapted to various tasks	A task-optimized model tuned for particular application with improved performance and alignment
Generalization vs. Specialization	Emphasizes broad generalization across many tasks and domains	Emphasizes high performance on specific target task(s)
Frequency	Usually done once to create a base model	Can be done multiple times or continuously as new data or requirements emerge

Core techniques

Supervised fine-tuning

Supervised Fine-Tuning (SFT) serves as the foundational post-training technique, training models on high-quality input-output pairs where responses have been verified beforehand. The PyTorch primer describes SFT's focus as "imitation," teaching the model to learn ideal responses step by step through structured examples.^[1]

The technical process resembles pre-training but with a critical difference: loss is calculated only on response tokens, not prompts. Training data follows the format (system_prompt, user_input, ideal_response), and while the entire sequence passes through the model for context, gradient updates occur only on the assistant's response portion.

How much data does SFT need?

Model	SFT Examples	Epochs
InstructGPT	13,000	1
Qwen 2	500,000	2
Llama 3.1	~1M synthetic	Multiple

OpenAI's InstructGPT used approximately 13,000 training prompts with human-written demonstrations for SFT, a tiny fraction compared to pre-training datasets.^[7] More recent models use larger SFT datasets, with synthetic data generation using larger teacher models becoming standard practice.

Reinforcement learning from human feedback

RLHF represents a paradigm shift in post-training, using human preferences as reward signals to align model behavior with complex human values difficult to specify algorithmically.^[8] The technique follows a three-stage process pioneered by OpenAI's InstructGPT paper (arXiv:2203.02155), published 4 March 2022:^[7]

Stage one: Supervised fine-tuning on human demonstrations establishes a baseline instruction-following model
Stage two: Training a reward model by collecting human preference data; labelers compare multiple model outputs for the same prompt and rank them
Stage three: Reinforcement learning optimization using algorithms like Proximal Policy Optimization (PPO)

InstructGPT collected roughly 33,000 prompts with human preference comparisons, training a reward model to predict which outputs humans prefer using the Bradley-Terry preference model.^[7]

PPO mechanics

The algorithm uses importance sampling to compare new and old policy outputs, clips the ratio to prevent excessive updates (typically within $1 \pm \epsilon$ where $\epsilon = 0.2$ ), and combines this with advantage estimation, value function training, and entropy bonuses for exploration.^[1] PPO achieves stability through a clipped surrogate objective function. At each update step, it compares the probability of an action under the new policy to that of the old policy. If this ratio becomes too large or too small, the objective function "clips" the update, preventing large, destabilizing changes.

Infrastructure demands

RLHF requires running multiple models simultaneously:

The policy model generating responses
A reference model computing KL penalties
The reward model scoring outputs
Optionally a critic network for value estimation

This creates significant computational overhead: NVIDIA research found that developing derivative models through post-training could consume 30x more compute than the original pre-training.^[9]

Direct preference optimization

Direct Preference Optimization (DPO) emerged in May 2023 as a transformative simplification of RLHF, eliminating the need for explicit reward models and reinforcement learning while achieving comparable or superior performance.^[10] In the paper "Direct Preference Optimization: Your Language Model Is Secretly a Reward Model," Rafailov et al. of Stanford University showed that language models can implicitly represent reward models, enabling direct optimization through a simple classification loss.^[10] The authors reported that DPO "exceeds PPO-based RLHF in ability to control sentiment of generations, and matches or improves response quality in summarization and single-turn dialogue while being substantially simpler to implement and train."^[10]

Technical innovation

DPO reformulates the RLHF objective into supervised learning using preference pairs: (prompt, preferred_response, rejected_response). The loss function maximizes the log-odds ratio between preferred and rejected responses while maintaining KL divergence constraints to a reference model.^[10]

The DPO loss function is formulated as: *

Where:

$x$ is the prompt, $y_{w}$ is the winning (chosen) response, and $y_{l}$ is the losing (rejected) response
$\pi _{\theta }$ is the policy model being trained
$\pi _{\text{ref}}$ is a fixed reference model (usually the initial SFT model)
$\beta$ is a hyperparameter that controls how much the policy should deviate from the reference model
$\sigma$ is the sigmoid function

The computational advantages are substantial:

Single forward pass per preference pair (vs multiple models for PPO)
Training stability improves dramatically
Implementation possible in under 200 lines of code

Adoption and limitations

Since 2023, DPO has become the most popular RLHF alternative, especially in open-source communities. Meta's Llama 3.1 explicitly chose DPO over PPO, finding it more stable and easier to scale.^[3] Hugging Face's TRL library, Axolotl, and other major frameworks provide comprehensive DPO support.

Constitutional AI

Anthropic's Constitutional AI represents a fundamental rethinking of alignment, replacing extensive human feedback with AI-generated feedback guided by explicit constitutional principles.^[11] This approach enables scalable oversight while maintaining transparency about value systems encoded in models.

Two-phase methodology

Supervised learning phase (SL-CAI): The model critiques its own responses using constitutional principles, then revises responses to align with principles
Reinforcement learning phase (RL-CAI or RLAIF): AI evaluators apply constitutional principles to generate preference data for training

Constitution design

Anthropic's constitution draws from diverse sources:^[12]

UN Declaration of Human Rights
Apple's Terms of Service
DeepMind's Sparrow principles
Non-Western philosophical traditions
Anthropic's safety research

The constitution contains 75 principles covering helpfulness, harmlessness, honesty, and specific values like privacy protection and non-discrimination.

Preference optimization variants

Beyond standard DPO and RLHF, the field has rapidly developed specialized preference optimization methods:

Method	Key Innovation	Use Case
IPO	Identity function regularization	Prevents overfitting
KTO	Binary labels instead of pairs	Simpler data collection
ORPO	Single-stage training	Memory efficiency
GRPO	No critic network needed	Long context training

Kahneman-Tversky Optimization (KTO) draws from behavioral economics and prospect theory to align LLMs, requiring only binary labels (desirable/undesirable) rather than paired comparisons.^[13]

Odds Ratio Preference Optimization (ORPO) combines SFT with preference optimization in a unified loss, improving efficiency and performance on benchmarks.^[14]

Group Relative Policy Optimization (GRPO), introduced by DeepSeek in 2024, removes the value/critic network of PPO and estimates the baseline from a group of sampled outputs, making it more memory-efficient. It was used to train DeepSeek-R1.^[15]^[29]

Model optimization and compression

Beyond alignment, a major goal of post-training is to optimize models for efficient deployment. This field, known as model compression, aims to reduce a model's size, memory footprint, and computational requirements without significantly degrading its performance.^[16]

Post-training quantization

Quantization is a widely used compression technique that reduces the numerical precision of a model's parameters (weights) and/or intermediate calculations (activations).^[5] Most neural networks are trained using 32-bit floating-point numbers (FP32). Quantization converts these values to lower-precision formats, such as 16-bit floats (FP16), 8-bit integers (INT8), or even 4-bit integers.

The primary benefits are:

Reduced Model Size: Converting from FP32 to INT8 can reduce the model size by approximately 4x
Faster Inference: Integer arithmetic is significantly faster than floating-point arithmetic on most CPUs and specialized hardware accelerators
Lower Energy Consumption: Faster computation and reduced memory access lead to lower power usage

Post-Training Quantization (PTQ)

Post-Training Quantization (PTQ) is applied to a model that has already been fully trained. It is a popular choice because it is fast and does not require access to the original training dataset or an expensive retraining process. There are two main types of PTQ:

Post-Training Dynamic Quantization: The model's weights are quantized offline, but the activations are quantized "on the fly" during inference. This method is simple to apply but adds computational overhead at runtime.
Post-Training Static Quantization: Both weights and activations are quantized offline using a small calibration dataset* (typically a few hundred representative samples). This generally results in lower inference latency than dynamic quantization.

Methods include:

Min-Max Calibration: Fast but outlier-sensitive
Activation-Aware Weight Quantization (AWQ): Prioritizes salient weights
SmoothQuant: Handles activation outliers

Benefits include 2-3x speedup and reduced memory usage.^[17]

Pruning

Pruning is a model compression technique based on the observation that many deep neural networks are highly over-parameterized, containing redundant weights and neurons that contribute little to their overall performance.^[18] Pruning systematically removes these unimportant parameters to create a smaller, more computationally efficient model.

Pruning methodologies

Unstructured Pruning: Removes individual weights based on importance scores, resulting in sparse weight matrices
Structured Pruning: Removes entire groups of related parameters (filters, channels, attention heads, neurons, or layers), maintaining dense matrix structure for efficient hardware execution

The most effective approach is often iterative pruning:

Train a dense model
Prune a certain percentage of the least important parameters
Fine-tune the pruned model to recover any accuracy lost during pruning
Repeat steps 2 and 3 until desired sparsity is achieved

Knowledge distillation

Knowledge distillation is a compression technique that involves transferring the "knowledge" from a large, complex, and high-performing model (the teacher) to a smaller, more efficient model (the student).^[19]

Types of knowledge transfer

Response-based (Logit) Distillation: The student model is trained to match the full probability distribution produced by the teacher's output layer (soft labels)
Feature-based (Hint) Distillation: The student is trained to mimic the teacher's intermediate feature representations from hidden layers

These compression techniques are often most powerful when used in combination; for example, using knowledge distillation followed by structured pruning and finally post-training static quantization.

Technique	Core Idea	Primary Benefit	Main Drawback	Best For...
Quantization	Reduce numerical precision of weights/activations	Significant size reduction and faster inference	Potential accuracy degradation at low bit-widths	Deploying on resource-constrained hardware
Pruning	Remove redundant parameters	Reduces model complexity and FLOPs	Hardware acceleration challenges for unstructured pruning	Reducing latency where some accuracy trade-off acceptable
Knowledge Distillation	Train smaller model to mimic larger one	Compresses knowledge into smaller architecture	Requires powerful teacher model and full training cycle	Creating compact models for specific tasks

When did post-training emerge?

Early foundations (2008-2020)

The roots of post-training extend back to early work on learning from human feedback in reinforcement learning. The foundational breakthrough came in June 2017 with "Deep Reinforcement Learning from Human Preferences" by Christiano et al. at OpenAI and DeepMind, demonstrating that learning from pairwise human comparisons could train reward models for complex RL tasks.^[20]

Instruction tuning emerges (2021-2022)

Google's FLAN (Finetuned Language Networks) paper in 2021 introduced instruction fine-tuning at scale, training models on diverse tasks with natural language instructions.^[21] March 2022 marked a watershed moment with OpenAI's InstructGPT paper, establishing the standard three-stage RLHF pipeline still widely used today.^[7]

December 2022 saw Anthropic's Constitutional AI paper introducing RLAIF (RL from AI Feedback), pioneering AI-driven alignment techniques.^[11]

Direct preference methods (2023-2024)

May 2023 brought another paradigm shift with Direct Preference Optimization by Rafailov et al. at Stanford University.^[10] DPO's key insight, that language models implicitly represent reward models, enabled eliminating the explicit reward model and RL training loop.

Reasoning models era (2024-2025)

The frontier of post-training shifted dramatically toward reasoning capabilities in late 2024 and 2025. On 12 September 2024, OpenAI's o1 model introduced "thinking" modes where models engage in extended chain-of-thought reasoning before generating final answers; OpenAI stated that o1 was "trained with large-scale reinforcement learning to reason using chain of thought."^[31] DeepSeek-R1, released in January 2025 (arXiv:2501.12948), provided the first widely studied open reproduction of reasoning-model training, using GRPO for efficient reinforcement learning and achieving 79.8% on AIME 2024 and 97.3% on MATH-500, rivaling o1.^[4]^[29] DeepSeek reported that its R1-Zero variant developed reasoning behaviors "without any supervised data," relying purely on reinforcement learning.^[29]

Post-training teaches LLMs to reason beyond prediction through techniques like:

Chain-of-Thought (CoT): Prompts step-by-step thinking, improving accuracy (for example PaLM 540B from 18% to 57% on GSM8K)
Tree-of-Thought (ToT): Branches reasoning paths, solving 74% of puzzles vs. 4% with CoT
Reflexion: Self-reflects on errors, raising success from 80% to 91% on HumanEval
Retrieval-Augmented Generation (RAG): Grounds responses in retrieved data, reducing hallucinations

Industry adoption

Major companies

Company	Model	Post-training Method	Investment
OpenAI	ChatGPT/GPT-4	RLHF with PPO	$10M-$50M+
Anthropic	Claude	Constitutional AI	$10M-$50M+
Google DeepMind	Gemini	RLHF + SFT	$10M-$50M+
Meta	Llama	DPO + Rejection Sampling	$50M+ (Llama 3.1)
Microsoft	GitHub Copilot	RLHF (via OpenAI)	Via $13B OpenAI investment

OpenAI

OpenAI pioneered RLHF for language models with InstructGPT in March 2022, establishing the three-step process of supervised fine-tuning, reward model training, and PPO-based reinforcement learning.^[7] ChatGPT serves over 100 million weekly active users as of 2024.^[22]

Anthropic

Anthropic developed Constitutional AI as its core alignment methodology, using AI-generated feedback guided by explicit principles rather than extensive human feedback.^[11] Claude model evolution spans from Claude 1 (March 2023, 100K context) through Claude 4 (May 2025).

Google DeepMind

Google DeepMind's Gemini family demonstrates sophisticated multimodal post-training, combining RLHF, supervised fine-tuning, and safety filtering.^[23] Context windows reach 2 million tokens, industry-leading.

Applications

Conversational AI

ChatGPT: 100M+ weekly active users, RLHF with PPO
Claude: Constitutional AI, 200K token context
Gemini: Multimodal integration across Google services

Post-training is essential for transforming base models into helpful conversational agents. Models like ChatGPT and Claude undergo extensive post-training using SFT and RLHF to transform them from simple text predictors into helpful, harmless, and instruction-following assistants.^[24]

Code generation

GitHub Copilot revolutionized software development through post-trained code models, providing real-time code completion across multiple programming languages. Over 1 million developers use AI coding assistants, with studies showing 40-50% reduction in time for repetitive coding tasks. Post-training with feedback from developers helps the model learn what constitutes "good" code in practice.

Domain-specific systems

Medical AI demonstrates post-training's power in specialized domains. Google's MedGemma underwent domain-specific post-training on medical imaging data including chest X-rays, histopathology, and dermatology images.^[25]

Finance: Models are fine-tuned on financial news, earnings reports, and market data to perform specialized tasks like sentiment analysis, risk assessment, or algorithmic trading.

Multimodal models

Post-training is critical for text-to-image models like DALL-E and Midjourney. While pre-training teaches them the association between text and images, post-training using human feedback on aesthetic quality, realism, and adherence to the prompt is used to refine their output.

Deployment on resource-constrained devices

Model compression techniques are fundamental to the field of Edge AI and edge computing:

Mobile Devices: Tech companies use pruning and quantization to deploy large models on smartphones for applications like real-time language translation
Frameworks like TensorFlow Lite: Provide tools specifically for post-training quantization, enabling deployment on devices with limited memory and processing power

Tools and frameworks

Open-source libraries

Framework	Organization	Key Features
TRL	Hugging Face	SFT, DPO, PPO, GRPO trainers
PEFT	Hugging Face	LoRA, QLoRA, 60-80% memory reduction
Axolotl	Open-source	YAML config, multi-method support
Unsloth	Open-source	2-5x faster, 70-80% less memory
Torchtune	PyTorch	Official PyTorch library

TRL (Transformer Reinforcement Learning) serves as a comprehensive full-stack library for post-training foundation models, supporting SFT, GRPO, DPO, Reward, and PPO trainers.^[26]

Commercial platforms

OpenAI Fine-Tuning API: Supports GPT-4o and GPT-3.5 Turbo
Amazon Bedrock: Managed fine-tuning for Claude, Llama, Titan
Azure AI Foundry: DPO and SFT for multiple models

Challenges and limitations

Despite its transformative impact, the post-training phase faces significant technical, ethical, and practical challenges:

Technical challenges

Catastrophic forgetting: When a pre-trained model is fine-tuned on a new, narrow dataset, it risks "forgetting" some of the general knowledge it learned during pre-training
Reward hacking and over-optimization: In RLHF, the model's objective is to maximize the score from the reward model, which can lead to exploits where the model achieves high rewards without actually fulfilling user intent
Accuracy degradation: Model compression techniques often involve a trade-off between efficiency and accuracy
Alignment problem: Difficulty translating abstract, nuanced human values into concrete reward functions or preference pairs

Data dependency and quality

Post-training alignment methods are highly sensitive to the data they are trained on. The quality and diversity of human-written demonstrations for SFT and the preference labels for RLHF/DPO directly determine the quality and biases of the final model. If preference data is collected from a non-diverse group or contains inherent biases, the aligned model will learn and amplify those biases.

Vulnerability to adversarial attacks

The alignment process can create new vulnerabilities. Adversarial attacks, or "jailbreaking", involve crafting specific prompts designed to bypass a model's safety guardrails and elicit forbidden or harmful behavior. Research has shown that if a negative behavior is suppressed but not eliminated from the model's capabilities, carefully designed prompts can re-activate it.

The waterbed effect

These challenges can be conceptualized through a "waterbed effect." Post-training applies strong optimization pressure to improve one aspect of the model's behavior (for example reducing toxicity). However, this pressure can cause unintended consequences in other areas, much like pushing down on one part of a waterbed causes another part to bulge up. For example, making a model overly cautious to avoid harm might severely reduce its helpfulness and creativity.

Resource requirements

Aspect	Requirement
GPU Memory	14-80GB depending on model size
Training Time	Hours to days
Data Annotation	$10K-$200K for preference data
Engineering Expertise	ML engineers for tuning and deployment

Recent advances

Novel techniques (2023-2025)

Reinforcement Learning with Verifiable Rewards (RLVR) uses objective signals like code execution results or theorem proofs as rewards, driving dramatic reasoning improvements in OpenAI o1 and DeepSeek-R1.^[4]^[29] In RLVR, the model is rewarded for reaching correct answers verified by a deterministic checker (a math solution checker, a unit test, a compiler) rather than by a learned reward model, which sharply reduces reward hacking on tasks with a definite ground truth.^[30]

Test-time scaling allocates more compute during inference for harder problems through chain-of-thought prompting, self-consistency, and iterative refinement.^[27] OpenAI observed that o1's performance "consistently improves with more reinforcement learning (train-time compute) and with more time spent thinking (test-time compute)," establishing post-training and inference-time compute as a new scaling axis alongside pre-training.^[31]

Efficiency improvements

Memory optimization: Liger Kernel delivers 20% throughput increase, 60% memory reduction
Synthetic data generation: Teacher models create training data at scale
Distributed training: FSDP, DeepSpeed ZeRO enable multi-node training

Future directions

Critical research problems

Data efficiency: Achieving strong performance with 100-1,000 examples
Catastrophic forgetting: Maintaining general capabilities during specialization
Reward specification: Encoding complex human values
Scalability: Post-training 100B+ parameter models efficiently
Continual learning: Models that can dynamically adapt to new information and evolving user preferences over time

Emerging trends

AI feedback (RLAIF): Replacing human annotations with AI-generated feedback
Personalization: Models adapting to individual users over time
Multimodal integration: Single models handling text, images, video, audio
Test-time optimization: Dynamic compute allocation based on query difficulty
Data-free compression: Quantization and pruning methods that don't require calibration datasets

Market impact

The enterprise LLM API market doubled in six months to reach $8.4 billion in the first half of 2025, up from $3.5 billion in late 2024, according to Menlo Ventures' 2025 Mid-Year LLM Market Update; the same report found Anthropic had overtaken OpenAI as the leading enterprise model API with 32% of usage.^[28] Post-training is a primary driver of this growth, enabling enterprise customization and alignment.

Post-training is shifting from being an afterthought to being the central stage for AI innovation. The ability to effectively and efficiently refine, align, and optimize foundation models is becoming the key competitive differentiator in the AI industry, defining who can build the most capable, reliable, and practical AI applications. As the industry matures, post-training represents the "product finishing" phase that transforms powerful but generic AI engines into tailored, polished, and deployable solutions.

References

PyTorch Foundation. "Post-Training: An Introduction to Modern LLM Alignment." PyTorch Blog. https://pytorch.org/blog/ ↩
DeepLearning.AI. "Post-training of LLMs." Short course overview. https://www.deeplearning.ai/short-courses/post-training-of-llms/ ↩
Grattafiori, A. et al. "The Llama 3 Herd of Models." Meta AI, 2024. arXiv:2407.21783. https://arxiv.org/abs/2407.21783 ↩
Raschka, S. "The State of LLMs 2025: Reasoning Models and Reinforcement Learning." Ahead of AI. https://magazine.sebastianraschka.com/p/state-of-llms-2025 ↩
Hugging Face. "Quantization." Optimum and Transformers documentation. https://huggingface.co/docs/optimum/concept_guides/quantization ↩
Kumar, K. et al. "LLM Post-Training: A Deep Dive into Reasoning Large Language Models." 2025. arXiv:2502.21321. https://arxiv.org/abs/2502.21321 ↩
Ouyang, L. et al. "Training language models to follow instructions with human feedback" (InstructGPT). OpenAI, March 2022. arXiv:2203.02155. https://arxiv.org/abs/2203.02155 ↩
Lambert, N. et al. "Illustrating Reinforcement Learning from Human Feedback (RLHF)." Hugging Face Blog, 2022. https://huggingface.co/blog/rlhf ↩
NVIDIA. "NeMo Aligner and the Compute Cost of Model Alignment." NVIDIA Technical Blog. https://developer.nvidia.com/blog/ ↩
Rafailov, R. et al. "Direct Preference Optimization: Your Language Model Is Secretly a Reward Model." Stanford University, May 2023. arXiv:2305.18290. https://arxiv.org/abs/2305.18290 ↩
Bai, Y. et al. "Constitutional AI: Harmlessness from AI Feedback." Anthropic, December 2022. arXiv:2212.08073. https://arxiv.org/abs/2212.08073 ↩
Anthropic. "Claude's Constitution." Anthropic, 2023. https://www.anthropic.com/news/claudes-constitution ↩
Ethayarajh, K. et al. "KTO: Model Alignment as Prospect Theoretic Optimization." 2024. arXiv:2402.01306. https://arxiv.org/abs/2402.01306 ↩
Hong, J. et al. "ORPO: Monolithic Preference Optimization without Reference Model." 2024. arXiv:2403.07691. https://arxiv.org/abs/2403.07691 ↩
Shao, Z. et al. "DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models" (introduces GRPO). DeepSeek, 2024. arXiv:2402.03300. https://arxiv.org/abs/2402.03300 ↩
Han, S. et al. "Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding." ICLR 2016. arXiv:1510.00149. https://arxiv.org/abs/1510.00149 ↩
Lin, J. et al. "AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration." MLSys 2024. arXiv:2306.00978. https://arxiv.org/abs/2306.00978 ↩
Frantar, E. and Alistarh, D. "SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot." ICML 2023. arXiv:2301.00774. https://arxiv.org/abs/2301.00774 ↩
Hinton, G., Vinyals, O., Dean, J. "Distilling the Knowledge in a Neural Network." 2015. arXiv:1503.02531. https://arxiv.org/abs/1503.02531 ↩
Christiano, P. et al. "Deep Reinforcement Learning from Human Preferences." OpenAI and DeepMind, June 2017. arXiv:1706.03741. https://arxiv.org/abs/1706.03741 ↩
Wei, J. et al. "Finetuned Language Models Are Zero-Shot Learners" (FLAN). Google, 2021. arXiv:2109.01652. https://arxiv.org/abs/2109.01652 ↩
OpenAI. "ChatGPT weekly active users." Reported via OpenAI announcements, 2024. https://openai.com/ ↩
Google DeepMind. "Gemini: A Family of Highly Capable Multimodal Models." 2023. arXiv:2312.11805. https://arxiv.org/abs/2312.11805 ↩
Lambert, N. "The RLHF Book." https://rlhfbook.com/ ↩
Google. "MedGemma: Medical Multimodal Models." Google Health AI, 2025. https://developers.google.com/health-ai-developer-foundations/medgemma ↩
Hugging Face. "TRL: Transformer Reinforcement Learning." Documentation. https://huggingface.co/docs/trl ↩
Snell, C. et al. "Scaling LLM Test-Time Compute Optimally Can Be More Effective Than Scaling Model Parameters." 2024. arXiv:2408.03314. https://arxiv.org/abs/2408.03314 ↩
Menlo Ventures. "2025 Mid-Year LLM Market Update." July 2025. https://menlovc.com/perspective/2025-mid-year-llm-market-update/ ↩
DeepSeek-AI. "DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning." January 2025. arXiv:2501.12948. https://arxiv.org/abs/2501.12948 ↩
Lambert, N. et al. "Tulu 3: Pushing Frontiers in Open Language Model Post-Training" (RLVR). Allen Institute for AI, 2024. arXiv:2411.15124. https://arxiv.org/abs/2411.15124 ↩
OpenAI. "Learning to Reason with LLMs" (o1). OpenAI, September 12, 2024. https://openai.com/index/learning-to-reason-with-llms/ ↩

External links

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

5 revisions by 1 contributors · full history

Suggest edit

What links here

AI Wiki Artificial intelligence terms Curriculum learning FP4 (4-bit floating point)HuggingFace TRL MobileNet Phi (language model)Post-processing Pre-training Pruning SPIN (Self-Play Fine-Tuning)Small language model Supervised fine-tuning Surge AI TensorRT Terms Tülu 3

What is post-training?

Why is post-training necessary?

Post-training in the development pipeline

How does pre-training differ from post-training?

Core techniques

Supervised fine-tuning

How much data does SFT need?

Reinforcement learning from human feedback

PPO mechanics

Infrastructure demands

Direct preference optimization

Technical innovation

Adoption and limitations

Constitutional AI

Two-phase methodology

Constitution design

Preference optimization variants

Model optimization and compression

Post-training quantization

Post-Training Quantization (PTQ)

Pruning

Pruning methodologies

Knowledge distillation

Types of knowledge transfer

When did post-training emerge?

Early foundations (2008-2020)

Instruction tuning emerges (2021-2022)

Direct preference methods (2023-2024)

Reasoning models era (2024-2025)

Industry adoption

Major companies

OpenAI

Anthropic

Google DeepMind

Meta

Applications

Conversational AI

Code generation

Domain-specific systems

Multimodal models

Deployment on resource-constrained devices

Tools and frameworks

Open-source libraries

Commercial platforms

Challenges and limitations

Technical challenges

Data dependency and quality

Vulnerability to adversarial attacks

The waterbed effect

Resource requirements

Recent advances

Novel techniques (2023-2025)

Efficiency improvements

Future directions

Critical research problems

Emerging trends

Market impact

See also

References

External links

Improve this article

Related Articles

Context window

Large Language Model

OCR Models

Pre-training

Supervised fine-tuning

Agentic Context Engineering

What links here

Related Articles

Context window

Large Language Model

OCR Models

Pre-training

Supervised fine-tuning

Agentic Context Engineering

What links here