# Fine Tuning

> Source: https://aiwiki.ai/wiki/fine_tuning
> Updated: 2026-06-20
> Categories: Deep Learning, Machine Learning, Training & Optimization
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

Fine-tuning is a [machine learning](/wiki/machine_learning) technique that takes a [pre-trained model](/wiki/pre-trained_model) and further trains it on a smaller, task-specific dataset, adjusting the model's existing weights instead of training a [neural network](/wiki/neural_network) from scratch. It is the dominant way modern AI systems are specialized: a developer selects a general-purpose foundation model and fine-tunes it for a target task, rather than building a new model from random initialization. Fine-tuning is a form of [transfer learning](/wiki/transfer_learning), reusing knowledge a model captured during pre-training and steering it toward a new task or domain.

Fine-tuning can be performed by updating every weight (full fine-tuning) or, increasingly, by updating only a tiny fraction of parameters. The 2021 [LoRA](/wiki/lora) method showed that low-rank adaptation "can reduce the number of trainable parameters by 10,000 times and the GPU memory requirement by 3 times" relative to fine-tuning GPT-3 175B with Adam, while matching or exceeding full fine-tuning quality.[5][19] This efficiency, combined with open-weight models, is why fine-tuning large language models on a single consumer GPU became routine by 2024.

*See also: [Machine learning terms](/wiki/machine_learning_terms), [Transfer learning](/wiki/transfer_learning), [Large language model](/wiki/large_language_model)*

## Introduction

Fine-tuning is a technique in [machine learning](/wiki/machine_learning) where a [pre-trained model](/wiki/pre-trained_model) is further trained on a smaller, task-specific dataset to adapt it for a particular use case. Rather than training a [neural network](/wiki/neural_network) from scratch, fine-tuning leverages the knowledge already captured in a model's weights during initial pre-training, then adjusts those weights to perform well on a new task or domain. This approach falls under the broader umbrella of [transfer learning](/wiki/transfer_learning), where knowledge gained from one task is applied to improve performance on another.

Fine-tuning has become one of the most important techniques in modern AI. In computer vision, it enabled researchers to take models trained on [ImageNet](/wiki/imagenet) and quickly adapt them to specialized image recognition tasks. In natural language processing (NLP), it powers the adaptation of [large language models](/wiki/large_language_model) (LLMs) like GPT, [BERT](/wiki/bert_bidirectional_encoder_representations_from_transformers), and LLaMA to specific tasks ranging from sentiment analysis to medical question answering. The practice has become so central to applied AI that the entire pipeline for building modern AI applications typically involves selecting a pre-trained foundation model and fine-tuning it, rather than training from scratch.

## How does fine-tuning differ from training from scratch and feature extraction?

Fine-tuning is one of three broad strategies for applying a neural network to a new task. Understanding the differences helps practitioners choose the right approach for their situation.

**Training from scratch** means initializing a model with random weights and training it entirely on the target dataset. This approach makes no assumptions about prior knowledge and gives the optimizer full freedom to learn task-specific representations. However, it requires large amounts of labeled data (often millions of examples), significant compute resources, and long training times. Training from scratch is appropriate when the target domain is fundamentally different from any available pre-trained model, or when massive labeled datasets are readily available.

**Feature extraction** (also called the frozen-backbone approach) uses a pre-trained model as a fixed feature extractor. The pre-trained weights are entirely frozen, and only a newly attached output head (such as a classification layer) is trained on the target data. Because the backbone parameters are never updated, feature extraction is fast and requires very little compute. It works well when the target task is closely related to the pre-training task and the dataset is small (a few hundred to a few thousand examples). The downside is that the frozen features may not be optimal for the new task, especially if the domains differ significantly.

**Fine-tuning** occupies the middle ground. It initializes from pre-trained weights and then updates some or all of those weights on the target dataset. This gives the model the benefit of transferred knowledge while still allowing adaptation to the specifics of the new task. Fine-tuning typically outperforms feature extraction when the target dataset is moderately sized or when the target domain differs meaningfully from the pre-training domain.

| Strategy | Weights updated | Data needed | Compute cost | Best when |
|---|---|---|---|---|
| Training from scratch | All (random init) | Very large (millions) | Very high | No relevant pre-trained model exists |
| Feature extraction | Output head only | Small (hundreds) | Very low | Target domain is very similar to pre-training domain |
| Fine-tuning (partial) | Top layers + head | Moderate (thousands) | Moderate | Target domain is related but not identical |
| Fine-tuning (full) | All layers | Moderate to large | High | Maximum performance is needed; sufficient data available |

In practice, many practitioners start with feature extraction to establish a baseline, then move to fine-tuning if performance is insufficient.

## History and development

### Transfer learning in computer vision

The concept of transfer learning predates the modern fine-tuning era. Early work in the 1990s explored how knowledge could be transferred between neural networks. However, transfer learning became practical and widespread in computer vision after the success of deep [convolutional neural networks](/wiki/convolutional_neural_network) (CNNs) on the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) starting in 2012.[16] AlexNet's victory in 2012, followed by VGGNet, GoogLeNet, and ResNet in subsequent years, established a standard workflow: pre-train a CNN on ImageNet's 1.4 million labeled images across 1,000 categories, then fine-tune the resulting model on a smaller target dataset.

This approach worked because the early layers of CNNs learn general visual features (edges, textures, shapes) that transfer well across tasks, while later layers learn task-specific features. Researchers found that even replacing and retraining only the final classification layer could yield strong results on new image recognition tasks with as few as a hundred labeled examples.

Yosinski et al. (2014) provided an influential study on the transferability of features learned by [deep learning](/wiki/deep_model) networks.[17] They showed that the first few layers of a CNN learn general, task-independent features, while later layers become increasingly specific to the original training task. This finding informed the common practice of freezing early layers and fine-tuning only the later, more task-specific layers.

### Transfer learning comes to NLP

NLP lagged behind computer vision in adopting transfer learning. For years, the standard approach in NLP involved training task-specific models from scratch using word embeddings like Word2Vec (2013) or GloVe (2014) as the only form of transferred knowledge. These static embeddings captured some semantic relationships but could not represent context-dependent word meanings.

The year 2018 marked a turning point for transfer learning in NLP, with several breakthroughs arriving in rapid succession:

- **ELMo** (Peters et al., February 2018): Embeddings from Language Models introduced contextualized word representations generated by a bidirectional LSTM trained on a large text corpus. ELMo representations could be used as input features for downstream tasks, improving performance across a range of NLP benchmarks.[14]

- **ULMFiT** (Howard and Ruder, May 2018): Universal Language Model Fine-tuning for Text Classification demonstrated that a general-purpose language model, based on an AWD-LSTM architecture, could be pre-trained on a large corpus and then fine-tuned for text classification with very little labeled data. ULMFiT introduced several techniques that became standard practice, including discriminative fine-tuning (using different [learning rates](/wiki/learning_rate) for different layers), slanted triangular learning rates, and gradual unfreezing of layers. It reduced classification error rates by 18 to 24 percent on most benchmarks compared to training from scratch.[2]

- **GPT** (Radford et al., June 2018): OpenAI's Generative Pre-trained Transformer showed that a transformer-based language model pre-trained on a large text corpus could be fine-tuned to achieve strong performance across a range of NLP tasks with minimal architecture changes.[4]

- **BERT** (Devlin et al., October 2018): Bidirectional Encoder Representations from Transformers introduced a pre-training approach using masked language modeling and next sentence prediction. BERT "obtains new state-of-the-art results on eleven natural language processing tasks," including pushing the GLUE benchmark to 80.5 percent (a 7.7-point absolute improvement) and SQuAD v1.1 test F1 to 93.2.[3][20] It became the most widely used foundation for fine-tuning in NLP for several years.

These models collectively established the "pre-train, then fine-tune" paradigm that dominates modern NLP.

### The large language model era

As language models grew larger, from BERT's 340 million parameters to GPT-3's 175 billion parameters in 2020, fine-tuning practices evolved. Full fine-tuning of such massive models became prohibitively expensive for most practitioners, motivating research into more efficient alternatives. This led to the development of parameter-efficient fine-tuning methods (discussed below), as well as new paradigms such as instruction tuning, reinforcement learning from human feedback (RLHF), and the recognition that careful prompt engineering could sometimes substitute for fine-tuning entirely.

The release of open-weight models like LLaMA (Meta, 2023), Mistral (2023), and Qwen (Alibaba, 2023) democratized fine-tuning further. Combined with parameter-efficient methods like LoRA and QLoRA, individual researchers and small teams could adapt models with tens of billions of parameters on consumer-grade GPUs. By 2024, fine-tuning an open-weight LLM had become a standard skill in the ML practitioner's toolkit.

## The fine-tuning process

### Pre-trained models

The fine-tuning process begins with a pre-trained model: a neural network that has already been trained on a large dataset. For CNNs, this is typically a model trained on ImageNet or a similar large-scale image dataset. For NLP, this is usually a transformer-based language model pre-trained on a large text corpus (such as Common Crawl, Wikipedia, or books) using a self-supervised objective like next-token prediction or masked language modeling.

These pre-trained models have already learned general features and representations from their training data. A vision model has learned to detect edges, textures, and object parts. A language model has learned grammar, facts about the world, and reasoning patterns. These general capabilities provide a strong foundation for adaptation to specific tasks.

### Adapting the model

To adapt a pre-trained model for a new task, the following steps are typically followed:

1. **Modify the model architecture**: Depending on the target task, the model's output layer or head may need to be replaced. For example, a language model's next-token prediction head might be replaced with a classification head for sentiment analysis, or a sequence-to-sequence head for summarization.

2. **Prepare the training data**: The task-specific dataset is formatted to match the model's expected input format. For LLMs, this often means structuring data as instruction-response pairs or prompt-completion pairs.

3. **Initialize from pre-trained weights**: The model's parameters are initialized with the pre-trained weights rather than random values. This gives the model a strong starting point.

4. **Train on the new dataset**: The model is trained on the task-specific data using standard optimization algorithms such as Adam or AdamW. Learning rates are typically set lower than those used during pre-training (often in the range of 1e-5 to 5e-5 for transformer models) to avoid overwriting the useful knowledge captured during pre-training.

5. **Evaluate and iterate**: The fine-tuned model is evaluated on a held-out validation set, and hyperparameters are adjusted as needed.

### Learning rate strategies

The choice of learning rate is one of the most critical decisions in fine-tuning. Because the model already contains useful pre-trained representations, the goal is to adjust weights enough to learn the new task without destroying the knowledge acquired during pre-training.

**Lower learning rates.** Fine-tuning learning rates are typically 10 to 100 times smaller than those used for training from scratch. For BERT-based models, a learning rate of 2e-5 to 5e-5 is standard.[3] For larger LLMs, rates as low as 1e-6 to 5e-6 may be appropriate. Starting with too high a learning rate can catastrophically overwrite pre-trained features within the first few gradient steps.

**Learning rate warmup.** Many fine-tuning schedules begin with a warmup period during which the learning rate gradually increases from near zero to the target value over the first 5 to 10 percent of training steps. Warmup prevents large, destabilizing updates at the start of training when the gradients of the newly initialized output head may be noisy. After warmup, the rate typically follows a linear or cosine decay schedule.

**Discriminative learning rates.** Introduced by Howard and Ruder in ULMFiT (2018), this strategy assigns different learning rates to different layers of the network.[2] Earlier layers, which contain more general features, receive smaller learning rates to preserve their representations. Later layers, which are more task-specific, receive larger learning rates to allow faster adaptation. A common approach is to set the learning rate for each layer group as a fraction (typically one-tenth) of the learning rate of the layer group above it. This technique has been shown to reduce [overfitting](/wiki/overfitting) and improve generalization, particularly in low-data scenarios.

**Gradual unfreezing.** Rather than updating all layers from the start, gradual unfreezing begins by training only the output head, then progressively unfreezes deeper layers over the course of training. This staged approach prevents large, destabilizing weight updates from damaging the pre-trained features in earlier layers. ULMFiT demonstrated that combining gradual unfreezing with discriminative learning rates consistently outperformed training all layers simultaneously.[2]

## Full fine-tuning vs. parameter-efficient fine-tuning

### Full fine-tuning

In full fine-tuning, all of the model's parameters are updated during training on the new dataset. This gives the optimizer maximum flexibility to adapt the model but comes with significant costs:

- **Memory requirements**: Training requires storing the model weights, gradients, and optimizer states (which for Adam-based optimizers means two additional copies of every parameter). For a 7-billion-parameter model in 16-bit precision, this can require over 100 GB of GPU memory.
- **Storage**: Each fine-tuned version of the model requires a full copy of all parameters. Serving multiple fine-tuned variants means storing multiple complete model copies.
- **Risk of catastrophic forgetting**: Updating all parameters increases the risk that the model will lose knowledge acquired during pre-training.[15]

Full fine-tuning remains the gold standard in terms of potential performance, and for smaller models (under a few billion parameters), it is often practical with modern hardware. For very large models, however, parameter-efficient alternatives are usually preferred.

### Parameter-efficient fine-tuning (PEFT)

Parameter-efficient fine-tuning (PEFT) methods keep most of the pre-trained model's parameters frozen and introduce or select a small number of trainable parameters. This dramatically reduces memory requirements, training time, and storage costs while typically achieving 90 to 95 percent of full fine-tuning performance. PEFT methods have become the standard approach for adapting large language models.[18]

The main families of PEFT methods are described in the following sections.

## Key fine-tuning techniques

### LoRA (low-rank adaptation)

LoRA (Low-Rank Adaptation of Large Language Models) was introduced by Edward Hu and colleagues at Microsoft Research in a 2021 paper (published at ICLR 2022).[5] It has become the most widely used PEFT method for fine-tuning large language models.

The core idea behind LoRA is based on an important observation: the weight updates that occur during fine-tuning have a low intrinsic rank. In other words, the changes needed to adapt a model to a new task can be captured by much smaller matrices than the full weight matrices.

Specifically, for a pre-trained weight matrix W_0 with dimensions d x k, LoRA represents the weight update as a product of two low-rank matrices: deltaW = B * A, where B has dimensions d x r and A has dimensions r x k, and the rank r is much smaller than both d and k. During training, the original weight matrix W_0 is frozen and only the small matrices A and B receive gradient updates.

The number of trainable parameters is determined by the rank r and the number of weight matrices that LoRA is applied to. For GPT-3 with 175 billion parameters, the authors showed that a rank as low as 1 or 2 was sufficient for good performance, even though the full rank of the weight matrices is 12,288. The paper reports that, compared with GPT-3 175B fine-tuned with Adam, "LoRA can reduce the number of trainable parameters by 10,000 times and the GPU memory requirement by 3 times," while performing on par with or better than full fine-tuning despite having fewer trainable parameters and no additional inference latency.[5][19]

Key advantages of LoRA include:

- No additional inference latency, because the low-rank matrices can be merged into the original weights after training
- Efficient task switching by swapping different LoRA adapter weights
- Compatible with other optimization techniques like quantization

Since its introduction, several variants of LoRA have been proposed. DoRA (Weight-Decomposed Low-Rank Adaptation, 2024) decomposes pre-trained weights into magnitude and direction components and applies LoRA only to the directional component, often matching full fine-tuning performance more closely. LoRA+ (2024) improves upon LoRA by using different learning rates for the A and B matrices, yielding faster convergence.[18]

### QLoRA (quantized low-rank adaptation)

QLoRA was introduced by Dettmers et al. in 2023 (NeurIPS 2023).[6] It combines LoRA with aggressive quantization to reduce memory requirements even further, enabling fine-tuning of very large models on consumer-grade hardware.

QLoRA introduces three technical innovations:

1. **4-bit NormalFloat (NF4) quantization**: A new data type specifically designed for weights that follow a normal distribution. NF4 assigns each weight in a block to one of 16 quantile bins of a normal distribution, storing only the index and a floating-point scale. The authors found NF4 to be information-theoretically optimal for normally distributed weights and superior to both FP4 and Int4 in post-quantization accuracy.

2. **Double quantization**: This technique reduces memory overhead further by also quantizing the quantization constants themselves, saving approximately 0.37 bits per parameter on average.

3. **Paged optimizers**: Using NVIDIA's unified memory feature, QLoRA enables seamless page transfers between GPU and CPU memory when the GPU runs out of memory, preventing out-of-memory errors during training.

With these innovations, QLoRA "reduces the average memory requirements of finetuning a 65B parameter model from >780GB of GPU memory to <48GB without degrading the runtime or predictive performance," enabling fine-tuning of a 65-billion-parameter model on a single 48 GB GPU while preserving full 16-bit fine-tuning performance.[6][21] To demonstrate the method, the authors trained the Guanaco model family; the paper reports that the best Guanaco model reaches "99.3% of the performance level of ChatGPT" on the Vicuna benchmark "while only requiring 24 hours of finetuning on a single GPU."[6][21] This made fine-tuning of large models accessible to individual researchers and small teams with limited hardware budgets.

### Prefix tuning

Prefix tuning, introduced by Li and Liang in 2021, prepends a sequence of trainable continuous vectors (the "prefix") to the keys and values at every layer of the transformer.[7] The prefix vectors are optimized during training while all of the model's original parameters remain frozen.

Unlike discrete text prompts, these prefix vectors exist in the model's continuous embedding space and are not constrained to correspond to real words or tokens. This gives prefix tuning more expressiveness than manual prompt engineering while keeping the number of trainable parameters very small (typically less than 1 percent of the model's total parameters).

Prefix tuning has shown strong performance on generation tasks such as table-to-text and summarization, sometimes matching or approaching full fine-tuning performance.[7]

### Adapter layers

Adapter layers, first proposed by Houlsby et al. in 2019, insert small trainable modules (adapters) between existing layers of the transformer.[1] Each adapter typically consists of a down-projection that reduces the hidden dimension, a nonlinear activation function, and an up-projection back to the original dimension, forming a bottleneck structure.

During fine-tuning, only the adapter parameters are trained while the original model weights remain frozen. The Houlsby et al. study reported that adapters attain "within 0.4% of the performance of full fine-tuning" on the GLUE benchmark while adding only 3.6 percent new parameters per task, compared with the 100 percent that full fine-tuning trains for each task.[1][22] Adapters typically add 1 to 5 percent of the original model size and can match full fine-tuning on many tasks.

A notable drawback of adapter layers is that they introduce additional computation during inference, adding some latency. This contrasts with LoRA, which can be merged into the base model weights at inference time with no additional overhead.

### Prompt tuning (soft prompts)

Prompt tuning, introduced by Lester et al. in 2021, prepends a set of trainable embedding vectors ("soft prompts") to the input of the model.[8] Unlike prefix tuning, which adds trainable vectors at every layer, prompt tuning only modifies the input embedding layer. This makes it the most parameter-efficient of the major PEFT methods.

Soft prompts are initialized either randomly or from the embeddings of real text tokens, and they are then optimized via backpropagation. The rest of the model remains completely frozen.

Prompt tuning has been shown to approach the performance of full fine-tuning as model size increases. For models with 10 billion or more parameters, the performance gap between prompt tuning and full fine-tuning becomes very small.[8] For smaller models, however, the gap can be significant.

### Comparison of fine-tuning approaches

The following table summarizes the key characteristics of the major fine-tuning approaches:

| Approach | Trainable parameters | Memory reduction | Inference overhead | Best for | Key limitation |
|---|---|---|---|---|---|
| Full fine-tuning | 100% of model | None (baseline) | None | Maximum performance; smaller models | Very high memory and compute cost |
| [LoRA](/wiki/lora) | ~0.01-1% of model | 3x or more | None (merged at inference) | General-purpose LLM adaptation | Slight performance gap vs. full fine-tuning on some tasks |
| [QLoRA](/wiki/qlora) | ~0.01-1% of model | 10x or more | Minimal (quantized inference) | Large models on limited hardware | Quantization can affect precision on some tasks |
| Prefix tuning | <1% of model | 5-10x | Minimal | Generation tasks (summarization, translation) | Less effective on classification tasks |
| Adapter layers | 1-5% of model | 2-5x | Small latency increase | Multi-task serving | Adds inference latency |
| Prompt tuning | <0.1% of model | 10x or more | None | Very large models; task switching | Underperforms on smaller models |

## Fine-tuning in computer vision

Fine-tuning in computer vision follows patterns that were established during the ImageNet era and remain relevant today. The standard workflow involves taking a CNN or vision transformer pre-trained on a large-scale dataset (typically ImageNet-1K or ImageNet-21K) and adapting it to a specialized task such as medical image classification, satellite imagery analysis, or fine-grained recognition.

### Layer freezing strategies

CNNs learn a hierarchy of features, from low-level patterns like edges and textures in early layers to high-level, task-specific concepts in later layers.[17] This hierarchy informs several common fine-tuning strategies:

- **Freeze all but the final layer**: Replace the classification head and train only it. This is equivalent to feature extraction and works well for small datasets that are similar to the pre-training data.
- **Freeze early layers, fine-tune later layers**: Keep the first several convolutional blocks frozen and update the remaining blocks plus the classification head. This preserves general features while allowing the model to adapt its higher-level representations.
- **Fine-tune all layers with a small learning rate**: Update the entire network, but use a much lower learning rate than was used during pre-training. This gives the optimizer full flexibility while limiting how far weights drift from their pre-trained values.

The choice among these strategies depends on the size and similarity of the target dataset. Smaller, more similar datasets favor more aggressive freezing, while larger or more dissimilar datasets benefit from updating more layers.

### Vision transformers and modern approaches

With the rise of Vision Transformers (ViT) and models like DINOv2, CLIP, and SAM, fine-tuning practices in vision have evolved. These models are pre-trained on diverse data and produce highly general features. LoRA and other PEFT methods have been adapted from NLP to vision transformers, allowing efficient fine-tuning of large vision models.[18] For example, applying LoRA to the attention weight matrices in a ViT follows the same low-rank decomposition principle used for language models.

## Instruction tuning

Instruction tuning is a specialized form of fine-tuning that trains a language model to follow natural language instructions across a wide range of tasks. Unlike standard fine-tuning, which adapts a model for a single task, instruction tuning aims to produce a general-purpose model that can understand and execute diverse instructions.

### FLAN

FLAN (Finetuned Language Net) was introduced by Google Research in 2021 (Wei et al.).[9] The researchers took a 137-billion-parameter pre-trained language model and instruction-tuned it on over 60 NLP tasks, each expressed through natural language instruction templates. For example, rather than providing a sentiment classification dataset in a standard label format, the data was reformulated as instructions like "Is the sentiment of the following review positive or negative?"

The results were striking: the paper reports that instruction tuning "substantially improves zero-shot performance on unseen tasks," with FLAN surpassing zero-shot 175-billion-parameter GPT-3 on 20 of 25 evaluation tasks and even outperforming few-shot GPT-3 by a large margin on benchmarks including ANLI, RTE, BoolQ, AI2-ARC, OpenbookQA, and StoryCloze.[9]

The FLAN-T5 and FLAN-PaLM follow-up work (Chung et al., 2022) scaled instruction tuning further, using a collection of over 1,800 tasks. This work showed that instruction tuning benefits from both increased numbers of tasks and increased model scale.[12]

### InstructGPT

InstructGPT (Ouyang et al., 2022) introduced a three-stage training process that became the template for aligning LLMs with human intentions:[10]

1. **Supervised fine-tuning (SFT)**: A pre-trained GPT-3 model was fine-tuned on a dataset of human-written demonstrations of desired behavior. Human labelers wrote high-quality responses to a range of prompts.

2. **Reward model training**: A separate model was trained to predict which of two responses a human would prefer, using a dataset of human preference comparisons.

3. **Reinforcement learning from human feedback (RLHF)**: The SFT model was further optimized using Proximal Policy Optimization (PPO) with the reward model providing the reward signal.

The result was striking: the paper states that "in human evaluations on our prompt distribution, outputs from the 1.3B parameter InstructGPT model are preferred to outputs from the 175B GPT-3, despite having 100x fewer parameters," demonstrating the power of alignment through fine-tuning.[10][23]

## RLHF and preference optimization

### Reinforcement learning from human feedback (RLHF)

RLHF has become a standard component of the LLM training pipeline. After supervised fine-tuning, RLHF uses human preference data to further align the model's outputs with human values and expectations.[10]

The RLHF process involves:

1. Collecting pairs of model outputs for the same prompt
2. Having human annotators rank which output they prefer
3. Training a reward model on these preference judgments
4. Using the reward model with a reinforcement learning algorithm (typically PPO) to optimize the language model's policy

RLHF has been used to train models including ChatGPT, Claude, and many other commercial LLMs. However, the process is complex and computationally expensive, requiring training and maintaining a separate reward model and using RL optimization, which can be unstable.

### Direct Preference Optimization (DPO)

Direct Preference Optimization (DPO), introduced by Rafailov et al. in 2023, offers a simpler alternative to RLHF.[11] DPO bypasses the need for a separate reward model by directly optimizing the language model on preference data using a classification-style loss function.

The key insight is that the optimal policy under the RLHF objective can be expressed in closed form as a function of the reward model. DPO reparameterizes this relationship to define a loss function that directly uses preference pairs without explicitly training a reward model.[11]

DPO offers several practical advantages over RLHF:

- Approximately 40 to 75 percent lower compute costs
- Substantially more stable training (no RL instabilities)
- Simpler implementation with fewer hyperparameters
- Single-stage training instead of the multi-stage RLHF pipeline

DPO has become widely adopted for training open-source LLMs, including Zephyr-7B and various Mistral-based models. However, some research suggests that RLHF-trained models may perform slightly better in safety evaluations and out-of-distribution generalization.

### Other preference optimization methods

Several alternatives to DPO have been proposed:

- **IPO (Identity Preference Optimization)**: Addresses a theoretical limitation of DPO by adding a regularization term that prevents overfitting to preference data.
- **KTO (Kahneman-Tversky Optimization)**: Works with unpaired preference data (individual examples labeled as good or bad), eliminating the need for explicit pairwise comparisons.
- **ORPO (Odds Ratio Preference Optimization)**: Combines supervised fine-tuning and preference alignment into a single training stage, simplifying the pipeline further.
- **GRPO (Group Relative Policy Optimization)**: Used by DeepSeek for training reasoning models, GRPO estimates advantages from group-level comparisons rather than a trained value function.

## Supervised fine-tuning in the LLM pipeline

In the modern LLM development pipeline, supervised fine-tuning (SFT) typically serves as the bridge between pre-training and alignment. The standard pipeline consists of three stages:

1. **Pre-training**: The base model is trained on a massive text corpus using self-supervised objectives (next-token prediction). This stage teaches the model language, facts, and reasoning patterns but does not teach it to follow instructions or be helpful.

2. **Supervised fine-tuning (SFT)**: The pre-trained model is fine-tuned on a curated dataset of high-quality instruction-response pairs. This teaches the model to follow instructions, produce helpful responses, and adopt a conversational format. SFT transforms a completion model into a chat model.

3. **Alignment (RLHF/DPO)**: The SFT model is further refined using human preference data to improve safety, reduce harmful outputs, and better align with human values.

The quality of the SFT dataset is often more important than its size. Research has shown that fine-tuning on a small set of carefully curated, high-quality examples can outperform fine-tuning on much larger but lower-quality datasets. The LIMA paper (Zhou et al., 2023) fine-tuned a 65-billion-parameter LLaMA model "on only 1,000 carefully curated prompts and responses, without any reinforcement learning or human preference modeling," and in controlled human evaluation its responses were "either equivalent or strictly preferred to GPT-4 in 43% of cases," rising to 58 percent versus Bard and 65 percent versus DaVinci003.[13][24] The authors argued this supports a "Superficial Alignment Hypothesis": that a model's knowledge is learned almost entirely during pre-training, and fine-tuning mainly teaches it which format and style to use.[13]

## Fine-tuning APIs and platforms

Several major AI companies offer fine-tuning through managed APIs, allowing users to customize models without managing their own infrastructure.

### OpenAI

OpenAI provides a fine-tuning API that supports several models:

| Model | Training cost (per 1M tokens) | Inference cost (input / output per 1M tokens) |
|---|---|---|
| GPT-4o | $25.00 | $3.75 / $15.00 |
| GPT-4o mini | $3.00 | $0.30 / $1.20 |
| GPT-4.1 | $25.00 | Higher than base model rates |

OpenAI's fine-tuning API accepts training data in JSONL format with instruction-response pairs. Users can fine-tune models through the API or the OpenAI dashboard, with training typically completing in minutes to hours depending on dataset size.

### Google Vertex AI

Google's Vertex AI platform supports supervised fine-tuning for Gemini models and open-source models like LLaMA 3.1. Vertex AI also supports fine-tuning of Gemma models, which can be deployed on-premises after tuning. The platform offers both supervised fine-tuning and RLHF-based tuning options.

### Amazon Bedrock

Amazon Bedrock provides fine-tuning capabilities for several foundation models, including Amazon's own Titan models and select partner models. Bedrock supports continued pre-training and supervised fine-tuning with data stored in Amazon S3.

### Anthropic

As of early 2026, Anthropic does not offer direct fine-tuning of Claude models through its API. Instead, Anthropic provides system prompts and constitutional AI principles as mechanisms for customizing Claude's behavior without modifying the model's weights.

## Data preparation and quality

The quality of training data is the single most important factor determining the success of a fine-tuning project. Poorly prepared data leads to poor results regardless of the method used.

### How much data do you need for fine-tuning?

The amount of data needed for effective fine-tuning varies by task type, model size, and fine-tuning method:

| Scenario | Recommended minimum | Notes |
|---|---|---|
| Text classification (BERT-style) | 200-500 examples per class | More classes need more data |
| LLM instruction tuning | 500-1,000 high-quality examples | Quality matters more than quantity |
| Domain-specific LLM adaptation | 1,000-10,000 examples | Depends on domain complexity |
| Style or format transfer | 100-500 examples | Consistent formatting is key |
| Code generation | 1,000-5,000 examples | Include diverse patterns |

Research consistently shows that a well-curated dataset of 500 high-quality examples often outperforms a noisy dataset of 10,000 examples. The relationship between dataset size and performance typically follows a pattern where roughly 80 percent of performance gains come from the first 20 percent of well-chosen examples. Beyond a certain point, adding more data of the same quality yields diminishing returns. The LIMA result, in which 1,000 curated examples were enough to make a 65B model competitive with much larger instruction-tuned systems, is a widely cited demonstration of this principle.[13]

### Data quality best practices

- **Accuracy**: Every example in the training dataset should represent the desired behavior correctly. A single incorrect or misleading example can have a disproportionate impact on a small fine-tuning dataset.
- **Consistency**: The format, tone, and style of responses should be consistent across examples. Mixed formatting confuses the model about what output style is expected.
- **Diversity**: Examples should cover the full range of inputs the model is expected to handle in production. Including edge cases and variations helps the model generalize.
- **Deduplication**: Duplicate or near-duplicate examples should be removed. Duplicates cause the model to memorize specific outputs rather than learning the underlying pattern.
- **Relevance**: Every example should be directly relevant to the target task. Including off-topic examples wastes training capacity and can confuse the model.

### Data formats

Most fine-tuning frameworks expect data in JSONL (JSON Lines) format, where each line represents one training example. The typical structure includes an instruction (or system message), an input (user message), and an output (assistant response). The Alpaca format, ShareGPT format, and OpenAI chat format are the most commonly used schemas.

## Common pitfalls

### Catastrophic forgetting

Catastrophic forgetting (also called catastrophic interference) occurs when a neural network forgets previously learned knowledge upon being trained on new data.[15] This is one of the most common problems in fine-tuning: the model gains the desired task-specific behavior but loses general capabilities it had before fine-tuning.

For example, a language model fine-tuned extensively on medical question answering might become very good at that task but lose its ability to write code or discuss history. The more extensively the model is fine-tuned and the more different the fine-tuning data is from the pre-training data, the greater the risk of catastrophic forgetting.

The theoretical basis for catastrophic forgetting was explored by Kirkpatrick et al. (2017) in their work on Elastic Weight Consolidation.[15] They drew an analogy to synaptic consolidation in neuroscience, where the brain selectively strengthens synapses important for previously learned tasks. In neural networks, the equivalent challenge is identifying which parameters are most important for retaining prior knowledge and penalizing changes to those parameters during new learning.

Mitigation strategies include:

- **Lower learning rates**: Using smaller learning rates reduces the magnitude of weight updates, preserving more pre-training knowledge.
- **Fewer training epochs**: Training for fewer epochs limits how much the weights can drift from their pre-trained values.
- **Regularization techniques**: Methods like Elastic Weight Consolidation (EWC) penalize large changes to parameters that are important for previously learned tasks. Researchers have combined EWC with LoRA in a method called EWCLoRA.
- **Parameter-efficient methods**: PEFT methods like LoRA keep the base model frozen, which inherently protects against catastrophic forgetting in the base parameters. However, studies have shown that LoRA does not fully prevent forgetting in continual learning scenarios.
- **Rehearsal-based methods**: Mixing some data from the original pre-training distribution into the fine-tuning dataset helps the model maintain its general knowledge.
- **Knowledge distillation**: Using the original model's output predictions as soft targets during fine-tuning encourages the new model to preserve the original model's learned behaviors.

### Overfitting

Overfitting occurs when the fine-tuned model memorizes the training examples rather than learning generalizable patterns. This is especially problematic with small fine-tuning datasets, which are common in practice.

Signs of overfitting include:

- Training loss continuing to decrease while validation loss increases
- The model producing near-verbatim copies of training examples
- Poor performance on inputs that differ slightly from training examples

Mitigation strategies include:

- **Early stopping**: Monitor validation loss and stop training when it begins to increase.
- **Dropout**: Apply dropout regularization during fine-tuning.
- **Data augmentation**: Increase the effective size and diversity of the training dataset through paraphrasing or other augmentation techniques.
- **Weight decay**: Apply L2 regularization to penalize large weight values.
- **Smaller learning rates**: Reduce the learning rate to slow down optimization and reduce the chance of overfitting to noise in the data.

### Other common issues

- **Learning rate too high**: Using a learning rate that is too large can destroy the pre-trained representations in the first few training steps. For transformer models, fine-tuning learning rates are typically 10 to 100 times smaller than those used in pre-training.
- **Insufficient evaluation**: Evaluating only on training metrics without a proper held-out test set can give a misleading picture of model performance.
- **Distribution mismatch**: If the fine-tuning data does not reflect the real-world distribution of inputs the model will encounter, performance in production will be poor regardless of training metrics.
- **Tokenizer issues**: Using a tokenizer that was not designed for the target domain (for example, fine-tuning a general-purpose model on code without adapting the tokenizer) can lead to inefficient encoding and degraded performance.

## Fine-tuning vs. RAG vs. prompt engineering

Practitioners working with LLMs face a common decision: should they fine-tune the model, use Retrieval-Augmented Generation (RAG), or rely on prompt engineering? Each approach has different trade-offs.

| Factor | Prompt engineering | RAG | Fine-tuning |
|---|---|---|---|
| Implementation time | Hours to days | Days to weeks | Weeks to months |
| Cost | Low (no training) | Medium (retrieval infrastructure) | High (GPU compute, data curation) |
| Data requirements | None | Document corpus | Labeled training examples |
| Knowledge freshness | Limited to model's training data | Up-to-date (real-time retrieval) | Limited to fine-tuning data |
| Customization depth | Surface-level (format, tone) | Knowledge-level (new facts) | Deep (behavior, style, domain expertise) |
| Best for | Quick prototyping; simple formatting | Factual accuracy; current information | Specialized behavior; domain adaptation |
| Limitations | Context window limits; no new knowledge | Retrieval quality varies; added latency | Expensive; risk of forgetting; static knowledge |

### When should you use fine-tuning instead of RAG or prompting?

**Prompt engineering** is the right starting point for most projects. It requires no training infrastructure, produces results immediately, and can handle many use cases through careful instruction design. Start here and only move to more complex approaches if prompt engineering is insufficient.

**RAG** is best when the application requires access to specific, up-to-date, or proprietary knowledge that the base model does not contain. Examples include customer support bots that need access to product documentation, or research assistants that need to cite recent papers.

**Fine-tuning** is best when the application requires the model to exhibit specialized behavior, adopt a particular style, or handle domain-specific tasks that cannot be adequately addressed through prompting alone. Examples include training a model to follow a specific output format consistently, adapting a model to a specialized domain like law or medicine, or teaching a model to perform a task that requires understanding domain-specific terminology.

These approaches are not mutually exclusive. Many production systems combine fine-tuning with RAG, using a fine-tuned model that is also augmented with retrieved context at inference time. This combination can provide both deep domain adaptation and access to current information.

A practical decision heuristic: if the problem is about **what** the model knows (facts, documents, recent data), RAG is likely the right tool. If the problem is about **how** the model responds (style, format, reasoning patterns, domain-specific behavior), fine-tuning is more appropriate.

## Cost and compute requirements

The computational cost of fine-tuning varies dramatically depending on the model size, method, and hardware used.

### Hardware requirements by method

| Method | 7B model | 13B model | 70B model |
|---|---|---|---|
| Full fine-tuning (FP16) | 1x 80GB GPU (A100) | 2x 80GB GPUs | 8+ 80GB GPUs |
| LoRA (FP16) | 1x 24GB GPU (RTX 3090/4090) | 1x 48GB GPU (A6000) | 2x 80GB GPUs |
| QLoRA (4-bit) | 1x 16GB GPU (RTX 4080) | 1x 24GB GPU | 1x 48GB GPU |

These figures reflect the order-of-magnitude reductions reported in the LoRA and QLoRA papers: roughly a 3x GPU-memory saving from low-rank adaptation, and a further drop to under 48 GB for a 65B model from 4-bit quantization.[5][6][19][21]

### Training time estimates

Training time depends heavily on dataset size, hardware, and hyperparameters. As a rough guide, fine-tuning a 7-billion-parameter model with LoRA on a dataset of 10,000 examples typically takes 1 to 4 hours on a single A100 GPU. Full fine-tuning of the same model on the same data might take 4 to 12 hours. For reference, the QLoRA authors reported fine-tuning their 65B Guanaco model in just 24 hours on a single GPU.[6][21]

Cloud GPU costs range from approximately $1 to $4 per hour for consumer-grade GPUs (RTX 4090) to $2 to $6 per hour for data center GPUs (A100, H100) through cloud providers. A typical LoRA fine-tuning run on a 7B model might cost $5 to $20 in cloud compute.

## Open-source tools and frameworks

A rich ecosystem of open-source tools supports LLM fine-tuning. The major frameworks include:

### Hugging Face Transformers and PEFT

The Hugging Face Transformers library is the most widely used foundation for fine-tuning. The companion PEFT library provides implementations of LoRA, QLoRA, prefix tuning, prompt tuning, adapter layers, and other parameter-efficient methods. The TRL (Transformers Reinforcement Learning) library adds support for SFT, DPO, PPO, GRPO, and other alignment techniques. As of 2025, the PEFT library supports over 20 different parameter-efficient fine-tuning methods and integrates seamlessly with the broader Hugging Face ecosystem.

### Axolotl

Axolotl is an open-source tool designed to streamline fine-tuning for large language models. It supports a broad range of training methods including full fine-tuning, LoRA, QLoRA, DPO, IPO, KTO, ORPO, GRPO, and reward modeling. Axolotl uses YAML configuration files, making it easy to define and reproduce training runs. Recent updates have added support for multimodal training of vision-language models.

### Unsloth

Unsloth focuses on training speed and memory efficiency. It uses custom CUDA kernels to achieve 2 to 5 times faster training while using up to 80 percent less VRAM compared to standard implementations. Unsloth is particularly well-suited for single-GPU training on consumer hardware, and supports LoRA, QLoRA, and full fine-tuning for a wide range of model architectures. It has shown 12x speed improvements for Mixture-of-Experts (MoE) model fine-tuning.

### LLaMA-Factory

LLaMA-Factory provides a user-friendly interface for fine-tuning LLMs. It supports LoRA, full fine-tuning, and reinforcement learning methods, and includes a web-based GUI that allows users to configure and launch training runs without writing code. It supports memory-efficient quantization and works with a wide variety of model architectures.

### Other notable tools

| Tool | Primary focus | Key feature |
|---|---|---|
| Hugging Face TRL | Alignment training (DPO, PPO, GRPO) | Most comprehensive RL-based training library |
| Axolotl | General fine-tuning | YAML-based configuration; broad method support |
| Unsloth | Speed and efficiency | Custom kernels; 2-5x faster; 80% less VRAM |
| LLaMA-Factory | Ease of use | Web GUI; no-code training configuration |
| Torchtune | PyTorch-native fine-tuning | Official PyTorch library; clean composable design |
| SWIFT (ModelScope) | Multilingual model support | Strong support for Chinese and multilingual models |

## Recent developments and trends

Several trends have shaped the fine-tuning space in 2024 and 2025:

- **Small language models (SLMs)**: The rise of capable small models (1B to 8B parameters) like Phi, Gemma, and Qwen has made fine-tuning more accessible, as these models can be fine-tuned on consumer hardware even with full fine-tuning.

- **Reinforcement fine-tuning for reasoning**: OpenAI introduced reinforcement fine-tuning for its o-series reasoning models, using RL to improve performance on specific reasoning tasks rather than traditional supervised fine-tuning.

- **Synthetic data for fine-tuning**: Using larger, more capable models to generate training data for fine-tuning smaller models has become a common practice, reducing the cost of creating high-quality training datasets.

- **Mixture-of-Experts (MoE) fine-tuning**: As MoE architectures like Mixtral and DeepSeek become popular, specialized fine-tuning techniques for these architectures have emerged.

- **Continued pre-training (CPT)**: Some practitioners perform continued pre-training on domain-specific text before supervised fine-tuning, combining the benefits of domain adaptation and task adaptation.

- **Merging fine-tuned models**: Techniques like TIES-Merging and DARE allow combining multiple separately fine-tuned models into a single model that inherits capabilities from all of them, without additional training.

- **Multi-modal fine-tuning**: As models like GPT-4, Gemini, and LLaVA integrate vision, text, and audio capabilities, fine-tuning approaches have expanded to handle multi-modal data. Vision-language models can be fine-tuned on paired image-text data to improve performance on specific visual reasoning tasks.

## Benefits and limitations

### Benefits

- **Improved task performance**: Fine-tuning consistently improves performance on target tasks compared to using a general-purpose model with prompt engineering alone.
- **Faster training than training from scratch**: Since the model starts with pre-trained weights, fine-tuning requires far less data and compute than training a model from scratch.
- **Smaller dataset requirements**: Fine-tuning can achieve good results with as few as a few hundred examples, while training from scratch typically requires millions.
- **Domain specialization**: Fine-tuning can teach a model domain-specific terminology, conventions, and reasoning patterns that are difficult to convey through prompts.
- **Consistent output formatting**: Fine-tuning is particularly effective at teaching models to produce outputs in a specific format, which can be difficult to achieve reliably through prompting.

### Limitations

- **Task similarity matters**: The effectiveness of fine-tuning depends on how related the target task is to what the model learned during pre-training. Highly specialized or unusual tasks may benefit less from transfer.[17]
- **Risk of catastrophic forgetting**: As discussed above, fine-tuning can cause the model to lose general capabilities.
- **Static knowledge**: A fine-tuned model's knowledge is limited to what was in its training data. It cannot access new information that emerges after training.
- **Cost and complexity**: Even with PEFT methods, fine-tuning requires technical expertise, GPU resources, and ongoing maintenance.
- **Evaluation difficulty**: Measuring whether a fine-tuned model has truly improved in the desired way can be challenging, especially for open-ended generation tasks.

## Explain like I'm 5 (ELI5)

Imagine you learned how to ride a bicycle really well. One day, someone gives you a unicycle and asks you to ride it. Even though a unicycle is different from a bicycle, a lot of what you already know still helps: how to balance, how to pedal, how to steer with your body. You would not need to learn everything about riding from the very beginning. You would just need to practice the new, tricky parts while keeping the skills you already have.

Fine-tuning works the same way for computers. A computer program first learns a lot of general things by studying a huge amount of information (like learning to ride a bicycle). Then, when someone wants it to do a specific job (like riding a unicycle), it uses what it already knows and just practices the new parts. This way, it learns the new job much faster and with much less practice than if it started from nothing.

## See also

- [SOAP (optimizer)](/wiki/soap_optimizer)

## References

1. Houlsby, N., et al. "Parameter-Efficient Transfer Learning for NLP." ICML 2019.
2. Howard, J. and Ruder, S. "Universal Language Model Fine-tuning for Text Classification." ACL 2018.
3. Devlin, J., et al. "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding." NAACL 2019.
4. Radford, A., et al. "Improving Language Understanding by Generative Pre-Training." OpenAI 2018.
5. Hu, E., et al. "LoRA: Low-Rank Adaptation of Large Language Models." ICLR 2022. arXiv:2106.09685.
6. Dettmers, T., et al. "QLoRA: Efficient Finetuning of Quantized Language Models." NeurIPS 2023. arXiv:2305.14314.
7. Li, X. and Liang, P. "Prefix-Tuning: Optimizing Continuous Prompts for Generation." ACL 2021.
8. Lester, B., et al. "The Power of Scale for Parameter-Efficient Prompt Tuning." EMNLP 2021.
9. Wei, J., et al. "Finetuned Language Models Are Zero-Shot Learners." ICLR 2022. arXiv:2109.01652.
10. Ouyang, L., et al. "Training language models to follow instructions with human feedback." NeurIPS 2022. arXiv:2203.02155.
11. Rafailov, R., et al. "Direct Preference Optimization: Your Language Model is Secretly a Reward Model." NeurIPS 2023. arXiv:2305.18290.
12. Chung, H., et al. "Scaling Instruction-Finetuned Language Models." arXiv:2210.11416, 2022.
13. Zhou, C., et al. "LIMA: Less Is More for Alignment." NeurIPS 2023. arXiv:2305.11206.
14. Peters, M., et al. "Deep contextualized word representations." NAACL 2018.
15. Kirkpatrick, J., et al. "Overcoming catastrophic forgetting in neural networks." PNAS 2017.
16. Krizhevsky, A., et al. "ImageNet Classification with Deep Convolutional Neural Networks." NeurIPS 2012.
17. Yosinski, J., et al. "How transferable are features in deep neural networks?" NeurIPS 2014. arXiv:1411.1792.
18. Han, S., et al. "Parameter-Efficient Fine-Tuning for Large Models: A Comprehensive Survey." arXiv:2403.14608, 2024.
19. Hu, E., et al. "LoRA: Low-Rank Adaptation of Large Language Models." arXiv:2106.09685, 2021 (abstract: "LoRA can reduce the number of trainable parameters by 10,000 times and the GPU memory requirement by 3 times" relative to GPT-3 175B fine-tuned with Adam).
20. Devlin, J., et al. "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding." arXiv:1810.04805, 2018 (GLUE score 80.5%, 7.7 point absolute improvement; SQuAD v1.1 test F1 93.2).
21. Dettmers, T., et al. "QLoRA: Efficient Finetuning of Quantized Language Models." arXiv:2305.14314, 2023 (65B finetuning memory reduced from >780GB to <48GB; Guanaco reaches 99.3% of ChatGPT on the Vicuna benchmark in 24 hours on a single GPU).
22. Houlsby, N., et al. "Parameter-Efficient Transfer Learning for NLP." arXiv:1902.00751, 2019 (adapters attain within 0.4% of full fine-tuning on GLUE while adding 3.6% parameters per task).
23. Ouyang, L., et al. "Training language models to follow instructions with human feedback." arXiv:2203.02155, 2022 ("outputs from the 1.3B parameter InstructGPT model are preferred to outputs from the 175B GPT-3, despite having 100x fewer parameters").
24. Zhou, C., et al. "LIMA: Less Is More for Alignment." arXiv:2305.11206, 2023 (65B LLaMA fine-tuned on 1,000 curated examples; preferred or equivalent to GPT-4 in 43% of cases, 58% vs Bard, 65% vs DaVinci003).
