See also: Machine learning terms, Transfer learning, Large language model
Fine-tuning is a technique in machine learning where a pre-trained model is further trained on a smaller, task-specific dataset to adapt it for a particular use case. Rather than training a neural network from scratch, fine-tuning leverages the knowledge already captured in a model's weights during initial pre-training, then adjusts those weights to perform well on a new task or domain. This approach falls under the broader umbrella of transfer learning, where knowledge gained from one task is applied to improve performance on another.
Fine-tuning has become one of the most important techniques in modern AI. In computer vision, it enabled researchers to take models trained on ImageNet and quickly adapt them to specialized image recognition tasks. In natural language processing (NLP), it powers the adaptation of large language models (LLMs) like GPT, BERT, and LLaMA to specific tasks ranging from sentiment analysis to medical question answering. The practice has become so central to applied AI that the entire pipeline for building modern AI applications typically involves selecting a pre-trained foundation model and fine-tuning it, rather than training from scratch.
Fine-tuning is one of three broad strategies for applying a neural network to a new task. Understanding the differences helps practitioners choose the right approach for their situation.
Training from scratch means initializing a model with random weights and training it entirely on the target dataset. This approach makes no assumptions about prior knowledge and gives the optimizer full freedom to learn task-specific representations. However, it requires large amounts of labeled data (often millions of examples), significant compute resources, and long training times. Training from scratch is appropriate when the target domain is fundamentally different from any available pre-trained model, or when massive labeled datasets are readily available.
Feature extraction (also called the frozen-backbone approach) uses a pre-trained model as a fixed feature extractor. The pre-trained weights are entirely frozen, and only a newly attached output head (such as a classification layer) is trained on the target data. Because the backbone parameters are never updated, feature extraction is fast and requires very little compute. It works well when the target task is closely related to the pre-training task and the dataset is small (a few hundred to a few thousand examples). The downside is that the frozen features may not be optimal for the new task, especially if the domains differ significantly.
Fine-tuning occupies the middle ground. It initializes from pre-trained weights and then updates some or all of those weights on the target dataset. This gives the model the benefit of transferred knowledge while still allowing adaptation to the specifics of the new task. Fine-tuning typically outperforms feature extraction when the target dataset is moderately sized or when the target domain differs meaningfully from the pre-training domain.
| Strategy | Weights updated | Data needed | Compute cost | Best when |
|---|---|---|---|---|
| Training from scratch | All (random init) | Very large (millions) | Very high | No relevant pre-trained model exists |
| Feature extraction | Output head only | Small (hundreds) | Very low | Target domain is very similar to pre-training domain |
| Fine-tuning (partial) | Top layers + head | Moderate (thousands) | Moderate | Target domain is related but not identical |
| Fine-tuning (full) | All layers | Moderate to large | High | Maximum performance is needed; sufficient data available |
In practice, many practitioners start with feature extraction to establish a baseline, then move to fine-tuning if performance is insufficient.
The concept of transfer learning predates the modern fine-tuning era. Early work in the 1990s explored how knowledge could be transferred between neural networks. However, transfer learning became practical and widespread in computer vision after the success of deep convolutional neural networks (CNNs) on the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) starting in 2012. AlexNet's victory in 2012, followed by VGGNet, GoogLeNet, and ResNet in subsequent years, established a standard workflow: pre-train a CNN on ImageNet's 1.4 million labeled images across 1,000 categories, then fine-tune the resulting model on a smaller target dataset.
This approach worked because the early layers of CNNs learn general visual features (edges, textures, shapes) that transfer well across tasks, while later layers learn task-specific features. Researchers found that even replacing and retraining only the final classification layer could yield strong results on new image recognition tasks with as few as a hundred labeled examples.
Yosinski et al. (2014) provided an influential study on the transferability of features learned by deep learning networks. They showed that the first few layers of a CNN learn general, task-independent features, while later layers become increasingly specific to the original training task. This finding informed the common practice of freezing early layers and fine-tuning only the later, more task-specific layers.
NLP lagged behind computer vision in adopting transfer learning. For years, the standard approach in NLP involved training task-specific models from scratch using word embeddings like Word2Vec (2013) or GloVe (2014) as the only form of transferred knowledge. These static embeddings captured some semantic relationships but could not represent context-dependent word meanings.
The year 2018 marked a turning point for transfer learning in NLP, with several breakthroughs arriving in rapid succession:
ELMo (Peters et al., February 2018): Embeddings from Language Models introduced contextualized word representations generated by a bidirectional LSTM trained on a large text corpus. ELMo representations could be used as input features for downstream tasks, improving performance across a range of NLP benchmarks.
ULMFiT (Howard and Ruder, May 2018): Universal Language Model Fine-tuning for Text Classification demonstrated that a general-purpose language model, based on an AWD-LSTM architecture, could be pre-trained on a large corpus and then fine-tuned for text classification with very little labeled data. ULMFiT introduced several techniques that became standard practice, including discriminative fine-tuning (using different learning rates for different layers), slanted triangular learning rates, and gradual unfreezing of layers. It reduced classification error rates by 18 to 24 percent on most benchmarks compared to training from scratch.
GPT (Radford et al., June 2018): OpenAI's Generative Pre-trained Transformer showed that a transformer-based language model pre-trained on a large text corpus could be fine-tuned to achieve strong performance across a range of NLP tasks with minimal architecture changes.
BERT (Devlin et al., October 2018): Bidirectional Encoder Representations from Transformers introduced a pre-training approach using masked language modeling and next sentence prediction. BERT set new state-of-the-art results on 11 NLP tasks and became the most widely used foundation for fine-tuning in NLP for several years.
These models collectively established the "pre-train, then fine-tune" paradigm that dominates modern NLP.
As language models grew larger, from BERT's 340 million parameters to GPT-3's 175 billion parameters in 2020, fine-tuning practices evolved. Full fine-tuning of such massive models became prohibitively expensive for most practitioners, motivating research into more efficient alternatives. This led to the development of parameter-efficient fine-tuning methods (discussed below), as well as new paradigms such as instruction tuning, reinforcement learning from human feedback (RLHF), and the recognition that careful prompt engineering could sometimes substitute for fine-tuning entirely.
The release of open-weight models like LLaMA (Meta, 2023), Mistral (2023), and Qwen (Alibaba, 2023) democratized fine-tuning further. Combined with parameter-efficient methods like LoRA and QLoRA, individual researchers and small teams could adapt models with tens of billions of parameters on consumer-grade GPUs. By 2024, fine-tuning an open-weight LLM had become a standard skill in the ML practitioner's toolkit.
The fine-tuning process begins with a pre-trained model: a neural network that has already been trained on a large dataset. For CNNs, this is typically a model trained on ImageNet or a similar large-scale image dataset. For NLP, this is usually a transformer-based language model pre-trained on a large text corpus (such as Common Crawl, Wikipedia, or books) using a self-supervised objective like next-token prediction or masked language modeling.
These pre-trained models have already learned general features and representations from their training data. A vision model has learned to detect edges, textures, and object parts. A language model has learned grammar, facts about the world, and reasoning patterns. These general capabilities provide a strong foundation for adaptation to specific tasks.
To adapt a pre-trained model for a new task, the following steps are typically followed:
Modify the model architecture: Depending on the target task, the model's output layer or head may need to be replaced. For example, a language model's next-token prediction head might be replaced with a classification head for sentiment analysis, or a sequence-to-sequence head for summarization.
Prepare the training data: The task-specific dataset is formatted to match the model's expected input format. For LLMs, this often means structuring data as instruction-response pairs or prompt-completion pairs.
Initialize from pre-trained weights: The model's parameters are initialized with the pre-trained weights rather than random values. This gives the model a strong starting point.
Train on the new dataset: The model is trained on the task-specific data using standard optimization algorithms such as Adam or AdamW. Learning rates are typically set lower than those used during pre-training (often in the range of 1e-5 to 5e-5 for transformer models) to avoid overwriting the useful knowledge captured during pre-training.
Evaluate and iterate: The fine-tuned model is evaluated on a held-out validation set, and hyperparameters are adjusted as needed.
The choice of learning rate is one of the most critical decisions in fine-tuning. Because the model already contains useful pre-trained representations, the goal is to adjust weights enough to learn the new task without destroying the knowledge acquired during pre-training.
Lower learning rates. Fine-tuning learning rates are typically 10 to 100 times smaller than those used for training from scratch. For BERT-based models, a learning rate of 2e-5 to 5e-5 is standard. For larger LLMs, rates as low as 1e-6 to 5e-6 may be appropriate. Starting with too high a learning rate can catastrophically overwrite pre-trained features within the first few gradient steps.
Learning rate warmup. Many fine-tuning schedules begin with a warmup period during which the learning rate gradually increases from near zero to the target value over the first 5 to 10 percent of training steps. Warmup prevents large, destabilizing updates at the start of training when the gradients of the newly initialized output head may be noisy. After warmup, the rate typically follows a linear or cosine decay schedule.
Discriminative learning rates. Introduced by Howard and Ruder in ULMFiT (2018), this strategy assigns different learning rates to different layers of the network. Earlier layers, which contain more general features, receive smaller learning rates to preserve their representations. Later layers, which are more task-specific, receive larger learning rates to allow faster adaptation. A common approach is to set the learning rate for each layer group as a fraction (typically one-tenth) of the learning rate of the layer group above it. This technique has been shown to reduce overfitting and improve generalization, particularly in low-data scenarios.
Gradual unfreezing. Rather than updating all layers from the start, gradual unfreezing begins by training only the output head, then progressively unfreezes deeper layers over the course of training. This staged approach prevents large, destabilizing weight updates from damaging the pre-trained features in earlier layers. ULMFiT demonstrated that combining gradual unfreezing with discriminative learning rates consistently outperformed training all layers simultaneously.
In full fine-tuning, all of the model's parameters are updated during training on the new dataset. This gives the optimizer maximum flexibility to adapt the model but comes with significant costs:
Full fine-tuning remains the gold standard in terms of potential performance, and for smaller models (under a few billion parameters), it is often practical with modern hardware. For very large models, however, parameter-efficient alternatives are usually preferred.
Parameter-efficient fine-tuning (PEFT) methods keep most of the pre-trained model's parameters frozen and introduce or select a small number of trainable parameters. This dramatically reduces memory requirements, training time, and storage costs while typically achieving 90 to 95 percent of full fine-tuning performance. PEFT methods have become the standard approach for adapting large language models.
The main families of PEFT methods are described in the following sections.
LoRA (Low-Rank Adaptation of Large Language Models) was introduced by Edward Hu and colleagues at Microsoft Research in a 2021 paper (published at ICLR 2022). It has become the most widely used PEFT method for fine-tuning large language models.
The core idea behind LoRA is based on an important observation: the weight updates that occur during fine-tuning have a low intrinsic rank. In other words, the changes needed to adapt a model to a new task can be captured by much smaller matrices than the full weight matrices.
Specifically, for a pre-trained weight matrix W_0 with dimensions d x k, LoRA represents the weight update as a product of two low-rank matrices: deltaW = B * A, where B has dimensions d x r and A has dimensions r x k, and the rank r is much smaller than both d and k. During training, the original weight matrix W_0 is frozen and only the small matrices A and B receive gradient updates.
The number of trainable parameters is determined by the rank r and the number of weight matrices that LoRA is applied to. For GPT-3 with 175 billion parameters, the authors showed that a rank as low as 1 or 2 was sufficient for good performance, even though the full rank of the weight matrices is 12,288. This can reduce the number of trainable parameters by a factor of 10,000 and GPU memory requirements by a factor of 3 compared to full fine-tuning.
Key advantages of LoRA include:
Since its introduction, several variants of LoRA have been proposed. DoRA (Weight-Decomposed Low-Rank Adaptation, 2024) decomposes pre-trained weights into magnitude and direction components and applies LoRA only to the directional component, often matching full fine-tuning performance more closely. LoRA+ (2024) improves upon LoRA by using different learning rates for the A and B matrices, yielding faster convergence.
QLoRA was introduced by Dettmers et al. in 2023 (NeurIPS 2023). It combines LoRA with aggressive quantization to reduce memory requirements even further, enabling fine-tuning of very large models on consumer-grade hardware.
QLoRA introduces three technical innovations:
4-bit NormalFloat (NF4) quantization: A new data type specifically designed for weights that follow a normal distribution. NF4 assigns each weight in a block to one of 16 quantile bins of a normal distribution, storing only the index and a floating-point scale. The authors found NF4 to be information-theoretically optimal for normally distributed weights and superior to both FP4 and Int4 in post-quantization accuracy.
Double quantization: This technique reduces memory overhead further by also quantizing the quantization constants themselves, saving approximately 0.37 bits per parameter on average.
Paged optimizers: Using NVIDIA's unified memory feature, QLoRA enables seamless page transfers between GPU and CPU memory when the GPU runs out of memory, preventing out-of-memory errors during training.
With these innovations, QLoRA enables fine-tuning of a 65-billion-parameter model on a single 48 GB GPU while preserving the performance of full 16-bit fine-tuning. This made fine-tuning of large models accessible to individual researchers and small teams with limited hardware budgets.
Prefix tuning, introduced by Li and Liang in 2021, prepends a sequence of trainable continuous vectors (the "prefix") to the keys and values at every layer of the transformer. The prefix vectors are optimized during training while all of the model's original parameters remain frozen.
Unlike discrete text prompts, these prefix vectors exist in the model's continuous embedding space and are not constrained to correspond to real words or tokens. This gives prefix tuning more expressiveness than manual prompt engineering while keeping the number of trainable parameters very small (typically less than 1 percent of the model's total parameters).
Prefix tuning has shown strong performance on generation tasks such as table-to-text and summarization, sometimes matching or approaching full fine-tuning performance.
Adapter layers, first proposed by Houlsby et al. in 2019, insert small trainable modules (adapters) between existing layers of the transformer. Each adapter typically consists of a down-projection that reduces the hidden dimension, a nonlinear activation function, and an up-projection back to the original dimension, forming a bottleneck structure.
During fine-tuning, only the adapter parameters are trained while the original model weights remain frozen. Adapters add a small number of parameters (typically 1 to 5 percent of the original model size) and can achieve performance comparable to full fine-tuning on many tasks.
A notable drawback of adapter layers is that they introduce additional computation during inference, adding some latency. This contrasts with LoRA, which can be merged into the base model weights at inference time with no additional overhead.
Prompt tuning, introduced by Lester et al. in 2021, prepends a set of trainable embedding vectors ("soft prompts") to the input of the model. Unlike prefix tuning, which adds trainable vectors at every layer, prompt tuning only modifies the input embedding layer. This makes it the most parameter-efficient of the major PEFT methods.
Soft prompts are initialized either randomly or from the embeddings of real text tokens, and they are then optimized via backpropagation. The rest of the model remains completely frozen.
Prompt tuning has been shown to approach the performance of full fine-tuning as model size increases. For models with 10 billion or more parameters, the performance gap between prompt tuning and full fine-tuning becomes very small. For smaller models, however, the gap can be significant.
The following table summarizes the key characteristics of the major fine-tuning approaches:
| Approach | Trainable parameters | Memory reduction | Inference overhead | Best for | Key limitation |
|---|---|---|---|---|---|
| Full fine-tuning | 100% of model | None (baseline) | None | Maximum performance; smaller models | Very high memory and compute cost |
| LoRA | ~0.01-1% of model | 3x or more | None (merged at inference) | General-purpose LLM adaptation | Slight performance gap vs. full fine-tuning on some tasks |
| QLoRA | ~0.01-1% of model | 10x or more | Minimal (quantized inference) | Large models on limited hardware | Quantization can affect precision on some tasks |
| Prefix tuning | <1% of model | 5-10x | Minimal | Generation tasks (summarization, translation) | Less effective on classification tasks |
| Adapter layers | 1-5% of model | 2-5x | Small latency increase | Multi-task serving | Adds inference latency |
| Prompt tuning | <0.1% of model | 10x or more | None | Very large models; task switching | Underperforms on smaller models |
Fine-tuning in computer vision follows patterns that were established during the ImageNet era and remain relevant today. The standard workflow involves taking a CNN or vision transformer pre-trained on a large-scale dataset (typically ImageNet-1K or ImageNet-21K) and adapting it to a specialized task such as medical image classification, satellite imagery analysis, or fine-grained recognition.
CNNs learn a hierarchy of features, from low-level patterns like edges and textures in early layers to high-level, task-specific concepts in later layers. This hierarchy informs several common fine-tuning strategies:
The choice among these strategies depends on the size and similarity of the target dataset. Smaller, more similar datasets favor more aggressive freezing, while larger or more dissimilar datasets benefit from updating more layers.
With the rise of Vision Transformers (ViT) and models like DINOv2, CLIP, and SAM, fine-tuning practices in vision have evolved. These models are pre-trained on diverse data and produce highly general features. LoRA and other PEFT methods have been adapted from NLP to vision transformers, allowing efficient fine-tuning of large vision models. For example, applying LoRA to the attention weight matrices in a ViT follows the same low-rank decomposition principle used for language models.
Instruction tuning is a specialized form of fine-tuning that trains a language model to follow natural language instructions across a wide range of tasks. Unlike standard fine-tuning, which adapts a model for a single task, instruction tuning aims to produce a general-purpose model that can understand and execute diverse instructions.
FLAN (Finetuned Language Net) was introduced by Google Research in 2021 (Wei et al.). The researchers took a 137-billion-parameter pre-trained language model and instruction-tuned it on over 60 NLP tasks, each expressed through natural language instruction templates. For example, rather than providing a sentiment classification dataset in a standard label format, the data was reformulated as instructions like "Is the sentiment of the following review positive or negative?"
The results were striking: FLAN substantially outperformed the unmodified base model on unseen tasks and surpassed zero-shot GPT-3 (175 billion parameters) on 20 of 25 evaluation tasks. FLAN also outperformed few-shot GPT-3 on several benchmarks including ANLI, RTE, BoolQ, and AI2-ARC.
The FLAN-T5 and FLAN-PaLM follow-up work (Chung et al., 2022) scaled instruction tuning further, using a collection of over 1,800 tasks. This work showed that instruction tuning benefits from both increased numbers of tasks and increased model scale.
InstructGPT (Ouyang et al., 2022) introduced a three-stage training process that became the template for aligning LLMs with human intentions:
Supervised fine-tuning (SFT): A pre-trained GPT-3 model was fine-tuned on a dataset of human-written demonstrations of desired behavior. Human labelers wrote high-quality responses to a range of prompts.
Reward model training: A separate model was trained to predict which of two responses a human would prefer, using a dataset of human preference comparisons.
Reinforcement learning from human feedback (RLHF): The SFT model was further optimized using Proximal Policy Optimization (PPO) with the reward model providing the reward signal.
The resulting InstructGPT model (with only 1.3 billion parameters) was preferred by human evaluators over the outputs of the much larger 175-billion-parameter GPT-3, demonstrating the power of alignment through fine-tuning.
RLHF has become a standard component of the LLM training pipeline. After supervised fine-tuning, RLHF uses human preference data to further align the model's outputs with human values and expectations.
The RLHF process involves:
RLHF has been used to train models including ChatGPT, Claude, and many other commercial LLMs. However, the process is complex and computationally expensive, requiring training and maintaining a separate reward model and using RL optimization, which can be unstable.
Direct Preference Optimization (DPO), introduced by Rafailov et al. in 2023, offers a simpler alternative to RLHF. DPO bypasses the need for a separate reward model by directly optimizing the language model on preference data using a classification-style loss function.
The key insight is that the optimal policy under the RLHF objective can be expressed in closed form as a function of the reward model. DPO reparameterizes this relationship to define a loss function that directly uses preference pairs without explicitly training a reward model.
DPO offers several practical advantages over RLHF:
DPO has become widely adopted for training open-source LLMs, including Zephyr-7B and various Mistral-based models. However, some research suggests that RLHF-trained models may perform slightly better in safety evaluations and out-of-distribution generalization.
Several alternatives to DPO have been proposed:
In the modern LLM development pipeline, supervised fine-tuning (SFT) typically serves as the bridge between pre-training and alignment. The standard pipeline consists of three stages:
Pre-training: The base model is trained on a massive text corpus using self-supervised objectives (next-token prediction). This stage teaches the model language, facts, and reasoning patterns but does not teach it to follow instructions or be helpful.
Supervised fine-tuning (SFT): The pre-trained model is fine-tuned on a curated dataset of high-quality instruction-response pairs. This teaches the model to follow instructions, produce helpful responses, and adopt a conversational format. SFT transforms a completion model into a chat model.
Alignment (RLHF/DPO): The SFT model is further refined using human preference data to improve safety, reduce harmful outputs, and better align with human values.
The quality of the SFT dataset is often more important than its size. Research has shown that fine-tuning on a small set of carefully curated, high-quality examples can outperform fine-tuning on much larger but lower-quality datasets. The LIMA paper (Zhou et al., 2023) demonstrated that fine-tuning LLaMA-65B on only 1,000 carefully selected examples produced a model competitive with models trained on much larger instruction datasets.
Several major AI companies offer fine-tuning through managed APIs, allowing users to customize models without managing their own infrastructure.
OpenAI provides a fine-tuning API that supports several models:
| Model | Training cost (per 1M tokens) | Inference cost (input / output per 1M tokens) |
|---|---|---|
| GPT-4o | $25.00 | $3.75 / $15.00 |
| GPT-4o mini | $3.00 | $0.30 / $1.20 |
| GPT-4.1 | $25.00 | Higher than base model rates |
OpenAI's fine-tuning API accepts training data in JSONL format with instruction-response pairs. Users can fine-tune models through the API or the OpenAI dashboard, with training typically completing in minutes to hours depending on dataset size.
Google's Vertex AI platform supports supervised fine-tuning for Gemini models and open-source models like LLaMA 3.1. Vertex AI also supports fine-tuning of Gemma models, which can be deployed on-premises after tuning. The platform offers both supervised fine-tuning and RLHF-based tuning options.
Amazon Bedrock provides fine-tuning capabilities for several foundation models, including Amazon's own Titan models and select partner models. Bedrock supports continued pre-training and supervised fine-tuning with data stored in Amazon S3.
As of early 2026, Anthropic does not offer direct fine-tuning of Claude models through its API. Instead, Anthropic provides system prompts and constitutional AI principles as mechanisms for customizing Claude's behavior without modifying the model's weights.
The quality of training data is the single most important factor determining the success of a fine-tuning project. Poorly prepared data leads to poor results regardless of the method used.
The amount of data needed for effective fine-tuning varies by task type, model size, and fine-tuning method:
| Scenario | Recommended minimum | Notes |
|---|---|---|
| Text classification (BERT-style) | 200-500 examples per class | More classes need more data |
| LLM instruction tuning | 500-1,000 high-quality examples | Quality matters more than quantity |
| Domain-specific LLM adaptation | 1,000-10,000 examples | Depends on domain complexity |
| Style or format transfer | 100-500 examples | Consistent formatting is key |
| Code generation | 1,000-5,000 examples | Include diverse patterns |
Research consistently shows that a well-curated dataset of 500 high-quality examples often outperforms a noisy dataset of 10,000 examples. The relationship between dataset size and performance typically follows a pattern where roughly 80 percent of performance gains come from the first 20 percent of well-chosen examples. Beyond a certain point, adding more data of the same quality yields diminishing returns.
Most fine-tuning frameworks expect data in JSONL (JSON Lines) format, where each line represents one training example. The typical structure includes an instruction (or system message), an input (user message), and an output (assistant response). The Alpaca format, ShareGPT format, and OpenAI chat format are the most commonly used schemas.
Catastrophic forgetting (also called catastrophic interference) occurs when a neural network forgets previously learned knowledge upon being trained on new data. This is one of the most common problems in fine-tuning: the model gains the desired task-specific behavior but loses general capabilities it had before fine-tuning.
For example, a language model fine-tuned extensively on medical question answering might become very good at that task but lose its ability to write code or discuss history. The more extensively the model is fine-tuned and the more different the fine-tuning data is from the pre-training data, the greater the risk of catastrophic forgetting.
The theoretical basis for catastrophic forgetting was explored by Kirkpatrick et al. (2017) in their work on Elastic Weight Consolidation. They drew an analogy to synaptic consolidation in neuroscience, where the brain selectively strengthens synapses important for previously learned tasks. In neural networks, the equivalent challenge is identifying which parameters are most important for retaining prior knowledge and penalizing changes to those parameters during new learning.
Mitigation strategies include:
Overfitting occurs when the fine-tuned model memorizes the training examples rather than learning generalizable patterns. This is especially problematic with small fine-tuning datasets, which are common in practice.
Signs of overfitting include:
Mitigation strategies include:
Practitioners working with LLMs face a common decision: should they fine-tune the model, use Retrieval-Augmented Generation (RAG), or rely on prompt engineering? Each approach has different trade-offs.
| Factor | Prompt engineering | RAG | Fine-tuning |
|---|---|---|---|
| Implementation time | Hours to days | Days to weeks | Weeks to months |
| Cost | Low (no training) | Medium (retrieval infrastructure) | High (GPU compute, data curation) |
| Data requirements | None | Document corpus | Labeled training examples |
| Knowledge freshness | Limited to model's training data | Up-to-date (real-time retrieval) | Limited to fine-tuning data |
| Customization depth | Surface-level (format, tone) | Knowledge-level (new facts) | Deep (behavior, style, domain expertise) |
| Best for | Quick prototyping; simple formatting | Factual accuracy; current information | Specialized behavior; domain adaptation |
| Limitations | Context window limits; no new knowledge | Retrieval quality varies; added latency | Expensive; risk of forgetting; static knowledge |
Prompt engineering is the right starting point for most projects. It requires no training infrastructure, produces results immediately, and can handle many use cases through careful instruction design. Start here and only move to more complex approaches if prompt engineering is insufficient.
RAG is best when the application requires access to specific, up-to-date, or proprietary knowledge that the base model does not contain. Examples include customer support bots that need access to product documentation, or research assistants that need to cite recent papers.
Fine-tuning is best when the application requires the model to exhibit specialized behavior, adopt a particular style, or handle domain-specific tasks that cannot be adequately addressed through prompting alone. Examples include training a model to follow a specific output format consistently, adapting a model to a specialized domain like law or medicine, or teaching a model to perform a task that requires understanding domain-specific terminology.
These approaches are not mutually exclusive. Many production systems combine fine-tuning with RAG, using a fine-tuned model that is also augmented with retrieved context at inference time. This combination can provide both deep domain adaptation and access to current information.
A practical decision heuristic: if the problem is about what the model knows (facts, documents, recent data), RAG is likely the right tool. If the problem is about how the model responds (style, format, reasoning patterns, domain-specific behavior), fine-tuning is more appropriate.
The computational cost of fine-tuning varies dramatically depending on the model size, method, and hardware used.
| Method | 7B model | 13B model | 70B model |
|---|---|---|---|
| Full fine-tuning (FP16) | 1x 80GB GPU (A100) | 2x 80GB GPUs | 8+ 80GB GPUs |
| LoRA (FP16) | 1x 24GB GPU (RTX 3090/4090) | 1x 48GB GPU (A6000) | 2x 80GB GPUs |
| QLoRA (4-bit) | 1x 16GB GPU (RTX 4080) | 1x 24GB GPU | 1x 48GB GPU |
Training time depends heavily on dataset size, hardware, and hyperparameters. As a rough guide, fine-tuning a 7-billion-parameter model with LoRA on a dataset of 10,000 examples typically takes 1 to 4 hours on a single A100 GPU. Full fine-tuning of the same model on the same data might take 4 to 12 hours.
Cloud GPU costs range from approximately $1 to $4 per hour for consumer-grade GPUs (RTX 4090) to $2 to $6 per hour for data center GPUs (A100, H100) through cloud providers. A typical LoRA fine-tuning run on a 7B model might cost $5 to $20 in cloud compute.
A rich ecosystem of open-source tools supports LLM fine-tuning. The major frameworks include:
The Hugging Face Transformers library is the most widely used foundation for fine-tuning. The companion PEFT library provides implementations of LoRA, QLoRA, prefix tuning, prompt tuning, adapter layers, and other parameter-efficient methods. The TRL (Transformers Reinforcement Learning) library adds support for SFT, DPO, PPO, GRPO, and other alignment techniques. As of 2025, the PEFT library supports over 20 different parameter-efficient fine-tuning methods and integrates seamlessly with the broader Hugging Face ecosystem.
Axolotl is an open-source tool designed to streamline fine-tuning for large language models. It supports a broad range of training methods including full fine-tuning, LoRA, QLoRA, DPO, IPO, KTO, ORPO, GRPO, and reward modeling. Axolotl uses YAML configuration files, making it easy to define and reproduce training runs. Recent updates have added support for multimodal training of vision-language models.
Unsloth focuses on training speed and memory efficiency. It uses custom CUDA kernels to achieve 2 to 5 times faster training while using up to 80 percent less VRAM compared to standard implementations. Unsloth is particularly well-suited for single-GPU training on consumer hardware, and supports LoRA, QLoRA, and full fine-tuning for a wide range of model architectures. It has shown 12x speed improvements for Mixture-of-Experts (MoE) model fine-tuning.
LLaMA-Factory provides a user-friendly interface for fine-tuning LLMs. It supports LoRA, full fine-tuning, and reinforcement learning methods, and includes a web-based GUI that allows users to configure and launch training runs without writing code. It supports memory-efficient quantization and works with a wide variety of model architectures.
| Tool | Primary focus | Key feature |
|---|---|---|
| Hugging Face TRL | Alignment training (DPO, PPO, GRPO) | Most comprehensive RL-based training library |
| Axolotl | General fine-tuning | YAML-based configuration; broad method support |
| Unsloth | Speed and efficiency | Custom kernels; 2-5x faster; 80% less VRAM |
| LLaMA-Factory | Ease of use | Web GUI; no-code training configuration |
| Torchtune | PyTorch-native fine-tuning | Official PyTorch library; clean composable design |
| SWIFT (ModelScope) | Multilingual model support | Strong support for Chinese and multilingual models |
Several trends have shaped the fine-tuning space in 2024 and 2025:
Small language models (SLMs): The rise of capable small models (1B to 8B parameters) like Phi, Gemma, and Qwen has made fine-tuning more accessible, as these models can be fine-tuned on consumer hardware even with full fine-tuning.
Reinforcement fine-tuning for reasoning: OpenAI introduced reinforcement fine-tuning for its o-series reasoning models, using RL to improve performance on specific reasoning tasks rather than traditional supervised fine-tuning.
Synthetic data for fine-tuning: Using larger, more capable models to generate training data for fine-tuning smaller models has become a common practice, reducing the cost of creating high-quality training datasets.
Mixture-of-Experts (MoE) fine-tuning: As MoE architectures like Mixtral and DeepSeek become popular, specialized fine-tuning techniques for these architectures have emerged.
Continued pre-training (CPT): Some practitioners perform continued pre-training on domain-specific text before supervised fine-tuning, combining the benefits of domain adaptation and task adaptation.
Merging fine-tuned models: Techniques like TIES-Merging and DARE allow combining multiple separately fine-tuned models into a single model that inherits capabilities from all of them, without additional training.
Multi-modal fine-tuning: As models like GPT-4, Gemini, and LLaVA integrate vision, text, and audio capabilities, fine-tuning approaches have expanded to handle multi-modal data. Vision-language models can be fine-tuned on paired image-text data to improve performance on specific visual reasoning tasks.
Imagine you learned how to ride a bicycle really well. One day, someone gives you a unicycle and asks you to ride it. Even though a unicycle is different from a bicycle, a lot of what you already know still helps: how to balance, how to pedal, how to steer with your body. You would not need to learn everything about riding from the very beginning. You would just need to practice the new, tricky parts while keeping the skills you already have.
Fine-tuning works the same way for computers. A computer program first learns a lot of general things by studying a huge amount of information (like learning to ride a bicycle). Then, when someone wants it to do a specific job (like riding a unicycle), it uses what it already knows and just practices the new parts. This way, it learns the new job much faster and with much less practice than if it started from nothing.