Supervised fine-tuning

Artificial Intelligence Deep Learning Machine Learning Natural Language Processing Training & Optimization

37 min read

Updated Jun 21, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 21, 2026

Fact-checked

In review queue

Sources

66 citations

Revision

v5 · 7,320 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Supervised fine-tuning (SFT) is the process of further training a pre-trained large language model (LLM) on a smaller, labeled dataset of input-output pairs so it performs a specific task or follows instructions.^[1]^[2]^[3] It is a form of transfer learning where the model's existing knowledge, acquired during an initial unsupervised pre-training phase, is refined through a secondary, supervised training stage in which a loss function measures the gap between the model's prediction and a human-provided "ground truth" label.^[1]^[4]^[5] This process adjusts the model's internal parameters (or weights) to improve its performance, accuracy, and alignment on specialized or domain-specific tasks.^[6]^[7]^[8] SFT is the first stage of the modern alignment pipeline used to build instruction-following assistants such as ChatGPT: OpenAI's 2022 InstructGPT work described it as collecting "a dataset of labeler demonstrations of the desired model behavior, which we use to fine-tune GPT-3 using supervised learning."^[12]

SFT is considered the most cost-effective step in the modern LLM development pipeline, often approximately 100 times less expensive than pre-training.^[9] Its leverage is striking: in human evaluations, OpenAI found that "outputs from the 1.3B parameter InstructGPT model are preferred to outputs from the 175B GPT-3, despite having 100x fewer parameters," after SFT and subsequent preference tuning.^[12] A central premise behind SFT, articulated in Meta AI's 2023 LIMA paper, is the Superficial Alignment Hypothesis: "A model's knowledge and capabilities are learnt almost entirely during pretraining, while alignment teaches it which subdistribution of formats should be used when interacting with users."^[32]

History

The concept of fine-tuning in deep learning originated as a form of transfer learning, where pre-trained models are adapted to new tasks to leverage learned features and reduce training costs. Early applications were in computer vision, such as fine-tuning Convolutional Neural Networks (CNNs) like AlexNet or ResNet on datasets like ImageNet for specific image classification tasks.^[10]

The breakthrough year for SFT in natural language processing (NLP) was 2018, which saw the release of several seminal models that established the pre-training and fine-tuning paradigm.

Model	Organization	Key Innovation	Impact
ELMo	Allen AI	Contextual embeddings via bidirectional LSTMs	Handled polysemy through context
ULMFiT	fast.ai	First effective NLP fine-tuning framework	Reduced errors by 18-24% on text classification
GPT	OpenAI	Transformer-based transfer learning	Established transformers for transfer learning
BERT	Google	Bidirectional pre-training + fine-tuning	"Beginning of new era in NLP"^[11]

The term "supervised fine-tuning" gained specific traction in the context of LLMs around 2022, particularly with OpenAI's InstructGPT paper (Ouyang et al., "Training language models to follow instructions with human feedback"), which formalized SFT as the initial stage in aligning models to follow human instructions.^[12] The InstructGPT supervised stage fine-tuned GPT-3 on roughly 13,000 labeler-written and API-submitted prompts paired with human-written demonstrations of the desired output.^[12] This built on earlier works like GPT-3's few-shot learning but emphasized supervised adaptation using human-generated demonstrations. Subsequent models, such as LLaMA-2 (2023) and Mistral, incorporated SFT to enhance instruction-following and safety.^[13] Parameter-efficient variants like Low-Rank Adaptation (LoRA), introduced by Hu et al. in 2021, emerged to make SFT more accessible for large models; LoRA can "reduce the number of trainable parameters by 10,000 times and the GPU memory requirement by 3 times" relative to fine-tuning GPT-3 175B with Adam.^[14]

What is supervised fine-tuning used for? Core Principles and Role in the AI Lifecycle

Supervised fine-tuning serves as a critical bridge in the development of modern AI models, transforming general-purpose foundation models into specialized tools capable of performing specific, high-value tasks. Its principles are rooted in the strategic application of labeled data to refine a model's vast but unfocused knowledge base.

Definition and Fundamental Goal

SFT is the process of taking a pre-trained model and further training it on a smaller, task-specific dataset composed of labeled examples.^[2]^[3]^[4] The fundamental goal is to specialize the model's general capabilities for a narrow, well-defined task (such as sentiment analysis, medical diagnosis, or legal contract review) without erasing the foundational knowledge acquired during pre-training.^[6]^[4] This is achieved by refining an already capable model, such as GPT-4, Gemini, or LLaMA, using a carefully curated dataset of input-output pairs. For example, a legal-tech company might fine-tune a model on thousands of court rulings to improve its understanding of legal terminology, or a customer service organization might use transcripts of support calls to align a model with its specific communication style and product knowledge.^[2]^[6] This process effectively bridges the gap between a model's broad, generalized understanding of language and the specific nuances, jargon, and response patterns required for a particular application.^[6]^[8]

The strategic value of SFT lies in this transition from a generalist to a specialist. The initial pre-training phase equips a model with a comprehensive understanding of language: grammar, syntax, semantics, and a vast repository of world knowledge.^[2]^[4] However, this knowledge is latent and not inherently directed toward any specific user goal. SFT is the mechanism that activates this potential, directing the model's capabilities to produce useful, reliable, and contextually appropriate outputs for a defined purpose. Without this step, a foundation model remains a powerful but unspecialized artifact; with SFT, it becomes a functional, task-oriented tool.

Position in the Model Development Pipeline

The development of large-scale AI models typically follows a multi-stage pipeline, and SFT occupies a crucial position after the initial pre-training phase.

Pre-training: The lifecycle begins with an unsupervised pre-training stage. Here, a foundation model is trained on a massive corpus of unlabeled text, often scraped from the internet, to learn general language patterns, facts, and reasoning abilities. The objective is typically next-token prediction or masked language modeling.^[2]^[4]^[15]
Supervised fine-tuning (Post-training): SFT is a critical post-training step that follows pre-training.^[15] It uses a much smaller, curated, and labeled dataset to teach the model how to perform specific tasks or follow instructions, a behavior not explicitly learned during the unsupervised pre-training phase.^[1]^[16] This is the stage where the model's behavior is first aligned with human-defined objectives.
Further Alignment (Optional): For many modern applications, especially conversational agents, SFT is followed by additional alignment techniques like Reinforcement Learning from Human Feedback (RLHF). In this paradigm, SFT provides the model with the necessary task competence and instruction-following ability, creating a strong starting point. RLHF then further refines the model's behavior based on human preferences (for example which of two responses is more helpful or harmless), optimizing for more subjective qualities that are difficult to capture in a static dataset.^[17]^[18]

The "Supervised" Aspect: The Role of Labeled Data

The term "supervised" in SFT refers directly to the use of a labeled training dataset to guide the fine-tuning process.^[1]^[4] Unlike the unlabeled data used in pre-training, each data point in an SFT dataset consists of an input and a corresponding desired output, often referred to as the "ground truth" label.^[4]^[19]^[20] For example:

In a question-answering task, the input is a question, and the output is the correct answer.
In a sentiment analysis task, the input is a piece of text (for example a product review), and the output is a label (for example "positive," "negative," or "neutral").
In a summarization task, the input is a long document, and the output is a human-written summary.

During training, the model learns by attempting to map the inputs to the desired outputs. It makes a prediction for each input, and the difference between its prediction and the ground truth label is quantified by a loss function. The model then adjusts its internal parameters based on this explicit feedback, iteratively minimizing the prediction error.^[7]^[4] This direct supervision is what allows for precise control over the model's behavior, aligning it with specific, task-oriented objectives.^[1]

How does the supervised fine-tuning process work?

The SFT workflow is a structured, multi-step process that transforms a general-purpose pre-trained model into a specialized one. It involves careful selection of the base model, meticulous preparation of data, a systematic training phase, and rigorous evaluation.

Step 1: Selection of a Pre-trained Base Model

The process begins with the selection of an appropriate pre-trained foundation model. Popular choices include models from the GPT family, Gemini, Claude, or open-source alternatives like LLaMA and Mistral.^[2]^[21]^[6] The choice of model architecture is critical and should align with the intended downstream task. For instance, causal decoder-only models like GPT are well-suited for text generation tasks, while encoder-based models like BERT excel at text classification and understanding tasks.^[21]^[20] The selected base model provides a robust starting point, having already learned general language syntax, semantics, and contextual understanding from its extensive pre-training.^[2]^[4]

Step 2: Task Definition and Dataset Preparation

First, the downstream task must be clearly and narrowly defined (for example classify customer support tickets into "urgent" or "non-urgent," summarize legal depositions, generate Python code from natural language descriptions).^[2]

Next, a high-quality, task-specific labeled dataset is created. The quality and relevance of this dataset are the most critical factors for successful SFT.^[2]^[22]^[5] The dataset must consist of input-output pairs that serve as concrete examples of the desired model behavior.^[4]^[19] This data is then typically split into three subsets:

Training Set: The largest portion, used to directly train the model and update its weights.
Validation Set: A separate set used to tune hyperparameters, monitor for overfitting, and decide when to stop training.
Testing Set: A final, held-out set that the model has never seen during training or validation. It provides an unbiased evaluation of the model's generalization performance.^[3]^[23]^[5]^[24]

Step 3: The Training Loop and Weight Updates

The core of SFT is the training loop, where the model's parameters are iteratively adjusted.

Technical Foundations: Loss Function and Backpropagation

The model is trained on the labeled dataset using supervised learning.^[4] For each input example, the model generates a prediction. A loss function is used to calculate the discrepancy between the model's prediction and the ground truth label from the dataset. For most language modeling and classification tasks, the standard choice is the Cross-entropy loss function (also called negative log-likelihood), which measures the difference between two probability distributions.^[5]^[25]^[26]^[27] The training objective is to minimize this loss across the entire training dataset.^[4]

Mathematically, the token-level cross-entropy loss is expressed as: * Where:

N* = sequence length
yi = target token at position i
P(yi | x, y<i) = probability assigned to correct token

This minimization is achieved through backpropagation. The calculated loss is propagated backward through the network to compute the gradient for each of the model's parameters (weights).^[10]^[5] These gradients represent the direction and magnitude of the change needed for each weight to reduce the error. An optimizer algorithm (such as Adam or SGD), then uses these gradients, along with a specified learning rate, to update the model's weights.^[4]^[5]^[28] This iterative process of forward pass (prediction), loss calculation, backward pass (backpropagation), and weight update is repeated for many epochs.

This mechanism of weight updates introduces a fundamental tension. The goal is to adjust the model's parameters to specialize it for a new task. However, if these adjustments are too aggressive (for example due to a high learning rate or excessive training), they risk destructively overwriting the complex, general knowledge encoded during pre-training, a phenomenon known as catastrophic forgetting.^[29]^[30]^[31] Therefore, successful SFT is not merely about teaching new skills but also about carefully preserving existing ones.

Hyperparameter Configuration

Several key hyperparameters must be configured to control the training process effectively. These include:

Learning Rate: Controls the step size of weight updates. A small learning rate is often recommended for fine-tuning to avoid drastic changes that could lead to catastrophic forgetting.^[28]
Batch Size: The number of training examples processed in a single iteration. This choice impacts memory requirements and training stability.^[30]^[8]
Number of Epochs: The number of times the entire training dataset is passed through the model. Too few epochs can lead to underfitting, while too many can lead to overfitting.^[8]^[20]

Step 4: Evaluation and Validation

Throughout the training process, the model's performance is periodically evaluated on the separate validation dataset.^[3]^[23] This step is crucial for:

Monitoring Progress: Tracking how the model's performance improves over time.
Detecting Overfitting: If training loss continues to decrease while validation loss starts to increase, the model is overfitting. Techniques like early stopping can be used to halt training at the optimal point.^[30]^[20]
Hyperparameter Tuning: Comparing the validation performance of models trained with different hyperparameter settings to find the optimal configuration.

(See #Evaluation Metrics section for more details on specific metrics.)

Step 5: Deployment and Monitoring

Once the model achieves satisfactory performance on the validation and test sets and is deemed ready for real-world application, it is deployed. This could involve integrating it into a customer support chatbot, a content generation tool, a medical diagnosis system, or any other application where its specialized skills are needed.^[2]^[4]

Dataset Requirements and Preparation

The success of supervised fine-tuning is overwhelmingly determined by the quality, relevance, and structure of the dataset used. The principle "garbage in, garbage out" applies with particular force; a powerful pre-trained model can easily be taught incorrect or undesirable behaviors with a poorly constructed dataset.

Is more data always better? Data Quality over Quantity

Across numerous studies and best-practice guides, there is a consensus that the quality of the SFT dataset is far more important than its sheer size.^[23]^[22]^[24] A small, clean, diverse, and highly relevant dataset will produce a better-specialized model than a large, noisy, or unrepresentative one. A key paper, "LIMA: Less Is More for Alignment" (Zhou et al., 2023), famously demonstrated that a 65B-parameter LLaMa model fine-tuned on only 1,000 carefully curated prompt-response pairs (750 from community forums such as Stack Exchange and wikiHow, plus 200 written by the authors) could rival much larger instruction-tuned and RLHF-trained systems.^[32] The result was used to support the paper's Superficial Alignment Hypothesis, which holds that almost all of a model's knowledge is acquired during pretraining and that alignment mainly teaches the model which response format to use.^[32]

While the optimal number of examples varies by task and model size, significant performance improvements can be seen with as few as 50 to 100 well-crafted examples.^[19] The minimum required is often around 10 examples.^[19] For more complex tasks, the dataset may need to scale to thousands of examples.^[23]

Common Dataset Formats and Structuring

SFT datasets are typically stored in structured formats that are easy to parse, with JSON and JSON Lines (JSONL) being the most common.^[19]^[15] The fundamental unit of data is a prompt-response pair or a turn in a simulated conversation.^[15] For instruction-following tasks, a common schema is a JSON object with keys such as "instruction", "input" (optional), and "output".^[33]

To prepare this structured data for the model, a templating step is often required. Templating engines like Jinja are used to format the prompt-response pairs into a single, consistent string. This process involves adding special tokens or markers to delineate between different parts of the interaction, such as the user's prompt and the assistant's response (for example using tags like <INST> and </INST>). This teaches the model the conversational structure it is expected to follow.^[15]

Data Generation Strategies (for example Self-Instruct)

Manually creating thousands of high-quality, diverse instruction-following examples is a significant bottleneck in SFT, requiring immense time and domain expertise.^[34]^[33] To address this, automated data generation strategies have been developed.

The most prominent of these is Self-Instruct, a framework that uses a powerful "teacher" LLM (like GPT-3.5 or GPT-4) to generate a large corpus of training data from a small seed set of human-written examples.^[35]^[34]^[33] This method was famously used to generate the 52,000-example dataset for the Stanford Alpaca model. Stanford's Center for Research on Foundation Models reported that Alpaca behaves "qualitatively similarly to OpenAI's text-davinci-003, while being surprisingly small and easy/cheap to reproduce (<600$)," demonstrating that it was possible to create a capable instruction-following model without a massive human annotation effort.^[35]^[36]

Notable Public Datasets for SFT

A rich ecosystem of public datasets enables research and application of SFT.

Dataset	Size	Source	Key Features	License
Alpaca^[36]	52,000	GPT-3 (text-davinci-003) generated	Self-instruct methodology	Apache 2.0 (code); CC BY-NC 4.0 (data)
Dolly (databricks-dolly-15k)^[37]^[38]	~15,000	Human-written	7 behavioral categories, authored by 5,000+ Databricks employees (March-April 2023)	CC-BY-SA 3.0
FLAN Collection	1,836 tasks	Task transformation	Multiple templates for existing NLP datasets	Mixed
LIMA^[32]	1,000	Curated	"Quality over quantity" thesis	Research only
ShareGPT	~70,000	User conversations	Real-world multi-turn dialogue	Unclear
OpenAssistant Conversations (OASST1)^[39]^[40]	161,443 msgs	Crowdsourced	35 languages, 461,292 quality ratings, 13,500+ volunteers	Apache 2.0

Key Techniques and Variants

Supervised fine-tuning is not a monolithic technique; several variants exist, each offering different trade-offs between performance, computational cost, and risk of catastrophic forgetting.

Full Fine-Tuning

Full fine-tuning is the most straightforward and traditional approach. In this method, all the parameters (weights) of the pre-trained model are updated during the training process on the new, task-specific dataset.^[3]^[29]^[28]

Advantages: Because the entire model adapts to the new data, this method often yields the best possible performance and accuracy for the target task.^[29]^[10]
Disadvantages: It is extremely resource-intensive, requiring substantial GPU memory and computational power, making it infeasible for many users, especially with very large models.^[29]^[41] It also carries the highest risk of catastrophic forgetting, as the aggressive updates can overwrite the model's general knowledge.^[29]^[28] Furthermore, it requires storing a complete, separate copy of the multi-billion parameter model for each specialized task, which is highly inefficient.^[41]

Selective Layer Fine-Tuning (Layer Freezing)

A more resource-efficient alternative to full fine-tuning is to selectively update only a subset of the model's layers while keeping the others "frozen" (i.e., their weights are not changed during backpropagation).^[10]^[30]^[28]

The underlying principle is that different layers in a deep neural network capture features at different levels of abstraction. Early layers (closer to the input) tend to learn general, low-level features (for example basic grammar, word relationships), while later layers (closer to the output) learn more high-level, task-specific features.^[10]^[28] By freezing the initial layers, practitioners can preserve the model's foundational knowledge while training only the later layers to adapt to the new task. This approach reduces both computational costs and the risk of catastrophic forgetting.^[10]^[30]^[8]

Instruction Tuning: A Special Case of SFT

While all instruction tuning is a form of SFT, the term instruction tuning has come to refer to a specific application of SFT with a distinct goal. Instead of fine-tuning a model for a single, narrow downstream task (like sentiment classification), instruction tuning aims to improve a model's general ability to follow natural language instructions across a wide variety of tasks.^[42]^[16]

The key differentiator is the nature and diversity of the training data. An instruction-tuned model is fine-tuned on a large mixture of datasets, where each example is formatted as an instruction (for example "Summarize the following article," "Translate this sentence to French," "Answer this question based on the context").^[43]^[16]^[44] This process helps to bridge the gap between the model's original pre-training objective (next-word prediction) and the user's goal of having the model act as a helpful, instruction-following assistant.^[42]^[16] Google's foundational FLAN study (Wei et al., 2021) showed the payoff at scale: a 137B-parameter model instruction-tuned on over 60 tasks "surpasses zero-shot 175B GPT-3 on 20 of 25 tasks" it was evaluated on.^[44] The follow-up "Scaling Instruction-Finetuned Language Models" (Chung et al., 2022) reported that Flan-PaLM 540B, instruction-finetuned on 1,836 tasks, outperformed PaLM 540B by +9.4% on average across benchmarks.^[52]

How does SFT differ from other model adaptation techniques?

SFT is one of several powerful techniques for adapting pre-trained models. Understanding its relationship with other methods like RLHF, PEFT, and RAG is crucial for selecting the right approach for a given application.

SFT vs. Reinforcement Learning from Human Feedback (RLHF)

Mechanism: SFT is a supervised learning method that trains a model on a static dataset of "correct" input-output pairs.^[18] RLHF, in contrast, is a reinforcement learning method that optimizes a model based on dynamic feedback. It involves training a separate "reward model" on a dataset of human preferences (for example humans ranking several model-generated responses from best to worst) and then using this reward model to guide the LLM's training.^[17]
Objective: The primary goal of SFT is to improve task-specific accuracy and correctness based on labeled examples.^[18] The goal of RLHF is to align the model's behavior with subjective and often nuanced human values, such as being helpful, harmless, and honest, especially in scenarios where there is no single correct answer.^[18]^[45]
Relationship: SFT and RLHF are not mutually exclusive; they form a powerful pipeline. SFT is typically the first step, used to teach the model the basic competence and instruction-following skills required for a task. RLHF is then applied to the SFT model to further refine its behavior and align it more closely with human preferences.^[17]^[18]^[46]

SFT vs. Parameter-Efficient Fine-Tuning (PEFT)

Mechanism: Full SFT updates all of the model's parameters.^[29] PEFT is a family of techniques that dramatically reduces the computational cost of fine-tuning by updating only a small fraction of the model's parameters (or adding a small number of new, trainable parameters) while keeping the vast majority of the original model frozen.^[47]^[41]^[28]
Key PEFT Methods:
- LoRA (Low-Rank Adaptation): This popular PEFT method injects small, trainable low-rank matrices (adapters) into the layers of the pre-trained model. Only these adapters are trained, significantly reducing the number of trainable parameters and memory usage. Hu et al. reported that LoRA can reduce trainable parameters by up to 10,000 times and the GPU memory requirement by 3 times relative to fine-tuning GPT-3 175B with Adam, while performing "on-par or better than fine-tuning in model quality."^[14]^[47]^[8]
- QLoRA: An even more efficient version of LoRA that quantizes the frozen, pre-trained model to 4-bit precision, further reducing memory requirements. Dettmers et al. (2023) showed QLoRA could finetune a 65B-parameter model on a single 48GB GPU while preserving full 16-bit task performance; their resulting Guanaco model family "reaches 99.3% of the performance level of ChatGPT while only requiring 24 hours of finetuning on a single GPU" on the Vicuna benchmark.^[47]^[48]^[49]
Trade-offs: Full SFT is computationally expensive but may offer slightly better performance on highly complex tasks. PEFT is vastly more efficient, requires less data, is more robust against catastrophic forgetting, and allows a single base model to be adapted for many tasks using small, swappable adapters.^[29]^[41]^[50]

SFT vs. Retrieval-Augmented Generation (RAG)

Mechanism: SFT modifies the model's internal weights through training, thereby encoding new knowledge or skills directly into the model.^[6]^[5] RAG does not alter the model's weights. Instead, it augments the model's knowledge at inference time by connecting it to an external, up-to-date knowledge base (such as a vector database of documents).^[6]
Process: In a RAG system, when a user submits a query, the system first retrieves relevant information from the external knowledge base. This retrieved information is then inserted into the prompt as context, which the unchanged LLM uses to generate a more accurate and informed response.^[6]
Use Case: SFT is used to teach a model a new skill, behavior, or style (for example how to write in a specific tone or format). RAG is used to provide the model with new or domain-specific knowledge without the need for retraining. RAG is particularly effective for applications that rely on rapidly changing information, as the external knowledge base can be updated easily without altering the model.

Aspect	Supervised Fine-Tuning (SFT)	Parameter-Efficient Fine-Tuning (PEFT)	Reinforcement Learning from Human Feedback (RLHF)	Retrieval-Augmented Generation (RAG)	Unsupervised Fine-Tuning
Core Goal	Improve task-specific accuracy and correctness.	Efficiently improve task-specific accuracy with minimal computation.	Align model behavior with subjective human preferences.	Provide model with external, up-to-date knowledge.	Adapt model to a new domain's language style.
Method	Supervised learning on labeled input-output pairs.	Supervised learning, but only on a small subset of parameters (for example LoRA adapters).	Reinforcement learning guided by a reward model trained on human preference data.	At inference time, retrieve relevant context from a database and add it to the prompt.	Self-supervised learning on unlabeled domain data.
Model Weights	All (or a large subset of) weights are modified.	>99% of weights are frozen; only a small fraction are modified.	All (or a subset of) weights are modified.	No weights are modified.	All (or a subset of) weights are modified.
Data Requirement	High-quality labeled dataset (hundreds to thousands of examples).	Smaller high-quality labeled dataset.	Human preference data (for example rankings of model outputs).	A corpus of documents for the external knowledge base.	Large corpus of unlabeled text from the target domain.
Computational Cost	High to very high.	Very low.	Very high (requires training multiple models).	Low (cost is primarily at inference time).	High (but less than pre-training).
Key Advantage	Can achieve high performance on specific tasks.	Dramatically reduces cost and memory; enables easy task-switching.	Excels at capturing nuanced, subjective qualities.	Reduces hallucinations; knowledge is easily updatable.	Improves performance in a new domain without labels.
Key Limitation	Costly, risk of catastrophic forgetting and overfitting.	May have slightly lower performance than full SFT on complex tasks.	Complex, resource-intensive, and relies on subjective human feedback.	Does not teach the model new skills or behaviors, only provides knowledge.	Does not teach specific tasks or instruction following.
Ideal Use Case	Adapting a model for a well-defined task with objective answers (for example classification, summarization).	Cost-effective adaptation for multiple tasks or in resource-constrained environments.	Creating conversational agents that are helpful, harmless, and aligned with human values.	Building Q&A systems over private documents or rapidly changing information.	Adapting a general model to a specialized field (for example legal or medical text).

Applications and Notable Models

SFT has been instrumental in transitioning LLMs from research curiosities to practical tools deployed across numerous industries. Its ability to specialize foundation models has unlocked a wide range of applications and has been a core component in the development of some of the most influential AI models.

Common Tasks

SFT is widely used to adapt language models to domain-specific tasks. Common tasks that benefit from SFT include:

Classification: Assigning inputs to categories (for example, spam vs. non-spam detection or topic classification).
Summarization: Producing concise summaries of longer text (such as summarizing articles or reports).
Question answering: Extractive or generative question-answering on provided contexts.
Dialogue/Chat: Training conversational agents to follow desired behaviors or scripts.
Translation: Fine-tuning on parallel corpora to improve translation quality for specific language pairs or domains.

Industry-Specific Applications

Finance and Legal: In sectors that rely on high-precision analysis of dense, specialized text, SFT is used to train models on corpora of legal documents, contracts, or financial regulations. This enables applications like automated contract analysis, compliance verification, and discovery of relevant information in legal cases.^[2]^[3]^[30]
Healthcare: SFT allows for the specialization of models for medical language understanding, such as interpreting clinical notes, summarizing patient histories, or powering diagnostic support systems. In computer vision, models can be fine-tuned on datasets of medical images (for example X-rays, MRIs) to improve their ability to detect specific anomalies like tumors or fractures.^[3]^[51]
Customer Support: Companies use SFT to create specialized chatbots and virtual assistants. By fine-tuning a base model on internal help desk tickets, support transcripts, and product documentation, they can create an agent that understands their specific products and aligns with the company's tone and voice, leading to more accurate and helpful customer interactions.^[2]^[6]

Notable Models Developed with SFT

SFT has been a cornerstone in the development of many state-of-the-art language models.

InstructGPT and ChatGPT: The development of these influential models by OpenAI marked a significant shift toward aligning LLMs with human intent. The training process begins with SFT. OpenAI collected a dataset of human-written demonstrations of desired behavior (prompts and corresponding responses) and used it to fine-tune a base GPT-3 model. This initial SFT step was crucial for teaching the model how to follow instructions effectively.^[12] This SFT model was then further refined using RLHF to improve its helpfulness, honesty, and harmlessness.^[46]
FLAN Models (for example Flan-T5): Google's FLAN (Finetuned Language Net) models are a prime example of large-scale instruction tuning. Researchers took a base model (like T5 or PaLM) and fine-tuned it on a massive collection of existing NLP datasets (initially over 60, later expanded to 1,836 tasks) that were reformatted into natural language instructions.^[44]^[52]^[53] This process dramatically improved the model's ability to generalize to new, unseen tasks in a zero-shot setting.^[52]^[54]
LLaMA and its Derivatives (for example Alpaca, Dolly): The release of Meta's open-source LLaMA models catalyzed a wave of innovation in the open-source community, driven primarily by SFT. Stanford's Alpaca project demonstrated that it was possible to replicate the instruction-following capabilities of models like ChatGPT with limited resources. They took the 7B LLaMA model and performed SFT using a dataset of 52,000 instruction-following examples generated synthetically via the Self-Instruct technique, at a reported reproduction cost of under $600.^[6]^[35] Similarly, Databricks created Dolly, one of the first open-source, commercially viable instruction-following models, by fine-tuning an EleutherAI Pythia model on a human-generated dataset of roughly 15,000 examples.^[38]^[55]

Implementation Tools

A rich ecosystem of open-source libraries and cloud platforms exists to facilitate SFT.

Hugging Face Ecosystem

The Hugging Face ecosystem is the de facto standard for open-source SFT.

Transformers: Provides 100+ pre-trained architectures and the Trainer API for handling the training loop.^[56]
TRL: A library specializing in training, including the SFTTrainer which simplifies SFT.
PEFT: The library for Parameter-Efficient Fine-Tuning methods like LoRA and QLoRA.^[57]
Datasets: Provides access to and preprocessing for 50,000+ datasets.
Accelerate: Simplifies distributed training across multiple GPUs or TPUs.

Cloud Platforms

Major cloud providers offer managed services for SFT.

Platform	Models Supported	Key Features
Azure AI Foundry	GPT-4, LLaMA, Mistral, Gemini	Serverless fine-tuning, no GPU quotas
AWS SageMaker	FLAN-T5, open models	Jumpstart templates, Trainium support
Google Vertex AI	Gemini family, Gemma	LoRA fine-tuning, Cloud Storage integration
OpenAI API	GPT-3.5, GPT-4	JSON Lines format, job tracking

Specialized Frameworks

LLaMA-Factory: A unified web UI and framework for fine-tuning 100+ models.
Unsloth: An optimized framework promising 2x faster training and 70% less VRAM usage.
DeepSpeed: A Microsoft library for optimizing distributed training.

Evaluation Metrics

Evaluating the performance of a fine-tuned model is a critical step, using both task-specific metrics and general benchmarks.

Task-Specific Metrics

Text Generation: BLEU, ROUGE, Perplexity
Question Answering: Exact Match (EM), F1 Score
Code Generation: pass@k, functional correctness
Math Reasoning: Accuracy, solution validity

General Benchmarks

These benchmarks evaluate a model's general instruction-following and reasoning abilities.

Benchmark	Description	Metric	Coverage
MMLU	Massive Multitask Language Understanding	Accuracy %	57 subjects (math, history, law, etc.)
BBH	Big-Bench Hard	Average accuracy	23 challenging reasoning tasks
AlpacaEval	Instruction following	Win rate vs baseline	805 prompts, auto-evaluated by GPT-4
MT-Bench	Multi-turn conversations	GPT-4 rating (1-10)	8 categories (writing, reasoning, math)

Advantages and Limitations

Supervised fine-tuning is a widely adopted technique due to its compelling advantages in performance and efficiency. However, it also presents significant challenges that require careful management.

Key Advantages

Improved Task-Specific Performance: The most significant benefit of SFT is the substantial improvement in a model's performance on its target task. By training on domain-specific data, the model's outputs become more accurate, relevant, and contextually appropriate compared to a general-purpose base model.^[2]^[6]^[30]^[8]
Resource Efficiency (vs. Training from Scratch): Fine-tuning a pre-trained model is orders of magnitude more efficient than training a large model from the ground up. It requires significantly less data, time, and computational power, making advanced AI accessible to a broader range of organizations.^[2]^[3]^[30]^[24]
Data Efficiency: Because the model leverages the vast knowledge acquired during pre-training, SFT requires a comparatively small amount of task-specific labeled data to achieve high performance.^[6]
Faster Development and Deployment: By building on an existing foundation, SFT dramatically accelerates the AI development lifecycle. This allows for more rapid prototyping, validation, and deployment of solutions, providing a competitive advantage.^[2]^[4]

Common Challenges and Pitfalls

High Cost of Data Curation: The primary bottleneck for SFT is often the creation of the high-quality labeled dataset. This process can be expensive and time-consuming, requiring significant human effort and domain expertise to collect, clean, and annotate the data accurately.^[45]^[8]^[24]^[30]
Overfitting: When fine-tuning on a small or non-diverse dataset, the model is at risk of overfitting. This occurs when the model memorizes the training examples instead of learning the underlying patterns, leading to poor performance on new, unseen data.^[7]^[30]^[51]
Catastrophic Forgetting: This is a major risk in fine-tuning, particularly full fine-tuning. The process of updating the model's weights to specialize in a new task can destructively overwrite the general knowledge and capabilities learned during pre-training, degrading its performance on other tasks.^[29]^[30]^[5]^[31]^[50]
Bias Amplification: Foundation models often contain societal biases present in their vast pre-training data. If the fine-tuning dataset also contains such biases (for example underrepresentation of certain demographic groups), SFT can not only inherit but also amplify these biases, resulting in unfair, inequitable, or harmful model outputs.^[58]^[59]^[60]^[61]

Best Practices

Do's	Don'ts
Use a high-quality, clean, and representative labeled dataset.^[30]^[58]	Ignore the importance of data quality; more data does not always equal better results.^[30]
Start with a pre-trained model that is relevant to your task.^[30]	Assume any pre-trained model is a suitable starting point.
Systematically optimize hyperparameters like learning rate and batch size.^[30]	Use a high learning rate without careful testing, which can lead to catastrophic forgetting.^[30]
Regularly validate the model's performance on a separate validation set.^[30]	Overfit the model by training for too long without monitoring validation performance.^[30]
Leverage data augmentation for small or imbalanced datasets.^[30]	Assume the pre-trained model is free of biases.^[58]
Monitor for and actively mitigate signs of catastrophic forgetting and bias amplification.^[30]^[58]	Freeze all layers indiscriminately without considering the task's nature.^[30]

Parameter	Full Fine-Tuning	LoRA / PEFT	Notes
Learning Rate	2e-5	1e-4 - 3e-4	Lower for larger batches or more complex tasks.
Batch Size	8-16	4-8	Use gradient accumulation if VRAM is limited.
Epochs	3-5 (large data)	15-20 (small data)	Monitor validation loss closely and use early stopping.
Warmup Steps	3-7% of total steps	Optional	Can help stabilize training in the beginning.
Weight Decay	0.01-0.05	0.01	Standard L2 regularization to prevent overfitting.

Advanced Topics and Ongoing Research

The field of supervised fine-tuning is rapidly evolving as researchers work to address its fundamental limitations and improve its efficiency and robustness.

Mitigating Catastrophic Forgetting

Catastrophic forgetting occurs because updating model weights to learn a new task can overwrite the parameters that encode previously learned knowledge.^[62]^[31] This is especially problematic in full fine-tuning.^[50] Strategies to mitigate this include:

Rehearsal-based Methods: These techniques involve mixing data from the original pre-training distribution or previously learned tasks with the new fine-tuning data. This "reminds" the model of its prior knowledge, preventing it from drifting too far.^[63]^[50]
Regularization Techniques: Methods like Elastic Weight Consolidation (EWC) add a penalty term to the loss function. This penalty discourages large changes to weights that were identified as important for previous tasks, thereby anchoring the model to its prior knowledge.^[31]^[50]
Parameter-Efficient Fine-Tuning (PEFT): PEFT methods like LoRA are inherently more robust against catastrophic forgetting. By freezing the vast majority of the model's parameters and only training a small number of new ones, they ensure that the core knowledge of the pre-trained model remains intact.^[50]

The Future of SFT: Unifying with Reinforcement Learning

The dominant paradigm for aligning advanced models has been a sequential, two-stage process: first, use SFT for task competence, then use RLHF for preference alignment. While effective, this pipeline is complex, inefficient, and involves optimizing for different objectives at each stage, which can lead to trade-offs and knowledge loss between steps.^[64]^[65]

A major frontier of current research is the development of unified frameworks that integrate SFT and RLHF into a single, cohesive training process. Emerging approaches include:

Hybrid Loss Functions: Techniques like Regularized Preference Optimization (RPO) propose a single training objective that combines a preference optimization loss (like that used in DPO) with a standard SFT loss. This fusion helps to ground the preference learning and stabilize training.^[66]
Unified Fine-Tuning (UFT) Frameworks: Recent work has proposed frameworks like UFT that use a generalized implicit reward function to unify the objectives of SFT and various alignment methods (like RLHF and DPO). This allows for a single-stage training process that has been shown to reduce catastrophic forgetting and improve overall performance.^[65]

References

Ouyang, L., Wu, J., Jiang, X., et al. (2022). "Training language models to follow instructions with human feedback." arXiv:2203.02155. https://arxiv.org/abs/2203.02155 ↩
IBM. "What is fine-tuning?" IBM Think Topics. https://www.ibm.com/think/topics/fine-tuning ↩
IBM. "What is supervised fine-tuning?" IBM Think Topics. https://www.ibm.com/think/topics/supervised-fine-tuning ↩
Hugging Face. "Supervised Fine-tuning Trainer (TRL documentation)." https://huggingface.co/docs/trl/en/sft_trainer ↩
SuperAnnotate. "Fine-tuning large language models (LLMs) in 2024." https://www.superannotate.com/blog/llm-fine-tuning ↩
Sebastian Raschka. "Finetuning Large Language Models." https://magazine.sebastianraschka.com/p/finetuning-large-language-models ↩
Google Cloud. "What is supervised learning?" https://cloud.google.com/discover/what-is-supervised-learning ↩
Lightning AI / Hugging Face. "Parameter-Efficient LLM Finetuning techniques." https://lightning.ai/pages/community/lora-insights/ ↩
Sebastian Raschka (2023). "The Modern LLM Development Pipeline: Pretraining, SFT, and Alignment." Ahead of AI. https://magazine.sebastianraschka.com/ ↩
Goodfellow, I., Bengio, Y., Courville, A. (2016). *Deep Learning*. MIT Press. https://www.deeplearningbook.org/ ↩
Devlin, J., Chang, M.-W., Lee, K., Toutanova, K. (2019). "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding." arXiv:1810.04805. https://arxiv.org/abs/1810.04805 ↩
Ouyang, L., et al. (2022). "Training language models to follow instructions with human feedback" (InstructGPT). arXiv:2203.02155. https://arxiv.org/abs/2203.02155 ↩
Touvron, H., et al. (2023). "Llama 2: Open Foundation and Fine-Tuned Chat Models." arXiv:2307.09288. https://arxiv.org/abs/2307.09288 ↩
Hu, E. J., Shen, Y., Wallis, P., et al. (2021). "LoRA: Low-Rank Adaptation of Large Language Models." arXiv:2106.09685. https://arxiv.org/abs/2106.09685 ↩
Hugging Face. "Datasets and data formats for supervised fine-tuning (TRL)." https://huggingface.co/docs/trl/en/dataset_formats ↩
Zhang, S., et al. (2023). "Instruction Tuning for Large Language Models: A Survey." arXiv:2308.10792. https://arxiv.org/abs/2308.10792 ↩
Christiano, P., et al. (2017). "Deep Reinforcement Learning from Human Preferences." arXiv:1706.03741. https://arxiv.org/abs/1706.03741 ↩
Hugging Face. "Illustrating Reinforcement Learning from Human Feedback (RLHF)." https://huggingface.co/blog/rlhf ↩
OpenAI. "Fine-tuning guide (API documentation)." https://platform.openai.com/docs/guides/fine-tuning ↩
Google for Developers. "Machine Learning Crash Course: Training and Test Sets." https://developers.google.com/machine-learning/crash-course ↩
Hugging Face. "Transformer model architectures and tasks (course)." https://huggingface.co/learn/nlp-course ↩
Zhou, C., et al. (2023). "LIMA: Less Is More for Alignment." arXiv:2305.11206. https://arxiv.org/abs/2305.11206 ↩
Databricks. "Best practices for fine-tuning large language models." https://www.databricks.com/blog ↩
Hugging Face. "How much data do you need to fine-tune? (blog)." https://huggingface.co/blog ↩
Murphy, K. P. (2022). *Probabilistic Machine Learning: An Introduction*. MIT Press. https://probml.github.io/pml-book/ ↩
PyTorch. "CrossEntropyLoss documentation." https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html ↩
Bishop, C. M. (2006). *Pattern Recognition and Machine Learning*. Springer. ↩
Hugging Face. "Fine-tuning a pretrained model (Transformers documentation)." https://huggingface.co/docs/transformers/en/training ↩
Hugging Face. "PEFT: Parameter-Efficient Fine-Tuning (documentation)." https://huggingface.co/docs/peft/en/index ↩
Google Cloud Vertex AI. "Tune text models by using supervised fine-tuning." https://cloud.google.com/vertex-ai/generative-ai/docs/models/tune-models ↩
Kirkpatrick, J., et al. (2017). "Overcoming catastrophic forgetting in neural networks." *PNAS* 114(13): 3521-3526. arXiv:1612.00796. https://arxiv.org/abs/1612.00796 ↩
Zhou, C., Liu, P., Xu, P., et al. (2023). "LIMA: Less Is More for Alignment." arXiv:2305.11206. https://arxiv.org/abs/2305.11206 ↩
Stanford CRFM. "Alpaca: A Strong, Replicable Instruction-Following Model." https://crfm.stanford.edu/2023/03/13/alpaca.html ↩
Wang, Y., et al. (2022). "Self-Instruct: Aligning Language Models with Self-Generated Instructions." arXiv:2212.10560. https://arxiv.org/abs/2212.10560 ↩
Taori, R., Gulrajani, I., Zhang, T., et al. (2023). "Stanford Alpaca: An Instruction-following LLaMA model." GitHub. https://github.com/tatsu-lab/stanford_alpaca ↩
tatsu-lab. "alpaca dataset (52K instruction-following examples)." Hugging Face Datasets. https://huggingface.co/datasets/tatsu-lab/alpaca ↩
Databricks. "Free Dolly: Introducing the World's First Truly Open Instruction-Tuned LLM" (2023). https://www.databricks.com/blog/2023/04/12/dolly-first-open-commercially-viable-instruction-tuned-llm ↩
databricks. "databricks-dolly-15k dataset." Hugging Face Datasets. https://huggingface.co/datasets/databricks/databricks-dolly-15k ↩
Köpf, A., Kilcher, Y., von Rütte, D., et al. (2023). "OpenAssistant Conversations: Democratizing Large Language Model Alignment." arXiv:2304.07327. https://arxiv.org/abs/2304.07327 ↩
OpenAssistant. "oasst1 dataset." Hugging Face Datasets. https://huggingface.co/datasets/OpenAssistant/oasst1 ↩
Lialin, V., Deshpande, V., Rumshisky, A. (2023). "Scaling Down to Scale Up: A Guide to Parameter-Efficient Fine-Tuning." arXiv:2303.15647. https://arxiv.org/abs/2303.15647 ↩
Mishra, S., Khashabi, D., Baral, C., Hajishirzi, H. (2022). "Cross-Task Generalization via Natural Language Crowdsourcing Instructions (Natural Instructions)." arXiv:2104.08773. https://arxiv.org/abs/2104.08773 ↩
Sanh, V., et al. (2022). "Multitask Prompted Training Enables Zero-Shot Task Generalization (T0)." arXiv:2110.08207. https://arxiv.org/abs/2110.08207 ↩
Wei, J., Bosma, M., Zhao, V. Y., et al. (2021). "Finetuned Language Models Are Zero-Shot Learners" (FLAN). arXiv:2109.01652. https://arxiv.org/abs/2109.01652 ↩
Bai, Y., et al. (2022). "Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback." arXiv:2204.05862. https://arxiv.org/abs/2204.05862 ↩
OpenAI (2022). "Aligning language models to follow instructions." https://openai.com/index/instruction-following/ ↩
Hugging Face. "Making LLMs even more accessible with bitsandbytes, 4-bit quantization and QLoRA." https://huggingface.co/blog/4bit-transformers-bitsandbytes ↩
Dettmers, T., Pagnoni, A., Holtzman, A., Zettlemoyer, L. (2023). "QLoRA: Efficient Finetuning of Quantized LLMs." arXiv:2305.14314. https://arxiv.org/abs/2305.14314 ↩
artidoro. "QLoRA: Efficient Finetuning of Quantized LLMs (code)." GitHub. https://github.com/artidoro/qlora ↩
Biderman, D., et al. (2024). "LoRA Learns Less and Forgets Less." arXiv:2405.09673. https://arxiv.org/abs/2405.09673 ↩
Singhal, K., et al. (2023). "Large language models encode clinical knowledge (Med-PaLM)." *Nature* 620: 172-180. https://www.nature.com/articles/s41586-023-06291-2 ↩
Chung, H. W., Hou, L., Longpre, S., et al. (2022). "Scaling Instruction-Finetuned Language Models" (Flan-PaLM / Flan-T5). arXiv:2210.11416. https://arxiv.org/abs/2210.11416 ↩
Longpre, S., et al. (2023). "The Flan Collection: Designing Data and Methods for Effective Instruction Tuning." arXiv:2301.13688. https://arxiv.org/abs/2301.13688 ↩
Google Research (2021). "Introducing FLAN: More generalizable Language Models with Instruction Fine-Tuning." https://research.google/blog/ ↩
databrickslabs. "dolly (training code repository)." GitHub. https://github.com/databrickslabs/dolly ↩
Wolf, T., et al. (2020). "Transformers: State-of-the-Art Natural Language Processing." EMNLP 2020 System Demonstrations. Hugging Face. https://github.com/huggingface/transformers ↩
Mangrulkar, S., et al. "PEFT: State-of-the-art Parameter-Efficient Fine-Tuning methods." Hugging Face. https://github.com/huggingface/peft ↩
Bommasani, R., et al. (2021). "On the Opportunities and Risks of Foundation Models." arXiv:2108.07258. https://arxiv.org/abs/2108.07258 ↩
Bender, E. M., Gebru, T., McMillan-Major, A., Shmitchell, S. (2021). "On the Dangers of Stochastic Parrots." FAccT '21. https://dl.acm.org/doi/10.1145/3442188.3445922 ↩
Sheng, E., et al. (2019). "The Woman Worked as a Babysitter: On Biases in Language Generation." arXiv:1909.01326. https://arxiv.org/abs/1909.01326 ↩
Weidinger, L., et al. (2021). "Ethical and social risks of harm from Language Models." arXiv:2112.04359. https://arxiv.org/abs/2112.04359 ↩
McCloskey, M., Cohen, N. J. (1989). "Catastrophic Interference in Connectionist Networks: The Sequential Learning Problem." *Psychology of Learning and Motivation* 24: 109-165. ↩
Robins, A. (1995). "Catastrophic Forgetting, Rehearsal and Pseudorehearsal." *Connection Science* 7(2): 123-146. ↩
Rafailov, R., et al. (2023). "Direct Preference Optimization: Your Language Model is Secretly a Reward Model." arXiv:2305.18290. https://arxiv.org/abs/2305.18290 ↩
Liu, M., et al. (2024). "UFT: Unifying Supervised and Reinforcement Fine-Tuning." arXiv:2410.21438. https://arxiv.org/abs/2410.21438 ↩
Liu, Z., et al. (2024). "Provably Mitigating Overoptimization in RLHF (Regularized Preference Optimization)." arXiv:2405.16436. https://arxiv.org/abs/2405.16436 ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

4 revisions by 1 contributors · full history

Suggest edit