# Fine Tuning

> Source: https://aiwiki.ai/wiki/fine_tuning
> Updated: 2026-07-29
> Categories: Deep Learning, Machine Learning, Training & Optimization
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

Fine-tuning is the process of adapting a pretrained machine-learning model by continuing to optimize some or all of its parameters on data selected for a target task, domain, behavior, or population. It is one form of [transfer learning](/wiki/transfer_learning): knowledge acquired during source training is reused instead of learning the target problem entirely from a random initialization. The target data may contain labels, demonstrations, preferences, or another training signal. The exact meaning of fine-tuning therefore depends on which parameters remain trainable and which objective is used.[1][24][25]

In contemporary practice, the term covers several distinct procedures. Full fine-tuning updates all model parameters. Partial fine-tuning updates selected layers. [Parameter-efficient fine-tuning](/wiki/peft) keeps most or all pretrained weights frozen and learns a comparatively small set of new or selected parameters. Supervised fine-tuning optimizes against target outputs, while preference-based post-training uses comparisons or rewards to alter behavior. These procedures share an initialization from a pretrained checkpoint, but they do not have identical data requirements, resource costs, or failure modes.[6][11][16][24]

Fine-tuning is used in [computer vision](/wiki/computer_vision), [natural language processing](/wiki/natural_language_processing), speech, recommendation, multimodal systems, and generative models. It can reduce the labeled data and training time needed for a target application, but it does not guarantee an improvement. Results depend on the relationship between source and target data, the quality of the target examples, the optimization setup, and the evaluation design. A model can overfit, lose useful source capabilities, acquire unsafe behavior, or become unstable across random seeds.[1][2][21][34][36]

## Scope and terminology

Fine-tuning begins with parameters learned during an earlier training stage. The earlier model is usually called a pretrained model or base model. The new optimization stage may be narrow, such as learning a classifier for one label set, or broad, such as training a general-purpose language model to follow instructions. The defining feature is a parameter update based on a target objective. Merely placing examples in a prompt, retrieving documents at inference time, or changing decoding settings does not fine-tune the model because those operations do not update its learned parameters.[28][29]

Transfer-learning terminology separates a domain, which includes an input space and its data distribution, from a task, which includes an output space and a predictive function. Source and target settings can differ in their domains, tasks, or both. Fine-tuning is most likely to help when the source representation contains features useful for the target. It can still help across substantial domain shifts, but transfer is an empirical question rather than an automatic property of pretraining.[1][2]

Several neighboring terms are often used imprecisely:

- **Pretraining** learns broadly reusable representations, commonly from a large corpus and a self-supervised objective.
- **Continued pretraining** applies a pretraining-style objective to additional unlabeled data. Domain-adaptive pretraining uses data from the target domain, while task-adaptive pretraining uses unlabeled data associated with the target task.[9]
- **Supervised fine-tuning** trains on input-output pairs with a supervised objective. For generative language models, these pairs are often instructions and desired responses.
- **Instruction tuning** is supervised fine-tuning across tasks represented through natural-language instructions. It is intended to improve the ability to respond to instructions, including tasks not seen in exactly the same form during tuning.[22][23]
- **Preference optimization** uses comparisons, ratings, or reward signals to favor some outputs over others. It belongs to the broader family of post-training methods. It may follow supervised fine-tuning, but it is not interchangeable with supervised fine-tuning.[24][25]
- **Feature extraction** freezes the pretrained network and trains a separate predictor on its activations. This is transfer learning, but the pretrained parameters themselves are not fine-tuned.[10]
- **Linear probing** is a restricted form of feature extraction in which the trained predictor is linear. It is also used as a diagnostic for the information already present in a representation.
- **Prompting** supplies instructions or examples in the input context without gradient-based parameter updates. [In-context learning](/wiki/in_context_learning), including the few-shot procedure studied with GPT-3, is therefore different from fine-tuning.[28]
- **Retrieval-augmented generation** supplies information from an external index or corpus at inference time. It changes the evidence available to the generator rather than necessarily changing the generator's weights.[29]

The boundaries are operational, not merely linguistic. A prompt-tuned model has learned prompt vectors and is a fine-tuned artifact even if the base network is frozen. A model using only a hand-written prompt is not. A system can also combine methods, such as domain-adaptive pretraining, supervised fine-tuning, preference optimization, retrieval, and inference-time prompting.

## How fine-tuning works

Let a pretrained model have parameters `theta_0`, and let a target dataset contain examples `D = {(x_i, y_i)}`. In supervised fine-tuning, training usually seeks parameters that minimize an empirical loss:

```
theta* = argmin over theta in Theta_T:
         (1/n) sum_i L(f_theta(x_i), y_i) + lambda R(theta, theta_0)
```

Here, `Theta_T` specifies which parameters may change. It contains all model parameters for full fine-tuning, a subset for partial fine-tuning, or only added parameters for many parameter-efficient methods. The optional regularizer `R` can discourage large departures from the pretrained state or otherwise constrain the solution. The objective may be token-level cross-entropy for generation, classification loss for a classifier, contrastive loss for representation learning, or a preference-derived objective.

Training proceeds through [backpropagation](/wiki/backpropagation) and a gradient-based optimizer. At each step, a batch is passed through the model, the objective is evaluated, gradients are computed for trainable parameters, and the optimizer applies an update. Frozen parameters participate in the forward computation but do not receive optimizer updates. This distinction matters for memory: freezing weights removes their optimizer state and weight gradients, but intermediate activations may still be needed to propagate gradients into trainable components located earlier in or around the network.

The initial checkpoint influences both the optimization path and the attainable result. Fine-tuning does not simply add a detachable database of target facts. It changes the function represented by the trainable parameters. The same target examples can produce different models when the base checkpoint, example order, random seed, optimizer, or learning-rate schedule changes. For transformer fine-tuning, instability across seeds has been documented especially in small-data settings.[34][35]

The appropriate [learning rate](/wiki/learning_rate) is commonly lower than a rate used to train the same architecture from scratch, but no universal ratio is valid across models and methods. Full fine-tuning, adapters, prompt parameters, and low-rank matrices can require different schedules. Some methods deliberately use different rates for different parameter groups. LoRA+, for example, derives a distinction between the learning rates for the two low-rank factors and reports faster convergence in the paper's experiments.[20]

## Development

Fine-tuning became prominent as learned representations replaced hand-designed feature pipelines. In vision, large labeled datasets and deep convolutional networks produced checkpoints whose intermediate features could be reused. AlexNet's 2012 ImageNet result demonstrated large-scale supervised convolutional training.[3] Later experiments by Yosinski and colleagues directly examined how neural-network features transfer between source and target tasks. They found that lower layers were generally more reusable, later layers were more specialized, and transferred initialization could improve target performance even after the transferred layers were fine-tuned.[2]

The feature hierarchy suggested two practical strategies. A practitioner could freeze the backbone and train a new task-specific head, or initialize from the checkpoint and continue training some or all layers. Both remain standard patterns in vision libraries. The official PyTorch transfer-learning tutorial presents them as separate scenarios: fine-tuning the whole network and using the network as a fixed feature extractor.[10]

Early neural transfer in language processing often relied on pretrained word embeddings or contextual representations as features. ELMo generated contextual word representations with a bidirectional language model and added them to task-specific architectures, improving results across six reported NLP tasks without defining the now-common transformer fine-tuning recipe.[4] A different path adapted a pretrained language model itself. ULMFiT introduced a general language-model fine-tuning procedure with discriminative learning rates, a slanted triangular schedule, and gradual unfreezing, and reported substantial error reductions on the text-classification benchmarks studied in the paper.[7]

The 2018 GPT work used generative language-model pretraining followed by discriminative fine-tuning for individual tasks.[5] BERT then showed that a bidirectional transformer pretrained with masked-language and sentence-level objectives could be adapted to multiple language-understanding tasks by adding a small output layer and fine-tuning the model with minimal task-specific architectural changes.[6] T5 later framed a wide range of language tasks in a common text-to-text format, making the pretraining and fine-tuning interface more uniform.[8]

As base models grew, storing and training a separate full parameter copy for every task became increasingly expensive. Adapter modules offered an early transformer-specific answer: insert small trainable bottleneck modules while keeping the original network fixed. In the experiments reported by Houlsby and colleagues, adapters added 3.6 percent trainable parameters per task and achieved performance close to full fine-tuning on GLUE.[11] Subsequent methods learned continuous prefixes, soft prompts, selected bias parameters, multiplicative activation vectors, or low-rank weight updates.[12][13][14][15][16]

The growth of general-purpose [large language models](/wiki/large_language_model) also expanded the objective of fine-tuning. Adaptation was no longer limited to a single classifier or extraction task. Instruction tuning sought broad task following, while human-feedback pipelines targeted helpfulness and other preferences. FLAN fine-tuned a 137-billion-parameter model on instructions for more than 60 tasks and reported improved zero-shot performance on held-out tasks.[22] Later work scaled instruction tuning to substantially more tasks and model sizes.[23] InstructGPT combined supervised demonstrations, a learned reward model, and reinforcement learning from human feedback in a staged alignment pipeline.[24]

## Main approaches

### Full fine-tuning

Full fine-tuning makes every parameter trainable. It provides the broadest direct update capacity among the approaches described here, which can be valuable when the target distribution or objective differs meaningfully from pretraining. It computes gradients for all model parameters and, when the optimizer is stateful, maintains optimizer state for all of them. Deployments commonly store a separate full checkpoint, although other artifact encodings are possible. The resulting storage and training requirements depend on model size, numerical precision, optimizer, batch construction, sequence or image dimensions, distributed strategy, activation-saving technique, and saved artifact format. Fixed claims that a given model size always requires a particular GPU count, memory capacity, or checkpoint size are therefore misleading.

Full fine-tuning is not automatically the most accurate option. Greater flexibility can increase the risk of fitting noise, spurious correlations, or a small validation set. It can also change capabilities that are not represented in the target data. A fair comparison with a parameter-efficient method must control the base model, data, optimization budget, model selection procedure, and evaluation suite. Research comparing LoRA and full fine-tuning has found a tradeoff in some settings: full fine-tuning learned more of the new target distribution, while LoRA retained more performance outside that distribution.[21] That finding is evidence about the studied models and tasks, not a universal ranking.

Full fine-tuning is most defensible when the target objective requires broad changes, sufficient representative data are available, and the deployment can support a separate full checkpoint. It is also useful as a reference condition in experiments. If a smaller method is chosen only because full fine-tuning was not tried, claims of equivalent quality should be avoided.

### Partial and layer-wise fine-tuning

Partial fine-tuning updates a selected subset of existing layers. Common variants train only a task head, the final block, the normalization parameters, or progressively larger portions of the network. The selection can reflect the architecture and the expected location of transferable features. In convolutional vision networks, early visual filters often transfer more readily than later task-specific features, although the exact pattern depends on source and target tasks.[2]

Gradual unfreezing starts with a small trainable portion and unlocks additional layers over time. ULMFiT used this approach to reduce destructive updates during language-model adaptation.[7] Layer-wise learning rates pursue a related goal by assigning smaller updates to lower or earlier layers and larger updates to task-specific layers. These are optimization choices, not guarantees. They should be evaluated against a fixed-backbone baseline and a full-fine-tuning baseline when resources permit.

Training only normalization statistics or biases can be surprisingly effective in some regimes, but the effect depends on architecture and task. BitFit updates only bias terms and reported competitive results on several small and medium BERT benchmarks.[14] It should be described as a specific sparse-update method, not as proof that weights are generally unnecessary.

### Adapters

Adapters insert small trainable modules into a frozen network. A common transformer adapter projects hidden states into a lower-dimensional bottleneck, applies a nonlinearity, projects back to the original dimension, and combines the result with the residual stream. The base weights remain unchanged, so one base checkpoint can be shared by multiple task-specific adapter sets.[11]

Adapters separate task storage from the base model and can simplify multi-task serving. They still add computation to the forward pass unless their transformation can be folded into another operation. Placement, bottleneck width, initialization, and which normalization or head parameters remain trainable all affect performance. "Adapter" is therefore a family label rather than a single configuration.

### Prefix and prompt tuning

Prefix-tuning learns continuous vectors that act like a task-specific prefix while keeping the language model frozen. In the original work, the learned prefix influenced every transformer layer through attention and used about 0.1 percent of the model parameters in the reported GPT-2 and BART configurations.[12] The percentage is specific to those configurations and is not a general constant.

Prompt tuning learns soft prompt embeddings that are prepended to the model input. Lester and colleagues found that its competitiveness with full model tuning improved as model scale increased in their experiments.[13] Unlike a hand-written prompt, the learned prompt is optimized by gradients and is therefore a fine-tuned parameter set. Its small storage footprint can be attractive when many tasks share one base model, but prompt length consumes part of the model's input processing and may be sensitive to initialization and model scale.

IA3 learns vectors that rescale selected internal activations while leaving the main weight matrices frozen. It was introduced as part of T-Few, a few-shot adaptation method, and is even more parameter-sparse than many low-rank configurations.[15] As with other parameter-efficient methods, a small trainable parameter count should not be confused with a proportional reduction in every resource. Forward activations, input processing, and the frozen base model still consume compute and memory.

### LoRA

[LoRA](/wiki/lora), or low-rank adaptation, represents a learned update to a weight matrix through two smaller matrices. For a pretrained matrix `W_0`, a common form is:

```
W = W_0 + delta_W
delta_W = s B A
```

where `A` and `B` have an inner dimension `r` smaller than the dimensions of `W_0`, and `s` is a scaling factor. During training, `W_0` is frozen while `A` and `B` are optimized. The original LoRA paper motivated the method with evidence that task-specific weight changes can be represented effectively in a low-rank subspace, and reported large reductions in trainable parameters and optimizer memory for the studied transformer models.[16]

Rank is a capacity choice, not a direct measure of model quality. Target modules determine where low-rank updates are applied, such as attention projections, feed-forward projections, or both. Scaling, dropout, initialization, bias handling, and the set of additionally trainable modules also matter. A reproducible LoRA result must report these choices rather than only saying that LoRA was used. The Hugging Face PEFT documentation describes common configuration fields and the option to merge learned LoRA weights into the base model for inference.[17]

LoRA can reduce the trainable state and make one frozen base checkpoint reusable across adapters. It does not eliminate the need to load and execute the base model. Training memory also depends on activations, sequence length, batch size, optimizer precision, attention implementation, and checkpointing. Claims such as "LoRA uses one-third of the memory" or "LoRA preserves full accuracy" require a specified experiment.

### Quantized low-rank adaptation

[QLoRA](/wiki/qlora) combines a frozen quantized base model with trainable LoRA adapters. Gradients are propagated through the quantized computation into the adapter parameters, but the quantized base weights remain frozen. The QLoRA paper introduced a 4-bit NormalFloat data type, double quantization of quantization constants, and paged optimizers. It demonstrated fine-tuning a 65-billion-parameter model on a single 48 GB GPU in the authors' setup.[18] That demonstration depends on the paper's architecture, sequence, batching, and software choices and should not be turned into a universal hardware requirement.

Quantization introduces its own tradeoffs. The base representation is approximate, kernels must support the selected format, and adapter training does not repair every quantization error. Evaluation should compare the deployed quantized composition, not only an unquantized development model. The base checkpoint, quantization configuration, adapter checkpoint, tokenizer, and inference library version are all part of the deployable artifact.

### LoRA variants

DoRA decomposes a pretrained weight into magnitude and direction, then applies a LoRA-style update to the directional component. Its authors reported improved learning capacity and stability relative to LoRA in the evaluated vision and language tasks, while retaining the ability to avoid added inference overhead after weight composition.[19]

LoRA+ focuses on optimization rather than the parameterization alone. It assigns different learning rates to the two low-rank factors and reports up to roughly twofold training speed improvements and modest performance gains in the paper's experiments.[20] Neither result makes the variant universally preferable. Comparisons should use the same rank, target modules, data, number of examples processed, and model-selection budget.

The growing number of variants makes method names insufficient as experiment descriptions. The exact trainable parameter set, update equations, rank or bottleneck dimensions, precision, quantization, initialization, optimizer groups, and merge procedure should be recorded.

### Comparison of approaches

| Approach | Updated state | Main benefit | Main limitation |
| --- | --- | --- | --- |
| Fixed feature extractor | New prediction head only | Small trainable state and reusable backbone | Cannot alter the pretrained representation |
| Full fine-tuning | All pretrained parameters, usually plus a head | Broadest direct update capacity among listed methods | Gradients for all weights; commonly a separate full checkpoint |
| Partial fine-tuning | Selected existing parameters | Middle ground between frozen and full training | Layer selection is architecture- and task-dependent |
| Adapters | Inserted task modules, sometimes selected existing parameters | Modular task storage | Can add inference operations and configuration complexity |
| Soft prompt or prefix | Learned continuous prompt state | Very small task-specific state | Performance can depend strongly on scale, prompt length, and initialization |
| BitFit or IA3 | Biases or activation-scaling vectors | Extremely sparse trainable state | Restricted update capacity |
| LoRA | Low-rank updates to selected matrices | Small adapter checkpoints and reduced optimizer state | Rank and target-module choices can limit adaptation |
| QLoRA | LoRA adapters over a frozen quantized base | Enables lower-memory adapter training | Adds quantization and runtime compatibility constraints |

This table describes parameterization, not a performance ordering. No method is best across all models, data regimes, target shifts, and deployment constraints.[11][12][14][15][16][18][21]

## Fine-tuning language models

### Supervised fine-tuning

[Supervised fine-tuning](/wiki/supervised_fine-tuning) of a causal language model usually trains it to predict response tokens conditioned on an input or conversational context. Datasets may contain instructions and responses, task demonstrations, dialogues, tool-call traces, or structured outputs. Implementations often mask the loss on tokens that belong only to the input so that optimization focuses on the desired response, but this is a data-format choice rather than part of the definition.

The examples determine what behavior receives direct support. A dataset that contains only successful final answers may not teach a model to expose intermediate state, ask for missing information, or recover from tool errors. Conversely, a collection with long explanations can reward verbosity even when short answers are preferable. Formatting artifacts, repeated templates, accidental label leakage, and contradictory system instructions can become learnable correlations.

Data volume alone is not a sufficient quality measure. LIMA fine-tuned a 65-billion-parameter LLaMA model on 1,000 curated prompt-response pairs and reported strong behavior in the authors' evaluations.[26] The result is evidence that a carefully selected small dataset can be effective for a capable base model. It does not establish 1,000 examples as a general requirement or show that more representative data cannot help. Work on selecting long responses similarly produced improvements in specific alignment experiments, while also illustrating that a simple proxy for quality can favor a particular output style.[27]

### Instruction tuning

[Instruction tuning](/wiki/instruction_tuning) mixes tasks expressed as natural-language instructions. The training mixture may vary the wording of instructions, task formats, and demonstration counts. FLAN's original experiments showed that tuning on many instruction-formatted tasks improved zero-shot generalization to held-out tasks for the studied model.[22] Scaling work later used 1,836 tasks and reported gains across model sizes and evaluation categories.[23]

Instruction tuning does not make a model reliable on every possible instruction. Generalization depends on the base model, task mixture, instruction diversity, contamination controls, and evaluation prompts. A held-out task can still resemble training tasks in latent structure or source. Evaluation should document how overlap was checked and should not infer broad instruction-following ability from one benchmark family.

### Preference-based post-training

Preference-based [post-training](/wiki/post-training) attempts to make one output more likely than alternatives according to human or synthetic comparisons. InstructGPT used a three-stage process: supervised fine-tuning on demonstrations, reward-model training on ranked outputs, and reinforcement learning against the learned reward while constraining departure from the supervised model.[24] This family is often described as [reinforcement learning from human feedback](/wiki/rlhf).

Direct Preference Optimization derives a classification-style objective that trains a policy directly from preferred and rejected responses relative to a reference policy. It avoids fitting an explicit reward model and avoids on-policy sampling during the optimization procedure described in the original paper.[25] DPO still depends on preference data, a reference choice, objective hyperparameters, and the coverage of the comparisons. It should not be summarized as "RLHF without a reward" in a way that hides those dependencies.

Preference optimization can trade one behavior for another. Raters may disagree, preferences may encode cultural or stylistic assumptions, and generated preference data can inherit errors from the generating system. Reported preference wins must identify the evaluator, prompt distribution, comparison protocol, and uncertainty. Improvements on a preference metric do not by themselves establish factual accuracy, safety, or robustness.

### Continued pretraining and domain adaptation

When the main gap is terminology, style, or unlabeled domain distribution, continued pretraining may be more appropriate than instruction-response training. Domain-adaptive pretraining continues the language-model objective on domain text. Task-adaptive pretraining continues it on unlabeled text associated with the target task. Gururangan and colleagues found that both stages improved performance across several domains and tasks in their experiments.[9]

Continued pretraining and supervised fine-tuning can be sequenced. A model may first learn domain statistics from unlabeled text and then learn outputs from labeled examples. The stages should be evaluated separately because continued pretraining can also shift capabilities or increase exposure to low-quality and sensitive data. Calling both stages "fine-tuning" without identifying the objective obscures what information the model received.

## Other modalities

In vision, a common design replaces the classification head of a pretrained [convolutional neural network](/wiki/convolutional_neural_network) or vision transformer and then trains the head, selected blocks, or the whole network. Source-target similarity, image resolution, augmentation, label quality, and class balance affect the outcome. ImageNet pretraining is historically important, but it is not the only source of transferable visual features, and performance should be compared with a randomly initialized or frozen-feature baseline where feasible.[2][3][10]

Visual Prompt Tuning adapts a frozen vision transformer by learning a small set of input-space prompt parameters. The original paper reported that fewer than 1 percent task-specific parameters could outperform full fine-tuning on many of the evaluated transfer tasks.[32] Those results concern the tested architectures and benchmarks, not every vision model.

For text-to-image [diffusion models](/wiki/diffusion_model), fine-tuning can specialize generation toward a subject, style, or domain. DreamBooth fine-tunes a pretrained text-to-image model to associate a rare identifier with a subject, using three to five subject images in the paper's setup and a class-specific prior-preservation objective.[30] Textual Inversion instead keeps the generative model frozen and learns an embedding for a new pseudo-word from a small set of images.[31] The first changes model weights; the second learns a compact conditioning parameter. Both can be called adaptation, but their storage, editability, and risks differ.

Speech and multimodal systems use analogous patterns: reuse a pretrained encoder, decoder, or joint representation and optimize against a target loss. Modality-specific preprocessing remains part of the model interface. A fine-tuned checkpoint is not reliably reusable if its feature extractor, sampling rate, image normalization, special tokens, or conversation template is missing.

## Data design

### Define the target distribution

A dataset should represent the inputs, outputs, users, languages, and edge cases expected after deployment. Randomly collecting examples from an available source is not enough if that source differs from the operational setting. A medical abbreviation task, for example, can vary by specialty, institution, country, and document type. The base model's broad capability does not remove the need to define the target.

The intended behavior should be written as testable requirements before training. For classification, this includes the label ontology and treatment of ambiguous cases. For generation, it includes factuality, refusal behavior, format constraints, citation behavior, tone, and what the model should do when information is missing. If two annotators cannot consistently apply the specification, the model will receive an inconsistent signal.

### Data provenance and rights

Every training example should have a recorded source, collection date, transformation history, and permission basis. Personal data, copyrighted material, confidential records, and restricted model outputs require specific review. Removing names alone may not eliminate identifying information. Deduplication should operate at appropriate levels, including exact records, near-duplicate text, templated variants, and repeated conversations.

Data provenance also supports later corrections. If a source is withdrawn or a label policy changes, the affected examples must be discoverable. A single flat file without stable example identifiers makes targeted remediation and audit difficult.

### Splits and leakage

Training, validation, and test sets should be separated before iterative model selection. The unit of separation must reflect the way information repeats. Splitting individual messages can leak nearly identical conversations across sets; splitting image crops can leak the same source image; splitting documents can leak passages from the same template or entity. Grouped or time-based splits are often more realistic.

Benchmark contamination is a separate concern for pretrained models. A public test set may already have appeared in pretraining data even if it is absent from the fine-tuning set. Where exact pretraining data are unknown, evaluation should distinguish confirmed fine-tuning leakage from possible base-model contamination and use newly collected or private tests when appropriate.

### Cleaning and balancing

Cleaning should preserve difficult but valid examples while removing corrupt, duplicate, or out-of-scope records. Class rebalancing and sampling weights change the effective training distribution and should be recorded. For generative data, repeated answer templates can dominate token-level loss even when the examples appear diverse at the prompt level.

Synthetic examples require the same provenance and evaluation discipline as human-created data. They can expand coverage, but they may also copy the generator's factual errors, style, or safety gaps. A synthetic-data filter should be tested against held-out human judgments rather than assumed to guarantee quality.

### Formatting and tokenization

Language-model examples are serialized into tokens through a tokenizer and conversation template. Role markers, separators, end-of-sequence tokens, truncation, and loss masks determine the actual training signal. A visually correct JSON conversation can become incorrect after formatting if assistant boundaries are misplaced or examples are truncated before the answer.

Dataset inspection should therefore include decoded token sequences and loss masks. Length distributions should be measured after tokenization. Truncation policy should specify whether to preserve the instruction, most recent turns, answer, or another region. Packing multiple short examples into one sequence can improve utilization, but attention boundaries and loss masks must prevent unintended cross-example conditioning.

## Training workflow

### Establish baselines

Evaluation should begin before training. Useful baselines include the unchanged base model with a fixed prompt, a fixed feature extractor with a simple head, and, when relevant, retrieval or continued pretraining. These baselines reveal whether parameter updates are necessary and provide a reference for regressions.

A strong baseline uses the same test prompts, decoding settings, preprocessing, and metric implementation as the fine-tuned model. Comparing a tuned model with carefully selected decoding against a base model with default decoding confounds training and inference.

### Choose the trainable state

The choice among full, partial, and parameter-efficient fine-tuning should follow the target shift and deployment constraints. Full training offers broad update capacity. LoRA or adapters reduce task-specific state and can support many variants over one base. Prompt methods minimize task storage but may have less capacity. Quantized adapter training reduces base-weight memory but adds numerical and runtime constraints.

The trainable parameter count should be computed from the actual model rather than copied from a method paper. Reports should list both the count and percentage, target modules, layers, rank or bottleneck size, biases, normalization parameters, task head, and any embeddings that remain trainable.

### Configure optimization

Batch size, sequence length, gradient accumulation, optimizer, schedule, precision, clipping, [regularization](/wiki/regularization), and checkpoint frequency interact. Effective batch size alone does not capture the number and length of tokens or examples contributing to an update. Token-normalized and example-normalized losses can behave differently when lengths vary.

Hyperparameters should be selected using only training and validation information. A small, predeclared search space is usually easier to interpret than repeated undocumented trial and error. When several random seeds are affordable, seed should be treated as part of the evaluation design rather than a way to select the best-looking result.[34][35]

### Monitor training

Training loss confirms that optimization is occurring but does not establish generalization. Validation loss and task metrics should be tracked at intervals that can detect divergence or overfitting. [Early stopping](/wiki/early_stopping) can limit unnecessary updates, but the stopping rule and patience become part of model selection and must not inspect the test set.

Samples of model output can reveal formatting collapse, repetition, refusal changes, and label leakage that an aggregate loss misses. Automated checks should be paired with a stable, versioned sample set. For high-impact applications, domain experts should inspect both typical and worst-case outputs.

### Save a reproducible artifact

A fine-tuned artifact includes more than a weight file. It should identify the base model and revision, tokenizer or feature processor, training data version, formatting code, trainable-parameter configuration, optimizer settings, random seeds, software versions, and evaluation results. For an adapter, the base checkpoint is a required dependency. For a merged adapter, the merge procedure and output precision are required.

Model cards provide a structured way to record intended use, evaluation, limitations, and relevant training information.[38] Deployment records should also state whether the artifact contains full weights, deltas, adapter parameters, soft prompts, or a quantized base-plus-adapter composition.

## Evaluation

### Target performance

The primary metric should correspond to the deployment decision. Accuracy can conceal class imbalance; token overlap can miss semantic errors; preference scores can hide disagreement; and average scores can conceal severe failures for a subgroup. The evaluation plan should include per-class or per-slice results where the target population is heterogeneous.

For generative systems, deterministic metrics and model-based graders should be calibrated against human judgments on a representative sample. A grader that shares a model family, training data, or stylistic preference with the candidate can introduce correlated bias. Human evaluations should report instructions, rater recruitment, blinding, number of comparisons, aggregation, and uncertainty.

### Retained capabilities

Target metrics alone do not measure what the model lost. A regression suite should cover important source capabilities, languages, safety behaviors, calibration, and output formats. The comparison between LoRA and full fine-tuning by Biderman and colleagues demonstrates why both in-domain learning and out-of-domain retention should be measured.[21]

Retention tests should be selected before inspecting the final checkpoint. Otherwise, the suite can be unconsciously chosen to favor the method. For a narrow classifier, retention may concern backbone representation quality. For a general language model, it can include reasoning, factual recall, instruction following, refusal boundaries, and tool-use protocols.

### Robustness and shift

Robustness tests vary spelling, formatting, paraphrase, image quality, demographic or geographic slices, and other conditions relevant to the use case. Out-of-distribution tests should reflect plausible deployment shift rather than arbitrary corruption. A model that performs well on a random held-out split can still fail when sources, institutions, time periods, or user populations change.

Fine-tuning comparisons should use confidence intervals or repeated runs when variation is material. Mosbach and colleagues traced some BERT fine-tuning failures to optimization difficulties and vanishing gradients rather than accepting a single simple explanation.[34] Du and Nguyen showed that standard deviation alone does not fully describe fine-tuning instability and advocated measurements that better reflect performance differences among runs.[35]

### Efficiency

Efficiency should be measured at the level that matters. Training reports can include elapsed time, energy, peak allocated and reserved memory, examples or tokens processed, and hardware. Serving reports can include latency, throughput, memory, adapter-switching cost, and cold-start behavior. Trainable parameter count is useful but does not substitute for these measurements.

All efficiency comparisons need a shared workload and environment. A method that saves optimizer memory may not reduce activation memory. A merged LoRA checkpoint may have no adapter operations at inference, while an unmerged or dynamically switched adapter may have a different serving profile.[17]

## Risks and failure modes

### Overfitting

[Overfitting](/wiki/overfitting) occurs when the fine-tuned model captures idiosyncrasies of the training sample that do not generalize. Small datasets, duplicates, noisy labels, excessive optimization, and repeated validation-driven selection can contribute. A falling training loss combined with worsening validation performance is a common signal, but overfitting can also appear as narrow stylistic imitation or memorization not captured by the primary metric.

Mitigation can include better data collection, deduplication, [data augmentation](/wiki/data_augmentation), weight decay, dropout, smaller trainable state, fewer updates, and early stopping. The correct intervention depends on the failure. Adding regularization cannot repair a mislabeled target ontology or a leaked test set.

### Negative transfer

Negative transfer occurs when reuse of source knowledge harms the target result. It can arise when source and target tasks reward conflicting features, when the pretrained representation omits important distinctions, or when a strong source bias steers optimization toward a poor solution. The transferability experiments of Yosinski and colleagues showed that the benefit of transferred features changed with task distance and layer depth.[2]

A randomly initialized target model may be too expensive as a routine baseline for a very large foundation model, but smaller controlled experiments, fixed-feature baselines, or comparisons among source checkpoints can still test the assumption that transfer is helping. The source model should not be selected only by size or popularity.

### Catastrophic forgetting

Catastrophic forgetting is the loss of previously learned capabilities while learning new information. It is a longstanding problem in sequential neural learning; elastic weight consolidation is one method designed to protect parameters important to earlier tasks.[33] During fine-tuning, forgetting can appear as a decline on general benchmarks, languages absent from the target data, safety behaviors, or capabilities that the target objective does not reward.

Freezing parameters or using LoRA can reduce some changes, but neither guarantees retention.[21] Other responses include mixing replay data, constraining parameter movement, reducing update magnitude, or training separate adapters. Each can reduce target adaptation, so target and retention metrics must be considered together.

### Optimization instability

Some fine-tuning runs fail or vary substantially even when their nominal configuration is unchanged. Random initialization of a task head, data order, dropout, and numerical operations can alter the trajectory. In small-data transformer experiments, different seeds have produced materially different outcomes.[34][35]

Reporting only the best seed overstates expected performance. Repeated runs, predeclared selection rules, learning-curve inspection, and full configuration logging make the result easier to interpret. If compute limits repeated training, the report should state that uncertainty instead of presenting a single result as deterministic.

### Safety degradation

Fine-tuning can weaken behavior learned during safety post-training. Qi and colleagues showed that a small number of adversarial examples could compromise safety alignment in the models they tested, and that even benign fine-tuning data could reduce safety performance.[36] The finding does not imply that every fine-tune is unsafe, but it shows that ordinary target accuracy is not an adequate safety check.

Safety evaluation should cover the target domain and general misuse categories before and after training. Access controls, data review, adapter provenance, and the ability to revoke a fine-tuned artifact are operational safeguards. If users can submit their own training data, the system also needs defenses against malicious examples and a policy for who may deploy the result.

### Privacy

Training data can contain personal or confidential information, and parameter updates can alter privacy behavior. Goel and colleagues found that benign fine-tuning degraded contextual privacy in experiments across several language models and datasets, including inappropriate disclosure or failure to use context-sensitive privacy norms.[37] This result concerns behavioral privacy, while memorization and extraction are related but separate evaluation questions.

Privacy review should address both the data and the model's behavior. Data minimization, access control, retention limits, redaction, and documented permission reduce exposure at collection time. Post-training tests should include context-dependent disclosure scenarios relevant to the application. A claim that data were "anonymized" should identify the transformation and the residual re-identification risk.

### Bias and representational harm

Fine-tuning data can amplify or redirect biases already present in a base model. A balanced label count does not ensure balanced coverage of language varieties, occupations, locations, disabilities, or intersecting groups. Synthetic data may further reproduce the generator's stereotypes. Evaluation should define affected groups and harms for the specific application rather than rely on a generic bias score.

Mitigation can involve targeted collection, annotation review, subgroup metrics, counterfactual tests, and escalation paths for uncertain cases. In some settings, the correct response is to restrict the task rather than optimize the model to make a sensitive inference.

### Distribution shift and maintenance

A fine-tuned model is tied to the target distribution observed during development. Policies, terminology, products, and user behavior can change. Monitoring should track input drift, output quality, subgroup performance, safety incidents, and changes in the retrieval or tool environment surrounding the model.

Retraining should not automatically append every new interaction to the dataset. Production feedback can be biased by earlier model behavior and by which users choose to report problems. New data should pass the same provenance, labeling, split, and review controls as the original corpus.

## Choosing between fine-tuning and alternatives

Fine-tuning is appropriate when the desired change is repeated, can be represented in training examples or another objective, and should become part of the model's parameters. It is often useful for a stable output format, a domain-specific classification boundary, a persistent interaction style, or efficient task adaptation across many repeated queries.

[Prompt engineering](/wiki/prompt_engineering) is a lower-cost starting point when a capable model already performs the task and the requirement can fit in the context. It is easier to revise and inspect, but consumes context and may be sensitive to phrasing. Few-shot prompting does not update weights, even though it demonstrates the task in the input.[28]

[Retrieval-augmented generation](/wiki/retrieval_augmented_generation) is usually better suited to knowledge that changes frequently, must be cited, or should be access-controlled outside the model. The original RAG formulation combined a pretrained sequence-to-sequence generator with non-parametric memory accessed through a retriever.[29] Retrieval quality, corpus freshness, and grounding remain separate engineering problems.

Continued pretraining fits a shift in unlabeled domain language or structure, while supervised fine-tuning fits a mapping from inputs to desired outputs. Preference optimization fits comparative judgments among plausible outputs. These stages can be combined, but each should have its own evaluation and data ledger.[9][24][25]

A practical decision sequence is:

1. Measure the unchanged base model with a fixed prompt and evaluation set.
2. Test whether retrieval or tools solve a changing-knowledge requirement.
3. If parameter updates are needed, establish a frozen-feature or small-adapter baseline.
4. Compare a larger trainable state only when target performance or robustness justifies it.
5. Evaluate both the target behavior and retained capabilities.
6. Choose the smallest operationally suitable method that meets the predeclared requirements, rather than the method with the fewest trainable parameters in isolation.

## Reproducibility checklist

A fine-tuning report should identify:

- the exact base checkpoint, revision, license, tokenizer, and preprocessing;
- the training objective and which tokens or examples contribute to the loss;
- dataset sources, versions, licenses or permissions, filters, deduplication, and split rules;
- the number of examples and tokens after preprocessing, with relevant length and class distributions;
- every trainable parameter group, including heads, embeddings, biases, normalization parameters, prompts, adapters, and low-rank targets;
- optimizer, learning-rate schedule, batch construction, precision, clipping, regularization, and stopping rule;
- random seeds and the number of independent runs;
- hardware and software versions;
- model-selection criteria and whether the test set remained untouched;
- target, retention, safety, privacy, robustness, and efficiency evaluations;
- the format of the released artifact and its dependency on a base model or quantization configuration.

These details are necessary to distinguish a reproducible procedure from a method label. "Fine-tuned with LoRA," for example, leaves rank, target modules, scaling, dropout, precision, quantization, optimizer groups, data format, and base revision unspecified.[16][17]

## See also

- [Machine Learning](/wiki/machine_learning)
- [Deep Learning](/wiki/deep_learning)
- [Pre-Trained Model](/wiki/pre-trained_model)
- [Foundation Models](/wiki/foundation_models)
- [BERT](/wiki/bert)
- [GPT](/wiki/gpt)
- [Quantization](/wiki/quantization)
- [Validation Set](/wiki/validation_set)

## References

1. Pan, S. J., and Yang, Q. "A Survey on Transfer Learning." IEEE Transactions on Knowledge and Data Engineering, 2010. https://doi.org/10.1109/TKDE.2009.191
2. Yosinski, J., Clune, J., Bengio, Y., and Lipson, H. "How transferable are features in deep neural networks?" Advances in Neural Information Processing Systems 27, 2014. https://proceedings.neurips.cc/paper_files/paper/2014/hash/532a2f85b6977104bc93f8580abbb330-Abstract.html
3. Krizhevsky, A., Sutskever, I., and Hinton, G. E. "ImageNet Classification with Deep Convolutional Neural Networks." Advances in Neural Information Processing Systems 25, 2012. https://proceedings.neurips.cc/paper/2012/hash/c399862d3b9d6b76c8436e924a68c45b-Abstract.html
4. Peters, M. E., et al. "Deep Contextualized Word Representations." Proceedings of NAACL-HLT, 2018. https://aclanthology.org/N18-1202/
5. Radford, A., Narasimhan, K., Salimans, T., and Sutskever, I. "Improving Language Understanding by Generative Pre-Training." OpenAI, 2018. https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf
6. Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding." Proceedings of NAACL-HLT, 2019. https://aclanthology.org/N19-1423/
7. Howard, J., and Ruder, S. "Universal Language Model Fine-tuning for Text Classification." Proceedings of ACL, 2018. https://aclanthology.org/P18-1031/
8. Raffel, C., et al. "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer." Journal of Machine Learning Research, 2020. https://www.jmlr.org/papers/v21/20-074.html
9. Gururangan, S., et al. "Don't Stop Pretraining: Adapt Language Models to Domains and Tasks." Proceedings of ACL, 2020. https://aclanthology.org/2020.acl-main.740/
10. PyTorch. "Transfer Learning for Computer Vision Tutorial." PyTorch Tutorials. https://docs.pytorch.org/tutorials/beginner/transfer_learning_tutorial.html
11. Houlsby, N., et al. "Parameter-Efficient Transfer Learning for NLP." Proceedings of ICML, 2019. https://proceedings.mlr.press/v97/houlsby19a.html
12. Li, X. L., and Liang, P. "Prefix-Tuning: Optimizing Continuous Prompts for Generation." Proceedings of ACL-IJCNLP, 2021. https://aclanthology.org/2021.acl-long.353/
13. Lester, B., Al-Rfou, R., and Constant, N. "The Power of Scale for Parameter-Efficient Prompt Tuning." Proceedings of EMNLP, 2021. https://aclanthology.org/2021.emnlp-main.243/
14. Ben Zaken, E., Goldberg, Y., and Ravfogel, S. "BitFit: Simple Parameter-efficient Fine-tuning for Transformer-based Masked Language-models." Proceedings of ACL, 2022. https://aclanthology.org/2022.acl-short.1/
15. Liu, H., et al. "Few-Shot Parameter-Efficient Fine-Tuning is Better and Cheaper than In-Context Learning." Advances in Neural Information Processing Systems 35, 2022. https://proceedings.neurips.cc/paper_files/paper/2022/hash/0cde695b83bd186c1fd456302888454c-Abstract-Conference.html
16. Hu, E. J., et al. "LoRA: Low-Rank Adaptation of Large Language Models." ICLR, 2022. https://openreview.net/forum?id=nZeVKeeFYf9
17. Hugging Face. "LoRA conceptual guide." PEFT documentation. https://huggingface.co/docs/peft/main/conceptual_guides/lora
18. Dettmers, T., Pagnoni, A., Holtzman, A., and Zettlemoyer, L. "QLoRA: Efficient Finetuning of Quantized LLMs." Advances in Neural Information Processing Systems 36, 2023. https://proceedings.neurips.cc/paper_files/paper/2023/hash/1feb87871436031bdc0f2beaa62a049b-Abstract-Conference.html
19. Liu, S.-Y., et al. "DoRA: Weight-Decomposed Low-Rank Adaptation." Proceedings of ICML, 2024. https://proceedings.mlr.press/v235/liu24bn.html
20. Hayou, S., Ghosh, N., and Yu, B. "LoRA+: Efficient Low Rank Adaptation of Large Models." Proceedings of ICML, 2024. https://proceedings.mlr.press/v235/hayou24a.html
21. Biderman, D., et al. "LoRA Learns Less and Forgets Less." Transactions on Machine Learning Research, 2024. https://openreview.net/forum?id=aloEru2qCG
22. Wei, J., et al. "Finetuned Language Models Are Zero-Shot Learners." ICLR, 2022. https://openreview.net/forum?id=gEZrGCozdqR
23. Chung, H. W., et al. "Scaling Instruction-Finetuned Language Models." Journal of Machine Learning Research, 2024. https://www.jmlr.org/papers/v25/23-0870.html
24. Ouyang, L., et al. "Training language models to follow instructions with human feedback." Advances in Neural Information Processing Systems 35, 2022. https://proceedings.neurips.cc/paper_files/paper/2022/hash/b1efde53be364a73914f58805a001731-Abstract.html
25. Rafailov, R., Sharma, A., Mitchell, E., Ermon, S., Manning, C. D., and Finn, C. "Direct Preference Optimization: Your Language Model is Secretly a Reward Model." Advances in Neural Information Processing Systems 36, 2023. https://proceedings.neurips.cc/paper_files/paper/2023/hash/a85b405ed65c6477a4fe8302b5e06ce7-Abstract-Conference.html
26. Zhou, C., et al. "LIMA: Less Is More for Alignment." Advances in Neural Information Processing Systems 36, 2023. https://proceedings.neurips.cc/paper_files/paper/2023/hash/ac662d74829e4407ce1d126477f4a03a-Abstract-Conference.html
27. Zhao, Y., et al. "Long Is More for Alignment: A Simple but Tough-to-Beat Baseline for Instruction Fine-Tuning." Proceedings of ICML, 2024. https://proceedings.mlr.press/v235/zhao24b.html
28. Brown, T. B., et al. "Language Models are Few-Shot Learners." Advances in Neural Information Processing Systems 33, 2020. https://proceedings.neurips.cc/paper_files/paper/2020/hash/1457c0d6bfcb4967418bfb8ac142f64a-Abstract.html
29. Lewis, P., et al. "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks." Advances in Neural Information Processing Systems 33, 2020. https://proceedings.neurips.cc/paper/2020/hash/6b493230205f780e1bc26945df7481e5-Abstract.html
30. Ruiz, N., et al. "DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation." Proceedings of CVPR, 2023. https://openaccess.thecvf.com/content/CVPR2023/html/Ruiz_DreamBooth_Fine_Tuning_Text-to-Image_Diffusion_Models_for_Subject-Driven_Generation_CVPR_2023_paper.html
31. Gal, R., et al. "An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion." ICLR, 2023. https://openreview.net/forum?id=NAQvF08TcyG
32. Jia, M., Tang, L., Chen, B.-C., Cardie, C., Belongie, S., Hariharan, B., and Lim, S.-N. "Visual Prompt Tuning." Proceedings of ECCV, 2022. https://doi.org/10.1007/978-3-031-19827-4_41
33. Kirkpatrick, J., et al. "Overcoming catastrophic forgetting in neural networks." Proceedings of the National Academy of Sciences, 2017. https://doi.org/10.1073/pnas.1611835114
34. Mosbach, M., Andriushchenko, M., and Klakow, D. "On the Stability of Fine-tuning BERT: Misconceptions, Explanations, and Strong Baselines." ICLR, 2021. https://openreview.net/forum?id=nzpLWnVAyah
35. Du, K., and Nguyen, D. "Measuring the Instability of Fine-Tuning." Proceedings of ACL, 2023. https://aclanthology.org/2023.acl-long.342/
36. Qi, X., et al. "Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!" ICLR, 2024. https://proceedings.iclr.cc/paper_files/paper/2024/hash/83b7da3ed13f06c13ce82235c8eedf35-Abstract-Conference.html
37. Goel, V., et al. "Privacy Collapse: Benign Fine-Tuning Can Break Contextual Privacy in Language Models." Proceedings of ACL, 2026. https://aclanthology.org/2026.acl-long.400/
38. Mitchell, M., et al. "Model Cards for Model Reporting." Proceedings of the Conference on Fairness, Accountability, and Transparency, 2019. https://doi.org/10.1145/3287560.3287596