Underfitting

Underfitting occurs when a machine learning model is too simple to capture the underlying patterns in the data. An underfit model performs poorly not only on unseen data but also on the training set it was built from, indicating that it has failed to learn the relationships between inputs and outputs. Underfitting is one of the two fundamental failure modes in predictive modeling, the other being overfitting, where a model memorizes noise in the training data rather than generalizing.

In the language of the bias-variance tradeoff, underfitting corresponds to high bias and low variance. The model makes strong, oversimplified assumptions about the data, leading to systematic errors that persist regardless of which training samples are used. Where overfitting is the failure mode of an over-eager memorizer, underfitting is the failure mode of a model that never managed to start learning in the first place.

The term sits at the heart of how practitioners reason about model capacity, regularization, and the deeper question of how to choose a hypothesis class for a given problem. It is also one of the few ML concepts whose meaning has shifted in the last decade: the rise of massively overparameterized neural networks and the discovery of double descent by Belkin and colleagues in 2019 forced the field to revise the textbook picture of "too small, just right, too big."

Bias-variance tradeoff

The expected prediction error of any model can be decomposed into three components:

Expected Error = Bias² + Variance + Irreducible Error

Bias measures how far the model's average predictions are from the true values. High bias means the model consistently misses important relationships.
Variance measures how much the model's predictions change when trained on different subsets of data. High variance means the model is overly sensitive to the specific training examples it sees.
Irreducible error (also called Bayes error) is noise inherent to the problem itself, which no model can eliminate.

Underfitting is the high-bias regime of this tradeoff. A model with excessive bias makes strong assumptions (for example, that the relationship between features and the target is linear) and cannot represent the true complexity of the data. While such a model tends to have low variance (its predictions are stable across different training sets), the predictions themselves are consistently inaccurate.

As model capacity increases, bias decreases because the model can represent more complex functions. At the same time, variance increases because the model becomes more sensitive to the specific training data. The classical goal in practice is to find the "sweet spot" where the combined error from bias and variance is minimized. This point represents a model that is complex enough to capture real patterns but not so complex that it fits noise. The picture is sometimes drawn as a U-shaped test error curve, with underfitting on the left side and overfitting on the right.

Geman, Bienenstock, and Doursat formalized this decomposition for neural networks in their influential 1992 paper, framing the bias-variance dilemma as the central challenge of generalization. Their analysis remained the dominant lens for understanding model selection until the mid-2010s, when very large neural networks began to violate its predictions. The full picture, discussed below in the section on double descent, is more nuanced: very large models can sometimes generalize well even after they perfectly fit the training data.

Signs and symptoms

Underfitting manifests through several observable indicators during model training and evaluation:

Indicator	What to look for
High training error	The model cannot accurately predict outcomes even on the data it was trained on
High validation error	Performance on the validation set is also poor, roughly matching the training error
Small gap between training and validation loss	Unlike overfitting (where training loss is low but validation loss is high), both losses remain elevated and close together
Flat learning curves	Training loss plateaus at a high value early and does not improve with additional training
Poor metric scores	Accuracy, F1, R-squared, or other evaluation metrics are well below acceptable thresholds on all data splits
Oversimplified predictions	The model predicts near-constant values or fails to capture obvious nonlinear trends visible in the data
Loss of calibration	Predicted probabilities cluster around the base rate; the model expresses no useful uncertainty
Failure to beat trivial baselines	A constant predictor (mean for regression, majority class for classification) performs about as well as the model
Oscillating loss	In some cases the loss bounces around without descending, often a sign of mismatched learning rate or optimizer

A common diagnostic shortcut: train the model to deliberately overfit a tiny subset of the training data (say, 100 examples). If it cannot drive training loss to near zero on that small set, the model lacks the capacity required for the task and is structurally underfit. This is sometimes called the "can it memorize?" sanity check and is widely used in deep learning workflows.

Causes of underfitting

Several factors can lead to underfitting, often in combination. The table below summarizes the most common ones, with brief notes on how each one typically manifests.

Cause	Mechanism	Typical signal
Model too simple (low capacity)	Hypothesis class cannot represent the true function	Training loss plateaus high; capacity test fails
Wrong model class	Linear model on nonlinear data, additive model on interactions	Residuals show clear systematic structure
Insufficient or irrelevant features	Inputs do not carry information about the target	Even very flexible models score near baseline
Poor feature engineering	Missing transformations, no interactions, no embeddings	Tree models outperform linear ones by a wide margin
Excessive regularization	Penalty pushes weights toward zero	Loss drops when regularization is dialed down
Too few training epochs	Optimizer stops before convergence	Both losses still trending downward at end of training
Learning rate too low	Updates are tiny relative to loss landscape	Training loss decreases very slowly, almost flat
Learning rate too high	Loss bounces or diverges, never settles	Oscillating training curve, NaN values
Heavy dropout or other stochastic noise	Effective capacity is reduced	Training and validation losses both elevated
Frozen layers in transfer learning	Pretrained weights cannot adapt	Strong features wasted on a head with no expressive power
Aggressive class-imbalance correction	Undersampling discards too much signal	Minority recall improves but overall fit degrades
Very small training dataset	Not enough examples to identify patterns	High variance hides as high bias
Mismatched optimizer	AdaGrad's diminishing learning rate, for example	Convergence stalls long before training loss is acceptable

A few of these deserve more discussion.

Model too simple

The most common cause is choosing a model family that lacks the capacity to represent the true data-generating function. For example, fitting a straight line to data that follows a quadratic or exponential relationship will always produce systematic errors, no matter how much data is available or how long training runs. The model's hypothesis space simply does not contain a function close to the true one. This is sometimes called approximation error to distinguish it from estimation error (which is about how well one can choose within a given hypothesis class).

Insufficient or irrelevant features

Even a model with adequate capacity will underfit if the input features do not carry enough information to predict the target. If key variables are missing from the dataset, or if the features provided are only weakly correlated with the outcome, the model has nothing meaningful to learn from. Poor feature engineering (failing to create interaction terms, polynomial features, or domain-specific transformations) can also limit the model's ability to detect patterns. Trees and gradient-boosted models can recover some of this through implicit feature crosses, but they cannot invent features that are not derivable from the inputs.

Excessive regularization

Techniques such as L1 (Lasso) and L2 (Ridge) regularization, dropout in neural networks, and early stopping are designed to prevent overfitting by constraining model complexity. However, if the regularization strength is set too high, these techniques can suppress the model's ability to learn genuine patterns. A very large L2 penalty will shrink all model weights toward zero, effectively reducing a complex model to a near-constant prediction. Similarly, a dropout rate of 0.8 in a small network can leave so few active neurons per forward pass that the network never accumulates a stable signal.

Optimization problems

A model can have plenty of capacity on paper but still underfit because the optimizer fails to find a good minimum. The most common version is a learning rate set too low: training loss creeps down but never reaches a useful level within the available training budget. The opposite (a learning rate too high) can also cause underfitting, since the loss never settles into a basin and the network behaves as though it cannot learn at all. Adaptive optimizers like Adam reduce sensitivity to the initial learning rate, while AdaGrad has the well-known issue that its accumulated gradient sums shrink the effective learning rate to near zero, sometimes causing premature underfitting.

Insufficient training

A model that has not been trained for enough epochs may not have had time to converge to a good solution. This is especially relevant for deep neural networks, which can require hundreds or thousands of epochs to fully learn complex representations. Learning rate schedules, warmup steps, and patience settings on early stopping all interact here. A common mistake is to treat early stopping as a free lunch and stop training the moment validation loss flattens, when in fact the model would have continued to improve given a few more epochs.

Too little or too noisy training data

When the training dataset is very small, even a well-chosen model may not have enough examples to identify robust patterns. The model may fail to distinguish signal from noise, leading to poor performance that resembles underfitting. In financial machine learning, where the signal-to-noise ratio is famously low, even high-capacity models can fail to lift training error meaningfully because the noise floor itself is high. The irreducible error term in the bias-variance decomposition dominates.

Detecting underfitting with learning curves

Learning curves are one of the most practical tools for diagnosing underfitting. A learning curve plots training and validation loss (or another performance metric) against the number of training iterations or the size of the training set.

What an underfit learning curve looks like

When a model is underfitting, both the training loss and validation loss converge to a high value. The training loss may decrease slightly in the first few epochs but then flattens out well above an acceptable threshold. The validation loss follows a similar trajectory and settles close to the training loss. The small gap between the two curves confirms that the problem is not overfitting (which would show a large gap) but rather that the model lacks the capacity to learn.

If both curves are still decreasing at the end of training, this suggests the model might benefit from additional epochs. The model may have the capacity to learn but was not given enough time. Some optimizers also produce oscillating loss curves when the learning rate is too aggressive; in that case the issue is optimization rather than capacity, and the fix is to lower the learning rate or switch to an adaptive optimizer.

Comparison of learning curve patterns

Pattern	Training loss	Validation loss	Gap	Diagnosis
Good fit	Low	Low (slightly higher than training)	Small	Model generalizes well
Underfitting	High	High	Small	Model is too simple or undertrained
Overfitting	Very low	High	Large	Model memorizes training data
Still converging	Decreasing	Decreasing	Moderate	Train longer or use more data
Diverging loss	Bouncing or rising	Bouncing or rising	Variable	Learning rate too high or optimizer broken
Plateau then drop	Initially flat, then improves	Follows training	Variable	Possible double-descent transition or warmup ending

Capacity learning curves

Another useful diagnostic plots performance against model capacity rather than training time. Sweep the depth of a tree, the polynomial degree, or the width of a network and record training and validation error at each setting. A purely underfit regime shows both errors high and falling together as capacity grows. The classical U-curve appears as the model crosses into the overfitting region. In modern overparameterized regimes, this sweep often reveals a second descent past the interpolation threshold, the hallmark of the double descent phenomenon.

Underfitting in different model types

Underfitting can occur in any model family, though it manifests differently depending on the algorithm:

Linear models on nonlinear data

Linear regression and logistic regression assume a linear relationship between input features and the target. When the true relationship is nonlinear (for example, a U-shaped curve or a step function), these models will systematically underfit because no linear function can approximate the true pattern. The classic example is fitting a straight line to data that follows a parabola: the line will always miss the curvature, producing high residuals at the extremes. Bishop's textbook opens its model-fitting chapter with exactly this example, using a sine wave fit by polynomials of varying degree to make the underfitting and overfitting failure modes visible side by side.

Shallow decision trees

A decision tree with a very low maximum depth or a high minimum samples per leaf is forced to make splits based on only the broadest distinctions in the data. Such a tree cannot capture fine-grained patterns, interactions between features, or nonlinear boundaries. For instance, a depth-2 tree used to classify images will likely perform poorly because it can only partition the feature space into a handful of regions. Random forests and gradient boosting are partly designed to mitigate this by combining many shallow trees, each contributing a small piece of the decision surface.

Small neural networks

A neural network with too few layers or too few neurons per layer lacks the representational power to approximate complex functions. According to the universal approximation theorem, a sufficiently wide single-hidden-layer network can approximate any continuous function, but "sufficiently wide" may mean thousands of neurons. In practice, shallow or narrow networks trained on complex tasks (such as image recognition or natural language processing) will underfit because they cannot learn the hierarchical feature representations that these tasks require.

Naive Bayes on correlated features

The Naive Bayes classifier assumes that all input features are conditionally independent given the class label. When features are strongly correlated (as they often are in real-world data), this assumption leads to systematic prediction errors, a form of underfitting caused by the model's overly restrictive assumptions rather than by insufficient complexity in the traditional sense. The model's hypothesis class is large enough numerically (one parameter per feature per class) but the structural assumption blocks it from representing useful interactions.

Constrained transfer learning models

In transfer learning, a common cause of underfitting is freezing too many layers of a pretrained network. If only a small classification head is trainable on top of frozen features, and those features are mismatched with the new task, the model cannot adapt and underfits. The standard remedy is to progressively unfreeze higher layers and fine-tune them with a low learning rate. This is one of the few cases where underfitting is more about plumbing (which weights receive gradients) than about raw model size.

Linear LLM probes

A practical example often seen in the deep learning literature is the linear probe used to evaluate representations from a large language model. The probe is intentionally a tiny linear classifier on top of a frozen embedding. If the probe's score is low, the practitioner usually suspects the representation rather than the probe; but a probe that is too constrained can underfit useful information that a slightly larger nonlinear head would expose. Distinguishing model-side underfitting from probe-side underfitting requires careful experimental design.

Solutions to underfitting

Addressing underfitting requires giving the model more flexibility, better information, or more time to learn. The table below summarizes common fixes.

Fix	When to use	What to watch for
Increase model capacity (more layers, more parameters)	Training loss high, no improvement with longer training	Watch for overfitting once gap opens between train and validation
Use a richer model class (tree, kernel, deep network)	Linear baseline far from baseline target	May need more compute or memory
Add or engineer features	Tree models beat linear models by a wide margin	Risk of feature leakage if not careful
Reduce regularization (lower lambda, lower dropout)	Loss drops sharply when regularization is reduced	Test set performance is the final arbiter
Train longer	Both losses still trending down	Use validation curve to avoid overshooting into overfitting
Lower learning rate or add warmup	Loss is oscillating	Convergence becomes slower; tune patience
Switch optimizer (Adam, AdamW)	SGD with low learning rate stalls	Adam can have worse generalization; tune weight decay
Use early stopping more leniently	Stopping fires before model converges	Increase patience and minimum delta
Reduce dropout rate	Network is shallow or narrow	Dropout below 0.1 often acts as no-op
Switch to ensemble or boosting	Single weak learner cannot capture pattern	Boosting can convert high-bias learners into low-bias ensembles
Use a more powerful pretrained backbone	Frozen features mismatch task	Larger backbones cost more inference compute
Improve data quality	Labels noisy or features ambiguous	Sometimes the cheapest fix and the most impactful
Gather more data	Genuine signal exists but is buried in noise	Diminishing returns past a problem-specific point

A few of these are worth elaborating.

Increase model complexity

The most direct remedy is to use a more expressive model. Options include adding more layers or neurons to a neural network, increasing the degree of a polynomial regression, allowing deeper splits in a decision tree, or switching from a linear model to a nonlinear one (for example, from linear regression to a random forest or gradient boosting model). The principle is to grow capacity in the smallest increment that solves the problem; jumping straight from a linear model to a billion-parameter transformer rarely makes sense and usually buys engineering complexity without much accuracy.

Improve feature engineering

Creating new features that better capture the underlying relationships can substantially improve model performance without changing the model architecture. Common techniques include adding polynomial or interaction terms, applying domain-specific transformations (log, square root), encoding cyclical features (day of week, month), and using embeddings for categorical variables. Feature engineering is sometimes the highest-leverage intervention available, especially in tabular problems where neural networks tend to lose to gradient-boosted trees.

Reduce regularization

If the model has adequate capacity but is being penalized too heavily, reducing the regularization strength can help. This means lowering the lambda parameter in L1/L2 regularization, decreasing the dropout rate in neural networks, or relaxing constraints on tree depth and minimum leaf size. A useful diagnostic is to set regularization to near zero and verify that training loss drops sharply; if it does, the previous setting was too aggressive.

Train longer or change the optimizer

If the learning curves show that both training and validation loss are still decreasing at the end of training, the model may simply need more epochs. Increasing the training budget, adjusting the learning rate schedule, or using a learning rate warmup can all help the model converge to a better solution. When loss oscillates rather than descending, the issue is usually the optimizer: lowering the learning rate, switching from SGD with low learning rate to Adam, or adding gradient clipping often unblocks training. AdaGrad, in particular, can underfit because its accumulated squared-gradient denominator drives the effective learning rate to near zero in long runs.

Use ensemble methods

Ensemble methods, particularly boosting algorithms such as AdaBoost and gradient boosting, are specifically designed to reduce bias. Boosting works by sequentially training weak learners (often shallow decision trees) and focusing each new learner on the mistakes of the previous ones. The result is a strong learner that can capture complex patterns even when the individual base models are simple. This is the standard escape hatch when a constrained per-model capacity is required (for example, for interpretability) but the resulting model underfits.

Gather more or better data

Adding more training examples, improving data quality, or incorporating additional features from external sources can provide the model with the information it needs to learn meaningful patterns. In practice, label noise is often the first thing to fix: a model can only fit signal that is actually present, and a 10% label error rate puts a hard ceiling on training accuracy.

Adjust transfer learning unfreezing

When underfitting appears in a transfer-learning setup, the standard playbook is to unfreeze the upper layers of the pretrained backbone and fine-tune them with a low learning rate. Keras and PyTorch tutorials both recommend a two-phase approach: train the new head first while the backbone is frozen, then unfreeze the upper backbone layers for a low-learning-rate fine-tune. Underfitting that disappears after unfreezing is a sign that the pretrained features were close but not quite right for the task.

Underfitting vs. overfitting

Underfitting and overfitting represent opposite ends of the model complexity spectrum. The following table summarizes their key differences:

Aspect	Underfitting	Overfitting
Bias	High	Low
Variance	Low	High
Training error	High	Low (often near zero)
Validation/test error	High	High
Training-validation gap	Small	Large
Model complexity	Too low	Too high
Sensitivity to data subset	Low; predictions stable across resamples	High; predictions swing on different splits
Calibration	Often miscalibrated toward the mean	Confidently wrong on out-of-distribution inputs
Learning curve shape	Both curves plateau at a high loss	Training loss is low; validation loss diverges
Primary cause	Model too simple, insufficient features, excessive regularization	Model too complex, too little regularization, noise in training data
Typical fix	Increase complexity, add features, reduce regularization	Simplify model, add regularization, use more training data
Effect of more data	Limited; bias persists	Helpful; reduces variance
Effect of more training time	Helps if the model has capacity	Harms; locks in noise

In practice, practitioners aim for the point between these two extremes where the model captures real patterns in the data without fitting noise. This is sometimes called the "Goldilocks zone" of model complexity. The two failure modes are not symmetric in their costs: an overfit model can be patched with regularization or more data, while a structurally underfit model often requires throwing the architecture out and starting again.

Model capacity and VC dimension

The concept of model capacity formalizes a model's ability to fit a wide variety of functions. A model with low capacity can only represent simple functions and is prone to underfitting, while a model with very high capacity can represent extremely complex functions but risks overfitting in classical analysis.

One way to measure capacity is the Vapnik-Chervonenkis (VC) dimension, introduced by Vladimir Vapnik and Alexey Chervonenkis. The VC dimension of a hypothesis class is the largest set of points that the class can shatter (classify correctly for every possible labeling). For example, a linear classifier in two dimensions has a VC dimension of 3, meaning it can perfectly classify any arrangement of 3 points but will fail on some arrangements of 4 points. A decision tree of depth d on binary features has a VC dimension that grows with d, while a neural network's VC dimension grows polynomially with the number of parameters in many configurations.

Statistical learning theory shows that the generalization error is bounded by a function of both the empirical (training) error and a complexity penalty proportional to the VC dimension. A model with a VC dimension far below what the problem requires will have high empirical error (underfitting), while one with an excessively high VC dimension relative to the number of training samples will have a large complexity penalty (overfitting). The number of training samples needed for good generalization is roughly proportional to the VC dimension of the hypothesis class.

Structural risk minimization (SRM), also due to Vapnik, is a principle that selects the model with the lowest upper bound on generalization error, balancing the training error against the complexity penalty. This provides a theoretical foundation for the practical advice to choose the simplest model that fits the data adequately. SRM was the operative principle behind the design of support vector machines, which select the maximum-margin hyperplane partly to control effective capacity.

The VC framework gives a clean explanation for classical underfitting: if the VC dimension of the chosen hypothesis class is too small, the empirical risk minimizer within that class will have a positive lower bound on training error that no amount of data can eliminate. The framework also predicts a U-shaped test-error curve, which held up empirically for decades. Modern overparameterized neural networks complicate this picture in ways the original theory did not anticipate.

Double descent and modern deep learning

One of the most striking developments in machine learning theory in the late 2010s was the observation of double descent, named and characterized by Mikhail Belkin, Daniel Hsu, Siyuan Ma, and Soumik Mandal in their 2019 paper "Reconciling modern machine-learning practice and the classical bias-variance trade-off," published in the Proceedings of the National Academy of Sciences.

The classical picture predicts a U-shaped test-error curve: too little capacity gives underfitting, too much gives overfitting, and the optimum sits in the middle. Belkin and colleagues showed that this curve is only the left portion of a larger pattern. As capacity grows past the interpolation threshold (the point at which the model has just enough parameters to fit every training point exactly), test error first spikes and then begins to descend again. Past that second descent, very large overparameterized models can generalize as well as or better than the optimum of the classical U-curve.

The immediate consequence for underfitting is that the textbook intuition ("add more parameters and you start overfitting") is incomplete. In modern deep learning, very wide networks rarely underfit, even with minimal explicit regularization. The implicit regularization of stochastic gradient descent and the structure of the loss landscape together push solutions toward flat minima that generalize well. Belkin's team demonstrated this empirically across random Fourier features, decision trees, AdaBoost, and small neural networks, suggesting the phenomenon is not an artifact of any one architecture.

For LLM pretraining, the picture is even more skewed: trillion-parameter models trained on internet-scale corpora rarely underfit in the classical sense. They are usually compute-bound rather than capacity-bound, and the closest equivalent of underfitting is a model that has not yet seen enough tokens. The Chinchilla scaling laws made this point quantitative: many earlier large models were undertrained for their size, a different failure mode that looks like underfitting on downstream tasks even though the architecture has plenty of capacity.

The practical takeaway is not that bias and variance no longer matter. They still do for classical models, for tabular problems, and for any setting where data is small relative to model size. But the boundary between underfitting and overfitting is more porous in deep learning than it appears in textbooks, and "my model is too small" should be evaluated against the size of the problem, the size of the dataset, and the implicit regularization properties of the optimizer.

When underfitting is desirable

Underfitting is usually a problem to fix, but there are settings where a slightly underfit model is the better choice.

The most common is interpretability. A linear model with a handful of coefficients is easier to explain to a regulator, a doctor, or a credit committee than a 200-tree gradient-boosted ensemble, even if the tree ensemble has higher accuracy. In credit scoring, healthcare risk stratification, and certain regulatory settings, models are often deliberately constrained to logistic regression or shallow scorecards because the cost of an unexplainable decision can outweigh the cost of a small accuracy loss. A model that mildly underfits but produces a clear chain of reasoning may be preferable to a black box that fits perfectly.

A related case is Occam's razor as a model-selection principle. When two models have similar performance on held-out data, the simpler one is usually preferred because it is less likely to depend on idiosyncrasies of the training set and easier to maintain in production. Structural risk minimization formalizes this preference: among models with the same training error, choose the one with the smaller capacity. The tradeoff is that pushing simplicity too far moves the model into structural underfitting; pushing it just enough produces a model that is parsimonious without being broken.

Underfit models are also useful as baselines. A constant predictor or a logistic regression with a handful of features is a sanity check: any more complex model worth deploying should beat it by a meaningful margin. If the elaborate model only edges out the underfit baseline, the elaborate model is probably not capturing useful structure and the apparent improvement may not survive a different data split.

Underfitting in modern deep learning

In the era of overparameterized models, the relationship between capacity and underfitting has shifted in several specific ways.

Large transformer language models almost never underfit in the classical sense; their parameter counts are vast compared to typical fine-tuning datasets. When they perform poorly on a downstream task, the cause is usually distribution shift, prompt sensitivity, or insufficient pretraining tokens rather than insufficient capacity. The Chinchilla scaling laws (Hoffmann et al., 2022) reframed this: many large models are undertrained for their parameter count, which can look like underfitting even though the architecture has the capacity to fit the task in principle.

Fine-tuning is a more frequent source of underfitting than pretraining. A LoRA adapter with a very small rank or a prompt-tuning soft prompt with too few learnable tokens can be incapable of representing the desired task even though the underlying base model has plenty of capacity. The standard fix is to increase the LoRA rank, unfreeze additional layers, or move to full fine-tuning if compute allows.

In computer vision and other domains, frozen feature extractors followed by a linear head are a classic source of mild underfitting. If the pretrained features are slightly mismatched with the target task, only a small accuracy improvement is possible until the upper layers are unfrozen. Two-phase fine-tuning, where the head is trained first and the backbone is then partially unfrozen with a low learning rate, is the standard practice.

Reinforcement-learning agents can underfit when the policy network is too small to represent the optimal action distribution, or when the value network is unable to track the temporal-difference targets. In contrast to supervised settings, underfitting in RL is harder to diagnose because both training and validation distributions shift as the policy improves; the closest equivalent of training loss is a moving target.

Underfitting and class imbalance

Class-imbalanced datasets create their own underfitting risks. The most common pattern: an aggressive undersampling strategy throws away majority-class examples to balance the training set, but in doing so it discards information that the model needed to learn. The minority class score may improve, but the overall model becomes simpler and more biased toward an idealized 50-50 view of the world that does not match deployment.

A 2025 simulation study by Carriero and colleagues in Statistics in Medicine found that for several common classifiers (logistic regression, random forest, XGBoost), correcting for moderate class imbalance was not necessary and could harm calibration. The takeaway is not that imbalance never matters, but that the cure can introduce its own underfitting. Practitioners working in clinical machine learning increasingly start with the natural class distribution and only intervene when the cost of one class of error clearly dominates.

A related issue is the use of class weights that are too extreme: setting a positive class weight of 100 in a problem with a 1:10 imbalance can collapse the decision boundary to predict the minority class everywhere, which is a form of underfitting in disguise. Calibrated thresholds on the natural distribution often work as well as resampling without the underfitting risk.

Domain examples

Healthcare

In medical machine learning, underfitting has unique stakes. A model trained to predict sepsis from vital signs that uses only a handful of features (temperature, heart rate, blood pressure) may underfit by missing the more subtle interaction patterns visible in lab values, medication histories, and trajectories over time. The fix in practice is rarely "train a bigger neural network"; it is to engineer richer features in collaboration with clinicians, or to switch from a linear scorecard to a gradient-boosted tree that can express interactions.

Medical models are also constrained by regulation and by the need for interpretability. A logistic regression that mildly underfits may be deployable, where a billion-parameter neural network is not. The tradeoff is explicit: some accuracy is left on the table in exchange for a model whose reasoning a clinician can audit. A 2024 Pitfalls and Best Practices review in the NCBI Bookshelf series on AI in health care discussed both overfitting and underfitting as parallel concerns, noting that overconfident underfit models (low accuracy combined with miscalibrated probabilities) can be more dangerous than overfit ones because they are easier to mistake for trustworthy.

Finance

Financial machine learning operates in a famously low signal-to-noise regime. A stock-return prediction model that achieves 51% accuracy on a binary up-down classification is doing well; the irreducible error is large. In this setting, both very simple models (linear regression on a single technical indicator) and very complex ones (deep transformers on tick data) tend to look underfit relative to the practitioner's hopes, because the underlying noise floor is high.

The practical pattern in quantitative trading is to use ensembles of simple models with carefully selected features, accepting that any individual model will look underfit. Voting schemes across many weak signals can produce useful aggregate predictions even when each component has high bias. The opposite failure (overfitting to historical price patterns) is more often discussed in trading literature than underfitting, but underfitting is the silent partner: a model that looks robust because it cannot find any signal at all is functionally useless even if it does not overfit.

Credit scoring

Credit scoring sits between healthcare and finance in terms of constraints. Models must be explainable to regulators and to consumers (under regulations like the U.S. Equal Credit Opportunity Act), which favors logistic regression scorecards. Such models often underfit relative to gradient-boosted alternatives, but the underfitting is a deliberate tradeoff for explainability. Recent work on interpretable machine learning for credit scoring has tried to combine the accuracy of tree ensembles with post-hoc explanations (such as SHAP values), but the underlying model still tends to be more constrained than a pure-accuracy approach would choose.

Practical case studies

A few illustrative examples bring the diagnostics together.

A team trains a logistic regression on a customer-churn dataset. Training accuracy is 72%, validation accuracy is 71%, and a constant predictor scores 68%. The small gap between training and validation, combined with the fact that the model barely beats the baseline, is the textbook signature of underfitting. Switching to a random forest pushes both scores into the high 80s, confirming that the linear model lacked capacity. Engineering a few interaction features (tenure x usage, plan x age) closes most of the remaining gap.

A computer vision team fine-tunes a ResNet-50 by freezing all layers and training only a 1000-class linear head. Top-1 accuracy plateaus at 60%. Both training and validation accuracy track each other closely. Unfreezing the last residual block and resuming training with a learning rate of 1e-4 lifts both scores by 8 points. The pretrained features were close but not quite right for the target distribution; the linear head alone could not bridge the gap.

A quant team trains a gradient-boosted regressor on minute-level price data with strong L2 regularization on the leaf weights. Training and validation R-squared are both 0.02, which sounds like underfitting until they realize the irreducible noise floor is around 0.025. The model is fine; the data is not informative at this horizon. The fix is upstream: gather alternative-data features or move to a longer prediction horizon where the signal is stronger.

A clinical risk-stratification model trained with aggressive undersampling (1:1 majority-to-minority) shows 88% recall on the minority class but only 62% precision on the majority. The model has been pushed into a high-bias regime by the resampling, with the calibration knocked out as a side effect. Refitting on the natural class distribution and choosing the threshold that meets the operational recall target restores the original calibration.

A team training a small transformer on a translation task observes that training loss flattens at a perplexity of 30 within a few epochs and refuses to drop further. They check the optimizer: Adam with a peak learning rate of 1e-3 and no warmup. Adding 2000 warmup steps and switching to AdamW with a small weight decay drops perplexity into the low teens within the same epoch budget. The original underfitting was an optimization failure rather than a capacity failure.

Practical diagnosis workflow

A systematic approach to diagnosing and fixing underfitting involves the following steps:

Establish a baseline. Before training any model, determine the performance of a trivial predictor (such as always predicting the mean for regression, or the majority class for classification). If your model does not substantially beat this baseline, underfitting is likely.
Train and evaluate. Train the model and compute metrics on both the training set and a held-out validation set. If both scores are poor, the model is underfitting.
Plot learning curves. Visualize training and validation loss across epochs. If both curves plateau at a high value with a small gap, the model lacks capacity or training time.
Run the memorize-tiny-data test. Try to overfit the model on a 100-example subset of the training data. If it cannot drive training loss to near zero, the architecture itself is too constrained.
Check the data. Verify that the features are informative and that there are no data quality issues (missing values, label noise, data leakage in the wrong direction).
Increase complexity incrementally. Rather than jumping to the most complex model available, add complexity step by step (more features, deeper trees, additional layers) and observe the effect on training and validation metrics.
Adjust regularization. If the model has sufficient capacity on paper but still underfits, reduce regularization strength and retrain.
Inspect the optimizer. Try a different optimizer (AdamW, SGD with momentum) or a different learning rate schedule (warmup, cosine decay) before concluding that more parameters are needed.
Re-evaluate. After each change, re-plot learning curves and check metrics. If training error drops but validation error rises sharply, you have crossed from underfitting into overfitting and should dial back.

Explain like I'm 5 (ELI5)

Imagine you are trying to draw a picture of a cat, but you can only use a ruler to draw straight lines. No matter how hard you try, a few straight lines will never look like a real cat because cats have curves, soft fur, and round eyes. Your drawing is too simple to capture what a cat actually looks like.

That is what underfitting means for a computer. The computer is given a tool (the model) that is too simple for the job. It tries its best, but it misses the important details because it does not have the right tools to capture them. The fix is to give the computer a better tool, like colored pencils and curves, so it can draw something that actually looks like a cat. Sometimes the fix is even simpler: the computer just needs more time to practice, or someone forgot to take the safety bumpers off (too much regularization), or it is using a pen that is running out of ink (a learning rate that is too low).

References

Hastie, T., Tibshirani, R., & Friedman, J. (2009). *The Elements of Statistical Learning: Data Mining, Inference, and Prediction* (2nd ed.). Springer. Chapter 7: Model Assessment and Selection.
Bishop, C. M. (2006). *Pattern Recognition and Machine Learning*. Springer. Section 1.1: Polynomial Curve Fitting (underfitting and overfitting examples).
Goodfellow, I., Bengio, Y., & Courville, A. (2016). *Deep Learning*. MIT Press. Chapter 5: Machine Learning Basics (capacity, bias, and variance).
Vapnik, V. N. (1998). *Statistical Learning Theory*. Wiley-Interscience. (VC dimension and structural risk minimization).
Geman, S., Bienenstock, E., & Doursat, R. (1992). "Neural Networks and the Bias/Variance Dilemma." *Neural Computation*, 4(1), 1-58.
Domingos, P. (2012). "A Few Useful Things to Know About Machine Learning." *Communications of the ACM*, 55(10), 78-87.
Breiman, L. (1996). "Bias, Variance, and Arcing Classifiers." Technical Report 460, Statistics Department, University of California, Berkeley.
Friedman, J. H. (2001). "Greedy Function Approximation: A Gradient Boosting Machine." *Annals of Statistics*, 29(5), 1189-1232.
Shalev-Shwartz, S., & Ben-David, S. (2014). *Understanding Machine Learning: From Theory to Algorithms*. Cambridge University Press. Chapter 6: The VC Dimension.
Belkin, M., Hsu, D., Ma, S., & Mandal, S. (2019). "Reconciling modern machine-learning practice and the classical bias-variance trade-off." *Proceedings of the National Academy of Sciences*, 116(32), 15849-15854. https://www.pnas.org/doi/10.1073/pnas.1903070116
Hoffmann, J., Borgeaud, S., Mensch, A., et al. (2022). "Training Compute-Optimal Large Language Models" (Chinchilla). arXiv:2203.15556.
Carriero, A., et al. (2025). "The Harms of Class Imbalance Corrections for Machine Learning Based Prediction Models: A Simulation Study." *Statistics in Medicine*, 44(2). https://onlinelibrary.wiley.com/doi/full/10.1002/sim.10320
Brownlee, J. (2019). "How to Use Learning Curves to Diagnose Machine Learning Model Performance." *Machine Learning Mastery*. https://machinelearningmastery.com/learning-curves-for-diagnosing-machine-learning-model-performance/
NCBI Bookshelf (2024). "Overfitting, Underfitting and General Model Overconfidence and Under-Performance Pitfalls and Best Practices in Machine Learning and AI." *Artificial Intelligence and Machine Learning in Health Care and Medical Sciences*. https://www.ncbi.nlm.nih.gov/books/NBK610560/
Keras documentation. "Transfer learning & fine-tuning." https://keras.io/guides/transfer_learning/
Wikipedia contributors. "Double descent." https://en.wikipedia.org/wiki/Double_descent
Domingos, P. (1999). "The Role of Occam's Razor in Knowledge Discovery." *Data Mining and Knowledge Discovery*, 3(4), 409-425. https://homes.cs.washington.edu/~pedrod/papers/dmkd99.pdf

Bias-variance tradeoff

Signs and symptoms

Causes of underfitting

Model too simple

Insufficient or irrelevant features

Excessive regularization

Optimization problems

Insufficient training

Too little or too noisy training data

Detecting underfitting with learning curves

What an underfit learning curve looks like

Comparison of learning curve patterns

Capacity learning curves

Underfitting in different model types

Linear models on nonlinear data

Shallow decision trees

Small neural networks

Naive Bayes on correlated features

Constrained transfer learning models

Linear LLM probes

Solutions to underfitting

Increase model complexity

Improve feature engineering

Reduce regularization

Train longer or change the optimizer

Use ensemble methods

Gather more or better data

Adjust transfer learning unfreezing

Underfitting vs. overfitting

Model capacity and VC dimension

Double descent and modern deep learning

When underfitting is desirable

Underfitting in modern deep learning

Underfitting and class imbalance

Domain examples

Healthcare

Finance

Credit scoring

Practical case studies

Practical diagnosis workflow

Explain like I'm 5 (ELI5)

References

Improve this article

Related Articles

ARC-AGI 2

AUC-ROC

Machine learning terms/Clustering

Machine learning terms/Decision Forests

Machine learning terms/Fairness

Machine learning terms/Fundamentals

Bias-variance tradeoff

Signs and symptoms

Causes of underfitting

Model too simple

Insufficient or irrelevant features

Excessive regularization

Optimization problems

Insufficient training

Too little or too noisy training data

Detecting underfitting with learning curves

What an underfit learning curve looks like

Comparison of learning curve patterns

Capacity learning curves

Underfitting in different model types

Linear models on nonlinear data

Shallow decision trees

Small neural networks

Naive Bayes on correlated features

Constrained transfer learning models

Linear LLM probes

Solutions to underfitting

Increase model complexity

Improve feature engineering

Reduce regularization

Train longer or change the optimizer

Use ensemble methods

Gather more or better data

Adjust transfer learning unfreezing

Underfitting vs. overfitting

Model capacity and VC dimension