Underfitting occurs when a machine learning model is too simple to capture the underlying patterns in the data. An underfit model performs poorly not only on unseen data but also on the training set it was built from, indicating that it has failed to learn the relationships between inputs and outputs. Underfitting is one of the two fundamental failure modes in predictive modeling, the other being overfitting, where a model memorizes noise in the training data rather than generalizing.
In the language of the bias-variance tradeoff, underfitting corresponds to high bias and low variance. The model makes strong, oversimplified assumptions about the data, leading to systematic errors that persist regardless of which training samples are used. Where overfitting is the failure mode of an over-eager memorizer, underfitting is the failure mode of a model that never managed to start learning in the first place.
The term sits at the heart of how practitioners reason about model capacity, regularization, and the deeper question of how to choose a hypothesis class for a given problem. It is also one of the few ML concepts whose meaning has shifted in the last decade: the rise of massively overparameterized neural networks and the discovery of double descent by Belkin and colleagues in 2019 forced the field to revise the textbook picture of "too small, just right, too big."
The expected prediction error of any model can be decomposed into three components:
Expected Error = Bias² + Variance + Irreducible Error
Underfitting is the high-bias regime of this tradeoff. A model with excessive bias makes strong assumptions (for example, that the relationship between features and the target is linear) and cannot represent the true complexity of the data. While such a model tends to have low variance (its predictions are stable across different training sets), the predictions themselves are consistently inaccurate.
As model capacity increases, bias decreases because the model can represent more complex functions. At the same time, variance increases because the model becomes more sensitive to the specific training data. The classical goal in practice is to find the "sweet spot" where the combined error from bias and variance is minimized. This point represents a model that is complex enough to capture real patterns but not so complex that it fits noise. The picture is sometimes drawn as a U-shaped test error curve, with underfitting on the left side and overfitting on the right.
Geman, Bienenstock, and Doursat formalized this decomposition for neural networks in their influential 1992 paper, framing the bias-variance dilemma as the central challenge of generalization. Their analysis remained the dominant lens for understanding model selection until the mid-2010s, when very large neural networks began to violate its predictions. The full picture, discussed below in the section on double descent, is more nuanced: very large models can sometimes generalize well even after they perfectly fit the training data.
Underfitting manifests through several observable indicators during model training and evaluation:
| Indicator | What to look for |
|---|---|
| High training error | The model cannot accurately predict outcomes even on the data it was trained on |
| High validation error | Performance on the validation set is also poor, roughly matching the training error |
| Small gap between training and validation loss | Unlike overfitting (where training loss is low but validation loss is high), both losses remain elevated and close together |
| Flat learning curves | Training loss plateaus at a high value early and does not improve with additional training |
| Poor metric scores | Accuracy, F1, R-squared, or other evaluation metrics are well below acceptable thresholds on all data splits |
| Oversimplified predictions | The model predicts near-constant values or fails to capture obvious nonlinear trends visible in the data |
| Loss of calibration | Predicted probabilities cluster around the base rate; the model expresses no useful uncertainty |
| Failure to beat trivial baselines | A constant predictor (mean for regression, majority class for classification) performs about as well as the model |
| Oscillating loss | In some cases the loss bounces around without descending, often a sign of mismatched learning rate or optimizer |
A common diagnostic shortcut: train the model to deliberately overfit a tiny subset of the training data (say, 100 examples). If it cannot drive training loss to near zero on that small set, the model lacks the capacity required for the task and is structurally underfit. This is sometimes called the "can it memorize?" sanity check and is widely used in deep learning workflows.
Several factors can lead to underfitting, often in combination. The table below summarizes the most common ones, with brief notes on how each one typically manifests.
| Cause | Mechanism | Typical signal |
|---|---|---|
| Model too simple (low capacity) | Hypothesis class cannot represent the true function | Training loss plateaus high; capacity test fails |
| Wrong model class | Linear model on nonlinear data, additive model on interactions | Residuals show clear systematic structure |
| Insufficient or irrelevant features | Inputs do not carry information about the target | Even very flexible models score near baseline |
| Poor feature engineering | Missing transformations, no interactions, no embeddings | Tree models outperform linear ones by a wide margin |
| Excessive regularization | Penalty pushes weights toward zero | Loss drops when regularization is dialed down |
| Too few training epochs | Optimizer stops before convergence | Both losses still trending downward at end of training |
| Learning rate too low | Updates are tiny relative to loss landscape | Training loss decreases very slowly, almost flat |
| Learning rate too high | Loss bounces or diverges, never settles | Oscillating training curve, NaN values |
| Heavy dropout or other stochastic noise | Effective capacity is reduced | Training and validation losses both elevated |
| Frozen layers in transfer learning | Pretrained weights cannot adapt | Strong features wasted on a head with no expressive power |
| Aggressive class-imbalance correction | Undersampling discards too much signal | Minority recall improves but overall fit degrades |
| Very small training dataset | Not enough examples to identify patterns | High variance hides as high bias |
| Mismatched optimizer | AdaGrad's diminishing learning rate, for example | Convergence stalls long before training loss is acceptable |
A few of these deserve more discussion.
The most common cause is choosing a model family that lacks the capacity to represent the true data-generating function. For example, fitting a straight line to data that follows a quadratic or exponential relationship will always produce systematic errors, no matter how much data is available or how long training runs. The model's hypothesis space simply does not contain a function close to the true one. This is sometimes called approximation error to distinguish it from estimation error (which is about how well one can choose within a given hypothesis class).
Even a model with adequate capacity will underfit if the input features do not carry enough information to predict the target. If key variables are missing from the dataset, or if the features provided are only weakly correlated with the outcome, the model has nothing meaningful to learn from. Poor feature engineering (failing to create interaction terms, polynomial features, or domain-specific transformations) can also limit the model's ability to detect patterns. Trees and gradient-boosted models can recover some of this through implicit feature crosses, but they cannot invent features that are not derivable from the inputs.
Techniques such as L1 (Lasso) and L2 (Ridge) regularization, dropout in neural networks, and early stopping are designed to prevent overfitting by constraining model complexity. However, if the regularization strength is set too high, these techniques can suppress the model's ability to learn genuine patterns. A very large L2 penalty will shrink all model weights toward zero, effectively reducing a complex model to a near-constant prediction. Similarly, a dropout rate of 0.8 in a small network can leave so few active neurons per forward pass that the network never accumulates a stable signal.
A model can have plenty of capacity on paper but still underfit because the optimizer fails to find a good minimum. The most common version is a learning rate set too low: training loss creeps down but never reaches a useful level within the available training budget. The opposite (a learning rate too high) can also cause underfitting, since the loss never settles into a basin and the network behaves as though it cannot learn at all. Adaptive optimizers like Adam reduce sensitivity to the initial learning rate, while AdaGrad has the well-known issue that its accumulated gradient sums shrink the effective learning rate to near zero, sometimes causing premature underfitting.
A model that has not been trained for enough epochs may not have had time to converge to a good solution. This is especially relevant for deep neural networks, which can require hundreds or thousands of epochs to fully learn complex representations. Learning rate schedules, warmup steps, and patience settings on early stopping all interact here. A common mistake is to treat early stopping as a free lunch and stop training the moment validation loss flattens, when in fact the model would have continued to improve given a few more epochs.
When the training dataset is very small, even a well-chosen model may not have enough examples to identify robust patterns. The model may fail to distinguish signal from noise, leading to poor performance that resembles underfitting. In financial machine learning, where the signal-to-noise ratio is famously low, even high-capacity models can fail to lift training error meaningfully because the noise floor itself is high. The irreducible error term in the bias-variance decomposition dominates.
Learning curves are one of the most practical tools for diagnosing underfitting. A learning curve plots training and validation loss (or another performance metric) against the number of training iterations or the size of the training set.
When a model is underfitting, both the training loss and validation loss converge to a high value. The training loss may decrease slightly in the first few epochs but then flattens out well above an acceptable threshold. The validation loss follows a similar trajectory and settles close to the training loss. The small gap between the two curves confirms that the problem is not overfitting (which would show a large gap) but rather that the model lacks the capacity to learn.
If both curves are still decreasing at the end of training, this suggests the model might benefit from additional epochs. The model may have the capacity to learn but was not given enough time. Some optimizers also produce oscillating loss curves when the learning rate is too aggressive; in that case the issue is optimization rather than capacity, and the fix is to lower the learning rate or switch to an adaptive optimizer.
| Pattern | Training loss | Validation loss | Gap | Diagnosis |
|---|---|---|---|---|
| Good fit | Low | Low (slightly higher than training) | Small | Model generalizes well |
| Underfitting | High | High | Small | Model is too simple or undertrained |
| Overfitting | Very low | High | Large | Model memorizes training data |
| Still converging | Decreasing | Decreasing | Moderate | Train longer or use more data |
| Diverging loss | Bouncing or rising | Bouncing or rising | Variable | Learning rate too high or optimizer broken |
| Plateau then drop | Initially flat, then improves | Follows training | Variable | Possible double-descent transition or warmup ending |
Another useful diagnostic plots performance against model capacity rather than training time. Sweep the depth of a tree, the polynomial degree, or the width of a network and record training and validation error at each setting. A purely underfit regime shows both errors high and falling together as capacity grows. The classical U-curve appears as the model crosses into the overfitting region. In modern overparameterized regimes, this sweep often reveals a second descent past the interpolation threshold, the hallmark of the double descent phenomenon.
Underfitting can occur in any model family, though it manifests differently depending on the algorithm:
Linear regression and logistic regression assume a linear relationship between input features and the target. When the true relationship is nonlinear (for example, a U-shaped curve or a step function), these models will systematically underfit because no linear function can approximate the true pattern. The classic example is fitting a straight line to data that follows a parabola: the line will always miss the curvature, producing high residuals at the extremes. Bishop's textbook opens its model-fitting chapter with exactly this example, using a sine wave fit by polynomials of varying degree to make the underfitting and overfitting failure modes visible side by side.
A decision tree with a very low maximum depth or a high minimum samples per leaf is forced to make splits based on only the broadest distinctions in the data. Such a tree cannot capture fine-grained patterns, interactions between features, or nonlinear boundaries. For instance, a depth-2 tree used to classify images will likely perform poorly because it can only partition the feature space into a handful of regions. Random forests and gradient boosting are partly designed to mitigate this by combining many shallow trees, each contributing a small piece of the decision surface.
A neural network with too few layers or too few neurons per layer lacks the representational power to approximate complex functions. According to the universal approximation theorem, a sufficiently wide single-hidden-layer network can approximate any continuous function, but "sufficiently wide" may mean thousands of neurons. In practice, shallow or narrow networks trained on complex tasks (such as image recognition or natural language processing) will underfit because they cannot learn the hierarchical feature representations that these tasks require.
The Naive Bayes classifier assumes that all input features are conditionally independent given the class label. When features are strongly correlated (as they often are in real-world data), this assumption leads to systematic prediction errors, a form of underfitting caused by the model's overly restrictive assumptions rather than by insufficient complexity in the traditional sense. The model's hypothesis class is large enough numerically (one parameter per feature per class) but the structural assumption blocks it from representing useful interactions.
In transfer learning, a common cause of underfitting is freezing too many layers of a pretrained network. If only a small classification head is trainable on top of frozen features, and those features are mismatched with the new task, the model cannot adapt and underfits. The standard remedy is to progressively unfreeze higher layers and fine-tune them with a low learning rate. This is one of the few cases where underfitting is more about plumbing (which weights receive gradients) than about raw model size.
A practical example often seen in the deep learning literature is the linear probe used to evaluate representations from a large language model. The probe is intentionally a tiny linear classifier on top of a frozen embedding. If the probe's score is low, the practitioner usually suspects the representation rather than the probe; but a probe that is too constrained can underfit useful information that a slightly larger nonlinear head would expose. Distinguishing model-side underfitting from probe-side underfitting requires careful experimental design.
Addressing underfitting requires giving the model more flexibility, better information, or more time to learn. The table below summarizes common fixes.
| Fix | When to use | What to watch for |
|---|---|---|
| Increase model capacity (more layers, more parameters) | Training loss high, no improvement with longer training | Watch for overfitting once gap opens between train and validation |
| Use a richer model class (tree, kernel, deep network) | Linear baseline far from baseline target | May need more compute or memory |
| Add or engineer features | Tree models beat linear models by a wide margin | Risk of feature leakage if not careful |
| Reduce regularization (lower lambda, lower dropout) | Loss drops sharply when regularization is reduced | Test set performance is the final arbiter |
| Train longer | Both losses still trending down | Use validation curve to avoid overshooting into overfitting |
| Lower learning rate or add warmup | Loss is oscillating | Convergence becomes slower; tune patience |
| Switch optimizer (Adam, AdamW) | SGD with low learning rate stalls | Adam can have worse generalization; tune weight decay |
| Use early stopping more leniently | Stopping fires before model converges | Increase patience and minimum delta |
| Reduce dropout rate | Network is shallow or narrow | Dropout below 0.1 often acts as no-op |
| Switch to ensemble or boosting | Single weak learner cannot capture pattern | Boosting can convert high-bias learners into low-bias ensembles |
| Use a more powerful pretrained backbone | Frozen features mismatch task | Larger backbones cost more inference compute |
| Improve data quality | Labels noisy or features ambiguous | Sometimes the cheapest fix and the most impactful |
| Gather more data | Genuine signal exists but is buried in noise | Diminishing returns past a problem-specific point |
A few of these are worth elaborating.
The most direct remedy is to use a more expressive model. Options include adding more layers or neurons to a neural network, increasing the degree of a polynomial regression, allowing deeper splits in a decision tree, or switching from a linear model to a nonlinear one (for example, from linear regression to a random forest or gradient boosting model). The principle is to grow capacity in the smallest increment that solves the problem; jumping straight from a linear model to a billion-parameter transformer rarely makes sense and usually buys engineering complexity without much accuracy.
Creating new features that better capture the underlying relationships can substantially improve model performance without changing the model architecture. Common techniques include adding polynomial or interaction terms, applying domain-specific transformations (log, square root), encoding cyclical features (day of week, month), and using embeddings for categorical variables. Feature engineering is sometimes the highest-leverage intervention available, especially in tabular problems where neural networks tend to lose to gradient-boosted trees.
If the model has adequate capacity but is being penalized too heavily, reducing the regularization strength can help. This means lowering the lambda parameter in L1/L2 regularization, decreasing the dropout rate in neural networks, or relaxing constraints on tree depth and minimum leaf size. A useful diagnostic is to set regularization to near zero and verify that training loss drops sharply; if it does, the previous setting was too aggressive.
If the learning curves show that both training and validation loss are still decreasing at the end of training, the model may simply need more epochs. Increasing the training budget, adjusting the learning rate schedule, or using a learning rate warmup can all help the model converge to a better solution. When loss oscillates rather than descending, the issue is usually the optimizer: lowering the learning rate, switching from SGD with low learning rate to Adam, or adding gradient clipping often unblocks training. AdaGrad, in particular, can underfit because its accumulated squared-gradient denominator drives the effective learning rate to near zero in long runs.
Ensemble methods, particularly boosting algorithms such as AdaBoost and gradient boosting, are specifically designed to reduce bias. Boosting works by sequentially training weak learners (often shallow decision trees) and focusing each new learner on the mistakes of the previous ones. The result is a strong learner that can capture complex patterns even when the individual base models are simple. This is the standard escape hatch when a constrained per-model capacity is required (for example, for interpretability) but the resulting model underfits.
Adding more training examples, improving data quality, or incorporating additional features from external sources can provide the model with the information it needs to learn meaningful patterns. In practice, label noise is often the first thing to fix: a model can only fit signal that is actually present, and a 10% label error rate puts a hard ceiling on training accuracy.
When underfitting appears in a transfer-learning setup, the standard playbook is to unfreeze the upper layers of the pretrained backbone and fine-tune them with a low learning rate. Keras and PyTorch tutorials both recommend a two-phase approach: train the new head first while the backbone is frozen, then unfreeze the upper backbone layers for a low-learning-rate fine-tune. Underfitting that disappears after unfreezing is a sign that the pretrained features were close but not quite right for the task.
Underfitting and overfitting represent opposite ends of the model complexity spectrum. The following table summarizes their key differences:
| Aspect | Underfitting | Overfitting |
|---|---|---|
| Bias | High | Low |
| Variance | Low | High |
| Training error | High | Low (often near zero) |
| Validation/test error | High | High |
| Training-validation gap | Small | Large |
| Model complexity | Too low | Too high |
| Sensitivity to data subset | Low; predictions stable across resamples | High; predictions swing on different splits |
| Calibration | Often miscalibrated toward the mean | Confidently wrong on out-of-distribution inputs |
| Learning curve shape | Both curves plateau at a high loss | Training loss is low; validation loss diverges |
| Primary cause | Model too simple, insufficient features, excessive regularization | Model too complex, too little regularization, noise in training data |
| Typical fix | Increase complexity, add features, reduce regularization | Simplify model, add regularization, use more training data |
| Effect of more data | Limited; bias persists | Helpful; reduces variance |
| Effect of more training time | Helps if the model has capacity | Harms; locks in noise |
In practice, practitioners aim for the point between these two extremes where the model captures real patterns in the data without fitting noise. This is sometimes called the "Goldilocks zone" of model complexity. The two failure modes are not symmetric in their costs: an overfit model can be patched with regularization or more data, while a structurally underfit model often requires throwing the architecture out and starting again.
The concept of model capacity formalizes a model's ability to fit a wide variety of functions. A model with low capacity can only represent simple functions and is prone to underfitting, while a model with very high capacity can represent extremely complex functions but risks overfitting in classical analysis.
One way to measure capacity is the Vapnik-Chervonenkis (VC) dimension, introduced by Vladimir Vapnik and Alexey Chervonenkis. The VC dimension of a hypothesis class is the largest set of points that the class can shatter (classify correctly for every possible labeling). For example, a linear classifier in two dimensions has a VC dimension of 3, meaning it can perfectly classify any arrangement of 3 points but will fail on some arrangements of 4 points. A decision tree of depth d on binary features has a VC dimension that grows with d, while a neural network's VC dimension grows polynomially with the number of parameters in many configurations.
Statistical learning theory shows that the generalization error is bounded by a function of both the empirical (training) error and a complexity penalty proportional to the VC dimension. A model with a VC dimension far below what the problem requires will have high empirical error (underfitting), while one with an excessively high VC dimension relative to the number of training samples will have a large complexity penalty (overfitting). The number of training samples needed for good generalization is roughly proportional to the VC dimension of the hypothesis class.
Structural risk minimization (SRM), also due to Vapnik, is a principle that selects the model with the lowest upper bound on generalization error, balancing the training error against the complexity penalty. This provides a theoretical foundation for the practical advice to choose the simplest model that fits the data adequately. SRM was the operative principle behind the design of support vector machines, which select the maximum-margin hyperplane partly to control effective capacity.
The VC framework gives a clean explanation for classical underfitting: if the VC dimension of the chosen hypothesis class is too small, the empirical risk minimizer within that class will have a positive lower bound on training error that no amount of data can eliminate. The framework also predicts a U-shaped test-error curve, which held up empirically for decades. Modern overparameterized neural networks complicate this picture in ways the original theory did not anticipate.
One of the most striking developments in machine learning theory in the late 2010s was the observation of double descent, named and characterized by Mikhail Belkin, Daniel Hsu, Siyuan Ma, and Soumik Mandal in their 2019 paper "Reconciling modern machine-learning practice and the classical bias-variance trade-off," published in the Proceedings of the National Academy of Sciences.
The classical picture predicts a U-shaped test-error curve: too little capacity gives underfitting, too much gives overfitting, and the optimum sits in the middle. Belkin and colleagues showed that this curve is only the left portion of a larger pattern. As capacity grows past the interpolation threshold (the point at which the model has just enough parameters to fit every training point exactly), test error first spikes and then begins to descend again. Past that second descent, very large overparameterized models can generalize as well as or better than the optimum of the classical U-curve.
The immediate consequence for underfitting is that the textbook intuition ("add more parameters and you start overfitting") is incomplete. In modern deep learning, very wide networks rarely underfit, even with minimal explicit regularization. The implicit regularization of stochastic gradient descent and the structure of the loss landscape together push solutions toward flat minima that generalize well. Belkin's team demonstrated this empirically across random Fourier features, decision trees, AdaBoost, and small neural networks, suggesting the phenomenon is not an artifact of any one architecture.
For LLM pretraining, the picture is even more skewed: trillion-parameter models trained on internet-scale corpora rarely underfit in the classical sense. They are usually compute-bound rather than capacity-bound, and the closest equivalent of underfitting is a model that has not yet seen enough tokens. The Chinchilla scaling laws made this point quantitative: many earlier large models were undertrained for their size, a different failure mode that looks like underfitting on downstream tasks even though the architecture has plenty of capacity.
The practical takeaway is not that bias and variance no longer matter. They still do for classical models, for tabular problems, and for any setting where data is small relative to model size. But the boundary between underfitting and overfitting is more porous in deep learning than it appears in textbooks, and "my model is too small" should be evaluated against the size of the problem, the size of the dataset, and the implicit regularization properties of the optimizer.
Underfitting is usually a problem to fix, but there are settings where a slightly underfit model is the better choice.
The most common is interpretability. A linear model with a handful of coefficients is easier to explain to a regulator, a doctor, or a credit committee than a 200-tree gradient-boosted ensemble, even if the tree ensemble has higher accuracy. In credit scoring, healthcare risk stratification, and certain regulatory settings, models are often deliberately constrained to logistic regression or shallow scorecards because the cost of an unexplainable decision can outweigh the cost of a small accuracy loss. A model that mildly underfits but produces a clear chain of reasoning may be preferable to a black box that fits perfectly.
A related case is Occam's razor as a model-selection principle. When two models have similar performance on held-out data, the simpler one is usually preferred because it is less likely to depend on idiosyncrasies of the training set and easier to maintain in production. Structural risk minimization formalizes this preference: among models with the same training error, choose the one with the smaller capacity. The tradeoff is that pushing simplicity too far moves the model into structural underfitting; pushing it just enough produces a model that is parsimonious without being broken.
Underfit models are also useful as baselines. A constant predictor or a logistic regression with a handful of features is a sanity check: any more complex model worth deploying should beat it by a meaningful margin. If the elaborate model only edges out the underfit baseline, the elaborate model is probably not capturing useful structure and the apparent improvement may not survive a different data split.
In the era of overparameterized models, the relationship between capacity and underfitting has shifted in several specific ways.
Large transformer language models almost never underfit in the classical sense; their parameter counts are vast compared to typical fine-tuning datasets. When they perform poorly on a downstream task, the cause is usually distribution shift, prompt sensitivity, or insufficient pretraining tokens rather than insufficient capacity. The Chinchilla scaling laws (Hoffmann et al., 2022) reframed this: many large models are undertrained for their parameter count, which can look like underfitting even though the architecture has the capacity to fit the task in principle.
Fine-tuning is a more frequent source of underfitting than pretraining. A LoRA adapter with a very small rank or a prompt-tuning soft prompt with too few learnable tokens can be incapable of representing the desired task even though the underlying base model has plenty of capacity. The standard fix is to increase the LoRA rank, unfreeze additional layers, or move to full fine-tuning if compute allows.
In computer vision and other domains, frozen feature extractors followed by a linear head are a classic source of mild underfitting. If the pretrained features are slightly mismatched with the target task, only a small accuracy improvement is possible until the upper layers are unfrozen. Two-phase fine-tuning, where the head is trained first and the backbone is then partially unfrozen with a low learning rate, is the standard practice.
Reinforcement-learning agents can underfit when the policy network is too small to represent the optimal action distribution, or when the value network is unable to track the temporal-difference targets. In contrast to supervised settings, underfitting in RL is harder to diagnose because both training and validation distributions shift as the policy improves; the closest equivalent of training loss is a moving target.
Class-imbalanced datasets create their own underfitting risks. The most common pattern: an aggressive undersampling strategy throws away majority-class examples to balance the training set, but in doing so it discards information that the model needed to learn. The minority class score may improve, but the overall model becomes simpler and more biased toward an idealized 50-50 view of the world that does not match deployment.
A 2025 simulation study by Carriero and colleagues in Statistics in Medicine found that for several common classifiers (logistic regression, random forest, XGBoost), correcting for moderate class imbalance was not necessary and could harm calibration. The takeaway is not that imbalance never matters, but that the cure can introduce its own underfitting. Practitioners working in clinical machine learning increasingly start with the natural class distribution and only intervene when the cost of one class of error clearly dominates.
A related issue is the use of class weights that are too extreme: setting a positive class weight of 100 in a problem with a 1:10 imbalance can collapse the decision boundary to predict the minority class everywhere, which is a form of underfitting in disguise. Calibrated thresholds on the natural distribution often work as well as resampling without the underfitting risk.
In medical machine learning, underfitting has unique stakes. A model trained to predict sepsis from vital signs that uses only a handful of features (temperature, heart rate, blood pressure) may underfit by missing the more subtle interaction patterns visible in lab values, medication histories, and trajectories over time. The fix in practice is rarely "train a bigger neural network"; it is to engineer richer features in collaboration with clinicians, or to switch from a linear scorecard to a gradient-boosted tree that can express interactions.
Medical models are also constrained by regulation and by the need for interpretability. A logistic regression that mildly underfits may be deployable, where a billion-parameter neural network is not. The tradeoff is explicit: some accuracy is left on the table in exchange for a model whose reasoning a clinician can audit. A 2024 Pitfalls and Best Practices review in the NCBI Bookshelf series on AI in health care discussed both overfitting and underfitting as parallel concerns, noting that overconfident underfit models (low accuracy combined with miscalibrated probabilities) can be more dangerous than overfit ones because they are easier to mistake for trustworthy.
Financial machine learning operates in a famously low signal-to-noise regime. A stock-return prediction model that achieves 51% accuracy on a binary up-down classification is doing well; the irreducible error is large. In this setting, both very simple models (linear regression on a single technical indicator) and very complex ones (deep transformers on tick data) tend to look underfit relative to the practitioner's hopes, because the underlying noise floor is high.
The practical pattern in quantitative trading is to use ensembles of simple models with carefully selected features, accepting that any individual model will look underfit. Voting schemes across many weak signals can produce useful aggregate predictions even when each component has high bias. The opposite failure (overfitting to historical price patterns) is more often discussed in trading literature than underfitting, but underfitting is the silent partner: a model that looks robust because it cannot find any signal at all is functionally useless even if it does not overfit.
Credit scoring sits between healthcare and finance in terms of constraints. Models must be explainable to regulators and to consumers (under regulations like the U.S. Equal Credit Opportunity Act), which favors logistic regression scorecards. Such models often underfit relative to gradient-boosted alternatives, but the underfitting is a deliberate tradeoff for explainability. Recent work on interpretable machine learning for credit scoring has tried to combine the accuracy of tree ensembles with post-hoc explanations (such as SHAP values), but the underlying model still tends to be more constrained than a pure-accuracy approach would choose.
A few illustrative examples bring the diagnostics together.
A team trains a logistic regression on a customer-churn dataset. Training accuracy is 72%, validation accuracy is 71%, and a constant predictor scores 68%. The small gap between training and validation, combined with the fact that the model barely beats the baseline, is the textbook signature of underfitting. Switching to a random forest pushes both scores into the high 80s, confirming that the linear model lacked capacity. Engineering a few interaction features (tenure x usage, plan x age) closes most of the remaining gap.
A computer vision team fine-tunes a ResNet-50 by freezing all layers and training only a 1000-class linear head. Top-1 accuracy plateaus at 60%. Both training and validation accuracy track each other closely. Unfreezing the last residual block and resuming training with a learning rate of 1e-4 lifts both scores by 8 points. The pretrained features were close but not quite right for the target distribution; the linear head alone could not bridge the gap.
A quant team trains a gradient-boosted regressor on minute-level price data with strong L2 regularization on the leaf weights. Training and validation R-squared are both 0.02, which sounds like underfitting until they realize the irreducible noise floor is around 0.025. The model is fine; the data is not informative at this horizon. The fix is upstream: gather alternative-data features or move to a longer prediction horizon where the signal is stronger.
A clinical risk-stratification model trained with aggressive undersampling (1:1 majority-to-minority) shows 88% recall on the minority class but only 62% precision on the majority. The model has been pushed into a high-bias regime by the resampling, with the calibration knocked out as a side effect. Refitting on the natural class distribution and choosing the threshold that meets the operational recall target restores the original calibration.
A team training a small transformer on a translation task observes that training loss flattens at a perplexity of 30 within a few epochs and refuses to drop further. They check the optimizer: Adam with a peak learning rate of 1e-3 and no warmup. Adding 2000 warmup steps and switching to AdamW with a small weight decay drops perplexity into the low teens within the same epoch budget. The original underfitting was an optimization failure rather than a capacity failure.
A systematic approach to diagnosing and fixing underfitting involves the following steps:
Imagine you are trying to draw a picture of a cat, but you can only use a ruler to draw straight lines. No matter how hard you try, a few straight lines will never look like a real cat because cats have curves, soft fur, and round eyes. Your drawing is too simple to capture what a cat actually looks like.
That is what underfitting means for a computer. The computer is given a tool (the model) that is too simple for the job. It tries its best, but it misses the important details because it does not have the right tools to capture them. The fix is to give the computer a better tool, like colored pencils and curves, so it can draw something that actually looks like a cat. Sometimes the fix is even simpler: the computer just needs more time to practice, or someone forgot to take the safety bumpers off (too much regularization), or it is using a pen that is running out of ink (a learning rate that is too low).