A regression model is a type of supervised learning algorithm designed to predict continuous numerical output values based on one or more input features. Unlike classification models, which assign inputs to discrete categories, regression models estimate real-valued quantities such as prices, temperatures, distances, or probabilities. Regression sits at the foundation of both classical statistics and modern machine learning, and it remains one of the most widely used modeling techniques across science, engineering, finance, and industry.
The roots of regression analysis trace back to the early 19th century. Adrien-Marie Legendre published the method of least squares in 1805 in his paper "New Methods for Determination of the Orbits of Comets," providing the first formal articulation of fitting a model by minimizing squared errors. Carl Friedrich Gauss later claimed he had been using the same technique since 1795, publishing his account in 1809 in "Theoria Motus Corporum Coelestium." Gauss went further than Legendre by connecting least squares to probability theory and the normal distribution, laying the mathematical groundwork for modern regression.[1]
The term "regression" itself was coined by Sir Francis Galton in the 1880s during his studies of hereditary traits, where he observed that extreme characteristics in parents tended to "regress" toward the population mean in their offspring. Karl Pearson and others later formalized regression as a general statistical tool. Today, the concept extends well beyond its origins, encompassing dozens of algorithms from simple linear fits to deep neural network architectures.
At a high level, a regression model learns a function f that maps input features x to a continuous target variable y:
y = f(x) + e
where e represents irreducible error (noise). During training, the model adjusts its parameters to minimize a chosen loss function, such as mean squared error. The trained model then generalizes to unseen data, producing predictions for new inputs.
Regression models can be divided into two broad families:
Linear regression is the simplest and most fundamental regression technique. It models the relationship between input features and the target as a weighted linear combination:
y = w_1 x_1 + w_2 x_2 + ... + w_n x_n + b
where w represents feature weights and b is the bias (intercept) term. The model is trained by minimizing the sum of squared residuals using ordinary least squares (OLS) or gradient-based optimization. Linear regression is fast, interpretable, and works well when the true relationship between variables is approximately linear.[2]
Polynomial regression extends linear regression by introducing higher-order terms (x^2, x^3, etc.) as additional features. A degree-n polynomial model can capture curved, non-linear patterns that a straight line cannot. However, high-degree polynomials risk overfitting, especially with limited training data. Polynomial regression is technically a special case of linear regression because the model remains linear in its parameters, even though the relationship with the original input variable is non-linear.
Ridge regression adds an L2 penalty to the ordinary least squares objective, shrinking the magnitude of all coefficients toward zero:
Loss = MSE + alpha * sum(w_i^2)
The regularization parameter alpha controls the strength of the penalty. Ridge regression is particularly useful when features are correlated (multicollinearity), as it stabilizes coefficient estimates and reduces variance at the cost of introducing a small amount of bias. Importantly, ridge regression keeps all features in the model; it never sets coefficients exactly to zero.[3]
Lasso (Least Absolute Shrinkage and Selection Operator) regression uses an L1 penalty instead:
Loss = MSE + alpha * sum(|w_i|)
The L1 penalty encourages sparsity by driving some coefficients to exactly zero, effectively performing automatic feature selection. This makes lasso regression valuable when many features are irrelevant or redundant. However, lasso tends to select only one feature from a group of correlated features and set the rest to zero, which may not always be desirable.[3]
Elastic net combines the L1 and L2 penalties, balancing the strengths of both ridge and lasso:
Loss = MSE + alpha_1 * sum(|w_i|) + alpha_2 * sum(w_i^2)
This hybrid approach is useful when there are groups of correlated features. Where lasso would arbitrarily select one from each group, elastic net tends to select or exclude the entire group together. The mixing ratio between L1 and L2 is controlled by an additional hyperparameter.[3]
Support vector regression adapts the principles of support vector machines to regression tasks. SVR fits data within an epsilon-insensitive tube, meaning errors smaller than a threshold epsilon are ignored entirely. Points outside the tube become support vectors that define the model. By applying the kernel trick, SVR can model non-linear relationships by projecting data into a higher-dimensional space where a linear fit in that space corresponds to a complex curve in the original space. SVR is effective for small to medium datasets but becomes computationally expensive for very large datasets.[4]
A decision tree regression model partitions the feature space into rectangular regions by recursively splitting on feature thresholds. Each leaf node contains the mean (or median) target value of the training samples that fall into that region. Decision trees are intuitive, require minimal data preprocessing, and can capture non-linear interactions. Their main weakness is high variance: small changes in the data can produce very different tree structures, leading to overfitting.
Random forest regression is an ensemble method that builds many decision trees on bootstrapped subsets of the data (bagging) and averages their predictions. By combining diverse trees, random forests reduce variance and improve generalization compared to a single decision tree. For regression, the final prediction is the mean of all individual tree predictions. Random forests handle high-dimensional data well, are robust to outliers, and require relatively little hyperparameter tuning.[5]
Gradient boosting builds an ensemble of decision trees sequentially, where each new tree is trained to correct the residual errors of the combined ensemble so far. This additive approach optimizes a differentiable loss function using gradient descent in function space. Modern implementations such as XGBoost, LightGBM, and CatBoost add regularization, parallel processing, and efficient handling of categorical features. Gradient boosting consistently achieves top performance on structured tabular data and dominates machine learning competitions.[6]
The following table summarizes the key frameworks:
| Framework | Key Innovation | Best For |
|---|---|---|
| XGBoost | L1/L2 regularization, parallel tree construction | General-purpose tabular data |
| LightGBM | Histogram-based learning, leaf-wise tree growth | Large-scale datasets with millions of rows |
| CatBoost | Native categorical feature handling, ordered boosting | Datasets rich in categorical features |
Deep neural networks can serve as powerful regression models by using a single linear output neuron (no activation function) in the final layer. The network learns hierarchical representations of the input data through hidden layers with non-linear activation functions. Common loss functions for neural network regression include MSE and mean absolute error. Convolutional neural networks (CNNs) are widely used for regression in computer vision tasks such as depth estimation and pose estimation, while recurrent architectures and transformers handle sequential regression problems like time series forecasting.[7]
| Algorithm | Type | Handles Non-linearity | Built-in Feature Selection | Interpretability | Scalability |
|---|---|---|---|---|---|
| Linear Regression | Parametric | No | No | High | High |
| Polynomial Regression | Parametric | Yes (via feature engineering) | No | Medium | High |
| Ridge Regression | Parametric | No | No (shrinks all coefficients) | High | High |
| Lasso Regression | Parametric | No | Yes (zeroes out coefficients) | High | High |
| Elastic Net | Parametric | No | Partial | High | High |
| SVR | Non-parametric (kernel) | Yes (via kernels) | No | Low | Medium |
| Decision Tree | Non-parametric | Yes | Implicit (feature importance) | High | Medium |
| Random Forest | Ensemble | Yes | Implicit | Medium | High |
| Gradient Boosting | Ensemble | Yes | Implicit | Medium | High |
| Neural Network | Non-parametric | Yes | No | Low | High (with GPUs) |
Selecting the right metric is essential for assessing how well a regression model performs. Different metrics emphasize different aspects of prediction quality.
| Metric | Formula | Interpretation | Sensitivity to Outliers |
|---|---|---|---|
| Mean Squared Error (MSE) | (1/n) * sum((y_i - y_hat_i)^2) | Average squared prediction error; in squared units of the target | High (squares amplify large errors) |
| Root Mean Squared Error (RMSE) | sqrt(MSE) | Same as MSE but in the original units of the target | High |
| Mean Absolute Error (MAE) | (1/n) * sum( | y_i - y_hat_i | ) |
| R-squared (R^2) | 1 - (SS_res / SS_tot) | Proportion of target variance explained by the model; ranges from negative infinity to 1 | High (inherited from MSE) |
| Adjusted R-squared | 1 - ((1 - R^2)(n - 1) / (n - p - 1)) | R^2 penalized for the number of predictors p; better for model comparison | High |
| Mean Absolute Percentage Error (MAPE) | (100/n) * sum( | y_i - y_hat_i | / |
MSE and RMSE are the most common choices when large errors are particularly costly, because squaring amplifies outlier deviations. MAE provides a more balanced view of typical error and is preferred when the data contains outliers. R-squared is widely reported because it gives an intuitive sense of model fit, but it always increases (or stays the same) when additional features are added, which is why adjusted R-squared is preferred for comparing models with different numbers of predictors. MAPE is useful when relative error matters more than absolute error, such as in demand forecasting.[8]
Classical linear regression rests on several assumptions. While non-linear and tree-based models relax many of these requirements, understanding these assumptions remains important for model diagnostics and troubleshooting.
Linearity. The relationship between predictors and the target is linear (for linear models). Violations can be detected using residual plots. Remedies include adding polynomial terms or switching to a non-linear model.
Independence. Observations are independent of each other. Violation is common in time series data, where consecutive observations are correlated. The Durbin-Watson test detects autocorrelation in residuals; a value near 2 indicates no autocorrelation.
Homoscedasticity. The variance of residuals is constant across all levels of the predicted values. When variance changes systematically (heteroscedasticity), standard errors and confidence intervals become unreliable. The Breusch-Pagan test formally checks for heteroscedasticity.
Normality of residuals. Residuals follow a normal distribution. This assumption is most important for hypothesis testing and confidence intervals, less critical for point predictions. The Shapiro-Wilk test and Q-Q plots are standard diagnostic tools.
No multicollinearity. Predictors should not be highly correlated with each other. Multicollinearity inflates the variance of coefficient estimates and makes them unstable. The Variance Inflation Factor (VIF) quantifies collinearity: VIF values above 5 to 10 indicate problematic levels. Remedies include removing redundant features, applying principal component analysis, or using ridge regression.[9]
Residual analysis is the primary tool for diagnosing regression model problems. Key diagnostic plots include:
Standard regression models predict the conditional mean of the target variable. Quantile regression generalizes this by predicting specific quantiles (e.g., the 10th, 50th, or 90th percentile) of the conditional distribution. This is valuable when the goal is to understand the full range of possible outcomes rather than just the average. For example, in risk assessment a financial analyst might care about the 5th percentile (worst-case scenario) rather than the mean. Quantile regression uses the pinball loss function and makes no distributional assumptions about the errors.[10]
Probabilistic regression models output a full probability distribution over possible target values rather than a single point estimate. Approaches include:
Probabilistic regression is essential in applications where knowing the confidence of a prediction matters as much as the prediction itself, such as medical dosing, autonomous driving, and weather forecasting.
Deep learning models have expanded the scope of regression to complex, high-dimensional inputs. Convolutional neural networks predict continuous values from images (e.g., estimating a person's age from a photograph, predicting crop yield from satellite imagery). Recurrent neural networks and transformers handle sequential regression for tasks like stock price prediction and energy demand forecasting. Graph neural networks perform regression on molecular properties for drug discovery.
For tabular data, however, gradient boosting methods (XGBoost, LightGBM, CatBoost) often match or outperform deep learning, particularly when the dataset is small to medium in size. Deep learning excels when the input data is unstructured (images, text, audio) or when the dataset is very large.[7]
Regression and classification are the two primary types of supervised learning. The following table highlights their key differences:
| Aspect | Regression | Classification |
|---|---|---|
| Output type | Continuous numerical value | Discrete class label |
| Example tasks | Predicting house price, estimating temperature | Spam detection, image recognition |
| Common loss functions | MSE, MAE, Huber loss | Cross-entropy, hinge loss |
| Evaluation metrics | RMSE, MAE, R-squared | Accuracy, precision, recall, F1-score |
| Decision boundary | Not applicable (fits a curve) | Separates feature space into class regions |
| Typical algorithms | Linear regression, SVR, random forest regressor | Logistic regression, SVM, random forest classifier |
Some algorithms can be adapted for both tasks. For example, decision trees, random forests, gradient boosting, and neural networks all have both regression and classification variants. The choice between regression and classification depends on the nature of the target variable: if it is a quantity, use regression; if it is a category, use classification.[11]
The quality of input features often matters more than the choice of algorithm. Effective feature engineering techniques for regression include:
Regularization is critical for preventing overfitting, especially with high-dimensional data. Beyond L1 and L2 penalties, techniques like dropout (for neural networks), early stopping, and pruning (for tree-based models) help control model complexity. Cross-validation provides a reliable estimate of out-of-sample performance and guides hyperparameter selection.
When the relationship between features and the target is non-linear, practitioners can either:
The choice depends on interpretability requirements, data size, and computational constraints.
Regression models are used across virtually every domain:
Imagine you have a scatter plot of dots on a piece of paper. Each dot represents a real-world observation, like a house with its size on one axis and its price on the other. A regression model is like finding the best line (or curve) to draw through those dots so that the line comes as close as possible to all of them at once.
Once you have that line, you can use it to make predictions. If someone asks "How much would a 2,000 square foot house cost?", you slide along the line to the 2,000 mark and read off the predicted price. The line will not pass through every dot perfectly because real-world data is messy, but a good regression model gets close enough to be useful.
Simple regression draws a straight line. More advanced regression models draw curves, wiggly lines, or even shapes in many dimensions to capture more complicated patterns. But the core idea is always the same: learn the pattern in existing data, then use that pattern to predict new values.