Regression Model

A regression model is a type of supervised learning algorithm designed to predict continuous numerical output values based on one or more input features. Unlike classification models, which assign inputs to discrete categories, regression models estimate real-valued quantities such as prices, temperatures, distances, or probabilities. Regression sits at the foundation of both classical statistics and modern machine learning, and it remains one of the most widely used modeling techniques across science, engineering, finance, and industry.

Historical Background

The roots of regression analysis trace back to the early 19th century. Adrien-Marie Legendre published the method of least squares in 1805 in his paper "New Methods for Determination of the Orbits of Comets," providing the first formal articulation of fitting a model by minimizing squared errors. Carl Friedrich Gauss later claimed he had been using the same technique since 1795, publishing his account in 1809 in "Theoria Motus Corporum Coelestium." Gauss went further than Legendre by connecting least squares to probability theory and the normal distribution, laying the mathematical groundwork for modern regression.^{^[1]}

The term "regression" itself was coined by Sir Francis Galton in the 1880s during his studies of hereditary traits, where he observed that extreme characteristics in parents tended to "regress" toward the population mean in their offspring. Karl Pearson and others later formalized regression as a general statistical tool. Today, the concept extends well beyond its origins, encompassing dozens of algorithms from simple linear fits to deep neural network architectures.

How Regression Models Work

At a high level, a regression model learns a function f that maps input features x to a continuous target variable y:

y = f(x) + e

where e represents irreducible error (noise). During training, the model adjusts its parameters to minimize a chosen loss function, such as mean squared error. The trained model then generalizes to unseen data, producing predictions for new inputs.

Regression models can be divided into two broad families:

Parametric models assume a specific functional form (e.g., linear, polynomial) and estimate a fixed set of parameters. They are interpretable and efficient but may underfit complex relationships.
Non-parametric models make fewer assumptions about the underlying function and can adapt to arbitrary patterns in the data. They are more flexible but require more data and are prone to overfitting without proper regularization.

Types of Regression Models

Linear Regression

Linear regression is the simplest and most fundamental regression technique. It models the relationship between input features and the target as a weighted linear combination:

y = w_1 x_1 + w_2 x_2 + ... + w_n x_n + b

where w represents feature weights and b is the bias (intercept) term. The model is trained by minimizing the sum of squared residuals using ordinary least squares (OLS) or gradient-based optimization. Linear regression is fast, interpretable, and works well when the true relationship between variables is approximately linear.^{^[2]}

Polynomial Regression

Polynomial regression extends linear regression by introducing higher-order terms (x^2, x^3, etc.) as additional features. A degree-n polynomial model can capture curved, non-linear patterns that a straight line cannot. However, high-degree polynomials risk overfitting, especially with limited training data. Polynomial regression is technically a special case of linear regression because the model remains linear in its parameters, even though the relationship with the original input variable is non-linear.

Ridge Regression (L2 Regularization)

Ridge regression adds an L2 penalty to the ordinary least squares objective, shrinking the magnitude of all coefficients toward zero:

Loss = MSE + alpha * sum(w_i^2)

The regularization parameter alpha controls the strength of the penalty. Ridge regression is particularly useful when features are correlated (multicollinearity), as it stabilizes coefficient estimates and reduces variance at the cost of introducing a small amount of bias. Importantly, ridge regression keeps all features in the model; it never sets coefficients exactly to zero.^{^[3]}

Lasso Regression (L1 Regularization)

Lasso (Least Absolute Shrinkage and Selection Operator) regression uses an L1 penalty instead:

Loss = MSE + alpha * sum(|w_i|)

The L1 penalty encourages sparsity by driving some coefficients to exactly zero, effectively performing automatic feature selection. This makes lasso regression valuable when many features are irrelevant or redundant. However, lasso tends to select only one feature from a group of correlated features and set the rest to zero, which may not always be desirable.^{^[3]}

Elastic Net

Elastic net combines the L1 and L2 penalties, balancing the strengths of both ridge and lasso:

Loss = MSE + alpha_1 * sum(|w_i|) + alpha_2 * sum(w_i^2)

This hybrid approach is useful when there are groups of correlated features. Where lasso would arbitrarily select one from each group, elastic net tends to select or exclude the entire group together. The mixing ratio between L1 and L2 is controlled by an additional hyperparameter.^{^[3]}

Support Vector Regression (SVR)

Support vector regression adapts the principles of support vector machines to regression tasks. SVR fits data within an epsilon-insensitive tube, meaning errors smaller than a threshold epsilon are ignored entirely. Points outside the tube become support vectors that define the model. By applying the kernel trick, SVR can model non-linear relationships by projecting data into a higher-dimensional space where a linear fit in that space corresponds to a complex curve in the original space. SVR is effective for small to medium datasets but becomes computationally expensive for very large datasets.^{^[4]}

Decision Tree Regression

A decision tree regression model partitions the feature space into rectangular regions by recursively splitting on feature thresholds. Each leaf node contains the mean (or median) target value of the training samples that fall into that region. Decision trees are intuitive, require minimal data preprocessing, and can capture non-linear interactions. Their main weakness is high variance: small changes in the data can produce very different tree structures, leading to overfitting.

Random Forest Regression

Random forest regression is an ensemble method that builds many decision trees on bootstrapped subsets of the data (bagging) and averages their predictions. By combining diverse trees, random forests reduce variance and improve generalization compared to a single decision tree. For regression, the final prediction is the mean of all individual tree predictions. Random forests handle high-dimensional data well, are robust to outliers, and require relatively little hyperparameter tuning.^{^[5]}

Gradient Boosting Regression

Gradient boosting builds an ensemble of decision trees sequentially, where each new tree is trained to correct the residual errors of the combined ensemble so far. This additive approach optimizes a differentiable loss function using gradient descent in function space. Modern implementations such as XGBoost, LightGBM, and CatBoost add regularization, parallel processing, and efficient handling of categorical features. Gradient boosting consistently achieves top performance on structured tabular data and dominates machine learning competitions.^{^[6]}

The following table summarizes the key frameworks:

Framework	Key Innovation	Best For
XGBoost	L1/L2 regularization, parallel tree construction	General-purpose tabular data
LightGBM	Histogram-based learning, leaf-wise tree growth	Large-scale datasets with millions of rows
CatBoost	Native categorical feature handling, ordered boosting	Datasets rich in categorical features

Neural Network Regression

Deep neural networks can serve as powerful regression models by using a single linear output neuron (no activation function) in the final layer. The network learns hierarchical representations of the input data through hidden layers with non-linear activation functions. Common loss functions for neural network regression include MSE and mean absolute error. Convolutional neural networks (CNNs) are widely used for regression in computer vision tasks such as depth estimation and pose estimation, while recurrent architectures and transformers handle sequential regression problems like time series forecasting.^{^[7]}

Summary of Regression Algorithms

Algorithm	Type	Handles Non-linearity	Built-in Feature Selection	Interpretability	Scalability
Linear Regression	Parametric	No	No	High	High
Polynomial Regression	Parametric	Yes (via feature engineering)	No	Medium	High
Ridge Regression	Parametric	No	No (shrinks all coefficients)	High	High
Lasso Regression	Parametric	No	Yes (zeroes out coefficients)	High	High
Elastic Net	Parametric	No	Partial	High	High
SVR	Non-parametric (kernel)	Yes (via kernels)	No	Low	Medium
Decision Tree	Non-parametric	Yes	Implicit (feature importance)	High	Medium
Random Forest	Ensemble	Yes	Implicit	Medium	High
Gradient Boosting	Ensemble	Yes	Implicit	Medium	High
Neural Network	Non-parametric	Yes	No	Low	High (with GPUs)

Evaluation Metrics

Selecting the right metric is essential for assessing how well a regression model performs. Different metrics emphasize different aspects of prediction quality.

Metric	Formula	Interpretation	Sensitivity to Outliers
Mean Squared Error (MSE)	(1/n) * sum((y_i - y_hat_i)^2)	Average squared prediction error; in squared units of the target	High (squares amplify large errors)
Root Mean Squared Error (RMSE)	sqrt(MSE)	Same as MSE but in the original units of the target	High
Mean Absolute Error (MAE)	(1/n) * sum(	y_i - y_hat_i	)
R-squared (R^2)	1 - (SS_res / SS_tot)	Proportion of target variance explained by the model; ranges from negative infinity to 1	High (inherited from MSE)
Adjusted R-squared	1 - ((1 - R^2)(n - 1) / (n - p - 1))	R^2 penalized for the number of predictors p; better for model comparison	High
Mean Absolute Percentage Error (MAPE)	(100/n) * sum(	y_i - y_hat_i	/

MSE and RMSE are the most common choices when large errors are particularly costly, because squaring amplifies outlier deviations. MAE provides a more balanced view of typical error and is preferred when the data contains outliers. R-squared is widely reported because it gives an intuitive sense of model fit, but it always increases (or stays the same) when additional features are added, which is why adjusted R-squared is preferred for comparing models with different numbers of predictors. MAPE is useful when relative error matters more than absolute error, such as in demand forecasting.^{^[8]}

Assumptions and Diagnostics

Classical linear regression rests on several assumptions. While non-linear and tree-based models relax many of these requirements, understanding these assumptions remains important for model diagnostics and troubleshooting.

Key Assumptions

Linearity. The relationship between predictors and the target is linear (for linear models). Violations can be detected using residual plots. Remedies include adding polynomial terms or switching to a non-linear model.
Independence. Observations are independent of each other. Violation is common in time series data, where consecutive observations are correlated. The Durbin-Watson test detects autocorrelation in residuals; a value near 2 indicates no autocorrelation.
Homoscedasticity. The variance of residuals is constant across all levels of the predicted values. When variance changes systematically (heteroscedasticity), standard errors and confidence intervals become unreliable. The Breusch-Pagan test formally checks for heteroscedasticity.
Normality of residuals. Residuals follow a normal distribution. This assumption is most important for hypothesis testing and confidence intervals, less critical for point predictions. The Shapiro-Wilk test and Q-Q plots are standard diagnostic tools.
No multicollinearity. Predictors should not be highly correlated with each other. Multicollinearity inflates the variance of coefficient estimates and makes them unstable. The Variance Inflation Factor (VIF) quantifies collinearity: VIF values above 5 to 10 indicate problematic levels. Remedies include removing redundant features, applying principal component analysis, or using ridge regression.^{^[9]}

Residual Analysis

Residual analysis is the primary tool for diagnosing regression model problems. Key diagnostic plots include:

Residuals vs. fitted values: Should show a random scatter with no systematic pattern. A funnel shape indicates heteroscedasticity; a curve suggests non-linearity.
Q-Q plot of residuals: Compares the distribution of residuals against a theoretical normal distribution. Deviations from the diagonal line reveal non-normality.
Scale-location plot: Plots the square root of standardized residuals against fitted values to check for homoscedasticity.
Residuals vs. leverage (Cook's distance): Identifies influential observations that disproportionately affect the model fit.

Advanced Regression Topics

Quantile Regression

Standard regression models predict the conditional mean of the target variable. Quantile regression generalizes this by predicting specific quantiles (e.g., the 10th, 50th, or 90th percentile) of the conditional distribution. This is valuable when the goal is to understand the full range of possible outcomes rather than just the average. For example, in risk assessment a financial analyst might care about the 5th percentile (worst-case scenario) rather than the mean. Quantile regression uses the pinball loss function and makes no distributional assumptions about the errors.^{^[10]}

Probabilistic Regression

Probabilistic regression models output a full probability distribution over possible target values rather than a single point estimate. Approaches include:

Bayesian regression: Places prior distributions on model parameters and computes posterior distributions using Bayes' theorem. This naturally quantifies parameter uncertainty and produces prediction intervals.
Gaussian process regression: A non-parametric Bayesian method that defines a distribution over functions. It provides smooth predictions with well-calibrated uncertainty estimates but scales poorly to large datasets.
Mixture density networks: Neural networks that output the parameters of a mixture of Gaussians, capturing multimodal conditional distributions.

Probabilistic regression is essential in applications where knowing the confidence of a prediction matters as much as the prediction itself, such as medical dosing, autonomous driving, and weather forecasting.

Deep Learning for Regression

Deep learning models have expanded the scope of regression to complex, high-dimensional inputs. Convolutional neural networks predict continuous values from images (e.g., estimating a person's age from a photograph, predicting crop yield from satellite imagery). Recurrent neural networks and transformers handle sequential regression for tasks like stock price prediction and energy demand forecasting. Graph neural networks perform regression on molecular properties for drug discovery.

For tabular data, however, gradient boosting methods (XGBoost, LightGBM, CatBoost) often match or outperform deep learning, particularly when the dataset is small to medium in size. Deep learning excels when the input data is unstructured (images, text, audio) or when the dataset is very large.^{^[7]}

Regression vs. Classification

Regression and classification are the two primary types of supervised learning. The following table highlights their key differences:

Aspect	Regression	Classification
Output type	Continuous numerical value	Discrete class label
Example tasks	Predicting house price, estimating temperature	Spam detection, image recognition
Common loss functions	MSE, MAE, Huber loss	Cross-entropy, hinge loss
Evaluation metrics	RMSE, MAE, R-squared	Accuracy, precision, recall, F1-score
Decision boundary	Not applicable (fits a curve)	Separates feature space into class regions
Typical algorithms	Linear regression, SVR, random forest regressor	Logistic regression, SVM, random forest classifier

Some algorithms can be adapted for both tasks. For example, decision trees, random forests, gradient boosting, and neural networks all have both regression and classification variants. The choice between regression and classification depends on the nature of the target variable: if it is a quantity, use regression; if it is a category, use classification.^{^[11]}

Practical Considerations

Feature Engineering

The quality of input features often matters more than the choice of algorithm. Effective feature engineering techniques for regression include:

Log transformation of skewed features or targets to reduce the influence of extreme values.
Polynomial and interaction features to capture non-linear relationships in linear models.
Standardization or normalization of features to ensure they are on comparable scales, which is important for regularized models and neural networks.
Handling missing values through imputation or using algorithms that handle them natively (e.g., XGBoost).

Regularization and Overfitting

Regularization is critical for preventing overfitting, especially with high-dimensional data. Beyond L1 and L2 penalties, techniques like dropout (for neural networks), early stopping, and pruning (for tree-based models) help control model complexity. Cross-validation provides a reliable estimate of out-of-sample performance and guides hyperparameter selection.

Handling Non-linear Relationships

When the relationship between features and the target is non-linear, practitioners can either:

Transform the features (polynomial features, splines, log transforms) and use a linear model.
Use an inherently non-linear model (decision trees, SVR with non-linear kernels, neural networks).

The choice depends on interpretability requirements, data size, and computational constraints.

Applications

Regression models are used across virtually every domain:

Finance: Stock price prediction, credit scoring, portfolio risk estimation.
Healthcare: Predicting patient outcomes, drug dosage optimization, estimating disease progression.
Real estate: House price estimation based on location, size, and amenities.
Weather and climate: Temperature forecasting, rainfall prediction, climate modeling.
Manufacturing: Predicting equipment failure times, quality control measurements.
Marketing: Customer lifetime value estimation, demand forecasting, ad spend optimization.
Natural sciences: Modeling physical phenomena, estimating chemical properties, ecological population forecasting.

ELI5: Regression Model Explained Simply

Imagine you have a scatter plot of dots on a piece of paper. Each dot represents a real-world observation, like a house with its size on one axis and its price on the other. A regression model is like finding the best line (or curve) to draw through those dots so that the line comes as close as possible to all of them at once.

Once you have that line, you can use it to make predictions. If someone asks "How much would a 2,000 square foot house cost?", you slide along the line to the 2,000 mark and read off the predicted price. The line will not pass through every dot perfectly because real-world data is messy, but a good regression model gets close enough to be useful.

Simple regression draws a straight line. More advanced regression models draw curves, wiggly lines, or even shapes in many dimensions to capture more complicated patterns. But the core idea is always the same: learn the pattern in existing data, then use that pattern to predict new values.

References

Stigler, S. M. "Gauss and the Invention of Least Squares." *Annals of Statistics*, 9(3), 1981.
Montgomery, D. C., Peck, E. A., & Vining, G. G. *Introduction to Linear Regression Analysis*. 5th ed., Wiley, 2012.
Hastie, T., Tibshirani, R., & Friedman, J. *The Elements of Statistical Learning*. 2nd ed., Springer, 2009.
Smola, A. J. & Scholkopf, B. "A Tutorial on Support Vector Regression." *Statistics and Computing*, 14(3), 199-222, 2004.
Breiman, L. "Random Forests." *Machine Learning*, 45(1), 5-32, 2001.
Chen, T. & Guestrin, C. "XGBoost: A Scalable Tree Boosting System." *Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining*, 2016.
Lathuiliere, S. et al. "A Comprehensive Analysis of Deep Regression." *IEEE Transactions on Pattern Analysis and Machine Intelligence*, 42(8), 2020.
Hyndman, R. J. & Koehler, A. B. "Another Look at Measures of Forecast Accuracy." *International Journal of Forecasting*, 22(4), 679-688, 2006.
James, G., Witten, D., Hastie, T., & Tibshirani, R. *An Introduction to Statistical Learning*. 2nd ed., Springer, 2021.
Koenker, R. & Hallock, K. F. "Quantile Regression." *Journal of Economic Perspectives*, 15(4), 143-156, 2001.
Kotsiantis, S. B. "Supervised Machine Learning: A Review of Classification Techniques." *Informatica*, 31, 249-268, 2007.
Geron, A. *Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow*. 3rd ed., O'Reilly, 2022.

Historical Background

How Regression Models Work

Types of Regression Models

Linear Regression

Polynomial Regression

Ridge Regression (L2 Regularization)

Lasso Regression (L1 Regularization)

Elastic Net

Support Vector Regression (SVR)

Decision Tree Regression

Random Forest Regression

Gradient Boosting Regression

Neural Network Regression

Summary of Regression Algorithms

Evaluation Metrics

Assumptions and Diagnostics

Key Assumptions

Residual Analysis

Advanced Regression Topics

Quantile Regression

Probabilistic Regression

Deep Learning for Regression

Regression vs. Classification

Practical Considerations

Feature Engineering

Regularization and Overfitting

Handling Non-linear Relationships

Applications

ELI5: Regression Model Explained Simply

References

Improve this article

Related Articles

ARC-AGI 2

Squared Loss

Least Squares Regression

Linear Regression

Probabilistic Regression Model

Root Mean Squared Error (RMSE)

Historical Background

How Regression Models Work

Types of Regression Models

Linear Regression

Polynomial Regression

Ridge Regression (L2 Regularization)

Lasso Regression (L1 Regularization)

Elastic Net

Support Vector Regression (SVR)

Decision Tree Regression

Random Forest Regression

Gradient Boosting Regression

Neural Network Regression

Summary of Regression Algorithms

Evaluation Metrics

Assumptions and Diagnostics

Key Assumptions

Residual Analysis

Advanced Regression Topics

Quantile Regression

Probabilistic Regression

Deep Learning for Regression

Regression vs. Classification

Practical Considerations

Feature Engineering

Regularization and Overfitting

Handling Non-linear Relationships

Applications

ELI5: Regression Model Explained Simply

References

Related Articles

ARC-AGI 2

Squared Loss

Least Squares Regression

Linear Regression

Probabilistic Regression Model

Root Mean Squared Error (RMSE)