Model

Machine Learning

20 min read

Updated Jun 21, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 21, 2026

Fact-checked

In review queue

Sources

12 citations

Revision

v5 · 3,950 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

See also: Machine learning terms

In machine learning, a model is a mathematical representation of a real-world process that learns patterns from data and uses those patterns to make predictions or decisions on new, unseen inputs.^[1]^[3] Every model combines two parts: an architecture (the structural blueprint defining how computations are organized) and a set of learned parameters (numerical values adjusted during training to minimize error).^[4] The architecture can be as simple as a linear equation or a decision tree, or as large as a deep neural network with hundreds of billions of parameters: OpenAI's GPT-3, for example, has 175 billion parameters, described in its 2020 paper as "10x more than any previous non-sparse language model."^[12] Regardless of complexity, every model serves the same fundamental purpose: to generalize from observed examples to novel situations.

Introduction

Models are central to virtually every application of artificial intelligence, from spam filters and recommendation engines to autonomous vehicles and medical diagnostics. The process of building a useful model involves selecting an appropriate architecture, training it on data, evaluating its performance, and deploying it to serve predictions in production. Understanding what a model is and how models differ from one another is foundational to the entire field of machine learning.

What is the formal definition of a model?

Formally, a machine learning model can be described as a function f that maps an input x (often a vector of features) to an output y:

y = f(x; theta)

where theta represents the model's parameters. During training, an optimization algorithm adjusts theta to minimize a loss function L(y_predicted, y_actual), which measures the discrepancy between the model's predictions and the true values.^[4] The choice of loss function depends on the task. For regression, the mean squared error is common; for classification, cross-entropy loss is widely used.^[3]

Tom M. Mitchell provided an influential formal definition of machine learning more broadly: "A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P if its performance at tasks in T, as measured by P, improves with experience E."^[1] The model, in this framing, is the artifact that encodes the learned experience.

What are the main types of models?

Machine learning models can be categorized along several dimensions. The most common taxonomies are based on the learning paradigm, the modeling approach, and the nature of the output.^[3]

By learning paradigm

Paradigm	Description	Example algorithms
Supervised learning	Learns from labeled input-output pairs to predict outputs for new inputs	Linear regression, logistic regression, SVMs, random forests, neural networks
Unsupervised learning	Discovers hidden structure in unlabeled data	K-means clustering, PCA, autoencoders, Gaussian mixture models
Semi-supervised learning	Uses a small amount of labeled data alongside a large amount of unlabeled data	Label propagation, self-training, co-training
Reinforcement learning	Learns optimal actions through interaction with an environment and reward signals	Q-learning, Deep Q-Networks, PPO, SAC
Self-supervised learning	Generates supervisory signals from the data itself, often through pretext tasks	BERT (masked language modeling), contrastive learning (SimCLR)

By output type

Classification models predict discrete categories or labels. A spam filter, for instance, classifies emails as "spam" or "not spam." Common classification algorithms include logistic regression, support vector machines, decision trees, random forests, and neural networks trained with cross-entropy loss.^[3]

Regression models predict continuous numerical values. Predicting house prices based on features like square footage and location is a classic regression task. Linear regression, polynomial regression, and neural networks trained with mean squared error are typical regression models.

Generative models learn the joint probability distribution P(X, Y) or the data distribution P(X) directly, enabling them to generate new data samples.^[3] Examples include Gaussian mixture models, variational autoencoders (VAEs), generative adversarial networks (GANs), and autoregressive language models like the GPT series.^[4]

Discriminative models learn the conditional probability P(Y | X) or a direct mapping from inputs to outputs. They focus on finding decision boundaries rather than modeling the full data distribution. Logistic regression, SVMs, and most neural network classifiers are discriminative models.^[3]

Aspect	Generative models	Discriminative models
What they learn	Joint distribution P(X, Y) or P(X)	Conditional distribution P(Y \| X) or direct mapping
Can generate new samples	Yes	No
Typical use cases	Data generation, density estimation, imputation	Classification, regression
Examples	Naive Bayes, VAEs, GANs, GPT	Logistic regression, SVMs, random forests

Parametric vs. non-parametric models

Another important distinction is between parametric and non-parametric models.

Parametric models assume a specific functional form for the relationship between inputs and outputs and have a fixed number of parameters regardless of the size of the training data. Linear regression (with its weight vector and bias term) is a classic parametric model. Parametric models are computationally efficient and easy to interpret, but they can underfit if the assumed functional form is too rigid for the true data distribution.^[3]

Non-parametric models do not assume a fixed functional form. Their complexity grows with the amount of training data, effectively allowing an infinite number of parameters. K-nearest neighbors (KNN), decision trees, and Gaussian processes are non-parametric models. They are more flexible and can capture complex patterns, but they typically require more data, are slower at inference, and carry a higher risk of overfitting.^[3]

Property	Parametric models	Non-parametric models
Number of parameters	Fixed	Grows with data
Assumptions	Strong (fixed functional form)	Weak or none
Data requirements	Lower	Higher
Interpretability	Generally higher	Generally lower
Risk of underfitting	Higher if form is misspecified	Lower
Examples	Linear regression, logistic regression, naive Bayes	KNN, decision trees, random forests, Gaussian processes

Model complexity and capacity

A model's complexity refers to the richness of functions it can represent. A linear model with two parameters (slope and intercept) is simple, while a deep neural network with hundreds of billions of parameters is highly complex. Model capacity is the closely related concept of a model's ability to fit a wide variety of functions.

One formal measure of capacity is the Vapnik-Chervonenkis (VC) dimension, introduced by Vladimir Vapnik and Alexey Chervonenkis in their 1971 paper "On the uniform convergence of relative frequencies of events to their probabilities."^[2] The VC dimension of a model is the largest number of data points that the model can shatter (classify correctly for all possible labelings). A model with a higher VC dimension can represent more complex decision boundaries but also has a greater risk of overfitting.^[2]

The relationship between capacity and generalization is captured by the bias-variance tradeoff:^[3]

Bias measures how far the model's average predictions are from the true values. High bias indicates underfitting: the model is too simple to capture the underlying patterns.
Variance measures how much the model's predictions fluctuate across different training sets. High variance indicates overfitting: the model is too sensitive to the specific training data.

As model complexity increases, bias typically decreases (the model can fit the training data more closely), but variance increases (the model becomes more sensitive to noise). The optimal model balances these two sources of error.^[3] Classical statistical learning theory prescribes choosing the simplest model that adequately captures the data patterns.^[2] However, modern deep learning has complicated this picture: very large neural networks sometimes generalize well despite having far more parameters than training examples, a phenomenon known as "double descent." Belkin, Hsu, Ma, and Mandal described it in 2019, writing that the "double descent" curve "subsumes the textbook U-shaped bias-variance trade-off curve by showing how increasing model capacity beyond the point of interpolation results in improved performance," which challenges traditional views of the bias-variance tradeoff.^[11]

How are models trained?

Training (also called fitting or learning) is the process of adjusting a model's parameters to minimize a loss function on the training data. The most common optimization approach for neural networks is gradient descent and its variants:^[4]

Compute predictions by passing training examples through the model (forward pass).
Calculate the loss between predictions and true labels.
Compute gradients of the loss with respect to each parameter using backpropagation.^[4]
Update parameters by moving them in the direction that reduces the loss.

Stochastic gradient descent (SGD) processes random subsets (mini-batches) of the training data at each step rather than the full dataset, which is computationally cheaper and introduces beneficial noise that can help escape local minima. Adaptive optimizers like Adam, AdaGrad, and RMSProp adjust the learning rate for each parameter individually, often converging faster than vanilla SGD.^[4]

Key training considerations include:

Learning rate: Controls the step size of parameter updates. Too large a rate causes instability; too small a rate leads to slow convergence.
Batch size: The number of examples processed before each parameter update. Larger batches give more stable gradient estimates but require more memory.
Epochs: One epoch is a complete pass through the training dataset. Models typically train for many epochs.
Regularization: Techniques like L1/L2 penalties, dropout, and early stopping prevent overfitting by constraining the model's effective complexity.^[4]

How is a model evaluated?

After training, a model must be evaluated to estimate how well it will perform on unseen data. Evaluation involves choosing appropriate metrics and validation strategies.^[3]

Common evaluation metrics

Task type	Metric	Description
Classification	Accuracy	Fraction of correct predictions
Classification	Precision	Fraction of positive predictions that are truly positive
Classification	Recall (sensitivity)	Fraction of actual positives correctly identified
Classification	F1 score	Harmonic mean of precision and recall
Classification	AUC-ROC	Area under the receiver operating characteristic curve
Regression	Mean squared error (MSE)	Average squared difference between predictions and true values
Regression	Mean absolute error (MAE)	Average absolute difference between predictions and true values
Regression	R-squared	Proportion of variance in the target explained by the model
Generation	Perplexity	Inverse probability of the test set (lower is better for language models)
Generation	BLEU / ROUGE	N-gram overlap metrics for text generation

Validation strategies

Train-test split divides the data into a training set and a held-out test set. The model is trained on the training set and evaluated on the test set. A common split is 80/20 or 70/30.

Cross-validation (typically k-fold) divides the data into k subsets. The model is trained k times, each time using a different subset as the test fold and the remaining k-1 subsets for training. The results are averaged to produce a more robust performance estimate.^[3]

Held-out validation set reserves a portion of the training data for hyperparameter tuning, preventing the test set from being used for model selection decisions (which would leak information and produce overly optimistic estimates).

How do you select among candidate models?

Model selection is the process of choosing among candidate models. This encompasses both the choice of algorithm (e.g., random forest vs. neural network) and the tuning of hyperparameters (e.g., the number of layers, the learning rate, or the regularization strength).

Common model selection strategies include:

Grid search: Exhaustively tries all combinations of hyperparameter values from a predefined grid.
Random search: Samples hyperparameter values randomly, which is often more efficient than grid search when only a few hyperparameters strongly affect performance (Bergstra and Bengio, 2012).^[8]
Bayesian optimization: Uses a probabilistic surrogate model to guide the search toward promising hyperparameter regions, reducing the number of evaluations needed.
Information criteria: Methods like the Akaike Information Criterion (AIC) and the Bayesian Information Criterion (BIC) balance model fit against complexity without requiring a separate validation set.^[3]

The goal of model selection is to identify the model that will generalize best to new data, not merely the model that fits the training data most closely.

Pre-trained models and transfer learning

Transfer learning is the practice of taking a model trained on one task or dataset and adapting it for a different, often related, task. This approach has become dominant in modern machine learning because training large models from scratch requires enormous computational resources and data.

A pre-trained model is one that has already been trained on a large-scale dataset. For example, ImageNet-trained convolutional neural networks learn general visual features (edges, textures, shapes) that transfer well to other vision tasks. In natural language processing, language models pre-trained on vast text corpora (such as BERT or GPT) learn linguistic representations that can be fine-tuned for specific tasks like sentiment analysis, question answering, or named entity recognition.

Transfer learning typically follows one of two approaches:

Feature extraction: The pre-trained model is used as a fixed feature extractor. Its weights are frozen, and only a new output layer (or head) is trained on the target task.
Fine-tuning: All or some of the pre-trained model's weights are further adjusted on the target task data, allowing the model to adapt its learned representations.

What are foundation models?

Foundation models are large-scale models trained on broad, diverse datasets that can be adapted to a wide range of downstream tasks.^[7] The term was coined in 2021 by researchers at the Center for Research on Foundation Models (CRFM), part of the Stanford Institute for Human-Centered Artificial Intelligence (HAI), in the report "On the Opportunities and Risks of Foundation Models."^[7] The authors wrote: "We call these models foundation models to underscore their critically central yet incomplete character."^[7] Examples include GPT-4, Claude, LLaMA, BERT, CLIP, and Stable Diffusion.

Foundation models represent a paradigm shift in AI development. Rather than training a specialized model for each task, a single large model is pre-trained once (at great expense) and then adapted, either through fine-tuning or through in-context learning (prompting), to serve many different applications. This approach reduces the need for task-specific labeled data and makes powerful AI capabilities accessible to organizations that lack the resources to train models from scratch. As the CRFM report notes, this scale "results in new emergent capabilities," while the effectiveness of one model across many tasks "incentivizes homogenization," meaning the defects of a foundation model are inherited by all the adapted models downstream.^[7]

The emergence of foundation models has raised important questions about centralization (a small number of organizations control the most capable models), safety (misuse potential scales with model capability), and environmental impact (training runs for frontier models consume substantial energy).^[7]

How are large models compressed?

Deploying large models in resource-constrained environments (mobile devices, edge hardware, real-time applications) often requires reducing their size and computational cost. The three principal model compression techniques are:

Pruning removes redundant weights, neurons, or entire layers from a trained model. Unstructured pruning sets individual weights to zero, while structured pruning removes entire filters or channels. Research has shown that neural networks can often be pruned by 90% or more with minimal accuracy loss, supporting the "lottery ticket hypothesis" proposed by Frankle and Carbin (2019), which states that dense networks contain sparse subnetworks ("winning tickets") that, trained in isolation from their original initialization, can match the full network's accuracy in a comparable number of iterations.^[6]

Quantization reduces the numerical precision of model weights and activations. Instead of storing parameters as 32-bit floating-point numbers (FP32), quantization converts them to lower-precision formats such as 16-bit (FP16), 8-bit integers (INT8), or even 4-bit or 2-bit representations. This reduces memory usage and accelerates inference on hardware that supports low-precision arithmetic.

Knowledge distillation trains a smaller "student" model to replicate the behavior of a larger "teacher" model. The student learns not just from the hard labels in the training data but also from the teacher's soft probability distributions ("dark knowledge"), which convey richer information about the relationships between classes.^[5] Hinton, Vinyals, and Dean (2015) introduced this technique, observing that "a very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions," and showing how that ensemble knowledge can be distilled into a single deployable model.^[5] It has since been widely adopted for deploying compact models.

Technique	What it reduces	Typical savings	Trade-off
Pruning	Number of non-zero parameters	50-95% sparsity	Requires hardware support for sparse operations
Quantization	Precision of weights and activations	2-8x memory reduction	Small accuracy loss; may need calibration
Knowledge distillation	Model size (fewer parameters)	3-10x smaller model	Student may not fully match teacher quality

Model interpretability and explainability

As models are deployed in high-stakes domains like healthcare, finance, and criminal justice, understanding why a model makes a particular prediction becomes critical. Model interpretability refers to the degree to which a human can understand the cause of a model's decisions.

Some models are inherently interpretable. Linear regression models have coefficients that directly quantify each feature's influence. Decision trees produce human-readable if-then rules. These models are sometimes called "glass-box" models.^[3]

More complex models, particularly deep neural networks, are often treated as black boxes. Several post-hoc explanation methods have been developed to interpret their predictions:

SHAP (SHapley Additive exPlanations): Based on cooperative game theory, SHAP assigns each feature a value representing its contribution to a specific prediction. It provides both local (per-prediction) and global (across the dataset) explanations.^[10]
LIME (Local Interpretable Model-agnostic Explanations): LIME generates explanations for individual predictions by fitting a simple, interpretable model (such as a linear model) to a set of perturbed inputs around the prediction of interest.^[9]
Gradient-based methods: Techniques like saliency maps and integrated gradients compute the gradient of the output with respect to the input features, highlighting which features most strongly influence the prediction.
Attention visualization: In Transformer-based models, attention weights can be visualized to show which input tokens the model focuses on, though researchers have cautioned that attention weights do not always reliably indicate feature importance.

Model deployment and serving

Deploying a model means making it available to serve predictions in a production environment. Deployment introduces engineering challenges that go beyond model accuracy, including latency requirements, throughput, reliability, and cost.

Common deployment patterns include:

REST API serving: The model is wrapped in a web service that accepts input data via HTTP requests and returns predictions. Frameworks like TensorFlow Serving, TorchServe, and Triton Inference Server support this pattern.
Batch inference: Predictions are computed offline on large datasets at scheduled intervals, suitable for use cases where real-time responses are not required.
Edge deployment: The model runs directly on devices (smartphones, IoT sensors, embedded systems), requiring model compression to meet memory and compute constraints.
Serverless deployment: Cloud platforms like AWS Lambda or Google Cloud Functions host the model, scaling automatically with request volume.

Model lifecycle

The full lifecycle of a machine learning model encompasses several stages, often managed under the discipline of MLOps (Machine Learning Operations):

Problem definition: Identify the business problem and formulate it as a machine learning task.
Data collection and preparation: Gather, clean, and preprocess the training data.
Feature engineering: Create informative input features from raw data.
Model development: Select architectures, train candidate models, and tune hyperparameters.
Evaluation: Assess model performance on held-out data using appropriate metrics.
Deployment: Serve the model in production.
Monitoring: Continuously track model performance, data drift (changes in input data distributions), and concept drift (changes in the relationship between inputs and outputs).
Retraining: Periodically retrain or update the model to maintain performance as conditions change.

Tools like MLflow, Weights and Biases, and DVC (Data Version Control) help teams manage experiment tracking, model versioning, and reproducibility across the lifecycle.

Model versioning and reproducibility

Reproducibility is a cornerstone of scientific machine learning. Given the same data, code, and random seeds, a reproducible workflow should produce the same model and results. In practice, achieving reproducibility requires tracking several artifacts:

Code versioning: The model architecture and training scripts should be stored in version control (e.g., Git).
Data versioning: The exact training data should be recorded. Tools like DVC track dataset versions alongside code.
Environment tracking: The software libraries, their versions, and hardware configurations should be logged. Containers (Docker) help ensure environment consistency.
Experiment tracking: Hyperparameters, metrics, and model artifacts for each experiment should be logged. MLflow, Weights and Biases, and Neptune are popular experiment tracking platforms.
Model registry: A centralized repository for managing model versions, stage transitions (development, staging, production), and metadata. MLflow's Model Registry and cloud-native registries (AWS SageMaker Model Registry, Vertex AI Model Registry) support this pattern.

Explain like I'm 5 (ELI5)

Imagine you are learning to sort colored blocks into the right buckets. At first, you make a lot of mistakes. But every time someone tells you "that red block goes in the red bucket," you get a little bit better. After seeing enough examples, you can sort new blocks you have never seen before. A machine learning model works the same way. It is like a set of rules that a computer writes for itself by looking at lots of examples. The more examples it sees, the better the rules get, and soon it can make good guesses about things it has never seen before.

References

Mitchell, T. M. (1997). *Machine Learning*. McGraw-Hill. (Provides the foundational formal definition of machine learning.) ↩
Vapnik, V. N., and Chervonenkis, A. Y. (1971). "On the uniform convergence of relative frequencies of events to their probabilities." *Theory of Probability and Its Applications*, 16(2), 264-280. (Introduces VC dimension and foundational concepts in model capacity and generalization.) ↩
Hastie, T., Tibshirani, R., and Friedman, J. (2009). *The Elements of Statistical Learning: Data Mining, Inference, and Prediction*. 2nd edition. Springer. (Comprehensive reference on model types, evaluation, and selection.) ↩
Goodfellow, I., Bengio, Y., and Courville, A. (2016). *Deep Learning*. MIT Press. (Standard reference for deep neural network models, training, and regularization.) ↩
Hinton, G., Vinyals, O., and Dean, J. (2015). "Distilling the Knowledge in a Neural Network." *arXiv preprint arXiv:1503.02531*. (Foundational paper on knowledge distillation for model compression.) ↩
Frankle, J. and Carbin, M. (2019). "The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks." *Proceedings of ICLR 2019* (arXiv:1803.03635). (Demonstrates that large networks contain small subnetworks capable of matching full performance.) ↩
Bommasani, R., Hudson, D. A., Adeli, E., et al. (2021). "On the Opportunities and Risks of Foundation Models." *arXiv preprint arXiv:2108.07258*. (Stanford CRFM/HAI report coining the term "foundation model" and analyzing its implications.) ↩
Bergstra, J. and Bengio, Y. (2012). "Random Search for Hyper-Parameter Optimization." *Journal of Machine Learning Research*, 13, 281-305. (Shows that random search outperforms grid search for hyperparameter optimization.) ↩
Ribeiro, M. T., Singh, S., and Guestrin, C. (2016). "'Why Should I Trust You?': Explaining the Predictions of Any Classifier." *Proceedings of KDD 2016*. (Introduces LIME for model-agnostic local interpretability.) ↩
Lundberg, S. M. and Lee, S.-I. (2017). "A Unified Approach to Interpreting Model Predictions." *Advances in Neural Information Processing Systems (NeurIPS) 2017*. (Introduces SHAP values for model explanation based on Shapley values from game theory.) ↩
Belkin, M., Hsu, D., Ma, S., and Mandal, S. (2019). "Reconciling modern machine-learning practice and the classical bias-variance trade-off." *Proceedings of the National Academy of Sciences*, 116(32), 15849-15854 (arXiv:1812.11118). (Introduces the "double descent" curve.) ↩
Brown, T. B., Mann, B., Ryder, N., et al. (2020). "Language Models are Few-Shot Learners." *Advances in Neural Information Processing Systems (NeurIPS) 2020* (arXiv:2005.14165). (Introduces GPT-3, a 175-billion-parameter language model.) ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

4 revisions by 1 contributors · full history

Suggest edit

Model

Introduction

What is the formal definition of a model?

What are the main types of models?

By learning paradigm

By output type

Parametric vs. non-parametric models

Model complexity and capacity

How are models trained?

How is a model evaluated?

Common evaluation metrics

Validation strategies

How do you select among candidate models?

Pre-trained models and transfer learning

What are foundation models?

How are large models compressed?

Model interpretability and explainability

Model deployment and serving

Model lifecycle

Model versioning and reproducibility

Explain like I'm 5 (ELI5)

References

Improve this article

What links here (24 of 42)

What links here (24 of 42)

Introduction

What is the formal definition of a model?

What are the main types of models?

By learning paradigm

By output type

Parametric vs. non-parametric models

Model complexity and capacity

How are models trained?

How is a model evaluated?

Common evaluation metrics

Validation strategies

How do you select among candidate models?

Pre-trained models and transfer learning

What are foundation models?

How are large models compressed?

Model interpretability and explainability

Model deployment and serving

Model lifecycle

Model versioning and reproducibility

Explain like I'm 5 (ELI5)

References

Improve this article

Related Articles

A/B Testing

Diffusion model

Dimension Reduction

Dimensions

Discrete Feature

Discriminative Model

What links here (24 of 42)

Related Articles

A/B Testing

Diffusion model

Dimension Reduction

Dimensions

Discrete Feature

Discriminative Model

What links here (24 of 42)