See also: Machine learning terms
In machine learning, a model is a mathematical representation of a real-world process that learns patterns from data and uses those patterns to make predictions or decisions on new, unseen inputs. At its core, a model consists of two components: an architecture (the structural blueprint defining how computations are organized) and a set of learned parameters (numerical values adjusted during training to minimize error). The architecture might be a simple linear equation, a decision tree, or a deep neural network with billions of parameters. Regardless of complexity, every model serves the same fundamental purpose: to generalize from observed examples to novel situations.
Models are central to virtually every application of artificial intelligence, from spam filters and recommendation engines to autonomous vehicles and medical diagnostics. The process of building a useful model involves selecting an appropriate architecture, training it on data, evaluating its performance, and deploying it to serve predictions in production. Understanding what a model is and how models differ from one another is foundational to the entire field of machine learning.
Formally, a machine learning model can be described as a function f that maps an input x (often a vector of features) to an output y:
y = f(x; theta)
where theta represents the model's parameters. During training, an optimization algorithm adjusts theta to minimize a loss function L(y_predicted, y_actual), which measures the discrepancy between the model's predictions and the true values. The choice of loss function depends on the task. For regression, the mean squared error is common; for classification, cross-entropy loss is widely used.
Tom M. Mitchell provided an influential formal definition of machine learning more broadly: "A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P if its performance at tasks in T, as measured by P, improves with experience E." The model, in this framing, is the artifact that encodes the learned experience.
Machine learning models can be categorized along several dimensions. The most common taxonomies are based on the learning paradigm, the modeling approach, and the nature of the output.
| Paradigm | Description | Example algorithms |
|---|---|---|
| Supervised learning | Learns from labeled input-output pairs to predict outputs for new inputs | Linear regression, logistic regression, SVMs, random forests, neural networks |
| Unsupervised learning | Discovers hidden structure in unlabeled data | K-means clustering, PCA, autoencoders, Gaussian mixture models |
| Semi-supervised learning | Uses a small amount of labeled data alongside a large amount of unlabeled data | Label propagation, self-training, co-training |
| Reinforcement learning | Learns optimal actions through interaction with an environment and reward signals | Q-learning, Deep Q-Networks, PPO, SAC |
| Self-supervised learning | Generates supervisory signals from the data itself, often through pretext tasks | BERT (masked language modeling), contrastive learning (SimCLR) |
Classification models predict discrete categories or labels. A spam filter, for instance, classifies emails as "spam" or "not spam." Common classification algorithms include logistic regression, support vector machines, decision trees, random forests, and neural networks trained with cross-entropy loss.
Regression models predict continuous numerical values. Predicting house prices based on features like square footage and location is a classic regression task. Linear regression, polynomial regression, and neural networks trained with mean squared error are typical regression models.
Generative models learn the joint probability distribution P(X, Y) or the data distribution P(X) directly, enabling them to generate new data samples. Examples include Gaussian mixture models, variational autoencoders (VAEs), generative adversarial networks (GANs), and autoregressive language models like the GPT series.
Discriminative models learn the conditional probability P(Y | X) or a direct mapping from inputs to outputs. They focus on finding decision boundaries rather than modeling the full data distribution. Logistic regression, SVMs, and most neural network classifiers are discriminative models.
| Aspect | Generative models | Discriminative models |
|---|---|---|
| What they learn | Joint distribution P(X, Y) or P(X) | Conditional distribution P(Y | X) or direct mapping |
| Can generate new samples | Yes | No |
| Typical use cases | Data generation, density estimation, imputation | Classification, regression |
| Examples | Naive Bayes, VAEs, GANs, GPT | Logistic regression, SVMs, random forests |
Another important distinction is between parametric and non-parametric models.
Parametric models assume a specific functional form for the relationship between inputs and outputs and have a fixed number of parameters regardless of the size of the training data. Linear regression (with its weight vector and bias term) is a classic parametric model. Parametric models are computationally efficient and easy to interpret, but they can underfit if the assumed functional form is too rigid for the true data distribution.
Non-parametric models do not assume a fixed functional form. Their complexity grows with the amount of training data, effectively allowing an infinite number of parameters. K-nearest neighbors (KNN), decision trees, and Gaussian processes are non-parametric models. They are more flexible and can capture complex patterns, but they typically require more data, are slower at inference, and carry a higher risk of overfitting.
| Property | Parametric models | Non-parametric models |
|---|---|---|
| Number of parameters | Fixed | Grows with data |
| Assumptions | Strong (fixed functional form) | Weak or none |
| Data requirements | Lower | Higher |
| Interpretability | Generally higher | Generally lower |
| Risk of underfitting | Higher if form is misspecified | Lower |
| Examples | Linear regression, logistic regression, naive Bayes | KNN, decision trees, random forests, Gaussian processes |
A model's complexity refers to the richness of functions it can represent. A linear model with two parameters (slope and intercept) is simple, while a deep neural network with hundreds of billions of parameters is highly complex. Model capacity is the closely related concept of a model's ability to fit a wide variety of functions.
One formal measure of capacity is the Vapnik-Chervonenkis (VC) dimension, introduced by Vladimir Vapnik and Alexey Chervonenkis in 1971. The VC dimension of a model is the largest number of data points that the model can shatter (classify correctly for all possible labelings). A model with a higher VC dimension can represent more complex decision boundaries but also has a greater risk of overfitting.
The relationship between capacity and generalization is captured by the bias-variance tradeoff:
As model complexity increases, bias typically decreases (the model can fit the training data more closely), but variance increases (the model becomes more sensitive to noise). The optimal model balances these two sources of error. Classical statistical learning theory prescribes choosing the simplest model that adequately captures the data patterns. However, modern deep learning has complicated this picture: very large neural networks sometimes generalize well despite having far more parameters than training examples, a phenomenon known as "double descent" that challenges traditional views of the bias-variance tradeoff.
Training (also called fitting or learning) is the process of adjusting a model's parameters to minimize a loss function on the training data. The most common optimization approach for neural networks is gradient descent and its variants:
Stochastic gradient descent (SGD) processes random subsets (mini-batches) of the training data at each step rather than the full dataset, which is computationally cheaper and introduces beneficial noise that can help escape local minima. Adaptive optimizers like Adam, AdaGrad, and RMSProp adjust the learning rate for each parameter individually, often converging faster than vanilla SGD.
Key training considerations include:
After training, a model must be evaluated to estimate how well it will perform on unseen data. Evaluation involves choosing appropriate metrics and validation strategies.
| Task type | Metric | Description |
|---|---|---|
| Classification | Accuracy | Fraction of correct predictions |
| Classification | Precision | Fraction of positive predictions that are truly positive |
| Classification | Recall (sensitivity) | Fraction of actual positives correctly identified |
| Classification | F1 score | Harmonic mean of precision and recall |
| Classification | AUC-ROC | Area under the receiver operating characteristic curve |
| Regression | Mean squared error (MSE) | Average squared difference between predictions and true values |
| Regression | Mean absolute error (MAE) | Average absolute difference between predictions and true values |
| Regression | R-squared | Proportion of variance in the target explained by the model |
| Generation | Perplexity | Inverse probability of the test set (lower is better for language models) |
| Generation | BLEU / ROUGE | N-gram overlap metrics for text generation |
Train-test split divides the data into a training set and a held-out test set. The model is trained on the training set and evaluated on the test set. A common split is 80/20 or 70/30.
Cross-validation (typically k-fold) divides the data into k subsets. The model is trained k times, each time using a different subset as the test fold and the remaining k-1 subsets for training. The results are averaged to produce a more robust performance estimate.
Held-out validation set reserves a portion of the training data for hyperparameter tuning, preventing the test set from being used for model selection decisions (which would leak information and produce overly optimistic estimates).
Model selection is the process of choosing among candidate models. This encompasses both the choice of algorithm (e.g., random forest vs. neural network) and the tuning of hyperparameters (e.g., the number of layers, the learning rate, or the regularization strength).
Common model selection strategies include:
The goal of model selection is to identify the model that will generalize best to new data, not merely the model that fits the training data most closely.
Transfer learning is the practice of taking a model trained on one task or dataset and adapting it for a different, often related, task. This approach has become dominant in modern machine learning because training large models from scratch requires enormous computational resources and data.
A pre-trained model is one that has already been trained on a large-scale dataset. For example, ImageNet-trained convolutional neural networks learn general visual features (edges, textures, shapes) that transfer well to other vision tasks. In natural language processing, language models pre-trained on vast text corpora (such as BERT or GPT) learn linguistic representations that can be fine-tuned for specific tasks like sentiment analysis, question answering, or named entity recognition.
Transfer learning typically follows one of two approaches:
Foundation models are large-scale models trained on broad, diverse datasets that can be adapted to a wide range of downstream tasks. The term was coined by researchers at the Stanford Institute for Human-Centered Artificial Intelligence (HAI) in 2021. Examples include GPT-4, Claude, LLaMA, BERT, CLIP, and Stable Diffusion.
Foundation models represent a paradigm shift in AI development. Rather than training a specialized model for each task, a single large model is pre-trained once (at great expense) and then adapted, either through fine-tuning or through in-context learning (prompting), to serve many different applications. This approach reduces the need for task-specific labeled data and makes powerful AI capabilities accessible to organizations that lack the resources to train models from scratch.
The emergence of foundation models has raised important questions about centralization (a small number of organizations control the most capable models), safety (misuse potential scales with model capability), and environmental impact (training runs for frontier models consume substantial energy).
Deploying large models in resource-constrained environments (mobile devices, edge hardware, real-time applications) often requires reducing their size and computational cost. The three principal model compression techniques are:
Pruning removes redundant weights, neurons, or entire layers from a trained model. Unstructured pruning sets individual weights to zero, while structured pruning removes entire filters or channels. Research has shown that neural networks can often be pruned by 90% or more with minimal accuracy loss, supporting the "lottery ticket hypothesis" proposed by Frankle and Carlin (2019), which states that dense networks contain sparse subnetworks that can match the full network's performance.
Quantization reduces the numerical precision of model weights and activations. Instead of storing parameters as 32-bit floating-point numbers (FP32), quantization converts them to lower-precision formats such as 16-bit (FP16), 8-bit integers (INT8), or even 4-bit or 2-bit representations. This reduces memory usage and accelerates inference on hardware that supports low-precision arithmetic.
Knowledge distillation trains a smaller "student" model to replicate the behavior of a larger "teacher" model. The student learns not just from the hard labels in the training data but also from the teacher's soft probability distributions ("dark knowledge"), which convey richer information about the relationships between classes. Hinton, Vinyals, and Dean (2015) introduced this technique, and it has since been widely adopted for deploying compact models.
| Technique | What it reduces | Typical savings | Trade-off |
|---|---|---|---|
| Pruning | Number of non-zero parameters | 50-95% sparsity | Requires hardware support for sparse operations |
| Quantization | Precision of weights and activations | 2-8x memory reduction | Small accuracy loss; may need calibration |
| Knowledge distillation | Model size (fewer parameters) | 3-10x smaller model | Student may not fully match teacher quality |
As models are deployed in high-stakes domains like healthcare, finance, and criminal justice, understanding why a model makes a particular prediction becomes critical. Model interpretability refers to the degree to which a human can understand the cause of a model's decisions.
Some models are inherently interpretable. Linear regression models have coefficients that directly quantify each feature's influence. Decision trees produce human-readable if-then rules. These models are sometimes called "glass-box" models.
More complex models, particularly deep neural networks, are often treated as black boxes. Several post-hoc explanation methods have been developed to interpret their predictions:
Deploying a model means making it available to serve predictions in a production environment. Deployment introduces engineering challenges that go beyond model accuracy, including latency requirements, throughput, reliability, and cost.
Common deployment patterns include:
The full lifecycle of a machine learning model encompasses several stages, often managed under the discipline of MLOps (Machine Learning Operations):
Tools like MLflow, Weights and Biases, and DVC (Data Version Control) help teams manage experiment tracking, model versioning, and reproducibility across the lifecycle.
Reproducibility is a cornerstone of scientific machine learning. Given the same data, code, and random seeds, a reproducible workflow should produce the same model and results. In practice, achieving reproducibility requires tracking several artifacts:
Imagine you are learning to sort colored blocks into the right buckets. At first, you make a lot of mistakes. But every time someone tells you "that red block goes in the red bucket," you get a little bit better. After seeing enough examples, you can sort new blocks you have never seen before. A machine learning model works the same way. It is like a set of rules that a computer writes for itself by looking at lots of examples. The more examples it sees, the better the rules get, and soon it can make good guesses about things it has never seen before.