# Model

> Source: https://aiwiki.ai/wiki/model
> Updated: 2026-06-21
> Categories: Machine Learning
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

*See also: [Machine learning terms](/wiki/machine_learning_terms)*

In [machine learning](/wiki/machine_learning), a **model** is a mathematical representation of a real-world process that learns patterns from data and uses those patterns to make predictions or decisions on new, unseen inputs.[1][3] Every model combines two parts: an **architecture** (the structural blueprint defining how computations are organized) and a set of **learned [parameters](/wiki/parameter)** (numerical values adjusted during [training](/wiki/training) to minimize error).[4] The architecture can be as simple as a linear equation or a decision tree, or as large as a deep [neural network](/wiki/neural_network) with hundreds of billions of parameters: OpenAI's GPT-3, for example, has 175 billion parameters, described in its 2020 paper as "10x more than any previous non-sparse language model."[12] Regardless of complexity, every model serves the same fundamental purpose: to generalize from observed examples to novel situations.

## Introduction

Models are central to virtually every application of artificial intelligence, from spam filters and recommendation engines to autonomous vehicles and medical diagnostics. The process of building a useful model involves selecting an appropriate architecture, training it on data, evaluating its performance, and deploying it to serve predictions in production. Understanding what a model is and how models differ from one another is foundational to the entire field of machine learning.

## What is the formal definition of a model?

Formally, a machine learning model can be described as a function f that maps an input x (often a vector of features) to an output y:

y = f(x; theta)

where theta represents the model's parameters. During training, an optimization algorithm adjusts theta to minimize a loss function L(y_predicted, y_actual), which measures the discrepancy between the model's predictions and the true values.[4] The choice of loss function depends on the task. For regression, the mean squared error is common; for classification, cross-entropy loss is widely used.[3]

Tom M. Mitchell provided an influential formal definition of machine learning more broadly: "A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P if its performance at tasks in T, as measured by P, improves with experience E."[1] The model, in this framing, is the artifact that encodes the learned experience.

## What are the main types of models?

Machine learning models can be categorized along several dimensions. The most common taxonomies are based on the learning paradigm, the modeling approach, and the nature of the output.[3]

### By learning paradigm

| Paradigm | Description | Example algorithms |
|---|---|---|
| Supervised learning | Learns from labeled input-output pairs to predict outputs for new inputs | Linear regression, logistic regression, SVMs, random forests, neural networks |
| Unsupervised learning | Discovers hidden structure in unlabeled data | K-means clustering, PCA, autoencoders, Gaussian mixture models |
| Semi-supervised learning | Uses a small amount of labeled data alongside a large amount of unlabeled data | Label propagation, self-training, co-training |
| [Reinforcement learning](/wiki/reinforcement_learning) | Learns optimal actions through interaction with an environment and reward signals | Q-learning, Deep Q-Networks, PPO, SAC |
| Self-supervised learning | Generates supervisory signals from the data itself, often through pretext tasks | [BERT](/wiki/bert) (masked language modeling), contrastive learning (SimCLR) |

### By output type

**[Classification models](/wiki/classification_model)** predict discrete categories or labels. A spam filter, for instance, classifies emails as "spam" or "not spam." Common classification algorithms include logistic regression, support vector machines, decision trees, random forests, and neural networks trained with cross-entropy loss.[3]

**[Regression models](/wiki/regression_model)** predict continuous numerical values. Predicting house prices based on features like square footage and location is a classic regression task. Linear regression, polynomial regression, and neural networks trained with mean squared error are typical regression models.

**[Generative models](/wiki/generative_model)** learn the joint probability distribution P(X, Y) or the data distribution P(X) directly, enabling them to generate new data samples.[3] Examples include Gaussian mixture models, variational autoencoders (VAEs), [generative adversarial networks](/wiki/generative_adversarial_network) (GANs), and autoregressive language models like the [GPT](/wiki/gpt) series.[4]

**Discriminative models** learn the conditional probability P(Y | X) or a direct mapping from inputs to outputs. They focus on finding decision boundaries rather than modeling the full data distribution. Logistic regression, SVMs, and most neural network classifiers are discriminative models.[3]

| Aspect | Generative models | Discriminative models |
|---|---|---|
| What they learn | Joint distribution P(X, Y) or P(X) | Conditional distribution P(Y &#124; X) or direct mapping |
| Can generate new samples | Yes | No |
| Typical use cases | Data generation, density estimation, imputation | Classification, regression |
| Examples | Naive Bayes, VAEs, GANs, GPT | Logistic regression, SVMs, random forests |

### Parametric vs. non-parametric models

Another important distinction is between parametric and non-parametric models.

**Parametric models** assume a specific functional form for the relationship between inputs and outputs and have a fixed number of parameters regardless of the size of the training data. Linear regression (with its weight vector and bias term) is a classic parametric model. Parametric models are computationally efficient and easy to interpret, but they can underfit if the assumed functional form is too rigid for the true data distribution.[3]

**Non-parametric models** do not assume a fixed functional form. Their complexity grows with the amount of training data, effectively allowing an infinite number of parameters. K-nearest neighbors (KNN), decision trees, and Gaussian processes are non-parametric models. They are more flexible and can capture complex patterns, but they typically require more data, are slower at inference, and carry a higher risk of [overfitting](/wiki/overfitting).[3]

| Property | Parametric models | Non-parametric models |
|---|---|---|
| Number of parameters | Fixed | Grows with data |
| Assumptions | Strong (fixed functional form) | Weak or none |
| Data requirements | Lower | Higher |
| Interpretability | Generally higher | Generally lower |
| Risk of underfitting | Higher if form is misspecified | Lower |
| Examples | Linear regression, logistic regression, naive Bayes | KNN, decision trees, random forests, Gaussian processes |

## Model complexity and capacity

A model's **complexity** refers to the richness of functions it can represent. A linear model with two parameters (slope and intercept) is simple, while a deep neural network with hundreds of billions of parameters is highly complex. [Model capacity](/wiki/model_capacity) is the closely related concept of a model's ability to fit a wide variety of functions.

One formal measure of capacity is the **Vapnik-Chervonenkis (VC) dimension**, introduced by Vladimir Vapnik and Alexey Chervonenkis in their 1971 paper "On the uniform convergence of relative frequencies of events to their probabilities."[2] The VC dimension of a model is the largest number of data points that the model can shatter (classify correctly for all possible labelings). A model with a higher VC dimension can represent more complex decision boundaries but also has a greater risk of overfitting.[2]

The relationship between capacity and generalization is captured by the **bias-variance tradeoff**:[3]

- **Bias** measures how far the model's average predictions are from the true values. High bias indicates underfitting: the model is too simple to capture the underlying patterns.
- **Variance** measures how much the model's predictions fluctuate across different training sets. High variance indicates overfitting: the model is too sensitive to the specific training data.

As model complexity increases, bias typically decreases (the model can fit the training data more closely), but variance increases (the model becomes more sensitive to noise). The optimal model balances these two sources of error.[3] Classical statistical learning theory prescribes choosing the simplest model that adequately captures the data patterns.[2] However, modern deep learning has complicated this picture: very large neural networks sometimes generalize well despite having far more parameters than training examples, a phenomenon known as "double descent." Belkin, Hsu, Ma, and Mandal described it in 2019, writing that the "double descent" curve "subsumes the textbook U-shaped bias-variance trade-off curve by showing how increasing model capacity beyond the point of interpolation results in improved performance," which challenges traditional views of the bias-variance tradeoff.[11]

## How are models trained?

Training (also called fitting or learning) is the process of adjusting a model's parameters to minimize a loss function on the training data. The most common optimization approach for neural networks is **gradient descent** and its variants:[4]

1. **Compute predictions** by passing training examples through the model (forward pass).
2. **Calculate the loss** between predictions and true labels.
3. **Compute gradients** of the loss with respect to each parameter using [backpropagation](/wiki/backpropagation).[4]
4. **Update parameters** by moving them in the direction that reduces the loss.

Stochastic gradient descent (SGD) processes random subsets (mini-batches) of the training data at each step rather than the full dataset, which is computationally cheaper and introduces beneficial noise that can help escape local minima. Adaptive optimizers like Adam, AdaGrad, and RMSProp adjust the learning rate for each parameter individually, often converging faster than vanilla SGD.[4]

Key training considerations include:

- **Learning rate:** Controls the step size of parameter updates. Too large a rate causes instability; too small a rate leads to slow convergence.
- **Batch size:** The number of examples processed before each parameter update. Larger batches give more stable gradient estimates but require more memory.
- **Epochs:** One epoch is a complete pass through the training dataset. Models typically train for many epochs.
- **Regularization:** Techniques like L1/L2 penalties, [dropout](/wiki/dropout), and early stopping prevent overfitting by constraining the model's effective complexity.[4]

## How is a model evaluated?

After training, a model must be evaluated to estimate how well it will perform on unseen data. Evaluation involves choosing appropriate metrics and validation strategies.[3]

### Common evaluation metrics

| Task type | Metric | Description |
|---|---|---|
| Classification | Accuracy | Fraction of correct predictions |
| Classification | Precision | Fraction of positive predictions that are truly positive |
| Classification | Recall (sensitivity) | Fraction of actual positives correctly identified |
| Classification | F1 score | Harmonic mean of precision and recall |
| Classification | AUC-ROC | Area under the receiver operating characteristic curve |
| Regression | Mean squared error (MSE) | Average squared difference between predictions and true values |
| Regression | Mean absolute error (MAE) | Average absolute difference between predictions and true values |
| Regression | R-squared | Proportion of variance in the target explained by the model |
| Generation | Perplexity | Inverse probability of the test set (lower is better for language models) |
| Generation | BLEU / ROUGE | N-gram overlap metrics for text generation |

### Validation strategies

**Train-test split** divides the data into a training set and a held-out test set. The model is trained on the training set and evaluated on the test set. A common split is 80/20 or 70/30.

**Cross-validation** (typically k-fold) divides the data into k subsets. The model is trained k times, each time using a different subset as the test fold and the remaining k-1 subsets for training. The results are averaged to produce a more robust performance estimate.[3]

**Held-out validation set** reserves a portion of the training data for hyperparameter tuning, preventing the test set from being used for model selection decisions (which would leak information and produce overly optimistic estimates).

## How do you select among candidate models?

Model selection is the process of choosing among candidate models. This encompasses both the choice of algorithm (e.g., random forest vs. neural network) and the tuning of hyperparameters (e.g., the number of layers, the learning rate, or the regularization strength).

Common model selection strategies include:

- **Grid search:** Exhaustively tries all combinations of hyperparameter values from a predefined grid.
- **Random search:** Samples hyperparameter values randomly, which is often more efficient than grid search when only a few hyperparameters strongly affect performance (Bergstra and Bengio, 2012).[8]
- **Bayesian optimization:** Uses a probabilistic surrogate model to guide the search toward promising hyperparameter regions, reducing the number of evaluations needed.
- **Information criteria:** Methods like the Akaike Information Criterion (AIC) and the Bayesian Information Criterion (BIC) balance model fit against complexity without requiring a separate validation set.[3]

The goal of model selection is to identify the model that will generalize best to new data, not merely the model that fits the training data most closely.

## Pre-trained models and transfer learning

[Transfer learning](/wiki/transfer_learning) is the practice of taking a model trained on one task or dataset and adapting it for a different, often related, task. This approach has become dominant in modern machine learning because training large models from scratch requires enormous computational resources and data.

A **pre-trained model** is one that has already been trained on a large-scale dataset. For example, [ImageNet](/wiki/imagenet)-trained convolutional neural networks learn general visual features (edges, textures, shapes) that transfer well to other vision tasks. In natural language processing, language models pre-trained on vast text corpora (such as BERT or GPT) learn linguistic representations that can be fine-tuned for specific tasks like sentiment analysis, question answering, or named entity recognition.

Transfer learning typically follows one of two approaches:

- **Feature extraction:** The pre-trained model is used as a fixed feature extractor. Its weights are frozen, and only a new output layer (or head) is trained on the target task.
- **[Fine-tuning](/wiki/fine_tuning):** All or some of the pre-trained model's weights are further adjusted on the target task data, allowing the model to adapt its learned representations.

## What are foundation models?

[Foundation models](/wiki/foundation_model) are large-scale models trained on broad, diverse datasets that can be adapted to a wide range of downstream tasks.[7] The term was coined in 2021 by researchers at the Center for Research on Foundation Models (CRFM), part of the Stanford Institute for Human-Centered Artificial Intelligence (HAI), in the report "On the Opportunities and Risks of Foundation Models."[7] The authors wrote: "We call these models foundation models to underscore their critically central yet incomplete character."[7] Examples include GPT-4, [Claude](/wiki/claude), [LLaMA](/wiki/llama), BERT, CLIP, and [Stable Diffusion](/wiki/stable_diffusion).

Foundation models represent a paradigm shift in AI development. Rather than training a specialized model for each task, a single large model is pre-trained once (at great expense) and then adapted, either through fine-tuning or through in-context learning (prompting), to serve many different applications. This approach reduces the need for task-specific labeled data and makes powerful AI capabilities accessible to organizations that lack the resources to train models from scratch. As the CRFM report notes, this scale "results in new emergent capabilities," while the effectiveness of one model across many tasks "incentivizes homogenization," meaning the defects of a foundation model are inherited by all the adapted models downstream.[7]

The emergence of foundation models has raised important questions about centralization (a small number of organizations control the most capable models), safety (misuse potential scales with model capability), and environmental impact (training runs for frontier models consume substantial energy).[7]

## How are large models compressed?

Deploying large models in resource-constrained environments (mobile devices, edge hardware, real-time applications) often requires reducing their size and computational cost. The three principal model compression techniques are:

**Pruning** removes redundant weights, neurons, or entire layers from a trained model. Unstructured pruning sets individual weights to zero, while structured pruning removes entire filters or channels. Research has shown that neural networks can often be pruned by 90% or more with minimal accuracy loss, supporting the "lottery ticket hypothesis" proposed by Frankle and Carbin (2019), which states that dense networks contain sparse subnetworks ("winning tickets") that, trained in isolation from their original initialization, can match the full network's accuracy in a comparable number of iterations.[6]

**Quantization** reduces the numerical precision of model weights and activations. Instead of storing parameters as 32-bit floating-point numbers (FP32), quantization converts them to lower-precision formats such as 16-bit (FP16), 8-bit integers (INT8), or even 4-bit or 2-bit representations. This reduces memory usage and accelerates inference on hardware that supports low-precision arithmetic.

**[Knowledge distillation](/wiki/knowledge_distillation)** trains a smaller "student" model to replicate the behavior of a larger "teacher" model. The student learns not just from the hard labels in the training data but also from the teacher's soft probability distributions ("dark knowledge"), which convey richer information about the relationships between classes.[5] Hinton, Vinyals, and Dean (2015) introduced this technique, observing that "a very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions," and showing how that ensemble knowledge can be distilled into a single deployable model.[5] It has since been widely adopted for deploying compact models.

| Technique | What it reduces | Typical savings | Trade-off |
|---|---|---|---|
| Pruning | Number of non-zero parameters | 50-95% sparsity | Requires hardware support for sparse operations |
| Quantization | Precision of weights and activations | 2-8x memory reduction | Small accuracy loss; may need calibration |
| Knowledge distillation | Model size (fewer parameters) | 3-10x smaller model | Student may not fully match teacher quality |

## Model interpretability and explainability

As models are deployed in high-stakes domains like healthcare, finance, and criminal justice, understanding why a model makes a particular prediction becomes critical. Model interpretability refers to the degree to which a human can understand the cause of a model's decisions.

Some models are **inherently interpretable**. Linear regression models have coefficients that directly quantify each feature's influence. Decision trees produce human-readable if-then rules. These models are sometimes called "glass-box" models.[3]

More complex models, particularly deep neural networks, are often treated as **black boxes**. Several post-hoc explanation methods have been developed to interpret their predictions:

- **SHAP (SHapley Additive exPlanations):** Based on cooperative game theory, SHAP assigns each feature a value representing its contribution to a specific prediction. It provides both local (per-prediction) and global (across the dataset) explanations.[10]
- **LIME (Local Interpretable Model-agnostic Explanations):** LIME generates explanations for individual predictions by fitting a simple, interpretable model (such as a linear model) to a set of perturbed inputs around the prediction of interest.[9]
- **Gradient-based methods:** Techniques like saliency maps and integrated gradients compute the gradient of the output with respect to the input features, highlighting which features most strongly influence the prediction.
- **Attention visualization:** In [Transformer](/wiki/transformer)-based models, attention weights can be visualized to show which input tokens the model focuses on, though researchers have cautioned that attention weights do not always reliably indicate feature importance.

## Model deployment and serving

Deploying a model means making it available to serve predictions in a production environment. Deployment introduces engineering challenges that go beyond model accuracy, including latency requirements, throughput, reliability, and cost.

Common deployment patterns include:

- **REST API serving:** The model is wrapped in a web service that accepts input data via HTTP requests and returns predictions. Frameworks like TensorFlow Serving, TorchServe, and Triton Inference Server support this pattern.
- **Batch inference:** Predictions are computed offline on large datasets at scheduled intervals, suitable for use cases where real-time responses are not required.
- **Edge deployment:** The model runs directly on devices (smartphones, IoT sensors, embedded systems), requiring model compression to meet memory and compute constraints.
- **Serverless deployment:** Cloud platforms like AWS Lambda or Google Cloud Functions host the model, scaling automatically with request volume.

## Model lifecycle

The full lifecycle of a machine learning model encompasses several stages, often managed under the discipline of MLOps (Machine Learning Operations):

1. **Problem definition:** Identify the business problem and formulate it as a machine learning task.
2. **Data collection and preparation:** Gather, clean, and preprocess the training data.
3. **Feature engineering:** Create informative input features from raw data.
4. **Model development:** Select architectures, train candidate models, and tune hyperparameters.
5. **Evaluation:** Assess model performance on held-out data using appropriate metrics.
6. **Deployment:** Serve the model in production.
7. **Monitoring:** Continuously track model performance, data drift (changes in input data distributions), and concept drift (changes in the relationship between inputs and outputs).
8. **Retraining:** Periodically retrain or update the model to maintain performance as conditions change.

Tools like [MLflow](/wiki/mlflow), Weights and Biases, and DVC (Data Version Control) help teams manage experiment tracking, model versioning, and reproducibility across the lifecycle.

## Model versioning and reproducibility

Reproducibility is a cornerstone of scientific machine learning. Given the same data, code, and random seeds, a reproducible workflow should produce the same model and results. In practice, achieving reproducibility requires tracking several artifacts:

- **Code versioning:** The model architecture and training scripts should be stored in version control (e.g., Git).
- **Data versioning:** The exact training data should be recorded. Tools like DVC track dataset versions alongside code.
- **Environment tracking:** The software libraries, their versions, and hardware configurations should be logged. Containers (Docker) help ensure environment consistency.
- **Experiment tracking:** Hyperparameters, metrics, and model artifacts for each experiment should be logged. MLflow, Weights and Biases, and Neptune are popular experiment tracking platforms.
- **Model registry:** A centralized repository for managing model versions, stage transitions (development, staging, production), and metadata. MLflow's Model Registry and cloud-native registries (AWS SageMaker Model Registry, Vertex AI Model Registry) support this pattern.

## Explain like I'm 5 (ELI5)

Imagine you are learning to sort colored blocks into the right buckets. At first, you make a lot of mistakes. But every time someone tells you "that red block goes in the red bucket," you get a little bit better. After seeing enough examples, you can sort new blocks you have never seen before. A machine learning model works the same way. It is like a set of rules that a computer writes for itself by looking at lots of examples. The more examples it sees, the better the rules get, and soon it can make good guesses about things it has never seen before.

## References

1. Mitchell, T. M. (1997). *Machine Learning*. McGraw-Hill. (Provides the foundational formal definition of machine learning.)
2. Vapnik, V. N., and Chervonenkis, A. Y. (1971). "On the uniform convergence of relative frequencies of events to their probabilities." *Theory of Probability and Its Applications*, 16(2), 264-280. (Introduces VC dimension and foundational concepts in model capacity and generalization.)
3. Hastie, T., Tibshirani, R., and Friedman, J. (2009). *The Elements of Statistical Learning: Data Mining, Inference, and Prediction*. 2nd edition. Springer. (Comprehensive reference on model types, evaluation, and selection.)
4. Goodfellow, I., Bengio, Y., and Courville, A. (2016). *Deep Learning*. MIT Press. (Standard reference for deep neural network models, training, and regularization.)
5. Hinton, G., Vinyals, O., and Dean, J. (2015). "Distilling the Knowledge in a Neural Network." *arXiv preprint arXiv:1503.02531*. (Foundational paper on knowledge distillation for model compression.)
6. Frankle, J. and Carbin, M. (2019). "The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks." *Proceedings of ICLR 2019* (arXiv:1803.03635). (Demonstrates that large networks contain small subnetworks capable of matching full performance.)
7. Bommasani, R., Hudson, D. A., Adeli, E., et al. (2021). "On the Opportunities and Risks of Foundation Models." *arXiv preprint arXiv:2108.07258*. (Stanford CRFM/HAI report coining the term "foundation model" and analyzing its implications.)
8. Bergstra, J. and Bengio, Y. (2012). "Random Search for Hyper-Parameter Optimization." *Journal of Machine Learning Research*, 13, 281-305. (Shows that random search outperforms grid search for hyperparameter optimization.)
9. Ribeiro, M. T., Singh, S., and Guestrin, C. (2016). "'Why Should I Trust You?': Explaining the Predictions of Any Classifier." *Proceedings of KDD 2016*. (Introduces LIME for model-agnostic local interpretability.)
10. Lundberg, S. M. and Lee, S.-I. (2017). "A Unified Approach to Interpreting Model Predictions." *Advances in Neural Information Processing Systems (NeurIPS) 2017*. (Introduces SHAP values for model explanation based on Shapley values from game theory.)
11. Belkin, M., Hsu, D., Ma, S., and Mandal, S. (2019). "Reconciling modern machine-learning practice and the classical bias-variance trade-off." *Proceedings of the National Academy of Sciences*, 116(32), 15849-15854 (arXiv:1812.11118). (Introduces the "double descent" curve.)
12. Brown, T. B., Mann, B., Ryder, N., et al. (2020). "Language Models are Few-Shot Learners." *Advances in Neural Information Processing Systems (NeurIPS) 2020* (arXiv:2005.14165). (Introduces GPT-3, a 175-billion-parameter language model.)