Machine learning terms/Fundamentals
Last reviewed
May 9, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v2 ยท 4,983 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 9, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v2 ยท 4,983 words
Add missing citations, update stale details, or suggest a clearer explanation.
See also: Machine learning terms
Machine learning (ML) is the branch of artificial intelligence concerned with building systems that learn patterns from data rather than following explicitly programmed rules. The field draws on statistics, optimization theory, and computer science. A canonical definition comes from Tom Mitchell (1997): "A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E."
This page collects the foundational vocabulary that underpins almost every ML system, from a simple linear regression baseline to a billion-parameter neural network. Each term is linked to its own dedicated wiki article. The fundamentals chapter is the entry point for readers working through the Machine learning terms glossary, which is grouped by topic into chapters such as fundamentals, neural networks and deep learning, language evaluation, and advanced topics.
A practical ML system has three ingredients: a model (a parameterized function family), a loss function (a measure of how far predictions are from ground truth), and an optimization procedure (a way to adjust the parameters so that the loss decreases). The remainder of this article walks through the categories, components, and metrics that make this loop work in practice.
ML problems are usually grouped by the kind of supervision signal the training set provides. The five main categories are summarized below.
| Category | Supervision signal | Typical tasks | Example algorithms |
|---|---|---|---|
| Supervised learning | Each example has an input and a label | Classification, regression | Linear regression, logistic regression, random forest, gradient boosted trees |
| Unsupervised learning | Inputs only, no labels | Clustering, density estimation, dimensionality reduction | k-means, Gaussian mixture, PCA, autoencoders |
| Semi-supervised learning | A small labeled set plus a much larger unlabeled set | Classification with scarce labels | Label propagation, pseudo-labeling, consistency regularization |
| Self-supervised learning | Labels are constructed automatically from the input itself | Pretraining language and vision models | Masked language modeling (BERT), next-token prediction (GPT), contrastive learning (SimCLR) |
| Reinforcement learning | A reward signal received after taking actions in an environment | Game playing, robotics, recommendation | Q-learning, policy gradients, actor-critic, PPO |
In supervised machine learning the dataset consists of pairs of inputs and target outputs. The learner fits a function that maps inputs to outputs and is evaluated on held-out examples. Most production ML systems, including spam filters, fraud detection, ad ranking, and medical image triage, are supervised.
Unsupervised machine learning discovers structure in unlabeled data. Clustering groups similar items, dimensionality reduction projects high-dimensional data into a lower-dimensional space for visualization or compression, and density estimation models the probability distribution of the inputs.
Semi-supervised learning is useful when labels are expensive but raw data is plentiful. Self-supervised learning has become the dominant paradigm for foundation models: the model is pretrained on large unlabeled corpora using a pretext task (predicting masked tokens, the next token, or the relative position of image patches) and is then fine-tuned on smaller supervised datasets.
Reinforcement learning (RL) trains an agent to take actions in an environment so as to maximize cumulative reward. The training signal is sparse and delayed, which makes the credit assignment problem central. Notable RL systems include DeepMind's AlphaGo and AlphaZero, OpenAI Five, and the reinforcement learning from human feedback (RLHF) stage used to align large language models.
Within supervised learning, the two main task types are regression and classification.
| Task | Output | Typical loss | Examples |
|---|---|---|---|
| Regression | A continuous numeric value | Squared loss (MSE), L1 loss | House price, temperature, time-to-failure |
| Binary classification | One of two classes (positive or negative class) | Log loss | Spam detection, loan default, click prediction |
| Multi-class classification | One of K mutually exclusive classes | Categorical cross-entropy | Digit recognition, image labeling |
| Multi-label classification | A subset of K classes (any number can be active) | Binary cross-entropy per label | Tagging, content moderation |
Classification models often output probabilities via a softmax (multi-class) or sigmoid (binary) head, and a classification threshold converts the probability into a discrete decision.
The central goal of ML is generalization: producing a model that performs well on data it has not seen. Two failure modes block this goal.
The expected squared error of a regression estimator can be decomposed into three terms: a bias term (how far the average prediction is from the truth), a variance term (how much predictions fluctuate across training sets), and irreducible noise. High-bias models (like a linear model on a curved relationship) underfit; high-variance models (like a deep tree on a small dataset) overfit. Choosing model capacity and regularization is largely the art of balancing these two sources of error.
A generalization curve plots training loss and validation loss against training time or model capacity. As capacity grows, training loss falls monotonically while validation loss typically falls and then rises again, forming a U shape whose minimum suggests the right level of capacity or the right number of training epochs. The point where validation loss is minimum motivates early stopping.
To estimate generalization honestly, the dataset is partitioned into disjoint subsets.
| Split | Purpose | Typical share |
|---|---|---|
| Training set | Fit the model parameters | 60 to 80 percent |
| Validation set | Tune hyperparameters, choose architectures, decide when to stop | 10 to 20 percent |
| Test set | Final unbiased evaluation, used at most once | 10 to 20 percent |
Splits should be drawn so that examples are independently and identically distributed (i.i.d.). For time-series problems use a temporal split; for grouped data (patients, users) use a group-aware split to avoid leakage.
When data is scarce, k-fold cross-validation gives a more stable estimate of validation performance. The training set is split into k folds; the model is trained k times, each time holding out one fold as the validation set, and the k validation scores are averaged. Common choices are 5-fold and 10-fold. Stratified k-fold preserves class proportions in each fold and is preferred for class-imbalanced datasets. Leave-one-out cross-validation (LOOCV) is the limiting case where k equals the number of examples; it is unbiased but expensive and high-variance.
A loss function measures the cost of a prediction. Training minimizes the average loss over the training set, sometimes called the empirical risk.
| Loss | Formula (per example) | Typical use |
|---|---|---|
| Squared loss (MSE) | (y - y_hat)^2 | Regression with Gaussian noise |
| L1 loss (MAE) | abs(y - y_hat) | Regression robust to outliers |
| Huber loss | Quadratic for small errors, linear for large | Regression with heavy-tailed noise |
| Binary log loss | -[y log(p) + (1-y) log(1-p)] | Binary classification |
| Categorical cross-entropy | -sum(y_k log p_k) | Multi-class classification |
| Hinge loss | max(0, 1 - y * f(x)) | Support vector machines |
| Kullback-Leibler divergence | sum(p log(p/q)) | Distribution matching, variational inference |
| Contrastive / triplet loss | Margin-based pair or triplet objectives | Metric learning, embeddings |
Alongside the per-example loss, training adds a regularization term to the total objective, weighted by a lambda coefficient, which is itself a hyperparameter often tuned by validation.
Once the loss is defined, an optimizer iteratively updates the model parameters to reduce it. Almost all modern ML uses first-order gradient methods.
Vanilla gradient descent computes the gradient of the loss over the entire training set and takes a step opposite to that gradient, scaled by a learning rate. For large datasets this is impractical because each step requires a full pass over the data.
Stochastic gradient descent (SGD) approximates the full-batch gradient with the gradient computed on a single example. Mini-batch SGD, the workhorse of deep learning, computes the gradient on a small batch of examples (the batch size is a key hyperparameter, typically between 32 and 8,192). Mini-batches give a low-variance estimate of the true gradient while still allowing many updates per epoch.
Momentum keeps an exponentially weighted average of past gradients, accelerating descent in consistent directions and damping oscillations. Nesterov accelerated gradient (NAG) computes the gradient at a look-ahead position. AdaGrad adapts the learning rate per parameter using the sum of squared past gradients, suiting sparse features but eventually shrinking the rate to zero. RMSProp replaces that cumulative sum with an exponentially decayed moving average. Adam (Adaptive Moment Estimation), introduced by Kingma and Ba in 2014, combines momentum and RMSProp with bias-corrected moments and is the default for many deep learning tasks. AdamW decouples weight decay from the gradient step and is preferred for transformers. Learning rate schedules (step decay, cosine decay, warmup followed by decay) are used together with optimizers to improve convergence.
Regularization adds inductive bias toward simpler functions to fight overfitting.
| Technique | Effect | Notes |
|---|---|---|
| L1 regularization | Sum of absolute weights, encourages sparsity | Used in lasso regression, can drive some weights to exactly zero |
| L2 regularization | Sum of squared weights, shrinks all weights toward zero | Used in ridge regression and weight decay |
| L0 regularization | Penalizes the count of non-zero weights | Combinatorial, usually approximated |
| Elastic net | Mixture of L1 and L2 | Compromise between sparsity and stability |
| Dropout | Randomly zeros activations during training | Acts as model averaging across sub-networks |
| Early stopping | Stops training when validation loss stops improving | Implicit regularization, near-free |
| Data augmentation | Generates synthetic training examples | Standard in vision (flips, crops, color jitter), audio (pitch, time stretch), and text (back-translation) |
| Batch normalization | Normalizes activations within a batch | Often acts as regularization in addition to its optimization benefits |
| Label smoothing | Softens hard labels toward a uniform distribution | Reduces overconfidence in classification |
| Gradient clipping | Bounds gradient norms | Stabilizes training when gradients explode |
The strength of L1 or L2 regularization is controlled by the regularization rate (lambda), tuned on the validation set.
Feature engineering converts raw data into the feature vectors consumed by the model. Even with deep learning, careful preparation of inputs often makes a measurable difference.
Numerical features typically require scaling so that no single feature dominates distance and gradient computations.
| Method | Formula | When to use |
|---|---|---|
| Min-max scaling | (x - min) / (max - min) | Bounded outputs, image pixels |
| Z-score normalization | (x - mean) / std | Default choice for most numeric features |
| Robust scaling | (x - median) / IQR | Outlier-prone features |
| Log transform | log(1 + x) | Skewed positive values, counts |
| Bucketing | Map continuous values into discrete bins | Capture non-linearity, simplify interactions |
Categorical features must be turned into numbers.
Feature selection removes irrelevant or redundant features to reduce variance and training cost. Methods include filter (mutual information, chi-squared), wrapper (recursive feature elimination), and embedded approaches (L1 regularization, tree feature importance). Feature crosses combine two or more features into a synthetic feature that captures their interaction, for example combining latitude and longitude into a grid cell.
Linear models are the foundation of supervised learning and remain strong baselines.
Decision trees split the feature space along axis-aligned thresholds and predict the majority class or mean target in each leaf. They handle mixed feature types, are invariant to monotone transformations, and are easy to inspect.
Tree models are non-linear and can capture interactions automatically, which is why they often outperform linear models on tabular tasks.
Probabilistic models represent uncertainty through joint or conditional distributions. Naive Bayes assumes feature independence given the class and applies Bayes' rule; despite the strong assumption it is fast and effective for text classification with bag-of-words features (Multinomial, Bernoulli, Gaussian variants). Bayesian networks are directed acyclic graphs whose nodes are random variables and whose edges encode conditional dependence; they were popularized by Judea Pearl in the 1980s. Hidden Markov models (HMMs) and conditional random fields (CRFs) model sequence data and were workhorses for speech and NLP before neural networks took over. Bayesian linear regression, Gaussian processes, and variational autoencoders (VAEs) bring Bayesian principles to modern ML.
A neural network stacks layers of neurons. Each neuron computes a weighted sum of its inputs plus a bias, then applies a non-linear activation function such as the Rectified Linear Unit (ReLU), sigmoid, or hyperbolic tangent. A network has an input layer, one or more hidden layers, and an output layer; the number of hidden layers gives the model's depth and a network with more than one hidden layer is called a deep model. Training uses backpropagation, which computes gradients by applying the chain rule from the output back through the layers. The fundamentals chapter introduces these concepts, while the dedicated neural networks and deep learning chapter covers convolutional, recurrent, and transformer architectures in more depth.
The choice of metric depends on the task and on what kinds of errors matter most.
| Metric | Definition | Notes |
|---|---|---|
| Accuracy | (TP + TN) / total | Misleading on imbalanced data |
| Precision | TP / (TP + FP) | Fraction of predicted positives that are truly positive |
| Recall (sensitivity, TPR) | TP / (TP + FN) | Fraction of real positives that are recovered |
| Specificity | TN / (TN + FP) | Fraction of negatives correctly rejected |
| F1 score | Harmonic mean of precision and recall | Balances the two when one class is rare |
| F-beta | Weighted harmonic mean | Use beta>1 to weight recall more, beta<1 to weight precision |
| False positive rate | FP / (FP + TN) | Used in ROC curves |
| AUC (ROC AUC) | Area under the ROC curve | Probability a random positive ranks above a random negative |
| PR AUC | Area under the precision-recall curve | Better than ROC AUC under heavy class imbalance |
| Log loss | Average negative log probability of the true class | Rewards calibrated probabilities |
| Brier score | Mean squared error between predicted probability and outcome | Used for calibration assessment |
A confusion matrix is the standard visualization that shows TP, FP, TN, and FN counts and is the source of all the metrics above.
| Metric | Formula | Notes |
|---|---|---|
| Mean absolute error (MAE) | mean(abs(y - y_hat)) | Same units as the target, robust to outliers |
| Mean squared error (MSE) | mean((y - y_hat)^2) | Penalizes large errors more |
| Root mean squared error (RMSE) | sqrt(MSE) | Same units as the target |
| R-squared | 1 - SS_res / SS_tot | Fraction of variance explained, can be negative |
| MAPE | mean(abs((y - y_hat) / y)) | Scale-free, undefined when y is zero |
| Quantile loss | Pinball loss | Used to fit conditional quantile estimators |
Ranking tasks (search, recommendation) use Mean Reciprocal Rank (MRR), Mean Average Precision (MAP), and Normalized Discounted Cumulative Gain (NDCG). Calibration plots and Expected Calibration Error (ECE) check whether predicted probabilities match observed frequencies.
A real ML system is more than a model. The end-to-end pipeline typically includes the following stages.
| Stage | Activities |
|---|---|
| Data | Collection, cleaning, deduplication, labeling, proxy labels, rater management, schema validation |
| Feature engineering | Transformations, scaling, encoding, feature selection, synthetic features |
| Training | Choosing model, loss, optimizer, hyperparameters, distributed training |
| Validation | Cross-validation, hyperparameter search (grid, random, Bayesian, Hyperband) |
| Evaluation | Held-out test metrics, slice-based analysis, fairness audits |
| Deployment | Online inference (low latency) versus offline batch scoring, static versus dynamic models, A/B testing, canary releases |
| Monitoring | Drift detection, training-serving skew, feedback loops, nonstationarity, retraining cadence, alerting |
| Governance | Interpretability, bias and fairness audits, lineage, model cards, documentation |
MLOps is the engineering discipline that productionizes this loop, with tools such as MLflow, Kubeflow, Vertex AI, SageMaker, and Tecton handling experiment tracking, feature stores, and continuous training.
Hyperparameters are configuration values set before training (learning rate, batch size, regularization strength, depth, hidden width, dropout rate). Common search strategies include grid search (all combinations on a discrete grid), random search (often outperforms grid search for the same budget; Bergstra and Bengio, 2012), Bayesian optimization (a probabilistic surrogate that picks configurations maximizing expected improvement), Hyperband and BOHB (random search with early stopping via successive halving), and population-based training. Tools include Optuna, Ray Tune, Weights and Biases Sweeps, and Vertex AI Vizier.
Offline learning trains a model on a fixed snapshot of data and then deploys it; the model is static until the next retraining cycle. Online learning updates the model as new examples arrive, which is essential when the data distribution changes (a property called nonstationarity). Online inference produces predictions on demand, while offline (batch) inference scores entire datasets on a schedule. The choice between static and dynamic inference depends on freshness requirements, latency budgets, and cost.
When one class is much more frequent than the other, naive accuracy is misleading. Strategies include resampling (oversampling the minority class with SMOTE or ADASYN, or undersampling the majority class), class weighting (scaling the loss by inverse frequency), tuning the classification threshold, cost-sensitive learning (asymmetric costs in the loss), and anomaly detection framing when the positive class is extremely rare.
Interpretability matters for debugging, fairness, and regulatory compliance. Common techniques include feature importance from tree ensembles (Gini, permutation, SHAP), partial dependence plots, local surrogates such as LIME (Ribeiro et al., 2016), SHAP values (Lundberg and Lee, 2017), counterfactual explanations, and attention visualization for deep models.
Algorithmic bias can creep in through skewed data, label noise, or proxy labels that correlate with protected attributes. Fairness metrics include demographic parity, equalized odds, and predictive parity; no single metric satisfies all desiderata simultaneously (Chouldechova, 2017; Kleinberg et al., 2017). Mitigation spans pre-processing (reweighting), in-processing (constraints), and post-processing (group-specific thresholds). Stability and feedback loops are also material risks. The NIST AI Risk Management Framework (2023) and EU AI Act (2024) formalize many of these requirements.
Most practitioners use the Python data stack: pandas (and its DataFrame abstraction), NumPy, scikit-learn for classical models, and PyTorch, TensorFlow, or JAX for deep learning. R, Julia, and Spark MLlib are also widely used. ONNX provides a portable format for deployed models, and Hugging Face Hub hosts pretrained checkpoints.
The following resources are widely used and define the canon of fundamentals.
| Resource | Authors / instructors | Year | Notes |
|---|---|---|---|
| The Elements of Statistical Learning (ESL) | Trevor Hastie, Robert Tibshirani, Jerome Friedman | 2nd ed. 2009 | Mathematical, dense, free PDF on the authors' Stanford site |
| An Introduction to Statistical Learning (ISL) | Gareth James, Daniela Witten, Trevor Hastie, Robert Tibshirani | 2nd ed. 2021 | Gentler companion to ESL with R and Python labs |
| Pattern Recognition and Machine Learning (PRML) | Christopher M. Bishop | 2006 | Bayesian and probabilistic perspective |
| Machine Learning: A Probabilistic Perspective | Kevin Murphy | 2012; expanded as Probabilistic Machine Learning, 2022 and 2023 | Comprehensive modern reference |
| Deep Learning | Ian Goodfellow, Yoshua Bengio, Aaron Courville | 2016 | The standard deep learning textbook, free online |
| Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow | Aurelien Geron | 3rd ed. 2022 | Practical, code-first |
| Mathematics for Machine Learning | Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong | 2020 | Linear algebra, calculus, probability for ML |
| Coursera Machine Learning | Andrew Ng (Stanford / DeepLearning.AI) | 2011 (original); refreshed 2022 | One of the most-taken online courses ever, an entry point for millions |
| Coursera Deep Learning Specialization | Andrew Ng | 2017 | Five-course sequence on neural networks |
| Stanford CS229 Machine Learning | Andrew Ng, Christopher Re, Tengyu Ma | ongoing | Graduate ML lectures, notes online |
| Stanford CS231n Convolutional Neural Networks for Visual Recognition | Fei-Fei Li, Andrej Karpathy, Justin Johnson | ongoing | Foundational deep learning for vision |
| Stanford CS224n Natural Language Processing with Deep Learning | Christopher Manning | ongoing | Modern NLP curriculum |
| MIT 6.036 Introduction to Machine Learning | MIT OCW | ongoing | Undergraduate ML |
| fast.ai Practical Deep Learning for Coders | Jeremy Howard, Rachel Thomas | 2017 onwards | Top-down, hands-on, free |
| Dive into Deep Learning (D2L) | Aston Zhang, Zachary Lipton, Mu Li, Alex Smola | 2020 onwards | Free interactive book with PyTorch, MXNet, TensorFlow, JAX |
| Reinforcement Learning: An Introduction | Richard S. Sutton, Andrew G. Barto | 2nd ed. 2018 | The canonical RL textbook, free PDF |
| Speech and Language Processing | Daniel Jurafsky, James H. Martin | 3rd ed. draft, ongoing | NLP reference, free chapters online |
The Google Machine Learning Glossary, on which much of the terminology in this wiki draws, is also a high-quality starting point.
The terms below have dedicated wiki pages in the fundamentals chapter of Machine learning terms. Where multiple notations exist (for example ReLU and Rectified Linear Unit), separate entries are linked.