Machine learning terms

See also: Terms and Machine learning

Machine learning terms are the standardized vocabulary used to describe how models learn from data, including concepts such as features, labels, loss, gradient descent, overfitting, and the architectures (neural networks, transformers, decision forests) that implement them. This glossary collects that core vocabulary across machine learning research, engineering, and applied data science, with more than 150 bolded, self-contained definitions plus topic hubs that link to dedicated sub-glossaries. Entries follow the conventions used by the Google Machine Learning Glossary, which itself defines over 460 terms and has been revised nine times since its 2016 launch,^[1]^[10] the Wikipedia machine learning articles,^[2]^[3] and standard textbooks such as Goodfellow, Bengio, and Courville's Deep Learning and Bishop's Pattern Recognition and Machine Learning.^[4] For the complete alphabetical index of every term, see the Machine learning terms/All page.

What does this glossary cover?

This page is organized in three parts. First, subcategory hubs group ML vocabulary by topic area (fundamentals, NLP, fairness, decision forests, reinforcement learning, computer vision, sequence models, clustering, recommendation systems, and frameworks) and link to a focused glossary for each. Second, a curated set of the most frequently referenced terms appears with concise, one-line definitions grouped by theme. Third, a short list of additional widely used terms is included for completeness. Each definition is written to be liftable on its own: it leads with the term and states the concept in a single sentence.

Subcategory hubs

The following hub pages organize ML vocabulary by topic area. Each hub contains a focused glossary for that domain along with related concepts and methods.

Fundamentals

Fundamentals covers the building blocks of supervised and unsupervised learning, including features, labels, examples, training, loss, regularization, gradient descent, hyperparameters, and evaluation metrics such as accuracy, precision, and recall.^[1]^[5] This is the recommended starting point for newcomers to ML.

Natural language processing

Natural Language Processing and Language Evaluation cover terms for working with text: tokenization, embeddings, attention, transformers, language models, sequence-to-sequence tasks, and quality metrics like BLEU and perplexity.^[1]

Fairness

Fairness addresses bias, demographic disparities, and ethical considerations in ML systems. It includes fairness metrics, sensitive attributes, disparate impact, and approaches to mitigating bias through pre-processing, in-processing, and post-processing techniques.^[1]

Decision forests

Decision Forests covers tree-based models such as decision trees, random forests, and gradient boosted trees, along with their splitting rules, ensemble methods, feature importance measures, and out-of-bag evaluation.^[1]^[6]

Reinforcement learning

Reinforcement Learning covers agents that learn by interacting with environments to maximize cumulative reward. Terms include policy, state, action, value function, Q-learning, Markov decision process, and Deep Q-Networks.^[1]^[9]

Computer vision

Computer Vision and Image Models cover the processing of pixels and visual scenes: convolutional networks, pooling, bounding boxes, intersection over union, image augmentation, and architectures for classification, detection, and segmentation.^[1]

Sequence models

Sequence Models covers architectures for ordered data such as recurrent neural networks, LSTMs, attention mechanisms, transformers, and time series methods.^[1]^[4]

Clustering

Clustering covers unsupervised grouping algorithms including k-means, k-median, agglomerative, divisive, hierarchical, and centroid-based clustering, along with similarity measures and centroid concepts.^[1]^[6]

Recommendation systems

Recommendation Systems covers techniques for predicting user preferences, including collaborative filtering, matrix factorization, candidate generation, scoring, re-ranking, and the user and item matrices used by recommender models.^[1]

TensorFlow and Google Cloud

TensorFlow covers the open-source ML framework's vocabulary including tensors, graphs, sessions, estimators, and the Keras and Layers APIs.^[7] Google Cloud covers Cloud TPU, TPU Pods, and other managed infrastructure for training and serving models.^[1]

Key terms with definitions

A curated set of the most frequently referenced ML terms, organized by topic.^[1] For the complete alphabetical list, see Machine learning terms/All.

Fundamentals and training

accuracy: Correct predictions divided by total predictions; can mislead on class-imbalanced datasets.
backpropagation: Algorithm computing gradients of the loss with respect to each weight by chain-rule traversal of the network.^[4]
batch: Set of examples processed in a single training iteration.
batch size: Number of examples in one batch.
cross-entropy: Loss measuring the difference between two probability distributions; generalizes log loss to multiple classes.
cross-validation: Resampling procedure that estimates model performance by repeatedly splitting data into training and validation folds.^[6]
early stopping: Regularization technique that halts training when validation loss stops improving.
epoch: One full pass over the training set during training.
example: Single row of data, comprising features and optionally a label.
feature: Input variable used by a model to make predictions.
feature engineering: Process of selecting and transforming raw data into features suitable for modeling.
generalization: Model's ability to perform well on examples not seen during training.^[3]
gradient descent: Optimization algorithm that iteratively updates parameters in the direction of steepest descent of the loss.^[5]
hyperparameter: Configuration value set by the practitioner before training, such as learning rate or batch size.
label: Target value associated with an example in supervised learning.
learning rate: Scalar that multiplies the gradient when updating parameters.
loss: Number expressing how far a model's predictions are from the labels.
mini-batch: Small subset of training examples used in a single gradient update.
model: Function learned from data that maps inputs to outputs.
optimizer: Algorithm that updates model parameters to reduce loss, such as SGD, Adam, or AdaGrad.
overfitting: Model fitting training data so closely that it fails to generalize to new data.^[3]
regularization: Technique that penalizes model complexity to reduce overfitting.^[4]
semi-supervised learning: Training using a mix of labeled and unlabeled data.
stochastic gradient descent (SGD): Gradient descent using estimated gradients from a single example or mini-batch.^[5]
supervised machine learning: Training using labeled examples to learn an input-to-output mapping.^[5]
test set: Held-out data used to evaluate the final model.
training set: Data used to fit model parameters.
transfer learning: Reusing knowledge learned on one task to improve learning on another.
underfitting: Model too simple to capture the relationship between features and labels.
unsupervised machine learning: Training without labels, discovering patterns or structure in data.^[5]
validation set: Held-out data used during training to tune hyperparameters.
weight: Coefficient that multiplies an input in a model.

Neural networks and deep learning

activation function: Nonlinear function applied at each neuron that lets a network learn nonlinear relationships.
attention: Mechanism that weights different parts of an input when producing each part of an output.
batch normalization: Layer that normalizes activations across a mini-batch, often speeding training and acting as regularization.
convolutional neural network: Neural network built from convolutional layers, commonly used for images.^[4]
deep neural network: Neural network with two or more hidden layers.
dropout regularization: Regularization that randomly zeroes out a fraction of neurons during training.^[4]
embedding vector: Dense vector representation of a discrete input such as a word.
exploding gradient problem: Training instability caused by gradients growing unboundedly large.
fine tuning: Continuing training of a pretrained model on a task-specific dataset.
hidden layer: Neural network layer between the input and output layers.
Long Short-Term Memory (LSTM): Recurrent neural network cell with gating designed to capture long-range dependencies.^[4]
neural network: Model composed of layers of interconnected units that compute nonlinear functions of their inputs.
Rectified Linear Unit (ReLU): Activation function that outputs max(0, x).
recurrent neural network: Neural network that processes sequences by maintaining a hidden state across time steps.
self-attention: Attention mechanism in which queries, keys, and values all come from the same sequence.
softmax: Function that converts a vector of logits into a probability distribution over classes.
Transformer: Neural network architecture built on self-attention, the basis for most large language models. Introduced by Vaswani et al. in the 2017 paper "Attention Is All You Need," which proposed "a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely" and reached 28.4 BLEU on the WMT 2014 English-to-German translation task.^[11]
vanishing gradient problem: Training difficulty in which gradients become very small in early layers, slowing or stopping learning.

Classification and metrics

AUC (Area under the ROC curve): Probability that a randomly chosen positive ranks above a randomly chosen negative.
binary classification: Classification task with two possible label values.
confusion matrix: NxN table showing counts of predicted versus actual class for a classifier.
F1 score: Harmonic mean of precision and recall, balancing the two metrics in a single number.
logistic regression: Classification model that applies a sigmoid to a linear combination of features.^[5]
multi-class classification: Classification task with more than two possible classes.
precision: Fraction of positive predictions that are correct.
recall: Fraction of actual positives the model correctly predicted as positive.
ROC curve: Plot of true positive rate against false positive rate across classification thresholds.

NLP and language models

BERT (Bidirectional Encoder Representations from Transformers): Transformer encoder pre-trained with masked language modeling to produce contextual text embeddings. Introduced by Devlin et al. in 2018, it masks roughly 15% of input tokens during pre-training; the released BERT-base has 110 million parameters and BERT-large has 340 million.^[12]
BLEU (Bilingual Evaluation Understudy): Score between 0 and 1 evaluating machine translation by N-gram overlap with reference translations. Proposed by Papineni et al. at IBM in 2002, it was among the first automated MT metrics to correlate well with human judgments.^[1]^[13]
GPT (Generative Pre-trained Transformer): Family of transformer language models pretrained autoregressively on large text corpora.
hallucination: Generation of plausible-sounding but factually incorrect output by a generative model.^[2]
language model: Model that estimates the probability of token sequences.
large language model: Transformer language model with billions of parameters trained on very large text corpora.
masked language model: Language model trained to predict tokens that have been masked out of an input; BERT pre-training masks about 15% of tokens.^[12]
N-gram: Sequence of N consecutive tokens.
perplexity: Measure of how well a language model predicts a sample; lower is better.
sentiment analysis: NLP task of classifying text by emotional tone, such as positive or negative.
sequence-to-sequence task: Task that maps an input sequence to an output sequence, such as translation.
token: Smallest unit of text processed by a language model, often a word or subword.
word embedding: Vector representation of a word capturing semantic and syntactic relationships.

Reinforcement learning

Bellman equation: Recursive equation expressing the value of a state in terms of the expected reward and the value of successor states.^[9]
Deep Q-Network (DQN): Reinforcement learning algorithm using a deep neural network to approximate the Q-function.^[9]
Markov decision process (MDP): Mathematical framework for sequential decision making under uncertainty.^[9]
policy: Mapping from states to actions or to a distribution over actions.^[9]
Q-learning: Reinforcement learning algorithm that estimates the optimal action-value function via temporal difference updates.^[9]
reinforcement learning (RL): Paradigm in which an agent learns to act in an environment to maximize cumulative reward.^[9]
reward: Scalar signal that a reinforcement learning agent tries to maximize.^[9]
state: Description of the environment at a given time in reinforcement learning.^[9]

Trees, ensembles, and clustering

bagging: Ensemble method training each model on a bootstrap sample of the data, used in random forests.^[6]
boosting: Sequential ensemble method that combines weak learners, upweighting examples earlier models got wrong.^[6]
decision tree: Supervised model routing examples through a tree of conditions to reach a leaf prediction.^[6]
ensemble: Collection of models whose predictions are combined for a final output.^[6]
gradient boosting: Boosting technique in which each new model fits the gradient of the loss from the current ensemble.^[6]
k-means: Clustering algorithm partitioning data into k clusters around mean centroids.^[6]
random forest: Ensemble of decision trees trained on bootstrap samples with attribute sampling.^[6]

Computer vision

bounding box: Rectangle described by image coordinates that encloses an object of interest.
convolution: Mathematical operation that slides a filter over an input to produce a feature map.
data augmentation: Artificially enlarging the training set by transforming existing examples, such as rotating images.
image recognition: Task of identifying objects, scenes, or other content in images.
intersection over union (IoU): Overlap metric for two regions equal to area of intersection divided by area of union.
pooling: Downsampling layer that aggregates spatial regions, such as max or average pooling.

Fairness

demographic parity: Fairness criterion requiring equal positive prediction rates across demographic groups.
disparate impact: Adverse effect of a decision system on a protected group, even without explicit discriminatory intent.
equality of opportunity: Fairness criterion requiring equal true positive rates across protected groups.
equalized odds: Fairness criterion requiring equal true positive and false positive rates across groups.
fairness metric: Quantitative measure of how a model's outcomes differ across groups.
sensitive attribute: Feature such as race, gender, or age that may be protected from use in decisions.

Recommendation and frameworks

collaborative filtering: Recommendation technique predicting user preferences based on the preferences of similar users.
Keras: High-level neural network API integrated into TensorFlow.^[7]
matrix factorization: Decomposing a matrix into a product of two lower-rank matrices, used in recommendation.
NumPy: Python library for numerical arrays and linear algebra.
pandas: Python library for tabular data manipulation built on top of NumPy.
recommendation system: System that suggests items to users based on their preferences or behaviors.
scikit-learn: Python library offering classical machine learning algorithms and preprocessing tools.^[6]
Tensor: Multidimensional array, the basic data structure in TensorFlow.^[7]
TensorFlow: Open-source platform for building and deploying machine learning models.^[7]
Tensor Processing Unit (TPU): Google's custom ASIC accelerator for machine learning workloads.^[1]

Generative models

diffusion model: Generative model that learns to reverse a gradual noising process to produce data.
discriminator: Network in a GAN that learns to distinguish real examples from those produced by the generator.
generative adversarial network (GAN): Two-network system where a generator produces samples and a discriminator tries to detect them.^[4]
generative model: Model that learns to produce new examples resembling the training distribution.
generator: Network in a GAN that produces synthetic examples intended to look real.
multimodal model: Model that ingests or produces more than one modality, such as text and images.

Additional commonly used terms

A short list of widely used ML terms not in the historic glossary, included for completeness.

Adam optimizer: Adaptive optimizer combining momentum with per-parameter learning rates.^[4]
autoencoder: Neural network trained to reconstruct its input through a compressed bottleneck representation.^[4]
bias-variance tradeoff: Tension between underfitting from high bias and overfitting from high variance.^[5]
gated recurrent unit (GRU): Recurrent cell similar to LSTM but with fewer gates.^[4]
hyperparameter tuning: Search procedure for choosing hyperparameter values that improve validation performance.^[6]
Kullback-Leibler divergence: Asymmetric measure of how one probability distribution diverges from a reference distribution.^[4]
learning curve: Plot of model performance against training set size or iterations.^[6]
mixed precision training: Training that uses lower-precision arithmetic for speed while keeping critical values in higher precision.^[8]
positional encoding: Vector added to token embeddings to inject information about token order.
prompt engineering: Crafting inputs to elicit desired behavior from a large language model.
retrieval-augmented generation (RAG): Generation conditioned on documents fetched at inference time from an external corpus.
variational autoencoder (VAE): Probabilistic autoencoder that models a continuous latent distribution.^[4]
zero-shot learning: Predicting on classes that were not seen during training, often by leveraging textual descriptions.^[4]

How are these definitions sourced?

Definitions on this page are drawn from primary and authoritative references rather than informal usage. The Google Machine Learning Glossary is the leading single source; Google describes it as having grown to over 460 terms across nine revisions since 2016.^[1]^[10] Architecture and algorithm entries are checked against the original papers, for example "Attention Is All You Need" (Vaswani et al., 2017) for the Transformer,^[11] the BERT paper (Devlin et al., 2018) for masked language modeling,^[12] and Papineni et al. (2002) for BLEU.^[13] Textbook references include Goodfellow, Bengio, and Courville's Deep Learning,^[4] Sutton and Barto's Reinforcement Learning: An Introduction for RL terms,^[9] and the scikit-learn and TensorFlow documentation for framework terms.^[6]^[7]

References

What does this glossary cover?

Subcategory hubs

Fundamentals

Natural language processing

Fairness

Decision forests

Reinforcement learning

Computer vision

Sequence models

Clustering

Recommendation systems

TensorFlow and Google Cloud

Key terms with definitions

Fundamentals and training

Neural networks and deep learning

Classification and metrics

NLP and language models

Reinforcement learning

Trees, ensembles, and clustering

Computer vision

Fairness

Recommendation and frameworks

Generative models

Additional commonly used terms

How are these definitions sourced?

See also

References

Improve this article

Related Articles

A/B Testing

Diffusion models

Diffusion model

Dimension Reduction

Dimensions

Discrete Feature

What links here (24 of 249)

What does this glossary cover?

Subcategory hubs

Fundamentals

Natural language processing

Fairness

Decision forests

Reinforcement learning

Computer vision

Sequence models

Clustering

Recommendation systems

TensorFlow and Google Cloud

Key terms with definitions

Fundamentals and training

Neural networks and deep learning

Classification and metrics

NLP and language models

Reinforcement learning

Trees, ensembles, and clustering

Computer vision

Fairness

Recommendation and frameworks

Generative models

Additional commonly used terms

How are these definitions sourced?

See also

References

Related Articles

A/B Testing

Diffusion models

Diffusion model

Dimension Reduction

Dimensions

Discrete Feature

What links here (24 of 249)