Machine learning terms/All

Machine Learning

23 min read

Updated Jul 16, 2026

Suggest edit History Talk

RawGraph

Last edited

Jul 16, 2026

Fact-checked

In review queue

Sources

6 citations

Revision

v5 · 4,566 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

See also: Machine learning terms

This alphabetical glossary collects core terminology used across machine learning, deep learning, reinforcement learning, large language models, and TensorFlow tooling. Each entry includes a short definition drawn from references such as the Google Machine Learning Glossary and Wikipedia.^[1]^[2] Terms are grouped by first letter for easier scanning.

A

A/B testing: Comparing two variants on randomly assigned users.
accuracy: Fraction of correct predictions.
action: An RL agent's choice changing environment state.
activation function: Nonlinearity applied to a neuron's weighted sum.
active learning: Training where the model picks which examples to label.
AdaGrad: Adaptive per-parameter learning rate optimizer.
agent: RL entity acting in an environment for reward.
agglomerative clustering: Bottom-up hierarchical clustering.
anomaly detection: Spotting data points unlike typical patterns.
AR: Augmented reality.
area under the PR curve: Summary of precision-recall tradeoff.
area under the ROC curve: Probability classifier ranks a positive above a negative.
artificial general intelligence: Hypothetical human-level general AI.
artificial intelligence: Field building machines that reason or perceive.
attention: Mechanism weighting input parts when producing output.
attention mechanism: Query-key-value weighted sum used in transformers.
attribute: Synonym for feature.
attribute sampling: Random feature subset chosen at each tree split.
AUC (Area under the ROC curve): Threshold-independent classifier metric.
augmented reality: Computer imagery overlaid on a real-world view.
automation bias: Favoring automated suggestions over other sources.
autoencoder: Network learning to reconstruct input through a bottleneck.
average precision: Single-number precision-recall curve summary.
axis-aligned condition: Tree split on one feature against a threshold.

B

backpropagation: Chain-rule gradient computation through a network.
bagging: Ensemble averaging models trained on bootstrapped subsets.
bag of words: Text representation counting words, ignoring order.
baseline: Simple reference model for comparison.
batch: Group of examples processed together.
batch normalization: Normalizing activations per batch to stabilize training.
batch size: Examples per gradient update.
Bayesian neural network: Network with distributions over weights.
Bayesian optimization: Hyperparameter tuning with a probabilistic surrogate.
Bellman equation: Recursive equation for optimal state value.
BERT (Bidirectional Encoder Representations from Transformers): Transformer pretrained with masked language modeling.
bias (ethics/fairness): Unfair model preference across groups.
bias (math) or bias term: Learned constant added to a weighted sum.
bigram: Pair of adjacent tokens.
bidirectional: Processing sequences from both directions.
bidirectional language model: Model using context on both sides of each token.
binary classification: Classification with two possible classes.
binary condition: Tree test with two outcomes.
binning: Mapping continuous values into discrete buckets.
BLEU (Bilingual Evaluation Understudy): Translation metric on n-gram overlap.
boosting: Sequential ensemble correcting prior errors.
bounding box: Rectangle around a detected object.
broadcasting: Aligning tensor shapes for element-wise math.
bucketing: Synonym for binning.

C

calibration layer: Adjusts predicted probabilities to match observed frequencies.
candidate generation: First recommender stage selecting an item subset.
candidate sampling: Using a subset of negative classes in loss.
categorical data: Features with a small set of discrete values.
causal language model: Predicts next token from past context only.
centroid: Center point of a cluster.
centroid-based clustering: Clustering represented by central points.
chain of thought: Prompting that elicits intermediate reasoning steps.
checkpoint: Saved snapshot of model parameters.
class: One discrete output category.
classification model: Model predicting class labels.
classification threshold: Probability cutoff for the positive class.
class-imbalanced dataset: Dataset with very unequal class frequencies.
clipping: Bounding values within a fixed range.
Cloud TPU: Google's cloud-hosted TPU service.^[1]
clustering: Unsupervised grouping of similar examples.
co-adaptation: Neurons depending on specific neighbors.
collaborative filtering: Recommendations from patterns of many users.
condition: Tree feature test choosing the next branch.
confirmation bias: Interpreting evidence to confirm existing beliefs.
confusion matrix: Table of predicted versus actual labels.
continuous feature: Feature with any real value in an interval.
convenience sampling: Sampling from easily accessible sources.
convergence: State where further training barely changes loss.
convex function: Function whose graph lies below any chord.
convex optimization: Minimizing convex functions over convex sets.
convex set: Set containing every segment between its points.
convolution: Sliding a filter over data to produce feature maps.
convolutional filter: Small weight matrix detecting a feature.
convolutional layer: Layer applying convolutional filters.
convolutional neural network: Network using convolutional layers, common for images.
convolutional operation: Multiply-and-sum as a filter slides over input.
cost: Synonym for loss.
co-training: Two models on different views label data for each other.
counterfactual fairness: Predictions unchanged under altered sensitive attributes.
coverage bias: Bias when sampling omits relevant subgroups.
crash blossom: Ambiguous headline illustrating language parsing challenges.
critic: Actor-critic value estimator guiding the actor.
cross-entropy: Loss comparing predicted and true distributions.
cross-validation: Rotating which data fold is held out.

D

data analysis: Inspecting data to guide modeling.
data augmentation: Generating extra examples by transformation.
DataFrame: Tabular labeled data structure used in pandas.
data parallelism: Replicating a model across devices splitting the batch.
data set or dataset: Collection of examples used for training or evaluation.
Dataset API (tf.data): TensorFlow API for input pipelines.^[1]
decision boundary: Surface separating predicted classes.
decision forest: Ensemble of decision trees.
decision threshold: Synonym for classification threshold.
decision tree: Model with branching feature tests.
deep model: Network with many hidden layers.
decoder: Encoder-decoder part generating outputs from a representation.
deep neural network: Network with many stacked hidden layers.^[4]
Deep Q-Network (DQN): Deep network approximating the Q-function.
demographic parity: Equal positive prediction rates across groups.
denoising: Removing noise or reconstructing clean signal.
dense feature: Feature with mostly nonzero values.
dense layer: Layer connecting every input to every output.
depth: Number of layers in a network.
depthwise separable convolutional neural network (sepCNN): Efficient CNN factorizing convolutions.
derived label: Label computed from other data.
device: CPU, GPU, or TPU running operations.
diffusion model: Generative model reversing a noising process.
dimension reduction: Mapping data to fewer dimensions preserving structure.
dimensions: Independent axes of a tensor.
discrete feature: Feature with countable possible values.
discriminative model: Model learning conditional probability of labels.
discriminator: GAN network distinguishing real from fake.
disparate impact: Decisions disproportionately harming a protected group.
disparate treatment: Direct use of sensitive attributes in decisions.
divisive clustering: Top-down hierarchical clustering.
downsampling: Reducing data rate or majority-class count.
DQN: Abbreviation for Deep Q-Network.
dropout regularization: Randomly zeroing neurons during training.
dynamic: Continuously updated rather than static.
dynamic model: Model retrained continuously as data arrives.

E

eager execution: TensorFlow mode running ops immediately.^[1]
early stopping: Stopping when validation loss stops improving.
earth mover's distance (EMD): Distance based on minimum transport cost between distributions.
embedding: Learned dense vector for a discrete object.
embedding layer: Maps discrete tokens to dense vectors.
embedding space: Vector space where geometry reflects similarity.
embedding vector: Dense vector for one item.
empirical risk minimization (ERM): Choosing a model that minimizes average training loss.
encoder: Maps inputs to a fixed-size representation.
ensemble: Combination of several models' predictions.
entropy: Uncertainty measure of a distribution.
environment: System an RL agent acts within.
episode: One full RL run from start to terminal.
epoch: One full pass through the training data.
epsilon greedy policy: Greedy mostly but explores with probability epsilon.
equality of opportunity: Equal true positive rates across groups.
equalized odds: Equal true and false positive rates across groups.
Estimator: TensorFlow high-level model API.^[1]
example: One data instance fed to a model.
experience replay: Sampling stored RL transitions for training.
experimenter's bias: Researcher expectations influencing data or analysis.
exploding gradient problem: Gradients growing uncontrollably during backprop.

F

fairness constraint: Formal restriction enforcing a fairness criterion.
fairness metric: Measure of behavior across protected groups.
false negative (FN): Negative prediction with positive truth.
false negative rate: Fraction of actual positives missed.
false positive (FP): Positive prediction with negative truth.
false positive rate (FPR): Fraction of negatives misclassified positive.
feature: Measurable input variable used by a model.
feature cross: Synthetic feature from combining features.
feature engineering: Designing features to improve performance.
feature extraction: Computing informative variables from raw data.
feature importances: Scores of each feature's contribution.
feature set: Features used by a particular model.
feature spec: Description of feature names and types.
feature vector: Ordered list of feature values for one example.
federated learning: Training across devices that keep data local.
feedback loop: When predictions influence future training data.
feedforward neural network (FFN): Network with connections only forward.
few-shot learning: Task performance from a few examples.
fine tuning: Continuing training of a pretrained model.
foundation model: Large pretrained model adaptable to many tasks.
forget gate: LSTM gate discarding cell-state information.
full softmax: Softmax over the entire vocabulary.
fully connected layer: Synonym for dense layer.

G

GAN: Abbreviation for generative adversarial network.
generalization: Performance on unseen data.
generalization curve: Plot of training versus validation loss.
generalized linear model: Linear model with link function.
generative adversarial network (GAN): Generator trained against a discriminator.
generative AI: AI producing new content like text or images.
generative model: Model of the joint distribution able to synthesize samples.
generator: GAN network producing synthetic data.
GPT (Generative Pre-trained Transformer): Autoregressive transformer family from OpenAI.
gini impurity: Misclassification probability at a tree node.
gradient: Vector of partial derivatives indicating steepest ascent.
gradient boosting: Boosting fitting each model to loss gradient.
gradient boosted (decision) trees (GBT): Gradient boosting using shallow trees.
gradient clipping: Capping gradient magnitudes during training.
gradient descent: Optimization stepping opposite the loss gradient.
graph: TensorFlow computation representation of nodes and edges.^[1]
graph execution: Running a precompiled TensorFlow graph.^[1]
greedy policy: Always picking the highest-valued action.
ground truth: Correct label for an example.
group attribution bias: Assuming individuals match their group's traits.

H

hallucination: Fluent but factually incorrect generative output.
hashing: Mapping inputs to fixed-size integers.
heuristic: Rule of thumb without optimality guarantee.
hidden layer: Layer between input and output.
hierarchical clustering: Clustering forming a tree of nested groups.
hinge loss: SVM margin-based loss.
holdout data: Examples reserved for final evaluation.
hyperparameter: Training setting chosen before optimization.
hyperplane: Flat decision surface in feature space.

I

i.i.d.: Independent and identically distributed.
image recognition: Identifying content within images.
imbalanced dataset: Dataset with very unequal class counts.
implicit bias: Unconscious attitudes affecting data work.
incompatibility of fairness metrics: Theorem that some fairness criteria cannot coexist.
independently and identically distributed (i.i.d): Examples drawn independently from one distribution.
individual fairness: Similar individuals get similar predictions.
inference: Using a trained model to predict.
inference path: Tree conditions an example follows to a leaf.
information gain: Entropy reduction from splitting on a feature.
in-group bias: Favoring members of one's own group.
input layer: First network layer receiving features.
in-set condition: Tree test on set membership.
instance: Synonym for example.
interpretability: How understandable model reasoning is.
inter-rater agreement: How often labelers agree on labels.
intersection over union (IoU): Overlap-to-union ratio between regions.
IoU: Abbreviation for intersection over union.
item matrix: Factor matrix for items in matrix factorization.
items: Objects suggested by a recommender.
iteration: One parameter update from one batch.

K

Keras: High-level neural network Python API.^[1]
keypoints: Distinctive image points used for pose or matching.
Kernel Support Vector Machines (KSVMs): SVMs using kernels in implicit feature spaces.
k-means: Clustering by alternating assignment and centroid update.
k-median: Clustering using medians instead of means.

L

L0 regularization: Penalty on nonzero weight count.
L1 loss: Sum of absolute prediction errors.
L1 regularization: Penalty on absolute weights producing sparsity.
L2 loss: Sum of squared prediction errors.
L2 regularization: Penalty on squared weights encouraging small values.
label: Correct output for an example.
labeled example: Example with features and a known label.
LaMDA (Language Model for Dialogue Applications): Google conversational language model.^[1]
lambda: Hyperparameter controlling regularization strength.
landmarks: Named coordinate points annotated on images.
language model: Assigns probability to token sequences.
large language model: Language model with billions of parameters.
layer: Group of neurons or ops at a network stage.
Layers API (tf.layers): TensorFlow reusable building blocks.^[1]
leaf: Terminal tree node producing a prediction.
learning rate: Step size for gradient updates.
least squares regression: Regression minimizing squared residuals.
linear model: Predictions as a weighted sum plus bias.
linear: Expressible as a weighted sum.
linear regression: Linear function predicting a continuous value.
logistic regression: Classifier using a logistic function on a linear sum.
logits: Raw scores before softmax or sigmoid.
Log Loss: Logistic regression loss, equivalent to binary cross-entropy.
log-odds: Log ratio of two outcome probabilities.
Long Short-Term Memory (LSTM): Gated recurrent unit for long-range dependencies.
LoRA: Low-rank adapter for parameter-efficient fine-tuning.
loss: Quantity measuring prediction error.
loss curve: Plot of loss over training steps.
loss function: Formula aggregating prediction errors into a loss.
loss surface: Landscape of loss across parameter values.
LSTM: Abbreviation for Long Short-Term Memory.

M

machine learning: Building systems that learn patterns from data.^[3]^[5]
majority class: Most frequent class in an imbalanced dataset.
Markov decision process (MDP): RL model of states, actions, transitions, rewards.
Markov property: Future depends only on current state.
masked language model: Predicts hidden tokens given context.
matplotlib: Python plotting library.
matrix factorization: Decomposing a matrix into smaller factor matrices.
Mean Absolute Error (MAE): Average of absolute prediction errors.
Mean Squared Error (MSE): Average of squared prediction errors.
metric: Scalar summarizing model performance.
meta-learning: Learning algorithms that learn new tasks quickly.
Metrics API (tf.metrics): TensorFlow standard metrics module.^[1]
mini-batch: Small group of examples for one update.
mini-batch stochastic gradient descent: Gradient descent using mini-batches.
minimax loss: Two-player adversarial GAN objective.
minority class: Less frequent class in imbalanced data.
mixture of experts: Architecture routing inputs to specialized sub-networks.
ML: Abbreviation for machine learning.
MNIST: Classic handwritten digit benchmark dataset.
modality: A data type such as text, image, or audio.
model: Object mapping inputs to predictions via learned parameters.
model capacity: Range of functions a model can represent.
model parallelism: Splitting a model across devices.
model training: Fitting model parameters to data.
Momentum: Optimizer using moving average of gradients.
multi-class classification: Classification with more than two classes.
multi-class logistic regression: Logistic regression with softmax across classes.
multi-head self-attention: Parallel self-attention operations concatenated.
multimodal model: Model handling multiple data modalities.
multinomial classification: Synonym for multi-class classification.
multinomial regression: Regression for probabilities across categories.

N

NaN trap: Activations becoming NaN and spreading.
natural language understanding: NLP subfield extracting meaning from text.
negative class: Class without the tested condition.
neural network: Connected layers of weighted units.
neuron: Single unit computing weighted sum plus activation.
N-gram: Sequence of n adjacent tokens.
NLU: Abbreviation for natural language understanding.
node (neural network): Synonym for a neuron.
node (TensorFlow graph): Single op in a TensorFlow graph.^[1]
node (decision tree): Point in a tree with a test or leaf.
noise: Random variation not reflecting the underlying signal.
non-binary condition: Tree test with more than two outcomes.
nonlinear: Not expressible as a weighted sum.
non-response bias: Bias from groups responding to surveys less often.
nonstationarity: Data statistics changing over time.
normalization: Rescaling values to a standard range.
novelty detection: Identifying observations unlike training cases.
numerical data: Features represented as numbers.
NumPy: Python array computing library.

O

objective: Quantity an algorithm optimizes.
objective function: Function optimized during training.
oblique condition: Tree split on a linear feature combination.
offline: Performed in batch, not in real time.
offline inference: Predicting in advance and storing results.
one-hot encoding: Category as a vector with a single 1.
one-shot learning: Learning from a single labeled example.
one-vs.-all: Multi-class via one binary classifier per class.
online: Processing data as it arrives.
online inference: Predicting in real time on requests.
operation (op): Named computation in a TensorFlow graph.^[1]
out-of-bag evaluation (OOB evaluation): Ensemble accuracy from unsampled examples.
optimizer: Algorithm updating parameters from gradients.
out-group homogeneity bias: Seeing outside groups as more uniform.
outlier detection: Identifying examples far from typical data.
outliers: Examples with very atypical values.
output layer: Final network layer producing predictions.
overfitting: Fitting training data but failing to generalize.
oversampling: Adding minority examples to balance classes.

P

pandas: Python tabular data analysis library.
parameter: Value learned during training.
Parameter Server (PS): Distributed training architecture with shared servers.
parameter update: One optimizer step changing parameters.
partial derivative: Derivative with respect to one variable.
participation bias: Bias when individuals choose to be included.
partitioning strategy: Method splitting variables across machines.
perceptron: Single-layer binary classifier.
performance: Model quality or running speed.
permutation variable importances: Importance from shuffling a feature.
perplexity: Language model surprise metric on held-out text.
pipeline: Sequence of processing and modeling steps.
pipelining: Overlapping execution stages for throughput.
policy: RL mapping from states to actions.
pooling: CNN downsampling over local regions.
positive class: Class with the tested condition present.
post-processing: Steps applied to model output.
PR AUC (area under the PR curve): Summary of precision-recall curve.
precision: Fraction of predicted positives that are correct.
precision-recall curve: Plot of precision versus recall tradeoff.
prediction: Output a model produces for an input.
prediction bias: Average prediction minus average label.
predictive parity: Equal precision across groups.
predictive rate parity: Equal positive predictive values across groups.
preprocessing: Transforming raw data for training.
pre-trained model: Model previously trained on a large dataset.
prior belief: Initial assumption about parameters.
probabilistic regression model: Regression outputting a target distribution.
prompt engineering: Designing inputs to guide language model output.
proxy (sensitive attributes): Feature correlating with a sensitive attribute.
proxy labels: Substitute targets when true labels are unavailable.

Q

Q-function: Expected return from action in a state under policy.
Q-learning: Value-based RL learning the Q-function.
quantile: Value below which a fraction of observations fall.
quantile bucketing: Binning with equal counts per bucket.
quantization: Reducing numerical precision of weights or activations.
queue: Structure holding inputs awaiting processing.

R

random forest: Ensemble of trees with random feature subsets.
random policy: Policy choosing actions uniformly at random.
ranking: Ordering items by relevance.
rank (ordinality): Position of an item in a sorted list.
rank (Tensor): Number of tensor dimensions.
rater: Person labeling examples for training data.
recall: Fraction of actual positives correctly identified.
recommendation system: System suggesting items of interest to users.
Rectified Linear Unit (ReLU): Activation returning input if positive, else zero.
recurrent neural network: Sequence network with hidden state.
regression model: Model predicting continuous values.
regularization: Penalizing complexity to reduce overfitting.
regularization rate: Hyperparameter for regularization strength.
reinforcement learning (RL): Learning by interaction to maximize reward.
reinforcement learning from human feedback: Fine-tuning models from human preference data.
ReLU: Abbreviation for Rectified Linear Unit.
replay buffer: Memory of past RL transitions.
reporting bias: Bias from unrepresentative event reporting.
representation: How examples are encoded as numbers.
re-ranking: Second pass refining candidate order.
retrieval augmented generation: Combining language models with document retrieval.
return: Cumulative discounted RL reward.
reward: RL feedback signal after an action.
ridge regularization: Synonym for L2 regularization.
RLHF: Abbreviation for reinforcement learning from human feedback.
RNN: Abbreviation for recurrent neural network.
ROC (receiver operating characteristic) Curve: Plot of TPR versus FPR across thresholds.
root: Top decision tree node.
root directory: Base folder for files like checkpoints.
Root Mean Squared Error (RMSE): Square root of mean squared error.
rotational invariance: Predictions unchanged under input rotation.

S

sampling bias: Sampled data unrepresentative of population.
sampling with replacement: Sampling where examples can repeat.
SavedModel: TensorFlow serialization format for complete models.^[1]
Saver: TensorFlow class for variable checkpoints.^[1]
scalar: Tensor of rank zero, one number.
scaling: Adjusting feature value ranges.
scikit-learn: Python classical machine learning library.^[6]
scoring: Producing numeric scores for candidates.
selection bias: Selection correlating with outcomes.
self-attention (also called self-attention layer): Attention within one sequence.
self-supervised learning: Generating labels from the input itself.
self-training: Using confident predictions as new labels.
semi-supervised learning: Mixing labeled and unlabeled data.
sensitive attribute: Feature treated specially for fairness.
sentiment analysis: Classifying text by emotional attitude.
sequence model: Model for ordered sequences.
sequence-to-sequence task: Task with sequence input and output.
serving: Deploying a model to answer requests.
shape (Tensor): Sizes along each tensor dimension.
shrinkage: Reducing boosting step contributions.
sigmoid function: Activation mapping reals to between zero and one.
similarity measure: Function quantifying example likeness.
size invariance: Predictions unchanged under input rescaling.
sketching: Approximate summaries of large datasets.
softmax: Turning a vector into a probability distribution.
sparse feature: Feature with mostly zero values.
sparse representation: Representation with mostly zero entries.
sparse vector: Vector storing only nonzero entries.
sparsity: Proportion of zero entries.
spatial pooling: Pooling across spatial feature map dimensions.
split: Subset of data such as train or test.
splitter: Tree component selecting best node conditions.
squared hinge loss: Hinge loss with squared margin violation.
squared loss: Squared prediction-target difference.
stability: Consistency under small changes.
staged training: Training in increasing-complexity phases.
state: Description of RL environment at a moment.
state-action value function: Synonym for Q-function.
static: Trained once and not updated.
static inference: Synonym for offline inference.
stationarity: Data statistics unchanged over time.
step: One training iteration on one batch.
step size: Synonym for learning rate.
stochastic gradient descent (SGD): Gradient descent on random examples or batches.
stride: Convolutional filter step size.
structural risk minimization (SRM): Balancing training error and complexity.
subsampling: Selecting a smaller data subset.
summary: TensorFlow value recorded for TensorBoard.^[1]
supervised machine learning: Learning from labeled input-output pairs.
synthetic feature: Feature derived from existing ones.

T

tabular Q-learning: Q-learning storing values in a table.
target: Synonym for label.
target network: Delayed Q-network copy for stability.
temporal data: Data indexed by time.
Tensor: Multidimensional array, the deep learning data structure.
TensorBoard: TensorFlow visualization toolkit for training runs.^[1]
TensorFlow: Open-source machine learning platform from Google.^[1]
TensorFlow Playground: Interactive browser tool for small networks.^[1]
TensorFlow Serving: Production serving system for TensorFlow models.^[1]
Tensor Processing Unit (TPU): Google's neural network hardware accelerator.^[1]
Tensor rank: Number of tensor dimensions.
Tensor shape: Sizes along each tensor dimension.
Tensor size: Total number of tensor elements.
termination condition: Criterion ending an iterative process.
test: Tree condition or held-out evaluation set.
test loss: Loss on the test set.
test set: Held-out data for final evaluation.
tf.Example: TensorFlow protocol buffer for one example.^[1]
tf.keras: TensorFlow bundled Keras API.^[1]
threshold (for decision trees): Cutoff value in a tree split.
time series analysis: Studying data sequenced in time.
timestep: One position in a sequence or RL tick.
token: Discrete unit such as a word or subword.
tower: Model replica in distributed training.
TPU: Abbreviation for Tensor Processing Unit.^[1]
TPU chip: Single integrated circuit performing TPU computation.^[1]
TPU device: Board with one or more TPU chips.^[1]
TPU master: Coordinator dispatching to TPU workers.^[1]
TPU node: Cloud TPU resource exposed to a VM.^[1]
TPU Pod: Cluster of many TPU chips on fast networking.^[1]
TPU resource: Allocation of TPU compute in Google Cloud.^[1]
TPU slice: Subset of TPU chips from a Pod.^[1]
TPU type: TPU configuration label like v3 or v4.^[1]
TPU worker: Process running TPU computation.^[1]
training: Fitting model parameters to data.
training loss: Loss on training data.
training-serving skew: Gap between training and serving data processing.
training set: Data used to fit model parameters.
trajectory: Sequence of RL states, actions, rewards.
transfer learning: Reusing knowledge across related tasks.
Transformer: Self-attention-based architecture dominant in language modeling.
translational invariance: Predictions unchanged under input shifts.
trigram: Sequence of three adjacent tokens.
true negative (TN): Correct negative prediction.
true positive (TP): Correct positive prediction.
true positive rate (TPR): Synonym for recall.

U

unawareness (to a sensitive attribute): Fairness approach hiding sensitive attributes.
underfitting: Model too simple to capture data patterns.
undersampling: Reducing majority class examples.
unidirectional: Sequence processing in one direction only.
unidirectional language model: Language model conditioning only on past tokens.
unlabeled example: Example without a known label.
unsupervised machine learning: Learning structure without labels.
uplift modeling: Causal modeling of an action's incremental effect.
upweighting: Increasing example influence in loss.
user matrix: Factor matrix for users in matrix factorization.

V

validation: Evaluating on a held-out set for tuning.
validation loss: Loss on the validation set.
validation set: Data for model selection during development.
vanishing gradient problem: Gradients shrinking to zero in deep networks.
variable importances: Scores estimating feature influence.
vector database: Storage for similarity search over embeddings.

W

Wasserstein loss: GAN loss based on earth mover's distance.
weight: Learned parameter scaling an input or activation.
Weighted Alternating Least Squares (WALS): Matrix factorization by alternating least squares.
weighted sum: Linear combination of inputs and weights.
wide model: Few-layer model with many features and feature crosses.
width: Number of units in a layer.
wisdom of the crowd: Aggregated independent estimates often beat single ones.
word embedding: Dense vector representation of a word.

Z

zero-shot learning: Performing a task with no labeled examples.
Z-score normalization: Rescaling to zero mean and unit variance.

References

Google Developers: Machine Learning Glossary, https://developers.google.com/machine-learning/glossary ↩
Wikipedia: Glossary of artificial intelligence, https://en.wikipedia.org/wiki/Glossary_of_artificial_intelligence ↩
Wikipedia: Machine learning, https://en.wikipedia.org/wiki/Machine_learning ↩
Wikipedia: Deep learning, https://en.wikipedia.org/wiki/Deep_learning ↩
IBM: What is machine learning?, https://www.ibm.com/topics/machine-learning ↩
scikit-learn glossary, https://scikit-learn.org/stable/glossary.html ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

4 revisions by 1 contributors · full history

Suggest edit

What links here

Machine learning terms Terms

A

B

C

D

E

F

G

H

I

K

L

M

N

O

P

Q

R

S

T

U

V

W

Z

References

Improve this article

Related Articles

A/B Testing

Diffusion model

Dimension Reduction

Dimensions

Discrete Feature

Discriminative Model

What links here

Related Articles

A/B Testing

Diffusion model

Dimension Reduction

Dimensions

Discrete Feature

Discriminative Model

What links here