Machine learning terms/Fundamentals

introduction to machine learning fundamentals

Machine learning (ML) is the branch of artificial intelligence concerned with building systems that learn patterns from data rather than following explicitly programmed rules. The field draws on statistics, optimization theory, and computer science. A canonical definition comes from Tom Mitchell (1997): "A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E."

This page collects the foundational vocabulary that underpins almost every ML system, from a simple linear regression baseline to a billion-parameter neural network. Each term is linked to its own dedicated wiki article. The fundamentals chapter is the entry point for readers working through the Machine learning terms glossary, which is grouped by topic into chapters such as fundamentals, neural networks and deep learning, language evaluation, and advanced topics.

A practical ML system has three ingredients: a model (a parameterized function family), a loss function (a measure of how far predictions are from ground truth), and an optimization procedure (a way to adjust the parameters so that the loss decreases). The remainder of this article walks through the categories, components, and metrics that make this loop work in practice.

categories of machine learning

ML problems are usually grouped by the kind of supervision signal the training set provides. The five main categories are summarized below.

Category	Supervision signal	Typical tasks	Example algorithms
Supervised learning	Each example has an input and a label	Classification, regression	Linear regression, logistic regression, random forest, gradient boosted trees
Unsupervised learning	Inputs only, no labels	Clustering, density estimation, dimensionality reduction	k-means, Gaussian mixture, PCA, autoencoders
Semi-supervised learning	A small labeled set plus a much larger unlabeled set	Classification with scarce labels	Label propagation, pseudo-labeling, consistency regularization
Self-supervised learning	Labels are constructed automatically from the input itself	Pretraining language and vision models	Masked language modeling (BERT), next-token prediction (GPT), contrastive learning (SimCLR)
Reinforcement learning	A reward signal received after taking actions in an environment	Game playing, robotics, recommendation	Q-learning, policy gradients, actor-critic, PPO

supervised learning

In supervised machine learning the dataset consists of pairs of inputs and target outputs. The learner fits a function that maps inputs to outputs and is evaluated on held-out examples. Most production ML systems, including spam filters, fraud detection, ad ranking, and medical image triage, are supervised.

unsupervised learning

Unsupervised machine learning discovers structure in unlabeled data. Clustering groups similar items, dimensionality reduction projects high-dimensional data into a lower-dimensional space for visualization or compression, and density estimation models the probability distribution of the inputs.

semi-supervised and self-supervised learning

Semi-supervised learning is useful when labels are expensive but raw data is plentiful. Self-supervised learning has become the dominant paradigm for foundation models: the model is pretrained on large unlabeled corpora using a pretext task (predicting masked tokens, the next token, or the relative position of image patches) and is then fine-tuned on smaller supervised datasets.

reinforcement learning

Reinforcement learning (RL) trains an agent to take actions in an environment so as to maximize cumulative reward. The training signal is sparse and delayed, which makes the credit assignment problem central. Notable RL systems include DeepMind's AlphaGo and AlphaZero, OpenAI Five, and the reinforcement learning from human feedback (RLHF) stage used to align large language models.

regression versus classification

Within supervised learning, the two main task types are regression and classification.

Task	Output	Typical loss	Examples
Regression	A continuous numeric value	Squared loss (MSE), L1 loss	House price, temperature, time-to-failure
Binary classification	One of two classes (positive or negative class)	Log loss	Spam detection, loan default, click prediction
Multi-class classification	One of K mutually exclusive classes	Categorical cross-entropy	Digit recognition, image labeling
Multi-label classification	A subset of K classes (any number can be active)	Binary cross-entropy per label	Tagging, content moderation

Classification models often output probabilities via a softmax (multi-class) or sigmoid (binary) head, and a classification threshold converts the probability into a discrete decision.

bias, variance, overfitting, and generalization

The central goal of ML is generalization: producing a model that performs well on data it has not seen. Two failure modes block this goal.

Underfitting occurs when the model is too simple to capture the structure in the data. Both training error and test error are high. Remedies include using a richer model, adding features, or training longer.
Overfitting occurs when the model memorizes idiosyncrasies of the training set. Training error is low but test error is high. Remedies include collecting more data, simplifying the model, or applying regularization.

bias-variance trade-off

The expected squared error of a regression estimator can be decomposed into three terms: a bias term (how far the average prediction is from the truth), a variance term (how much predictions fluctuate across training sets), and irreducible noise. High-bias models (like a linear model on a curved relationship) underfit; high-variance models (like a deep tree on a small dataset) overfit. Choosing model capacity and regularization is largely the art of balancing these two sources of error.

generalization curve

A generalization curve plots training loss and validation loss against training time or model capacity. As capacity grows, training loss falls monotonically while validation loss typically falls and then rises again, forming a U shape whose minimum suggests the right level of capacity or the right number of training epochs. The point where validation loss is minimum motivates early stopping.

train, validation, and test splits

To estimate generalization honestly, the dataset is partitioned into disjoint subsets.

Split	Purpose	Typical share
Training set	Fit the model parameters	60 to 80 percent
Validation set	Tune hyperparameters, choose architectures, decide when to stop	10 to 20 percent
Test set	Final unbiased evaluation, used at most once	10 to 20 percent

Splits should be drawn so that examples are independently and identically distributed (i.i.d.). For time-series problems use a temporal split; for grouped data (patients, users) use a group-aware split to avoid leakage.

cross-validation

When data is scarce, k-fold cross-validation gives a more stable estimate of validation performance. The training set is split into k folds; the model is trained k times, each time holding out one fold as the validation set, and the k validation scores are averaged. Common choices are 5-fold and 10-fold. Stratified k-fold preserves class proportions in each fold and is preferred for class-imbalanced datasets. Leave-one-out cross-validation (LOOCV) is the limiting case where k equals the number of examples; it is unbiased but expensive and high-variance.

loss functions

A loss function measures the cost of a prediction. Training minimizes the average loss over the training set, sometimes called the empirical risk.

Loss	Formula (per example)	Typical use
Squared loss (MSE)	(y - y_hat)^2	Regression with Gaussian noise
L1 loss (MAE)	abs(y - y_hat)	Regression robust to outliers
Huber loss	Quadratic for small errors, linear for large	Regression with heavy-tailed noise
Binary log loss	-[y log(p) + (1-y) log(1-p)]	Binary classification
Categorical cross-entropy	-sum(y_k log p_k)	Multi-class classification
Hinge loss	max(0, 1 - y * f(x))	Support vector machines
Kullback-Leibler divergence	sum(p log(p/q))	Distribution matching, variational inference
Contrastive / triplet loss	Margin-based pair or triplet objectives	Metric learning, embeddings

Alongside the per-example loss, training adds a regularization term to the total objective, weighted by a lambda coefficient, which is itself a hyperparameter often tuned by validation.

optimization algorithms

Once the loss is defined, an optimizer iteratively updates the model parameters to reduce it. Almost all modern ML uses first-order gradient methods.

gradient descent

Vanilla gradient descent computes the gradient of the loss over the entire training set and takes a step opposite to that gradient, scaled by a learning rate. For large datasets this is impractical because each step requires a full pass over the data.

stochastic and mini-batch gradient descent

Stochastic gradient descent (SGD) approximates the full-batch gradient with the gradient computed on a single example. Mini-batch SGD, the workhorse of deep learning, computes the gradient on a small batch of examples (the batch size is a key hyperparameter, typically between 32 and 8,192). Mini-batches give a low-variance estimate of the true gradient while still allowing many updates per epoch.

momentum and adaptive optimizers

Momentum keeps an exponentially weighted average of past gradients, accelerating descent in consistent directions and damping oscillations. Nesterov accelerated gradient (NAG) computes the gradient at a look-ahead position. AdaGrad adapts the learning rate per parameter using the sum of squared past gradients, suiting sparse features but eventually shrinking the rate to zero. RMSProp replaces that cumulative sum with an exponentially decayed moving average. Adam (Adaptive Moment Estimation), introduced by Kingma and Ba in 2014, combines momentum and RMSProp with bias-corrected moments and is the default for many deep learning tasks. AdamW decouples weight decay from the gradient step and is preferred for transformers. Learning rate schedules (step decay, cosine decay, warmup followed by decay) are used together with optimizers to improve convergence.

regularization techniques

Regularization adds inductive bias toward simpler functions to fight overfitting.

Technique	Effect	Notes
L1 regularization	Sum of absolute weights, encourages sparsity	Used in lasso regression, can drive some weights to exactly zero
L2 regularization	Sum of squared weights, shrinks all weights toward zero	Used in ridge regression and weight decay
L0 regularization	Penalizes the count of non-zero weights	Combinatorial, usually approximated
Elastic net	Mixture of L1 and L2	Compromise between sparsity and stability
Dropout	Randomly zeros activations during training	Acts as model averaging across sub-networks
Early stopping	Stops training when validation loss stops improving	Implicit regularization, near-free
Data augmentation	Generates synthetic training examples	Standard in vision (flips, crops, color jitter), audio (pitch, time stretch), and text (back-translation)
Batch normalization	Normalizes activations within a batch	Often acts as regularization in addition to its optimization benefits
Label smoothing	Softens hard labels toward a uniform distribution	Reduces overconfidence in classification
Gradient clipping	Bounds gradient norms	Stabilizes training when gradients explode

The strength of L1 or L2 regularization is controlled by the regularization rate (lambda), tuned on the validation set.

feature engineering

Feature engineering converts raw data into the feature vectors consumed by the model. Even with deep learning, careful preparation of inputs often makes a measurable difference.

scaling and normalization

Numerical features typically require scaling so that no single feature dominates distance and gradient computations.

Method	Formula	When to use
Min-max scaling	(x - min) / (max - min)	Bounded outputs, image pixels
Z-score normalization	(x - mean) / std	Default choice for most numeric features
Robust scaling	(x - median) / IQR	Outlier-prone features
Log transform	log(1 + x)	Skewed positive values, counts
Bucketing	Map continuous values into discrete bins	Capture non-linearity, simplify interactions

encoding categorical features

Categorical features must be turned into numbers.

One-hot encoding creates a 0/1 indicator per category, producing a sparse representation for high-cardinality fields.
Ordinal encoding assigns integers when categories have a natural order.
Target encoding replaces a category with a statistic of the target (mean, median) computed within cross-validation folds to avoid leakage.
Hashing trick maps categories into a fixed-size space, useful at very high cardinalities.
Learned embeddings, used by an embedding layer in a neural network, map each category to a low-dimensional dense vector.

feature selection and crossing

Feature selection removes irrelevant or redundant features to reduce variance and training cost. Methods include filter (mutual information, chi-squared), wrapper (recursive feature elimination), and embedded approaches (L1 regularization, tree feature importance). Feature crosses combine two or more features into a synthetic feature that captures their interaction, for example combining latitude and longitude into a grid cell.

linear models

Linear models are the foundation of supervised learning and remain strong baselines.

Linear regression predicts a continuous target as a weighted sum of features. The closed-form solution uses the normal equation; large problems use SGD. With L2 regularization it becomes ridge regression; with L1 it becomes lasso.
Logistic regression maps a linear combination of features through the sigmoid function to produce a probability for binary classification. Despite its name it is a classifier. The multinomial extension uses softmax and is sometimes called softmax regression. The log-odds (logit) is the inverse of the sigmoid.
Generalized linear models (GLMs) extend linear regression to other distributions, for example Poisson regression for count data and gamma regression for positive continuous responses.
Linear discriminant analysis (LDA) fits class-conditional Gaussians with shared covariance and produces linear decision boundaries.

tree-based models

Decision trees split the feature space along axis-aligned thresholds and predict the majority class or mean target in each leaf. They handle mixed feature types, are invariant to monotone transformations, and are easy to inspect.

Random forests (Breiman, 2001) train many trees on bootstrap samples with feature subsampling and average their predictions. They reduce variance with little tuning.
Gradient boosted decision trees (GBDT) fit each new tree to the residual gradient of the loss with respect to the current ensemble. Modern implementations include XGBoost (Chen and Guestrin, 2016), LightGBM (Microsoft, 2017), and CatBoost (Yandex, 2017). GBDT remains the dominant approach on tabular data.
Isolation forests are an unsupervised variant used for anomaly detection.

Tree models are non-linear and can capture interactions automatically, which is why they often outperform linear models on tabular tasks.

distance-based and kernel models

k-nearest neighbors (kNN) is a non-parametric method that classifies an example by majority vote of its k nearest neighbors in feature space, with distance computed under a chosen metric (Euclidean, cosine, Mahalanobis). kNN has no training phase but expensive prediction; structures like KD-trees or approximate nearest neighbor (ANN) indexes accelerate it.
Support vector machines (SVMs), introduced by Cortes and Vapnik (1995), find the maximum-margin hyperplane separating classes. The kernel trick allows SVMs to fit non-linear boundaries by implicitly mapping inputs into a higher-dimensional space using a kernel function such as the radial basis function (RBF) or polynomial kernel. SVMs use the hinge loss.
Kernel ridge regression, Gaussian processes, and kernel PCA generalize the kernel idea beyond classification.

probabilistic models

Probabilistic models represent uncertainty through joint or conditional distributions. Naive Bayes assumes feature independence given the class and applies Bayes' rule; despite the strong assumption it is fast and effective for text classification with bag-of-words features (Multinomial, Bernoulli, Gaussian variants). Bayesian networks are directed acyclic graphs whose nodes are random variables and whose edges encode conditional dependence; they were popularized by Judea Pearl in the 1980s. Hidden Markov models (HMMs) and conditional random fields (CRFs) model sequence data and were workhorses for speech and NLP before neural networks took over. Bayesian linear regression, Gaussian processes, and variational autoencoders (VAEs) bring Bayesian principles to modern ML.

neural networks: a brief introduction

A neural network stacks layers of neurons. Each neuron computes a weighted sum of its inputs plus a bias, then applies a non-linear activation function such as the Rectified Linear Unit (ReLU), sigmoid, or hyperbolic tangent. A network has an input layer, one or more hidden layers, and an output layer; the number of hidden layers gives the model's depth and a network with more than one hidden layer is called a deep model. Training uses backpropagation, which computes gradients by applying the chain rule from the output back through the layers. The fundamentals chapter introduces these concepts, while the dedicated neural networks and deep learning chapter covers convolutional, recurrent, and transformer architectures in more depth.

evaluation metrics

The choice of metric depends on the task and on what kinds of errors matter most.

classification metrics

Metric	Definition	Notes
Accuracy	(TP + TN) / total	Misleading on imbalanced data
Precision	TP / (TP + FP)	Fraction of predicted positives that are truly positive
Recall (sensitivity, TPR)	TP / (TP + FN)	Fraction of real positives that are recovered
Specificity	TN / (TN + FP)	Fraction of negatives correctly rejected
F1 score	Harmonic mean of precision and recall	Balances the two when one class is rare
F-beta	Weighted harmonic mean	Use beta>1 to weight recall more, beta<1 to weight precision
False positive rate	FP / (FP + TN)	Used in ROC curves
AUC (ROC AUC)	Area under the ROC curve	Probability a random positive ranks above a random negative
PR AUC	Area under the precision-recall curve	Better than ROC AUC under heavy class imbalance
Log loss	Average negative log probability of the true class	Rewards calibrated probabilities
Brier score	Mean squared error between predicted probability and outcome	Used for calibration assessment

A confusion matrix is the standard visualization that shows TP, FP, TN, and FN counts and is the source of all the metrics above.

regression metrics

Metric	Formula	Notes
Mean absolute error (MAE)	mean(abs(y - y_hat))	Same units as the target, robust to outliers
Mean squared error (MSE)	mean((y - y_hat)^2)	Penalizes large errors more
Root mean squared error (RMSE)	sqrt(MSE)	Same units as the target
R-squared	1 - SS_res / SS_tot	Fraction of variance explained, can be negative
MAPE	mean(abs((y - y_hat) / y))	Scale-free, undefined when y is zero
Quantile loss	Pinball loss	Used to fit conditional quantile estimators

ranking and probabilistic calibration

Ranking tasks (search, recommendation) use Mean Reciprocal Rank (MRR), Mean Average Precision (MAP), and Normalized Discounted Cumulative Gain (NDCG). Calibration plots and Expected Calibration Error (ECE) check whether predicted probabilities match observed frequencies.

the machine learning pipeline

A real ML system is more than a model. The end-to-end pipeline typically includes the following stages.

Stage	Activities
Data	Collection, cleaning, deduplication, labeling, proxy labels, rater management, schema validation
Feature engineering	Transformations, scaling, encoding, feature selection, synthetic features
Training	Choosing model, loss, optimizer, hyperparameters, distributed training
Validation	Cross-validation, hyperparameter search (grid, random, Bayesian, Hyperband)
Evaluation	Held-out test metrics, slice-based analysis, fairness audits
Deployment	Online inference (low latency) versus offline batch scoring, static versus dynamic models, A/B testing, canary releases
Monitoring	Drift detection, training-serving skew, feedback loops, nonstationarity, retraining cadence, alerting
Governance	Interpretability, bias and fairness audits, lineage, model cards, documentation

MLOps is the engineering discipline that productionizes this loop, with tools such as MLflow, Kubeflow, Vertex AI, SageMaker, and Tecton handling experiment tracking, feature stores, and continuous training.

hyperparameter tuning

Hyperparameters are configuration values set before training (learning rate, batch size, regularization strength, depth, hidden width, dropout rate). Common search strategies include grid search (all combinations on a discrete grid), random search (often outperforms grid search for the same budget; Bergstra and Bengio, 2012), Bayesian optimization (a probabilistic surrogate that picks configurations maximizing expected improvement), Hyperband and BOHB (random search with early stopping via successive halving), and population-based training. Tools include Optuna, Ray Tune, Weights and Biases Sweeps, and Vertex AI Vizier.

online versus offline learning

Offline learning trains a model on a fixed snapshot of data and then deploys it; the model is static until the next retraining cycle. Online learning updates the model as new examples arrive, which is essential when the data distribution changes (a property called nonstationarity). Online inference produces predictions on demand, while offline (batch) inference scores entire datasets on a schedule. The choice between static and dynamic inference depends on freshness requirements, latency budgets, and cost.

class imbalance

When one class is much more frequent than the other, naive accuracy is misleading. Strategies include resampling (oversampling the minority class with SMOTE or ADASYN, or undersampling the majority class), class weighting (scaling the loss by inverse frequency), tuning the classification threshold, cost-sensitive learning (asymmetric costs in the loss), and anomaly detection framing when the positive class is extremely rare.

interpretability and responsible ML

Interpretability matters for debugging, fairness, and regulatory compliance. Common techniques include feature importance from tree ensembles (Gini, permutation, SHAP), partial dependence plots, local surrogates such as LIME (Ribeiro et al., 2016), SHAP values (Lundberg and Lee, 2017), counterfactual explanations, and attention visualization for deep models.

Algorithmic bias can creep in through skewed data, label noise, or proxy labels that correlate with protected attributes. Fairness metrics include demographic parity, equalized odds, and predictive parity; no single metric satisfies all desiderata simultaneously (Chouldechova, 2017; Kleinberg et al., 2017). Mitigation spans pre-processing (reweighting), in-processing (constraints), and post-processing (group-specific thresholds). Stability and feedback loops are also material risks. The NIST AI Risk Management Framework (2023) and EU AI Act (2024) formalize many of these requirements.

tooling and software ecosystem

Most practitioners use the Python data stack: pandas (and its DataFrame abstraction), NumPy, scikit-learn for classical models, and PyTorch, TensorFlow, or JAX for deep learning. R, Julia, and Spark MLlib are also widely used. ONNX provides a portable format for deployed models, and Hugging Face Hub hosts pretrained checkpoints.

notable textbooks and courses

The following resources are widely used and define the canon of fundamentals.

Resource	Authors / instructors	Year	Notes
The Elements of Statistical Learning (ESL)	Trevor Hastie, Robert Tibshirani, Jerome Friedman	2nd ed. 2009	Mathematical, dense, free PDF on the authors' Stanford site
An Introduction to Statistical Learning (ISL)	Gareth James, Daniela Witten, Trevor Hastie, Robert Tibshirani	2nd ed. 2021	Gentler companion to ESL with R and Python labs
Pattern Recognition and Machine Learning (PRML)	Christopher M. Bishop	2006	Bayesian and probabilistic perspective
Machine Learning: A Probabilistic Perspective	Kevin Murphy	2012; expanded as Probabilistic Machine Learning, 2022 and 2023	Comprehensive modern reference
Deep Learning	Ian Goodfellow, Yoshua Bengio, Aaron Courville	2016	The standard deep learning textbook, free online
Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow	Aurelien Geron	3rd ed. 2022	Practical, code-first
Mathematics for Machine Learning	Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong	2020	Linear algebra, calculus, probability for ML
Coursera Machine Learning	Andrew Ng (Stanford / DeepLearning.AI)	2011 (original); refreshed 2022	One of the most-taken online courses ever, an entry point for millions
Coursera Deep Learning Specialization	Andrew Ng	2017	Five-course sequence on neural networks
Stanford CS229 Machine Learning	Andrew Ng, Christopher Re, Tengyu Ma	ongoing	Graduate ML lectures, notes online
Stanford CS231n Convolutional Neural Networks for Visual Recognition	Fei-Fei Li, Andrej Karpathy, Justin Johnson	ongoing	Foundational deep learning for vision
Stanford CS224n Natural Language Processing with Deep Learning	Christopher Manning	ongoing	Modern NLP curriculum
MIT 6.036 Introduction to Machine Learning	MIT OCW	ongoing	Undergraduate ML
fast.ai Practical Deep Learning for Coders	Jeremy Howard, Rachel Thomas	2017 onwards	Top-down, hands-on, free
Dive into Deep Learning (D2L)	Aston Zhang, Zachary Lipton, Mu Li, Alex Smola	2020 onwards	Free interactive book with PyTorch, MXNet, TensorFlow, JAX
Reinforcement Learning: An Introduction	Richard S. Sutton, Andrew G. Barto	2nd ed. 2018	The canonical RL textbook, free PDF
Speech and Language Processing	Daniel Jurafsky, James H. Martin	3rd ed. draft, ongoing	NLP reference, free chapters online

The Google Machine Learning Glossary, on which much of the terminology in this wiki draws, is also a high-quality starting point.

index of fundamentals terms

The terms below have dedicated wiki pages in the fundamentals chapter of Machine learning terms. Where multiple notations exist (for example ReLU and Rectified Linear Unit), separate entries are linked.

references

Mitchell, Tom M. Machine Learning. McGraw-Hill, 1997.
Hastie, Trevor; Tibshirani, Robert; Friedman, Jerome. The Elements of Statistical Learning. 2nd ed. Springer, 2009.
Bishop, Christopher M. Pattern Recognition and Machine Learning. Springer, 2006.
Murphy, Kevin P. Probabilistic Machine Learning: An Introduction. MIT Press, 2022.
Goodfellow, Ian; Bengio, Yoshua; Courville, Aaron. Deep Learning. MIT Press, 2016.
Sutton, Richard S.; Barto, Andrew G. Reinforcement Learning: An Introduction. 2nd ed. MIT Press, 2018.
Geron, Aurelien. Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow. 3rd ed. O'Reilly, 2022.
James, Gareth; Witten, Daniela; Hastie, Trevor; Tibshirani, Robert. An Introduction to Statistical Learning. 2nd ed. Springer, 2021.
Deisenroth, Marc Peter; Faisal, A. Aldo; Ong, Cheng Soon. Mathematics for Machine Learning. Cambridge University Press, 2020.
Breiman, Leo. "Random Forests." Machine Learning 45, no. 1 (2001): 5-32.
Chen, Tianqi; Guestrin, Carlos. "XGBoost: A Scalable Tree Boosting System." KDD, 2016.
Cortes, Corinna; Vapnik, Vladimir. "Support-Vector Networks." Machine Learning 20, no. 3 (1995): 273-297.
Kingma, Diederik P.; Ba, Jimmy. "Adam: A Method for Stochastic Optimization." ICLR, 2015.
Krizhevsky, Alex; Sutskever, Ilya; Hinton, Geoffrey E. "ImageNet Classification with Deep Convolutional Neural Networks." NeurIPS, 2012.
Vaswani, Ashish, et al. "Attention Is All You Need." NeurIPS, 2017.
Bergstra, James; Bengio, Yoshua. "Random Search for Hyper-Parameter Optimization." Journal of Machine Learning Research 13 (2012): 281-305.
Rumelhart, David E.; Hinton, Geoffrey E.; Williams, Ronald J. "Learning Representations by Back-Propagating Errors." Nature 323, no. 6088 (1986): 533-536.
Ribeiro, Marco Tulio; Singh, Sameer; Guestrin, Carlos. "Why Should I Trust You? Explaining the Predictions of Any Classifier." KDD, 2016.
Lundberg, Scott M.; Lee, Su-In. "A Unified Approach to Interpreting Model Predictions." NeurIPS, 2017.
Chouldechova, Alexandra. "Fair Prediction with Disparate Impact: A Study of Bias in Recidivism Prediction Instruments." Big Data 5, no. 2 (2017): 153-163.
Kleinberg, Jon; Mullainathan, Sendhil; Raghavan, Manish. "Inherent Trade-Offs in the Fair Determination of Risk Scores." ITCS, 2017.
Google Developers. "Machine Learning Glossary." https://developers.google.com/machine-learning/glossary, accessed 2026.
National Institute of Standards and Technology. AI Risk Management Framework (AI RMF 1.0). NIST AI 100-1, 2023.
European Union. Regulation (EU) 2024/1689 (EU AI Act). Official Journal of the European Union, 2024.

introduction to machine learning fundamentals

categories of machine learning

supervised learning

unsupervised learning

semi-supervised and self-supervised learning

reinforcement learning

regression versus classification

bias, variance, overfitting, and generalization

bias-variance trade-off

generalization curve

train, validation, and test splits

cross-validation

loss functions

optimization algorithms

gradient descent

stochastic and mini-batch gradient descent

momentum and adaptive optimizers

regularization techniques

feature engineering

scaling and normalization

encoding categorical features

feature selection and crossing

linear models

tree-based models

distance-based and kernel models

probabilistic models

neural networks: a brief introduction

evaluation metrics

classification metrics

regression metrics

ranking and probabilistic calibration

the machine learning pipeline

hyperparameter tuning

online versus offline learning

class imbalance

interpretability and responsible ML

tooling and software ecosystem

notable textbooks and courses

index of fundamentals terms

see also

references

Improve this article

Related Articles

Machine learning terms/Reinforcement Learning

ARC-AGI 2

Abbreviations

AUC-ROC

Machine learning terms/Clustering

Machine learning terms/Decision Forests

introduction to machine learning fundamentals

categories of machine learning

supervised learning

unsupervised learning

semi-supervised and self-supervised learning

reinforcement learning

regression versus classification

bias, variance, overfitting, and generalization

bias-variance trade-off

generalization curve

train, validation, and test splits

cross-validation

loss functions

optimization algorithms

gradient descent

stochastic and mini-batch gradient descent

momentum and adaptive optimizers

regularization techniques

feature engineering

scaling and normalization

encoding categorical features

feature selection and crossing

linear models

tree-based models

distance-based and kernel models

probabilistic models

neural networks: a brief introduction

evaluation metrics

classification metrics

regression metrics

ranking and probabilistic calibration