# Machine learning terms/Fundamentals

> Source: https://aiwiki.ai/wiki/machine_learning_terms_fundamentals
> Updated: 2026-07-13
> Categories: Machine Learning
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

*See also: [Machine learning terms](/wiki/machine_learning_terms)*

## What are the fundamental terms in machine learning?

[Machine learning](/wiki/machine_learning) (ML) is the branch of [artificial intelligence](/wiki/artificial_intelligence) concerned with building systems that learn patterns from [data](/wiki/dataset) rather than following explicitly programmed rules. The field draws on statistics, optimization theory, and computer science, and rests on a small vocabulary of foundational terms: a [model](/wiki/model) (a parameterized function family), [features](/wiki/feature) and [labels](/wiki/label) (the inputs and target outputs), a [loss function](/wiki/loss_function) (a measure of prediction error), and an optimizer such as [gradient descent](/wiki/gradient_descent) that adjusts the [parameters](/wiki/parameter) to reduce that loss. A canonical formal definition comes from Tom Mitchell (1997): "A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E."[1]

The term "machine learning" itself was coined by IBM researcher Arthur Samuel in 1959, in his paper "Some Studies in Machine Learning Using the Game of Checkers," published in the IBM Journal of Research and Development.[25] Samuel framed the goal of the field through his checkers program, writing that "a computer can be programmed so that it will learn to play a better game of checkers than can be played by the person who wrote the program."[25] The broader paraphrase often attributed to him, that machine learning is the field of study that gives computers the ability to learn without being explicitly programmed, captures the same idea but is a later reformulation rather than a verbatim line from the 1959 paper.

This page collects the foundational vocabulary that underpins almost every ML system, from a simple [linear regression](/wiki/linear_regression) baseline to a billion-parameter [neural network](/wiki/neural_network). Each term is linked to its own dedicated wiki article. The fundamentals chapter is the entry point for readers working through the [Machine learning terms](/wiki/machine_learning_terms) glossary, which is grouped by topic into chapters such as fundamentals, [neural networks and deep learning](/wiki/machine_learning_terms_neural_networks), [language evaluation](/wiki/machine_learning_terms_language_evaluation), and [advanced topics](/wiki/machine_learning_terms_advanced).

A practical ML system has three ingredients: a [model](/wiki/model) (a parameterized function family), a [loss function](/wiki/loss_function) (a measure of how far predictions are from [ground truth](/wiki/ground_truth)), and an optimization procedure (a way to adjust the [parameters](/wiki/parameter) so that the loss decreases). The remainder of this article walks through the categories, components, and metrics that make this loop work in practice.

## What are the main categories of machine learning?

ML problems are usually grouped by the kind of supervision signal the [training set](/wiki/training_set) provides. The five main categories are summarized below.

| Category | Supervision signal | Typical tasks | Example algorithms |
|---|---|---|---|
| [Supervised learning](/wiki/supervised_machine_learning) | Each [example](/wiki/example) has an input and a [label](/wiki/label) | [Classification](/wiki/classification_model), [regression](/wiki/regression_model) | [Linear regression](/wiki/linear_regression), [logistic regression](/wiki/logistic_regression), random forest, gradient boosted trees |
| [Unsupervised learning](/wiki/unsupervised_machine_learning) | Inputs only, no labels | Clustering, density estimation, dimensionality reduction | k-means, Gaussian mixture, PCA, autoencoders |
| Semi-supervised learning | A small labeled set plus a much larger [unlabeled](/wiki/unlabeled_example) set | Classification with scarce labels | Label propagation, pseudo-labeling, consistency regularization |
| Self-supervised learning | Labels are constructed automatically from the input itself | [Pretraining](/wiki/training) language and vision models | Masked language modeling (BERT), next-token prediction (GPT), contrastive learning (SimCLR) |
| Reinforcement learning | A reward signal received after taking actions in an environment | Game playing, robotics, recommendation | Q-learning, policy gradients, actor-critic, PPO |

### What is supervised learning?

In [supervised machine learning](/wiki/supervised_machine_learning) the dataset consists of pairs of inputs and target outputs. The learner fits a function that maps inputs to outputs and is evaluated on held-out examples. Most production ML systems, including spam filters, fraud detection, ad ranking, and medical image triage, are supervised.

### What is unsupervised learning?

[Unsupervised machine learning](/wiki/unsupervised_machine_learning) discovers structure in unlabeled data. Clustering groups similar items, dimensionality reduction projects high-dimensional data into a lower-dimensional space for visualization or compression, and density estimation models the probability distribution of the inputs.

### What is the difference between semi-supervised and self-supervised learning?

Semi-supervised learning is useful when labels are expensive but raw data is plentiful. Self-supervised learning has become the dominant paradigm for foundation models: the model is pretrained on large unlabeled corpora using a pretext task (predicting masked tokens, the next token, or the relative position of image patches) and is then fine-tuned on smaller supervised datasets.

### What is reinforcement learning?

Reinforcement learning (RL) trains an agent to take actions in an environment so as to maximize cumulative reward. The training signal is sparse and delayed, which makes the credit assignment problem central.[6] Notable RL systems include DeepMind's AlphaGo and AlphaZero, OpenAI Five, and the reinforcement learning from human feedback (RLHF) stage used to align large language models.

## What is the difference between regression and classification?

Within supervised learning, the two main task types are regression and classification.

| Task | Output | Typical loss | Examples |
|---|---|---|---|
| [Regression](/wiki/regression_model) | A continuous numeric value | [Squared loss](/wiki/squared_loss) (MSE), [L1 loss](/wiki/l1_loss) | House price, temperature, time-to-failure |
| [Binary classification](/wiki/binary_classification) | One of two classes (positive or [negative class](/wiki/negative_class)) | [Log loss](/wiki/log_loss) | Spam detection, loan default, click prediction |
| [Multi-class classification](/wiki/multi-class_classification) | One of K mutually exclusive [classes](/wiki/class) | Categorical cross-entropy | Digit recognition, image labeling |
| Multi-label classification | A subset of K classes (any number can be active) | Binary cross-entropy per label | Tagging, content moderation |

Classification models often output [probabilities](/wiki/prediction) via a [softmax](/wiki/softmax) (multi-class) or [sigmoid](/wiki/sigmoid_function) (binary) head, and a [classification threshold](/wiki/classification_threshold) converts the probability into a discrete decision.

## What are bias, variance, overfitting, and generalization?

The central goal of ML is [generalization](/wiki/generalization): producing a model that performs well on data it has not seen. Two failure modes block this goal.

- [Underfitting](/wiki/underfitting) occurs when the model is too simple to capture the structure in the data. Both training error and test error are high. Remedies include using a richer model, adding features, or training longer.
- [Overfitting](/wiki/overfitting) occurs when the model memorizes idiosyncrasies of the training set. Training error is low but test error is high. Remedies include collecting more data, simplifying the model, or applying [regularization](/wiki/regularization).

### What is the bias-variance trade-off?

The expected squared error of a regression estimator can be decomposed into three terms: a bias term (how far the average prediction is from the truth), a variance term (how much predictions fluctuate across training sets), and irreducible noise.[2] High-bias models (like a linear model on a curved relationship) underfit; high-variance models (like a deep tree on a small dataset) overfit. Choosing model capacity and regularization is largely the art of balancing these two sources of error.

### What does a generalization curve show?

A [generalization curve](/wiki/generalization_curve) plots [training loss](/wiki/training_loss) and [validation loss](/wiki/validation_loss) against training time or model capacity. As capacity grows, training loss falls monotonically while validation loss typically falls and then rises again, forming a U shape whose minimum suggests the right level of capacity or the right number of training epochs. The point where validation loss is minimum motivates [early stopping](/wiki/early_stopping).

## What are training, validation, and test splits?

To estimate generalization honestly, the dataset is partitioned into disjoint subsets.

| Split | Purpose | Typical share |
|---|---|---|
| [Training set](/wiki/training_set) | Fit the model parameters | 60 to 80 percent |
| [Validation set](/wiki/validation_set) | Tune [hyperparameters](/wiki/hyperparameter), choose architectures, decide when to stop | 10 to 20 percent |
| Test set | Final unbiased evaluation, used at most once | 10 to 20 percent |

Splits should be drawn so that examples are [independently and identically distributed](/wiki/independently_and_identically_distributed_i_i_d) (i.i.d.). For time-series problems use a temporal split; for grouped data (patients, users) use a group-aware split to avoid leakage.

### What is cross-validation?

When data is scarce, k-fold cross-validation gives a more stable estimate of validation performance. The training set is split into k folds; the model is trained k times, each time holding out one fold as the validation set, and the k validation scores are averaged. Common choices are 5-fold and 10-fold. Stratified k-fold preserves class proportions in each fold and is preferred for [class-imbalanced datasets](/wiki/class-imbalanced_dataset). Leave-one-out cross-validation (LOOCV) is the limiting case where k equals the number of examples; it is unbiased but expensive and high-variance.[8]

## What is a loss function?

A [loss function](/wiki/loss_function) measures the cost of a prediction. Training minimizes the average loss over the training set, sometimes called the empirical risk.[3]

| Loss | Formula (per example) | Typical use |
|---|---|---|
| [Squared loss](/wiki/squared_loss) (MSE) | $$(y - \hat{y})^2$$ | Regression with Gaussian noise |
| [L1 loss](/wiki/l1_loss) (MAE) | $$\lvert y - \hat{y} \rvert$$ | Regression robust to outliers |
| Huber loss | Quadratic for small errors, linear for large | Regression with heavy-tailed noise |
| Binary [log loss](/wiki/log_loss) | $$-[y \log(p) + (1-y) \log(1-p)]$$ | Binary classification |
| Categorical cross-entropy | $$-\sum(y_k \log p_k)$$ | Multi-class classification |
| Hinge loss | $$\max(0, 1 - y \cdot f(x))$$ | Support vector machines |
| Kullback-Leibler divergence | $$\sum p \log(p/q)$$ | Distribution matching, variational inference |
| Contrastive / triplet loss | Margin-based pair or triplet objectives | Metric learning, embeddings |

Alongside the per-example loss, training adds a [regularization](/wiki/regularization) term to the total objective, weighted by a [lambda](/wiki/lambda) coefficient, which is itself a hyperparameter often tuned by [validation](/wiki/validation).

## How do optimization algorithms train a model?

Once the loss is defined, an optimizer iteratively updates the model parameters to reduce it. Almost all modern ML uses first-order gradient methods.

### What is gradient descent?

Vanilla [gradient descent](/wiki/gradient_descent) computes the gradient of the loss over the entire training set and takes a step opposite to that gradient, scaled by a [learning rate](/wiki/learning_rate). For large datasets this is impractical because each step requires a full pass over the data.

### What is the difference between stochastic and mini-batch gradient descent?

[Stochastic gradient descent](/wiki/stochastic_gradient_descent_sgd) (SGD) approximates the full-batch gradient with the gradient computed on a single example. [Mini-batch](/wiki/mini-batch) SGD, the workhorse of deep learning, computes the gradient on a small [batch](/wiki/batch) of examples (the [batch size](/wiki/batch_size) is a key hyperparameter, typically between 32 and 8,192). Mini-batches give a low-variance estimate of the true gradient while still allowing many updates per [epoch](/wiki/epoch).[5]

### What are momentum and adaptive optimizers?

Momentum keeps an exponentially weighted average of past gradients, accelerating descent in consistent directions and damping oscillations. Nesterov accelerated gradient (NAG) computes the gradient at a look-ahead position. AdaGrad adapts the learning rate per parameter using the sum of squared past gradients, suiting sparse features but eventually shrinking the rate to zero. RMSProp replaces that cumulative sum with an exponentially decayed moving average. Adam (Adaptive Moment Estimation), introduced by Kingma and Ba in 2014, combines momentum and RMSProp with bias-corrected moments and is the default for many deep learning tasks.[13] AdamW decouples weight decay from the gradient step and is preferred for transformers. Learning rate schedules (step decay, cosine decay, warmup followed by decay) are used together with optimizers to improve [convergence](/wiki/convergence).

## What are the main regularization techniques?

Regularization adds inductive bias toward simpler functions to fight overfitting.

| Technique | Effect | Notes |
|---|---|---|
| [L1 regularization](/wiki/l1_regularization) | Sum of absolute weights, encourages sparsity | Used in lasso regression, can drive some weights to exactly zero |
| [L2 regularization](/wiki/l2_regularization) | Sum of squared weights, shrinks all weights toward zero | Used in ridge regression and weight decay |
| [L0 regularization](/wiki/l0_regularization) | Penalizes the count of non-zero weights | Combinatorial, usually approximated |
| Elastic net | Mixture of L1 and L2 | Compromise between sparsity and stability |
| Dropout | Randomly zeros activations during training | Acts as model averaging across sub-networks |
| [Early stopping](/wiki/early_stopping) | Stops training when validation loss stops improving | Implicit regularization, near-free |
| Data augmentation | Generates synthetic training examples | Standard in vision (flips, crops, color jitter), audio (pitch, time stretch), and text (back-translation) |
| Batch normalization | Normalizes activations within a batch | Often acts as regularization in addition to its optimization benefits |
| Label smoothing | Softens hard labels toward a uniform distribution | Reduces overconfidence in classification |
| Gradient [clipping](/wiki/clipping) | Bounds gradient norms | Stabilizes training when gradients explode |

The strength of L1 or L2 regularization is controlled by the [regularization rate](/wiki/regularization_rate) (lambda), tuned on the validation set.

## What is feature engineering?

Feature engineering converts raw data into the [feature vectors](/wiki/feature_vector) consumed by the model. Even with deep learning, careful preparation of inputs often makes a measurable difference.[7]

### Why scale and normalize features?

[Numerical features](/wiki/numerical_data) typically require scaling so that no single feature dominates distance and gradient computations.

| Method | Formula | When to use |
|---|---|---|
| Min-max scaling | $$(x - \min) / (\max - \min)$$ | Bounded outputs, image pixels |
| [Z-score normalization](/wiki/z-score_normalization) | $$(x - \text{mean}) / \text{std}$$ | Default choice for most numeric features |
| Robust scaling | $$(x - \text{median}) / \text{IQR}$$ | Outlier-prone features |
| Log transform | $$\log(1 + x)$$ | Skewed positive values, counts |
| [Bucketing](/wiki/bucketing) | Map continuous values into discrete bins | Capture non-linearity, simplify interactions |

### How are categorical features encoded?

[Categorical features](/wiki/categorical_data) must be turned into numbers.

- [One-hot encoding](/wiki/one-hot_encoding) creates a 0/1 indicator per category, producing a [sparse representation](/wiki/sparse_representation) for high-cardinality fields.
- Ordinal encoding assigns integers when categories have a natural order.
- Target encoding replaces a category with a statistic of the target (mean, median) computed within cross-validation folds to avoid leakage.
- Hashing trick maps categories into a fixed-size space, useful at very high cardinalities.
- Learned embeddings, used by an [embedding layer](/wiki/embedding_layer) in a neural network, map each category to a low-dimensional dense vector.

### What are feature selection and feature crossing?

[Feature selection](/wiki/feature_engineering) removes irrelevant or redundant features to reduce variance and training cost. Methods include filter (mutual information, chi-squared), wrapper (recursive feature elimination), and embedded approaches (L1 regularization, tree feature importance). [Feature crosses](/wiki/feature_cross) combine two or more features into a synthetic feature that captures their interaction, for example combining latitude and longitude into a grid cell.

## What are linear models?

Linear models are the foundation of supervised learning and remain strong baselines.

- [Linear regression](/wiki/linear_regression) predicts a continuous target as a [weighted sum](/wiki/weighted_sum) of features. The closed-form solution uses the normal equation; large problems use SGD.[9] With [L2](/wiki/l2_regularization) regularization it becomes ridge regression; with [L1](/wiki/l1_regularization) it becomes lasso.
- [Logistic regression](/wiki/logistic_regression) maps a linear combination of features through the [sigmoid](/wiki/sigmoid_function) function to produce a probability for [binary classification](/wiki/binary_classification). Despite its name it is a classifier. The multinomial extension uses [softmax](/wiki/softmax) and is sometimes called softmax regression. The [log-odds](/wiki/log-odds) (logit) is the inverse of the sigmoid.
- Generalized linear models (GLMs) extend linear regression to other distributions, for example Poisson regression for count data and gamma regression for positive continuous responses.
- Linear discriminant analysis (LDA) fits class-conditional Gaussians with shared covariance and produces linear decision boundaries.

## What are tree-based models?

Decision trees split the feature space along axis-aligned thresholds and predict the majority class or mean target in each leaf. They handle mixed feature types, are invariant to monotone transformations, and are easy to inspect.

- Random forests (Breiman, 2001) train many trees on bootstrap samples with feature subsampling and average their predictions. They reduce variance with little tuning.[10]
- Gradient boosted decision trees (GBDT) fit each new tree to the residual gradient of the loss with respect to the current ensemble. Modern implementations include XGBoost (Chen and Guestrin, 2016)[11], LightGBM (Microsoft, 2017), and CatBoost (Yandex, 2017). GBDT remains the dominant approach on tabular data.
- Isolation forests are an unsupervised variant used for anomaly detection.

Tree models are non-linear and can capture interactions automatically, which is why they often outperform linear models on tabular tasks.

## What are distance-based and kernel models?

- k-nearest neighbors (kNN) is a non-parametric method that classifies an example by majority vote of its k nearest neighbors in feature space, with distance computed under a chosen metric (Euclidean, cosine, Mahalanobis). kNN has no training phase but expensive prediction; structures like KD-trees or approximate nearest neighbor (ANN) indexes accelerate it.
- Support vector machines (SVMs), introduced by Cortes and Vapnik (1995), find the maximum-margin hyperplane separating classes.[12] The kernel trick allows SVMs to fit non-linear boundaries by implicitly mapping inputs into a higher-dimensional space using a kernel function such as the radial basis function (RBF) or polynomial kernel. SVMs use the hinge loss.
- Kernel ridge regression, Gaussian processes, and kernel PCA generalize the kernel idea beyond classification.

## What are probabilistic models?

Probabilistic models represent uncertainty through joint or conditional distributions. Naive Bayes assumes feature independence given the class and applies Bayes' rule; despite the strong assumption it is fast and effective for text classification with bag-of-words features (Multinomial, Bernoulli, Gaussian variants). Bayesian networks are directed acyclic graphs whose nodes are random variables and whose edges encode conditional dependence; they were popularized by Judea Pearl in the 1980s. Hidden Markov models (HMMs) and conditional random fields (CRFs) model sequence data and were workhorses for speech and NLP before neural networks took over. Bayesian linear regression, Gaussian processes, and variational autoencoders (VAEs) bring Bayesian principles to modern ML.[4]

## What is a neural network?

A [neural network](/wiki/neural_network) stacks layers of [neurons](/wiki/neuron). Each neuron computes a [weighted sum](/wiki/weighted_sum) of its inputs plus a [bias](/wiki/bias), then applies a non-linear [activation function](/wiki/activation_function) such as the [Rectified Linear Unit](/wiki/rectified_linear_unit_relu) (ReLU), [sigmoid](/wiki/sigmoid_function), or hyperbolic tangent. A network has an [input layer](/wiki/input_layer), one or more [hidden layers](/wiki/hidden_layer), and an [output layer](/wiki/output_layer); the number of hidden layers gives the model's [depth](/wiki/depth) and a network with more than one hidden layer is called a [deep model](/wiki/deep_model).[5] Training uses [backpropagation](/wiki/backpropagation), which computes gradients by applying the chain rule from the output back through the layers.[17] The fundamentals chapter introduces these concepts, while the dedicated [neural networks and deep learning](/wiki/machine_learning_terms_neural_networks) chapter covers convolutional[14], recurrent, and transformer architectures[15] in more depth.

## How are machine learning models evaluated?

The choice of metric depends on the task and on what kinds of errors matter most.

### What metrics are used for classification?

| Metric | Definition | Notes |
|---|---|---|
| [Accuracy](/wiki/accuracy) | $$(\text{TP} + \text{TN}) / \text{total}$$ | Misleading on [imbalanced data](/wiki/class-imbalanced_dataset) |
| Precision | [TP](/wiki/true_positive_tp) / (TP + [FP](/wiki/false_positive_fp)) | Fraction of predicted positives that are truly positive |
| Recall (sensitivity, [TPR](/wiki/true_positive_rate_tpr)) | TP / (TP + [FN](/wiki/false_negative_fn)) | Fraction of real positives that are recovered |
| Specificity | [TN](/wiki/true_negative_tn) / (TN + FP) | Fraction of negatives correctly rejected |
| F1 score | Harmonic mean of precision and recall | Balances the two when one class is rare |
| F-beta | Weighted harmonic mean | Use beta>1 to weight recall more, beta<1 to weight precision |
| [False positive rate](/wiki/false_positive_rate_fpr) | $$\text{FP} / (\text{FP} + \text{TN})$$ | Used in [ROC curves](/wiki/roc_receiver_operating_characteristic_curve) |
| [AUC](/wiki/auc_area_under_the_curve) (ROC AUC) | Area under the [ROC](/wiki/roc_receiver_operating_characteristic_curve) curve | Probability a random positive ranks above a random negative |
| PR AUC | Area under the precision-recall curve | Better than ROC AUC under heavy class imbalance |
| [Log loss](/wiki/log_loss) | Average negative log probability of the true class | Rewards calibrated probabilities |
| Brier score | Mean squared error between predicted probability and outcome | Used for calibration assessment |

A [confusion matrix](/wiki/confusion_matrix) is the standard visualization that shows TP, FP, TN, and FN counts and is the source of all the metrics above.

### What metrics are used for regression?

| Metric | Formula | Notes |
|---|---|---|
| Mean absolute error (MAE) | $$\text{mean}(\lvert y - \hat{y} \rvert)$$ | Same units as the target, robust to outliers |
| Mean squared error (MSE) | $$\text{mean}((y - \hat{y})^2)$$ | Penalizes large errors more |
| [Root mean squared error](/wiki/root_mean_squared_error_rmse) (RMSE) | $$\sqrt{\text{MSE}}$$ | Same units as the target |
| R-squared | $$1 - \text{SS}_{\text{res}} / \text{SS}_{\text{tot}}$$ | Fraction of variance explained, can be negative |
| MAPE | $$\text{mean}(\lvert (y - \hat{y}) / y \rvert)$$ | Scale-free, undefined when y is zero |
| Quantile loss | Pinball loss | Used to fit conditional quantile estimators |

### How are ranking quality and probability calibration measured?

Ranking tasks (search, recommendation) use Mean Reciprocal Rank (MRR), Mean Average Precision (MAP), and Normalized Discounted Cumulative Gain (NDCG). Calibration plots and [Expected Calibration Error](/wiki/expected_calibration_error) (ECE) check whether predicted probabilities match observed frequencies.

## What are the stages of a machine learning pipeline?

A real ML system is more than a model. The end-to-end pipeline typically includes the following stages.

| Stage | Activities |
|---|---|
| Data | Collection, cleaning, deduplication, labeling, [proxy labels](/wiki/proxy_labels), [rater](/wiki/rater) management, schema validation |
| Feature engineering | Transformations, [scaling](/wiki/normalization), [encoding](/wiki/one-hot_encoding), [feature selection](/wiki/feature_engineering), [synthetic features](/wiki/synthetic_feature) |
| Training | Choosing model, loss, optimizer, hyperparameters, distributed training |
| Validation | Cross-validation, hyperparameter search (grid, random, Bayesian, Hyperband) |
| Evaluation | Held-out [test](/wiki/test_loss) metrics, slice-based analysis, fairness audits |
| Deployment | [Online inference](/wiki/online_inference) (low latency) versus [offline](/wiki/offline_inference) batch scoring, [static](/wiki/static_inference) versus [dynamic](/wiki/dynamic_model) models, A/B testing, canary releases |
| Monitoring | Drift detection, [training-serving skew](/wiki/training-serving_skew), [feedback loops](/wiki/feedback_loop), [nonstationarity](/wiki/nonstationarity), retraining cadence, alerting |
| Governance | [Interpretability](/wiki/interpretability), [bias and fairness](/wiki/bias_ethics_fairness) audits, lineage, model cards, documentation |

MLOps is the engineering discipline that productionizes this loop, with tools such as MLflow, Kubeflow, Vertex AI, SageMaker, and Tecton handling experiment tracking, feature stores, and continuous training.

## How are hyperparameters tuned?

[Hyperparameters](/wiki/hyperparameter) are configuration values set before training (learning rate, batch size, regularization strength, depth, hidden width, dropout rate). Common search strategies include grid search (all combinations on a discrete grid), random search (often outperforms grid search for the same budget; Bergstra and Bengio, 2012)[16], Bayesian optimization (a probabilistic surrogate that picks configurations maximizing expected improvement), Hyperband and BOHB (random search with early stopping via successive halving), and population-based training. Tools include Optuna, Ray Tune, Weights and Biases Sweeps, and Vertex AI Vizier.

## What is the difference between online and offline learning?

[Offline learning](/wiki/offline) trains a model on a fixed snapshot of data and then deploys it; the model is [static](/wiki/static) until the next retraining cycle. [Online learning](/wiki/online_learning) updates the model as new examples arrive, which is essential when the data distribution changes (a property called [nonstationarity](/wiki/nonstationarity)). Online inference produces predictions on demand, while offline (batch) inference scores entire datasets on a schedule. The choice between [static](/wiki/static_inference) and [dynamic](/wiki/dynamic) inference depends on freshness requirements, latency budgets, and cost.

## How is class imbalance handled?

When one [class](/wiki/class) is much more frequent than the other, naive accuracy is misleading. Strategies include resampling (oversampling the [minority class](/wiki/minority_class) with SMOTE or ADASYN, or undersampling the [majority class](/wiki/majority_class)), class weighting (scaling the loss by inverse frequency), tuning the [classification threshold](/wiki/classification_threshold), cost-sensitive learning (asymmetric costs in the loss), and anomaly detection framing when the positive class is extremely rare.

## What is interpretability and responsible machine learning?

[Interpretability](/wiki/interpretability) matters for debugging, fairness, and regulatory compliance. Common techniques include feature importance from tree ensembles (Gini, permutation, SHAP), partial dependence plots, local surrogates such as LIME (Ribeiro et al., 2016)[18], SHAP values (Lundberg and Lee, 2017)[19], counterfactual explanations, and attention visualization for deep models.

[Algorithmic bias](/wiki/bias_ethics_fairness) can creep in through skewed data, label noise, or [proxy labels](/wiki/proxy_labels) that correlate with protected attributes. Fairness metrics include demographic parity, equalized odds, and predictive parity; no single metric satisfies all desiderata simultaneously (Chouldechova, 2017; Kleinberg et al., 2017).[20][21] Mitigation spans pre-processing (reweighting), in-processing (constraints), and post-processing (group-specific thresholds). [Stability](/wiki/stability) and [feedback loops](/wiki/feedback_loop) are also material risks. The NIST AI Risk Management Framework (2023)[23] and EU AI Act (2024)[24] formalize many of these requirements.

## What software and tools are used for machine learning?

Most practitioners use the Python data stack: [pandas](/wiki/pandas) (and its [DataFrame](/wiki/dataframe) abstraction), NumPy, scikit-learn for classical models, and PyTorch, TensorFlow, or JAX for deep learning. R, Julia, and Spark MLlib are also widely used. ONNX provides a portable format for deployed models, and Hugging Face Hub hosts pretrained checkpoints.

## What are the best textbooks and courses for machine learning fundamentals?

The following resources are widely used and define the canon of fundamentals.

| Resource | Authors / instructors | Year | Notes |
|---|---|---|---|
| The Elements of Statistical Learning (ESL) | Trevor Hastie, Robert Tibshirani, Jerome Friedman | 2nd ed. 2009 | Mathematical, dense, free PDF on the authors' Stanford site |
| An Introduction to Statistical Learning (ISL) | Gareth James, Daniela Witten, Trevor Hastie, Robert Tibshirani | 2nd ed. 2021 | Gentler companion to ESL with R and Python labs |
| Pattern Recognition and Machine Learning (PRML) | Christopher M. Bishop | 2006 | Bayesian and probabilistic perspective |
| Machine Learning: A Probabilistic Perspective | Kevin Murphy | 2012; expanded as Probabilistic Machine Learning, 2022 and 2023 | Comprehensive modern reference |
| Deep Learning | Ian Goodfellow, Yoshua Bengio, Aaron Courville | 2016 | The standard deep learning textbook, free online |
| Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow | Aurelien Geron | 3rd ed. 2022 | Practical, code-first |
| Mathematics for Machine Learning | Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong | 2020 | Linear algebra, calculus, probability for ML |
| Coursera Machine Learning | Andrew Ng (Stanford / DeepLearning.AI) | 2011 (original); refreshed 2022 | One of the most-taken online courses ever, an entry point for millions |
| Coursera Deep Learning Specialization | Andrew Ng | 2017 | Five-course sequence on neural networks |
| Stanford CS229 Machine Learning | Andrew Ng, Christopher Re, Tengyu Ma | ongoing | Graduate ML lectures, notes online |
| Stanford CS231n Convolutional Neural Networks for Visual Recognition | Fei-Fei Li, Andrej Karpathy, Justin Johnson | ongoing | Foundational deep learning for vision |
| Stanford CS224n Natural Language Processing with Deep Learning | Christopher Manning | ongoing | Modern NLP curriculum |
| MIT 6.036 Introduction to Machine Learning | MIT OCW | ongoing | Undergraduate ML |
| fast.ai Practical Deep Learning for Coders | Jeremy Howard, Rachel Thomas | 2017 onwards | Top-down, hands-on, free |
| Dive into Deep Learning (D2L) | Aston Zhang, Zachary Lipton, Mu Li, Alex Smola | 2020 onwards | Free interactive book with PyTorch, MXNet, TensorFlow, JAX |
| Reinforcement Learning: An Introduction | Richard S. Sutton, Andrew G. Barto | 2nd ed. 2018 | The canonical RL textbook, free PDF |
| Speech and Language Processing | Daniel Jurafsky, James H. Martin | 3rd ed. draft, ongoing | NLP reference, free chapters online |

The Google Machine Learning Glossary, on which much of the terminology in this wiki draws, is also a high-quality starting point.[22]

## Index of fundamentals terms

The terms below have dedicated wiki pages in the fundamentals chapter of [Machine learning terms](/wiki/machine_learning_terms). Where multiple notations exist (for example ReLU and Rectified Linear Unit), separate entries are linked.

- [accuracy](/wiki/accuracy)
- [activation function](/wiki/activation_function)
- [artificial intelligence](/wiki/artificial_intelligence)
- [AUC (Area Under the Curve)](/wiki/auc_area_under_the_curve)
- [backpropagation](/wiki/backpropagation)
- [batch](/wiki/batch)
- [batch size](/wiki/batch_size)
- [bias](/wiki/bias)
- [bias (ethics/fairness)](/wiki/bias_ethics_fairness)
- [binary classification](/wiki/binary_classification)
- [bucketing](/wiki/bucketing)
- [categorical data](/wiki/categorical_data)
- [class](/wiki/class)
- [classification model](/wiki/classification_model)
- [classification threshold](/wiki/classification_threshold)
- [class-imbalanced dataset](/wiki/class-imbalanced_dataset)
- [clipping](/wiki/clipping)
- [confusion matrix](/wiki/confusion_matrix)
- [continuous feature](/wiki/continuous_feature)
- [convergence](/wiki/convergence)
- [DataFrame](/wiki/dataframe)
- [dataset](/wiki/dataset)
- [deep model](/wiki/deep_model)
- [dense feature](/wiki/dense_feature)
- [depth](/wiki/depth)
- [discrete feature](/wiki/discrete_feature)
- [dynamic](/wiki/dynamic)
- [dynamic model](/wiki/dynamic_model)
- [early stopping](/wiki/early_stopping)
- [embedding layer](/wiki/embedding_layer)
- [epoch](/wiki/epoch)
- [example](/wiki/example)
- [false negative (FN)](/wiki/false_negative_fn)
- [false positive (FP)](/wiki/false_positive_fp)
- [false positive rate (FPR)](/wiki/false_positive_rate_fpr)
- [feature](/wiki/feature)
- [feature cross](/wiki/feature_cross)
- [feature engineering](/wiki/feature_engineering)
- [feature set](/wiki/feature_set)
- [feature vector](/wiki/feature_vector)
- [feedback loop](/wiki/feedback_loop)
- [generalization](/wiki/generalization)
- [generalization curve](/wiki/generalization_curve)
- [gradient descent](/wiki/gradient_descent)
- [ground truth](/wiki/ground_truth)
- [hidden layer](/wiki/hidden_layer)
- [hyperparameter](/wiki/hyperparameter)
- [independently and identically distributed (i.i.d.)](/wiki/independently_and_identically_distributed_i_i_d)
- [inference](/wiki/inference)
- [input layer](/wiki/input_layer)
- [interpretability](/wiki/interpretability)
- [iteration](/wiki/iteration)
- [L0 regularization](/wiki/l0_regularization)
- [L1 loss](/wiki/l1_loss)
- [L1 regularization](/wiki/l1_regularization)
- [L2 loss](/wiki/l2_loss)
- [L2 regularization](/wiki/l2_regularization)
- [label](/wiki/label)
- [labeled example](/wiki/labeled_example)
- [lambda](/wiki/lambda)
- [layer](/wiki/layer)
- [learning rate](/wiki/learning_rate)
- [linear model](/wiki/linear_model)
- [linear](/wiki/linear)
- [linear regression](/wiki/linear_regression)
- [logistic regression](/wiki/logistic_regression)
- [Log Loss](/wiki/log_loss)
- [log-odds](/wiki/log-odds)
- [loss](/wiki/loss)
- [loss curve](/wiki/loss_curve)
- [loss function](/wiki/loss_function)
- [machine learning](/wiki/machine_learning)
- [majority class](/wiki/majority_class)
- [mini-batch](/wiki/mini-batch)
- [minority class](/wiki/minority_class)
- [model](/wiki/model)
- [multi-class classification](/wiki/multi-class_classification)
- [negative class](/wiki/negative_class)
- [neural network](/wiki/neural_network)
- [neuron](/wiki/neuron)
- [node (neural network)](/wiki/node_neural_network)
- [nonlinear](/wiki/nonlinear)
- [nonstationarity](/wiki/nonstationarity)
- [normalization](/wiki/normalization)
- [numerical data](/wiki/numerical_data)
- [offline](/wiki/offline)
- [offline inference](/wiki/offline_inference)
- [one-hot encoding](/wiki/one-hot_encoding)
- [one-vs.-all](/wiki/one-vs_-all)
- [online inference](/wiki/online_inference)
- [online learning](/wiki/online_learning)
- [output layer](/wiki/output_layer)
- [overfitting](/wiki/overfitting)
- [pandas](/wiki/pandas)
- [parameter](/wiki/parameter)
- [positive class](/wiki/positive_class)
- [post-processing](/wiki/post-processing)
- [prediction](/wiki/prediction)
- [proxy labels](/wiki/proxy_labels)
- [rater](/wiki/rater)
- [Rectified Linear Unit (ReLU)](/wiki/rectified_linear_unit_relu)
- [regression model](/wiki/regression_model)
- [regularization](/wiki/regularization)
- [regularization rate](/wiki/regularization_rate)
- [ReLU](/wiki/relu)
- [ROC (receiver operating characteristic) Curve](/wiki/roc_receiver_operating_characteristic_curve)
- [Root Mean Squared Error (RMSE)](/wiki/root_mean_squared_error_rmse)
- [sigmoid function](/wiki/sigmoid_function)
- [softmax](/wiki/softmax)
- [sparse feature](/wiki/sparse_feature)
- [sparse representation](/wiki/sparse_representation)
- [sparse vector](/wiki/sparse_vector)
- [squared loss](/wiki/squared_loss)
- [stability](/wiki/stability)
- [static](/wiki/static)
- [static inference](/wiki/static_inference)
- [stationarity](/wiki/stationarity)
- [stochastic gradient descent (SGD)](/wiki/stochastic_gradient_descent_sgd)
- [supervised machine learning](/wiki/supervised_machine_learning)
- [synthetic feature](/wiki/synthetic_feature)
- [test loss](/wiki/test_loss)
- [training](/wiki/training)
- [training loss](/wiki/training_loss)
- [training-serving skew](/wiki/training-serving_skew)
- [training set](/wiki/training_set)
- [true negative (TN)](/wiki/true_negative_tn)
- [true positive (TP)](/wiki/true_positive_tp)
- [true positive rate (TPR)](/wiki/true_positive_rate_tpr)
- [underfitting](/wiki/underfitting)
- [unlabeled example](/wiki/unlabeled_example)
- [unsupervised machine learning](/wiki/unsupervised_machine_learning)
- [validation](/wiki/validation)
- [validation loss](/wiki/validation_loss)
- [validation set](/wiki/validation_set)
- [weight](/wiki/weight)
- [weighted sum](/wiki/weighted_sum)
- [Z-score normalization](/wiki/z-score_normalization)

## see also

- [Machine learning terms](/wiki/machine_learning_terms)
- [Machine learning terms/Neural networks](/wiki/machine_learning_terms_neural_networks)
- [Machine learning terms/Language evaluation](/wiki/machine_learning_terms_language_evaluation)
- [Machine learning terms/Advanced](/wiki/machine_learning_terms_advanced)
- [Artificial intelligence](/wiki/artificial_intelligence)
- [Deep learning](/wiki/deep_learning)
- [Neural network](/wiki/neural_network)

## References

1. Mitchell, Tom M. *Machine Learning*. McGraw-Hill, 1997.
2. Hastie, Trevor; Tibshirani, Robert; Friedman, Jerome. *The Elements of Statistical Learning*. 2nd ed. Springer, 2009.
3. Bishop, Christopher M. *Pattern Recognition and Machine Learning*. Springer, 2006.
4. Murphy, Kevin P. *Probabilistic Machine Learning: An Introduction*. MIT Press, 2022.
5. Goodfellow, Ian; Bengio, Yoshua; Courville, Aaron. *Deep Learning*. MIT Press, 2016.
6. Sutton, Richard S.; Barto, Andrew G. *Reinforcement Learning: An Introduction*. 2nd ed. MIT Press, 2018.
7. Geron, Aurelien. *Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow*. 3rd ed. O'Reilly, 2022.
8. James, Gareth; Witten, Daniela; Hastie, Trevor; Tibshirani, Robert. *An Introduction to Statistical Learning*. 2nd ed. Springer, 2021.
9. Deisenroth, Marc Peter; Faisal, A. Aldo; Ong, Cheng Soon. *Mathematics for Machine Learning*. Cambridge University Press, 2020.
10. Breiman, Leo. "Random Forests." *Machine Learning* 45, no. 1 (2001): 5-32.
11. Chen, Tianqi; Guestrin, Carlos. "XGBoost: A Scalable Tree Boosting System." *KDD*, 2016.
12. Cortes, Corinna; Vapnik, Vladimir. "Support-Vector Networks." *Machine Learning* 20, no. 3 (1995): 273-297.
13. Kingma, Diederik P.; Ba, Jimmy. "Adam: A Method for Stochastic Optimization." *ICLR*, 2015.
14. Krizhevsky, Alex; Sutskever, Ilya; Hinton, Geoffrey E. "ImageNet Classification with Deep Convolutional Neural Networks." *NeurIPS*, 2012.
15. Vaswani, Ashish, et al. "Attention Is All You Need." *NeurIPS*, 2017.
16. Bergstra, James; Bengio, Yoshua. "Random Search for Hyper-Parameter Optimization." *Journal of Machine Learning Research* 13 (2012): 281-305.
17. Rumelhart, David E.; Hinton, Geoffrey E.; Williams, Ronald J. "Learning Representations by Back-Propagating Errors." *Nature* 323, no. 6088 (1986): 533-536.
18. Ribeiro, Marco Tulio; Singh, Sameer; Guestrin, Carlos. "Why Should I Trust You? Explaining the Predictions of Any Classifier." *KDD*, 2016.
19. Lundberg, Scott M.; Lee, Su-In. "A Unified Approach to Interpreting Model Predictions." *NeurIPS*, 2017.
20. Chouldechova, Alexandra. "Fair Prediction with Disparate Impact: A Study of Bias in Recidivism Prediction Instruments." *Big Data* 5, no. 2 (2017): 153-163.
21. Kleinberg, Jon; Mullainathan, Sendhil; Raghavan, Manish. "Inherent Trade-Offs in the Fair Determination of Risk Scores." *ITCS*, 2017.
22. Google Developers. "Machine Learning Glossary." https://developers.google.com/machine-learning/glossary, accessed 2026.
23. National Institute of Standards and Technology. *AI Risk Management Framework (AI RMF 1.0)*. NIST AI 100-1, 2023.
24. European Union. *Regulation (EU) 2024/1689 (EU AI Act)*. Official Journal of the European Union, 2024.
25. Samuel, Arthur L. "Some Studies in Machine Learning Using the Game of Checkers." *IBM Journal of Research and Development* 3, no. 3 (1959): 210-229.