# Machine Learning

> Source: https://aiwiki.ai/wiki/machine_learning
> Updated: 2026-07-09
> Categories: Artificial Intelligence, Computer Science, Machine Learning
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

See also: [Machine learning terms](/wiki/machine_learning_terms) and [artificial intelligence](/wiki/artificial_intelligence)

**Machine learning** (ML) is a branch of [artificial intelligence](/wiki/artificial_intelligence) that gives computers the ability to learn from data and improve their performance on tasks without being explicitly programmed. Rather than following rigid, hand-coded rules, ML systems build mathematical models from sample data (known as training data) in order to make predictions or decisions. The term was coined by Arthur Samuel at IBM in 1959, and machine learning has since become the dominant approach to artificial intelligence: as of 2024, 72% of organizations surveyed by McKinsey reported using AI in at least one business function, and 65% reported regularly using generative AI built on machine learning [16].

Machine learning is classically described as the "field of study that gives computers the ability to learn without being explicitly programmed," a definition almost universally attributed to Arthur Samuel and his checkers-playing work at IBM [1]. That exact wording does not appear in Samuel's 1959 paper, however; it is a later paraphrase of his ideas, traceable to a 1996 paper by John Koza and colleagues that posed the question "How can computers learn to solve problems without being explicitly programmed?" while explicitly paraphrasing Samuel. [22] In his 1959 paper, Samuel framed the goal vividly: he wanted to show that "a computer can be programmed so that it will learn to play a better game of checkers than can be played by the person who wrote the program" [1]. A more precise and widely cited definition was later provided by Tom Mitchell in his 1997 textbook *Machine Learning*: "A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P if its performance at tasks in T, as measured by P, improves with experience E" [2]. Mitchell's formulation is valued for its rigor because it decomposes the concept of learning into three measurable components: experience (the data), the task (what the program should accomplish), and the performance measure (how success is quantified).

Machine learning sits at the intersection of computer science, statistics, and mathematics. It is closely related to data science, which focuses on extracting knowledge from data, and to computational statistics, which emphasizes prediction through computers. As of 2026, the field is dominated by foundation models and large language models, but classical ML techniques remain widely used in industry for tabular data, time-series forecasting, and many production systems.

## Explain like I'm 5 (ELI5)

Imagine you have a toy box full of different toys. Every day, you play with some of them and eventually decide which ones are your favorites. Now imagine a computer program trying to figure out which toys you like best. At first, it does not know anything and just makes guesses. But as you play with more toys and tell it "I liked this one" or "I didn't like that one," the program gets better at guessing which toys you will enjoy next time.

This process of getting better from experience is called "machine learning." Just as you keep learning and discovering new favorites, the computer program keeps improving at figuring out what you like. That is what makes machine learning useful: the computer teaches itself by looking at lots of examples, without someone having to write out every single rule.

## What is the difference between machine learning and AI?

Machine learning is best understood as a subfield of artificial intelligence, not a synonym for it. Artificial intelligence is the broader discipline concerned with creating systems that can perform tasks that typically require human intelligence. AI encompasses many approaches beyond ML, including symbolic reasoning, expert systems, rule-based logic, and classical search algorithms. Machine learning is the specific subset of AI in which systems improve at a task by learning patterns from data rather than being explicitly programmed with hand-written rules.

Deep learning, in turn, is a subset of machine learning, which is a subset of AI. This nesting relationship is often illustrated as concentric circles: AI on the outside, ML inside it, and deep learning at the core. A useful rule of thumb: all machine learning is AI, but not all AI is machine learning, and all deep learning is machine learning, but not all machine learning is deep learning. The detailed relationships to statistics, data science, and data mining are covered in the dedicated section below.

## History

The intellectual roots of machine learning stretch back to the mid-twentieth century, with several decades of breakthroughs building on one another.

### Early foundations (1940s-1950s)

In 1943, Warren McCulloch and Walter Pitts published a paper describing a computational model of neural networks based on mathematics and threshold logic, establishing one of the earliest theoretical frameworks for how brain-like computation could work [3]. In 1949, Donald Hebb published *The Organization of Behavior*, introducing a learning rule ("Hebbian learning") that proposed how neural pathways strengthen through repeated activation.

In 1950, Alan Turing published "Computing Machinery and Intelligence" in the journal *Mind*, posing the question "Can machines think?" and proposing what became known as the Turing test [4]. The paper also anticipated machine learning directly: rather than hand-program an adult mind, Turing suggested building a "child machine" that could be taught, writing that "we cannot expect to find a good child machine at the first attempt" and that "one must experiment with teaching one such machine and see how well it learns" [4]. He even sketched a reward-and-punishment scheme that prefigured reinforcement learning.

In 1952, Arthur Samuel at IBM began developing a checkers-playing program that could improve its play over time by learning from past games. He demonstrated the program publicly in 1956 and published his landmark paper, "Some Studies in Machine Learning Using the Game of Checkers," in 1959 [1]. The program was one of the first successful demonstrations of self-learning software, and the paper is generally credited with coining the term "machine learning."

### The perceptron and first neural networks (1958-1969)

In 1958, Frank Rosenblatt at the Cornell Aeronautical Laboratory unveiled the perceptron, the first algorithm that could learn weights from input data to perform binary classification [5]. The U.S. Office of Naval Research demonstrated it publicly on July 7, 1958, using an IBM 704 computer that taught itself to distinguish cards marked on the left from cards marked on the right after 50 trials. Press coverage was extravagant: *The New York Times* reported that the perceptron was "the embryo of an electronic computer that [the Navy] expects will be able to walk, talk, see, write, reproduce itself and be conscious of its existence" [17]. Those claims proved wildly premature, but the device launched the field of neural-network research.

In 1960, Rosenblatt's team built the Mark I Perceptron, a physical machine with an array of photocells that could learn to recognize simple shapes. However, in 1969, Marvin Minsky and Seymour Papert published *Perceptrons*, which mathematically demonstrated limitations of single-layer perceptrons (they could not learn the XOR function, for example). This contributed to a decline in [neural network](/wiki/neural_network) research funding, a period often called the first "AI winter."

### Statistical methods and backpropagation (1980s)

Interest in neural networks revived in the 1980s. The most significant development was the 1986 publication of "Learning representations by back-propagating errors" by David Rumelhart, Geoffrey Hinton, and Ronald Williams in *Nature* [6]. While the mathematical foundations of backpropagation had been explored earlier by Seppo Linnainmaa (1970) and Paul Werbos (1974), the 1986 paper demonstrated that multi-layer networks trained with backpropagation could learn useful internal representations, overcoming the limitations identified by Minsky and Papert.

During this same period, researchers explored other approaches. Decision tree algorithms such as ID3 (1986) and C4.5 (1993), developed by Ross Quinlan, became popular for their interpretability. The Probably Approximately Correct (PAC) learning framework, introduced by Leslie Valiant in 1984, provided a theoretical foundation for computational learning theory, formalizing what it means for an algorithm to learn a concept from examples with quantifiable guarantees [7].

### SVMs and ensemble methods (1990s-2000s)

In 1995, Corinna Cortes and Vladimir Vapnik published "Support-Vector Networks" in *Machine Learning*, introducing support vector machines (SVMs) for classification [8]. SVMs found optimal separating hyperplanes in high-dimensional feature spaces using the "kernel trick" and became one of the most widely used algorithms throughout the late 1990s and 2000s.

In 2001, Leo Breiman published his paper on random forests in *Machine Learning*, describing an ensemble learning method that combines many decision trees trained on random subsets of data and features [9]. The paper became one of the most cited in the field. Breiman's method corrected for the tendency of individual decision trees to overfit, and random forests proved effective across a wide range of problems.

Boosting methods also gained prominence during this era. AdaBoost was introduced by Freund and Schapire in 1997, and [gradient descent](/wiki/gradient_descent)-based boosting was formalized by Jerome Friedman in 2001.

### The deep learning revolution (2012-present)

The modern era of machine learning began in earnest in 2012 when Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton entered their [deep learning](/wiki/deep_model) model, AlexNet, in the ImageNet Large Scale Visual Recognition Challenge [10]. AlexNet achieved a top-5 error rate of 15.3%, compared with 26.2% for the second-best entry, a margin of nearly 11 percentage points over a field that had previously improved by fractions of a percent per year [10][18]. Previous winners had typically used hand-engineered features fed into SVMs.

AlexNet's success was enabled by three converging factors: large-scale labeled datasets, general-purpose GPU computing via NVIDIA's CUDA platform, and improved training techniques for deep networks. The full ImageNet dataset, curated by Fei-Fei Li and collaborators and introduced in 2009, contains more than 14 million hand-labeled images spanning over 20,000 categories; the ILSVRC competition used a subset of roughly 1.2 million training images across 1,000 classes [18]. This result triggered massive investment in deep learning research.

Subsequent milestones include:

- **2014:** Generative adversarial networks (GANs) introduced by Ian Goodfellow and colleagues.
- **2015:** ResNet achieved superhuman performance on ImageNet with 152 layers using residual connections.
- **2017:** The transformer architecture introduced by Vaswani et al. in "Attention Is All You Need" [11], which became the basis for modern language models. The paper went on to become one of the most cited works in modern AI, surpassing 100,000 citations.
- **2018:** BERT from Google and GPT from OpenAI demonstrated the power of pre-trained language models.
- **2020:** OpenAI released GPT-3, a 175-billion-parameter language model trained on roughly 300 billion tokens, demonstrating strong few-shot learning across tasks without task-specific fine-tuning [19].
- **2020-2023:** Rapid scaling of large language models including GPT-3, GPT-4, Claude, Llama, and others.
- **2025-2026:** Foundation models are treated as enterprise infrastructure, with smaller specialized mixture-of-experts models, multi-agent systems, and efficient fine-tuning techniques like LoRA becoming standard practice [12].

### Historical timeline summary

| Year | Milestone |
|---|---|
| 1943 | McCulloch and Pitts publish a computational model of neural networks |
| 1949 | Donald Hebb proposes Hebbian learning in *The Organization of Behavior* |
| 1950 | Alan Turing publishes "Computing Machinery and Intelligence" |
| 1952 | Arthur Samuel begins developing a self-learning checkers program at IBM |
| 1957 | Frank Rosenblatt designs the perceptron |
| 1959 | Arthur Samuel coins the term "machine learning" |
| 1969 | Minsky and Papert publish *Perceptrons*, contributing to the first AI winter |
| 1979 | Stanford Cart navigates a room of obstacles using machine vision |
| 1984 | Leslie Valiant introduces the PAC learning framework |
| 1986 | Rumelhart, Hinton, and Williams publish the backpropagation paper |
| 1995 | Cortes and Vapnik introduce support vector machines |
| 1997 | Tom Mitchell publishes formal definition of machine learning |
| 2001 | Leo Breiman publishes the random forests paper |
| 2006 | Geoffrey Hinton popularizes the term "deep learning" and demonstrates deep belief networks |
| 2009 | ImageNet dataset released, enabling large-scale visual learning |
| 2012 | AlexNet wins ImageNet competition, sparking the deep learning revolution |
| 2014 | Ian Goodfellow introduces GANs |
| 2016 | AlphaGo defeats world champion Lee Sedol at Go |
| 2017 | Vaswani et al. introduce the transformer architecture |
| 2018 | BERT and GPT demonstrate large-scale pre-training for NLP |
| 2020 | GPT-3 (175B parameters) and AlphaFold 2 demonstrate the power of scale |
| 2022 | ChatGPT brings large language models into mainstream public awareness |
| 2025 | DeepSeek demonstrates efficiency breakthroughs; reasoning models emerge |

## What are the main types of machine learning?

Machine learning methods are typically categorized by the type of signal or feedback available during training. The five most common paradigms are supervised, unsupervised, semi-supervised, self-supervised, and reinforcement learning.

### Supervised learning

[Supervised machine learning](/wiki/supervised_machine_learning) is the most common paradigm. The algorithm is trained on a labeled dataset where each input example is paired with a known output (the label or target). The goal is to learn a mapping function from inputs to outputs so the model can predict labels for new, unseen data.

Supervised learning problems fall into two main categories:

- **Classification:** Predicting a discrete category. Examples include email spam detection (spam or not spam), medical image diagnosis (benign or malignant), and sentiment analysis (positive or negative).
- **Regression:** Predicting a continuous numerical value. Examples include house price prediction, temperature forecasting, and stock price estimation.

### Unsupervised learning

[Unsupervised machine learning](/wiki/unsupervised_machine_learning) involves training on data without labels. The algorithm attempts to find hidden patterns, groupings, or structure in the data on its own. Common unsupervised tasks include:

- **Clustering:** Grouping similar data points together. Customer segmentation and document grouping are typical applications. Algorithms include k-means, DBSCAN, and hierarchical clustering.
- **Dimensionality reduction:** Reducing the number of features while preserving important information. Principal component analysis (PCA), t-SNE, and UMAP are widely used techniques.
- **Density estimation:** Modeling the probability distribution of data to understand how data points are spread across the feature space. Kernel density estimation and Gaussian mixture models are common approaches.
- **Association rule learning:** Discovering relationships between variables in large databases, such as market basket analysis ("customers who buy X also tend to buy Y").
- **Anomaly detection:** Identifying unusual data points that do not conform to expected patterns.

### Semi-supervised learning

Semi-supervised learning uses a small amount of labeled data combined with a large amount of unlabeled data during training. This approach is practical because labeling data is often expensive and time-consuming, while unlabeled data is abundant. For instance, a medical imaging system might have millions of X-ray images but only a few thousand with expert annotations. Semi-supervised methods, such as self-training and co-training, leverage the structure of the unlabeled data to improve learning beyond what the labeled examples alone could provide.

### Self-supervised learning

Self-supervised learning generates its own supervisory signals from the input data itself, without requiring human-provided labels. The model learns by solving a "pretext task" derived from the data structure. For example, a language model might learn to predict the next word in a sentence, or an image model might learn to fill in masked patches of an image.

This approach underpins modern foundation models like GPT and BERT, which are pre-trained on massive text corpora using self-supervised objectives before being fine-tuned for specific tasks. Self-supervised learning has proven remarkably effective because it allows models to learn rich, general-purpose representations from virtually unlimited unlabeled data.

### Reinforcement learning

[Reinforcement learning](/wiki/reinforcement_learning_rl) (RL) takes a fundamentally different approach. An agent learns to make decisions by taking actions in an environment and receiving feedback in the form of rewards or penalties. The goal is to learn a policy that maximizes cumulative reward over time. The agent must balance exploration (trying new actions to discover their consequences) with exploitation (choosing actions known to yield high rewards).

RL has achieved remarkable results in game-playing (DeepMind's AlphaGo defeated world champion Lee Sedol at Go in 2016), robotics control, and resource management. Reinforcement learning from human feedback (RLHF) has become a standard technique for aligning large language models with human preferences.

### Comparison of learning types

| Learning type | Training data | Goal | Example applications |
|---|---|---|---|
| Supervised | Labeled (input-output pairs) | Learn a mapping from inputs to outputs | Spam detection, price prediction, medical diagnosis |
| Unsupervised | Unlabeled | Discover hidden structure or patterns | Customer segmentation, anomaly detection, topic modeling |
| Semi-supervised | Small labeled set + large unlabeled set | Improve learning by leveraging unlabeled data | Medical imaging, web content classification |
| Self-supervised | Unlabeled (labels derived from data) | Learn general representations via pretext tasks | Language model pre-training (GPT, BERT), image pre-training |
| Reinforcement | Reward signals from environment | Learn a policy to maximize cumulative reward | Game playing, robotics, recommendation systems |

## Key algorithms

The table below summarizes widely used machine learning algorithms, organized by learning type and typical use cases.

| Algorithm | Type | Task | Description |
|---|---|---|---|
| Linear regression | Supervised | Regression | Models the relationship between input features and a continuous output using a linear equation. One of the simplest and most interpretable ML methods. |
| Logistic regression | Supervised | Classification | Despite the name, it is a classification method that estimates the probability of a binary outcome using the logistic (sigmoid) function. |
| Decision tree | Supervised | Both | Builds a tree-like structure of if-then rules to split data based on feature values. Highly interpretable but prone to [overfitting](/wiki/overfitting). |
| Random forest | Supervised | Both | An ensemble of many decision trees, each trained on a random subset of data and features. Reduces overfitting compared to individual trees. Introduced by Breiman in 2001 [9]. |
| Support vector machine (SVM) | Supervised | Classification | Finds the optimal hyperplane that maximizes the margin between classes. Effective in high-dimensional spaces using kernel functions [8]. |
| K-nearest neighbors (k-NN) | Supervised | Both | Classifies a data point based on the majority label among its k closest neighbors in the feature space. Simple but can be slow for large datasets. |
| Naive Bayes | Supervised | Classification | Applies Bayes' theorem with an assumption of feature independence. Fast and effective for text classification tasks like spam filtering. |
| Gradient boosting (XGBoost, LightGBM) | Supervised | Both | Sequentially builds trees where each new tree corrects errors made by the previous ones. Often achieves state-of-the-art results on tabular data. |
| Neural network | Supervised / Self-supervised | Both | Models inspired by biological neurons, consisting of layers of interconnected nodes. Deep [neural networks](/wiki/neural_network) with many layers form the basis of deep learning. |
| K-means | Unsupervised | Clustering | Partitions data into k clusters by iteratively assigning points to the nearest cluster centroid and updating centroids. |
| Principal component analysis (PCA) | Unsupervised | Dimensionality reduction | Projects data onto a lower-dimensional subspace that captures the most variance. |
| DBSCAN | Unsupervised | Clustering | Density-based clustering algorithm that groups together closely packed points and marks points in low-density regions as outliers. Does not require specifying the number of clusters in advance. |

## How does the machine learning pipeline work?

Building a machine learning system involves a sequence of steps, often called the ML pipeline. Each step is important; poor data preparation or incorrect evaluation can undermine even the most sophisticated algorithm.

### 1. Data collection

The process begins with gathering relevant data. Sources vary widely: databases, APIs, web scraping, sensors, surveys, or public datasets. The quantity and quality of data have a direct impact on model performance. Andrew Ng has frequently emphasized that for many practical applications, improving the data yields better results than improving the algorithm.

### 2. Data preprocessing

Raw data is rarely clean. Preprocessing includes handling missing values (imputation or removal), removing duplicates, correcting errors, encoding categorical variables (one-hot encoding, label encoding), and normalizing or standardizing numerical features so they share a common scale. Outlier detection and treatment is also a common preprocessing step.

### 3. Feature engineering

[Feature engineering](/wiki/feature_engineering) is the process of creating, selecting, or transforming input variables to improve model performance. This might involve combining existing features (e.g., calculating a price-per-square-foot feature from price and area), extracting date components (day of week, month), or applying domain-specific transformations.

Although deep learning has reduced the need for manual feature engineering in some domains (images, text, audio), it remains critically important for tabular data problems. Good feature engineering requires domain knowledge and can often make the difference between a mediocre model and an excellent one.

**Representation learning** is a closely related concept where the model itself learns useful features from raw data rather than relying on human-designed transformations. Neural networks excel at representation learning: convolutional layers learn visual features, and transformer layers learn contextual text representations. This shift from manual feature engineering to learned representations is one of the defining characteristics of the deep learning era.

### 4. Model selection

Choosing an appropriate algorithm depends on the problem type (classification, regression, clustering), dataset size, number of features, interpretability requirements, and computational constraints. Practitioners often try several algorithms and compare their performance. The "no free lunch" theorem states that no single algorithm is universally best across all problems, so empirical comparison is essential.

### 5. Training

During training, the model learns parameters from the training data. For supervised learning, this means minimizing a loss function that measures the difference between predictions and actual labels. Gradient descent and its variants (stochastic gradient descent, Adam, AdaGrad) are the most common optimization algorithms. Training may take seconds for simple models on small datasets, or weeks on clusters of GPUs for large neural networks.

### 6. Evaluation

After training, the model is evaluated on a held-out test set that it has never seen before. The choice of evaluation metric depends on the task (see the Evaluation metrics section below). It is essential to evaluate on data separate from the training set to get an honest estimate of how the model will perform in production.

### 7. Hyperparameter tuning

Most ML algorithms have hyperparameters (settings that are not learned from data but set before training), such as the learning rate, number of trees, regularization strength, or network depth. Hyperparameter tuning involves searching for the best combination of these settings. Common approaches include grid search, random search, and Bayesian optimization.

### 8. Deployment and monitoring

Once a model meets performance requirements, it is deployed into a production environment where it makes predictions on new data. Deployment methods range from REST APIs to embedded systems to batch processing jobs. After deployment, ongoing monitoring is needed to detect performance degradation (model drift), where the statistical properties of the input data change over time.

## Evaluation metrics

Different tasks call for different metrics. Using the wrong metric can give a misleading picture of model quality.

### Classification metrics

| Metric | Formula | When to use |
|---|---|---|
| Accuracy | (TP + TN) / Total | Balanced classes; overall correctness |
| Precision | TP / (TP + FP) | When false positives are costly (e.g., spam filtering) |
| Recall (Sensitivity) | TP / (TP + FN) | When false negatives are costly (e.g., disease screening) |
| F1 score | 2 * (Precision * Recall) / (Precision + Recall) | Imbalanced classes; balance of precision and recall |
| AUC-ROC | Area under the ROC curve | Evaluating performance across all classification thresholds |

TP = true positives, TN = true negatives, FP = false positives, FN = false negatives.

The precision-recall tradeoff is a common consideration: increasing precision typically reduces recall and vice versa. The right balance depends on the application. A cancer screening system should prioritize recall (catching all true cases), while a recommendation system might prioritize precision (avoiding irrelevant suggestions).

### Regression metrics

| Metric | Description |
|---|---|
| Mean Squared Error (MSE) | Average of the squared differences between predicted and actual values. Penalizes large errors heavily. |
| Root Mean Squared Error (RMSE) | Square root of MSE; in the same units as the target variable. |
| Mean Absolute Error (MAE) | Average of absolute differences. Less sensitive to outliers than MSE. |
| R-squared (R²) | Proportion of variance explained by the model. Ranges from 0 to 1 for a reasonable model. |

## Bias-variance tradeoff

The bias-variance tradeoff is a fundamental concept in machine learning that describes the tension between two sources of prediction error.

- **Bias** is the error introduced by simplifying assumptions in the model. A model with high bias pays too little attention to the training data and oversimplifies the underlying pattern. This leads to underfitting, where the model performs poorly on both training and test data. For example, fitting a straight line to data that follows a quadratic curve produces high bias.
- **Variance** is the error introduced by the model's sensitivity to small fluctuations in the training data. A model with high variance fits the training data very closely, capturing noise as if it were signal. This leads to [overfitting](/wiki/overfitting), where the model performs well on training data but poorly on unseen test data. A very deep decision tree that memorizes every training example is a high-variance model.

The total prediction error can be decomposed as: **Error = Bias² + Variance + Irreducible Noise**. The irreducible noise is inherent randomness in the data that no model can eliminate.

The goal is to find a model complexity that is low enough to avoid overfitting but high enough to capture the true underlying patterns in the data. In practice, this balance is managed through regularization and [cross-validation](/wiki/cross-validation).

## Regularization

Regularization is a set of techniques that constrain or penalize model complexity to reduce overfitting.

- **L1 regularization (Lasso):** Adds the sum of absolute values of model weights to the loss function. This encourages sparsity, effectively performing feature selection by driving some weights to exactly zero.
- **L2 regularization (Ridge):** Adds the sum of squared weights to the loss function. This discourages large weight values but does not force them to zero.
- **Elastic Net:** Combines L1 and L2 penalties, offering a middle ground that can handle correlated features better than Lasso alone.
- **Dropout:** Used in neural networks; randomly sets a fraction of neuron activations to zero during training, forcing the network to not rely on any single neuron.
- **Early stopping:** Monitors performance on a validation set during training and stops when performance begins to degrade, preventing the model from memorizing the training data.
- **Data augmentation:** Artificially expands the training set by creating modified versions of existing data (e.g., rotating, flipping, or cropping images), which helps the model generalize better.

## Cross-validation

Cross-validation is a resampling technique used to evaluate model performance more reliably than a single train-test split. The most common form is k-fold cross-validation: the dataset is divided into k equally sized subsets (folds). The model is trained k times, each time using k-1 folds for training and the remaining fold for validation. The final performance estimate is the average across all k runs. Common choices for k are 5 and 10.

Cross-validation helps detect overfitting and provides a more robust performance estimate, especially when data is limited. Stratified k-fold cross-validation preserves the proportion of each class in every fold, which is important for imbalanced datasets. Leave-one-out cross-validation (LOOCV) is an extreme case where k equals the number of data points; it is computationally expensive but useful for very small datasets.

## How does deep learning relate to machine learning?

[Deep learning](/wiki/deep_model) is a subset of machine learning that uses neural networks with many layers (hence "deep"). While classical ML algorithms like random forests and SVMs require hand-crafted features, deep learning models can automatically learn hierarchical representations from raw data. The term itself predates the modern boom: Rina Dechter introduced "deep learning" to the machine learning literature in 1986 and Igor Aizenberg and colleagues applied it to artificial neural networks around 2000, but the phrase only entered mainstream use after Geoffrey Hinton's 2006 work on deep belief networks popularized it. [23]

Key deep learning architectures include:

- **Convolutional neural networks (CNNs):** Designed for grid-like data such as images. They use convolutional filters to detect local patterns (edges, textures) and build increasingly abstract representations in deeper layers.
- **Recurrent neural networks (RNNs):** Designed for sequential data. Variants like LSTM (Long Short-Term Memory) and GRU (Gated Recurrent Unit) address the vanishing gradient problem that affects basic RNNs.
- **Transformers:** Based on the self-attention mechanism, transformers process all input tokens in parallel rather than sequentially. They are the architecture behind GPT, BERT, Claude, and virtually all modern large language models [11].
- **Generative adversarial networks (GANs):** Consist of a generator and discriminator trained adversarially. Used for image synthesis, style transfer, and data augmentation.
- **Diffusion models:** Generate data by learning to reverse a gradual noising process. They power modern image generation systems like Stable Diffusion and DALL-E.
- **Autoencoders and variational autoencoders (VAEs):** Learn compressed representations of data by training an encoder-decoder architecture. VAEs add a probabilistic framework that enables generation of new data samples.

## How does machine learning relate to statistics and data science?

Machine learning is best understood as a subfield of artificial intelligence, which itself is the broader discipline concerned with creating systems that can perform tasks that typically require human intelligence. AI encompasses many approaches beyond ML, including symbolic reasoning, expert systems, and search algorithms.

**Statistics** and machine learning share substantial overlap but differ in emphasis. Statistics traditionally focuses on inference: drawing conclusions about populations from samples, quantifying uncertainty, and testing hypotheses. Machine learning prioritizes prediction: building models that generalize well to unseen data, often with less concern for interpretability or inferential guarantees. Leo Breiman articulated this distinction in his influential 2001 paper "Statistical Modeling: The Two Cultures," contrasting the "data modeling" culture of classical statistics with the "algorithmic modeling" culture of machine learning, and arguing that the latter had been unfairly neglected by statisticians [14].

**Computational learning theory** provides the mathematical foundations for understanding what can and cannot be learned efficiently. Key results include the PAC (Probably Approximately Correct) learning framework introduced by Leslie Valiant in 1984 [7] and the VC (Vapnik-Chervonenkis) dimension, which quantifies the capacity of a class of models.

**Data science** is a related but distinct field that focuses on extracting insights and knowledge from data using a combination of statistics, programming, and domain expertise. Machine learning provides many of the predictive modeling tools that data scientists use, but data science also includes data cleaning, exploratory analysis, visualization, and communication of results. Not every data science project requires machine learning, and not every ML project fits neatly into data science.

**Data mining** focuses on discovering previously unknown patterns in large datasets. While machine learning evaluates performance on reproducing known knowledge, data mining emphasizes finding novel, useful patterns. In practice, the two disciplines use many of the same techniques.

Deep learning is a subset of machine learning, which is a subset of AI. This nesting relationship is sometimes illustrated as concentric circles: AI on the outside, ML inside it, and deep learning at the core.

## Tools and frameworks

The machine learning ecosystem has matured significantly. Most ML development happens in Python, with a rich set of open-source libraries.

| Framework | Developer | Primary use | Notes |
|---|---|---|---|
| scikit-learn | Community (originally Inria) | Classical ML | The standard library for non-deep-learning tasks: classification, regression, clustering, preprocessing, and model evaluation. Stable API, excellent documentation. |
| PyTorch | Meta (Facebook) | Deep learning | Known for its dynamic computation graph and Pythonic design. Dominant in research. Serves as the foundation for models like GPT and Llama. |
| TensorFlow | Google | Deep learning | Production-focused framework with strong deployment tools (TensorFlow Serving, TensorFlow Lite for mobile). Widely used in industry. |
| Keras | Francois Chollet / Google | Deep learning (high-level API) | User-friendly API that can run on top of TensorFlow, PyTorch, or JAX. Good for prototyping and beginners. |
| XGBoost | Tianqi Chen | Gradient boosting | Extremely popular for tabular data competitions and production systems. Offers speed and regularization improvements over earlier boosting implementations. |
| LightGBM | Microsoft | Gradient boosting | Uses histogram-based algorithms for faster training on large datasets. |
| JAX | Google | Numerical computing / DL | Combines NumPy-like syntax with automatic differentiation and XLA compilation. Increasingly used for ML research. |
| Hugging Face Transformers | Hugging Face | NLP / Foundation models | Provides pre-trained transformer models and tools for fine-tuning. The de facto hub for sharing and using language models. |
| MLflow | Databricks / Community | Experiment tracking | Open-source platform for managing the ML lifecycle, including experiment tracking, model registry, and deployment. |

## ML infrastructure and MLOps

As machine learning has moved from research to production, a new engineering discipline called MLOps (Machine Learning Operations) has emerged. MLOps applies DevOps principles to ML systems, addressing the unique challenges of managing data-dependent, continuously evolving models in production environments.

Key components of ML infrastructure include:

- **Experiment tracking:** Recording hyperparameters, metrics, and artifacts for every training run so experiments are reproducible. Tools include MLflow, Weights & Biases, and Neptune.
- **Model versioning and registry:** Storing trained models with metadata, lineage information, and deployment status. This allows teams to roll back to previous versions if a new model underperforms.
- **Feature stores:** Centralized repositories for computing, storing, and serving features consistently across training and inference. Feast and Tecton are popular open-source and commercial options, respectively.
- **Model serving:** Deploying models to serve predictions in real time (via REST APIs or gRPC) or in batch mode. Tools like TensorFlow Serving, Triton Inference Server, and BentoML handle model serving at scale.
- **Monitoring and observability:** Tracking model performance in production to detect data drift (changes in input data distributions), concept drift (changes in the relationship between inputs and outputs), and performance degradation. Prometheus and Grafana are commonly used for infrastructure-level monitoring, while specialized tools like Evidently and WhyLabs focus on ML-specific monitoring.
- **CI/CD for ML:** Continuous integration and continuous deployment pipelines adapted for ML, including automated testing of data quality, model performance, and serving infrastructure. Kubeflow Pipelines and Apache Airflow are widely used orchestration tools.

As of 2025, surveys indicate that over 70% of enterprises have adopted or are actively implementing MLOps practices, reflecting the maturity of the field.

## Current state (2025-2026)

As of early 2026, machine learning is defined by several intersecting trends.

**Foundation models as infrastructure.** Large pre-trained models, particularly large language models, have shifted from experimental curiosities to production infrastructure. Companies fine-tune pre-trained backbones with lightweight adapters (such as LoRA) for specialized domains rather than training from scratch [12].

**Efficiency gains.** In January 2025, DeepSeek released models that matched Western frontier systems using roughly one-tenth the training compute, demonstrating that efficiency improvements can be as impactful as raw scale [12]. Mixture-of-experts (MoE) architectures route inputs to specialized subnetworks instead of activating every parameter for every input, reducing inference cost.

**Reasoning and inference-time compute.** A notable shift in 2025 was the move from simply scaling training compute to investing more compute at inference time. "Thinking" models that spend more time reasoning through problems before answering showed significant gains on complex tasks, and this trend is expected to continue through 2026 [12].

**Multimodal models.** Modern foundation models increasingly handle text, images, audio, and video within a single architecture, blurring the boundaries between what were previously separate ML subfields.

**Agentic systems.** ML-powered agents that can plan, use tools, write code, and take actions autonomously represent a growing area of development, moving beyond simple prompt-response interaction.

**AutoML and MLaaS.** Automated machine learning (AutoML) tools are making ML more accessible to non-experts by automating model selection, hyperparameter tuning, and feature engineering. Machine Learning as a Service (MLaaS) platforms, offered by major cloud providers, allow organizations to build and deploy models without managing infrastructure. The AutoML market is projected to grow from $2.34 billion in 2025 to $3.43 billion in 2026, reflecting a 46.5% CAGR.

**Classical ML remains relevant.** For tabular data, time series, and many production applications, gradient boosted trees (XGBoost, LightGBM) continue to outperform or match neural approaches while being faster to train, easier to interpret, and cheaper to deploy.

## What is machine learning used for?

Machine learning is applied across nearly every industry. Some prominent areas include:

- **Computer vision:** Image recognition, object detection, facial recognition, medical image analysis, autonomous vehicle perception, and video understanding.
- **Natural language processing (NLP):** Machine translation, text summarization, question answering, chatbots, and sentiment analysis.
- **Speech and audio:** Speech recognition (transcription), text-to-speech synthesis, music generation, and speaker identification.
- **Recommendation systems:** Powering suggestions on platforms like Netflix, Spotify, Amazon, and YouTube by predicting user preferences from behavior data.
- **Healthcare:** Drug discovery, protein structure prediction, medical diagnosis from imaging, electronic health record analysis, and clinical trial optimization. DeepMind's AlphaFold 2 reached a median GDT_TS accuracy of about 92 (on a 0-100 scale) at the CASP14 assessment in 2020, a result widely described as solving the decades-old protein-folding problem [20]. The AI healthcare market is projected to grow from $26.5 billion in 2024 to nearly $188 billion within a decade.
- **Finance:** Fraud detection, credit scoring, algorithmic trading, risk assessment, and anti-money-laundering monitoring. JPMorgan's COIN platform uses NLP to review legal documents, reportedly saving 360,000 hours annually.
- **Science:** Particle physics analysis, climate modeling, genomics, materials discovery, and astronomical survey classification.
- **Robotics:** Control policies for manipulation, locomotion, and navigation learned through reinforcement learning and imitation learning.
- **Autonomous vehicles:** Perception (identifying objects), planning (deciding routes), and control (steering and braking) all rely on ML models.
- **Agriculture:** Crop yield prediction, disease detection in plants, precision farming through drone imagery analysis, and soil health monitoring.
- **Manufacturing:** Predictive maintenance, quality control through visual inspection, supply chain optimization, and process automation.

## Ethical considerations

As machine learning systems are deployed in consequential domains such as hiring, criminal justice, healthcare, and lending, ethical considerations have become a central concern for researchers, practitioners, and policymakers.

### Fairness and bias

ML models can perpetuate or amplify existing societal biases present in training data. For example, a hiring algorithm trained on historical data may discriminate against certain demographic groups if past hiring decisions were biased. A notable case was Amazon's experimental recruiting tool, which was found to penalize resumes containing the word "women's" because the training data reflected the male-dominated composition of the tech industry. Ensuring fairness requires careful attention to data collection, model design, and outcome measurement. Techniques for bias mitigation include re-sampling training data, applying fairness constraints during training, and auditing model outputs across demographic groups.

### Transparency and explainability

Many high-performing ML models, particularly deep neural networks, function as "black boxes" whose internal decision-making processes are difficult to interpret. This lack of transparency is problematic in high-stakes applications where people need to understand why a decision was made. The field of Explainable AI (XAI) addresses this through techniques such as SHAP (SHapley Additive exPlanations), LIME (Local Interpretable Model-agnostic Explanations), and attention visualization.

### Privacy

ML models trained on personal data raise privacy concerns. Models can sometimes memorize and reveal individual training examples. Differential privacy, a mathematical framework that provides formal guarantees about information leakage, has been adopted by organizations like Apple and Google to train models while protecting individual data. Federated learning, where models are trained across decentralized devices without sharing raw data, is another approach to preserving privacy.

### Accountability

Determining responsibility when ML systems make harmful decisions is an ongoing challenge. The question of who is accountable when an autonomous vehicle causes an accident, or when a medical diagnosis system provides an incorrect recommendation, remains unresolved in many legal frameworks.

### Regulatory landscape

Governments are increasingly regulating ML and AI systems. The European Union's AI Act, which entered into force in August 2024, establishes a risk-based framework for AI regulation. Prohibited AI practices and AI literacy obligations took effect in February 2025, and governance rules for general-purpose AI models became applicable in August 2025 [13]. The compliance deadlines for high-risk AI systems were subsequently deferred by the "Digital Omnibus on AI" simplification package, endorsed by the European Parliament on June 16, 2026 and given final approval by the Council of the EU on June 29, 2026. [21] Under the amended timeline, stand-alone high-risk systems listed in Annex III, originally due to comply by August 2, 2026, now have until December 2, 2027, while high-risk AI embedded in regulated products under Annex I has until August 2, 2028. [21] Other jurisdictions, including the United States, Canada, China, and Brazil, are developing their own regulatory approaches.

### Environmental impact

Training large ML models consumes significant energy. Training a single large language model can emit hundreds of tons of CO2 equivalent. The growing emphasis on model efficiency (smaller models, distillation, mixture-of-experts architectures) is partly motivated by environmental concerns, alongside cost reduction.

## Limitations and challenges

Despite its successes, machine learning faces several fundamental challenges:

- **Data dependence:** ML models are only as good as their training data. Biased, incomplete, or noisy data leads to poor or unfair models.
- **Generalization:** Models that perform well on training data may fail on data from different distributions (domain shift). Robust generalization remains an active research area.
- **Interpretability:** Complex models (especially deep networks) are difficult to interpret, limiting trust and adoption in regulated industries.
- **Catastrophic forgetting:** When neural networks are trained on new tasks, they tend to forget previously learned information. Continual learning methods attempt to address this.
- **Adversarial vulnerability:** Small, carefully crafted perturbations to input data can cause ML models to make confident but incorrect predictions. Adversarial robustness is an active area of research.
- **Computational cost:** Training state-of-the-art models requires enormous computational resources, limiting access to well-funded organizations.
- **Reproducibility:** Differences in software versions, hardware, random seeds, and data splits can make it difficult to reproduce ML results exactly.

For the foundational ideas and modern frontier systems, see also the AI Wiki articles on [deep learning](/wiki/deep_model), [neural networks](/wiki/neural_network), [reinforcement learning](/wiki/reinforcement_learning_rl), and [artificial intelligence](/wiki/artificial_intelligence).

## References

[1] Samuel, A.L. (1959). "Some Studies in Machine Learning Using the Game of Checkers." *IBM Journal of Research and Development*, 3(3), 210-229.

[2] Mitchell, T.M. (1997). *Machine Learning*. McGraw-Hill. p. 2.

[3] McCulloch, W.S. and Pitts, W. (1943). "A Logical Calculus of the Ideas Immanent in Nervous Activity." *Bulletin of Mathematical Biophysics*, 5(4), 115-133.

[4] Turing, A.M. (1950). "Computing Machinery and Intelligence." *Mind*, 59(236), 433-460.

[5] Rosenblatt, F. (1958). "The Perceptron: A Probabilistic Model for Information Storage and Organization in the Brain." *Psychological Review*, 65(6), 386-408.

[6] Rumelhart, D.E., Hinton, G.E., and Williams, R.J. (1986). "Learning representations by back-propagating errors." *Nature*, 323, 533-536.

[7] Valiant, L.G. (1984). "A Theory of the Learnable." *Communications of the ACM*, 27(11), 1134-1142.

[8] Cortes, C. and Vapnik, V. (1995). "Support-Vector Networks." *Machine Learning*, 20, 273-297.

[9] Breiman, L. (2001). "Random Forests." *Machine Learning*, 45, 5-32.

[10] Krizhevsky, A., Sutskever, I., and Hinton, G.E. (2012). "ImageNet Classification with Deep Convolutional Neural Networks." *Advances in Neural Information Processing Systems*, 25.

[11] Vaswani, A. et al. (2017). "Attention Is All You Need." *Advances in Neural Information Processing Systems*, 30.

[12] MIT Sloan Management Review, Epoch AI, and industry trend analyses (2025-2026).

[13] European Parliament and Council of the EU (2024). Regulation (EU) 2024/1689 laying down harmonised rules on artificial intelligence (AI Act). *Official Journal of the European Union*.

[14] Breiman, L. (2001). "Statistical Modeling: The Two Cultures." *Statistical Science*, 16(3), 199-231.

[15] Goodfellow, I., Bengio, Y., and Courville, A. (2016). *Deep Learning*. MIT Press.

[16] McKinsey & Company (2024). "The state of AI in early 2024: Gen AI adoption spikes and starts to generate value." QuantumBlack / McKinsey Global Survey.

[17] "New Navy Device Learns By Doing." *The New York Times*, July 8, 1958. (Coverage of the Rosenblatt perceptron demonstration by the U.S. Office of Naval Research.)

[18] Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, L. (2009). "ImageNet: A Large-Scale Hierarchical Image Database." *IEEE Conference on Computer Vision and Pattern Recognition (CVPR)*.

[19] Brown, T.B. et al. (2020). "Language Models are Few-Shot Learners." *Advances in Neural Information Processing Systems*, 33. (GPT-3, 175 billion parameters.)

[20] Jumper, J. et al. (2021). "Highly accurate protein structure prediction with AlphaFold." *Nature*, 596, 583-589. (AlphaFold 2, CASP14.)

[21] Council of the European Union (2026). "Artificial intelligence: Council gives final green light to simplify and streamline rules." Press release, June 29, 2026. consilium.europa.eu. (Digital Omnibus on AI, endorsed by the European Parliament on June 16, 2026: high-risk obligations deferred to December 2, 2027 for stand-alone Annex III systems and August 2, 2028 for Annex I product-embedded systems.)

[22] Koza, J.R., Bennett, F.H., Andre, D., and Keane, M.A. (1996). "Automated Design of Both the Topology and Sizing of Analog Electrical Circuits Using Genetic Programming." *Artificial Intelligence in Design '96*. Springer, Dordrecht, 151-170. (Source of the widely quoted "without being explicitly programmed" paraphrase of Samuel 1959.)

[23] Schmidhuber, J. (2015). "Deep Learning." *Scholarpedia*, 10(11), 32832. (Notes that the term "deep learning" was introduced to machine learning by Dechter in 1986 and to artificial neural networks by Aizenberg et al. in 2000, before Hinton's 2006 deep belief network work brought it to prominence.)