See also: Machine learning terms and artificial intelligence
Machine learning (ML) is a branch of artificial intelligence that gives computers the ability to learn from data and improve their performance on tasks without being explicitly programmed. Rather than following rigid, hand-coded rules, ML systems build mathematical models from sample data (known as training data) in order to make predictions or decisions.
The term was popularized by Arthur Samuel in 1959, who defined it as a "field of study that gives computers the ability to learn without being explicitly programmed" while working on a checkers-playing program at IBM [1]. A more precise and widely cited definition was later provided by Tom Mitchell in his 1997 textbook Machine Learning: "A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P if its performance at tasks in T, as measured by P, improves with experience E" [2]. Mitchell's formulation is valued for its rigor because it decomposes the concept of learning into three measurable components: experience (the data), the task (what the program should accomplish), and the performance measure (how success is quantified).
Machine learning sits at the intersection of computer science, statistics, and mathematics. It is closely related to data science, which focuses on extracting knowledge from data, and to computational statistics, which emphasizes prediction through computers. As of 2026, the field is dominated by foundation models and large language models, but classical ML techniques remain widely used in industry for tabular data, time-series forecasting, and many production systems.
Imagine you have a toy box full of different toys. Every day, you play with some of them and eventually decide which ones are your favorites. Now imagine a computer program trying to figure out which toys you like best. At first, it does not know anything and just makes guesses. But as you play with more toys and tell it "I liked this one" or "I didn't like that one," the program gets better at guessing which toys you will enjoy next time.
This process of getting better from experience is called "machine learning." Just as you keep learning and discovering new favorites, the computer program keeps improving at figuring out what you like. That is what makes machine learning useful: the computer teaches itself by looking at lots of examples, without someone having to write out every single rule.
The intellectual roots of machine learning stretch back to the mid-twentieth century, with several decades of breakthroughs building on one another.
In 1943, Warren McCulloch and Walter Pitts published a paper describing a computational model of neural networks based on mathematics and threshold logic, establishing one of the earliest theoretical frameworks for how brain-like computation could work [3]. In 1949, Donald Hebb published The Organization of Behavior, introducing a learning rule ("Hebbian learning") that proposed how neural pathways strengthen through repeated activation.
In 1950, Alan Turing published "Computing Machinery and Intelligence" in the journal Mind, posing the question "Can machines think?" and proposing what became known as the Turing test [4]. The paper also discussed the concept of a "learning machine" that could be taught through experience, laying philosophical groundwork for the field.
In 1952, Arthur Samuel at IBM began developing a checkers-playing program that could improve its play over time by learning from past games. He demonstrated the program publicly in 1956 and published his landmark paper, "Some Studies in Machine Learning Using the Game of Checkers," in 1959 [1]. The program was one of the first successful demonstrations of self-learning software.
In 1958, Frank Rosenblatt at the Cornell Aeronautical Laboratory unveiled the perceptron, the first algorithm that could learn weights from input data to perform binary classification [5]. The U.S. Office of Naval Research demonstrated it publicly on July 7, 1958, using an IBM 704 computer that taught itself to distinguish cards marked on the left from cards marked on the right after 50 trials.
In 1960, Rosenblatt's team built the Mark I Perceptron, a physical machine with an array of photocells that could learn to recognize simple shapes. However, in 1969, Marvin Minsky and Seymour Papert published Perceptrons, which mathematically demonstrated limitations of single-layer perceptrons (they could not learn the XOR function, for example). This contributed to a decline in neural network research funding, a period often called the first "AI winter."
Interest in neural networks revived in the 1980s. The most significant development was the 1986 publication of "Learning representations by back-propagating errors" by David Rumelhart, Geoffrey Hinton, and Ronald Williams in Nature [6]. While the mathematical foundations of backpropagation had been explored earlier by Seppo Linnainmaa (1970) and Paul Werbos (1974), the 1986 paper demonstrated that multi-layer networks trained with backpropagation could learn useful internal representations, overcoming the limitations identified by Minsky and Papert.
During this same period, researchers explored other approaches. Decision tree algorithms such as ID3 (1986) and C4.5 (1993), developed by Ross Quinlan, became popular for their interpretability. The Probably Approximately Correct (PAC) learning framework, introduced by Leslie Valiant in 1984, provided a theoretical foundation for computational learning theory, formalizing what it means for an algorithm to learn a concept from examples with quantifiable guarantees [7].
In 1995, Corinna Cortes and Vladimir Vapnik published "Support-Vector Networks" in Machine Learning, introducing support vector machines (SVMs) for classification [8]. SVMs found optimal separating hyperplanes in high-dimensional feature spaces using the "kernel trick" and became one of the most widely used algorithms throughout the late 1990s and 2000s.
In 2001, Leo Breiman published his paper on random forests in Machine Learning, describing an ensemble learning method that combines many decision trees trained on random subsets of data and features [9]. The paper became one of the most cited in the field. Breiman's method corrected for the tendency of individual decision trees to overfit, and random forests proved effective across a wide range of problems.
Boosting methods also gained prominence during this era. AdaBoost was introduced by Freund and Schapire in 1997, and gradient descent-based boosting was formalized by Jerome Friedman in 2001.
The modern era of machine learning began in earnest in 2012 when Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton entered their deep learning model, AlexNet, in the ImageNet Large Scale Visual Recognition Challenge [10]. AlexNet achieved a top-5 error rate of 15.3%, outperforming the runner-up by more than 10 percentage points. Previous winners had typically used hand-engineered features fed into SVMs, with progress measured in fractions of a percent.
AlexNet's success was enabled by three converging factors: large-scale labeled datasets (ImageNet contained over 1.2 million images), general-purpose GPU computing via NVIDIA's CUDA platform, and improved training techniques for deep networks. This result triggered massive investment in deep learning research.
Subsequent milestones include:
| Year | Milestone |
|---|---|
| 1943 | McCulloch and Pitts publish a computational model of neural networks |
| 1949 | Donald Hebb proposes Hebbian learning in The Organization of Behavior |
| 1950 | Alan Turing publishes "Computing Machinery and Intelligence" |
| 1952 | Arthur Samuel begins developing a self-learning checkers program at IBM |
| 1957 | Frank Rosenblatt designs the perceptron |
| 1959 | Arthur Samuel coins the term "machine learning" |
| 1969 | Minsky and Papert publish Perceptrons, contributing to the first AI winter |
| 1979 | Stanford Cart navigates a room of obstacles using machine vision |
| 1984 | Leslie Valiant introduces the PAC learning framework |
| 1986 | Rumelhart, Hinton, and Williams publish the backpropagation paper |
| 1995 | Cortes and Vapnik introduce support vector machines |
| 1997 | Tom Mitchell publishes formal definition of machine learning |
| 2001 | Leo Breiman publishes the random forests paper |
| 2006 | Geoffrey Hinton coins "deep learning" and demonstrates deep belief networks |
| 2012 | AlexNet wins ImageNet competition, sparking the deep learning revolution |
| 2014 | Ian Goodfellow introduces GANs |
| 2016 | AlphaGo defeats world champion Lee Sedol at Go |
| 2017 | Vaswani et al. introduce the transformer architecture |
| 2018 | BERT and GPT demonstrate large-scale pre-training for NLP |
| 2022 | ChatGPT brings large language models into mainstream public awareness |
| 2025 | DeepSeek demonstrates efficiency breakthroughs; reasoning models emerge |
Machine learning methods are typically categorized by the type of signal or feedback available during training.
Supervised machine learning is the most common paradigm. The algorithm is trained on a labeled dataset where each input example is paired with a known output (the label or target). The goal is to learn a mapping function from inputs to outputs so the model can predict labels for new, unseen data.
Supervised learning problems fall into two main categories:
Unsupervised machine learning involves training on data without labels. The algorithm attempts to find hidden patterns, groupings, or structure in the data on its own. Common unsupervised tasks include:
Semi-supervised learning uses a small amount of labeled data combined with a large amount of unlabeled data during training. This approach is practical because labeling data is often expensive and time-consuming, while unlabeled data is abundant. For instance, a medical imaging system might have millions of X-ray images but only a few thousand with expert annotations. Semi-supervised methods, such as self-training and co-training, leverage the structure of the unlabeled data to improve learning beyond what the labeled examples alone could provide.
Self-supervised learning generates its own supervisory signals from the input data itself, without requiring human-provided labels. The model learns by solving a "pretext task" derived from the data structure. For example, a language model might learn to predict the next word in a sentence, or an image model might learn to fill in masked patches of an image.
This approach underpins modern foundation models like GPT and BERT, which are pre-trained on massive text corpora using self-supervised objectives before being fine-tuned for specific tasks. Self-supervised learning has proven remarkably effective because it allows models to learn rich, general-purpose representations from virtually unlimited unlabeled data.
Reinforcement learning (RL) takes a fundamentally different approach. An agent learns to make decisions by taking actions in an environment and receiving feedback in the form of rewards or penalties. The goal is to learn a policy that maximizes cumulative reward over time. The agent must balance exploration (trying new actions to discover their consequences) with exploitation (choosing actions known to yield high rewards).
RL has achieved remarkable results in game-playing (DeepMind's AlphaGo defeated world champion Lee Sedol at Go in 2016), robotics control, and resource management. Reinforcement learning from human feedback (RLHF) has become a standard technique for aligning large language models with human preferences.
| Learning type | Training data | Goal | Example applications |
|---|---|---|---|
| Supervised | Labeled (input-output pairs) | Learn a mapping from inputs to outputs | Spam detection, price prediction, medical diagnosis |
| Unsupervised | Unlabeled | Discover hidden structure or patterns | Customer segmentation, anomaly detection, topic modeling |
| Semi-supervised | Small labeled set + large unlabeled set | Improve learning by leveraging unlabeled data | Medical imaging, web content classification |
| Self-supervised | Unlabeled (labels derived from data) | Learn general representations via pretext tasks | Language model pre-training (GPT, BERT), image pre-training |
| Reinforcement | Reward signals from environment | Learn a policy to maximize cumulative reward | Game playing, robotics, recommendation systems |
The table below summarizes widely used machine learning algorithms, organized by learning type and typical use cases.
| Algorithm | Type | Task | Description |
|---|---|---|---|
| Linear regression | Supervised | Regression | Models the relationship between input features and a continuous output using a linear equation. One of the simplest and most interpretable ML methods. |
| Logistic regression | Supervised | Classification | Despite the name, it is a classification method that estimates the probability of a binary outcome using the logistic (sigmoid) function. |
| Decision tree | Supervised | Both | Builds a tree-like structure of if-then rules to split data based on feature values. Highly interpretable but prone to overfitting. |
| Random forest | Supervised | Both | An ensemble of many decision trees, each trained on a random subset of data and features. Reduces overfitting compared to individual trees. Introduced by Breiman in 2001 [9]. |
| Support vector machine (SVM) | Supervised | Classification | Finds the optimal hyperplane that maximizes the margin between classes. Effective in high-dimensional spaces using kernel functions [8]. |
| K-nearest neighbors (k-NN) | Supervised | Both | Classifies a data point based on the majority label among its k closest neighbors in the feature space. Simple but can be slow for large datasets. |
| Naive Bayes | Supervised | Classification | Applies Bayes' theorem with an assumption of feature independence. Fast and effective for text classification tasks like spam filtering. |
| Gradient boosting (XGBoost, LightGBM) | Supervised | Both | Sequentially builds trees where each new tree corrects errors made by the previous ones. Often achieves state-of-the-art results on tabular data. |
| Neural network | Supervised / Self-supervised | Both | Models inspired by biological neurons, consisting of layers of interconnected nodes. Deep neural networks with many layers form the basis of deep learning. |
| K-means | Unsupervised | Clustering | Partitions data into k clusters by iteratively assigning points to the nearest cluster centroid and updating centroids. |
| Principal component analysis (PCA) | Unsupervised | Dimensionality reduction | Projects data onto a lower-dimensional subspace that captures the most variance. |
| DBSCAN | Unsupervised | Clustering | Density-based clustering algorithm that groups together closely packed points and marks points in low-density regions as outliers. Does not require specifying the number of clusters in advance. |
Building a machine learning system involves a sequence of steps, often called the ML pipeline. Each step is important; poor data preparation or incorrect evaluation can undermine even the most sophisticated algorithm.
The process begins with gathering relevant data. Sources vary widely: databases, APIs, web scraping, sensors, surveys, or public datasets. The quantity and quality of data have a direct impact on model performance. Andrew Ng has frequently emphasized that for many practical applications, improving the data yields better results than improving the algorithm.
Raw data is rarely clean. Preprocessing includes handling missing values (imputation or removal), removing duplicates, correcting errors, encoding categorical variables (one-hot encoding, label encoding), and normalizing or standardizing numerical features so they share a common scale. Outlier detection and treatment is also a common preprocessing step.
Feature engineering is the process of creating, selecting, or transforming input variables to improve model performance. This might involve combining existing features (e.g., calculating a price-per-square-foot feature from price and area), extracting date components (day of week, month), or applying domain-specific transformations.
Although deep learning has reduced the need for manual feature engineering in some domains (images, text, audio), it remains critically important for tabular data problems. Good feature engineering requires domain knowledge and can often make the difference between a mediocre model and an excellent one.
Representation learning is a closely related concept where the model itself learns useful features from raw data rather than relying on human-designed transformations. Neural networks excel at representation learning: convolutional layers learn visual features, and transformer layers learn contextual text representations. This shift from manual feature engineering to learned representations is one of the defining characteristics of the deep learning era.
Choosing an appropriate algorithm depends on the problem type (classification, regression, clustering), dataset size, number of features, interpretability requirements, and computational constraints. Practitioners often try several algorithms and compare their performance. The "no free lunch" theorem states that no single algorithm is universally best across all problems, so empirical comparison is essential.
During training, the model learns parameters from the training data. For supervised learning, this means minimizing a loss function that measures the difference between predictions and actual labels. Gradient descent and its variants (stochastic gradient descent, Adam, AdaGrad) are the most common optimization algorithms. Training may take seconds for simple models on small datasets, or weeks on clusters of GPUs for large neural networks.
After training, the model is evaluated on a held-out test set that it has never seen before. The choice of evaluation metric depends on the task (see the Evaluation metrics section below). It is essential to evaluate on data separate from the training set to get an honest estimate of how the model will perform in production.
Most ML algorithms have hyperparameters (settings that are not learned from data but set before training), such as the learning rate, number of trees, regularization strength, or network depth. Hyperparameter tuning involves searching for the best combination of these settings. Common approaches include grid search, random search, and Bayesian optimization.
Once a model meets performance requirements, it is deployed into a production environment where it makes predictions on new data. Deployment methods range from REST APIs to embedded systems to batch processing jobs. After deployment, ongoing monitoring is needed to detect performance degradation (model drift), where the statistical properties of the input data change over time.
Different tasks call for different metrics. Using the wrong metric can give a misleading picture of model quality.
| Metric | Formula | When to use |
|---|---|---|
| Accuracy | (TP + TN) / Total | Balanced classes; overall correctness |
| Precision | TP / (TP + FP) | When false positives are costly (e.g., spam filtering) |
| Recall (Sensitivity) | TP / (TP + FN) | When false negatives are costly (e.g., disease screening) |
| F1 score | 2 * (Precision * Recall) / (Precision + Recall) | Imbalanced classes; balance of precision and recall |
| AUC-ROC | Area under the ROC curve | Evaluating performance across all classification thresholds |
TP = true positives, TN = true negatives, FP = false positives, FN = false negatives.
The precision-recall tradeoff is a common consideration: increasing precision typically reduces recall and vice versa. The right balance depends on the application. A cancer screening system should prioritize recall (catching all true cases), while a recommendation system might prioritize precision (avoiding irrelevant suggestions).
| Metric | Description |
|---|---|
| Mean Squared Error (MSE) | Average of the squared differences between predicted and actual values. Penalizes large errors heavily. |
| Root Mean Squared Error (RMSE) | Square root of MSE; in the same units as the target variable. |
| Mean Absolute Error (MAE) | Average of absolute differences. Less sensitive to outliers than MSE. |
| R-squared (R²) | Proportion of variance explained by the model. Ranges from 0 to 1 for a reasonable model. |
The bias-variance tradeoff is a fundamental concept in machine learning that describes the tension between two sources of prediction error.
The total prediction error can be decomposed as: Error = Bias² + Variance + Irreducible Noise. The irreducible noise is inherent randomness in the data that no model can eliminate.
The goal is to find a model complexity that is low enough to avoid overfitting but high enough to capture the true underlying patterns in the data. In practice, this balance is managed through regularization and cross-validation.
Regularization is a set of techniques that constrain or penalize model complexity to reduce overfitting.
Cross-validation is a resampling technique used to evaluate model performance more reliably than a single train-test split. The most common form is k-fold cross-validation: the dataset is divided into k equally sized subsets (folds). The model is trained k times, each time using k-1 folds for training and the remaining fold for validation. The final performance estimate is the average across all k runs. Common choices for k are 5 and 10.
Cross-validation helps detect overfitting and provides a more robust performance estimate, especially when data is limited. Stratified k-fold cross-validation preserves the proportion of each class in every fold, which is important for imbalanced datasets. Leave-one-out cross-validation (LOOCV) is an extreme case where k equals the number of data points; it is computationally expensive but useful for very small datasets.
Deep learning is a subset of machine learning that uses neural networks with many layers (hence "deep"). While classical ML algorithms like random forests and SVMs require hand-crafted features, deep learning models can automatically learn hierarchical representations from raw data.
Key deep learning architectures include:
Machine learning is best understood as a subfield of artificial intelligence, which itself is the broader discipline concerned with creating systems that can perform tasks that typically require human intelligence. AI encompasses many approaches beyond ML, including symbolic reasoning, expert systems, and search algorithms.
Statistics and machine learning share substantial overlap but differ in emphasis. Statistics traditionally focuses on inference: drawing conclusions about populations from samples, quantifying uncertainty, and testing hypotheses. Machine learning prioritizes prediction: building models that generalize well to unseen data, often with less concern for interpretability or inferential guarantees. Leo Breiman articulated this distinction in his influential 2001 paper "Statistical Modeling: The Two Cultures," contrasting the "data modeling" culture of statistics with the "algorithmic modeling" culture of machine learning.
Computational learning theory provides the mathematical foundations for understanding what can and cannot be learned efficiently. Key results include the PAC (Probably Approximately Correct) learning framework introduced by Leslie Valiant in 1984 [7] and the VC (Vapnik-Chervonenkis) dimension, which quantifies the capacity of a class of models.
Data science is a related but distinct field that focuses on extracting insights and knowledge from data using a combination of statistics, programming, and domain expertise. Machine learning provides many of the predictive modeling tools that data scientists use, but data science also includes data cleaning, exploratory analysis, visualization, and communication of results. Not every data science project requires machine learning, and not every ML project fits neatly into data science.
Data mining focuses on discovering previously unknown patterns in large datasets. While machine learning evaluates performance on reproducing known knowledge, data mining emphasizes finding novel, useful patterns. In practice, the two disciplines use many of the same techniques.
Deep learning is a subset of machine learning, which is a subset of AI. This nesting relationship is sometimes illustrated as concentric circles: AI on the outside, ML inside it, and deep learning at the core.
The machine learning ecosystem has matured significantly. Most ML development happens in Python, with a rich set of open-source libraries.
| Framework | Developer | Primary use | Notes |
|---|---|---|---|
| scikit-learn | Community (originally Inria) | Classical ML | The standard library for non-deep-learning tasks: classification, regression, clustering, preprocessing, and model evaluation. Stable API, excellent documentation. |
| PyTorch | Meta (Facebook) | Deep learning | Known for its dynamic computation graph and Pythonic design. Dominant in research. Serves as the foundation for models like GPT and Llama. |
| TensorFlow | Deep learning | Production-focused framework with strong deployment tools (TensorFlow Serving, TensorFlow Lite for mobile). Widely used in industry. | |
| Keras | Francois Chollet / Google | Deep learning (high-level API) | User-friendly API that can run on top of TensorFlow, PyTorch, or JAX. Good for prototyping and beginners. |
| XGBoost | Tianqi Chen | Gradient boosting | Extremely popular for tabular data competitions and production systems. Offers speed and regularization improvements over earlier boosting implementations. |
| LightGBM | Microsoft | Gradient boosting | Uses histogram-based algorithms for faster training on large datasets. |
| JAX | Numerical computing / DL | Combines NumPy-like syntax with automatic differentiation and XLA compilation. Increasingly used for ML research. | |
| Hugging Face Transformers | Hugging Face | NLP / Foundation models | Provides pre-trained transformer models and tools for fine-tuning. The de facto hub for sharing and using language models. |
| MLflow | Databricks / Community | Experiment tracking | Open-source platform for managing the ML lifecycle, including experiment tracking, model registry, and deployment. |
As machine learning has moved from research to production, a new engineering discipline called MLOps (Machine Learning Operations) has emerged. MLOps applies DevOps principles to ML systems, addressing the unique challenges of managing data-dependent, continuously evolving models in production environments.
Key components of ML infrastructure include:
As of 2025, surveys indicate that over 70% of enterprises have adopted or are actively implementing MLOps practices, reflecting the maturity of the field.
As of early 2026, machine learning is defined by several intersecting trends.
Foundation models as infrastructure. Large pre-trained models, particularly large language models, have shifted from experimental curiosities to production infrastructure. Companies fine-tune pre-trained backbones with lightweight adapters (such as LoRA) for specialized domains rather than training from scratch [12].
Efficiency gains. In January 2025, DeepSeek released models that matched Western frontier systems using roughly one-tenth the training compute, demonstrating that efficiency improvements can be as impactful as raw scale [12]. Mixture-of-experts (MoE) architectures route inputs to specialized subnetworks instead of activating every parameter for every input, reducing inference cost.
Reasoning and inference-time compute. A notable shift in 2025 was the move from simply scaling training compute to investing more compute at inference time. "Thinking" models that spend more time reasoning through problems before answering showed significant gains on complex tasks, and this trend is expected to continue through 2026 [12].
Multimodal models. Modern foundation models increasingly handle text, images, audio, and video within a single architecture, blurring the boundaries between what were previously separate ML subfields.
Agentic systems. ML-powered agents that can plan, use tools, write code, and take actions autonomously represent a growing area of development, moving beyond simple prompt-response interaction.
AutoML and MLaaS. Automated machine learning (AutoML) tools are making ML more accessible to non-experts by automating model selection, hyperparameter tuning, and feature engineering. Machine Learning as a Service (MLaaS) platforms, offered by major cloud providers, allow organizations to build and deploy models without managing infrastructure. The AutoML market is projected to grow from $2.34 billion in 2025 to $3.43 billion in 2026, reflecting a 46.5% CAGR.
Classical ML remains relevant. For tabular data, time series, and many production applications, gradient boosted trees (XGBoost, LightGBM) continue to outperform or match neural approaches while being faster to train, easier to interpret, and cheaper to deploy.
Machine learning is applied across nearly every industry. Some prominent areas include:
As machine learning systems are deployed in consequential domains such as hiring, criminal justice, healthcare, and lending, ethical considerations have become a central concern for researchers, practitioners, and policymakers.
ML models can perpetuate or amplify existing societal biases present in training data. For example, a hiring algorithm trained on historical data may discriminate against certain demographic groups if past hiring decisions were biased. A notable case was Amazon's experimental recruiting tool, which was found to penalize resumes containing the word "women's" because the training data reflected the male-dominated composition of the tech industry. Ensuring fairness requires careful attention to data collection, model design, and outcome measurement. Techniques for bias mitigation include re-sampling training data, applying fairness constraints during training, and auditing model outputs across demographic groups.
Many high-performing ML models, particularly deep neural networks, function as "black boxes" whose internal decision-making processes are difficult to interpret. This lack of transparency is problematic in high-stakes applications where people need to understand why a decision was made. The field of Explainable AI (XAI) addresses this through techniques such as SHAP (SHapley Additive exPlanations), LIME (Local Interpretable Model-agnostic Explanations), and attention visualization.
ML models trained on personal data raise privacy concerns. Models can sometimes memorize and reveal individual training examples. Differential privacy, a mathematical framework that provides formal guarantees about information leakage, has been adopted by organizations like Apple and Google to train models while protecting individual data. Federated learning, where models are trained across decentralized devices without sharing raw data, is another approach to preserving privacy.
Determining responsibility when ML systems make harmful decisions is an ongoing challenge. The question of who is accountable when an autonomous vehicle causes an accident, or when a medical diagnosis system provides an incorrect recommendation, remains unresolved in many legal frameworks.
Governments are increasingly regulating ML and AI systems. The European Union's AI Act, which entered into force in August 2024, establishes a risk-based framework for AI regulation. Prohibited AI practices and AI literacy obligations took effect in February 2025, governance rules for general-purpose AI models became applicable in August 2025, and comprehensive compliance requirements for high-risk AI systems are scheduled for August 2026 [13]. Other jurisdictions, including the United States, Canada, China, and Brazil, are developing their own regulatory approaches.
Training large ML models consumes significant energy. Training a single large language model can emit hundreds of tons of CO2 equivalent. The growing emphasis on model efficiency (smaller models, distillation, mixture-of-experts architectures) is partly motivated by environmental concerns, alongside cost reduction.
Despite its successes, machine learning faces several fundamental challenges: