Neural Network
Last reviewed
Sources
No citations yet
Review status
Needs citations
Revision
v9 · 6,293 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Sources
No citations yet
Review status
Needs citations
Revision
v9 · 6,293 words
Add missing citations, update stale details, or suggest a clearer explanation.
A neural network (also called an artificial neural network or ANN) is a computational model, loosely inspired by the networks of biological neurons in animal brains, that learns to perform tasks by adjusting the strengths of connections between simple processing units rather than by following explicit programmed rules. It consists of layered groups of artificial neurons that transmit numerical signals along weighted connections; by tuning these weights on data, the network can approximate almost any function, from classifying images to generating text. Since the early 2010s, deep neural networks (networks with many stacked layers) have become the dominant model class in machine learning and the technological foundation of deep learning and modern artificial intelligence[1][2]. The earliest mathematical model of an artificial neuron was published by Warren McCulloch and Walter Pitts in 1943, the first trainable network (the perceptron) by Frank Rosenblatt in 1958, and the backpropagation learning algorithm that made deep networks practical was popularized in 1986[3][5][9].
This article is a high-level survey. It traces the eight-decade history of neural networks, sketches their mathematical structure and training procedure, and links out to dedicated articles for each major architecture (such as the convolutional neural network, recurrent neural network, and transformer), training technique, and theoretical concept.
Imagine a huge team of tiny helpers, where each helper can only do one very simple thing: look at some numbers coming in, multiply them, add them up, and pass the result to the next helper. Alone, none of them are very smart. But when you line up thousands of these helpers in rows and connect them together, something amazing happens. You can show the whole team a picture of a cat, and after the numbers pass through all the helpers, the team says "cat!" at the end.
How does the team get so good? By practice. At first, the helpers give wrong answers. Each time they are wrong, a coach goes backward through the team and tells each helper to nudge its multiplication number a tiny bit. After seeing thousands of pictures of cats, dogs, and cars, the helpers settle on numbers that work, and the team can recognize new things it has never seen before.
That is essentially how a neural network works. The "helpers" are artificial neurons, the "multiplication numbers" are weights, and the "coach going backward" is an algorithm called backpropagation.
At a high level, a neural network turns an input (such as the pixels of an image or the tokens of a sentence) into an output (such as a label or the next word) by passing numbers through a sequence of layers. Each artificial neuron multiplies its inputs by learned weights, adds a bias, and applies a nonlinear activation function; stacking many such layers lets the network build up increasingly abstract representations of the data. The network is not programmed with the rules for a task. Instead, it is shown many examples and learns the weights that minimize its errors, using the forward pass, a loss function, and the backward pass described in the Training section below. The mathematical formulation that follows makes this picture precise.
The history of neural networks spans more than eighty years and is conventionally divided into periods of progress separated by stretches of reduced funding and attention often called AI winters.
In 1943, neurophysiologist Warren McCulloch and logician Walter Pitts published "A Logical Calculus of the Ideas Immanent in Nervous Activity" in the Bulletin of Mathematical Biophysics. They proposed the first mathematical model of an artificial neuron: a binary-threshold unit that fires if and only if a weighted sum of its inputs exceeds a fixed threshold. The paper opens with the premise that motivated the whole field: "Because of the 'all-or-none' character of nervous activity, neural events and the relations among them can be treated by means of propositional logic."[3] They proved that networks of such units, with appropriately chosen connections, could in principle compute any function expressible in propositional logic[3].
In 1949, psychologist Donald Hebb published The Organization of Behavior, in which he formulated what became known as the Hebbian learning principle. Hebb proposed that when one neuron repeatedly participates in firing another, the connection between them strengthens, often paraphrased as "cells that fire together wire together." Hebb's principle provided a theoretical basis for how synaptic strengths might be modified by experience and remains the conceptual root of many learning rules used in both neuroscience and artificial neural networks[4].
In 1957 and 1958, Frank Rosenblatt of the Cornell Aeronautical Laboratory introduced the perceptron, the first trainable neural network model. Rosenblatt described the perceptron in a 1957 technical report and in the widely cited 1958 Psychological Review paper "The Perceptron: A Probabilistic Model for Information Storage and Organization in the Brain." The perceptron is a single-layer network of threshold neurons whose weights are learned from data via an iterative correction rule. Rosenblatt and colleagues also built the Mark I Perceptron, a custom hardware implementation with photocell inputs[5].
In 1960, Bernard Widrow and his student Marcian "Ted" Hoff at Stanford introduced ADALINE (ADAptive LInear NEuron) and the closely related MADALINE network. ADALINE used a linear output unit and was trained by the Widrow-Hoff (or least-mean-square, LMS) learning rule, which adjusts weights in proportion to the difference between target and actual output. The LMS rule is a precursor to modern gradient descent and remains a foundational algorithm in adaptive signal processing[6].
These early successes generated intense optimism. After a US Navy press conference in 1958, The New York Times reported that the perceptron was "the embryo of an electronic computer that [the Navy] expects will be able to walk, talk, see, write, reproduce itself and be conscious of its existence."[35] Such coverage set expectations that the technology of the era could not meet.
In 1969, Marvin Minsky and Seymour Papert published Perceptrons: An Introduction to Computational Geometry (MIT Press). The book gave a rigorous mathematical analysis of what single-layer perceptrons can and cannot compute. Most famously, Minsky and Papert showed that a single-layer perceptron cannot represent the exclusive-or (XOR) function because XOR is not linearly separable. They also raised pessimistic, though more nuanced, questions about whether multi-layer networks could be trained efficiently[7].
The book's influence, combined with the unrealistic expectations set during the perceptron boom, contributed to a sharp decline in neural network research funding through the 1970s. Together with the broader 1973 Lighthill report and contemporaneous funding cuts, this period is widely considered part of the first AI winter. The XOR limitation applies strictly to single-layer perceptrons; multi-layer networks can represent XOR and arbitrary Boolean functions, but a practical learning algorithm for multi-layer networks would not become widely known until the mid-1980s.
Throughout the late 1970s and early 1980s, a small group of researchers continued to study learning in multi-layer networks. In 1974, Paul Werbos described the backpropagation algorithm in his Harvard PhD thesis, framing it as an application of the chain rule to layered systems[8]. Independent rediscoveries and extensions appeared throughout the early 1980s.
The decisive event was the 1986 publication of David Rumelhart, Geoffrey Hinton, and Ronald Williams's "Learning Representations by Back-Propagating Errors" in Nature, which clearly described backpropagation and demonstrated that gradient-trained multi-layer networks could learn useful internal representations[9]. In parallel, Rumelhart, James McClelland, and the broader Parallel Distributed Processing (PDP) Research Group published the two-volume Parallel Distributed Processing: Explorations in the Microstructure of Cognition, which laid out the connectionist research program: cognition as the emergent behavior of large networks of simple units. The PDP volumes helped seed a generation of researchers, including Hinton and Yoshua Bengio[10].
Other landmark contributions of the 1980s include John Hopfield's 1982 associative memory networks (Hopfield networks), the introduction of Boltzmann machines by Hinton and Terrence Sejnowski, and Teuvo Kohonen's self-organizing maps.
In 1989, Yann LeCun and colleagues at AT&T Bell Labs applied backpropagation to a convolutional neural network for reading handwritten digits, demonstrating that depth, weight sharing, and local receptive fields could solve a practical pattern-recognition problem[11]. Subsequent refinements led to the LeNet-5 system described in LeCun et al.'s 1998 Proceedings of the IEEE paper "Gradient-Based Learning Applied to Document Recognition," which was deployed at scale to read checks and ZIP codes in the United States[12].
The same period produced fundamental theoretical results. In 1989, George Cybenko proved that a feedforward network with a single hidden layer of sigmoidal units can approximate any continuous function on a compact domain to arbitrary accuracy[13]. In 1991, Kurt Hornik generalized this universal approximation result to a wide class of activation functions, establishing approximation as a property of the multi-layer architecture rather than any particular nonlinearity[14]. The 1997 introduction of Long Short-Term Memory (LSTM) by Sepp Hochreiter and Jürgen Schmidhuber provided a sequence model that mitigated the vanishing-gradient problem in recurrent networks and would later dominate speech and language modeling for nearly two decades[15].
Despite these advances, neural networks fell out of favor through much of the 1990s and early 2000s. Vladimir Vapnik and colleagues' support vector machine (SVM), with strong theoretical guarantees and effective kernel methods, became the default tool for classification on moderate-sized datasets. Ensemble methods such as random forests and gradient-boosted trees also outperformed neural networks on many benchmarks. Compute and data were scarce by modern standards, deep networks were difficult to train end-to-end, and many researchers abandoned neural approaches. This stretch is sometimes called the second AI winter for neural networks (though the broader AI field experienced its own cycles)[16].
The modern era of neural networks began in 2006 with Geoffrey Hinton, Simon Osindero, and Yee-Whye Teh's "A Fast Learning Algorithm for Deep Belief Nets" in Neural Computation, which showed how to pre-train deep networks layer-by-layer using restricted Boltzmann machines. Hinton and Ruslan Salakhutdinov's companion Science paper, "Reducing the Dimensionality of Data with Neural Networks," further popularized the term "deep learning"[17][18].
The decisive watershed came in 2012, when Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton entered AlexNet in the ImageNet Large Scale Visual Recognition Challenge. AlexNet, a deep CNN with 60 million parameters and 650,000 neurons trained on two NVIDIA GTX 580 GPUs using ReLU activations and dropout, achieved a top-5 error rate of 15.3%, compared to 26.2% for the second-best (non-neural) entry[19]. The result was widely viewed as the unambiguous victory of deep learning over hand-engineered features, and triggered the rapid adoption of deep neural networks across vision, speech, and natural language processing.
Subsequent years produced rapid architectural advances:
In 2017, Ashish Vaswani and colleagues at Google Brain introduced the transformer in "Attention Is All You Need." The transformer replaced recurrence and convolution with multi-head self-attention; the authors proposed "a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely," and reported 28.4 BLEU on the WMT 2014 English-to-German task and 41.8 BLEU on English-to-French, setting new state-of-the-art results while being far more parallelizable[22]. Within a few years, transformers displaced LSTMs as the default sequence model and spread to vision (Vision Transformer), speech (Whisper), code, and scientific data.
In 2019, Hinton, LeCun, and Yoshua Bengio received the 2018 ACM A.M. Turing Award for "conceptual and engineering breakthroughs that have made deep neural networks a critical component of computing." The citation specifically credits backpropagation, convolutional networks, probabilistic models, and the broader connectionist research program[23].
Beginning in 2018, large transformer language models pre-trained on web-scale corpora became the dominant paradigm. Google's BERT demonstrated that bidirectional masked-language-modeling pre-training produced state-of-the-art results across many NLP tasks, while OpenAI's GPT series, particularly GPT-3 (2020, 175 billion parameters, described by its authors as "10x more than any previous non-sparse language model") and GPT-4 (2023), showed that autoregressive language models scaled to extreme sizes exhibited in-context learning and broad generalization[24][25]. The Stanford CRFM research group coined the term "foundation model" in 2021 to refer to any model trained on broad data at scale that can be adapted to a wide range of downstream tasks[26].
By 2025-2026, frontier neural systems, including Anthropic's Claude, OpenAI's GPT-4 and GPT-5 lines, Google's Gemini, Meta's Llama family, and DeepMind's AlphaFold and Gemini Robotics, are routinely multimodal, mix dense and sparse mixture-of-experts layers, and rely on enormous distributed training systems running for weeks on tens of thousands of accelerators.
| Year | Milestone | Key contributors |
|---|---|---|
| 1943 | First mathematical neuron model | Warren McCulloch, Walter Pitts |
| 1949 | Hebbian learning principle | Donald Hebb |
| 1958 | Perceptron | Frank Rosenblatt |
| 1960 | ADALINE / Widrow-Hoff rule | Bernard Widrow, Marcian Hoff |
| 1969 | Perceptrons book identifies XOR limitation | Marvin Minsky, Seymour Papert |
| 1974 | Backpropagation in thesis form | Paul Werbos |
| 1982 | Hopfield networks | John Hopfield |
| 1986 | Backpropagation popularized; PDP volumes | Rumelhart, Hinton, Williams; PDP Group |
| 1989 | Backprop-trained CNN for digit recognition | Yann LeCun et al. |
| 1989 | Universal approximation theorem | George Cybenko (and Kurt Hornik, 1991) |
| 1997 | LSTM | Sepp Hochreiter, Jürgen Schmidhuber |
| 2006 | Deep belief networks; "deep learning" rises | Geoffrey Hinton et al. |
| 2012 | AlexNet wins ImageNet (60M parameters) | Krizhevsky, Sutskever, Hinton |
| 2014 | GANs | Ian Goodfellow et al. |
| 2015 | ResNet (152 layers) | Kaiming He et al. |
| 2017 | Transformer architecture | Vaswani et al. |
| 2018 | BERT and GPT released | Google AI; OpenAI |
| 2018 Turing Award (announced 2019) | "Fathers of the deep-learning revolution" | Geoffrey Hinton, Yann LeCun, Yoshua Bengio |
| 2020 | GPT-3 (175B parameters) | OpenAI |
| 2022 | Chinchilla scaling laws; ChatGPT | DeepMind; OpenAI |
| 2023-2026 | Multimodal foundation models, sparse MoE, agentic systems | Multiple labs |
A standard feedforward neural network defines a parameterized function fθ: Rd → Rk, where θ collects all weights and biases.
A single artificial neuron computes:
z = w1·x1 + w2·x2 + … + wn·xn + b, a = φ(z)
Here xi are the inputs, wi the corresponding weights, b a bias term, φ a nonlinear activation function, and a the unit's output (also called its activation).
A neural network organizes neurons into layers. Writing the activations of layer ℓ as a vector a(ℓ), a fully-connected layer applies an affine transformation followed by an elementwise nonlinearity:
a(ℓ) = φ(ℓ)( W(ℓ) a(ℓ−1) + b(ℓ) )
with weight matrix W(ℓ) and bias vector b(ℓ). Stacking L such layers gives:
fθ(x) = a(L)(a(L−1)( … (a(1)(x)) … ))
Information is organized into three roles:
For non-feedforward networks (recurrent, graph, attention-based), the same building blocks reappear but the connectivity pattern differs.
Training a neural network is the process of choosing parameters θ to make fθ agree with a dataset. The standard recipe is empirical risk minimization by stochastic gradient descent on a differentiable loss, with gradients computed via backpropagation.
Given a training example x, the network is evaluated layer-by-layer to produce a prediction ŷ = fθ(x). This is the forward pass. Because each layer is a matrix-vector product followed by an elementwise nonlinearity, the forward pass is a sequence of dense linear algebra operations, which is why GPUs and TPUs, hardware specialized for parallel matrix arithmetic, are so effective.
The prediction ŷ is compared against the target y using a loss function L(ŷ, y). Common choices are mean squared error for regression, cross-entropy for classification, and contrastive or sequence-level losses for self-supervised and generative tasks. The training objective is the expected loss over the training distribution, approximated as the average loss over a mini-batch of examples.
To improve θ, the network computes ∇θL: the gradient of the loss with respect to every parameter. Backpropagation is the efficient algorithm for doing so. It applies the chain rule of calculus to the computation graph of the forward pass, starting from the output and propagating partial derivatives backward layer-by-layer. Backpropagation is the dominant training algorithm for essentially all modern neural networks[9].
Gradient descent uses the gradient to update parameters:
θ ← θ − η · ∇θL
where η is the learning rate. In practice, the gradient is estimated from a mini-batch of typically 32-4,096 examples; this is mini-batch stochastic gradient descent. Modern training almost always uses momentum-based and adaptive variants. The most popular optimizers are SGD with momentum, Adam, and AdamW, the last being the de facto standard for transformer training.
Large networks overfit easily, so training combines several techniques to improve generalization and stabilize optimization:
These techniques, together with careful initialization (He, Xavier/Glorot) and the use of ReLU-family activations, are what make networks of hundreds or thousands of layers and billions of parameters trainable in practice.
Neural networks come in many architectural families; each tailors connectivity and parameterization to a class of data. This section is a high-level tour with links to dedicated articles.
The classical fully-connected feedforward network, historically called a multilayer perceptron or feedforward neural network, applies a stack of dense linear layers and nonlinear activations. MLPs remain ubiquitous as components inside larger models: the position-wise "MLP block" in a transformer is essentially a two-layer MLP applied independently to each token.
Convolutional neural networks introduce weight sharing and local receptive fields well-suited to spatial data such as images, video, and audio spectrograms. Key operations are convolution (feature detection), pooling (spatial downsampling), and fully connected classification heads. Landmark architectures include LeNet (1989/1998), AlexNet (2012), VGG (2014), GoogLeNet/Inception (2014), and ResNet (2015)[11][19][21].
Recurrent neural networks maintain a hidden state that is updated as a sequence is consumed, providing a natural model for time series, text, and speech. Vanilla RNNs suffer from vanishing and exploding gradients, motivating gated variants: LSTM (Hochreiter and Schmidhuber, 1997) and the simpler GRU (Cho et al., 2014). RNNs dominated machine translation and speech recognition from roughly 2014 until they were largely displaced by transformers after 2017[15].
The transformer is the dominant architecture for sequence modeling since 2017. Its core ingredient is multi-head self-attention: each position in the sequence attends to a learned, weighted combination of all other positions. Transformers are highly parallelizable, scale gracefully to enormous parameter counts, and now power large language models, image models (Vision Transformer), speech models (Whisper), and protein models (AlphaFold 2/3)[22].
Graph neural networks generalize convolutions to arbitrary graphs by iteratively passing messages between neighboring nodes. They are central to molecular property prediction, drug discovery, recommendation systems, traffic forecasting (Google's road ETA models), and combinatorial optimization.
State-space models (SSMs) such as S4 and Mamba are a more recent family that models long sequences via linear recurrences with structured kernels, achieving subquadratic scaling in sequence length. SSMs are increasingly used as transformer alternatives or complements for long-context modeling.
Autoencoders (including variational autoencoders) learn compressed latent representations by reconstructing their inputs. Generative families include GANs, normalizing flows, autoregressive models, and (most recently) diffusion models, which dominate state-of-the-art image and video synthesis.
A mixture-of-experts (MoE) layer routes each input token to a small subset of "expert" sub-networks via a learned gating function. MoE allows total parameter count to grow much faster than per-token compute and underlies many recent frontier models (Mixtral, GPT-4-class systems, DeepSeek-V3).
| Architecture | Best suited for | Key mechanism | Examples |
|---|---|---|---|
| MLP | Tabular data, components in larger nets | Fully-connected layers | Standard MLP |
| CNN | Images, audio, video | Convolutional filters, pooling | AlexNet, ResNet, Inception |
| RNN / LSTM | Sequences, time series | Recurrent hidden state with gating | LSTM, GRU |
| Transformer | Text, sequences, multimodal | Multi-head self-attention | GPT-4, Claude, BERT, ViT |
| Graph NN | Graphs, molecules, social networks | Message passing | GCN, GAT, GraphSAGE |
| State-space model / Mamba | Very long sequences | Structured linear recurrence | S4, Mamba |
| Autoencoder | Compression, representation learning | Encoder-decoder bottleneck | VAE, denoising AE |
| GAN | Image synthesis | Generator vs. discriminator | StyleGAN, BigGAN |
| Mixture of experts | Scaling parameters cheaply | Sparse routing | Switch Transformer, Mixtral |
Activation functions inject the nonlinearity that lets stacked layers represent more than a single affine map. The most widely used today are[27]:
| Activation | Definition | Notes |
|---|---|---|
| Sigmoid | σ(x) = 1 / (1 + e−x) | Bounded (0,1); historically dominant; vanishing gradients limit use in deep nets. Common at output of binary classifiers. |
| Tanh | (ex − e−x) / (ex + e−x) | Bounded (−1,1); standard in early RNNs and LSTMs. |
| ReLU | max(0, x) | Default hidden activation since AlexNet (2012); simple, sparse, and accelerates convergence[19]. |
| Leaky ReLU / PReLU | max(αx, x) | Avoids "dead neuron" problem of vanilla ReLU. |
| GELU | x · Φ(x) | Smooth ReLU variant; default in BERT, GPT, and many modern transformers[28]. |
| SwiGLU | (Swish(xW) ⊙ (xV)) | Gated activation used in PaLM, LLaMA, and many recent LLMs[29]. |
| Softmax | exi / Σj exj | Converts logits to a probability distribution at the output of multiclass classifiers. |
The shift from sigmoid/tanh to ReLU around 2010-2012, and then to GELU and gated variants such as SwiGLU around 2018-2020, was one of several inconspicuous changes that made deep networks reliably trainable.
The choice of loss function encodes what "wrong" means for the task:
The universal approximation theorem is the foundational expressivity result for feedforward neural networks. George Cybenko proved in 1989 that finite linear combinations of sigmoidal activation functions are dense in the space of continuous functions on the unit cube, meaning that a single-hidden-layer feedforward network can approximate any continuous function on a compact domain to any desired accuracy, given enough hidden units[13]. In 1991, Kurt Hornik extended the result to a wide class of non-polynomial activation functions and clarified that the universality is a property of the multi-layer architecture rather than of sigmoid in particular[14].
The theorem is purely an existence result. It does not say how many neurons are needed, how to find the right weights, or whether gradient descent can in fact reach them. Subsequent work has shown that depth often allows networks to represent the same functions with exponentially fewer parameters than a shallow network, which is a key motivation for the modern emphasis on deep models.
Training contemporary neural networks is a large-scale systems problem.
Modern training runs on parallel matrix accelerators:
Distributed training spreads a single model across many accelerators. Standard parallelism strategies include:
Frameworks such as Megatron-LM, DeepSpeed, FSDP (Fully Sharded Data Parallel), and JAX's pjit/shard_map orchestrate these strategies.
Modern training uses mixed precision (when available), storing weights in 32-bit or 16-bit and performing matrix multiplications in lower-precision formats (FP16, BF16, FP8), to dramatically increase throughput. Loss scaling and stochastic rounding are used to preserve numerical stability. Mixed precision, together with sparse MoE and KV-cache tricks at inference time, accounts for much of the per-flop improvement in training large models over the past five years.
Scaling laws, first systematically studied by Kaplan et al. (OpenAI, 2020) and refined by the Chinchilla work of Hoffmann et al. (DeepMind, 2022), describe how validation loss falls smoothly as a power law in compute, dataset size, and parameters. Chinchilla in particular showed that for a fixed compute budget, model size and training tokens should be scaled in roughly equal proportions, implying that many earlier large models were undertrained[30][31].
Most modern neural networks are built using one of a handful of high-level frameworks:
All three major frameworks share core abstractions: tensors as multi-dimensional arrays, automatic differentiation, and compilation to optimized accelerator code (cuDNN, XLA, Triton).
CNNs and Vision Transformers power image classification, object detection, semantic and instance segmentation, optical character recognition, satellite and medical imaging analysis, and the perception stacks of autonomous vehicles. Architectures such as AlexNet, ResNet, Inception, YOLO, EfficientNet, ViT, and Segment Anything are all neural networks.
Transformer-based language models dominate machine translation, summarization, question answering, code generation, sentiment analysis, and conversational AI. Large language models such as Claude, GPT-4, Gemini, and Llama 3/4 perform extended chains of reasoning, function calling, and multi-step agentic behavior.
Neural networks underpin automatic speech recognition (e.g., Whisper), text-to-speech (WaveNet, VALL-E), music generation, and audio classification.
DeepMind's AlphaGo defeated world champion Lee Sedol at Go in 2016, combining deep CNNs, value/policy networks, and Monte Carlo tree search. AlphaGo Zero (2017) and MuZero (2019) generalized this approach to self-play and model-based reinforcement learning.
AlphaFold 2 (2020) predicted protein 3D structures at near-experimental accuracy, and contributions to protein structure prediction were recognized by the 2024 Nobel Prize in Chemistry (awarded jointly to David Baker, Demis Hassabis, and John Jumper)[32]. Neural networks are now used in weather forecasting (GraphCast, GenCast), materials discovery, particle physics, fluid dynamics, and quantum chemistry. Physics-informed neural networks (PINNs) embed known physical laws as soft constraints.
Applications include radiology and pathology image analysis, retinal disease screening, electronic health record modeling, drug discovery, and single-cell genomics.
A neural network is the model: a specific class of function made of layers of artificial neurons connected by weights. Deep learning is the broader field and methodology of training neural networks that have many layers (hence "deep") on large datasets, including the architectures, optimization methods, regularization tricks, and hardware practices that make such training work. Put simply, every deep learning system is built on neural networks, but the term "deep learning" emphasizes depth and scale: a single-layer perceptron is a neural network but is not usually called deep learning, whereas a 100-layer ResNet or a billion-parameter transformer is. Neural networks are also one model class within the still-broader field of machine learning, which also includes non-neural methods such as decision trees, support vector machines, and linear regression. The 2015 Nature review by LeCun, Bengio, and Hinton defines deep learning as methods that "allow computational models that are composed of multiple processing layers to learn representations of data with multiple levels of abstraction."[2]
The empirical success of neural networks has outpaced theoretical understanding, but several lines of work have made progress:
Open theoretical questions include why overparameterized networks generalize, how representations form during training, whether emergent capabilities reflect genuine phase transitions, and what, if any, fundamental limits constrain scaling.
Despite their success, neural networks have well-known limitations: