Neural Network

Deep Learning Machine Learning Neural Networks

25 min read

Updated Apr 5, 2026

A neural network (also called an artificial neural network or ANN) is a computing system inspired by the biological neural networks found in animal brains. It consists of interconnected groups of artificial neurons (nodes) that process information using a connectionist approach to computation. Neural networks learn to perform tasks by adjusting the numerical weights of connections between nodes, typically without being programmed with task-specific rules. They have become the foundation of modern artificial intelligence, powering applications from image recognition and natural language processing to drug discovery and autonomous driving.

Explain like I'm 5 (ELI5)

Imagine you have a huge team of tiny helpers, and each helper can only do one very simple thing: look at a number, multiply it, and pass the result to the next helper. Alone, none of them are very smart. But when you line up thousands of these helpers in rows and connect them together, something amazing happens. You can show the whole team a picture of a cat, and after the numbers pass through all the helpers, the team says "cat!" at the end.

How does the team get so smart? By practice. At first, the helpers give wrong answers. But each time they are wrong, a coach goes backward through the team and tells each helper to adjust their multiplication number a tiny bit. After seeing thousands of pictures of cats and dogs and cars, the helpers get their numbers just right, and the team can recognize things it has never seen before.

That is basically how a neural network works. The "helpers" are artificial neurons, the "multiplication numbers" are weights, and the "coach going backward" is an algorithm called backpropagation.

Historical development

The history of neural networks spans more than eight decades, with periods of intense excitement separated by stretches of reduced funding and interest sometimes called "AI winters."

Early foundations (1943 to 1960s)

In 1943, neurophysiologist Warren McCulloch and logician Walter Pitts published "A Logical Calculus of the Ideas Immanent in Nervous Activity," proposing the first mathematical model of an artificial neuron. Their model showed that simple binary-threshold units, when connected in networks, could compute any logical function ^[1].

In 1949, psychologist Donald Hebb published The Organization of Behavior, introducing what became known as Hebbian learning: the principle that neurons that fire together wire together. This idea, often summarized as "cells that fire together wire together," provided a theoretical basis for how connection strengths between neurons could be adjusted through experience, forming the conceptual root of modern learning rules ^[2].

In 1958, Frank Rosenblatt developed the perceptron, a single-layer neural network that could learn to classify linearly separable patterns. Rosenblatt and colleagues built the Mark I Perceptron, one of the first hardware implementations of a neural network, at the Cornell Aeronautical Laboratory. The perceptron generated enormous enthusiasm about the potential of thinking machines ^[3].

The first AI winter (1969 to early 1980s)

In 1969, Marvin Minsky and Seymour Papert published Perceptrons: An Introduction to Computational Geometry, a rigorous mathematical analysis that exposed fundamental limitations of single-layer perceptrons. Most famously, they proved that a single-layer perceptron cannot learn the XOR (exclusive OR) function because XOR is not linearly separable. This result, combined with Minsky and Papert's broader skepticism about extending perceptrons to multiple layers, contributed to a sharp decline in neural network research funding and interest throughout the 1970s ^[4].

Although the limitations applied specifically to single-layer networks and multi-layer architectures were already known to be more powerful in principle, the book's influence led many funding agencies and researchers to abandon neural network research for over a decade.

The backpropagation revival (1974 to 1990s)

In 1974, Paul Werbos described the backpropagation algorithm in his doctoral dissertation, providing a method for training multi-layer networks by propagating error gradients backward from the output layer to earlier layers. However, the technique did not gain widespread attention until 1986, when David Rumelhart, Geoffrey Hinton, and Ronald Williams published "Learning Representations by Back-propagating Errors" in Nature, demonstrating the practical effectiveness of backpropagation for training multi-layer networks ^[5].

This revival reignited interest in neural networks. In 1989, Yann LeCun and colleagues successfully applied backpropagation to a convolutional neural network (LeNet) for recognizing handwritten ZIP codes, demonstrating that deep neural networks could solve real-world pattern recognition tasks ^[6].

Also in 1989, George Cybenko proved the universal approximation theorem, showing that a feedforward network with a single hidden layer containing a finite number of sigmoid neurons can approximate any continuous function on a compact subset of R^n to arbitrary accuracy. Kurt Hornik extended this result in 1991, demonstrating that the universal approximation property is not specific to sigmoid activations but is instead a fundamental consequence of the multi-layer feedforward architecture itself ^[7]^[8].

The deep learning revolution (2006 to present)

In 2006, Geoffrey Hinton and colleagues published a breakthrough paper showing how to effectively train deep networks using layer-wise unsupervised pre-training with restricted Boltzmann machines, coining the term "deep learning" for networks with many hidden layers ^[9].

The watershed moment came in 2012, when Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton entered AlexNet in the ImageNet Large Scale Visual Recognition Challenge. AlexNet achieved a top-5 error rate of 15.3%, dramatically outperforming the second-best entry at 26.2%. This result proved that deep convolutional neural networks trained on GPUs could vastly surpass traditional computer vision methods ^[10]. After AlexNet's victory, nearly every subsequent ImageNet competitor adopted deep learning approaches.

In 2017, Vaswani et al. introduced the transformer architecture in "Attention Is All You Need," replacing recurrence and convolution with self-attention mechanisms. The transformer enabled far greater parallelism during training and superior performance on sequence tasks. This architecture became the foundation for models like BERT (2018), GPT-2 (2019), GPT-3 (2020), GPT-4 (2023), and Claude, and has been cited over 173,000 times as of 2025 ^[11].

In 2019, Hinton, LeCun, and Yoshua Bengio received the Turing Award for their foundational contributions to deep learning.

Year	Milestone	Key contributors
1943	First mathematical neuron model	Warren McCulloch, Walter Pitts
1949	Hebbian learning rule	Donald Hebb
1958	Perceptron	Frank Rosenblatt
1960	ADALINE (adaptive linear neuron)	Bernard Widrow, Marcian Hoff
1969	Perceptrons book highlights limitations	Marvin Minsky, Seymour Papert
1974	Backpropagation described	Paul Werbos
1986	Backpropagation popularized	David Rumelhart, Geoffrey Hinton, Ronald Williams
1989	Convolutional neural network (LeNet) for digit recognition	Yann LeCun et al.
1989	Universal approximation theorem	George Cybenko
1997	Long short-term memory (LSTM)	Sepp Hochreiter, Jurgen Schmidhuber
2006	Deep belief networks, "deep learning" coined	Geoffrey Hinton et al.
2012	AlexNet wins ImageNet	Alex Krizhevsky, Ilya Sutskever, Geoffrey Hinton
2014	Generative adversarial networks (GANs)	Ian Goodfellow et al.
2017	Transformer architecture	Vaswani et al.
2018	BERT pre-trained language model	Google AI
2019	Turing Award for deep learning	Geoffrey Hinton, Yann LeCun, Yoshua Bengio
2020	GPT-3 (175 billion parameters)	OpenAI
2023	GPT-4, Claude, multimodal foundation models	OpenAI, Anthropic, Google

How neural networks work

A neural network processes information through a series of interconnected layers of artificial neurons. Understanding how this processing works requires examining the building blocks (neurons, layers, weights), the forward pass, loss computation, and the training loop.

Neurons, layers, and connections

The fundamental unit of a neural network is the artificial neuron (also called a node or unit). Each neuron receives one or more input values, multiplies each input by a corresponding weight, sums the weighted inputs together with a bias term, and passes the result through a nonlinear activation function to produce an output.

Mathematically, the output of a single neuron can be expressed as:

output = f(w_1x_1 + w_2x_2 + ... + w_n*x_n + b)

where x_1 through x_n are the inputs, w_1 through w_n are the weights, b is the bias, and f is the activation function.

Neurons are organized into layers:

Input layer: Receives the raw data (pixel values, word embeddings, sensor readings). Nodes in this layer do not perform computation; they simply pass values to the next layer.
Hidden layers: One or more intermediate layers where the actual computation takes place. Each neuron in a hidden layer applies weights, biases, and an activation function to its inputs. Networks with two or more hidden layers are commonly referred to as "deep" neural networks.
Output layer: Produces the final result, such as a class probability, a predicted number, or a generated token. The number of output neurons depends on the task (one for binary classification, multiple for multi-class classification, etc.).

Activation functions

Activation functions introduce nonlinearity into the network, allowing it to learn complex, non-linear relationships. Without activation functions, a multi-layer network would collapse into a single linear transformation regardless of its depth.

Activation function	Formula	Typical use
Sigmoid	1 / (1 + e^(-x))	Output layer for binary classification
Tanh	(e^x - e^(-x)) / (e^x + e^(-x))	Hidden layers in recurrent neural networks
ReLU	max(0, x)	Most common in hidden layers of deep networks
Leaky ReLU	max(0.01x, x)	Variant of ReLU that avoids "dead neurons"
Softmax	e^(x_i) / sum(e^(x_j))	Output layer for multi-class classification
GELU	x * Phi(x)	Transformer architectures

The forward pass

During the forward pass, data flows from the input layer through each hidden layer to the output layer. At each layer, the inputs are multiplied by weights, summed with biases, and transformed by activation functions. The final layer produces the network's prediction. This entire computation is a series of matrix multiplications and element-wise nonlinear transformations, which is why GPUs (designed for parallel matrix arithmetic) are so effective at accelerating neural network computation.

Loss computation

After the forward pass, the network's prediction is compared to the true target value using a loss function (also called a cost function or objective function). The loss function quantifies how wrong the network's prediction is. Common loss functions include:

Mean squared error (MSE): Used for regression tasks. Computes the average of squared differences between predictions and true values.
Cross-entropy loss: Used for classification tasks. Measures the divergence between the predicted probability distribution and the true distribution.
Binary cross-entropy: A special case of cross-entropy for two-class problems.

The goal of training is to minimize this loss across the entire training dataset.

Backpropagation and gradient descent

Training a neural network involves repeatedly adjusting weights to minimize the loss function. This is accomplished through two key mechanisms working together.

Backpropagation computes the gradient (partial derivative) of the loss function with respect to each weight in the network. It applies the chain rule of calculus, starting at the output layer and working backward through each hidden layer. This process determines how much each individual weight contributed to the overall error.

Gradient descent then uses these gradients to update the weights. Each weight is adjusted in the direction that reduces the loss, with the size of the adjustment controlled by a parameter called the learning rate:

w_new = w_old - learning_rate * (dL/dw)

where dL/dw is the partial derivative of the loss L with respect to weight w.

This cycle of forward pass, loss computation, backpropagation, and weight update is repeated over many iterations (epochs) until the network converges to acceptable performance.

Training neural networks

Training a neural network involves many practical choices that significantly affect performance, convergence speed, and generalization ability.

Optimization algorithms

Plain (vanilla) gradient descent computes gradients over the entire training set before making a single weight update, which can be extremely slow for large datasets. Several variants address this:

Optimizer	Description	Key property
Stochastic gradient descent (SGD)	Updates weights after each individual training example	Noisy but fast updates
Mini-batch SGD	Updates weights after a small batch of examples (typically 32 to 512)	Balances noise and stability
SGD with momentum	Accumulates a velocity term to accelerate convergence	Helps escape shallow local minima
AdaGrad	Adapts the learning rate for each parameter based on past gradients	Good for sparse data
RMSProp	Uses an exponentially decaying average of squared gradients	Handles non-stationary objectives
Adam	Combines momentum and adaptive learning rates	Most popular general-purpose optimizer
AdamW	Adam with decoupled weight decay	Standard for transformer training

Hyperparameters

Several hyperparameters must be tuned during training:

Learning rate: Controls the step size during weight updates. If set too high, training may diverge; if set too low, training will be extremely slow. Learning rate schedules (e.g., cosine annealing, warm-up followed by decay) are commonly used to adjust the rate during training.
Batch size: The number of training examples processed before a weight update. Smaller batches introduce more noise, which can act as a regularizer but may slow convergence. Larger batches are more computationally efficient on GPUs but may lead to sharper, less generalizable minima.
Number of epochs: One epoch means the model has seen every training example once. Training typically continues for tens to hundreds of epochs, depending on the dataset size and model complexity.
Network architecture: The number of layers, the number of neurons per layer, and the choice of activation functions all affect the network's capacity and performance.

Regularization

Neural networks with many parameters can easily memorize training data instead of learning general patterns, a problem known as overfitting. Regularization techniques help the model generalize to unseen data:

Dropout: During training, randomly sets a fraction (commonly 20% to 50%) of neuron outputs to zero at each forward pass, preventing co-adaptation of neurons.
L1 and L2 regularization (weight decay): Adds a penalty term to the loss function proportional to the absolute values (L1) or squared values (L2) of the weights, discouraging large weights.
Batch normalization: Normalizes activations within each mini-batch, stabilizing and accelerating training.
Early stopping: Monitors performance on a validation set and stops training when validation loss begins to increase, preventing the model from overfitting to the training data.
Data augmentation: Artificially expands the training set by applying transformations (rotation, cropping, flipping for images; paraphrasing, synonym replacement for text).

Deep vs. shallow networks

A "shallow" neural network has one hidden layer, while a "deep" neural network has two or more hidden layers. Although the universal approximation theorem guarantees that even a single-hidden-layer network can approximate any continuous function, deeper networks can often represent the same functions using exponentially fewer neurons.

Deep networks learn hierarchical representations: early layers detect simple features (edges, basic shapes), intermediate layers combine these into more complex patterns (textures, object parts), and later layers recognize high-level concepts (faces, objects, scenes). This hierarchical feature extraction is one of the main reasons deep learning has outperformed shallow models on tasks involving images, speech, and natural language.

However, deeper networks are harder to train. They are more susceptible to vanishing gradients (where gradient signals shrink to near zero in early layers, stalling learning) and exploding gradients (where gradients grow exponentially). Techniques such as ReLU activations, batch normalization, residual connections (as in ResNet), and careful weight initialization (e.g., He initialization, Xavier initialization) have been developed to address these challenges.

Types of neural network architectures

Neural networks come in many architectural variants, each suited to different types of data and tasks.

Feedforward neural networks

The simplest architecture, in which information flows in one direction from input to output with no cycles or loops. Multi-layer perceptrons (MLPs) are the classic example. Feedforward networks are used for tabular data classification, regression, and as building blocks within more complex architectures.

Convolutional neural networks (CNNs)

CNNs use convolutional layers that apply learnable filters (kernels) to local regions of the input, making them especially effective for spatial data like images. Key operations include convolution (feature detection), pooling (spatial downsampling), and fully connected layers for final classification. Landmark CNN architectures include LeNet (1989), AlexNet (2012), VGGNet (2014), GoogLeNet/Inception (2014), and ResNet (2015). CNNs are the backbone of modern computer vision systems ^[6]^[10].

Recurrent neural networks (RNNs)

RNNs process sequential data by maintaining a hidden state that carries information from previous time steps. At each step, the network takes the current input along with the previous hidden state to produce a new output and updated state. Standard RNNs struggle with long sequences due to vanishing gradients. Two important variants address this problem: Long Short-Term Memory (LSTM) networks, introduced by Hochreiter and Schmidhuber in 1997, use gating mechanisms to selectively remember or forget information; Gated Recurrent Units (GRUs) provide a simplified alternative with similar performance ^[12].

Transformers

Transformers process entire sequences in parallel using self-attention mechanisms, computing relationships between all pairs of elements in a sequence simultaneously. Unlike RNNs, transformers have no recurrence, which makes them highly parallelizable and efficient on modern hardware. The multi-head attention mechanism allows the model to attend to different types of relationships simultaneously.

Transformers have become the dominant architecture for natural language processing and are increasingly used in computer vision (Vision Transformer, or ViT), speech recognition (Whisper), protein structure prediction (AlphaFold 2), and other domains. Nearly all modern large language models, including GPT-4, Claude, Gemini, and LLaMA, are based on the transformer architecture ^[11].

Autoencoders

Autoencoders are unsupervised networks that learn compressed representations of data. They consist of an encoder that maps input to a lower-dimensional latent space and a decoder that reconstructs the original input from the compressed representation. Variants include denoising autoencoders (trained to reconstruct clean data from noisy input), sparse autoencoders (that enforce sparsity in the latent representation), and variational autoencoders (VAEs, which learn a probabilistic latent space and can generate new data samples).

Generative adversarial networks (GANs)

Introduced by Ian Goodfellow and colleagues in 2014, GANs consist of two networks trained in opposition: a generator that creates synthetic data and a discriminator that tries to distinguish real data from generated data. Through this adversarial training process, the generator learns to produce increasingly realistic outputs. GANs have been used for image synthesis, style transfer, super-resolution, and data augmentation ^[13].

Graph neural networks (GNNs)

GNNs are designed to operate on graph-structured data, where entities (nodes) are connected by relationships (edges). They work through message passing: each node aggregates information from its neighbors to update its own representation. GNNs are used in social network analysis, molecular property prediction, recommendation systems, traffic forecasting, and combinatorial optimization.

Architecture	Best suited for	Key mechanism	Example models
Feedforward (MLP)	Tabular data, regression	Weighted sum + activation	Standard MLP
Convolutional neural network (CNN)	Images, spatial data	Convolutional filters	AlexNet, ResNet, VGGNet
Recurrent neural network (RNN/LSTM)	Sequential data, time series	Hidden state with gating	LSTM, GRU
Transformer	Text, sequences, multimodal	Self-attention	GPT-4, BERT, ViT
Autoencoder	Dimensionality reduction, generation	Encoder-decoder bottleneck	VAE, denoising AE
GAN	Image synthesis, data augmentation	Adversarial training	StyleGAN, DCGAN
Graph neural network	Graphs, molecules, social networks	Message passing	GCN, GAT, GraphSAGE

The universal approximation theorem

The universal approximation theorem is one of the most important theoretical results in neural network research. George Cybenko proved in 1989 that a feedforward neural network with a single hidden layer of sigmoid neurons can approximate any continuous function on a compact subset of R^n to any desired degree of accuracy, provided the hidden layer has sufficiently many neurons ^[7].

Kurt Hornik, Maxwell Stinchcombe, and Halbert White extended this result in 1989 and 1991, showing that the approximation property holds for a wide class of activation functions, not just sigmoids. Hornik's 1991 paper demonstrated that the universal approximation capability is an inherent property of the multi-layer feedforward architecture itself, not of any specific activation function ^[8].

The theorem has important caveats. It is an existence result: it guarantees that such a network exists but says nothing about how to find the right weights, how many neurons are needed, or whether gradient-based training can reach the optimal solution. In practice, deeper networks with fewer neurons per layer often learn more efficiently than the very wide, shallow networks the theorem describes.

Biological inspiration vs. reality

While neural networks draw inspiration from the brain, artificial and biological neural networks differ profoundly in their structure, mechanisms, and capabilities.

Feature	Biological neural networks	Artificial neural networks
Scale	Approximately 86 billion neurons in the human brain	Millions to billions of parameters, but far fewer distinct processing units
Connectivity	Each neuron connects to roughly 7,000 other neurons via synapses	Typically structured in layers with dense inter-layer connections
Learning mechanism	Synaptic plasticity (Hebbian learning, spike-timing-dependent plasticity)	Backpropagation and gradient descent
Signal type	Electrical spikes (action potentials) with precise timing	Continuous floating-point numbers
Processing speed	Individual neurons fire slowly (milliseconds)	Electronic gates switch in nanoseconds
Energy efficiency	Brain uses roughly 20 watts	Training a large model may require megawatts of power
Fault tolerance	Highly fault-tolerant; minor neuron loss causes no significant memory loss	Less inherently fault-tolerant; model parameters must be preserved
Adaptability	Continuous, lifelong learning and adaptation	Typically trained once on fixed datasets (though fine-tuning and continual learning are active research areas)

It is important to note that backpropagation, the primary training algorithm for artificial neural networks, has no known biological equivalent. Biological brains do not appear to propagate error gradients backward through neural pathways. This discrepancy has motivated research into biologically plausible learning rules, but none has yet achieved the practical effectiveness of backpropagation for training artificial systems.

Hardware for neural network training

The modern deep learning revolution has been driven as much by hardware advances as by algorithmic breakthroughs.

Graphics processing units (GPUs): Originally designed for rendering graphics, GPUs contain thousands of small cores optimized for parallel matrix arithmetic, making them ideal for the matrix multiplications at the heart of neural network computation. NVIDIA's CUDA platform (released in 2007) enabled researchers to use GPUs for general-purpose computing, and AlexNet's 2012 success was made possible by training on two NVIDIA GTX 580 GPUs. Modern training clusters use thousands of high-end GPUs (e.g., NVIDIA A100, H100, H200, B200) connected by high-bandwidth interconnects.

Tensor processing units (TPUs): Google developed TPUs as custom application-specific integrated circuits (ASICs) designed specifically for neural network workloads. TPUs are optimized for the tensor operations (multi-dimensional matrix operations) central to deep learning and are available through Google Cloud.

Other accelerators: Other hardware includes Intel's Habana Gaudi processors, AMD Instinct GPUs, Cerebras wafer-scale engines (containing millions of cores on a single wafer), and Graphcore's Intelligence Processing Units (IPUs). The growing demand for neural network training and inference has created an intense hardware competition.

The scale of hardware required for training large models has grown exponentially. Training GPT-3 (2020) required an estimated 3,640 petaflop-days of compute. Modern frontier models require orders of magnitude more, driving investment in massive data centers and raising questions about energy consumption and environmental impact.

Applications

Neural networks have achieved state-of-the-art performance across a wide range of domains.

Computer vision

CNNs and vision transformers power image classification, object detection, image segmentation, facial recognition, medical imaging analysis, and autonomous vehicle perception systems. Models like ResNet, YOLO, and ViT have set benchmarks in the field.

Natural language processing

Transformer-based models dominate language understanding and generation tasks, including machine translation, text summarization, question answering, sentiment analysis, and conversational AI. Large language models such as GPT-4, Claude, and Gemini can perform complex reasoning, write code, and engage in multi-turn dialogue.

Speech and audio

Neural networks are central to automatic speech recognition (e.g., OpenAI's Whisper), text-to-speech synthesis (e.g., WaveNet), music generation, and audio classification.

Game playing and decision making

DeepMind's AlphaGo defeated world champion Go player Lee Sedol in 2016 using a combination of deep neural networks and Monte Carlo tree search. AlphaGo Zero later surpassed AlphaGo by training entirely through self-play. AlphaFold and AlphaFold 2 applied deep learning to predict protein structures with near-experimental accuracy, earning the 2024 Nobel Prize in Chemistry for contributions to protein structure prediction.

Scientific computing

Neural networks are used in weather forecasting (e.g., Google DeepMind's GraphCast), molecular dynamics simulations, materials discovery, particle physics, and genomics. Physics-informed neural networks (PINNs) incorporate known physical laws as constraints during training.

Healthcare

Applications include medical image analysis (detecting tumors in radiology scans, analyzing retinal images for diabetic retinopathy), drug discovery (predicting molecular properties, generating candidate drug molecules), genomics, and clinical decision support systems.

Finance

Neural networks are used for algorithmic trading, fraud detection, credit scoring, risk assessment, and financial forecasting.

Scaling laws and emergent capabilities

Recent research has revealed remarkably predictable relationships between neural network performance and the resources used for training. Neural scaling laws, first studied systematically by Kaplan et al. at OpenAI in 2020, describe how a model's loss decreases as a power law with increases in model size (number of parameters), dataset size, and amount of compute ^[14].

These scaling laws have guided the development of increasingly large models. The Chinchilla scaling laws (Hoffmann et al., 2022) refined earlier estimates by showing that for a given compute budget, model size and dataset size should be scaled roughly equally, suggesting that many earlier models were under-trained relative to their size.

Perhaps the most striking finding from scaling research is the emergence of capabilities that appear suddenly at certain scales. Emergent abilities are defined as capabilities that are absent in smaller models but appear in larger ones without explicit training for that skill. Examples include chain-of-thought reasoning, in-context learning, and multi-step arithmetic. The mechanisms behind emergence remain an active area of research, with some scholars debating whether the phenomenon reflects true discontinuities or artifacts of evaluation metrics ^[15].

A related trend, described as the "densing law" in recent literature, suggests that capability density (the performance achievable per parameter) doubles approximately every 3.5 months, meaning that equivalent model performance can be achieved with exponentially fewer parameters over time.

Limitations and challenges

Despite their remarkable success, neural networks face several important limitations:

Interpretability: Neural networks are often described as "black boxes" because understanding exactly why a network makes a particular prediction is difficult. The field of explainable AI (XAI) seeks to address this through techniques such as attention visualization, saliency maps, and SHAP values.
Data requirements: Neural networks typically require large amounts of labeled training data. Techniques like transfer learning, few-shot learning, and self-supervised learning aim to reduce this requirement.
Computational cost: Training large models requires enormous computational resources and energy. The environmental impact of training frontier AI models has become a growing concern.
Adversarial vulnerability: Neural networks can be fooled by adversarial examples: small, carefully crafted perturbations to inputs that cause the model to make incorrect predictions with high confidence.
Catastrophic forgetting: When trained on new tasks, neural networks tend to forget previously learned tasks, a problem addressed by continual learning and elastic weight consolidation.
Hallucination: Large language models can generate plausible-sounding but factually incorrect content, presenting challenges for reliability in high-stakes applications.
Bias: Neural networks can learn and amplify biases present in their training data, potentially leading to unfair or discriminatory outputs.

Current state and future directions

As of 2025, neural networks, particularly transformer-based architectures, are at the center of the most significant advances in AI. Foundation models trained on broad datasets can be adapted to a wide range of downstream tasks through fine-tuning or prompting. Multimodal models that process text, images, audio, and video within a single architecture are becoming standard.

Active research areas include improving model efficiency (through quantization, pruning, distillation, and mixture-of-experts architectures), developing better evaluation methods, making neural networks more interpretable, reducing training costs, exploring alternative architectures (such as state-space models like Mamba), and ensuring that increasingly capable systems remain safe and aligned with human values.

References

Goodfellow, I., Bengio, Y., & Courville, A. (2016). *Deep Learning*. MIT Press.
Haykin, S. (2009). *Neural Networks and Learning Machines* (3rd ed.). Pearson.
Rosenblatt, F. (1958). "The perceptron: A probabilistic model for information storage and organization in the brain." *Psychological Review*, 65(6), 386-408.
Minsky, M., & Papert, S. (1969). *Perceptrons: An Introduction to Computational Geometry*. MIT Press.
Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). "Learning representations by back-propagating errors." *Nature*, 323(6088), 533-536.
LeCun, Y., Boser, B., Denker, J. S., et al. (1989). "Backpropagation applied to handwritten zip code recognition." *Neural Computation*, 1(4), 541-551.
Cybenko, G. (1989). "Approximation by superpositions of a sigmoidal function." *Mathematics of Control, Signals, and Systems*, 2(4), 303-314.
Hornik, K. (1991). "Approximation capabilities of multilayer feedforward networks." *Neural Networks*, 4(2), 251-257.
Hinton, G. E., Osindero, S., & Teh, Y.-W. (2006). "A fast learning algorithm for deep belief nets." *Neural Computation*, 18(7), 1527-1554.
Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). "ImageNet classification with deep convolutional neural networks." *Advances in Neural Information Processing Systems*, 25.
Vaswani, A., Shazeer, N., Parmar, N., et al. (2017). "Attention is all you need." *Advances in Neural Information Processing Systems*, 30.
Hochreiter, S., & Schmidhuber, J. (1997). "Long short-term memory." *Neural Computation*, 9(8), 1735-1780.
Goodfellow, I. J., Pouget-Abadie, J., Mirza, M., et al. (2014). "Generative adversarial nets." *Advances in Neural Information Processing Systems*, 27.
Kaplan, J., McCandlish, S., Henighan, T., et al. (2020). "Scaling laws for neural language models." arXiv:2001.08361.
Wei, J., Tay, Y., Bommasani, R., et al. (2022). "Emergent abilities of large language models." *Transactions on Machine Learning Research*.

Explain like I'm 5 (ELI5)

Historical development

Early foundations (1943 to 1960s)

The first AI winter (1969 to early 1980s)

The backpropagation revival (1974 to 1990s)

The deep learning revolution (2006 to present)

How neural networks work

Neurons, layers, and connections

Activation functions

The forward pass

Loss computation

Backpropagation and gradient descent

Training neural networks

Optimization algorithms

Hyperparameters

Regularization

Deep vs. shallow networks

Types of neural network architectures

Feedforward neural networks

Convolutional neural networks (CNNs)

Recurrent neural networks (RNNs)

Transformers

Autoencoders

Generative adversarial networks (GANs)

Graph neural networks (GNNs)

The universal approximation theorem

Biological inspiration vs. reality

Hardware for neural network training

Applications

Computer vision

Natural language processing

Speech and audio

Game playing and decision making

Scientific computing

Healthcare

Finance

Scaling laws and emergent capabilities

Limitations and challenges

Current state and future directions

References

Related Articles

Mixture of Experts (MoE)

Activation Function

Attention

Backpropagation

Batch Normalization

Bayesian Neural Network

Explain like I'm 5 (ELI5)

Historical development

Early foundations (1943 to 1960s)

The first AI winter (1969 to early 1980s)

The backpropagation revival (1974 to 1990s)

The deep learning revolution (2006 to present)

How neural networks work

Neurons, layers, and connections

Activation functions

The forward pass

Loss computation

Backpropagation and gradient descent

Training neural networks

Optimization algorithms

Hyperparameters

Regularization

Deep vs. shallow networks

Types of neural network architectures

Feedforward neural networks

Convolutional neural networks (CNNs)

Recurrent neural networks (RNNs)

Transformers

Autoencoders

Generative adversarial networks (GANs)

Graph neural networks (GNNs)

The universal approximation theorem

Biological inspiration vs. reality

Hardware for neural network training

Applications

Computer vision

Natural language processing

Speech and audio

Game playing and decision making