In machine learning and neural networks, a weight is a learnable numerical parameter that determines the strength of the connection between two neurons. Weights are the primary values that a network adjusts during training in order to map inputs to correct outputs. Together with biases, weights constitute the bulk of a model's trainable parameters, and their values encode everything the network has learned from data.
Imagine you and your friends are voting on what game to play at recess. Some friends are really good at picking fun games, so you trust their opinion more. Other friends always pick boring games, so you listen to them less. A weight is like how much you trust each friend's vote. The computer does the same thing: it gives a bigger "trust number" (weight) to inputs that matter more and a smaller one to inputs that matter less. While learning, the computer keeps adjusting those trust numbers until it gets really good at making the right choice.
A single artificial neuron receives one or more input values and produces a single output. Each input x_i is multiplied by a corresponding weight w_i, and the products are summed together with a bias term b. This weighted sum is then passed through an activation function to produce the neuron's output.
The mathematical expression for a single neuron with n inputs is:
y = f(w₁x₁ + w₂x₂ + ... + wₙxₙ + b)
where f is the activation function, w₁ through wₙ are the weights, x₁ through xₙ are the inputs, and b is the bias. The weight w_i controls how much influence input x_i has on the neuron's output. A large positive weight amplifies the corresponding input, a large negative weight inverts and amplifies it, and a weight near zero effectively silences that input.
In practice, neurons are organized into layers, and the connections between two adjacent layers are represented compactly as a weight matrix. If a layer has m input neurons and n output neurons, the weight matrix W has dimensions n x m. The output of the layer (before the activation function) is computed as:
z = Wx + b
where x is the input vector, b is the bias vector, and z is the pre-activation output vector. Organizing weights into matrices makes computation efficient because modern hardware (GPUs and TPUs) is optimized for matrix multiplication. A deep neural network with L layers will have L weight matrices, one per layer, plus additional weights inside specialized components such as attention heads or normalization layers.
The terms "weight," "parameter," and "bias" are related but distinct.
| Term | Definition | Role in a neuron |
|---|---|---|
| Weight | A value that multiplies an input signal | Controls the strength of a connection between two neurons |
| Bias | An additive constant applied after the weighted sum | Shifts the activation function left or right, allowing the neuron to fire even when all inputs are zero |
| Parameter | Any trainable value in the model | A superset that includes all weights, all biases, and any other learned values (e.g., scale factors in batch normalization, learned positional encodings) |
When researchers say a model has "175 billion parameters," they are counting every weight and every bias in the network. Weights typically account for the vast majority of parameters because each connection between neurons requires its own weight, while biases are shared per neuron rather than per connection.
Before training begins, the weights of a neural network must be assigned initial values. The choice of initialization strategy has a significant impact on whether the network trains successfully, how quickly it converges, and the quality of the final model.
If all weights are initialized to the same value (for example, zero), every neuron in a layer will compute the same output and receive the same gradient during backpropagation. This is known as the symmetry problem: the neurons remain identical throughout training and the network cannot learn diverse features. Random initialization breaks this symmetry by giving each neuron a unique starting point.
However, the scale of the random values also matters. If weights are initialized too large, activations can explode in magnitude as they propagate forward, leading to the exploding gradient problem. If weights are too small, activations shrink toward zero, causing the vanishing gradient problem. Both scenarios make training extremely slow or cause it to fail entirely.
| Method | Distribution | Variance | Best suited for |
|---|---|---|---|
| Random normal | N(0, sigma²) | User-defined | Simple baselines |
| Xavier / Glorot (Glorot and Bengio, 2010) | U(-a, a) or N(0, sigma²) | 2 / (n_in + n_out) | Sigmoid, tanh activations |
| He / Kaiming (He et al., 2015) | N(0, sigma²) | 2 / n_in | ReLU and variants |
| Orthogonal (Saxe et al., 2014) | Semi-orthogonal matrix | Preserves norms | Deep linear networks, RNNs |
| LeCun (LeCun et al., 1998) | N(0, sigma²) | 1 / n_in | SELU activation |
Xavier (Glorot) initialization sets weights so that the variance of activations remains roughly constant across layers. It draws weights from a distribution with variance 2 / (n_in + n_out), where n_in is the number of inputs and n_out is the number of outputs. This approach was designed for networks using sigmoid or tanh activations, which are symmetric around zero.
He (Kaiming) initialization adapts the Xavier approach for ReLU activation functions, which zero out roughly half of their inputs. To compensate, He initialization uses a variance of 2 / n_in, effectively doubling the variance relative to Xavier. This method has become the default for most modern networks that use ReLU or its variants (Leaky ReLU, ELU, GELU).
Orthogonal initialization constructs weight matrices that are orthogonal (or semi-orthogonal for non-square matrices). Because orthogonal matrices preserve the norm of vectors they multiply, gradients neither grow nor shrink as they flow through the network. This property is especially useful in recurrent neural networks and very deep feedforward networks.
Weights are updated through an iterative optimization process. During each training step, the network makes a prediction, a loss function measures the error, and backpropagation computes the gradient of the loss with respect to each weight. An optimizer then updates the weights to reduce the loss.
The simplest update rule is stochastic gradient descent (SGD):
w = w - lr * (dL/dw)
where w is the weight, lr is the learning rate, and dL/dw is the partial derivative of the loss L with respect to that weight. The learning rate controls the step size: too large and the optimization overshoots; too small and convergence is slow.
Modern optimizers like Adam, AdaGrad, and RMSProp maintain additional state (such as running averages of gradients and squared gradients) to adapt the effective learning rate for each weight individually. These adaptive methods often converge faster than plain SGD, especially in the early stages of training.
Weight decay is a regularization technique that penalizes large weight values, encouraging the network to find simpler solutions that generalize better. The standard approach adds a term proportional to the sum of squared weights to the loss function:
L_total = L_original + (lambda / 2) * sum(w²)
where lambda is the regularization strength. The effect on the gradient is straightforward: each weight receives an additional gradient contribution of lambda * w, which pushes weights toward zero at every update. This is why the technique is called "weight decay"; the weights literally shrink at each step.
For standard SGD, L2 regularization and weight decay are mathematically equivalent. However, Loshchilov and Hutter (2019) demonstrated that this equivalence breaks down with adaptive optimizers like Adam. They introduced AdamW, which applies weight decay directly to the weights rather than through the gradient, restoring proper regularization behavior and improving generalization by up to 15% on image classification benchmarks.
Weight sharing is a design principle in which multiple parts of a network use the same set of weights rather than maintaining independent copies. This reduces the total number of parameters, acts as a form of regularization, and encodes useful assumptions about the structure of the data.
The most well-known example of weight sharing occurs in convolutional neural networks (CNNs). A convolutional filter (kernel) is a small weight matrix that slides across the entire input image. The same weights are applied at every spatial location, which means the network learns to detect features (edges, textures, shapes) regardless of where they appear. This translation invariance is a direct consequence of weight sharing and is one of the key reasons CNNs are so effective for image tasks. Without sharing, a CNN with a 3x3 filter applied to a 224x224 image would need millions of additional parameters.
In transformer architectures, weight sharing has been explored as a compression technique. Research on models like ALBERT (Lan et al., 2020) showed that sharing weights across all transformer layers can reduce the parameter count dramatically (for example, from 108 million to 12 million for BERT-base) while retaining most of the model's performance. The trade-off is typically a modest decrease in accuracy in exchange for a large reduction in memory and storage requirements.
Weight tying is a specific form of weight sharing in which two logically distinct components of a model share the same weight matrix. The most common example in large language models is tying the input embedding matrix to the output projection matrix.
In a language model, the input embedding converts token IDs into dense vectors, while the output projection converts hidden states back into a probability distribution over the vocabulary. Press and Wolf (2017) showed that using the same matrix for both operations reduces perplexity and cuts the number of parameters significantly, because the embedding matrix is often one of the largest components of the model. Weight tying is used in many prominent architectures including GPT-2, T5, and several BERT variants.
Recent research (2026) has revealed a subtle trade-off: tied embeddings tend to be shaped primarily by output gradients during training, which can negatively affect input representations in the early layers of the network.
Weight pruning is a model compression technique that removes weights from a trained network to reduce its size and computational cost without significantly harming accuracy. The simplest approach is magnitude-based pruning, which sets weights below a certain absolute threshold to zero, effectively deleting those connections.
Han et al. (2015) demonstrated that pruning can be remarkably effective. They reduced the parameter count of AlexNet from 61 million to 6.7 million (a 9x compression) with no loss in accuracy on ImageNet. Further work on VGG-16 achieved similar results. The typical pruning workflow follows a train-prune-retrain cycle: the network is first trained to convergence, small-magnitude weights are removed, and the remaining weights are fine-tuned to recover any lost accuracy.
The lottery ticket hypothesis (Frankle and Carbin, 2019) offered a compelling theoretical perspective on pruning. It proposes that dense, randomly initialized networks contain sparse subnetworks ("winning tickets") that, when trained in isolation from their original initialization, can match the full network's accuracy. The authors consistently found winning tickets that were less than 10 to 20 percent of the size of the original network.
Weight quantization reduces the numerical precision of weights to decrease memory usage and accelerate inference. Neural network weights are typically stored as 32-bit floating-point numbers (FP32). Quantization converts them to lower-precision formats, most commonly 16-bit (FP16/BF16), 8-bit integers (INT8), or even 4-bit integers (INT4).
| Precision | Bits per weight | Memory vs. FP32 | Typical accuracy impact |
|---|---|---|---|
| FP32 | 32 | 1x (baseline) | None (full precision) |
| FP16 / BF16 | 16 | 2x reduction | Negligible in most cases |
| INT8 | 8 | 4x reduction | Minimal with calibration |
| INT4 | 4 | 8x reduction | Noticeable; requires careful tuning |
Two main approaches are used. Post-training quantization (PTQ) converts a pre-trained FP32 model to lower precision without additional training. It is fast and easy to apply but may suffer accuracy loss, especially at very low bit widths. Quantization-aware training (QAT) simulates quantization effects during training, allowing the network to adapt and maintain higher accuracy at lower precision.
Quantization has become essential for deploying large language models on consumer hardware. A model that occupies 32 GB in FP32 can be compressed to just 4 GB in INT4, making it feasible to run on devices with limited memory.
Pre-trained weights are the saved weight values from a network that has already been trained on a large dataset. Rather than training a new network from scratch, practitioners can load these pre-trained weights and adapt them to a new task through transfer learning.
Transfer learning works because the early layers of neural networks tend to learn general features (edges, textures, basic language patterns) that are useful across many tasks. Only the later, task-specific layers need significant adjustment. Two common strategies exist:
The rise of foundation models (BERT, GPT, ViT, CLIP) has made pre-trained weights the standard starting point for almost all modern deep learning applications. Organizations like Hugging Face host thousands of pre-trained weight files (often called "checkpoints") that researchers and developers download and fine-tune for specific applications.
Visualizing weights can provide insight into what a neural network has learned. In convolutional neural networks, the filters in the first layer can be displayed as small images. Well-trained networks typically learn filters that resemble edge detectors, color blobs, and Gabor-like patterns, which aligns with how the human visual system processes information.
For deeper layers and other architectures, direct visualization of weight matrices is less informative because individual weights do not correspond to interpretable features. Researchers instead use techniques such as:
These visualization and interpretation tools are part of the broader field of explainable AI and are important for building trust in neural network predictions.