Weighted Sum

A weighted sum is a mathematical operation that combines multiple input values by multiplying each value by a corresponding weight (coefficient) and then summing the results. In machine learning and deep learning, weighted sums serve as the foundational computation inside neural networks, linear regression models, attention mechanisms, ensemble methods, and many other algorithms. Nearly every prediction a machine learning model makes can be traced back to one or more weighted sum operations.

Explain like I'm 5 (ELI5)

Imagine you are making a smoothie. You add three kinds of fruit: strawberries, bananas, and blueberries. But you don't add the same amount of each fruit. You add a big scoop of strawberries because you love them, a medium scoop of bananas, and just a tiny handful of blueberries. The "scoop size" for each fruit is like a weight. A weighted sum is what you get when you multiply each fruit amount by its scoop size and then mix everything together. Computers do the same thing with numbers: they take a bunch of inputs, decide how much each one matters (the weight), multiply, and add it all up to get one answer.

Mathematical definition

Given an input vector x = [x₁, x₂, ..., xₙ] and a weight vector w = [w₁, w₂, ..., wₙ], the weighted sum (also called a linear combination) is defined as:

z = w₁·x₁ + w₂·x₂ + ... + wₙ·xₙ = Σᵢ wᵢ·xᵢ

In vector notation this is the dot product:

z = wᵀx

When a bias term b is included, the expression becomes:

z = wᵀx + b = w₁·x₁ + w₂·x₂ + ... + wₙ·xₙ + b

The bias allows the output to be nonzero even when all inputs are zero, giving the model additional flexibility.

Properties

Property	Description
Linearity	A weighted sum is a linear operation. Scaling any input by a constant scales its contribution proportionally.
Commutativity of addition	The order in which terms are summed does not affect the result.
Associativity	Grouping of terms can be rearranged without changing the outcome.
Dimensionality reduction	A weighted sum maps a vector of n values down to a single scalar.
Differentiability	The weighted sum is differentiable with respect to both the weights and the inputs, which is why gradient descent can optimize it.

Worked example

Consider three input values and their corresponding weights:

Input (xᵢ)	Weight (wᵢ)	Product (wᵢ · xᵢ)
3.0	2.1	6.30
1.5	0.7	1.05
-2.0	1.3	-2.60
Sum		4.75

The weighted sum is 4.75. If a bias of 0.5 were added, the result would be 5.25.

Historical background

The idea of combining values with different weights has roots in statistics and operations research stretching back centuries; the concept of a weighted average appears in the work of early astronomers who combined observations of differing reliability. In the context of artificial intelligence, the weighted sum became central with the development of the first artificial neuron models.

In 1943, Warren McCulloch and Walter Pitts proposed the McCulloch-Pitts neuron, a binary threshold model where a neuron fires if the sum of its excitatory inputs exceeds a threshold. While this model did not yet use continuously valued weights, it established the principle of summing inputs and comparing against a threshold.

In 1957, Frank Rosenblatt introduced the perceptron, which extended the McCulloch-Pitts model by assigning learned, continuously valued weights to each input. The perceptron computes a weighted sum of its inputs, adds a bias term, and passes the result through a step function. This was the first trainable model based on the weighted sum, and it laid the groundwork for all subsequent neural network architectures.

In 1960, Bernard Widrow introduced the ADALINE (Adaptive Linear Neuron), which formalized the bias as an additional weight on a constant input of +1 and used the Widrow-Hoff (least mean squares) learning rule to adjust weights. This made the weighted sum computation and its optimization through gradient-based methods a standard approach in adaptive signal processing and later in backpropagation-trained networks.

Weighted sum in neural networks

The weighted sum is the core computation performed by every neuron (node) in a neural network. Understanding how neurons use weighted sums is essential to understanding how neural networks learn.

Single neuron computation

A single artificial neuron receives a set of input values, multiplies each by a learned weight, sums the products, adds a bias, and passes the result through an activation function:

output = φ(w₁·x₁ + w₂·x₂ + ... + wₙ·xₙ + b)

where φ is the activation function (such as ReLU, sigmoid, or tanh).

The weighted sum portion (before the activation function) is sometimes called the pre-activation value or logit. The activation function introduces nonlinearity, allowing the network to learn complex patterns that a purely linear weighted sum cannot represent.

Feedforward networks

In a feedforward neural network with multiple layers, each neuron in each layer computes a weighted sum of outputs from the previous layer. For a layer with m neurons receiving input from n neurons in the previous layer, the computation can be expressed in matrix form:

z = Wx + b

where W is an m × n weight matrix, x is the input vector of length n, and b is the bias vector of length m. Each row of W contains the weights for one neuron, and each element of z is the weighted sum (pre-activation) for that neuron.

This matrix formulation makes it possible to compute all weighted sums in a layer simultaneously using optimized linear algebra libraries, which is why modern deep learning runs efficiently on GPUs.

Convolutional neural networks

In a convolutional neural network (CNN), the convolution operation is itself a localized weighted sum. A small filter (kernel) slides across the input (such as an image), and at each position the filter values are multiplied element-wise with the corresponding input values. Those products are then summed to produce a single output value. Formally, for a 2D convolution at position (i, j) with a kernel K of size k × k:

output(i,j) = Σₘ Σₙ K(m,n) · input(i+m, j+n) + b

The key difference from a fully connected layer is that the same set of weights (the kernel) is shared across all spatial positions, reducing the total number of parameters. But the underlying operation at each position is still a weighted sum.

Recurrent neural networks and LSTMs

In a recurrent neural network (RNN), each time step computes a weighted sum that combines the current input with the previous hidden state:

hₜ = φ(Wₓ·xₜ + Wₕ·hₜ₋₁ + b)

Here, two separate weighted sums are computed (one for the input xₜ and one for the previous hidden state hₜ₋₁) and then added together before the activation function φ (typically tanh).

LSTM networks extend this with gating mechanisms. Each gate (input gate, forget gate, output gate) computes its own weighted sum of the input and previous hidden state, then applies a sigmoid activation to produce values between 0 and 1 that control information flow. The cell state update itself is a weighted combination of the old cell state and a candidate value, where the "weights" are the gate outputs. As the LSTM Wikipedia article notes, each gate "can be thought of as a standard neuron in a feed-forward neural network: that is, they compute an activation of a weighted sum."

Weighted sum in linear models

Linear regression

Linear regression is perhaps the simplest machine learning model, and its prediction is exactly a weighted sum plus a bias:

ŷ = w₁·x₁ + w₂·x₂ + ... + wₙ·xₙ + b = wᵀx + b

The weights determine the influence of each input feature on the predicted output. The bias b (also called the intercept) determines the predicted value when all features are zero. During training, the weights and bias are optimized to minimize a loss function such as mean squared error.

Logistic regression

Logistic regression uses a weighted sum followed by the sigmoid function to produce a probability for binary classification:

p(y=1|x) = σ(wᵀx + b)

where σ(z) = 1 / (1 + e⁻ᶻ). The weighted sum z = wᵀx + b is called the log-odds (logit), and the sigmoid converts it to a probability between 0 and 1. For multi-class classification, the weighted sum is extended with one set of weights per class, and the softmax function replaces the sigmoid.

Weighted sum in attention mechanisms

The attention mechanism, introduced by Bahdanau et al. (2014) and refined in the Transformer architecture (Vaswani et al., 2017), relies on weighted sums as its output computation.

Scaled dot-product attention

In scaled dot-product attention, the output for each query is a weighted sum of the value vectors:

Attention(Q, K, V) = softmax(QKᵀ / √dₖ) · V

The process works as follows:

Score computation. For each query vector, compute dot products with all key vectors to obtain raw similarity scores.
Normalization. Divide by √dₖ (the square root of the key dimension) to prevent large dot products from pushing softmax into regions with very small gradients. Apply softmax to convert scores into attention weights that sum to 1.
Weighted sum. Multiply each value vector by its corresponding attention weight and sum the results. The output is a weighted combination of all value vectors, where the weights reflect how relevant each key-value pair is to the query.

This means that each output token in a Transformer is literally a weighted sum of value representations, with the weights determined dynamically based on query-key similarity.

Self-attention

In self-attention, the queries, keys, and values all come from the same input sequence. Each token attends to every other token (including itself), producing a context-aware representation that is a weighted sum of all token representations in the sequence. This allows the model to capture long-range dependencies without the sequential processing constraints of RNNs.

Cross-attention

In cross-attention (used in encoder-decoder models for machine translation, summarization, and similar tasks), the queries come from the decoder and the keys and values come from the encoder. The decoder output at each position is a weighted sum of encoder representations, where the attention weights indicate which parts of the source input are most relevant.

Weighted sum in ensemble methods

Ensemble learning combines the predictions of multiple models to produce a single prediction. Many ensemble strategies use weighted sums.

Weighted average ensemble

In a weighted average ensemble, the final prediction is a weighted sum of individual model predictions:

ŷ_ensemble = w₁·ŷ₁ + w₂·ŷ₂ + ... + wₖ·ŷₖ

where ŷᵢ is the prediction of the i-th model and wᵢ is its weight. Typically, the weights are constrained to be non-negative and sum to 1, so they can be interpreted as the fraction of trust placed in each model. More accurate models receive higher weights.

The weights can be determined through several methods:

Method	Description
Uniform averaging	All models receive equal weight (wᵢ = 1/k). This is a simple baseline.
Performance-based	Weights are set proportional to each model's accuracy on a validation set.
Optimization-based	Weights are found by minimizing ensemble error on a validation set using numerical optimization (e.g., scipy.optimize).
Stacking	A meta-learner (often linear regression or logistic regression) is trained to learn optimal weights from model outputs.

Gradient boosting

In gradient boosting, the final prediction is a weighted sum of predictions from sequentially trained weak learners (typically decision trees):

F(x) = Σₘ γₘ · hₘ(x)

where hₘ is the m-th tree and γₘ is its weight (shrinkage coefficient or learning rate). Each new tree is trained to correct the residual errors of the weighted sum of all previous trees.

Random forests

Random forests use an unweighted average (equal weights) of predictions from many independently trained decision trees. For classification, each tree votes, and the class with the most votes wins. For regression, predictions are simply averaged. While the weights are uniform, the operation is still a weighted sum where all weights equal 1/k.

Weighted sum in support vector machines

In support vector machines (SVMs), the decision function for a new input x is a weighted sum of kernel evaluations over the training data:

f(x) = Σᵢ αᵢ · yᵢ · K(xᵢ, x) + b

where αᵢ are the learned dual coefficients, yᵢ are the training labels, K is the kernel function, and the sum is taken over all training examples (though in practice most αᵢ are zero, and only the support vectors contribute). This is a weighted sum where the weights αᵢ·yᵢ are determined by the optimization problem that maximizes the margin between classes.

Weighted sum in mixture of experts

The mixture of experts (MoE) architecture uses a gating network to compute a weighted sum of outputs from multiple expert sub-networks:

y = Σᵢ g(x)ᵢ · Eᵢ(x)

where Eᵢ(x) is the output of the i-th expert network and g(x)ᵢ is the gating weight for that expert, produced by a gating network (typically a linear layer followed by softmax). The gating weights sum to 1 and determine how much each expert contributes to the final output.

In sparse MoE models (such as those described by Shazeer et al., 2017), only the top-k experts receive nonzero gating weights, making the weighted sum sparse. This allows the model to have a very large total number of parameters while activating only a small subset for any given input, achieving computational efficiency.

Weighted sum in graph neural networks

In graph neural networks (GNNs), the message-passing framework uses weighted sums to aggregate information from neighboring nodes. Each node updates its representation by computing a weighted sum of messages from its neighbors:

hᵥ = φ(Σᵤ∈N(v) wᵥᵤ · mᵤ)

where N(v) is the set of neighbors of node v, mᵤ is the message from neighbor u, and wᵥᵤ is the aggregation weight.

Different GNN architectures define these weights differently:

Architecture	Aggregation weights
GCN (Kipf & Welling, 2017)	Weights are determined by the graph structure (inverse square root of node degrees).
GAT (Velickovic et al., 2018)	Weights are learned attention coefficients, computed from node features.
GraphSAGE (Hamilton et al., 2017)	Uses mean, LSTM, or max pooling aggregation; mean aggregation is an equally weighted sum.
GIN (Xu et al., 2019)	Uses sum aggregation (all weights equal to 1) for maximum expressiveness.

Weighted sum in loss functions

Weighted sums appear in the formulation and modification of loss functions.

Weighted cross-entropy

For imbalanced classification problems, weighted cross-entropy assigns different weights to different classes:

L = -Σᵢ wᵧᵢ · [yᵢ · log(ŷᵢ) + (1-yᵢ) · log(1-ŷᵢ)]

where wᵧᵢ is the class weight for the true label of sample i. Underrepresented classes receive higher weights, which penalizes misclassifications of minority classes more heavily and helps the model learn to classify them correctly.

Multi-task learning

When training a model on multiple tasks simultaneously, the overall loss is often a weighted sum of individual task losses:

L_total = w₁·L₁ + w₂·L₂ + ... + wₖ·Lₖ

The weights control how much the model prioritizes each task during training. These weights can be set manually, or learned automatically using methods such as uncertainty weighting (Kendall et al., 2018).

Weighted sum in the softmax function

The softmax function, used extensively in classification output layers and attention mechanisms, is closely related to weighted sums. Before softmax is applied, the network computes a weighted sum (logit) for each class:

zⱼ = wⱼᵀ · h + bⱼ

where h is the representation from the previous layer. The softmax then converts these logits into a probability distribution:

p(class j) = exp(zⱼ) / Σₖ exp(zₖ)

The output probabilities sum to 1, and the class with the highest logit (weighted sum) receives the highest probability.

Weighted sum in optimization

The process of learning weights through gradient descent relies on computing gradients of the loss with respect to the weights in each weighted sum.

Gradient computation

For a weighted sum z = wᵀx + b, the partial derivatives are straightforward:

∂z/∂wᵢ = xᵢ
∂z/∂b = 1
∂z/∂xᵢ = wᵢ

These simple gradients are what make backpropagation computationally tractable. The gradient of the loss with respect to each weight is proportional to the corresponding input, and the gradient with respect to each input is proportional to the corresponding weight.

Weight update rule

During training, weights are updated in the direction that reduces the loss:

wᵢ ← wᵢ - η · ∂L/∂wᵢ

where η is the learning rate. The chain rule connects the loss gradient to the weighted sum gradient through intervening activation functions and subsequent layers.

Weighted sum in decision analysis

Outside of machine learning, the weighted sum model (WSM) is one of the most widely used methods in multi-criteria decision analysis (MCDA). Given m alternatives evaluated on n criteria, the overall score for alternative i is:

Sᵢ = Σⱼ wⱼ · aᵢⱼ

where wⱼ is the importance weight for criterion j and aᵢⱼ is the performance score of alternative i on criterion j. The alternative with the highest weighted sum is selected as the best option. This method, also known as simple additive weighting (SAW), requires all criteria to be expressed in the same unit for results to be meaningful.

Operation	Formula	Key difference from weighted sum
Weighted sum	z = Σ wᵢxᵢ	Baseline operation
Weighted average	z = Σ wᵢxᵢ / Σ wᵢ	Normalized by sum of weights; output scale matches input scale
Dot product	z = aᵀb	Equivalent to weighted sum when one vector is treated as weights
Convex combination	z = Σ wᵢxᵢ, where wᵢ ≥ 0 and Σ wᵢ = 1	Weights are non-negative and sum to 1; result lies within the convex hull of inputs
Element-wise product (Hadamard)	z = w ⊙ x	Produces a vector, not a scalar; no summation step
Outer product	Z = wxᵀ	Produces a matrix; captures all pairwise products

Advantages

Simplicity. The weighted sum is easy to implement and computationally inexpensive. It requires only multiply-and-accumulate operations, which modern hardware (CPUs, GPUs, TPUs) executes very efficiently.
Differentiability. Because the weighted sum is a smooth, differentiable function, it integrates naturally with gradient-based optimization algorithms like stochastic gradient descent and Adam.
Flexibility. By learning different weight values, the same weighted sum structure can represent a wide variety of functions. Combined with nonlinear activation functions, layers of weighted sums can approximate any continuous function (universal approximation theorem).
Interpretability. In linear models, the magnitude of each weight directly indicates the importance and direction of influence of the corresponding feature, making the model relatively easy to interpret.
Composability. Weighted sums can be stacked in layers. The output of one weighted sum can serve as input to the next, enabling the construction of deep networks with millions of parameters.

Limitations

Linearity. A single weighted sum can only represent linear relationships between inputs and outputs. Nonlinear patterns require the addition of activation functions or other nonlinear transformations.
Independence assumption. A simple weighted sum treats each input independently. It cannot capture interactions between features unless feature crosses or polynomial terms are explicitly added.
Sensitivity to scale. If input features have very different scales, the weights may be difficult to interpret and optimization may converge slowly. Feature scaling or normalization (such as batch normalization) is often applied to address this.
Vulnerability to overfitting. With many features and limited data, the model can learn weight values that fit noise in the training data rather than the true underlying pattern. Regularization techniques such as L1 and L2 penalties are used to mitigate this.
Numerical stability. Very large weighted sums can cause overflow in subsequent exponential operations (e.g., softmax). Scaling techniques, such as dividing by √dₖ in Transformer attention, are used to keep values in a numerically stable range.

References

McCulloch, W. S., & Pitts, W. (1943). "A logical calculus of the ideas immanent in nervous activity." *Bulletin of Mathematical Biophysics*, 5(4), 115-133.
Rosenblatt, F. (1958). "The perceptron: A probabilistic model for information storage and organization in the brain." *Psychological Review*, 65(6), 386-408.
Widrow, B., & Hoff, M. E. (1960). "Adaptive switching circuits." *IRE WESCON Convention Record*, Part 4, 96-104.
Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). "Learning representations by back-propagating errors." *Nature*, 323(6088), 533-536.
Bahdanau, D., Cho, K., & Bengio, Y. (2014). "Neural machine translation by jointly learning to align and translate." *arXiv preprint arXiv:1409.0473*.
Vaswani, A., et al. (2017). "Attention is all you need." *Advances in Neural Information Processing Systems*, 30.
Shazeer, N., et al. (2017). "Outrageously large neural networks: The sparsely-gated mixture-of-experts layer." *arXiv preprint arXiv:1701.06538*.
Kipf, T. N., & Welling, M. (2017). "Semi-supervised classification with graph convolutional networks." *International Conference on Learning Representations (ICLR)*.
Goodfellow, I., Bengio, Y., & Courville, A. (2016). *Deep Learning*. MIT Press.
Bishop, C. M. (2006). *Pattern Recognition and Machine Learning*. Springer.
Zhang, A., Lipton, Z. C., Li, M., & Smola, A. J. (2023). *Dive into Deep Learning*. Cambridge University Press.
Fishburn, P. C. (1967). "Additive utilities with incomplete product sets: Application to priorities and assignments." *Operations Research*, 15(3), 537-542.
Kendall, A., Gal, Y., & Cipolla, R. (2018). "Multi-task learning using uncertainty to weigh losses for scene geometry and semantics." *Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)*.
Hochreiter, S., & Schmidhuber, J. (1997). "Long short-term memory." *Neural Computation*, 9(8), 1735-1780.

Explain like I'm 5 (ELI5)

Mathematical definition

Properties

Worked example

Historical background

Weighted sum in neural networks

Single neuron computation

Feedforward networks

Convolutional neural networks

Recurrent neural networks and LSTMs

Weighted sum in linear models

Linear regression

Logistic regression

Weighted sum in attention mechanisms

Scaled dot-product attention

Self-attention

Cross-attention

Weighted sum in ensemble methods

Weighted average ensemble

Gradient boosting

Random forests

Weighted sum in support vector machines

Weighted sum in mixture of experts

Weighted sum in graph neural networks

Weighted sum in loss functions

Weighted cross-entropy

Multi-task learning

Weighted sum in the softmax function

Weighted sum in optimization

Gradient computation

Weight update rule

Weighted sum in decision analysis

Comparison with related operations

Advantages

Limitations

See also

References

Improve this article

Related Articles

Multi-head Latent Attention

ARC-AGI 2

GELU (Gaussian Error Linear Unit)

Bias (Math) or Bias Term

Sigmoid Function

Bellman Equation

Explain like I'm 5 (ELI5)

Mathematical definition

Properties

Worked example

Historical background

Weighted sum in neural networks

Single neuron computation

Feedforward networks

Convolutional neural networks

Recurrent neural networks and LSTMs

Weighted sum in linear models

Linear regression

Logistic regression

Weighted sum in attention mechanisms

Scaled dot-product attention

Self-attention

Cross-attention

Weighted sum in ensemble methods

Weighted average ensemble

Gradient boosting

Random forests

Weighted sum in support vector machines

Weighted sum in mixture of experts

Weighted sum in graph neural networks

Weighted sum in loss functions

Weighted cross-entropy

Multi-task learning

Weighted sum in the softmax function

Weighted sum in optimization

Gradient computation

Weight update rule

Weighted sum in decision analysis

Comparison with related operations

Advantages

Limitations