# Weighted Sum

> Source: https://aiwiki.ai/wiki/weighted_sum
> Updated: 2026-04-28
> Categories: Machine Learning, Mathematics, Neural Networks
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

A **weighted sum** is a mathematical operation that combines multiple input values by multiplying each value by a corresponding weight (coefficient) and then summing the results. In [machine learning](/wiki/machine_learning) and [deep learning](/wiki/deep_learning), weighted sums serve as the foundational computation inside [neural networks](/wiki/neural_network), [linear regression](/wiki/linear_regression) models, [attention mechanisms](/wiki/attention), [ensemble methods](/wiki/ensemble_learning), and many other algorithms. Nearly every prediction a machine learning model makes can be traced back to one or more weighted sum operations.

## Explain like I'm 5 (ELI5)

Imagine you are making a smoothie. You add three kinds of fruit: strawberries, bananas, and blueberries. But you don't add the same amount of each fruit. You add a big scoop of strawberries because you love them, a medium scoop of bananas, and just a tiny handful of blueberries. The "scoop size" for each fruit is like a weight. A weighted sum is what you get when you multiply each fruit amount by its scoop size and then mix everything together. Computers do the same thing with numbers: they take a bunch of inputs, decide how much each one matters (the weight), multiply, and add it all up to get one answer.

## Mathematical definition

Given an input vector **x** = [x₁, x₂, ..., xₙ] and a weight vector **w** = [w₁, w₂, ..., wₙ], the weighted sum (also called a **linear combination**) is defined as:

```
z = w₁·x₁ + w₂·x₂ + ... + wₙ·xₙ = Σᵢ wᵢ·xᵢ
```

In vector notation this is the [dot product](/wiki/dot_product):

```
z = wᵀx
```

When a [bias](/wiki/bias) term *b* is included, the expression becomes:

```
z = wᵀx + b = w₁·x₁ + w₂·x₂ + ... + wₙ·xₙ + b
```

The bias allows the output to be nonzero even when all inputs are zero, giving the model additional flexibility.

### Properties

| Property | Description |
|---|---|
| Linearity | A weighted sum is a linear operation. Scaling any input by a constant scales its contribution proportionally. |
| Commutativity of addition | The order in which terms are summed does not affect the result. |
| Associativity | Grouping of terms can be rearranged without changing the outcome. |
| Dimensionality reduction | A weighted sum maps a vector of *n* values down to a single scalar. |
| Differentiability | The weighted sum is differentiable with respect to both the weights and the inputs, which is why [gradient descent](/wiki/gradient_descent) can optimize it. |

## Worked example

Consider three input values and their corresponding weights:

| Input (xᵢ) | Weight (wᵢ) | Product (wᵢ · xᵢ) |
|---|---|---|
| 3.0 | 2.1 | 6.30 |
| 1.5 | 0.7 | 1.05 |
| -2.0 | 1.3 | -2.60 |
| **Sum** | | **4.75** |

The weighted sum is 4.75. If a bias of 0.5 were added, the result would be 5.25.

## Historical background

The idea of combining values with different weights has roots in statistics and operations research stretching back centuries; the concept of a weighted average appears in the work of early astronomers who combined observations of differing reliability. In the context of artificial intelligence, the weighted sum became central with the development of the first artificial neuron models.

In 1943, Warren McCulloch and Walter Pitts proposed the McCulloch-Pitts neuron, a binary threshold model where a neuron fires if the sum of its excitatory inputs exceeds a threshold. While this model did not yet use continuously valued weights, it established the principle of summing inputs and comparing against a threshold.

In 1957, Frank Rosenblatt introduced the [perceptron](/wiki/perceptron), which extended the McCulloch-Pitts model by assigning learned, continuously valued weights to each input. The perceptron computes a weighted sum of its inputs, adds a bias term, and passes the result through a step function. This was the first trainable model based on the weighted sum, and it laid the groundwork for all subsequent neural network architectures.

In 1960, Bernard Widrow introduced the ADALINE (Adaptive Linear Neuron), which formalized the bias as an additional weight on a constant input of +1 and used the Widrow-Hoff (least mean squares) learning rule to adjust weights. This made the weighted sum computation and its optimization through gradient-based methods a standard approach in adaptive signal processing and later in [backpropagation](/wiki/backpropagation)-trained networks.

## Weighted sum in neural networks

The weighted sum is the core computation performed by every neuron (node) in a [neural network](/wiki/neural_network). Understanding how neurons use weighted sums is essential to understanding how neural networks learn.

### Single neuron computation

A single artificial neuron receives a set of input values, multiplies each by a learned weight, sums the products, adds a bias, and passes the result through an [activation function](/wiki/activation_function):

```
output = φ(w₁·x₁ + w₂·x₂ + ... + wₙ·xₙ + b)
```

where φ is the activation function (such as [ReLU](/wiki/relu), [sigmoid](/wiki/sigmoid_function), or [tanh](/wiki/tanh)).

The weighted sum portion (before the activation function) is sometimes called the **pre-activation** value or **logit**. The activation function introduces nonlinearity, allowing the network to learn complex patterns that a purely linear weighted sum cannot represent.

### Feedforward networks

In a feedforward neural network with multiple layers, each neuron in each layer computes a weighted sum of outputs from the previous layer. For a layer with *m* neurons receiving input from *n* neurons in the previous layer, the computation can be expressed in matrix form:

```
z = Wx + b
```

where **W** is an *m × n* weight matrix, **x** is the input vector of length *n*, and **b** is the bias vector of length *m*. Each row of **W** contains the weights for one neuron, and each element of **z** is the weighted sum (pre-activation) for that neuron.

This matrix formulation makes it possible to compute all weighted sums in a layer simultaneously using optimized linear algebra libraries, which is why modern [deep learning](/wiki/deep_learning) runs efficiently on GPUs.

### Convolutional neural networks

In a [convolutional neural network](/wiki/convolutional_neural_network) (CNN), the [convolution](/wiki/convolution) operation is itself a localized weighted sum. A small filter (kernel) slides across the input (such as an image), and at each position the filter values are multiplied element-wise with the corresponding input values. Those products are then summed to produce a single output value. Formally, for a 2D convolution at position (i, j) with a kernel **K** of size *k × k*:

```
output(i,j) = Σₘ Σₙ K(m,n) · input(i+m, j+n) + b
```

The key difference from a fully connected layer is that the same set of weights (the kernel) is shared across all spatial positions, reducing the total number of parameters. But the underlying operation at each position is still a weighted sum.

### Recurrent neural networks and LSTMs

In a [recurrent neural network](/wiki/recurrent_neural_network) (RNN), each time step computes a weighted sum that combines the current input with the previous hidden state:

```
hₜ = φ(Wₓ·xₜ + Wₕ·hₜ₋₁ + b)
```

Here, two separate weighted sums are computed (one for the input xₜ and one for the previous hidden state hₜ₋₁) and then added together before the activation function φ (typically tanh).

[LSTM](/wiki/lstm) networks extend this with gating mechanisms. Each gate (input gate, forget gate, output gate) computes its own weighted sum of the input and previous hidden state, then applies a sigmoid activation to produce values between 0 and 1 that control information flow. The cell state update itself is a weighted combination of the old cell state and a candidate value, where the "weights" are the gate outputs. As the LSTM Wikipedia article notes, each gate "can be thought of as a standard neuron in a feed-forward neural network: that is, they compute an activation of a weighted sum."

## Weighted sum in linear models

### Linear regression

[Linear regression](/wiki/linear_regression) is perhaps the simplest machine learning model, and its prediction is exactly a weighted sum plus a bias:

```
ŷ = w₁·x₁ + w₂·x₂ + ... + wₙ·xₙ + b = wᵀx + b
```

The [weights](/wiki/weights) determine the influence of each input [feature](/wiki/feature) on the predicted output. The bias *b* (also called the intercept) determines the predicted value when all features are zero. During [training](/wiki/training), the weights and bias are optimized to minimize a [loss function](/wiki/loss_function) such as mean squared error.

### Logistic regression

[Logistic regression](/wiki/logistic_regression) uses a weighted sum followed by the [sigmoid](/wiki/sigmoid_function) function to produce a probability for binary [classification](/wiki/classification):

```
p(y=1|x) = σ(wᵀx + b)
```

where σ(z) = 1 / (1 + e⁻ᶻ). The weighted sum z = wᵀx + b is called the log-odds (logit), and the sigmoid converts it to a probability between 0 and 1. For multi-class classification, the weighted sum is extended with one set of weights per class, and the [softmax](/wiki/softmax) function replaces the sigmoid.

## Weighted sum in attention mechanisms

The [attention mechanism](/wiki/attention), introduced by Bahdanau et al. (2014) and refined in the [Transformer](/wiki/transformer) architecture (Vaswani et al., 2017), relies on weighted sums as its output computation.

### Scaled dot-product attention

In scaled dot-product attention, the output for each query is a weighted sum of the value vectors:

```
Attention(Q, K, V) = softmax(QKᵀ / √dₖ) · V
```

The process works as follows:

1. **Score computation.** For each query vector, compute dot products with all key vectors to obtain raw similarity scores.
2. **Normalization.** Divide by √dₖ (the square root of the key dimension) to prevent large dot products from pushing softmax into regions with very small gradients. Apply softmax to convert scores into attention weights that sum to 1.
3. **Weighted sum.** Multiply each value vector by its corresponding attention weight and sum the results. The output is a weighted combination of all value vectors, where the weights reflect how relevant each key-value pair is to the query.

This means that each output token in a Transformer is literally a weighted sum of value representations, with the weights determined dynamically based on query-key similarity.

### Self-attention

In self-attention, the queries, keys, and values all come from the same input sequence. Each token attends to every other token (including itself), producing a context-aware representation that is a weighted sum of all token representations in the sequence. This allows the model to capture long-range dependencies without the sequential processing constraints of RNNs.

### Cross-attention

In cross-attention (used in encoder-decoder models for machine translation, summarization, and similar tasks), the queries come from the decoder and the keys and values come from the encoder. The decoder output at each position is a weighted sum of encoder representations, where the attention weights indicate which parts of the source input are most relevant.

## Weighted sum in ensemble methods

[Ensemble learning](/wiki/ensemble_learning) combines the predictions of multiple models to produce a single prediction. Many ensemble strategies use weighted sums.

### Weighted average ensemble

In a weighted average ensemble, the final prediction is a weighted sum of individual model predictions:

```
ŷ_ensemble = w₁·ŷ₁ + w₂·ŷ₂ + ... + wₖ·ŷₖ
```

where ŷᵢ is the prediction of the *i*-th model and wᵢ is its weight. Typically, the weights are constrained to be non-negative and sum to 1, so they can be interpreted as the fraction of trust placed in each model. More accurate models receive higher weights.

The weights can be determined through several methods:

| Method | Description |
|---|---|
| Uniform averaging | All models receive equal weight (wᵢ = 1/k). This is a simple baseline. |
| Performance-based | Weights are set proportional to each model's accuracy on a validation set. |
| Optimization-based | Weights are found by minimizing ensemble error on a validation set using numerical optimization (e.g., scipy.optimize). |
| Stacking | A meta-learner (often [linear regression](/wiki/linear_regression) or [logistic regression](/wiki/logistic_regression)) is trained to learn optimal weights from model outputs. |

### Gradient boosting

In gradient boosting, the final prediction is a weighted sum of predictions from sequentially trained weak learners (typically [decision trees](/wiki/decision_tree)):

```
F(x) = Σₘ γₘ · hₘ(x)
```

where hₘ is the *m*-th tree and γₘ is its weight (shrinkage coefficient or [learning rate](/wiki/learning_rate)). Each new tree is trained to correct the residual errors of the weighted sum of all previous trees.

### Random forests

[Random forests](/wiki/random_forest) use an unweighted average (equal weights) of predictions from many independently trained decision trees. For classification, each tree votes, and the class with the most votes wins. For regression, predictions are simply averaged. While the weights are uniform, the operation is still a weighted sum where all weights equal 1/k.

## Weighted sum in support vector machines

In [support vector machines](/wiki/support_vector_machine_svm) (SVMs), the decision function for a new input **x** is a weighted sum of kernel evaluations over the training data:

```
f(x) = Σᵢ αᵢ · yᵢ · K(xᵢ, x) + b
```

where αᵢ are the learned dual coefficients, yᵢ are the training labels, K is the kernel function, and the sum is taken over all training examples (though in practice most αᵢ are zero, and only the support vectors contribute). This is a weighted sum where the weights αᵢ·yᵢ are determined by the optimization problem that maximizes the margin between classes.

## Weighted sum in mixture of experts

The [mixture of experts](/wiki/mixture_of_experts) (MoE) architecture uses a gating network to compute a weighted sum of outputs from multiple expert sub-networks:

```
y = Σᵢ g(x)ᵢ · Eᵢ(x)
```

where Eᵢ(x) is the output of the *i*-th expert network and g(x)ᵢ is the gating weight for that expert, produced by a gating network (typically a linear layer followed by softmax). The gating weights sum to 1 and determine how much each expert contributes to the final output.

In sparse MoE models (such as those described by Shazeer et al., 2017), only the top-k experts receive nonzero gating weights, making the weighted sum sparse. This allows the model to have a very large total number of parameters while activating only a small subset for any given input, achieving computational efficiency.

## Weighted sum in graph neural networks

In [graph neural networks](/wiki/graph_neural_network) (GNNs), the message-passing framework uses weighted sums to aggregate information from neighboring nodes. Each node updates its representation by computing a weighted sum of messages from its neighbors:

```
hᵥ = φ(Σᵤ∈N(v) wᵥᵤ · mᵤ)
```

where N(v) is the set of neighbors of node v, mᵤ is the message from neighbor u, and wᵥᵤ is the aggregation weight.

Different GNN architectures define these weights differently:

| Architecture | Aggregation weights |
|---|---|
| GCN (Kipf & Welling, 2017) | Weights are determined by the graph structure (inverse square root of node degrees). |
| GAT (Velickovic et al., 2018) | Weights are learned attention coefficients, computed from node features. |
| GraphSAGE (Hamilton et al., 2017) | Uses mean, LSTM, or max pooling aggregation; mean aggregation is an equally weighted sum. |
| GIN (Xu et al., 2019) | Uses sum aggregation (all weights equal to 1) for maximum expressiveness. |

## Weighted sum in loss functions

Weighted sums appear in the formulation and modification of [loss functions](/wiki/loss_function).

### Weighted cross-entropy

For imbalanced [classification](/wiki/classification) problems, weighted cross-entropy assigns different weights to different classes:

```
L = -Σᵢ wᵧᵢ · [yᵢ · log(ŷᵢ) + (1-yᵢ) · log(1-ŷᵢ)]
```

where wᵧᵢ is the class weight for the true label of sample i. Underrepresented classes receive higher weights, which penalizes misclassifications of minority classes more heavily and helps the model learn to classify them correctly.

### Multi-task learning

When training a model on multiple tasks simultaneously, the overall loss is often a weighted sum of individual task losses:

```
L_total = w₁·L₁ + w₂·L₂ + ... + wₖ·Lₖ
```

The weights control how much the model prioritizes each task during training. These weights can be set manually, or learned automatically using methods such as uncertainty weighting (Kendall et al., 2018).

## Weighted sum in the softmax function

The [softmax](/wiki/softmax) function, used extensively in classification output layers and attention mechanisms, is closely related to weighted sums. Before softmax is applied, the network computes a weighted sum (logit) for each class:

```
zⱼ = wⱼᵀ · h + bⱼ
```

where h is the representation from the previous layer. The softmax then converts these logits into a probability distribution:

```
p(class j) = exp(zⱼ) / Σₖ exp(zₖ)
```

The output probabilities sum to 1, and the class with the highest logit (weighted sum) receives the highest probability.

## Weighted sum in optimization

The process of learning weights through [gradient descent](/wiki/gradient_descent) relies on computing gradients of the loss with respect to the weights in each weighted sum.

### Gradient computation

For a weighted sum z = wᵀx + b, the partial derivatives are straightforward:

```
∂z/∂wᵢ = xᵢ
∂z/∂b = 1
∂z/∂xᵢ = wᵢ
```

These simple gradients are what make [backpropagation](/wiki/backpropagation) computationally tractable. The gradient of the loss with respect to each weight is proportional to the corresponding input, and the gradient with respect to each input is proportional to the corresponding weight.

### Weight update rule

During training, weights are updated in the direction that reduces the loss:

```
wᵢ ← wᵢ - η · ∂L/∂wᵢ
```

where η is the [learning rate](/wiki/learning_rate). The chain rule connects the loss gradient to the weighted sum gradient through intervening activation functions and subsequent layers.

## Weighted sum in decision analysis

Outside of machine learning, the weighted sum model (WSM) is one of the most widely used methods in multi-criteria decision analysis (MCDA). Given *m* alternatives evaluated on *n* criteria, the overall score for alternative *i* is:

```
Sᵢ = Σⱼ wⱼ · aᵢⱼ
```

where wⱼ is the importance weight for criterion j and aᵢⱼ is the performance score of alternative i on criterion j. The alternative with the highest weighted sum is selected as the best option. This method, also known as simple additive weighting (SAW), requires all criteria to be expressed in the same unit for results to be meaningful.

## Comparison with related operations

| Operation | Formula | Key difference from weighted sum |
|---|---|---|
| Weighted sum | z = Σ wᵢxᵢ | Baseline operation |
| Weighted average | z = Σ wᵢxᵢ / Σ wᵢ | Normalized by sum of weights; output scale matches input scale |
| Dot product | z = aᵀb | Equivalent to weighted sum when one vector is treated as weights |
| Convex combination | z = Σ wᵢxᵢ, where wᵢ ≥ 0 and Σ wᵢ = 1 | Weights are non-negative and sum to 1; result lies within the convex hull of inputs |
| Element-wise product (Hadamard) | z = w ⊙ x | Produces a vector, not a scalar; no summation step |
| Outer product | Z = wxᵀ | Produces a matrix; captures all pairwise products |

## Advantages

- **Simplicity.** The weighted sum is easy to implement and computationally inexpensive. It requires only multiply-and-accumulate operations, which modern hardware (CPUs, GPUs, TPUs) executes very efficiently.
- **Differentiability.** Because the weighted sum is a smooth, differentiable function, it integrates naturally with gradient-based optimization algorithms like [stochastic gradient descent](/wiki/stochastic_gradient_descent_sgd) and [Adam](/wiki/adam_optimizer).
- **Flexibility.** By learning different weight values, the same weighted sum structure can represent a wide variety of functions. Combined with nonlinear activation functions, layers of weighted sums can approximate any continuous function (universal approximation theorem).
- **Interpretability.** In linear models, the magnitude of each weight directly indicates the importance and direction of influence of the corresponding feature, making the model relatively easy to interpret.
- **Composability.** Weighted sums can be stacked in layers. The output of one weighted sum can serve as input to the next, enabling the construction of deep networks with millions of parameters.

## Limitations

- **Linearity.** A single weighted sum can only represent linear relationships between inputs and outputs. Nonlinear patterns require the addition of activation functions or other nonlinear transformations.
- **Independence assumption.** A simple weighted sum treats each input independently. It cannot capture interactions between features unless feature crosses or polynomial terms are explicitly added.
- **Sensitivity to scale.** If input features have very different scales, the weights may be difficult to interpret and optimization may converge slowly. Feature scaling or normalization (such as [batch normalization](/wiki/batch_normalization)) is often applied to address this.
- **Vulnerability to [overfitting](/wiki/overfitting).** With many features and limited data, the model can learn weight values that fit noise in the training data rather than the true underlying pattern. [Regularization](/wiki/regularization) techniques such as L1 and L2 penalties are used to mitigate this.
- **Numerical stability.** Very large weighted sums can cause overflow in subsequent exponential operations (e.g., softmax). Scaling techniques, such as dividing by √dₖ in Transformer attention, are used to keep values in a numerically stable range.

## See also

- [Activation function](/wiki/activation_function)
- [Backpropagation](/wiki/backpropagation)
- [Bias](/wiki/bias)
- [Dot product](/wiki/dot_product)
- [Gradient descent](/wiki/gradient_descent)
- [Linear regression](/wiki/linear_regression)
- [Loss function](/wiki/loss_function)
- [Neural network](/wiki/neural_network)
- [Perceptron](/wiki/perceptron)
- [Softmax](/wiki/softmax)
- [Weights](/wiki/weights)

## References

1. McCulloch, W. S., & Pitts, W. (1943). "A logical calculus of the ideas immanent in nervous activity." *Bulletin of Mathematical Biophysics*, 5(4), 115-133.
2. Rosenblatt, F. (1958). "The perceptron: A probabilistic model for information storage and organization in the brain." *Psychological Review*, 65(6), 386-408.
3. Widrow, B., & Hoff, M. E. (1960). "Adaptive switching circuits." *IRE WESCON Convention Record*, Part 4, 96-104.
4. Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). "Learning representations by back-propagating errors." *Nature*, 323(6088), 533-536.
5. Bahdanau, D., Cho, K., & Bengio, Y. (2014). "Neural machine translation by jointly learning to align and translate." *arXiv preprint arXiv:1409.0473*.
6. Vaswani, A., et al. (2017). "Attention is all you need." *Advances in Neural Information Processing Systems*, 30.
7. Shazeer, N., et al. (2017). "Outrageously large neural networks: The sparsely-gated mixture-of-experts layer." *arXiv preprint arXiv:1701.06538*.
8. Kipf, T. N., & Welling, M. (2017). "Semi-supervised classification with graph convolutional networks." *International Conference on Learning Representations (ICLR)*.
9. Goodfellow, I., Bengio, Y., & Courville, A. (2016). *Deep Learning*. MIT Press.
10. Bishop, C. M. (2006). *Pattern Recognition and Machine Learning*. Springer.
11. Zhang, A., Lipton, Z. C., Li, M., & Smola, A. J. (2023). *Dive into Deep Learning*. Cambridge University Press.
12. Fishburn, P. C. (1967). "Additive utilities with incomplete product sets: Application to priorities and assignments." *Operations Research*, 15(3), 537-542.
13. Kendall, A., Gal, Y., & Cipolla, R. (2018). "Multi-task learning using uncertainty to weigh losses for scene geometry and semantics." *Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)*.
14. Hochreiter, S., & Schmidhuber, J. (1997). "Long short-term memory." *Neural Computation*, 9(8), 1735-1780.
