In machine learning and statistics, a parameter is an internal variable of a model whose value is learned from data during the training process. Parameters define the behavior of a model and determine how it maps inputs to outputs. In a neural network, the most familiar parameters are weights and biases; in a linear regression, they are the coefficients and intercept. Unlike hyperparameters, which are set by the practitioner before training begins, parameters are estimated automatically by an optimization algorithm such as gradient descent.
The total number of parameters in a model is one of the most commonly cited indicators of its capacity, or its ability to represent complex functions. Modern large language models contain hundreds of billions, or even trillions, of parameters.
Imagine you have a big machine with lots of little knobs. Each knob controls something different about what the machine does. When you first build the machine, all the knobs are set randomly. Then you show the machine thousands of examples, and it slowly turns each knob a tiny bit until it gets really good at its job. Those knobs are the parameters. The more knobs the machine has, the more complicated things it can learn, but it also needs more examples and more time to get all the knobs just right.
A common source of confusion for newcomers is the difference between parameters and hyperparameters. The table below summarizes the key distinctions.
| Aspect | Parameter | Hyperparameter |
|---|---|---|
| Set by | Learned from data during training | Chosen by the practitioner before training |
| Examples | Weights, biases, embedding vectors | Learning rate, batch size, number of layers |
| Quantity | Can number in the billions or trillions | Typically a handful (tens to hundreds) |
| Optimized via | Gradient descent, backpropagation | Grid search, random search, Bayesian optimization |
| Stored after training | Yes, saved in model checkpoint files | Usually recorded in experiment config files |
In short, parameters are what the model learns; hyperparameters are what the engineer decides.
Different model architectures use different kinds of learnable parameters. The following sections describe the most important types.
Weights are scalar values associated with connections between neurons in a neural network, or with input features in linear models. Each weight controls how strongly one input signal influences the next layer's computation. During training, weights are adjusted to minimize the loss function. In a fully connected (dense) layer with n inputs and m outputs, there are n x m weight parameters.
A bias is an additive constant included in each neuron's computation. It allows the neuron's activation to shift, making it possible to fit data that does not pass through the origin. Each neuron typically has one bias term, so a dense layer with m outputs has m bias parameters.
In natural language processing, embedding tables map discrete tokens (words or subwords) to dense, continuous vectors. An embedding table for a vocabulary of size V with embedding dimension d contains V x d parameters. In large language models, embedding tables often account for a significant share of total parameters.
Layers such as batch normalization and layer normalization include learnable scale (gamma) and shift (beta) parameters. For a feature dimension of size d, these layers contribute 2d trainable parameters. While small in number compared to weight matrices, normalization parameters play an important role in stabilizing training dynamics.
In convolutional neural networks, each filter is a small tensor of learnable weights that slides across the input to detect spatial patterns such as edges, textures, and shapes. A convolutional layer with C_in input channels, C_out output filters, and kernel size k x k contains C_in x C_out x k x k + C_out parameters (including biases).
In transformer architectures, the self-attention mechanism relies on query, key, and value projection matrices, plus an output projection matrix. For a model dimension d_model and h attention heads with head dimension d_k = d_model / h, each attention sub-layer contains 4 x d_model^2 weight parameters (plus biases).
Knowing how to count parameters is essential for estimating memory requirements and comparing architectures.
| Layer type | Formula (with bias) | Example |
|---|---|---|
| Dense (fully connected) | (n_in x n_out) + n_out | 768 inputs, 3072 outputs: 2,362,368 |
| Convolutional 2D | (k x k x C_in x C_out) + C_out | 3x3 kernel, 64 in, 128 out: 73,856 |
| Embedding | V x d | Vocab 50,257, dim 768: 38,597,376 |
| Layer normalization | 2 x d | d = 768: 1,536 |
| Multi-head attention | 4 x d_model^2 + 4 x d_model | d_model = 768: 2,362,368 |
| Transformer FFN | 2 x d_model x d_ff + d_model + d_ff | d_model = 768, d_ff = 3072: 4,722,432 |
To find the total parameter count for a full model, sum the parameters from every layer. For a transformer with L layers, the rough formula is: L x (4 x d_model^2 + 2 x d_model x d_ff) + V x d_model, ignoring biases and normalization terms for a first-order estimate.
The number of parameters varies enormously across different architectures and application domains.
| Model | Type | Approximate parameters |
|---|---|---|
| Simple linear regression (10 features) | Linear | 11 |
| Logistic regression (MNIST, 10 classes) | Linear | 7,850 |
| LeNet-5 (1998) | CNN | 60,000 |
| AlexNet (2012) | CNN | 60 million |
| VGG-16 (2014) | CNN | 138 million |
| ResNet-50 (2015) | CNN | 25.6 million |
| BERT-Base (2018) | Transformer (encoder) | 110 million |
| BERT-Large (2018) | Transformer (encoder) | 340 million |
| GPT-2 (2019) | Transformer (decoder) | 1.5 billion |
| GPT-3 (2020) | Transformer (decoder) | 175 billion |
| GPT-4 (2023) | Transformer (MoE, decoder) | ~1.8 trillion (estimated) |
| LLaMA 3.1 405B (2024) | Transformer (decoder) | 405 billion |
| Mixtral 8x22B (2024) | Transformer (MoE) | ~141 billion (39B active) |
As the table shows, parameter counts have grown by roughly ten orders of magnitude from early linear models to modern large language models. However, larger does not always mean better; architectural innovations, data quality, and training methodology also matter greatly.
In statistics and machine learning, several frameworks exist for estimating optimal parameter values from observed data.
MLE finds the parameter values that maximize the probability (likelihood) of the observed data under the assumed model. Formally, given data D and parameters theta, MLE solves: theta_MLE = argmax P(D | theta). MLE is the most widely used estimation method in practice. Standard gradient-based optimization of neural networks with a cross-entropy or mean squared error loss is equivalent to performing MLE.
MAP estimation extends MLE by incorporating a prior distribution over parameters. It finds: theta_MAP = argmax P(theta | D) = argmax P(D | theta) x P(theta). When the prior is uniform, MAP reduces to MLE. When the prior is Gaussian, MAP is equivalent to L2 regularization (weight decay). MAP provides a principled way to encode domain knowledge or preferences about parameter values.
Instead of finding a single point estimate, Bayesian inference computes the full posterior distribution P(theta | D). This provides a complete picture of uncertainty over parameter values, not just the most likely setting. While theoretically appealing, exact Bayesian inference is computationally intractable for large neural networks. Approximate methods such as variational inference, Markov chain Monte Carlo (MCMC), and Monte Carlo dropout are used in practice.
| Method | Output | Incorporates prior? | Computational cost |
|---|---|---|---|
| MLE | Point estimate | No | Low |
| MAP | Point estimate | Yes | Low to moderate |
| Bayesian inference | Full posterior distribution | Yes | High |
The parameter space of a model is the high-dimensional space defined by all its learnable parameters. For a model with N parameters, the parameter space is an N-dimensional real-valued space. The loss function defines a surface (or landscape) over this space, and training amounts to finding a low point on that surface.
Key features of the optimization landscape include:
Parameter sharing is a design technique in which multiple parts of a model use the same set of parameters rather than each maintaining separate copies. This reduces the total number of learnable parameters, acts as a form of regularization, and encodes useful inductive biases.
In convolutional neural networks, the same filter weights are applied at every spatial position of the input. A single 3x3 filter with 64 input channels has only 576 weight parameters, yet it is applied hundreds or thousands of times across the image. This sharing encodes the assumption of translation equivariance: the same local pattern can appear anywhere in the image.
In recurrent neural networks, the same weight matrices are applied at every time step. This keeps the parameter count constant regardless of sequence length and encodes the assumption that the same transformation is useful at every position in the sequence.
Weight tying is a technique where two distinct components of a model are forced to share the same weight matrix. A common example is tying the input embedding matrix and the output softmax projection matrix in language models. Since both matrices relate the same vocabulary to the same vector space, sharing them reduces memory usage and often improves performance. The original transformer paper and many subsequent language models use this technique.
Full fine-tuning of a large pretrained model requires updating all parameters, which is expensive in terms of compute and memory. Parameter-efficient fine-tuning (PEFT) methods adapt a model to new tasks by modifying only a small fraction of parameters while keeping the rest frozen.
| Method | Approach | Trainable params (typical) | Key advantage |
|---|---|---|---|
| LoRA (Low-Rank Adaptation) | Injects trainable low-rank matrices into attention layers | 0.1% to 1% of original | No inference latency overhead after merging |
| Adapters | Inserts small bottleneck modules between existing layers | 1% to 5% of original | Modular; easy to swap for different tasks |
| Prefix tuning | Prepends trainable continuous vectors to transformer inputs | Less than 1% of original | Effective in low-data and few-shot settings |
| QLoRA | Combines LoRA with 4-bit quantization of the base model | 0.1% to 1% of original | Enables fine-tuning on a single consumer GPU |
| BitFit | Only trains bias terms; all weights are frozen | Less than 0.1% of original | Extremely lightweight |
LoRA has become the most widely adopted PEFT method. It decomposes each weight update into two low-rank matrices (A and B), so that the update delta_W = A x B has rank r, which is typically 4 to 64. After training, the low-rank updates can be merged back into the original weights, adding zero latency at inference time.
In modern deep learning workflows, it is common to freeze some parameters while training others.
Frozen parameters are parameters whose values are not updated during training. Freezing is achieved by disabling gradient computation for those parameters. Common scenarios include:
Trainable parameters are parameters that receive gradient updates during optimization. The number of trainable parameters directly affects GPU memory consumption during training, since each trainable parameter requires storage for its gradient and optimizer state (momentum, variance in Adam).
In practice, with the Adam optimizer, each trainable parameter requires roughly 12 to 16 bytes of memory (4 bytes for the parameter, 4 bytes for the gradient, and 4 to 8 bytes for optimizer states). A model with 7 billion trainable parameters therefore requires approximately 84 to 112 GB of optimizer memory alone.
Before training begins, parameters must be assigned initial values. The choice of initialization strategy significantly affects training speed, stability, and final model quality. Poor initialization can cause vanishing or exploding gradients, making training fail entirely.
| Strategy | Distribution | Variance | Best for |
|---|---|---|---|
| Zero initialization | All zeros | 0 | Biases only (never for weights) |
| Random uniform/normal | Uniform or Gaussian | Fixed | Simple networks |
| Xavier (Glorot) | Gaussian or uniform | 2 / (n_in + n_out) | Sigmoid, tanh activations |
| He (Kaiming) | Gaussian or uniform | 2 / n_in | ReLU and variants |
| Orthogonal | Orthogonal matrix | 1 | RNNs, very deep networks |
Xavier initialization, proposed by Xavier Glorot and Yoshua Bengio in 2010, sets weights so that the variance of activations remains approximately constant across layers during both the forward and backward passes. It works well with symmetric activation functions such as tanh and sigmoid.
He initialization, proposed by Kaiming He and colleagues in 2015, accounts for the fact that ReLU activations zero out roughly half of their inputs. It uses a larger variance (2 / n_in instead of 2 / (n_in + n_out)) to compensate, preventing the signal from vanishing in deep networks that use ReLU.
Modern transformer models often use a scaled normal initialization where the standard deviation is reduced by a factor of 1 / sqrt(2L) for residual connections, where L is the number of layers. This prevents the residual signal from growing too large in very deep models.
Machine learning models are sometimes classified based on whether they have a fixed number of parameters.
Parametric models assume a specific functional form and have a fixed, finite number of parameters regardless of the amount of training data. Examples include linear regression, logistic regression, and neural networks. Once trained, predictions can be made using only the learned parameters; the training data is no longer needed.
Non-parametric models do not assume a fixed functional form. Their complexity grows with the amount of training data, and they may effectively retain the training data itself as part of the model. Examples include k-nearest neighbors, kernel density estimation, and decision trees (which grow deeper with more data). Gaussian processes are another example: while they have hyperparameters, the number of effective parameters scales with the dataset size.
| Property | Parametric | Non-parametric |
|---|---|---|
| Fixed number of parameters | Yes | No (grows with data) |
| Assumptions about data distribution | Strong (specific functional form) | Weak or none |
| Data efficiency | Higher (fewer samples needed) | Lower (needs more data) |
| Computational cost at prediction | Constant (independent of training set size) | Often scales with training set size |
| Risk of underfitting | Higher (if form is wrong) | Lower |
| Risk of overfitting | Lower (with appropriate regularization) | Higher (especially with small datasets) |
| Examples | Linear regression, neural networks | k-NN, kernel methods, decision trees |
It is worth noting that neural networks, despite having a fixed parameter count, can be extremely flexible due to their large number of parameters and nonlinear activation functions. In practice, large neural networks occupy a middle ground: they are technically parametric but can approximate nearly any function given enough parameters.
The total number of parameters is often used as a rough proxy for a model's capacity, which is its ability to represent complex functions. Models with more parameters can, in principle, memorize more training data and capture finer-grained patterns.
However, parameter count alone is an imperfect measure of capacity for several reasons:
The trend in recent years has been toward ever-larger models, but there is also growing interest in building smaller, more efficient models that achieve competitive performance through better architectures, data curation, and training recipes.