Parameter

In machine learning and statistics, a parameter is an internal variable of a model whose value is learned from data during the training process. Parameters define the behavior of a model and determine how it maps inputs to outputs. In a neural network, the most familiar parameters are weights and biases; in a linear regression, they are the coefficients and intercept. Unlike hyperparameters, which are set by the practitioner before training begins, parameters are estimated automatically by an optimization algorithm such as gradient descent.

The total number of parameters in a model is one of the most commonly cited indicators of its capacity, or its ability to represent complex functions. Modern large language models contain hundreds of billions, or even trillions, of parameters.

ELI5 (Explain like I'm 5)

Imagine you have a big machine with lots of little knobs. Each knob controls something different about what the machine does. When you first build the machine, all the knobs are set randomly. Then you show the machine thousands of examples, and it slowly turns each knob a tiny bit until it gets really good at its job. Those knobs are the parameters. The more knobs the machine has, the more complicated things it can learn, but it also needs more examples and more time to get all the knobs just right.

Parameters vs. hyperparameters

A common source of confusion for newcomers is the difference between parameters and hyperparameters. The table below summarizes the key distinctions.

Aspect	Parameter	Hyperparameter
Set by	Learned from data during training	Chosen by the practitioner before training
Examples	Weights, biases, embedding vectors	Learning rate, batch size, number of layers
Quantity	Can number in the billions or trillions	Typically a handful (tens to hundreds)
Optimized via	Gradient descent, backpropagation	Grid search, random search, Bayesian optimization
Stored after training	Yes, saved in model checkpoint files	Usually recorded in experiment config files

In short, parameters are what the model learns; hyperparameters are what the engineer decides.

Types of parameters

Different model architectures use different kinds of learnable parameters. The following sections describe the most important types.

Weights

Weights are scalar values associated with connections between neurons in a neural network, or with input features in linear models. Each weight controls how strongly one input signal influences the next layer's computation. During training, weights are adjusted to minimize the loss function. In a fully connected (dense) layer with n inputs and m outputs, there are n x m weight parameters.

Biases

A bias is an additive constant included in each neuron's computation. It allows the neuron's activation to shift, making it possible to fit data that does not pass through the origin. Each neuron typically has one bias term, so a dense layer with m outputs has m bias parameters.

Embedding tables

In natural language processing, embedding tables map discrete tokens (words or subwords) to dense, continuous vectors. An embedding table for a vocabulary of size V with embedding dimension d contains V x d parameters. In large language models, embedding tables often account for a significant share of total parameters.

Normalization parameters

Layers such as batch normalization and layer normalization include learnable scale (gamma) and shift (beta) parameters. For a feature dimension of size d, these layers contribute 2d trainable parameters. While small in number compared to weight matrices, normalization parameters play an important role in stabilizing training dynamics.

Convolutional filters (kernels)

In convolutional neural networks, each filter is a small tensor of learnable weights that slides across the input to detect spatial patterns such as edges, textures, and shapes. A convolutional layer with C_in input channels, C_out output filters, and kernel size k x k contains C_in x C_out x k x k + C_out parameters (including biases).

Attention projection matrices

In transformer architectures, the self-attention mechanism relies on query, key, and value projection matrices, plus an output projection matrix. For a model dimension d_model and h attention heads with head dimension d_k = d_model / h, each attention sub-layer contains 4 x d_model^2 weight parameters (plus biases).

Parameter counting formulas for common layers

Knowing how to count parameters is essential for estimating memory requirements and comparing architectures.

Layer type	Formula (with bias)	Example
Dense (fully connected)	(n_in x n_out) + n_out	768 inputs, 3072 outputs: 2,362,368
Convolutional 2D	(k x k x C_in x C_out) + C_out	3x3 kernel, 64 in, 128 out: 73,856
Embedding	V x d	Vocab 50,257, dim 768: 38,597,376
Layer normalization	2 x d	d = 768: 1,536
Multi-head attention	4 x d_model^2 + 4 x d_model	d_model = 768: 2,362,368
Transformer FFN	2 x d_model x d_ff + d_model + d_ff	d_model = 768, d_ff = 3072: 4,722,432

To find the total parameter count for a full model, sum the parameters from every layer. For a transformer with L layers, the rough formula is: L x (4 x d_model^2 + 2 x d_model x d_ff) + V x d_model, ignoring biases and normalization terms for a first-order estimate.

Parameter counts across model types

The number of parameters varies enormously across different architectures and application domains.

Model	Type	Approximate parameters
Simple linear regression (10 features)	Linear	11
Logistic regression (MNIST, 10 classes)	Linear	7,850
LeNet-5 (1998)	CNN	60,000
AlexNet (2012)	CNN	60 million
VGG-16 (2014)	CNN	138 million
ResNet-50 (2015)	CNN	25.6 million
BERT-Base (2018)	Transformer (encoder)	110 million
BERT-Large (2018)	Transformer (encoder)	340 million
GPT-2 (2019)	Transformer (decoder)	1.5 billion
GPT-3 (2020)	Transformer (decoder)	175 billion
GPT-4 (2023)	Transformer (MoE, decoder)	~1.8 trillion (estimated)
LLaMA 3.1 405B (2024)	Transformer (decoder)	405 billion
Mixtral 8x22B (2024)	Transformer (MoE)	~141 billion (39B active)

As the table shows, parameter counts have grown by roughly ten orders of magnitude from early linear models to modern large language models. However, larger does not always mean better; architectural innovations, data quality, and training methodology also matter greatly.

Parameter estimation methods

In statistics and machine learning, several frameworks exist for estimating optimal parameter values from observed data.

Maximum likelihood estimation (MLE)

MLE finds the parameter values that maximize the probability (likelihood) of the observed data under the assumed model. Formally, given data D and parameters theta, MLE solves: theta_MLE = argmax P(D | theta). MLE is the most widely used estimation method in practice. Standard gradient-based optimization of neural networks with a cross-entropy or mean squared error loss is equivalent to performing MLE.

Maximum a posteriori estimation (MAP)

MAP estimation extends MLE by incorporating a prior distribution over parameters. It finds: theta_MAP = argmax P(theta | D) = argmax P(D | theta) x P(theta). When the prior is uniform, MAP reduces to MLE. When the prior is Gaussian, MAP is equivalent to L2 regularization (weight decay). MAP provides a principled way to encode domain knowledge or preferences about parameter values.

Bayesian inference

Instead of finding a single point estimate, Bayesian inference computes the full posterior distribution P(theta | D). This provides a complete picture of uncertainty over parameter values, not just the most likely setting. While theoretically appealing, exact Bayesian inference is computationally intractable for large neural networks. Approximate methods such as variational inference, Markov chain Monte Carlo (MCMC), and Monte Carlo dropout are used in practice.

Method	Output	Incorporates prior?	Computational cost
MLE	Point estimate	No	Low
MAP	Point estimate	Yes	Low to moderate
Bayesian inference	Full posterior distribution	Yes	High

Parameter space and optimization landscape

The parameter space of a model is the high-dimensional space defined by all its learnable parameters. For a model with N parameters, the parameter space is an N-dimensional real-valued space. The loss function defines a surface (or landscape) over this space, and training amounts to finding a low point on that surface.

Key features of the optimization landscape include:

Local minima: Points where the loss is lower than in all nearby directions, but not necessarily the global minimum. In high-dimensional spaces, most local minima are close in loss value to the global minimum, so reaching any local minimum often yields a good solution.
Saddle points: Points where the gradient is zero but the loss increases in some directions and decreases in others. Saddle points are far more common than local minima in high-dimensional landscapes and can slow down optimization.
Loss plateaus: Flat regions of the landscape where gradients are very small, making progress slow. Momentum-based optimizers and adaptive learning rate methods help traverse these regions.
Sharp vs. flat minima: Research suggests that flat minima (where the loss changes slowly around the minimum) tend to generalize better than sharp minima. Techniques like stochastic gradient descent with large batch noise naturally favor flat minima.

Parameter sharing is a design technique in which multiple parts of a model use the same set of parameters rather than each maintaining separate copies. This reduces the total number of learnable parameters, acts as a form of regularization, and encodes useful inductive biases.

In convolutional neural networks, the same filter weights are applied at every spatial position of the input. A single 3x3 filter with 64 input channels has only 576 weight parameters, yet it is applied hundreds or thousands of times across the image. This sharing encodes the assumption of translation equivariance: the same local pattern can appear anywhere in the image.

In recurrent neural networks, the same weight matrices are applied at every time step. This keeps the parameter count constant regardless of sequence length and encodes the assumption that the same transformation is useful at every position in the sequence.

Weight tying

Weight tying is a technique where two distinct components of a model are forced to share the same weight matrix. A common example is tying the input embedding matrix and the output softmax projection matrix in language models. Since both matrices relate the same vocabulary to the same vector space, sharing them reduces memory usage and often improves performance. The original transformer paper and many subsequent language models use this technique.

Parameter efficiency techniques

Full fine-tuning of a large pretrained model requires updating all parameters, which is expensive in terms of compute and memory. Parameter-efficient fine-tuning (PEFT) methods adapt a model to new tasks by modifying only a small fraction of parameters while keeping the rest frozen.

Method	Approach	Trainable params (typical)	Key advantage
LoRA (Low-Rank Adaptation)	Injects trainable low-rank matrices into attention layers	0.1% to 1% of original	No inference latency overhead after merging
Adapters	Inserts small bottleneck modules between existing layers	1% to 5% of original	Modular; easy to swap for different tasks
Prefix tuning	Prepends trainable continuous vectors to transformer inputs	Less than 1% of original	Effective in low-data and few-shot settings
QLoRA	Combines LoRA with 4-bit quantization of the base model	0.1% to 1% of original	Enables fine-tuning on a single consumer GPU
BitFit	Only trains bias terms; all weights are frozen	Less than 0.1% of original	Extremely lightweight

LoRA has become the most widely adopted PEFT method. It decomposes each weight update into two low-rank matrices (A and B), so that the update delta_W = A x B has rank r, which is typically 4 to 64. After training, the low-rank updates can be merged back into the original weights, adding zero latency at inference time.

Trainable vs. frozen parameters

In modern deep learning workflows, it is common to freeze some parameters while training others.

Frozen parameters are parameters whose values are not updated during training. Freezing is achieved by disabling gradient computation for those parameters. Common scenarios include:

Transfer learning, where a pretrained backbone's parameters are frozen and only a new classification head is trained.
PEFT methods, where the vast majority of a large model's parameters remain frozen.
Feature extraction, where a pretrained encoder produces fixed representations for a downstream task.

Trainable parameters are parameters that receive gradient updates during optimization. The number of trainable parameters directly affects GPU memory consumption during training, since each trainable parameter requires storage for its gradient and optimizer state (momentum, variance in Adam).

In practice, with the Adam optimizer, each trainable parameter requires roughly 12 to 16 bytes of memory (4 bytes for the parameter, 4 bytes for the gradient, and 4 to 8 bytes for optimizer states). A model with 7 billion trainable parameters therefore requires approximately 84 to 112 GB of optimizer memory alone.

Parameter initialization

Before training begins, parameters must be assigned initial values. The choice of initialization strategy significantly affects training speed, stability, and final model quality. Poor initialization can cause vanishing or exploding gradients, making training fail entirely.

Common initialization strategies

Strategy	Distribution	Variance	Best for
Zero initialization	All zeros	0	Biases only (never for weights)
Random uniform/normal	Uniform or Gaussian	Fixed	Simple networks
Xavier (Glorot)	Gaussian or uniform	2 / (n_in + n_out)	Sigmoid, tanh activations
He (Kaiming)	Gaussian or uniform	2 / n_in	ReLU and variants
Orthogonal	Orthogonal matrix	1	RNNs, very deep networks

Xavier initialization, proposed by Xavier Glorot and Yoshua Bengio in 2010, sets weights so that the variance of activations remains approximately constant across layers during both the forward and backward passes. It works well with symmetric activation functions such as tanh and sigmoid.

He initialization, proposed by Kaiming He and colleagues in 2015, accounts for the fact that ReLU activations zero out roughly half of their inputs. It uses a larger variance (2 / n_in instead of 2 / (n_in + n_out)) to compensate, preventing the signal from vanishing in deep networks that use ReLU.

Modern transformer models often use a scaled normal initialization where the standard deviation is reduced by a factor of 1 / sqrt(2L) for residual connections, where L is the number of layers. This prevents the residual signal from growing too large in very deep models.

Parametric vs. non-parametric models

Machine learning models are sometimes classified based on whether they have a fixed number of parameters.

Parametric models assume a specific functional form and have a fixed, finite number of parameters regardless of the amount of training data. Examples include linear regression, logistic regression, and neural networks. Once trained, predictions can be made using only the learned parameters; the training data is no longer needed.

Non-parametric models do not assume a fixed functional form. Their complexity grows with the amount of training data, and they may effectively retain the training data itself as part of the model. Examples include k-nearest neighbors, kernel density estimation, and decision trees (which grow deeper with more data). Gaussian processes are another example: while they have hyperparameters, the number of effective parameters scales with the dataset size.

Property	Parametric	Non-parametric
Fixed number of parameters	Yes	No (grows with data)
Assumptions about data distribution	Strong (specific functional form)	Weak or none
Data efficiency	Higher (fewer samples needed)	Lower (needs more data)
Computational cost at prediction	Constant (independent of training set size)	Often scales with training set size
Risk of underfitting	Higher (if form is wrong)	Lower
Risk of overfitting	Lower (with appropriate regularization)	Higher (especially with small datasets)
Examples	Linear regression, neural networks	k-NN, kernel methods, decision trees

It is worth noting that neural networks, despite having a fixed parameter count, can be extremely flexible due to their large number of parameters and nonlinear activation functions. In practice, large neural networks occupy a middle ground: they are technically parametric but can approximate nearly any function given enough parameters.

Parameters as a measure of model capacity

The total number of parameters is often used as a rough proxy for a model's capacity, which is its ability to represent complex functions. Models with more parameters can, in principle, memorize more training data and capture finer-grained patterns.

However, parameter count alone is an imperfect measure of capacity for several reasons:

Architecture matters. A transformer with 1 billion parameters may outperform a fully connected network with the same count because of better inductive biases (attention, residual connections, layer normalization).
Effective parameters. In mixture-of-experts models, only a subset of parameters is active for any given input. GPT-4, with an estimated 1.8 trillion total parameters, reportedly activates only around 111 billion per forward pass.
Regularization. Techniques like dropout, weight decay, and early stopping reduce the effective capacity of a model below what the raw parameter count might suggest.
Scaling laws. Empirical scaling laws (Kaplan et al., 2020; Hoffmann et al., 2022) show that model performance depends on the interplay between parameter count, dataset size, and compute budget, not on parameters alone.

The trend in recent years has been toward ever-larger models, but there is also growing interest in building smaller, more efficient models that achieve competitive performance through better architectures, data curation, and training recipes.

References

Goodfellow, I., Bengio, Y., & Courville, A. (2016). *Deep Learning*. MIT Press. Chapter 5 (Machine Learning Basics) and Chapter 7 (Regularization).
Glorot, X., & Bengio, Y. (2010). "Understanding the difficulty of training deep feedforward neural networks." *Proceedings of AISTATS*.
He, K., Zhang, X., Ren, S., & Sun, J. (2015). "Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification." *Proceedings of ICCV*.
Vaswani, A., et al. (2017). "Attention is all you need." *Advances in Neural Information Processing Systems (NeurIPS)*.
Hu, E. J., et al. (2022). "LoRA: Low-Rank Adaptation of Large Language Models." *ICLR 2022*.
Kaplan, J., et al. (2020). "Scaling laws for neural language models." *arXiv:2001.08361*.
Hoffmann, J., et al. (2022). "Training compute-optimal large language models." *arXiv:2203.15556* (Chinchilla paper).
Brown, T., et al. (2020). "Language models are few-shot learners." *NeurIPS 2020* (GPT-3 paper).
Devlin, J., et al. (2019). "BERT: Pre-training of deep bidirectional transformers for language understanding." *NAACL-HLT 2019*.
Bishop, C. M. (2006). *Pattern Recognition and Machine Learning*. Springer. Chapters 3-4 (Parameter Estimation).

ELI5 (Explain like I'm 5)

Parameters vs. hyperparameters

Types of parameters

Weights

Biases

Embedding tables

Normalization parameters

Convolutional filters (kernels)

Attention projection matrices

Parameter counting formulas for common layers

Parameter counts across model types

Parameter estimation methods

Maximum likelihood estimation (MLE)

Maximum a posteriori estimation (MAP)

Bayesian inference

Parameter space and optimization landscape

Parameter sharing

Convolutional weight sharing

Recurrent weight sharing

Weight tying

Parameter efficiency techniques

Trainable vs. frozen parameters

Parameter initialization

Common initialization strategies

Parametric vs. non-parametric models

Parameters as a measure of model capacity

References

Improve this article

Related Articles

Multi-head Latent Attention

ARC-AGI 2

GELU (Gaussian Error Linear Unit)

AUC-ROC

Machine learning terms/Clustering

Machine learning terms/Decision Forests

ELI5 (Explain like I'm 5)

Parameters vs. hyperparameters

Types of parameters

Weights

Biases

Embedding tables

Normalization parameters

Convolutional filters (kernels)

Attention projection matrices

Parameter counting formulas for common layers

Parameter counts across model types

Parameter estimation methods

Maximum likelihood estimation (MLE)

Maximum a posteriori estimation (MAP)

Bayesian inference

Parameter space and optimization landscape

Parameter sharing

Convolutional weight sharing

Recurrent weight sharing

Weight tying

Parameter efficiency techniques

Trainable vs. frozen parameters

Parameter initialization

Common initialization strategies

Parametric vs. non-parametric models

Parameters as a measure of model capacity

References

Related Articles

Multi-head Latent Attention

ARC-AGI 2

GELU (Gaussian Error Linear Unit)

AUC-ROC

Machine learning terms/Clustering

Machine learning terms/Decision Forests