# Parameter

> Source: https://aiwiki.ai/wiki/parameter
> Updated: 2026-06-21
> Categories: Machine Learning, Neural Networks
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

In [machine learning](/wiki/machine_learning) and statistics, a **parameter** is an internal variable of a [model](/wiki/model) whose value is learned from data during the [training](/wiki/training) process.[1] Parameters define the behavior of a model and determine how it maps inputs to outputs. In a [neural network](/wiki/neural_network), the most familiar parameters are [weights](/wiki/weight) and [biases](/wiki/bias_math_or_bias_term); in a linear regression, they are the coefficients and intercept. Unlike [hyperparameters](/wiki/hyperparameter), which are set by the practitioner before training begins, parameters are estimated automatically by an optimization algorithm such as [gradient descent](/wiki/gradient_descent).

The total number of parameters in a model is one of the most commonly cited indicators of its capacity, or its ability to represent complex functions. Modern [large language models](/wiki/large_language_model) contain hundreds of billions, or even trillions, of parameters: GPT-3, released in 2020, has 175 billion parameters,[8] while GPT-4 is estimated at roughly 1.8 trillion. This is an increase of about ten orders of magnitude over the roughly 60,000 parameters in LeNet-5, the convolutional network from 1998.

## ELI5 (Explain like I'm 5)

Imagine you have a big machine with lots of little knobs. Each knob controls something different about what the machine does. When you first build the machine, all the knobs are set randomly. Then you show the machine thousands of examples, and it slowly turns each knob a tiny bit until it gets really good at its job. Those knobs are the parameters. The more knobs the machine has, the more complicated things it can learn, but it also needs more examples and more time to get all the knobs just right.

## What is the difference between a parameter and a hyperparameter?

A common source of confusion for newcomers is the difference between parameters and hyperparameters. The table below summarizes the key distinctions.

| Aspect | Parameter | [Hyperparameter](/wiki/hyperparameter) |
|---|---|---|
| Set by | Learned from data during training | Chosen by the practitioner before training |
| Examples | Weights, biases, embedding vectors | [Learning rate](/wiki/learning_rate), batch size, number of layers |
| Quantity | Can number in the billions or trillions | Typically a handful (tens to hundreds) |
| Optimized via | Gradient descent, backpropagation | Grid search, random search, Bayesian optimization |
| Stored after training | Yes, saved in model checkpoint files | Usually recorded in experiment config files |

In short, parameters are what the model learns; hyperparameters are what the engineer decides.

## What types of parameters exist?

Different model architectures use different kinds of learnable parameters. The following sections describe the most important types.

### Weights

Weights are scalar values associated with connections between neurons in a neural network, or with input features in linear models. Each weight controls how strongly one input signal influences the next layer's computation. During training, weights are adjusted to minimize the [loss function](/wiki/loss_function). In a fully connected (dense) layer with *n* inputs and *m* outputs, there are *n x m* weight parameters.

### Biases

A bias is an additive constant included in each neuron's computation. It allows the neuron's activation to shift, making it possible to fit data that does not pass through the origin. Each neuron typically has one bias term, so a dense layer with *m* outputs has *m* bias parameters.

### Embedding tables

In natural language processing, [embedding](/wiki/embedding_vector) tables map discrete tokens (words or subwords) to dense, continuous vectors. An embedding table for a vocabulary of size *V* with embedding dimension *d* contains *V x d* parameters. In large language models, embedding tables often account for a significant share of total parameters.

### Normalization parameters

Layers such as [batch normalization](/wiki/batch_normalization) and layer normalization include learnable scale (gamma) and shift (beta) parameters. For a feature dimension of size *d*, these layers contribute *2d* trainable parameters. While small in number compared to weight matrices, normalization parameters play an important role in stabilizing training dynamics.

### Convolutional filters (kernels)

In [convolutional neural networks](/wiki/convolutional_neural_network), each filter is a small tensor of learnable weights that slides across the input to detect spatial patterns such as edges, textures, and shapes. A convolutional layer with *C_in* input channels, *C_out* output filters, and kernel size *k x k* contains *C_in x C_out x k x k + C_out* parameters (including biases).

### Attention projection matrices

In [transformer](/wiki/transformer) architectures, the self-attention mechanism relies on query, key, and value projection matrices, plus an output projection matrix.[4] For a model dimension *d_model* and *h* attention heads with head dimension *d_k = d_model / h*, each attention sub-layer contains *4 x d_model^2* weight parameters (plus biases).

## How do you count the parameters in a layer?

Knowing how to count parameters is essential for estimating memory requirements and comparing architectures.

| Layer type | Formula (with bias) | Example |
|---|---|---|
| Dense (fully connected) | (n_in x n_out) + n_out | 768 inputs, 3072 outputs: 2,362,368 |
| Convolutional 2D | (k x k x C_in x C_out) + C_out | 3x3 kernel, 64 in, 128 out: 73,856 |
| Embedding | V x d | Vocab 50,257, dim 768: 38,597,376 |
| Layer normalization | 2 x d | d = 768: 1,536 |
| Multi-head attention | 4 x d_model^2 + 4 x d_model | d_model = 768: 2,362,368 |
| Transformer FFN | 2 x d_model x d_ff + d_model + d_ff | d_model = 768, d_ff = 3072: 4,722,432 |

To find the total parameter count for a full model, sum the parameters from every layer. For a transformer with *L* layers, the rough formula is: L x (4 x d_model^2 + 2 x d_model x d_ff) + V x d_model, ignoring biases and normalization terms for a first-order estimate.

## How many parameters do common models have?

The number of parameters varies enormously across different architectures and application domains.

| Model | Type | Approximate parameters |
|---|---|---|
| Simple linear regression (10 features) | Linear | 11 |
| Logistic regression (MNIST, 10 classes) | Linear | 7,850 |
| LeNet-5 (1998) | CNN | 60,000 |
| AlexNet (2012) | CNN | 60 million |
| VGG-16 (2014) | CNN | 138 million |
| [ResNet](/wiki/resnet)-50 (2015) | CNN | 25.6 million |
| [BERT](/wiki/bert_bidirectional_encoder_representations_from_transformers)-Base (2018) | Transformer (encoder) | 110 million |
| BERT-Large (2018) | Transformer (encoder) | 340 million |
| [GPT-2](/wiki/gpt2) (2019) | Transformer (decoder) | 1.5 billion |
| [GPT-3](/wiki/gpt3) (2020) | Transformer (decoder) | 175 billion |
| [GPT-4](/wiki/gpt4) (2023) | Transformer (MoE, decoder) | ~1.8 trillion (estimated) |
| [LLaMA](/wiki/llama) 3.1 405B (2024) | Transformer (decoder) | 405 billion |
| Mixtral 8x22B (2024) | Transformer (MoE) | ~141 billion (39B active) |

As the table shows, parameter counts have grown by roughly ten orders of magnitude from early linear models to modern large language models.[8][9] However, larger does not always mean better; architectural innovations, data quality, and training methodology also matter greatly. The Chinchilla study from DeepMind demonstrated this directly: a 70 billion parameter model trained on 1.4 trillion tokens outperformed the 175 billion parameter GPT-3 and the 280 billion parameter Gopher, because those larger models had been substantially under-trained relative to their size.[7]

## How are parameter values estimated?

In statistics and machine learning, several frameworks exist for estimating optimal parameter values from observed data.[10]

### Maximum likelihood estimation (MLE)

MLE finds the parameter values that maximize the probability (likelihood) of the observed data under the assumed model. Formally, given data *D* and parameters *theta*, MLE solves: theta_MLE = argmax P(D | theta). MLE is the most widely used estimation method in practice. Standard gradient-based optimization of neural networks with a cross-entropy or mean squared error loss is equivalent to performing MLE.[1]

### Maximum a posteriori estimation (MAP)

MAP estimation extends MLE by incorporating a prior distribution over parameters. It finds: theta_MAP = argmax P(theta | D) = argmax P(D | theta) x P(theta). When the prior is uniform, MAP reduces to MLE. When the prior is Gaussian, MAP is equivalent to L2 [regularization](/wiki/regularization) (weight decay).[1] MAP provides a principled way to encode domain knowledge or preferences about parameter values.

### Bayesian inference

Instead of finding a single point estimate, Bayesian inference computes the full posterior distribution P(theta | D). This provides a complete picture of uncertainty over parameter values, not just the most likely setting.[10] While theoretically appealing, exact Bayesian inference is computationally intractable for large neural networks. Approximate methods such as variational inference, Markov chain Monte Carlo (MCMC), and Monte Carlo dropout are used in practice.

| Method | Output | Incorporates prior? | Computational cost |
|---|---|---|---|
| MLE | Point estimate | No | Low |
| MAP | Point estimate | Yes | Low to moderate |
| Bayesian inference | Full posterior distribution | Yes | High |

## Parameter space and optimization landscape

The parameter space of a model is the high-dimensional space defined by all its learnable parameters. For a model with *N* parameters, the parameter space is an *N*-dimensional real-valued space. The loss function defines a surface (or landscape) over this space, and training amounts to finding a low point on that surface.

Key features of the optimization landscape include:

- **Local minima:** Points where the loss is lower than in all nearby directions, but not necessarily the global minimum. In high-dimensional spaces, most local minima are close in loss value to the global minimum, so reaching any local minimum often yields a good solution.
- **Saddle points:** Points where the gradient is zero but the loss increases in some directions and decreases in others. Saddle points are far more common than local minima in high-dimensional landscapes and can slow down optimization.
- **Loss plateaus:** Flat regions of the landscape where gradients are very small, making progress slow. Momentum-based optimizers and adaptive learning rate methods help traverse these regions.
- **Sharp vs. flat minima:** Research suggests that flat minima (where the loss changes slowly around the minimum) tend to generalize better than sharp minima. Techniques like stochastic gradient descent with large batch noise naturally favor flat minima.

## What is parameter sharing?

Parameter sharing is a design technique in which multiple parts of a model use the same set of parameters rather than each maintaining separate copies. This reduces the total number of learnable parameters, acts as a form of regularization, and encodes useful inductive biases.

### Convolutional weight sharing

In convolutional neural networks, the same filter weights are applied at every spatial position of the input. A single 3x3 filter with 64 input channels has only 576 weight parameters, yet it is applied hundreds or thousands of times across the image. This sharing encodes the assumption of translation equivariance: the same local pattern can appear anywhere in the image.

### Recurrent weight sharing

In [recurrent neural networks](/wiki/recurrent_neural_network), the same weight matrices are applied at every time step. This keeps the parameter count constant regardless of sequence length and encodes the assumption that the same transformation is useful at every position in the sequence.

### Weight tying

Weight tying is a technique where two distinct components of a model are forced to share the same weight matrix. A common example is tying the input embedding matrix and the output softmax projection matrix in language models. Since both matrices relate the same vocabulary to the same vector space, sharing them reduces memory usage and often improves performance. The original transformer paper and many subsequent language models use this technique.[4]

## What are parameter-efficient fine-tuning methods?

Full [fine-tuning](/wiki/fine_tuning) of a large pretrained model requires updating all parameters, which is expensive in terms of compute and memory. Parameter-efficient fine-tuning (PEFT) methods adapt a model to new tasks by modifying only a small fraction of parameters while keeping the rest frozen.

| Method | Approach | Trainable params (typical) | Key advantage |
|---|---|---|---|
| LoRA (Low-Rank Adaptation) | Injects trainable low-rank matrices into attention layers | 0.1% to 1% of original | No inference latency overhead after merging |
| Adapters | Inserts small bottleneck modules between existing layers | 1% to 5% of original | Modular; easy to swap for different tasks |
| Prefix tuning | Prepends trainable continuous vectors to transformer inputs | Less than 1% of original | Effective in low-data and few-shot settings |
| QLoRA | Combines LoRA with 4-bit quantization of the base model | 0.1% to 1% of original | Enables fine-tuning on a single consumer GPU |
| BitFit | Only trains bias terms; all weights are frozen | Less than 0.1% of original | Extremely lightweight |

LoRA has become the most widely adopted PEFT method. As its authors describe it, the method "freezes the pre-trained model weights and injects trainable rank decomposition matrices into each layer of the Transformer architecture, greatly reducing the number of trainable parameters for downstream tasks."[5] It decomposes each weight update into two low-rank matrices (A and B), so that the update delta_W = A x B has rank *r*, which is typically 4 to 64. The original paper reported that LoRA can reduce the number of trainable parameters by up to 10,000 times and the GPU memory requirement by 3 times compared to full fine-tuning of GPT-3 175B.[5] After training, the low-rank updates can be merged back into the original weights, adding zero latency at inference time.[5]

## What is the difference between trainable and frozen parameters?

In modern deep learning workflows, it is common to freeze some parameters while training others.

**Frozen parameters** are parameters whose values are not updated during training. Freezing is achieved by disabling gradient computation for those parameters. Common scenarios include:

- Transfer learning, where a pretrained backbone's parameters are frozen and only a new classification head is trained.
- PEFT methods, where the vast majority of a large model's parameters remain frozen.
- Feature extraction, where a pretrained encoder produces fixed representations for a downstream task.

**Trainable parameters** are parameters that receive gradient updates during optimization. The number of trainable parameters directly affects GPU memory consumption during training, since each trainable parameter requires storage for its gradient and optimizer state (momentum, variance in Adam).

In practice, with the Adam optimizer, each trainable parameter requires roughly 12 to 16 bytes of memory (4 bytes for the parameter, 4 bytes for the gradient, and 4 to 8 bytes for optimizer states). A model with 7 billion trainable parameters therefore requires approximately 84 to 112 GB of optimizer memory alone.

## How are parameters initialized before training?

Before training begins, parameters must be assigned initial values. The choice of initialization strategy significantly affects training speed, stability, and final model quality. Poor initialization can cause vanishing or exploding gradients, making training fail entirely.

### Common initialization strategies

| Strategy | Distribution | Variance | Best for |
|---|---|---|---|
| Zero initialization | All zeros | 0 | Biases only (never for weights) |
| Random uniform/normal | Uniform or Gaussian | Fixed | Simple networks |
| Xavier (Glorot) | Gaussian or uniform | 2 / (n_in + n_out) | Sigmoid, tanh activations |
| He (Kaiming) | Gaussian or uniform | 2 / n_in | ReLU and variants |
| Orthogonal | Orthogonal matrix | 1 | RNNs, very deep networks |

**Xavier initialization**, proposed by Xavier Glorot and Yoshua Bengio in 2010, sets weights so that the variance of activations remains approximately constant across layers during both the forward and backward passes.[2] It works well with symmetric activation functions such as tanh and sigmoid.

**He initialization**, proposed by Kaiming He and colleagues in 2015, accounts for the fact that ReLU activations zero out roughly half of their inputs.[3] It uses a larger variance (2 / n_in instead of 2 / (n_in + n_out)) to compensate, preventing the signal from vanishing in deep networks that use ReLU.

Modern transformer models often use a scaled normal initialization where the standard deviation is reduced by a factor of 1 / sqrt(2L) for residual connections, where *L* is the number of layers. This prevents the residual signal from growing too large in very deep models.

## What is the difference between parametric and non-parametric models?

Machine learning models are sometimes classified based on whether they have a fixed number of parameters.

**Parametric models** assume a specific functional form and have a fixed, finite number of parameters regardless of the amount of training data. Examples include linear regression, logistic regression, and neural networks. Once trained, predictions can be made using only the learned parameters; the training data is no longer needed.

**Non-parametric models** do not assume a fixed functional form. Their complexity grows with the amount of training data, and they may effectively retain the training data itself as part of the model. Examples include k-nearest neighbors, kernel density estimation, and decision trees (which grow deeper with more data). Gaussian processes are another example: while they have hyperparameters, the number of effective parameters scales with the dataset size.

| Property | Parametric | Non-parametric |
|---|---|---|
| Fixed number of parameters | Yes | No (grows with data) |
| Assumptions about data distribution | Strong (specific functional form) | Weak or none |
| Data efficiency | Higher (fewer samples needed) | Lower (needs more data) |
| Computational cost at prediction | Constant (independent of training set size) | Often scales with training set size |
| Risk of underfitting | Higher (if form is wrong) | Lower |
| Risk of [overfitting](/wiki/overfitting) | Lower (with appropriate regularization) | Higher (especially with small datasets) |
| Examples | Linear regression, neural networks | k-NN, kernel methods, decision trees |

It is worth noting that neural networks, despite having a fixed parameter count, can be extremely flexible due to their large number of parameters and nonlinear activation functions. In practice, large neural networks occupy a middle ground: they are technically parametric but can approximate nearly any function given enough parameters.

## Do more parameters always mean a better model?

The total number of parameters is often used as a rough proxy for a model's capacity, which is its ability to represent complex functions. Models with more parameters can, in principle, memorize more training data and capture finer-grained patterns. However, parameter count alone is an imperfect measure of capacity, and more parameters do not guarantee a better model.

Parameter count is an imperfect measure of capacity for several reasons:

- **Architecture matters.** A transformer with 1 billion parameters may outperform a fully connected network with the same count because of better inductive biases (attention, residual connections, layer normalization).
- **Effective parameters.** In mixture-of-experts models, only a subset of parameters is active for any given input. GPT-4, with an estimated 1.8 trillion total parameters, is reported in widely cited unofficial analyses to route only 2 of its 16 experts per token, so that on the order of a few hundred billion parameters (not the full 1.8 trillion) are active in any single forward pass. OpenAI has not published official figures.
- **Regularization.** Techniques like dropout, weight decay, and early stopping reduce the effective capacity of a model below what the raw parameter count might suggest.[1]
- **Scaling laws.** Empirical scaling laws (Kaplan et al., 2020; Hoffmann et al., 2022) show that model performance depends on the interplay between parameter count, dataset size, and compute budget, not on parameters alone.[6][7] Kaplan and colleagues found that test loss "scales as a power-law with model size, dataset size, and the amount of compute used for training."[6] Hoffmann's Chinchilla work refined this by showing that, for compute-optimal training, model size and the number of training tokens should be scaled in roughly equal proportion, with about 20 training tokens per parameter.[7]

The trend in recent years has been toward ever-larger models, but there is also growing interest in building smaller, more efficient models that achieve competitive performance through better architectures, data curation, and training recipes.

## References

1. Goodfellow, I., Bengio, Y., & Courville, A. (2016). *Deep Learning*. MIT Press. Chapter 5 (Machine Learning Basics) and Chapter 7 (Regularization).
2. Glorot, X., & Bengio, Y. (2010). "Understanding the difficulty of training deep feedforward neural networks." *Proceedings of AISTATS*.
3. He, K., Zhang, X., Ren, S., & Sun, J. (2015). "Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification." *Proceedings of ICCV*.
4. Vaswani, A., et al. (2017). "Attention is all you need." *Advances in Neural Information Processing Systems (NeurIPS)*.
5. Hu, E. J., et al. (2021). "LoRA: Low-Rank Adaptation of Large Language Models." *arXiv:2106.09685* (ICLR 2022). https://arxiv.org/abs/2106.09685
6. Kaplan, J., et al. (2020). "Scaling laws for neural language models." *arXiv:2001.08361*. https://arxiv.org/abs/2001.08361
7. Hoffmann, J., et al. (2022). "Training compute-optimal large language models." *arXiv:2203.15556* (Chinchilla paper). https://arxiv.org/abs/2203.15556
8. Brown, T., et al. (2020). "Language models are few-shot learners." *NeurIPS 2020* (GPT-3 paper). https://arxiv.org/abs/2005.14165
9. Devlin, J., et al. (2019). "BERT: Pre-training of deep bidirectional transformers for language understanding." *NAACL-HLT 2019*.
10. Bishop, C. M. (2006). *Pattern Recognition and Machine Learning*. Springer. Chapters 3-4 (Parameter Estimation).

