# Hyperparameter

> Source: https://aiwiki.ai/wiki/hyperparameter
> Updated: 2026-06-20
> Categories: Deep Learning, Machine Learning, Training & Optimization
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

*See also: [Machine learning terms](/wiki/machine_learning_terms)*

A hyperparameter is a configuration setting in a [machine learning](/wiki/machine_learning) algorithm that is fixed by the practitioner before training begins and is not learned from the [training data](/wiki/training_data). Examples include the [learning rate](/wiki/learning_rate), the [batch size](/wiki/batch_size), the number of layers in a [neural network](/wiki/neural_network), and the number of trees in a [random forest](/wiki/random_forest). Hyperparameters govern how a model learns rather than what it learns, and they are distinct from the model's [parameters](/wiki/parameter) (the [weights](/wiki/weights) and [biases](/wiki/biases) that the [optimizer](/wiki/optimizer) fits to the data). The process of selecting good values is called hyperparameter tuning or hyperparameter optimization, and it is widely regarded as one of the most consequential steps in building a machine learning system: a well-designed model with poorly chosen hyperparameters often performs worse than a simpler model that is carefully tuned.

## Introduction

[Machine learning](/wiki/machine_learning) involves finding the optimal set of parameters that allows the [model](/wiki/model) to make accurate predictions on new data. Unfortunately, certain parameters cannot be learned from [training data](/wiki/training_data) and must be set before training the model. These are known as hyperparameters, and they play a significant role in determining the model's performance. Choosing appropriate hyperparameters is one of the most important and time-consuming parts of building a machine learning system, because even a well-designed architecture can fail if its hyperparameters are poorly configured.

Yoshua Bengio described the [learning rate](/wiki/learning_rate) as "the single most important hyper-parameter" for gradient-based training of deep architectures in his widely cited 2012 guide [3], advising that "if there is only time to optimize one hyper-parameter and one uses stochastic gradient descent, then this is the hyper-parameter that is worth tuning" [3]. That observation extends to a broader principle: a small number of hyperparameters typically account for most of the variation in model performance. Understanding which hyperparameters matter, how they interact, and how to search for good values efficiently is a core practical skill in applied machine learning.

## What is a hyperparameter? (Definition)

Hyperparameters are configuration variables that are set before the training process begins and control the behavior of the learning algorithm. Unlike regular parameters (such as [weights](/wiki/weights) and [biases](/wiki/biases)), which are learned from data during training, hyperparameters must be specified by the practitioner and remain fixed throughout the training process. They govern how the model learns rather than what the model learns.

More formally, a hyperparameter is any setting whose value is used to control the learning process itself. Hyperparameters are external to the model in the sense that their optimal values cannot be estimated from the training data alone. Instead, they are typically chosen through experimentation, domain expertise, or automated search procedures.

Hyperparameters can be divided into two broad categories:

- **Model hyperparameters** define the structure of the model itself. Examples include the number of hidden layers in a [neural network](/wiki/neural_network), the number of trees in a [random forest](/wiki/random_forest), or the kernel type in a support vector machine.
- **Algorithm hyperparameters** control the training procedure rather than the model structure. Examples include the learning rate, [batch size](/wiki/batch_size), number of training [epochs](/wiki/epoch), and the choice of [optimizer](/wiki/optimizer).

Hyperparameters can influence a wide range of model behaviors, including the complexity of the model, how quickly it learns, how well it generalizes to unseen data, and how long training takes. Because of this broad influence, hyperparameter selection is closely tied to the [bias-variance tradeoff](/wiki/bias_variance_tradeoff): hyperparameters that increase model complexity tend to reduce bias but increase variance, while those that constrain the model tend to have the opposite effect.

## How do hyperparameters differ from parameters?

The distinction between parameters and hyperparameters is fundamental in machine learning. Parameters are the internal variables of the model that are learned directly from the training data through optimization algorithms like [gradient descent](/wiki/gradient_descent). Hyperparameters, by contrast, are set externally and dictate how the learning process operates.

| Aspect | [Parameter](/wiki/parameter) | Hyperparameter |
|---|---|---|
| Definition | Internal variable learned from data | External configuration set before training |
| When set | During training (learned automatically) | Before training (set by the practitioner) |
| Source of value | Estimated from the training data | Chosen via experimentation, heuristics, or search |
| Role | Captures patterns and relationships in the data | Controls the learning process and model complexity |
| Examples | Weights in a neural network, coefficients in linear regression, support vectors in [SVM](/wiki/support_vector_machine_svm) | Learning rate, number of hidden layers, [regularization](/wiki/regularization) strength, batch size |
| Updated during training | Yes, via [backpropagation](/wiki/backpropagation) or other optimization | No, remains fixed for a given training run |
| Present in prediction | Yes, parameters define the final model | No, hyperparameters are not part of the trained model |
| Number | Can be millions or billions (e.g., [deep learning](/wiki/deep_learning) models) | Typically a handful to a few dozen |
| Optimization method | Gradient-based (e.g., gradient descent, Adam) | Derivative-free search (grid, random, Bayesian) |

A simple way to remember the distinction: parameters are what the model learns; hyperparameters are what you tell the model before it starts learning. Another useful framing is that parameters are optimized with respect to the training loss, while hyperparameters are optimized with respect to validation performance.

## Common hyperparameters by model type

Different families of machine learning algorithms expose different sets of hyperparameters. Below is a survey of the most commonly tuned hyperparameters organized by model type.

### Neural networks

Neural networks have a large number of hyperparameters that interact in complex ways. Tuning them effectively requires both understanding their individual effects and recognizing how they influence each other.

| Hyperparameter | Description | Typical range | Effect |
|---|---|---|---|
| Learning rate | Step size for weight updates during gradient descent | 1e-5 to 1e-1 | Too high causes divergence; too low causes slow convergence or getting stuck in poor local minima |
| Batch size | Number of training samples processed before each weight update | 16 to 512 | Smaller batches add regularizing noise and may generalize better; larger batches enable faster computation but may converge to sharper minima |
| Number of epochs | Number of complete passes through the training dataset | 10 to 1000+ | Too few epochs lead to [underfitting](/wiki/underfitting); too many lead to [overfitting](/wiki/overfitting) |
| Optimizer choice | Algorithm used to update weights (SGD, [Adam](/wiki/adam_optimizer), AdamW, RMSProp) | Categorical | Different optimizers suit different problems; Adam is a common default for its adaptive learning rates |
| Weight decay | L2 regularization penalty on weight magnitudes | 1e-6 to 1e-2 | Prevents weights from growing too large, reducing overfitting; interacts with learning rate |
| [Dropout](/wiki/dropout_regularization) rate | Fraction of neurons randomly disabled during training | 0.0 to 0.5 | Higher values provide stronger regularization but may discard useful information; too low may not prevent overfitting |
| Number of layers | Depth of the network (number of hidden layers) | 1 to 100+ | Deeper networks can represent more complex functions but are harder to train and more prone to overfitting |
| Hidden units per layer | Width of each hidden layer | 32 to 4096 | More units increase the representational capacity but also increase computational cost and overfitting risk |
| Activation function | Non-linearity applied after each layer (ReLU, GELU, Tanh) | Categorical | Affects gradient flow and model expressiveness; ReLU is a widely used default |

Leslie Smith's 2018 paper "A Disciplined Approach to Neural Network Hyper-Parameters" provides practical guidance on tuning learning rate, batch size, momentum, and weight decay for neural networks [6]. One well-known heuristic from this work is the linear scaling rule: when the batch size is multiplied by a factor k, the learning rate should also be multiplied by k to maintain similar training dynamics.

### Convolutional neural networks

[Convolutional neural networks](/wiki/convolutional_neural_network) (CNNs) inherit all the general neural network hyperparameters listed above and add several architecture-specific ones related to their convolutional layers.

| Hyperparameter | Description | Typical range | Effect |
|---|---|---|---|
| Kernel size | Height and width of the convolutional filter | 1x1 to 7x7 | Larger kernels capture wider spatial context but add more parameters; 3x3 is the most common choice since Simonyan and Zisserman (2015) showed that stacking small filters is more efficient than using large ones |
| Stride | Step size of the filter as it moves across the input | 1 to 3 | Larger strides reduce spatial dimensions more aggressively, lowering computation at the cost of spatial resolution |
| Padding | Number of pixels added around the input borders | 0 (valid) or same | "Same" padding preserves spatial dimensions; "valid" padding reduces them |
| Number of filters | Number of distinct filters (channels) in each convolutional layer | 16 to 1024 | More filters allow the network to learn more feature maps but increase memory and computation |
| Pooling size and type | Spatial downsampling operation (max pooling, average pooling) | 2x2 or 3x3 | Reduces spatial dimensions and provides translation invariance; max pooling is the most common default |

### Tree-based models

Tree-based models, including [decision trees](/wiki/decision_tree), random forests, and [gradient boosting](/wiki/gradient_boosting) methods, have their own distinct set of hyperparameters.

| Hyperparameter | Description | Typical range | Effect |
|---|---|---|---|
| max_depth | Maximum depth of each tree | 3 to 20 | Deeper trees can fit more complex patterns but are more prone to overfitting |
| min_samples_split | Minimum number of samples required to split an internal node | 2 to 20 | Higher values constrain the tree, acting as regularization |
| min_samples_leaf | Minimum number of samples required in a leaf node | 1 to 20 | Higher values smooth the model and prevent learning noise |
| n_estimators | Number of trees in the [ensemble](/wiki/ensemble_learning) (for forests and boosting) | 50 to 5000 | More trees generally improve performance up to a point, then plateau |
| max_features | Number or fraction of features considered for each split | sqrt(n), log2(n), or a fraction | Lower values increase diversity among trees and can reduce overfitting |
| learning_rate (boosting) | Shrinkage factor applied to each tree's contribution | 0.001 to 0.3 | Lower values require more trees but often yield better generalization |
| subsample (boosting) | Fraction of training samples used per tree | 0.5 to 1.0 | Introduces stochasticity; values below 1.0 can reduce overfitting |

For random forests, the n_estimators hyperparameter generally shows diminishing returns: performance improves sharply with more trees initially but plateaus after a certain point. A practical rule of thumb is to increase n_estimators until the validation error stops improving. The max_depth and min_samples_leaf hyperparameters are more critical for controlling overfitting.

For gradient boosting, the interaction between learning_rate and n_estimators is particularly important. A lower learning rate requires more estimators to achieve the same training loss but typically produces a model that generalizes better. Trees in gradient boosting are usually kept shallow (3 to 8 levels) compared to the deep trees used in random forests.

### Support vector machines

[Support vector machines](/wiki/support_vector_machine_svm) have a small but impactful set of hyperparameters.

| Hyperparameter | Description | Typical range | Effect |
|---|---|---|---|
| C (regularization) | Penalty parameter for misclassified training examples | 1e-3 to 1e3 | Low C allows a wider margin with more misclassifications (higher bias, lower variance); high C enforces a narrower margin with fewer misclassifications (lower bias, higher variance) |
| Kernel | Function used to map data into a higher-dimensional space (linear, RBF, polynomial, sigmoid) | Categorical | Determines the type of decision boundary; RBF (radial basis function) is a common default for non-linear problems |
| Gamma (RBF kernel) | Defines the influence radius of a single training example | 1e-4 to 1e1, or 'scale' | Low gamma means each point has a broad influence (smoother boundary); high gamma means each point has a narrow influence (more complex boundary, risk of overfitting) |
| Degree (polynomial kernel) | Degree of the polynomial kernel function | 2 to 5 | Higher degree allows more complex boundaries but increases computation and overfitting risk |

When tuning SVMs, it is common to search over C and gamma jointly using a logarithmic grid (e.g., C in {0.01, 0.1, 1, 10, 100} and gamma in {0.001, 0.01, 0.1, 1}). Feature scaling is essential before training an SVM, because the algorithm is sensitive to the magnitude of input features.

### Other models

Other machine learning algorithms also have important hyperparameters:

- **K-Nearest Neighbors (KNN):** The number of neighbors k, the distance metric (Euclidean, Manhattan, Minkowski), and whether to weight neighbors by distance.
- **Logistic and Linear Regression:** The regularization type (L1, L2, or Elastic Net), regularization strength (alpha or lambda), and solver choice.
- **[Naive Bayes](/wiki/naive_bayes):** Smoothing parameter (Laplace smoothing alpha), prior probabilities.
- **[Clustering](/wiki/clustering) (K-Means):** Number of clusters k, initialization method, maximum iterations.

## How are hyperparameters tuned?

Hyperparameter tuning (also called hyperparameter optimization) is the process of finding the combination of hyperparameter values that produces the best model performance on a validation set. Several strategies exist, ranging from simple exhaustive approaches to sophisticated model-based methods.

### Manual tuning

Manual tuning is the most basic approach, where the practitioner adjusts hyperparameters by hand based on intuition, experience, and observed results. Although it may seem outdated, manual tuning remains common in practice, especially for initial exploration. Experienced practitioners often know reasonable starting ranges for common hyperparameters and can narrow down the search space before applying automated methods. The main drawback is that manual tuning does not scale well to high-dimensional hyperparameter spaces and is difficult to reproduce.

### Grid search

Grid search is the most straightforward automated approach to hyperparameter tuning. The practitioner defines a finite set of values for each hyperparameter, and the algorithm evaluates every possible combination. For example, if there are two hyperparameters with 5 values each, grid search evaluates all 25 combinations.

**Advantages:**
- Simple to implement and understand.
- Guarantees that the best combination within the specified grid is found.
- Embarrassingly parallel, meaning each combination can be evaluated independently on separate machines.

**Disadvantages:**
- Suffers from the curse of dimensionality: the number of evaluations grows exponentially with the number of hyperparameters.
- Wastes evaluations on unimportant hyperparameters because it allocates equal effort to every dimension.
- Requires the practitioner to define the grid boundaries and granularity in advance.

Grid search is implemented in scikit-learn as `GridSearchCV`, which combines the exhaustive search with [cross-validation](/wiki/cross-validation) to produce robust performance estimates.

### Random search

Random search samples hyperparameter values randomly from specified distributions rather than evaluating a fixed grid. Bergstra and Bengio demonstrated in their influential 2012 paper "Random Search for Hyper-Parameter Optimization" (published in the Journal of Machine Learning Research, volume 13, pages 281-305) that random search is significantly more efficient than grid search, especially when only a few hyperparameters actually matter for model performance [1]. The paper reports that "random search over the same domain is able to find models that are as good or better within a small fraction of the computation time" [1].

The key insight is that in most machine learning problems, only a small subset of hyperparameters have a large effect on performance. Grid search wastes many evaluations varying hyperparameters that do not matter, while random search explores a wider range of values for the important hyperparameters by distributing samples across the full space.

**Advantages:**
- More efficient than grid search when some hyperparameters are more important than others.
- Easy to implement and parallelize.
- Adding new hyperparameters does not reduce search efficiency for the other dimensions.
- A computational budget can be set independently of the number of hyperparameters.

**Disadvantages:**
- Does not guarantee finding the optimal combination.
- Does not learn from previous evaluations (each sample is independent).

Random search is implemented in scikit-learn as `RandomizedSearchCV`.

### Bayesian optimization

[Bayesian optimization](/wiki/bayesian_optimization) is a sequential, model-based approach that builds a probabilistic surrogate model of the objective function (typically validation performance as a function of hyperparameters) and uses it to decide which configurations to evaluate next. The foundational work by Snoek, Larochelle, and Adams (2012), "Practical Bayesian Optimization of Machine Learning Algorithms," published at [NeurIPS](/wiki/neurips), demonstrated that this approach can match or exceed expert-level tuning with far fewer evaluations [2].

The process works as follows:

1. **Surrogate model:** A probabilistic model (often a Gaussian process) is fitted to the observed hyperparameter-performance pairs.
2. **Acquisition function:** A function (such as Expected Improvement or Upper Confidence Bound) uses the surrogate model to balance exploration of uncertain regions with exploitation of promising regions.
3. **Evaluation:** The configuration that maximizes the acquisition function is evaluated on the actual objective.
4. **Update:** The surrogate model is updated with the new observation, and the process repeats.

An alternative to Gaussian processes is the Tree-structured Parzen Estimator (TPE), which models the conditional probability of the hyperparameters given good and bad performance separately. TPE is used by Optuna and Hyperopt and tends to scale better to higher-dimensional spaces than Gaussian process-based approaches.

**Advantages:**
- Learns from previous evaluations, becoming more efficient over time.
- Typically finds good hyperparameters with fewer total evaluations than grid or random search.
- Naturally balances exploration and exploitation.

**Disadvantages:**
- Overhead of fitting and querying the surrogate model.
- Standard Gaussian process models scale poorly to very high-dimensional search spaces.
- Sequential by nature, making parallelization less straightforward (though batch variants exist).

Popular implementations include Spearmint, SMAC (Sequential Model-based Algorithm Configuration), and the TPE used in Optuna and Hyperopt.

### Hyperband

Hyperband, introduced by Li, Jamieson, DeSalvo, Rostamizadeh, and Talwalkar in their 2018 JMLR paper "Hyperband: A Novel Bandit-Based Approach to Hyperparameter Optimization," takes a fundamentally different approach [4]. Instead of trying to be smarter about which configurations to evaluate, Hyperband focuses on being smarter about how much resource (such as training epochs or data subset size) to allocate to each configuration.

Hyperband is built on the Successive Halving Algorithm (SHA):

1. Start with a large number of randomly sampled configurations, each given a small resource budget.
2. Train all configurations for the allocated budget.
3. Evaluate their performance and discard the worst-performing half (or other fraction).
4. Double the resource budget for the surviving configurations.
5. Repeat until one configuration remains.

Hyperband improves on Successive Halving by running it multiple times with different tradeoffs between the number of initial configurations and the minimum resource per configuration. This addresses the uncertainty in how aggressively to prune early.

**Advantages:**
- Can achieve over an order-of-magnitude speedup compared to random search and Bayesian optimization on a variety of deep-learning and kernel-based problems, and is roughly 5x to 30x faster than those methods in the paper's experiments [4].
- Simple and theoretically grounded.
- Highly parallelizable.

**Disadvantages:**
- Relies on the assumption that early performance is a reasonable predictor of final performance.
- Does not use a surrogate model to guide the search.

### Population-based training

Population-Based Training (PBT), introduced by Jaderberg et al. at DeepMind in 2017, combines hyperparameter optimization with training by maintaining a population of models trained in parallel [8]. Periodically, underperforming models copy weights and hyperparameters from better-performing models and apply random perturbations. This allows hyperparameters to change during training rather than being fixed, which is especially valuable for hyperparameters whose optimal values shift over the course of training (such as learning rate).

PBT discovers a schedule of hyperparameter settings rather than a single fixed configuration. This is a meaningful distinction from other tuning methods, because research has shown that the optimal learning rate, for instance, often changes as training progresses. PBT effectively learns this schedule adaptively from the population dynamics.

### Neural architecture search

[Neural architecture search](/wiki/neural_architecture_search) (NAS) extends the concept of hyperparameter optimization to the model architecture itself. Rather than choosing from a predefined set of architectures, NAS algorithms automatically search for the optimal network topology, including the number of layers, types of operations, and connectivity patterns.

NAS methods can be categorized by their search strategy:

- **Reinforcement learning-based NAS:** A controller network generates candidate architectures and is trained using the validation accuracy as a reward signal. Zoph and Le (2017) demonstrated this approach but at enormous computational cost (800 GPUs for weeks).
- **Differentiable NAS (DARTS):** Relaxes the discrete architecture choices into continuous ones, enabling gradient-based optimization of the architecture. This is orders of magnitude more efficient than RL-based search.
- **One-shot NAS:** Trains a single overparameterized "supernet" that contains all candidate architectures as subnetworks, then evaluates subnetworks by inheriting shared weights.

NAS can be viewed as optimizing a very complex hyperparameter (the architecture graph) and is sometimes combined with standard hyperparameter tuning in a joint optimization pipeline.

### ASHA (Asynchronous Successive Halving Algorithm)

ASHA extends Successive Halving to work efficiently in distributed and asynchronous settings. In standard SHA, all configurations in a bracket must finish before any can be promoted or pruned. ASHA removes this synchronization barrier: as soon as a worker finishes evaluating a configuration, it checks whether that configuration can be promoted to the next rung. If not, the worker begins evaluating a new random configuration.

This asynchronous design dramatically improves resource utilization in parallel computing environments, where different configurations may take different amounts of time to train. ASHA was described in the 2020 paper "A System for Massively Parallel Hyperparameter Tuning" by Li et al [12].

### Other methods

- **Multi-fidelity methods (BOHB):** Combine Bayesian optimization with Hyperband to get the benefits of both informed search and adaptive resource allocation. BOHB uses a kernel density estimator as the surrogate model and the Hyperband scheduling scheme to allocate resources.
- **Evolutionary/Genetic Algorithms:** Use mutation, crossover, and selection operators to evolve a population of hyperparameter configurations over generations.
- **Gradient-based hyperparameter optimization:** Computes gradients of the validation loss with respect to hyperparameters using techniques like implicit differentiation or unrolled optimization. This approach can be efficient for continuous hyperparameters but is limited by the requirement that the training procedure be differentiable end-to-end.

### Comparison of tuning methods

| Method | Guided by prior evaluations | Handles early stopping | Parallelizable | Best for |
|---|---|---|---|---|
| Grid search | No | No | Easily | Small search spaces (1-2 hyperparameters) |
| Random search | No | No | Easily | Initial exploration, moderate search spaces |
| Bayesian optimization | Yes | No (unless combined) | With batch variants | Fine-tuning after narrowing the search space |
| Hyperband | No | Yes | Easily | Large search spaces with expensive evaluations |
| BOHB | Yes | Yes | Yes | General-purpose, when compute is limited |
| PBT | Yes (population) | Implicit | Yes (population) | Long training runs with schedule-sensitive hyperparameters |
| NAS | Varies | Varies | Varies | Architecture design when compute is available |

## Cross-validation for hyperparameter selection

Cross-validation is the standard statistical technique for evaluating hyperparameter configurations. Rather than evaluating a single train/validation split (which can be noisy), k-fold cross-validation partitions the data into k subsets (folds), trains on k-1 folds, validates on the remaining fold, and rotates through all k choices. The average validation score across folds provides a more robust estimate of generalization performance.

A critical concern in hyperparameter optimization is overfitting the validation set. When many hyperparameter combinations are evaluated against the same validation data, the selected configuration may exploit idiosyncrasies of that specific data partition rather than reflecting true generalization ability. Nested cross-validation addresses this by using an inner loop for hyperparameter selection and an outer loop for performance estimation, ensuring that the test data used for final evaluation is never seen during hyperparameter tuning.

In practice, the choice of k involves a tradeoff. Larger k (e.g., 10 or leave-one-out) gives lower-bias estimates but is computationally expensive, while smaller k (e.g., 3 or 5) is faster but noisier. For large datasets, a single held-out validation set is often sufficient because the validation estimate is already stable.

## Hyperparameter optimization tools

Several software frameworks and libraries have been developed to make hyperparameter optimization more accessible and scalable.

| Tool | Developer | Key features | Search algorithms | Language |
|---|---|---|---|---|
| [Optuna](https://optuna.org/) | Preferred Networks | Define-by-run API, pruning of unpromising trials, visualization dashboard | TPE, CMA-ES, Grid, Random | Python |
| [Ray Tune](https://docs.ray.io/en/latest/tune/) | Anyscale | Distributed execution across clusters, integrates with many frameworks | Supports Optuna, HyperOpt, Bayesian, PBT, ASHA | Python |
| [W&B Sweeps](https://wandb.ai/) | Weights & Biases | Integrated experiment tracking, collaborative visualization, cloud-based | Bayesian, Grid, Random | Python |
| [Keras Tuner](https://keras.io/keras_tuner/) | Google / Keras | Tight integration with Keras and [TensorFlow](/wiki/tensorflow), built-in tuners | Bayesian, Hyperband, Random | Python |
| [Hyperopt](http://hyperopt.github.io/hyperopt/) | James Bergstra et al. | Mature library, supports MongoDB for distributed trials | TPE, Random, Adaptive TPE | Python |
| [SMAC](https://automl.github.io/SMAC3/) | AutoML Freiburg | Strong on combinatorial and conditional spaces | Random Forest-based Bayesian | Python |

**Optuna** has gained significant popularity since its introduction by Akiba et al. in 2019 [5]. Its define-by-run API allows the search space to be constructed dynamically within the objective function, which makes it easy to define conditional hyperparameters (for example, only tuning gamma when the kernel is set to RBF). Optuna also features built-in pruning of unpromising trials using algorithms like Median Pruning and Successive Halving, which stops training early if intermediate results look poor. Optuna is released under the MIT license and was developed at the Japanese AI company Preferred Networks [5].

**Ray Tune** focuses on scalability, enabling distributed hyperparameter searches across hundreds of machines. It acts as a unified interface for multiple search algorithms and scheduling strategies, including ASHA, PBT, and Bayesian optimization via Optuna or HyperOpt backends.

**Weights & Biases Sweeps** integrates hyperparameter optimization with experiment tracking, making it easy to visualize how different hyperparameter combinations affect metrics over time. Its collaborative features are well suited for team-based machine learning projects.

**Keras Tuner** is designed for practitioners working within the Keras ecosystem and provides a simple interface for common tuning workflows, including Hyperband scheduling.

## Learning rate schedules

The learning rate is often considered the single most important hyperparameter in deep learning (Bengio, 2012) [3]. Rather than keeping the learning rate fixed throughout training, practitioners commonly adjust it over time using a learning rate schedule. This technique can significantly improve both convergence speed and final model performance.

### Common schedules

**Step Decay:** The learning rate is reduced by a fixed factor at predetermined intervals. For example, the learning rate might be multiplied by 0.1 every 30 epochs. This is one of the simplest schedules and was widely used in early [convolutional neural network](/wiki/convolutional_neural_network) research.

**Exponential Decay:** The learning rate decreases exponentially over time according to the formula lr(t) = lr_0 * e^(-kt), where k is the decay rate. This produces a smooth, continuous reduction.

**Cosine Annealing:** The learning rate follows a cosine curve from its initial value down to a minimum (often near zero). First proposed by Loshchilov and Hutter in 2016 [7], cosine annealing decreases the learning rate slowly at first, then more rapidly in the middle of training, and slowly again near the end. This schedule has become popular for training [transformers](/wiki/transformer) and vision models.

**Warmup:** Training begins with a very small learning rate that linearly increases to the target value over a set number of steps or epochs. Warmup helps stabilize training in the early stages when gradients can be large and erratic, particularly for models with [batch normalization](/wiki/batch_normalization) or large batch sizes. Most modern transformer training recipes combine a warmup phase with a subsequent decay schedule.

**Warmup + Cosine Annealing:** A common combined schedule, especially in transformer training, starts with linear warmup over a few thousand steps followed by cosine decay for the remainder of training. This combination has become a de facto standard for large language model pre-training.

**Cyclical Learning Rates:** Proposed by Leslie Smith in 2017, cyclical learning rates oscillate between a minimum and maximum value in a triangular or other repeating pattern. The intuition is that periodically increasing the learning rate can help the optimizer escape sharp local minima and find flatter, more generalizable solutions.

**One-Cycle Policy:** Also proposed by Smith, this schedule uses a single cycle of learning rate increase followed by decrease over the entire training run. It often achieves faster convergence and better final performance than fixed schedules, a phenomenon Smith termed "super-convergence" [6].

### Choosing a schedule

The optimal learning rate schedule depends on the model architecture, dataset size, and training budget. For most modern deep learning tasks, warmup combined with cosine annealing provides a strong baseline. For quick experiments or smaller models, step decay remains effective and easy to configure.

## Which hyperparameters matter most?

Not all hyperparameters are equally important. Research by Hutter, Hoos, and Leyton-Brown (2014) on functional ANOVA decomposition showed that in many machine learning algorithms, most of the performance variation can be attributed to just a few hyperparameters [10]. This observation has practical implications: practitioners should identify and focus their tuning effort on the hyperparameters that matter most for their specific problem.

For neural networks, the learning rate is consistently the most influential hyperparameter. Probst, Boulesteix, and Bischl (2019), in a large-scale study of hyperparameter tunability across 38 datasets and six algorithm families published in the Journal of Machine Learning Research, found that the relative importance of hyperparameters varies by algorithm family but is often quite concentrated [14]. For example, in gradient boosting, the learning rate and number of estimators tend to dominate; in random forests, the number of features considered at each split (max_features) and minimum samples per leaf are most consequential.

Sensitivity analysis can be performed in several ways:

- **One-at-a-time (OAT) analysis** varies each hyperparameter independently while holding others at their default values. This is simple but misses interactions.
- **Functional ANOVA** decomposes the total variance in performance into contributions from individual hyperparameters and their interactions. This provides a more complete picture but requires enough evaluations to fit the decomposition.
- **Ablation studies** compare a fully tuned configuration against configurations where individual hyperparameters are reset to their defaults. This reveals which hyperparameters contributed most to the final performance gain.

For [LSTMs](/wiki/long_short-term_memory_lstm) specifically, Greff et al. (2017) performed a large-scale hyperparameter study spanning 5,400 experimental runs, roughly 15 years of CPU time, and concluded that the learning rate was "by far the most important hyperparameter, followed by the network size," while "the momentum was found to be unimportant" [13]. The authors described this as the largest study of its kind on LSTM networks at the time [13].

## How can hyperparameters be reused across tasks and scales?

A long-standing goal in machine learning is to reduce the cost of hyperparameter tuning by transferring knowledge from previous experiments. There are several approaches to this problem.

### Warm-starting from related tasks

When tuning hyperparameters for a new task that is similar to a previously solved task, the hyperparameter configurations that worked well on the old task can serve as strong starting points. Meta-learning approaches formalize this idea by training a model (or building a database) that predicts good hyperparameter configurations based on dataset characteristics. For example, Auto-sklearn uses meta-learning to warm-start Bayesian optimization by initializing it with configurations that performed well on similar datasets [11].

### Maximal update parameterization (muP)

One of the most significant recent developments in hyperparameter transfer is the Maximal Update Parameterization (muP), introduced by Yang and Hu in their 2022 paper "Tensor Programs V: Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer" [9]. muP addresses a specific and expensive problem: when scaling a neural network to a larger size, hyperparameters (especially the learning rate and initialization scale) that worked at the smaller size often do not transfer, requiring costly re-tuning at each new scale.

muP re-parameterizes the model so that activation scales remain consistent across different model widths during training. Under this parameterization, the optimal learning rate, initialization variance, and other key hyperparameters remain stable as the model width increases. This enables a workflow called muTransfer:

1. Define the model architecture with muP-compliant scaling rules.
2. Tune hyperparameters on a small proxy model (e.g., 40M parameters).
3. Transfer those hyperparameters directly to the full-scale model (e.g., 6.7B parameters) without any additional tuning.

In experiments on GPT-3 scale models, the authors report that "by transferring from 40M parameters, we outperform published numbers of the 6.7B GPT-3 model, with tuning cost only 7% of total pretraining cost" [9]. They report a similar result for [BERT](/wiki/bert): transferring pretraining hyperparameters from a 13M-parameter model outperformed published numbers for BERT-large (350M parameters) "with a total tuning cost equivalent to pretraining BERT-large once" [9]. These results represent roughly an order of magnitude in compute savings for hyperparameter search at large scale.

muP has been adopted or studied by several organizations training large language models, and an improved variant called u-muP (unit-scaled muP) has been proposed to further simplify the parameterization by combining muP with unit scaling, ensuring that activations, weights, and gradients all begin training at a scale of one.

## Impact on model performance

Hyperparameters affect virtually every aspect of model performance.

**Convergence speed:** The learning rate, optimizer, and batch size directly determine how quickly the model's [loss function](/wiki/loss_function) decreases during training. Poorly chosen values can cause training to stall, oscillate, or diverge entirely.

**Generalization:** Hyperparameters like dropout rate, weight decay, and regularization strength control how well the model performs on unseen data. These regularization hyperparameters help manage the overfitting/underfitting tradeoff.

**Model capacity:** The number of layers, hidden units, and (for tree models) maximum depth determine the complexity of functions the model can represent. Insufficient capacity leads to underfitting; excessive capacity leads to overfitting when not paired with adequate regularization.

**Training stability:** Certain hyperparameter combinations can cause training instabilities such as exploding or vanishing gradients, mode collapse in generative models, or numerical overflow. Techniques like learning rate warmup, gradient clipping, and careful initialization are hyperparameter-driven solutions to these problems.

**Computational cost:** Larger batch sizes, more layers, and longer training schedules all increase the computational resources required. In practice, hyperparameter selection often involves a tradeoff between model quality and compute budget.

Research consistently shows that hyperparameter optimization can make a larger difference in model performance than architectural changes. A well-tuned simple model often outperforms a poorly tuned complex model.

## Best practices

The following guidelines can help practitioners tune hyperparameters more effectively:

1. **Start with established defaults.** Most frameworks and papers provide recommended default values. Begin with these and adjust based on validation performance rather than starting from scratch.

2. **Tune the most important hyperparameters first.** For neural networks, the learning rate almost always has the largest effect on performance. Focus on it first before tuning other hyperparameters. For tree-based models, max_depth and learning_rate (for boosting) are typically the highest-priority hyperparameters.

3. **Use logarithmic scales for continuous hyperparameters.** Learning rate, regularization strength, and weight decay should be searched on a log scale (e.g., 1e-5, 1e-4, 1e-3, 1e-2) rather than a linear scale, because their effect is roughly proportional to their order of magnitude.

4. **Always use a separate validation set.** Evaluate hyperparameter combinations on a held-out validation set, not the training set. For small datasets, use k-fold cross-validation to get a more reliable estimate of generalization performance.

5. **Prefer random search over grid search for initial exploration.** As Bergstra and Bengio (2012) showed, random search is more efficient when only a subset of hyperparameters are truly important, which is the typical case [1].

6. **Graduate to Bayesian optimization for fine-tuning.** After random search narrows the promising region, Bayesian optimization can efficiently zoom in on the best values with fewer evaluations.

7. **Use early stopping.** Monitor validation performance during training and stop when it begins to degrade. This acts as an implicit regularizer and saves computation. Many frameworks (including Optuna) support pruning of unpromising trials based on intermediate results.

8. **Document and track experiments.** Record every hyperparameter configuration and its corresponding performance. Tools like Weights & Biases, [MLflow](/wiki/mlflow), and TensorBoard make this easier and enable retrospective analysis.

9. **Consider computational budget.** The choice of tuning method should reflect the available resources. Grid search may be acceptable for one or two hyperparameters, but random search or Bayesian optimization is more appropriate for larger search spaces.

10. **Be aware of hyperparameter interactions.** Some hyperparameters interact strongly. For example, learning rate and batch size are closely coupled, and learning rate and weight decay jointly affect regularization. Tuning them independently can miss the optimal combination.

11. **Watch for validation set overfitting.** When evaluating many configurations against the same validation data, the selected hyperparameters can overfit to that specific split. Use nested cross-validation or a separate test set to get an unbiased performance estimate.

## Explain like I'm 5 (ELI5)

Imagine you are teaching a robot how to play a game, but first it needs to know the rules it should follow. These instructions serve as hyperparameters in machine learning: they dictate how the robot should act and react when playing the game.

For example, you could instruct the robot to take one step at a time or two steps at once. This is similar to setting the learning rate hyperparameter on machine learning models; it tells them how much to adjust their predictions based on new data they encounter.

You could instruct the robot to pay close attention to other players or focus solely on itself. This is similar to setting a regularization strength hyperparameter in machine learning models, controlling how much attention should be paid to training data to prevent overfitting.

The robot does not get to choose these rules itself. You have to decide them before the game starts. If you pick bad rules, the robot will play poorly no matter how many games it practices. If you pick good rules, the robot will learn quickly and play well. Finding the best rules is what hyperparameter tuning is all about.

## References

1. Bergstra, J. and Bengio, Y. (2012). "Random Search for Hyper-Parameter Optimization." *Journal of Machine Learning Research*, 13, 281-305.
2. Snoek, J., Larochelle, H., and Adams, R.P. (2012). "Practical Bayesian Optimization of Machine Learning Algorithms." *Advances in Neural Information Processing Systems 25 (NeurIPS 2012)*.
3. Bengio, Y. (2012). "Practical Recommendations for Gradient-Based Training of Deep Architectures." In *Neural Networks: Tricks of the Trade*, Springer, 437-478. (arXiv:1206.5533)
4. Li, L., Jamieson, K., DeSalvo, G., Rostamizadeh, A., and Talwalkar, A. (2018). "Hyperband: A Novel Bandit-Based Approach to Hyperparameter Optimization." *Journal of Machine Learning Research*, 18(185), 1-52.
5. Akiba, T., Sano, S., Yanase, T., Ohta, T., and Koyama, M. (2019). "Optuna: A Next-generation Hyperparameter Optimization Framework." *Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining*.
6. Smith, L.N. (2018). "A Disciplined Approach to Neural Network Hyper-Parameters: Part 1 -- Learning Rate, Batch Size, Momentum, and Weight Decay." *arXiv preprint arXiv:1803.09820*.
7. Loshchilov, I. and Hutter, F. (2016). "SGDR: Stochastic Gradient Descent with Warm Restarts." *arXiv preprint arXiv:1608.03983*.
8. Jaderberg, M. et al. (2017). "Population Based Training of Neural Networks." *arXiv preprint arXiv:1711.09846*.
9. Yang, G. and Hu, E.J. (2022). "Tensor Programs V: Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer." *arXiv preprint arXiv:2203.03466*.
10. Hutter, F., Hoos, H., and Leyton-Brown, K. (2014). "An Efficient Approach for Assessing Hyperparameter Importance." *Proceedings of the 31st International Conference on Machine Learning (ICML 2014)*.
11. Feurer, M. and Hutter, F. (2019). "Hyperparameter Optimization." In *Automated Machine Learning: Methods, Systems, Challenges*, Springer.
12. Li, L. et al. (2020). "A System for Massively Parallel Hyperparameter Tuning." *Proceedings of Machine Learning and Systems (MLSys)*.
13. Greff, K., Srivastava, R.K., Koutnik, J., Steunebrink, B.R., and Schmidhuber, J. (2017). "LSTM: A Search Space Odyssey." *IEEE Transactions on Neural Networks and Learning Systems*, 28(10), 2222-2232. (arXiv:1503.04069)
14. Probst, P., Boulesteix, A.-L., and Bischl, B. (2019). "Tunability: Importance of Hyperparameters of Machine Learning Algorithms." *Journal of Machine Learning Research*, 20(53), 1-32.

