Hyperparameter

Introduction

Machine learning involves finding the optimal set of parameters that allows the model to make accurate predictions on new data. Unfortunately, certain parameters cannot be learned from training data and must be set before training the model. These are known as hyperparameters, and they play a significant role in determining the model's performance. Choosing appropriate hyperparameters is one of the most important and time-consuming parts of building a machine learning system, because even a well-designed architecture can fail if its hyperparameters are poorly configured.

Yoshua Bengio described the learning rate as "the single most important hyper-parameter" for gradient-based training of deep architectures in his widely cited 2012 guide, and that observation extends to a broader principle: a small number of hyperparameters typically account for most of the variation in model performance. Understanding which hyperparameters matter, how they interact, and how to search for good values efficiently is a core practical skill in applied machine learning.

Definition

Hyperparameters are configuration variables that are set before the training process begins and control the behavior of the learning algorithm. Unlike regular parameters (such as weights and biases), which are learned from data during training, hyperparameters must be specified by the practitioner and remain fixed throughout the training process. They govern how the model learns rather than what the model learns.

More formally, a hyperparameter is any setting whose value is used to control the learning process itself. Hyperparameters are external to the model in the sense that their optimal values cannot be estimated from the training data alone. Instead, they are typically chosen through experimentation, domain expertise, or automated search procedures.

Hyperparameters can be divided into two broad categories:

Model hyperparameters define the structure of the model itself. Examples include the number of hidden layers in a neural network, the number of trees in a random forest, or the kernel type in a support vector machine.
Algorithm hyperparameters control the training procedure rather than the model structure. Examples include the learning rate, batch size, number of training epochs, and the choice of optimizer.

Hyperparameters can influence a wide range of model behaviors, including the complexity of the model, how quickly it learns, how well it generalizes to unseen data, and how long training takes. Because of this broad influence, hyperparameter selection is closely tied to the bias-variance tradeoff: hyperparameters that increase model complexity tend to reduce bias but increase variance, while those that constrain the model tend to have the opposite effect.

Hyperparameters vs. parameters

The distinction between parameters and hyperparameters is fundamental in machine learning. Parameters are the internal variables of the model that are learned directly from the training data through optimization algorithms like gradient descent. Hyperparameters, by contrast, are set externally and dictate how the learning process operates.

Aspect	Parameter	Hyperparameter
Definition	Internal variable learned from data	External configuration set before training
When set	During training (learned automatically)	Before training (set by the practitioner)
Source of value	Estimated from the training data	Chosen via experimentation, heuristics, or search
Role	Captures patterns and relationships in the data	Controls the learning process and model complexity
Examples	Weights in a neural network, coefficients in linear regression, support vectors in SVM	Learning rate, number of hidden layers, regularization strength, batch size
Updated during training	Yes, via backpropagation or other optimization	No, remains fixed for a given training run
Present in prediction	Yes, parameters define the final model	No, hyperparameters are not part of the trained model
Number	Can be millions or billions (e.g., deep learning models)	Typically a handful to a few dozen
Optimization method	Gradient-based (e.g., gradient descent, Adam)	Derivative-free search (grid, random, Bayesian)

A simple way to remember the distinction: parameters are what the model learns; hyperparameters are what you tell the model before it starts learning. Another useful framing is that parameters are optimized with respect to the training loss, while hyperparameters are optimized with respect to validation performance.

Common hyperparameters by model type

Different families of machine learning algorithms expose different sets of hyperparameters. Below is a survey of the most commonly tuned hyperparameters organized by model type.

Neural networks

Neural networks have a large number of hyperparameters that interact in complex ways. Tuning them effectively requires both understanding their individual effects and recognizing how they influence each other.

Hyperparameter	Description	Typical range	Effect
Learning rate	Step size for weight updates during gradient descent	1e-5 to 1e-1	Too high causes divergence; too low causes slow convergence or getting stuck in poor local minima
Batch size	Number of training samples processed before each weight update	16 to 512	Smaller batches add regularizing noise and may generalize better; larger batches enable faster computation but may converge to sharper minima
Number of epochs	Number of complete passes through the training dataset	10 to 1000+	Too few epochs lead to underfitting; too many lead to overfitting
Optimizer choice	Algorithm used to update weights (SGD, Adam, AdamW, RMSProp)	Categorical	Different optimizers suit different problems; Adam is a common default for its adaptive learning rates
Weight decay	L2 regularization penalty on weight magnitudes	1e-6 to 1e-2	Prevents weights from growing too large, reducing overfitting; interacts with learning rate
Dropout rate	Fraction of neurons randomly disabled during training	0.0 to 0.5	Higher values provide stronger regularization but may discard useful information; too low may not prevent overfitting
Number of layers	Depth of the network (number of hidden layers)	1 to 100+	Deeper networks can represent more complex functions but are harder to train and more prone to overfitting
Hidden units per layer	Width of each hidden layer	32 to 4096	More units increase the representational capacity but also increase computational cost and overfitting risk
Activation function	Non-linearity applied after each layer (ReLU, GELU, Tanh)	Categorical	Affects gradient flow and model expressiveness; ReLU is a widely used default

Leslie Smith's 2018 paper "A Disciplined Approach to Neural Network Hyper-Parameters" provides practical guidance on tuning learning rate, batch size, momentum, and weight decay for neural networks. One well-known heuristic from this work is the linear scaling rule: when the batch size is multiplied by a factor k, the learning rate should also be multiplied by k to maintain similar training dynamics.

Convolutional neural networks

Convolutional neural networks (CNNs) inherit all the general neural network hyperparameters listed above and add several architecture-specific ones related to their convolutional layers.

Hyperparameter	Description	Typical range	Effect
Kernel size	Height and width of the convolutional filter	1x1 to 7x7	Larger kernels capture wider spatial context but add more parameters; 3x3 is the most common choice since Simonyan and Zisserman (2015) showed that stacking small filters is more efficient than using large ones
Stride	Step size of the filter as it moves across the input	1 to 3	Larger strides reduce spatial dimensions more aggressively, lowering computation at the cost of spatial resolution
Padding	Number of pixels added around the input borders	0 (valid) or same	"Same" padding preserves spatial dimensions; "valid" padding reduces them
Number of filters	Number of distinct filters (channels) in each convolutional layer	16 to 1024	More filters allow the network to learn more feature maps but increase memory and computation
Pooling size and type	Spatial downsampling operation (max pooling, average pooling)	2x2 or 3x3	Reduces spatial dimensions and provides translation invariance; max pooling is the most common default

Tree-based models

Tree-based models, including decision trees, random forests, and gradient boosting methods, have their own distinct set of hyperparameters.

Hyperparameter	Description	Typical range	Effect
max_depth	Maximum depth of each tree	3 to 20	Deeper trees can fit more complex patterns but are more prone to overfitting
min_samples_split	Minimum number of samples required to split an internal node	2 to 20	Higher values constrain the tree, acting as regularization
min_samples_leaf	Minimum number of samples required in a leaf node	1 to 20	Higher values smooth the model and prevent learning noise
n_estimators	Number of trees in the ensemble (for forests and boosting)	50 to 5000	More trees generally improve performance up to a point, then plateau
max_features	Number or fraction of features considered for each split	sqrt(n), log2(n), or a fraction	Lower values increase diversity among trees and can reduce overfitting
learning_rate (boosting)	Shrinkage factor applied to each tree's contribution	0.001 to 0.3	Lower values require more trees but often yield better generalization
subsample (boosting)	Fraction of training samples used per tree	0.5 to 1.0	Introduces stochasticity; values below 1.0 can reduce overfitting

For random forests, the n_estimators hyperparameter generally shows diminishing returns: performance improves sharply with more trees initially but plateaus after a certain point. A practical rule of thumb is to increase n_estimators until the validation error stops improving. The max_depth and min_samples_leaf hyperparameters are more critical for controlling overfitting.

For gradient boosting, the interaction between learning_rate and n_estimators is particularly important. A lower learning rate requires more estimators to achieve the same training loss but typically produces a model that generalizes better. Trees in gradient boosting are usually kept shallow (3 to 8 levels) compared to the deep trees used in random forests.

Support vector machines

Support vector machines have a small but impactful set of hyperparameters.

Hyperparameter	Description	Typical range	Effect
C (regularization)	Penalty parameter for misclassified training examples	1e-3 to 1e3	Low C allows a wider margin with more misclassifications (higher bias, lower variance); high C enforces a narrower margin with fewer misclassifications (lower bias, higher variance)
Kernel	Function used to map data into a higher-dimensional space (linear, RBF, polynomial, sigmoid)	Categorical	Determines the type of decision boundary; RBF (radial basis function) is a common default for non-linear problems
Gamma (RBF kernel)	Defines the influence radius of a single training example	1e-4 to 1e1, or 'scale'	Low gamma means each point has a broad influence (smoother boundary); high gamma means each point has a narrow influence (more complex boundary, risk of overfitting)
Degree (polynomial kernel)	Degree of the polynomial kernel function	2 to 5	Higher degree allows more complex boundaries but increases computation and overfitting risk

When tuning SVMs, it is common to search over C and gamma jointly using a logarithmic grid (e.g., C in {0.01, 0.1, 1, 10, 100} and gamma in {0.001, 0.01, 0.1, 1}). Feature scaling is essential before training an SVM, because the algorithm is sensitive to the magnitude of input features.

Other models

Other machine learning algorithms also have important hyperparameters:

K-Nearest Neighbors (KNN): The number of neighbors k, the distance metric (Euclidean, Manhattan, Minkowski), and whether to weight neighbors by distance.
Logistic and Linear Regression: The regularization type (L1, L2, or Elastic Net), regularization strength (alpha or lambda), and solver choice.
Naive Bayes: Smoothing parameter (Laplace smoothing alpha), prior probabilities.
Clustering (K-Means): Number of clusters k, initialization method, maximum iterations.

Hyperparameter tuning methods

Hyperparameter tuning (also called hyperparameter optimization) is the process of finding the combination of hyperparameter values that produces the best model performance on a validation set. Several strategies exist, ranging from simple exhaustive approaches to sophisticated model-based methods.

Manual tuning

Manual tuning is the most basic approach, where the practitioner adjusts hyperparameters by hand based on intuition, experience, and observed results. Although it may seem outdated, manual tuning remains common in practice, especially for initial exploration. Experienced practitioners often know reasonable starting ranges for common hyperparameters and can narrow down the search space before applying automated methods. The main drawback is that manual tuning does not scale well to high-dimensional hyperparameter spaces and is difficult to reproduce.

Grid search

Grid search is the most straightforward automated approach to hyperparameter tuning. The practitioner defines a finite set of values for each hyperparameter, and the algorithm evaluates every possible combination. For example, if there are two hyperparameters with 5 values each, grid search evaluates all 25 combinations.

Advantages:

Simple to implement and understand.
Guarantees that the best combination within the specified grid is found.
Embarrassingly parallel, meaning each combination can be evaluated independently on separate machines.

Disadvantages:

Suffers from the curse of dimensionality: the number of evaluations grows exponentially with the number of hyperparameters.
Wastes evaluations on unimportant hyperparameters because it allocates equal effort to every dimension.
Requires the practitioner to define the grid boundaries and granularity in advance.

Grid search is implemented in scikit-learn as GridSearchCV, which combines the exhaustive search with cross-validation to produce robust performance estimates.

Random search

Random search samples hyperparameter values randomly from specified distributions rather than evaluating a fixed grid. Bergstra and Bengio demonstrated in their influential 2012 paper "Random Search for Hyper-Parameter Optimization" (published in the Journal of Machine Learning Research, volume 13, pages 281-305) that random search is significantly more efficient than grid search, especially when only a few hyperparameters actually matter for model performance.

The key insight is that in most machine learning problems, only a small subset of hyperparameters have a large effect on performance. Grid search wastes many evaluations varying hyperparameters that do not matter, while random search explores a wider range of values for the important hyperparameters by distributing samples across the full space.

Advantages:

More efficient than grid search when some hyperparameters are more important than others.
Easy to implement and parallelize.
Adding new hyperparameters does not reduce search efficiency for the other dimensions.
A computational budget can be set independently of the number of hyperparameters.

Disadvantages:

Does not guarantee finding the optimal combination.
Does not learn from previous evaluations (each sample is independent).

Random search is implemented in scikit-learn as RandomizedSearchCV.

Bayesian optimization

Bayesian optimization is a sequential, model-based approach that builds a probabilistic surrogate model of the objective function (typically validation performance as a function of hyperparameters) and uses it to decide which configurations to evaluate next. The foundational work by Snoek, Larochelle, and Adams (2012), "Practical Bayesian Optimization of Machine Learning Algorithms," published at NeurIPS, demonstrated that this approach can match or exceed expert-level tuning with far fewer evaluations.

The process works as follows:

Surrogate model: A probabilistic model (often a Gaussian process) is fitted to the observed hyperparameter-performance pairs.
Acquisition function: A function (such as Expected Improvement or Upper Confidence Bound) uses the surrogate model to balance exploration of uncertain regions with exploitation of promising regions.
Evaluation: The configuration that maximizes the acquisition function is evaluated on the actual objective.
Update: The surrogate model is updated with the new observation, and the process repeats.

An alternative to Gaussian processes is the Tree-structured Parzen Estimator (TPE), which models the conditional probability of the hyperparameters given good and bad performance separately. TPE is used by Optuna and Hyperopt and tends to scale better to higher-dimensional spaces than Gaussian process-based approaches.

Advantages:

Learns from previous evaluations, becoming more efficient over time.
Typically finds good hyperparameters with fewer total evaluations than grid or random search.
Naturally balances exploration and exploitation.

Disadvantages:

Overhead of fitting and querying the surrogate model.
Standard Gaussian process models scale poorly to very high-dimensional search spaces.
Sequential by nature, making parallelization less straightforward (though batch variants exist).

Popular implementations include Spearmint, SMAC (Sequential Model-based Algorithm Configuration), and the TPE used in Optuna and Hyperopt.

Hyperband

Hyperband, introduced by Li, Jamieson, DeSalvo, Rostamizadeh, and Talwalkar in their 2018 JMLR paper "Hyperband: A Novel Bandit-Based Approach to Hyperparameter Optimization," takes a fundamentally different approach. Instead of trying to be smarter about which configurations to evaluate, Hyperband focuses on being smarter about how much resource (such as training epochs or data subset size) to allocate to each configuration.

Hyperband is built on the Successive Halving Algorithm (SHA):

Start with a large number of randomly sampled configurations, each given a small resource budget.
Train all configurations for the allocated budget.
Evaluate their performance and discard the worst-performing half (or other fraction).
Double the resource budget for the surviving configurations.
Repeat until one configuration remains.

Hyperband improves on Successive Halving by running it multiple times with different tradeoffs between the number of initial configurations and the minimum resource per configuration. This addresses the uncertainty in how aggressively to prune early.

Advantages:

Can achieve over an order-of-magnitude speedup compared to random search and Bayesian optimization.
Simple and theoretically grounded.
Highly parallelizable.

Disadvantages:

Relies on the assumption that early performance is a reasonable predictor of final performance.
Does not use a surrogate model to guide the search.

Population-based training

Population-Based Training (PBT), introduced by Jaderberg et al. at DeepMind in 2017, combines hyperparameter optimization with training by maintaining a population of models trained in parallel. Periodically, underperforming models copy weights and hyperparameters from better-performing models and apply random perturbations. This allows hyperparameters to change during training rather than being fixed, which is especially valuable for hyperparameters whose optimal values shift over the course of training (such as learning rate).

PBT discovers a schedule of hyperparameter settings rather than a single fixed configuration. This is a meaningful distinction from other tuning methods, because research has shown that the optimal learning rate, for instance, often changes as training progresses. PBT effectively learns this schedule adaptively from the population dynamics.

Neural architecture search

Neural architecture search (NAS) extends the concept of hyperparameter optimization to the model architecture itself. Rather than choosing from a predefined set of architectures, NAS algorithms automatically search for the optimal network topology, including the number of layers, types of operations, and connectivity patterns.

NAS methods can be categorized by their search strategy:

Reinforcement learning-based NAS: A controller network generates candidate architectures and is trained using the validation accuracy as a reward signal. Zoph and Le (2017) demonstrated this approach but at enormous computational cost (800 GPUs for weeks).
Differentiable NAS (DARTS): Relaxes the discrete architecture choices into continuous ones, enabling gradient-based optimization of the architecture. This is orders of magnitude more efficient than RL-based search.
One-shot NAS: Trains a single overparameterized "supernet" that contains all candidate architectures as subnetworks, then evaluates subnetworks by inheriting shared weights.

NAS can be viewed as optimizing a very complex hyperparameter (the architecture graph) and is sometimes combined with standard hyperparameter tuning in a joint optimization pipeline.

ASHA (Asynchronous Successive Halving Algorithm)

ASHA extends Successive Halving to work efficiently in distributed and asynchronous settings. In standard SHA, all configurations in a bracket must finish before any can be promoted or pruned. ASHA removes this synchronization barrier: as soon as a worker finishes evaluating a configuration, it checks whether that configuration can be promoted to the next rung. If not, the worker begins evaluating a new random configuration.

This asynchronous design dramatically improves resource utilization in parallel computing environments, where different configurations may take different amounts of time to train. ASHA was described in the 2020 paper "A System for Massively Parallel Hyperparameter Tuning" by Li et al.

Other methods

Multi-fidelity methods (BOHB): Combine Bayesian optimization with Hyperband to get the benefits of both informed search and adaptive resource allocation. BOHB uses a kernel density estimator as the surrogate model and the Hyperband scheduling scheme to allocate resources.
Evolutionary/Genetic Algorithms: Use mutation, crossover, and selection operators to evolve a population of hyperparameter configurations over generations.
Gradient-based hyperparameter optimization: Computes gradients of the validation loss with respect to hyperparameters using techniques like implicit differentiation or unrolled optimization. This approach can be efficient for continuous hyperparameters but is limited by the requirement that the training procedure be differentiable end-to-end.

Comparison of tuning methods

Method	Guided by prior evaluations	Handles early stopping	Parallelizable	Best for
Grid search	No	No	Easily	Small search spaces (1-2 hyperparameters)
Random search	No	No	Easily	Initial exploration, moderate search spaces
Bayesian optimization	Yes	No (unless combined)	With batch variants	Fine-tuning after narrowing the search space
Hyperband	No	Yes	Easily	Large search spaces with expensive evaluations
BOHB	Yes	Yes	Yes	General-purpose, when compute is limited
PBT	Yes (population)	Implicit	Yes (population)	Long training runs with schedule-sensitive hyperparameters
NAS	Varies	Varies	Varies	Architecture design when compute is available

Cross-validation for hyperparameter selection

Cross-validation is the standard statistical technique for evaluating hyperparameter configurations. Rather than evaluating a single train/validation split (which can be noisy), k-fold cross-validation partitions the data into k subsets (folds), trains on k-1 folds, validates on the remaining fold, and rotates through all k choices. The average validation score across folds provides a more robust estimate of generalization performance.

A critical concern in hyperparameter optimization is overfitting the validation set. When many hyperparameter combinations are evaluated against the same validation data, the selected configuration may exploit idiosyncrasies of that specific data partition rather than reflecting true generalization ability. Nested cross-validation addresses this by using an inner loop for hyperparameter selection and an outer loop for performance estimation, ensuring that the test data used for final evaluation is never seen during hyperparameter tuning.

In practice, the choice of k involves a tradeoff. Larger k (e.g., 10 or leave-one-out) gives lower-bias estimates but is computationally expensive, while smaller k (e.g., 3 or 5) is faster but noisier. For large datasets, a single held-out validation set is often sufficient because the validation estimate is already stable.

Hyperparameter optimization tools

Several software frameworks and libraries have been developed to make hyperparameter optimization more accessible and scalable.

Tool	Developer	Key features	Search algorithms	Language
Optuna	Preferred Networks	Define-by-run API, pruning of unpromising trials, visualization dashboard	TPE, CMA-ES, Grid, Random	Python
Ray Tune	Anyscale	Distributed execution across clusters, integrates with many frameworks	Supports Optuna, HyperOpt, Bayesian, PBT, ASHA	Python
W&B Sweeps	Weights & Biases	Integrated experiment tracking, collaborative visualization, cloud-based	Bayesian, Grid, Random	Python
Keras Tuner	Google / Keras	Tight integration with Keras and TensorFlow, built-in tuners	Bayesian, Hyperband, Random	Python
Hyperopt	James Bergstra et al.	Mature library, supports MongoDB for distributed trials	TPE, Random, Adaptive TPE	Python
SMAC	AutoML Freiburg	Strong on combinatorial and conditional spaces	Random Forest-based Bayesian	Python

Optuna has gained significant popularity since its introduction by Akiba et al. in 2019. Its define-by-run API allows the search space to be constructed dynamically within the objective function, which makes it easy to define conditional hyperparameters (for example, only tuning gamma when the kernel is set to RBF). Optuna also features built-in pruning of unpromising trials using algorithms like Median Pruning and Successive Halving, which stops training early if intermediate results look poor.

Ray Tune focuses on scalability, enabling distributed hyperparameter searches across hundreds of machines. It acts as a unified interface for multiple search algorithms and scheduling strategies, including ASHA, PBT, and Bayesian optimization via Optuna or HyperOpt backends.

Weights & Biases Sweeps integrates hyperparameter optimization with experiment tracking, making it easy to visualize how different hyperparameter combinations affect metrics over time. Its collaborative features are well suited for team-based machine learning projects.

Keras Tuner is designed for practitioners working within the Keras ecosystem and provides a simple interface for common tuning workflows, including Hyperband scheduling.

Learning rate schedules

The learning rate is often considered the single most important hyperparameter in deep learning (Bengio, 2012). Rather than keeping the learning rate fixed throughout training, practitioners commonly adjust it over time using a learning rate schedule. This technique can significantly improve both convergence speed and final model performance.

Common schedules

Step Decay: The learning rate is reduced by a fixed factor at predetermined intervals. For example, the learning rate might be multiplied by 0.1 every 30 epochs. This is one of the simplest schedules and was widely used in early convolutional neural network research.

Exponential Decay: The learning rate decreases exponentially over time according to the formula lr(t) = lr_0 * e^(-kt), where k is the decay rate. This produces a smooth, continuous reduction.

Cosine Annealing: The learning rate follows a cosine curve from its initial value down to a minimum (often near zero). First proposed by Loshchilov and Hutter in 2016, cosine annealing decreases the learning rate slowly at first, then more rapidly in the middle of training, and slowly again near the end. This schedule has become popular for training transformers and vision models.

Warmup: Training begins with a very small learning rate that linearly increases to the target value over a set number of steps or epochs. Warmup helps stabilize training in the early stages when gradients can be large and erratic, particularly for models with batch normalization or large batch sizes. Most modern transformer training recipes combine a warmup phase with a subsequent decay schedule.

Warmup + Cosine Annealing: A common combined schedule, especially in transformer training, starts with linear warmup over a few thousand steps followed by cosine decay for the remainder of training. This combination has become a de facto standard for large language model pre-training.

Cyclical Learning Rates: Proposed by Leslie Smith in 2017, cyclical learning rates oscillate between a minimum and maximum value in a triangular or other repeating pattern. The intuition is that periodically increasing the learning rate can help the optimizer escape sharp local minima and find flatter, more generalizable solutions.

One-Cycle Policy: Also proposed by Smith, this schedule uses a single cycle of learning rate increase followed by decrease over the entire training run. It often achieves faster convergence and better final performance than fixed schedules, a phenomenon Smith termed "super-convergence."

Choosing a schedule

The optimal learning rate schedule depends on the model architecture, dataset size, and training budget. For most modern deep learning tasks, warmup combined with cosine annealing provides a strong baseline. For quick experiments or smaller models, step decay remains effective and easy to configure.

Hyperparameter sensitivity and importance

Not all hyperparameters are equally important. Research by Hutter, Hoos, and Leyton-Brown (2014) on functional ANOVA decomposition showed that in many machine learning algorithms, most of the performance variation can be attributed to just a few hyperparameters. This observation has practical implications: practitioners should identify and focus their tuning effort on the hyperparameters that matter most for their specific problem.

For neural networks, the learning rate is consistently the most influential hyperparameter. Probst, Boulesteix, and Bischl (2019) conducted a large-scale study of hyperparameter tunability across many algorithms and found that the relative importance of hyperparameters varies by algorithm family but is often quite concentrated. For example, in gradient boosting, the learning rate and number of estimators tend to dominate; in random forests, the number of features considered at each split (max_features) and minimum samples per leaf are most consequential.

Sensitivity analysis can be performed in several ways:

One-at-a-time (OAT) analysis varies each hyperparameter independently while holding others at their default values. This is simple but misses interactions.
Functional ANOVA decomposes the total variance in performance into contributions from individual hyperparameters and their interactions. This provides a more complete picture but requires enough evaluations to fit the decomposition.
Ablation studies compare a fully tuned configuration against configurations where individual hyperparameters are reset to their defaults. This reveals which hyperparameters contributed most to the final performance gain.

For LSTMs specifically, Greff et al. (2017) performed a large-scale hyperparameter study and found that learning rate and network size were the most critical hyperparameters, while batch size and momentum had relatively little impact on performance.

Hyperparameter transfer across tasks and scales

A long-standing goal in machine learning is to reduce the cost of hyperparameter tuning by transferring knowledge from previous experiments. There are several approaches to this problem.

When tuning hyperparameters for a new task that is similar to a previously solved task, the hyperparameter configurations that worked well on the old task can serve as strong starting points. Meta-learning approaches formalize this idea by training a model (or building a database) that predicts good hyperparameter configurations based on dataset characteristics. For example, Auto-sklearn uses meta-learning to warm-start Bayesian optimization by initializing it with configurations that performed well on similar datasets.

Maximal update parameterization (muP)

One of the most significant recent developments in hyperparameter transfer is the Maximal Update Parameterization (muP), introduced by Yang and Hu in their 2022 paper "Tensor Programs V: Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer." muP addresses a specific and expensive problem: when scaling a neural network to a larger size, hyperparameters (especially the learning rate and initialization scale) that worked at the smaller size often do not transfer, requiring costly re-tuning at each new scale.

muP re-parameterizes the model so that activation scales remain consistent across different model widths during training. Under this parameterization, the optimal learning rate, initialization variance, and other key hyperparameters remain stable as the model width increases. This enables a workflow called muTransfer:

Define the model architecture with muP-compliant scaling rules.
Tune hyperparameters on a small proxy model (e.g., 40M parameters).
Transfer those hyperparameters directly to the full-scale model (e.g., 6.7B parameters) without any additional tuning.

In experiments on GPT-3 scale models, muTransfer used only about 7% of pre-training compute for tuning and achieved performance comparable to a model twice its size that was tuned conventionally. This represents roughly an order of magnitude in compute savings for hyperparameter search.

muP has been adopted or studied by several organizations training large language models, and an improved variant called u-muP (unit-scaled muP) has been proposed to further simplify the parameterization by combining muP with unit scaling, ensuring that activations, weights, and gradients all begin training at a scale of one.

Impact on model performance

Hyperparameters affect virtually every aspect of model performance.

Convergence speed: The learning rate, optimizer, and batch size directly determine how quickly the model's loss function decreases during training. Poorly chosen values can cause training to stall, oscillate, or diverge entirely.

Generalization: Hyperparameters like dropout rate, weight decay, and regularization strength control how well the model performs on unseen data. These regularization hyperparameters help manage the overfitting/underfitting tradeoff.

Model capacity: The number of layers, hidden units, and (for tree models) maximum depth determine the complexity of functions the model can represent. Insufficient capacity leads to underfitting; excessive capacity leads to overfitting when not paired with adequate regularization.

Training stability: Certain hyperparameter combinations can cause training instabilities such as exploding or vanishing gradients, mode collapse in generative models, or numerical overflow. Techniques like learning rate warmup, gradient clipping, and careful initialization are hyperparameter-driven solutions to these problems.

Computational cost: Larger batch sizes, more layers, and longer training schedules all increase the computational resources required. In practice, hyperparameter selection often involves a tradeoff between model quality and compute budget.

Research consistently shows that hyperparameter optimization can make a larger difference in model performance than architectural changes. A well-tuned simple model often outperforms a poorly tuned complex model.

Best practices

The following guidelines can help practitioners tune hyperparameters more effectively:

Start with established defaults. Most frameworks and papers provide recommended default values. Begin with these and adjust based on validation performance rather than starting from scratch.
Tune the most important hyperparameters first. For neural networks, the learning rate almost always has the largest effect on performance. Focus on it first before tuning other hyperparameters. For tree-based models, max_depth and learning_rate (for boosting) are typically the highest-priority hyperparameters.
Use logarithmic scales for continuous hyperparameters. Learning rate, regularization strength, and weight decay should be searched on a log scale (e.g., 1e-5, 1e-4, 1e-3, 1e-2) rather than a linear scale, because their effect is roughly proportional to their order of magnitude.
Always use a separate validation set. Evaluate hyperparameter combinations on a held-out validation set, not the training set. For small datasets, use k-fold cross-validation to get a more reliable estimate of generalization performance.
Prefer random search over grid search for initial exploration. As Bergstra and Bengio (2012) showed, random search is more efficient when only a subset of hyperparameters are truly important, which is the typical case.
Graduate to Bayesian optimization for fine-tuning. After random search narrows the promising region, Bayesian optimization can efficiently zoom in on the best values with fewer evaluations.
Use early stopping. Monitor validation performance during training and stop when it begins to degrade. This acts as an implicit regularizer and saves computation. Many frameworks (including Optuna) support pruning of unpromising trials based on intermediate results.
Document and track experiments. Record every hyperparameter configuration and its corresponding performance. Tools like Weights & Biases, MLflow, and TensorBoard make this easier and enable retrospective analysis.
Consider computational budget. The choice of tuning method should reflect the available resources. Grid search may be acceptable for one or two hyperparameters, but random search or Bayesian optimization is more appropriate for larger search spaces.
Be aware of hyperparameter interactions. Some hyperparameters interact strongly. For example, learning rate and batch size are closely coupled, and learning rate and weight decay jointly affect regularization. Tuning them independently can miss the optimal combination.
Watch for validation set overfitting. When evaluating many configurations against the same validation data, the selected hyperparameters can overfit to that specific split. Use nested cross-validation or a separate test set to get an unbiased performance estimate.

Explain like I'm 5 (ELI5)

Imagine you are teaching a robot how to play a game, but first it needs to know the rules it should follow. These instructions serve as hyperparameters in machine learning: they dictate how the robot should act and react when playing the game.

For example, you could instruct the robot to take one step at a time or two steps at once. This is similar to setting the learning rate hyperparameter on machine learning models; it tells them how much to adjust their predictions based on new data they encounter.

You could instruct the robot to pay close attention to other players or focus solely on itself. This is similar to setting a regularization strength hyperparameter in machine learning models, controlling how much attention should be paid to training data to prevent overfitting.

The robot does not get to choose these rules itself. You have to decide them before the game starts. If you pick bad rules, the robot will play poorly no matter how many games it practices. If you pick good rules, the robot will learn quickly and play well. Finding the best rules is what hyperparameter tuning is all about.

References

Bergstra, J. and Bengio, Y. (2012). "Random Search for Hyper-Parameter Optimization." *Journal of Machine Learning Research*, 13, 281-305.
Snoek, J., Larochelle, H., and Adams, R.P. (2012). "Practical Bayesian Optimization of Machine Learning Algorithms." *Advances in Neural Information Processing Systems 25 (NeurIPS 2012)*.
Bengio, Y. (2012). "Practical Recommendations for Gradient-Based Training of Deep Architectures." In *Neural Networks: Tricks of the Trade*, Springer, 437-478.
Li, L., Jamieson, K., DeSalvo, G., Rostamizadeh, A., and Talwalkar, A. (2018). "Hyperband: A Novel Bandit-Based Approach to Hyperparameter Optimization." *Journal of Machine Learning Research*, 18(185), 1-52.
Akiba, T., Sano, S., Yanase, T., Ohta, T., and Koyama, M. (2019). "Optuna: A Next-generation Hyperparameter Optimization Framework." *Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining*.
Smith, L.N. (2018). "A Disciplined Approach to Neural Network Hyper-Parameters: Part 1 -- Learning Rate, Batch Size, Momentum, and Weight Decay." *arXiv preprint arXiv:1803.09820*.
Loshchilov, I. and Hutter, F. (2016). "SGDR: Stochastic Gradient Descent with Warm Restarts." *arXiv preprint arXiv:1608.03983*.
Jaderberg, M. et al. (2017). "Population Based Training of Neural Networks." *arXiv preprint arXiv:1711.09846*.
Yang, G. and Hu, E.J. (2022). "Tensor Programs V: Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer." *arXiv preprint arXiv:2203.03466*.
Hutter, F., Hoos, H., and Leyton-Brown, K. (2014). "An Efficient Approach for Assessing Hyperparameter Importance." *Proceedings of the 31st International Conference on Machine Learning (ICML 2014)*.
Feurer, M. and Hutter, F. (2019). "Hyperparameter Optimization." In *Automated Machine Learning: Methods, Systems, Challenges*, Springer.
Li, L. et al. (2020). "A System for Massively Parallel Hyperparameter Tuning." *Proceedings of Machine Learning and Systems (MLSys)*.

Introduction

Definition

Hyperparameters vs. parameters

Common hyperparameters by model type

Neural networks

Convolutional neural networks

Tree-based models

Support vector machines

Other models

Hyperparameter tuning methods

Manual tuning

Grid search

Random search

Bayesian optimization

Hyperband

Population-based training

Neural architecture search

ASHA (Asynchronous Successive Halving Algorithm)

Other methods

Comparison of tuning methods

Cross-validation for hyperparameter selection

Hyperparameter optimization tools

Learning rate schedules

Common schedules

Choosing a schedule

Hyperparameter sensitivity and importance

Hyperparameter transfer across tasks and scales

Warm-starting from related tasks

Maximal update parameterization (muP)

Impact on model performance

Best practices

Explain like I'm 5 (ELI5)

References

Improve this article

Related Articles

Sparse autoencoder

ARC-AGI 2

GELU (Gaussian Error Linear Unit)

LeNet

Learning Rate

Loss Function

Introduction

Definition

Hyperparameters vs. parameters

Common hyperparameters by model type

Neural networks

Convolutional neural networks

Tree-based models

Support vector machines

Other models

Hyperparameter tuning methods

Manual tuning

Grid search

Random search

Bayesian optimization

Hyperband

Population-based training

Neural architecture search

ASHA (Asynchronous Successive Halving Algorithm)

Other methods

Comparison of tuning methods

Cross-validation for hyperparameter selection

Hyperparameter optimization tools

Learning rate schedules

Common schedules

Choosing a schedule

Hyperparameter sensitivity and importance

Hyperparameter transfer across tasks and scales

Warm-starting from related tasks

Maximal update parameterization (muP)

Impact on model performance

Best practices

Explain like I'm 5 (ELI5)

References

Related Articles

Sparse autoencoder

ARC-AGI 2

GELU (Gaussian Error Linear Unit)

LeNet

Learning Rate