Hyperparameter Tuning

Machine Learning

14 min read

Updated Jun 21, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 21, 2026

Fact-checked

In review queue

Sources

11 citations

Revision

v5 · 2,863 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Hyperparameter tuning (also called hyperparameter optimization or hyperparameter search) is the process of finding the configuration parameters that make a machine learning model perform best, by searching over values such as learning rate, batch size, network depth, and regularization strength before or during training. Unlike model parameters (such as weights and biases) that are learned automatically during training, hyperparameters are set by the practitioner or a search algorithm and control aspects of the learning process itself, including network architecture, optimization behavior, and regularization strength.

The four dominant search strategies are grid search, random search, Bayesian optimization, and bandit-based methods such as Hyperband and ASHA. A landmark 2012 result by James Bergstra and Yoshua Bengio showed that random search is more efficient than grid search, finding models that are "as good or better within a small fraction of the computation time" ^[2]. Choosing good hyperparameters is critical because they directly influence a model's ability to learn effectively. A learning rate that is too high causes training to diverge; one that is too low makes training painfully slow or gets trapped in poor local minima. Similarly, the wrong combination of batch size, regularization, and architectural choices can be the difference between a model that generalizes well and one that overfits or underfits ^[1].

What hyperparameters are most commonly tuned?

The following table lists the most frequently tuned hyperparameters across deep learning models:

Hyperparameter	Typical Range	Controls	Impact
Learning rate	1e-5 to 1e-1	Step size for weight updates during gradient descent	Most influential single hyperparameter; wrong values cause divergence or stagnation
Batch size	8 to 4096+	Number of training examples per gradient update	Affects training speed, memory usage, and generalization; larger batches need higher learning rates
Number of layers (depth)	2 to 100+	Depth of the network architecture	Deeper networks can learn more complex functions but are harder to train and prone to vanishing gradients
Hidden size (width)	64 to 16384	Dimensionality of hidden representations	Wider layers increase model capacity but also parameter count and compute
Dropout rate	0.0 to 0.5	Fraction of neurons randomly deactivated during training	Regularization technique; too high reduces model capacity, too low allows overfitting
Weight decay	1e-6 to 1e-1	L2 regularization penalty on weights	Prevents weights from growing too large; acts as implicit regularization
Number of attention heads	1 to 128	Parallel attention computations in transformer models	More heads allow attending to different representation subspaces
Warmup steps	0 to 10000	Gradual increase of learning rate at training start	Stabilizes early training, especially important for transformers
Optimizer choice	Adam, AdamW, SGD, etc.	Algorithm for updating weights	Different optimizers suit different problems; AdamW is standard for LLMs
Sequence length	128 to 128000+	Maximum input length for sequence models	Longer sequences capture more context but require quadratically more memory for standard attention

How do hyperparameter search methods work?

Hyperparameter tuning methods range from simple exhaustive approaches to sophisticated model-based optimization strategies. The evolution of these methods reflects the growing computational cost of evaluating each configuration.

Grid Search

Grid search is the simplest approach: define a discrete set of values for each hyperparameter and evaluate every possible combination. For example, if searching over 5 learning rates and 4 batch sizes, grid search evaluates all 20 combinations.

The main advantage is completeness within the defined grid. The main disadvantage is that the number of evaluations grows exponentially with the number of hyperparameters (the curse of dimensionality). With 6 hyperparameters each taking 5 values, grid search requires 15,625 evaluations, which is typically impractical for deep learning models that take hours or days to train.

Why is random search better than grid search?

Bergstra and Bengio demonstrated in their influential 2012 paper, published in the Journal of Machine Learning Research (volume 13, pages 281-305), that random search is significantly more efficient than grid search for hyperparameter optimization ^[2]. Rather than evaluating a fixed grid, random search samples hyperparameter values from specified distributions (e.g., log-uniform for learning rate, uniform for dropout). The authors concluded that "randomly chosen trials are more efficient for hyper-parameter optimization than trials on a grid" and proposed random search as "a natural baseline against which to judge progress in the development of adaptive (sequential) hyper-parameter optimization algorithms" ^[2].

The key insight is that not all hyperparameters are equally important. In practice, model performance is often sensitive to only a few hyperparameters. Grid search wastes evaluations by exhaustively varying unimportant hyperparameters, while random search explores the important dimensions more thoroughly per trial. With the same computational budget, random search typically finds better configurations than grid search.

Bayesian Optimization

Bayesian optimization uses a probabilistic surrogate model to guide the search toward promising regions of the hyperparameter space. The approach maintains a model (typically a Gaussian process or Tree-structured Parzen Estimator) of the objective function and uses an acquisition function to decide which configuration to evaluate next ^[3].

The process works as follows:

Evaluate a small number of random initial configurations
Fit the surrogate model to observed results
Use the acquisition function (e.g., Expected Improvement) to select the next configuration that balances exploration of unknown regions with exploitation of promising ones
Evaluate the selected configuration and add the result to observations
Repeat from step 2

Bayesian optimization is substantially more sample-efficient than random search because it learns from previous evaluations. However, the surrogate model itself has limitations: Gaussian processes scale poorly beyond a few thousand observations, and the method can struggle with very high-dimensional hyperparameter spaces.

Hyperband

Hyperband, introduced by Lisha Li and colleagues in 2017, addresses a different aspect of efficiency: rather than being smarter about which configurations to try, it is smarter about how much compute to allocate to each configuration ^[4]. The authors report that Hyperband can provide "more than an order-of-magnitude speedups" over competing Bayesian optimization methods on a range of deep-learning and kernel-based learning problems ^[4]. The algorithm is based on the successive halving principle:

Start many configurations with a small resource budget (e.g., train for 1 epoch)
Evaluate all configurations
Keep the top-performing fraction (typically the top third or half)
Increase the resource budget for the remaining configurations
Repeat until only one configuration remains, trained with the full budget

This approach exploits the observation that bad configurations can often be identified early, long before they have been trained to completion. Hyperband runs multiple rounds of successive halving with different initial budget allocations (called brackets), hedging against the possibility that some good configurations are slow starters.

ASHA (Asynchronous Successive Halving)

The Asynchronous Successive Halving Algorithm (ASHA), introduced by Li and colleagues in 2020 in "A System for Massively Parallel Hyperparameter Tuning" ^[11], adapts successive halving for large parallel clusters. Standard (synchronous) successive halving waits for a full rung of configurations to finish before promoting the best ones; ASHA instead promotes a configuration to the next resource level as soon as it ranks in the top fraction of the current rung, without waiting. This removes the synchronization bottleneck and lets the algorithm scale to hundreds or thousands of parallel workers, which is why ASHA (rather than vanilla Hyperband) is the default early-stopping scheduler in distributed tuning libraries such as Ray Tune.

BOHB (Bayesian Optimization and Hyperband)

BOHB combines the sample efficiency of Bayesian optimization with the early-stopping capabilities of Hyperband ^[5]. It replaces Hyperband's random configuration sampling with a Tree-structured Parzen Estimator (TPE) that learns from completed evaluations to propose better configurations.

BOHB achieves the best of both worlds: it starts finding good solutions as fast as Hyperband (by quickly discarding poor configurations) while converging to the global optimum as efficiently as Bayesian optimization (by directing the search toward promising regions). In benchmarks, BOHB often finds good solutions over an order of magnitude faster than standard Bayesian optimization and converges to better final solutions faster than Hyperband alone.

Population-Based Training (PBT)

Population-Based Training, developed by DeepMind, takes an evolutionary approach to hyperparameter optimization ^[6]. Instead of training each configuration independently, PBT trains a population of models simultaneously and periodically:

Evaluates each model's current performance
Replaces underperforming models with copies of better-performing ones (exploitation)
Perturbs the hyperparameters of the copied models (exploration)

PBT's key advantage is that it adapts hyperparameters during training rather than fixing them at the start. This enables learning rate schedules, regularization changes, and other dynamic adjustments to emerge automatically. The method has been particularly successful for reinforcement learning and large-scale language model training.

How do the search methods compare?

Method	Sample Efficiency	Parallelizability	Handles Early Stopping	Adaptive Scheduling	Best For
Grid search	Low	High	No	No	Small, low-dimensional spaces
Random search	Medium	High	No	No	General baseline; high-dimensional spaces
Bayesian optimization	High	Limited (sequential)	No	No	Expensive evaluations; small-to-medium spaces
Hyperband	Medium	High	Yes	No	Large search spaces with cheap evaluations
ASHA	Medium	Very High	Yes	No	Massively parallel clusters
BOHB	High	High	Yes	No	General purpose; best overall efficiency
Population-based training	Medium-High	Medium	Implicit	Yes	Long training runs; dynamic hyperparameters

What tools are used for hyperparameter tuning?

Several mature frameworks implement these search methods and provide infrastructure for managing hyperparameter tuning experiments:

Optuna

Optuna is an open-source hyperparameter optimization framework that provides an imperative, "define-by-run" API. Presented at KDD 2019 by Takuya Akiba and colleagues, it lets users define hyperparameter sampling within the objective function itself rather than declaring the search space upfront, making it easy to create conditional and dynamic search spaces ^[7]. The authors describe the design goal as a "define-by-run API that allows users to construct the parameter search space dynamically" ^[7].

Optuna implements several search algorithms, including TPE (Tree-structured Parzen Estimator, its default single-objective sampler), CMA-ES (Covariance Matrix Adaptation Evolution Strategy), and grid/random search. Its pruning feature can terminate unpromising trials early using algorithms like Median Pruning or Hyperband. Optuna also provides a built-in dashboard for visualizing optimization history, parameter importance, and parallel coordinate plots.

Ray Tune

Ray Tune is a distributed hyperparameter tuning library built on the Ray framework. Its primary strength is seamless scaling across clusters of machines, supporting hundreds of parallel trials ^[8].

Ray Tune integrates with multiple search algorithms (including Optuna, HyperOpt, and Bayesian optimization libraries) and scheduling algorithms (including ASHA, a scalable variant of Hyperband, and Population-Based Training). This modularity allows users to combine their preferred search algorithm with their preferred scheduling strategy.

Weights & Biases Sweeps

Weights & Biases (W&B) Sweeps provides hyperparameter tuning integrated with the W&B experiment tracking platform. It supports three search methods, grid, random, and Bayesian, with results automatically logged alongside training metrics, system metrics, and artifacts. For early termination, W&B Sweeps currently supports the Hyperband stopping algorithm, which halts poorly performing runs at preset iteration brackets.

The integration with W&B's visualization tools makes it straightforward to analyze sweep results, compare runs, and identify which hyperparameters most influence performance.

Keras Tuner

Keras Tuner is designed specifically for users of the Keras framework. It provides RandomSearch, Hyperband, BayesianOptimization, and Sklearn tuners, with a clean API for defining tunable parameters within Keras model-building functions. It is well-suited for rapid prototyping but less flexible than Optuna or Ray Tune for complex, distributed search scenarios.

Tool Comparison

Tool	Search Algorithms	Distributed	Early Stopping	Visualization	Best For
Optuna	TPE, CMA-ES, Grid, Random	Via integration with Ray	Median, Hyperband-style	Built-in dashboard	General purpose; flexible API
Ray Tune	Any (via integrations)	Native (cluster-scale)	ASHA, PBT, Hyperband	Via W&B or TensorBoard	Large-scale distributed tuning
W&B Sweeps	Grid, Random, Bayesian	Agent-based	Hyperband	Integrated W&B platform	Teams already using W&B
Keras Tuner	Random, Hyperband, Bayesian	Limited	Hyperband built-in	TensorBoard	Keras users; quick prototyping

How does AutoML relate to hyperparameter tuning?

AutoML extends hyperparameter tuning to encompass the entire machine learning pipeline, including feature engineering, model selection, and architecture design. AutoML systems like Google's AutoML, Auto-sklearn, and AutoGluon automate the process of finding the best model configuration for a given dataset.

Neural Architecture Search (NAS), a subfield of AutoML, optimizes the architecture itself (number of layers, layer types, connectivity patterns) in addition to training hyperparameters. While NAS was initially prohibitively expensive (requiring thousands of GPU hours), efficient NAS methods like DARTS (Differentiable Architecture Search) and one-shot approaches have made it more practical ^[9].

For large language models, full AutoML is rarely applied because the architecture is typically fixed (transformer decoder) and the cost of each training run is enormous. Instead, practitioners focus on tuning a smaller set of critical hyperparameters: learning rate, learning rate schedule, warmup steps, weight decay, and batch size.

Practical Guidelines

Decades of collective experience have produced several reliable heuristics for hyperparameter tuning:

Start with the learning rate. It is consistently the most impactful hyperparameter. Use a learning rate finder (which sweeps through learning rates and plots loss) to identify a reasonable range, then tune within that range.

Scale learning rate with batch size. The linear scaling rule suggests that when doubling the batch size, the learning rate should also be doubled. This is an approximation that works well in practice, particularly for SGD. For Adam-based optimizers, the relationship is weaker but still directionally useful.

Use established defaults as starting points. Transformer models have well-known good defaults (learning rate around 1e-4 to 3e-4 for Adam/AdamW, weight decay of 0.01 to 0.1, warmup of 1% to 5% of total steps). Starting from these defaults and tuning around them is far more efficient than searching from scratch.

Tune in stages. First identify the right order of magnitude for each hyperparameter using coarse random search. Then refine within the promising range using Bayesian optimization or targeted random search.

Log everything. Experiment tracking tools like W&B, MLflow, or Neptune are not optional luxuries. Without systematic logging, it becomes impossible to reproduce results or understand which changes mattered.

Budget-aware methods first. For expensive models, start with Hyperband, ASHA, or BOHB to quickly eliminate bad configurations before investing compute in full evaluations.

Current State

As of early 2026, hyperparameter tuning practices in the AI community have consolidated around several patterns:

For LLM pretraining, the enormous cost of each run (millions of dollars) means that extensive hyperparameter searches are impractical. Instead, teams rely on scaling laws (as established by Chinchilla and subsequent work) to predict optimal hyperparameters from smaller proxy experiments. Critical parameters like learning rate, batch size schedule, and weight decay are determined from runs at smaller scale and extrapolated.

For fine-tuning and smaller models, Optuna with TPE has emerged as the de facto standard for single-machine experiments, while Ray Tune handles distributed scenarios. The combination of Optuna's search algorithms with Ray Tune's distributed infrastructure is increasingly common.

BOHB remains one of the strongest general-purpose algorithms in benchmarks, though in practice many practitioners find that random search with early stopping (via Hyperband or ASHA) provides most of the benefit with simpler implementation.

The integration of hyperparameter tuning with experiment tracking platforms has become standard practice, with Weights & Biases and MLflow serving as the most widely adopted platforms for logging, comparing, and analyzing tuning results ^[10].

References

Goodfellow, I., Bengio, Y., & Courville, A. (2016). "Deep Learning." MIT Press. https://www.deeplearningbook.org/ ↩
Bergstra, J. & Bengio, Y. (2012). "Random Search for Hyper-Parameter Optimization." *Journal of Machine Learning Research*, 13, 281-305. https://www.jmlr.org/papers/v13/bergstra12a.html ↩
Snoek, J., Larochelle, H., & Adams, R.P. (2012). "Practical Bayesian Optimization of Machine Learning Algorithms." *NeurIPS 2012*. https://arxiv.org/abs/1206.2944 ↩
Li, L., Jamieson, K., DeSalvo, G., Rostamizadeh, A., & Talwalkar, A. (2017). "Hyperband: A Novel Bandit-Based Approach to Hyperparameter Optimization." *Journal of Machine Learning Research*, 18, 1-52. https://arxiv.org/abs/1603.06560 ↩
Falkner, S., Klein, A., & Hutter, F. (2018). "BOHB: Robust and Efficient Hyperparameter Optimization at Scale." *ICML 2018*. https://arxiv.org/abs/1807.01774 ↩
Jaderberg, M. et al. (2017). "Population Based Training of Neural Networks." *arXiv preprint*. https://arxiv.org/abs/1711.09846 ↩
Akiba, T., Sano, S., Yanase, T., Ohta, T., & Koyama, M. (2019). "Optuna: A Next-generation Hyperparameter Optimization Framework." *KDD 2019*. https://arxiv.org/abs/1907.10902 ↩
Liaw, R. et al. (2018). "Tune: A Research Platform for Distributed Model Selection and Training." *arXiv preprint*. https://arxiv.org/abs/1807.05118 ↩
Liu, H., Simonyan, K., & Yang, Y. (2019). "DARTS: Differentiable Architecture Search." *ICLR 2019*. https://arxiv.org/abs/1806.09055 ↩
"Best Tools for Model Tuning and Hyperparameter Optimization." *Neptune.ai* (2025). https://neptune.ai/blog/best-tools-for-model-tuning-and-hyperparameter-optimization ↩
Li, L., Jamieson, K., Rostamizadeh, A., Gonina, E., Ben-tzur, J., Hardt, M., Recht, B., & Talwalkar, A. (2020). "A System for Massively Parallel Hyperparameter Tuning." *Proceedings of Machine Learning and Systems (MLSys)*, 2. https://arxiv.org/abs/1810.05934 ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

4 revisions by 1 contributors · full history

Suggest edit

What links here

AUC-ROC AutoML (Automated Machine Learning)Bayesian Optimization Decision Forest Decision Threshold Estimator (tf.estimator)Google Vertex AI Holdout data Imbue Knowledge Distillation Machine learning terms Machine learning terms/Google Cloud Pipeline Regularization Rate Schedule-Free optimizer Supervised fine-tuning Validation muP (Maximal Update Parametrization)

What hyperparameters are most commonly tuned?

How do hyperparameter search methods work?

Grid Search

Why is random search better than grid search?

Bayesian Optimization

Hyperband

ASHA (Asynchronous Successive Halving)

BOHB (Bayesian Optimization and Hyperband)

Population-Based Training (PBT)

How do the search methods compare?

What tools are used for hyperparameter tuning?

Optuna

Ray Tune

Weights & Biases Sweeps

Keras Tuner

Tool Comparison

How does AutoML relate to hyperparameter tuning?

Practical Guidelines

Current State

References

Improve this article

Related Articles

A/B Testing

Diffusion model

Dimension Reduction

Dimensions

Discrete Feature

Discriminative Model

What links here

Related Articles

A/B Testing

Diffusion model

Dimension Reduction

Dimensions

Discrete Feature

Discriminative Model

What links here