Hyperparameter tuning (also called hyperparameter optimization or hyperparameter search) is the process of finding the optimal configuration parameters for a machine learning model. Unlike model parameters (such as weights and biases) that are learned automatically during training, hyperparameters are set before training begins and control aspects of the learning process itself, including network architecture, optimization behavior, and regularization strength.
Choosing good hyperparameters is critical because they directly influence a model's ability to learn effectively. A learning rate that is too high causes training to diverge; one that is too low makes training painfully slow or gets trapped in poor local minima. Similarly, the wrong combination of batch size, regularization, and architectural choices can be the difference between a model that generalizes well and one that overfits or underfits [1].
The following table lists the most frequently tuned hyperparameters across deep learning models:
| Hyperparameter | Typical Range | Controls | Impact |
|---|---|---|---|
| Learning rate | 1e-5 to 1e-1 | Step size for weight updates during gradient descent | Most influential single hyperparameter; wrong values cause divergence or stagnation |
| Batch size | 8 to 4096+ | Number of training examples per gradient update | Affects training speed, memory usage, and generalization; larger batches need higher learning rates |
| Number of layers (depth) | 2 to 100+ | Depth of the network architecture | Deeper networks can learn more complex functions but are harder to train and prone to vanishing gradients |
| Hidden size (width) | 64 to 16384 | Dimensionality of hidden representations | Wider layers increase model capacity but also parameter count and compute |
| Dropout rate | 0.0 to 0.5 | Fraction of neurons randomly deactivated during training | Regularization technique; too high reduces model capacity, too low allows overfitting |
| Weight decay | 1e-6 to 1e-1 | L2 regularization penalty on weights | Prevents weights from growing too large; acts as implicit regularization |
| Number of attention heads | 1 to 128 | Parallel attention computations in transformer models | More heads allow attending to different representation subspaces |
| Warmup steps | 0 to 10000 | Gradual increase of learning rate at training start | Stabilizes early training, especially important for transformers |
| Optimizer choice | Adam, AdamW, SGD, etc. | Algorithm for updating weights | Different optimizers suit different problems; AdamW is standard for LLMs |
| Sequence length | 128 to 128000+ | Maximum input length for sequence models | Longer sequences capture more context but require quadratically more memory for standard attention |
Hyperparameter tuning methods range from simple exhaustive approaches to sophisticated model-based optimization strategies. The evolution of these methods reflects the growing computational cost of evaluating each configuration.
Grid search is the simplest approach: define a discrete set of values for each hyperparameter and evaluate every possible combination. For example, if searching over 5 learning rates and 4 batch sizes, grid search evaluates all 20 combinations.
The main advantage is completeness within the defined grid. The main disadvantage is that the number of evaluations grows exponentially with the number of hyperparameters (the curse of dimensionality). With 6 hyperparameters each taking 5 values, grid search requires 15,625 evaluations, which is typically impractical for deep learning models that take hours or days to train.
Bergstra and Bengio demonstrated in their influential 2012 paper that random search is significantly more efficient than grid search for hyperparameter optimization [2]. Rather than evaluating a fixed grid, random search samples hyperparameter values from specified distributions (e.g., log-uniform for learning rate, uniform for dropout).
The key insight is that not all hyperparameters are equally important. In practice, model performance is often sensitive to only a few hyperparameters. Grid search wastes evaluations by exhaustively varying unimportant hyperparameters, while random search explores the important dimensions more thoroughly per trial. With the same computational budget, random search typically finds better configurations than grid search.
Bayesian optimization uses a probabilistic surrogate model to guide the search toward promising regions of the hyperparameter space. The approach maintains a model (typically a Gaussian process or Tree-structured Parzen Estimator) of the objective function and uses an acquisition function to decide which configuration to evaluate next [3].
The process works as follows:
Bayesian optimization is substantially more sample-efficient than random search because it learns from previous evaluations. However, the surrogate model itself has limitations: Gaussian processes scale poorly beyond a few thousand observations, and the method can struggle with very high-dimensional hyperparameter spaces.
Hyperband addresses a different aspect of efficiency: rather than being smarter about which configurations to try, it is smarter about how much compute to allocate to each configuration [4]. The algorithm is based on the successive halving principle:
This approach exploits the observation that bad configurations can often be identified early, long before they have been trained to completion. Hyperband runs multiple rounds of successive halving with different initial budget allocations, hedging against the possibility that some good configurations are slow starters.
BOHB combines the sample efficiency of Bayesian optimization with the early-stopping capabilities of Hyperband [5]. It replaces Hyperband's random configuration sampling with a Tree-structured Parzen Estimator (TPE) that learns from completed evaluations to propose better configurations.
BOHB achieves the best of both worlds: it starts finding good solutions as fast as Hyperband (by quickly discarding poor configurations) while converging to the global optimum as efficiently as Bayesian optimization (by directing the search toward promising regions). In benchmarks, BOHB often finds good solutions over an order of magnitude faster than standard Bayesian optimization and converges to better final solutions orders of magnitude faster than Hyperband alone.
Population-Based Training, developed by DeepMind, takes an evolutionary approach to hyperparameter optimization [6]. Instead of training each configuration independently, PBT trains a population of models simultaneously and periodically:
PBT's key advantage is that it adapts hyperparameters during training rather than fixing them at the start. This enables learning rate schedules, regularization changes, and other dynamic adjustments to emerge automatically. The method has been particularly successful for reinforcement learning and large-scale language model training.
| Method | Sample Efficiency | Parallelizability | Handles Early Stopping | Adaptive Scheduling | Best For |
|---|---|---|---|---|---|
| Grid search | Low | High | No | No | Small, low-dimensional spaces |
| Random search | Medium | High | No | No | General baseline; high-dimensional spaces |
| Bayesian optimization | High | Limited (sequential) | No | No | Expensive evaluations; small-to-medium spaces |
| Hyperband | Medium | High | Yes | No | Large search spaces with cheap evaluations |
| BOHB | High | High | Yes | No | General purpose; best overall efficiency |
| Population-based training | Medium-High | Medium | Implicit | Yes | Long training runs; dynamic hyperparameters |
Several mature frameworks implement these search methods and provide infrastructure for managing hyperparameter tuning experiments:
Optuna is an open-source hyperparameter optimization framework that provides an imperative, "define-by-run" API. Rather than declaring the search space upfront, users define hyperparameter sampling within the objective function itself, making it easy to create conditional and dynamic search spaces [7].
Optuna implements several search algorithms, including TPE (Tree-structured Parzen Estimator), CMA-ES (Covariance Matrix Adaptation Evolution Strategy), and grid/random search. Its pruning feature can terminate unpromising trials early using algorithms like Median Pruning or Hyperband. Optuna also provides a built-in dashboard for visualizing optimization history, parameter importance, and parallel coordinate plots.
Ray Tune is a distributed hyperparameter tuning library built on the Ray framework. Its primary strength is seamless scaling across clusters of machines, supporting hundreds of parallel trials [8].
Ray Tune integrates with multiple search algorithms (including Optuna, HyperOpt, and Bayesian optimization libraries) and scheduling algorithms (including ASHA, a scalable variant of Hyperband, and Population-Based Training). This modularity allows users to combine their preferred search algorithm with their preferred scheduling strategy.
Weights & Biases (W&B) Sweeps provides hyperparameter tuning integrated with the W&B experiment tracking platform. It supports grid, random, and Bayesian search methods, with results automatically logged alongside training metrics, system metrics, and artifacts.
The integration with W&B's visualization tools makes it straightforward to analyze sweep results, compare runs, and identify which hyperparameters most influence performance.
Keras Tuner is designed specifically for users of the Keras framework. It provides RandomSearch, Hyperband, BayesianOptimization, and Sklearn tuners, with a clean API for defining tunable parameters within Keras model-building functions. It is well-suited for rapid prototyping but less flexible than Optuna or Ray Tune for complex, distributed search scenarios.
| Tool | Search Algorithms | Distributed | Early Stopping | Visualization | Best For |
|---|---|---|---|---|---|
| Optuna | TPE, CMA-ES, Grid, Random | Via integration with Ray | Median, Hyperband-style | Built-in dashboard | General purpose; flexible API |
| Ray Tune | Any (via integrations) | Native (cluster-scale) | ASHA, PBT, Hyperband | Via W&B or TensorBoard | Large-scale distributed tuning |
| W&B Sweeps | Grid, Random, Bayesian | Agent-based | Manual | Integrated W&B platform | Teams already using W&B |
| Keras Tuner | Random, Hyperband, Bayesian | Limited | Hyperband built-in | TensorBoard | Keras users; quick prototyping |
AutoML extends hyperparameter tuning to encompass the entire machine learning pipeline, including feature engineering, model selection, and architecture design. AutoML systems like Google's AutoML, Auto-sklearn, and AutoGluon automate the process of finding the best model configuration for a given dataset.
Neural Architecture Search (NAS), a subfield of AutoML, optimizes the architecture itself (number of layers, layer types, connectivity patterns) in addition to training hyperparameters. While NAS was initially prohibitively expensive (requiring thousands of GPU hours), efficient NAS methods like DARTS (Differentiable Architecture Search) and one-shot approaches have made it more practical [9].
For large language models, full AutoML is rarely applied because the architecture is typically fixed (transformer decoder) and the cost of each training run is enormous. Instead, practitioners focus on tuning a smaller set of critical hyperparameters: learning rate, learning rate schedule, warmup steps, weight decay, and batch size.
Decades of collective experience have produced several reliable heuristics for hyperparameter tuning:
Start with the learning rate. It is consistently the most impactful hyperparameter. Use a learning rate finder (which sweeps through learning rates and plots loss) to identify a reasonable range, then tune within that range.
Scale learning rate with batch size. The linear scaling rule suggests that when doubling the batch size, the learning rate should also be doubled. This is an approximation that works well in practice, particularly for SGD. For Adam-based optimizers, the relationship is weaker but still directionally useful.
Use established defaults as starting points. Transformer models have well-known good defaults (learning rate around 1e-4 to 3e-4 for Adam/AdamW, weight decay of 0.01 to 0.1, warmup of 1% to 5% of total steps). Starting from these defaults and tuning around them is far more efficient than searching from scratch.
Tune in stages. First identify the right order of magnitude for each hyperparameter using coarse random search. Then refine within the promising range using Bayesian optimization or targeted random search.
Log everything. Experiment tracking tools like W&B, MLflow, or Neptune are not optional luxuries. Without systematic logging, it becomes impossible to reproduce results or understand which changes mattered.
Budget-aware methods first. For expensive models, start with Hyperband or BOHB to quickly eliminate bad configurations before investing compute in full evaluations.
As of early 2026, hyperparameter tuning practices in the AI community have consolidated around several patterns:
For LLM pretraining, the enormous cost of each run (millions of dollars) means that extensive hyperparameter searches are impractical. Instead, teams rely on scaling laws (as established by Chinchilla and subsequent work) to predict optimal hyperparameters from smaller proxy experiments. Critical parameters like learning rate, batch size schedule, and weight decay are determined from runs at smaller scale and extrapolated.
For fine-tuning and smaller models, Optuna with TPE has emerged as the de facto standard for single-machine experiments, while Ray Tune handles distributed scenarios. The combination of Optuna's search algorithms with Ray Tune's distributed infrastructure is increasingly common.
BOHB remains the strongest general-purpose algorithm in benchmarks, though in practice many practitioners find that random search with early stopping (via Hyperband or ASHA) provides most of the benefit with simpler implementation.
The integration of hyperparameter tuning with experiment tracking platforms has become standard practice, with Weights & Biases and MLflow serving as the most widely adopted platforms for logging, comparing, and analyzing tuning results [10].