See also: Machine learning terms
Machine learning involves finding the optimal set of parameters that allows the model to make accurate predictions on new data. Unfortunately, certain parameters cannot be learned from training data and must be set before training the model. These are known as hyperparameters, and they play a significant role in determining the model's performance. Choosing appropriate hyperparameters is one of the most important and time-consuming parts of building a machine learning system, because even a well-designed architecture can fail if its hyperparameters are poorly configured.
Yoshua Bengio described the learning rate as "the single most important hyper-parameter" for gradient-based training of deep architectures in his widely cited 2012 guide, and that observation extends to a broader principle: a small number of hyperparameters typically account for most of the variation in model performance. Understanding which hyperparameters matter, how they interact, and how to search for good values efficiently is a core practical skill in applied machine learning.
Hyperparameters are configuration variables that are set before the training process begins and control the behavior of the learning algorithm. Unlike regular parameters (such as weights and biases), which are learned from data during training, hyperparameters must be specified by the practitioner and remain fixed throughout the training process. They govern how the model learns rather than what the model learns.
More formally, a hyperparameter is any setting whose value is used to control the learning process itself. Hyperparameters are external to the model in the sense that their optimal values cannot be estimated from the training data alone. Instead, they are typically chosen through experimentation, domain expertise, or automated search procedures.
Hyperparameters can be divided into two broad categories:
Hyperparameters can influence a wide range of model behaviors, including the complexity of the model, how quickly it learns, how well it generalizes to unseen data, and how long training takes. Because of this broad influence, hyperparameter selection is closely tied to the bias-variance tradeoff: hyperparameters that increase model complexity tend to reduce bias but increase variance, while those that constrain the model tend to have the opposite effect.
The distinction between parameters and hyperparameters is fundamental in machine learning. Parameters are the internal variables of the model that are learned directly from the training data through optimization algorithms like gradient descent. Hyperparameters, by contrast, are set externally and dictate how the learning process operates.
| Aspect | Parameter | Hyperparameter |
|---|---|---|
| Definition | Internal variable learned from data | External configuration set before training |
| When set | During training (learned automatically) | Before training (set by the practitioner) |
| Source of value | Estimated from the training data | Chosen via experimentation, heuristics, or search |
| Role | Captures patterns and relationships in the data | Controls the learning process and model complexity |
| Examples | Weights in a neural network, coefficients in linear regression, support vectors in SVM | Learning rate, number of hidden layers, regularization strength, batch size |
| Updated during training | Yes, via backpropagation or other optimization | No, remains fixed for a given training run |
| Present in prediction | Yes, parameters define the final model | No, hyperparameters are not part of the trained model |
| Number | Can be millions or billions (e.g., deep learning models) | Typically a handful to a few dozen |
| Optimization method | Gradient-based (e.g., gradient descent, Adam) | Derivative-free search (grid, random, Bayesian) |
A simple way to remember the distinction: parameters are what the model learns; hyperparameters are what you tell the model before it starts learning. Another useful framing is that parameters are optimized with respect to the training loss, while hyperparameters are optimized with respect to validation performance.
Different families of machine learning algorithms expose different sets of hyperparameters. Below is a survey of the most commonly tuned hyperparameters organized by model type.
Neural networks have a large number of hyperparameters that interact in complex ways. Tuning them effectively requires both understanding their individual effects and recognizing how they influence each other.
| Hyperparameter | Description | Typical range | Effect |
|---|---|---|---|
| Learning rate | Step size for weight updates during gradient descent | 1e-5 to 1e-1 | Too high causes divergence; too low causes slow convergence or getting stuck in poor local minima |
| Batch size | Number of training samples processed before each weight update | 16 to 512 | Smaller batches add regularizing noise and may generalize better; larger batches enable faster computation but may converge to sharper minima |
| Number of epochs | Number of complete passes through the training dataset | 10 to 1000+ | Too few epochs lead to underfitting; too many lead to overfitting |
| Optimizer choice | Algorithm used to update weights (SGD, Adam, AdamW, RMSProp) | Categorical | Different optimizers suit different problems; Adam is a common default for its adaptive learning rates |
| Weight decay | L2 regularization penalty on weight magnitudes | 1e-6 to 1e-2 | Prevents weights from growing too large, reducing overfitting; interacts with learning rate |
| Dropout rate | Fraction of neurons randomly disabled during training | 0.0 to 0.5 | Higher values provide stronger regularization but may discard useful information; too low may not prevent overfitting |
| Number of layers | Depth of the network (number of hidden layers) | 1 to 100+ | Deeper networks can represent more complex functions but are harder to train and more prone to overfitting |
| Hidden units per layer | Width of each hidden layer | 32 to 4096 | More units increase the representational capacity but also increase computational cost and overfitting risk |
| Activation function | Non-linearity applied after each layer (ReLU, GELU, Tanh) | Categorical | Affects gradient flow and model expressiveness; ReLU is a widely used default |
Leslie Smith's 2018 paper "A Disciplined Approach to Neural Network Hyper-Parameters" provides practical guidance on tuning learning rate, batch size, momentum, and weight decay for neural networks. One well-known heuristic from this work is the linear scaling rule: when the batch size is multiplied by a factor k, the learning rate should also be multiplied by k to maintain similar training dynamics.
Convolutional neural networks (CNNs) inherit all the general neural network hyperparameters listed above and add several architecture-specific ones related to their convolutional layers.
| Hyperparameter | Description | Typical range | Effect |
|---|---|---|---|
| Kernel size | Height and width of the convolutional filter | 1x1 to 7x7 | Larger kernels capture wider spatial context but add more parameters; 3x3 is the most common choice since Simonyan and Zisserman (2015) showed that stacking small filters is more efficient than using large ones |
| Stride | Step size of the filter as it moves across the input | 1 to 3 | Larger strides reduce spatial dimensions more aggressively, lowering computation at the cost of spatial resolution |
| Padding | Number of pixels added around the input borders | 0 (valid) or same | "Same" padding preserves spatial dimensions; "valid" padding reduces them |
| Number of filters | Number of distinct filters (channels) in each convolutional layer | 16 to 1024 | More filters allow the network to learn more feature maps but increase memory and computation |
| Pooling size and type | Spatial downsampling operation (max pooling, average pooling) | 2x2 or 3x3 | Reduces spatial dimensions and provides translation invariance; max pooling is the most common default |
Tree-based models, including decision trees, random forests, and gradient boosting methods, have their own distinct set of hyperparameters.
| Hyperparameter | Description | Typical range | Effect |
|---|---|---|---|
| max_depth | Maximum depth of each tree | 3 to 20 | Deeper trees can fit more complex patterns but are more prone to overfitting |
| min_samples_split | Minimum number of samples required to split an internal node | 2 to 20 | Higher values constrain the tree, acting as regularization |
| min_samples_leaf | Minimum number of samples required in a leaf node | 1 to 20 | Higher values smooth the model and prevent learning noise |
| n_estimators | Number of trees in the ensemble (for forests and boosting) | 50 to 5000 | More trees generally improve performance up to a point, then plateau |
| max_features | Number or fraction of features considered for each split | sqrt(n), log2(n), or a fraction | Lower values increase diversity among trees and can reduce overfitting |
| learning_rate (boosting) | Shrinkage factor applied to each tree's contribution | 0.001 to 0.3 | Lower values require more trees but often yield better generalization |
| subsample (boosting) | Fraction of training samples used per tree | 0.5 to 1.0 | Introduces stochasticity; values below 1.0 can reduce overfitting |
For random forests, the n_estimators hyperparameter generally shows diminishing returns: performance improves sharply with more trees initially but plateaus after a certain point. A practical rule of thumb is to increase n_estimators until the validation error stops improving. The max_depth and min_samples_leaf hyperparameters are more critical for controlling overfitting.
For gradient boosting, the interaction between learning_rate and n_estimators is particularly important. A lower learning rate requires more estimators to achieve the same training loss but typically produces a model that generalizes better. Trees in gradient boosting are usually kept shallow (3 to 8 levels) compared to the deep trees used in random forests.
Support vector machines have a small but impactful set of hyperparameters.
| Hyperparameter | Description | Typical range | Effect |
|---|---|---|---|
| C (regularization) | Penalty parameter for misclassified training examples | 1e-3 to 1e3 | Low C allows a wider margin with more misclassifications (higher bias, lower variance); high C enforces a narrower margin with fewer misclassifications (lower bias, higher variance) |
| Kernel | Function used to map data into a higher-dimensional space (linear, RBF, polynomial, sigmoid) | Categorical | Determines the type of decision boundary; RBF (radial basis function) is a common default for non-linear problems |
| Gamma (RBF kernel) | Defines the influence radius of a single training example | 1e-4 to 1e1, or 'scale' | Low gamma means each point has a broad influence (smoother boundary); high gamma means each point has a narrow influence (more complex boundary, risk of overfitting) |
| Degree (polynomial kernel) | Degree of the polynomial kernel function | 2 to 5 | Higher degree allows more complex boundaries but increases computation and overfitting risk |
When tuning SVMs, it is common to search over C and gamma jointly using a logarithmic grid (e.g., C in {0.01, 0.1, 1, 10, 100} and gamma in {0.001, 0.01, 0.1, 1}). Feature scaling is essential before training an SVM, because the algorithm is sensitive to the magnitude of input features.
Other machine learning algorithms also have important hyperparameters:
Hyperparameter tuning (also called hyperparameter optimization) is the process of finding the combination of hyperparameter values that produces the best model performance on a validation set. Several strategies exist, ranging from simple exhaustive approaches to sophisticated model-based methods.
Manual tuning is the most basic approach, where the practitioner adjusts hyperparameters by hand based on intuition, experience, and observed results. Although it may seem outdated, manual tuning remains common in practice, especially for initial exploration. Experienced practitioners often know reasonable starting ranges for common hyperparameters and can narrow down the search space before applying automated methods. The main drawback is that manual tuning does not scale well to high-dimensional hyperparameter spaces and is difficult to reproduce.
Grid search is the most straightforward automated approach to hyperparameter tuning. The practitioner defines a finite set of values for each hyperparameter, and the algorithm evaluates every possible combination. For example, if there are two hyperparameters with 5 values each, grid search evaluates all 25 combinations.
Advantages:
Disadvantages:
Grid search is implemented in scikit-learn as GridSearchCV, which combines the exhaustive search with cross-validation to produce robust performance estimates.
Random search samples hyperparameter values randomly from specified distributions rather than evaluating a fixed grid. Bergstra and Bengio demonstrated in their influential 2012 paper "Random Search for Hyper-Parameter Optimization" (published in the Journal of Machine Learning Research, volume 13, pages 281-305) that random search is significantly more efficient than grid search, especially when only a few hyperparameters actually matter for model performance.
The key insight is that in most machine learning problems, only a small subset of hyperparameters have a large effect on performance. Grid search wastes many evaluations varying hyperparameters that do not matter, while random search explores a wider range of values for the important hyperparameters by distributing samples across the full space.
Advantages:
Disadvantages:
Random search is implemented in scikit-learn as RandomizedSearchCV.
Bayesian optimization is a sequential, model-based approach that builds a probabilistic surrogate model of the objective function (typically validation performance as a function of hyperparameters) and uses it to decide which configurations to evaluate next. The foundational work by Snoek, Larochelle, and Adams (2012), "Practical Bayesian Optimization of Machine Learning Algorithms," published at NeurIPS, demonstrated that this approach can match or exceed expert-level tuning with far fewer evaluations.
The process works as follows:
An alternative to Gaussian processes is the Tree-structured Parzen Estimator (TPE), which models the conditional probability of the hyperparameters given good and bad performance separately. TPE is used by Optuna and Hyperopt and tends to scale better to higher-dimensional spaces than Gaussian process-based approaches.
Advantages:
Disadvantages:
Popular implementations include Spearmint, SMAC (Sequential Model-based Algorithm Configuration), and the TPE used in Optuna and Hyperopt.
Hyperband, introduced by Li, Jamieson, DeSalvo, Rostamizadeh, and Talwalkar in their 2018 JMLR paper "Hyperband: A Novel Bandit-Based Approach to Hyperparameter Optimization," takes a fundamentally different approach. Instead of trying to be smarter about which configurations to evaluate, Hyperband focuses on being smarter about how much resource (such as training epochs or data subset size) to allocate to each configuration.
Hyperband is built on the Successive Halving Algorithm (SHA):
Hyperband improves on Successive Halving by running it multiple times with different tradeoffs between the number of initial configurations and the minimum resource per configuration. This addresses the uncertainty in how aggressively to prune early.
Advantages:
Disadvantages:
Population-Based Training (PBT), introduced by Jaderberg et al. at DeepMind in 2017, combines hyperparameter optimization with training by maintaining a population of models trained in parallel. Periodically, underperforming models copy weights and hyperparameters from better-performing models and apply random perturbations. This allows hyperparameters to change during training rather than being fixed, which is especially valuable for hyperparameters whose optimal values shift over the course of training (such as learning rate).
PBT discovers a schedule of hyperparameter settings rather than a single fixed configuration. This is a meaningful distinction from other tuning methods, because research has shown that the optimal learning rate, for instance, often changes as training progresses. PBT effectively learns this schedule adaptively from the population dynamics.
Neural architecture search (NAS) extends the concept of hyperparameter optimization to the model architecture itself. Rather than choosing from a predefined set of architectures, NAS algorithms automatically search for the optimal network topology, including the number of layers, types of operations, and connectivity patterns.
NAS methods can be categorized by their search strategy:
NAS can be viewed as optimizing a very complex hyperparameter (the architecture graph) and is sometimes combined with standard hyperparameter tuning in a joint optimization pipeline.
ASHA extends Successive Halving to work efficiently in distributed and asynchronous settings. In standard SHA, all configurations in a bracket must finish before any can be promoted or pruned. ASHA removes this synchronization barrier: as soon as a worker finishes evaluating a configuration, it checks whether that configuration can be promoted to the next rung. If not, the worker begins evaluating a new random configuration.
This asynchronous design dramatically improves resource utilization in parallel computing environments, where different configurations may take different amounts of time to train. ASHA was described in the 2020 paper "A System for Massively Parallel Hyperparameter Tuning" by Li et al.
| Method | Guided by prior evaluations | Handles early stopping | Parallelizable | Best for |
|---|---|---|---|---|
| Grid search | No | No | Easily | Small search spaces (1-2 hyperparameters) |
| Random search | No | No | Easily | Initial exploration, moderate search spaces |
| Bayesian optimization | Yes | No (unless combined) | With batch variants | Fine-tuning after narrowing the search space |
| Hyperband | No | Yes | Easily | Large search spaces with expensive evaluations |
| BOHB | Yes | Yes | Yes | General-purpose, when compute is limited |
| PBT | Yes (population) | Implicit | Yes (population) | Long training runs with schedule-sensitive hyperparameters |
| NAS | Varies | Varies | Varies | Architecture design when compute is available |
Cross-validation is the standard statistical technique for evaluating hyperparameter configurations. Rather than evaluating a single train/validation split (which can be noisy), k-fold cross-validation partitions the data into k subsets (folds), trains on k-1 folds, validates on the remaining fold, and rotates through all k choices. The average validation score across folds provides a more robust estimate of generalization performance.
A critical concern in hyperparameter optimization is overfitting the validation set. When many hyperparameter combinations are evaluated against the same validation data, the selected configuration may exploit idiosyncrasies of that specific data partition rather than reflecting true generalization ability. Nested cross-validation addresses this by using an inner loop for hyperparameter selection and an outer loop for performance estimation, ensuring that the test data used for final evaluation is never seen during hyperparameter tuning.
In practice, the choice of k involves a tradeoff. Larger k (e.g., 10 or leave-one-out) gives lower-bias estimates but is computationally expensive, while smaller k (e.g., 3 or 5) is faster but noisier. For large datasets, a single held-out validation set is often sufficient because the validation estimate is already stable.
Several software frameworks and libraries have been developed to make hyperparameter optimization more accessible and scalable.
| Tool | Developer | Key features | Search algorithms | Language |
|---|---|---|---|---|
| Optuna | Preferred Networks | Define-by-run API, pruning of unpromising trials, visualization dashboard | TPE, CMA-ES, Grid, Random | Python |
| Ray Tune | Anyscale | Distributed execution across clusters, integrates with many frameworks | Supports Optuna, HyperOpt, Bayesian, PBT, ASHA | Python |
| W&B Sweeps | Weights & Biases | Integrated experiment tracking, collaborative visualization, cloud-based | Bayesian, Grid, Random | Python |
| Keras Tuner | Google / Keras | Tight integration with Keras and TensorFlow, built-in tuners | Bayesian, Hyperband, Random | Python |
| Hyperopt | James Bergstra et al. | Mature library, supports MongoDB for distributed trials | TPE, Random, Adaptive TPE | Python |
| SMAC | AutoML Freiburg | Strong on combinatorial and conditional spaces | Random Forest-based Bayesian | Python |
Optuna has gained significant popularity since its introduction by Akiba et al. in 2019. Its define-by-run API allows the search space to be constructed dynamically within the objective function, which makes it easy to define conditional hyperparameters (for example, only tuning gamma when the kernel is set to RBF). Optuna also features built-in pruning of unpromising trials using algorithms like Median Pruning and Successive Halving, which stops training early if intermediate results look poor.
Ray Tune focuses on scalability, enabling distributed hyperparameter searches across hundreds of machines. It acts as a unified interface for multiple search algorithms and scheduling strategies, including ASHA, PBT, and Bayesian optimization via Optuna or HyperOpt backends.
Weights & Biases Sweeps integrates hyperparameter optimization with experiment tracking, making it easy to visualize how different hyperparameter combinations affect metrics over time. Its collaborative features are well suited for team-based machine learning projects.
Keras Tuner is designed for practitioners working within the Keras ecosystem and provides a simple interface for common tuning workflows, including Hyperband scheduling.
The learning rate is often considered the single most important hyperparameter in deep learning (Bengio, 2012). Rather than keeping the learning rate fixed throughout training, practitioners commonly adjust it over time using a learning rate schedule. This technique can significantly improve both convergence speed and final model performance.
Step Decay: The learning rate is reduced by a fixed factor at predetermined intervals. For example, the learning rate might be multiplied by 0.1 every 30 epochs. This is one of the simplest schedules and was widely used in early convolutional neural network research.
Exponential Decay: The learning rate decreases exponentially over time according to the formula lr(t) = lr_0 * e^(-kt), where k is the decay rate. This produces a smooth, continuous reduction.
Cosine Annealing: The learning rate follows a cosine curve from its initial value down to a minimum (often near zero). First proposed by Loshchilov and Hutter in 2016, cosine annealing decreases the learning rate slowly at first, then more rapidly in the middle of training, and slowly again near the end. This schedule has become popular for training transformers and vision models.
Warmup: Training begins with a very small learning rate that linearly increases to the target value over a set number of steps or epochs. Warmup helps stabilize training in the early stages when gradients can be large and erratic, particularly for models with batch normalization or large batch sizes. Most modern transformer training recipes combine a warmup phase with a subsequent decay schedule.
Warmup + Cosine Annealing: A common combined schedule, especially in transformer training, starts with linear warmup over a few thousand steps followed by cosine decay for the remainder of training. This combination has become a de facto standard for large language model pre-training.
Cyclical Learning Rates: Proposed by Leslie Smith in 2017, cyclical learning rates oscillate between a minimum and maximum value in a triangular or other repeating pattern. The intuition is that periodically increasing the learning rate can help the optimizer escape sharp local minima and find flatter, more generalizable solutions.
One-Cycle Policy: Also proposed by Smith, this schedule uses a single cycle of learning rate increase followed by decrease over the entire training run. It often achieves faster convergence and better final performance than fixed schedules, a phenomenon Smith termed "super-convergence."
The optimal learning rate schedule depends on the model architecture, dataset size, and training budget. For most modern deep learning tasks, warmup combined with cosine annealing provides a strong baseline. For quick experiments or smaller models, step decay remains effective and easy to configure.
Not all hyperparameters are equally important. Research by Hutter, Hoos, and Leyton-Brown (2014) on functional ANOVA decomposition showed that in many machine learning algorithms, most of the performance variation can be attributed to just a few hyperparameters. This observation has practical implications: practitioners should identify and focus their tuning effort on the hyperparameters that matter most for their specific problem.
For neural networks, the learning rate is consistently the most influential hyperparameter. Probst, Boulesteix, and Bischl (2019) conducted a large-scale study of hyperparameter tunability across many algorithms and found that the relative importance of hyperparameters varies by algorithm family but is often quite concentrated. For example, in gradient boosting, the learning rate and number of estimators tend to dominate; in random forests, the number of features considered at each split (max_features) and minimum samples per leaf are most consequential.
Sensitivity analysis can be performed in several ways:
For LSTMs specifically, Greff et al. (2017) performed a large-scale hyperparameter study and found that learning rate and network size were the most critical hyperparameters, while batch size and momentum had relatively little impact on performance.
A long-standing goal in machine learning is to reduce the cost of hyperparameter tuning by transferring knowledge from previous experiments. There are several approaches to this problem.
When tuning hyperparameters for a new task that is similar to a previously solved task, the hyperparameter configurations that worked well on the old task can serve as strong starting points. Meta-learning approaches formalize this idea by training a model (or building a database) that predicts good hyperparameter configurations based on dataset characteristics. For example, Auto-sklearn uses meta-learning to warm-start Bayesian optimization by initializing it with configurations that performed well on similar datasets.
One of the most significant recent developments in hyperparameter transfer is the Maximal Update Parameterization (muP), introduced by Yang and Hu in their 2022 paper "Tensor Programs V: Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer." muP addresses a specific and expensive problem: when scaling a neural network to a larger size, hyperparameters (especially the learning rate and initialization scale) that worked at the smaller size often do not transfer, requiring costly re-tuning at each new scale.
muP re-parameterizes the model so that activation scales remain consistent across different model widths during training. Under this parameterization, the optimal learning rate, initialization variance, and other key hyperparameters remain stable as the model width increases. This enables a workflow called muTransfer:
In experiments on GPT-3 scale models, muTransfer used only about 7% of pre-training compute for tuning and achieved performance comparable to a model twice its size that was tuned conventionally. This represents roughly an order of magnitude in compute savings for hyperparameter search.
muP has been adopted or studied by several organizations training large language models, and an improved variant called u-muP (unit-scaled muP) has been proposed to further simplify the parameterization by combining muP with unit scaling, ensuring that activations, weights, and gradients all begin training at a scale of one.
Hyperparameters affect virtually every aspect of model performance.
Convergence speed: The learning rate, optimizer, and batch size directly determine how quickly the model's loss function decreases during training. Poorly chosen values can cause training to stall, oscillate, or diverge entirely.
Generalization: Hyperparameters like dropout rate, weight decay, and regularization strength control how well the model performs on unseen data. These regularization hyperparameters help manage the overfitting/underfitting tradeoff.
Model capacity: The number of layers, hidden units, and (for tree models) maximum depth determine the complexity of functions the model can represent. Insufficient capacity leads to underfitting; excessive capacity leads to overfitting when not paired with adequate regularization.
Training stability: Certain hyperparameter combinations can cause training instabilities such as exploding or vanishing gradients, mode collapse in generative models, or numerical overflow. Techniques like learning rate warmup, gradient clipping, and careful initialization are hyperparameter-driven solutions to these problems.
Computational cost: Larger batch sizes, more layers, and longer training schedules all increase the computational resources required. In practice, hyperparameter selection often involves a tradeoff between model quality and compute budget.
Research consistently shows that hyperparameter optimization can make a larger difference in model performance than architectural changes. A well-tuned simple model often outperforms a poorly tuned complex model.
The following guidelines can help practitioners tune hyperparameters more effectively:
Start with established defaults. Most frameworks and papers provide recommended default values. Begin with these and adjust based on validation performance rather than starting from scratch.
Tune the most important hyperparameters first. For neural networks, the learning rate almost always has the largest effect on performance. Focus on it first before tuning other hyperparameters. For tree-based models, max_depth and learning_rate (for boosting) are typically the highest-priority hyperparameters.
Use logarithmic scales for continuous hyperparameters. Learning rate, regularization strength, and weight decay should be searched on a log scale (e.g., 1e-5, 1e-4, 1e-3, 1e-2) rather than a linear scale, because their effect is roughly proportional to their order of magnitude.
Always use a separate validation set. Evaluate hyperparameter combinations on a held-out validation set, not the training set. For small datasets, use k-fold cross-validation to get a more reliable estimate of generalization performance.
Prefer random search over grid search for initial exploration. As Bergstra and Bengio (2012) showed, random search is more efficient when only a subset of hyperparameters are truly important, which is the typical case.
Graduate to Bayesian optimization for fine-tuning. After random search narrows the promising region, Bayesian optimization can efficiently zoom in on the best values with fewer evaluations.
Use early stopping. Monitor validation performance during training and stop when it begins to degrade. This acts as an implicit regularizer and saves computation. Many frameworks (including Optuna) support pruning of unpromising trials based on intermediate results.
Document and track experiments. Record every hyperparameter configuration and its corresponding performance. Tools like Weights & Biases, MLflow, and TensorBoard make this easier and enable retrospective analysis.
Consider computational budget. The choice of tuning method should reflect the available resources. Grid search may be acceptable for one or two hyperparameters, but random search or Bayesian optimization is more appropriate for larger search spaces.
Be aware of hyperparameter interactions. Some hyperparameters interact strongly. For example, learning rate and batch size are closely coupled, and learning rate and weight decay jointly affect regularization. Tuning them independently can miss the optimal combination.
Watch for validation set overfitting. When evaluating many configurations against the same validation data, the selected hyperparameters can overfit to that specific split. Use nested cross-validation or a separate test set to get an unbiased performance estimate.
Imagine you are teaching a robot how to play a game, but first it needs to know the rules it should follow. These instructions serve as hyperparameters in machine learning: they dictate how the robot should act and react when playing the game.
For example, you could instruct the robot to take one step at a time or two steps at once. This is similar to setting the learning rate hyperparameter on machine learning models; it tells them how much to adjust their predictions based on new data they encounter.
You could instruct the robot to pay close attention to other players or focus solely on itself. This is similar to setting a regularization strength hyperparameter in machine learning models, controlling how much attention should be paid to training data to prevent overfitting.
The robot does not get to choose these rules itself. You have to decide them before the game starts. If you pick bad rules, the robot will play poorly no matter how many games it practices. If you pick good rules, the robot will learn quickly and play well. Finding the best rules is what hyperparameter tuning is all about.