See also: Machine learning terms
In machine learning and artificial intelligence, scaling is a broad term that refers to several related but distinct concepts. In data preprocessing, scaling (often called feature scaling) is the process of adjusting the range or distribution of input features so they share a common scale. This normalization step is essential for many algorithms and can significantly affect model performance. In a separate but equally important sense, scaling describes how neural network performance changes as researchers increase model size, dataset size, or compute budget, a line of research known as neural scaling laws. More recently, the term also covers model scaling strategies (adjusting network width, depth, and resolution) and test-time scaling (allocating additional compute during inference to improve reasoning).
This article covers all of these uses, starting with feature scaling in data preprocessing and continuing through scaling laws, model architecture scaling, and test-time compute scaling.
Feature scaling is a critical preprocessing step because many machine learning algorithms are sensitive to the relative magnitudes of input features. Without scaling, a feature measured in thousands (such as annual income) can dominate a feature measured in single digits (such as number of bedrooms), leading to poor model behavior. The primary reasons scaling matters are outlined below.
Algorithms that use gradient descent for optimization, including linear regression, logistic regression, and neural networks, converge much faster when features are on a similar scale. When feature magnitudes differ widely, the loss surface becomes elongated, causing the optimizer to zigzag slowly toward the minimum. Scaling produces a more spherical loss surface, allowing the optimizer to take more direct steps.
Algorithms that compute distances between data points, such as k-nearest neighbors, k-means clustering, and support vector machines, are directly affected by feature magnitudes. A feature with a range of 0 to 100,000 will dominate Euclidean distance calculations over a feature with a range of 0 to 1. Scaling ensures every feature contributes proportionally to distance computations.
Regularization techniques such as L1 regularization and L2 regularization add penalty terms based on the magnitudes of model weights. If features are on different scales, the corresponding weights will also be on different scales, and the regularization penalty will be applied unevenly. For instance, a feature with a large numeric range will have a small weight, receiving a disproportionately small penalty. Scaling ensures that the regularization parameter applies uniformly across all features, so feature selection and weight shrinkage reflect genuine predictive signal rather than differences in units.
Scaling helps prevent issues related to floating-point arithmetic. Very large or very small feature values can lead to numerical overflow or underflow during matrix operations, gradient computations, and activation functions. Keeping values in a moderate range improves the stability of training.
Several techniques are commonly used to scale features. Each has its own strengths and is suited to different situations. The table below summarizes the most widely used methods.
| Method | Formula | Output Range | Handles Outliers | Best Used When |
|---|---|---|---|---|
| Min-Max Scaling | x' = (x - x_min) / (x_max - x_min) | [0, 1] | No | Bounded data with no extreme outliers |
| Standardization (Z-score) | x' = (x - mean) / std | Unbounded (mean=0, std=1) | Partially | Data is roughly Gaussian; most general-purpose choice |
| Max-Abs Scaling | x' = x / abs(x_max) | [-1, 1] | No | Sparse data (preserves zero entries) |
| Robust Scaling | x' = (x - median) / IQR | Unbounded | Yes | Data with significant outliers |
| Unit Vector (L2 Norm) | x' = x / ||x|| | Unit length | No | When direction matters more than magnitude |
Min-max scaling, also called normalization, transforms each feature to a fixed range, typically [0, 1]. The formula is:
x' = (x - x_min) / (x_max - x_min)
This method preserves the original distribution shape and produces bounded outputs. It works well when features have a known, finite range and the data contains no extreme outliers. However, a single outlier can compress the remaining values into a narrow band, reducing the effective resolution of the feature. In scikit-learn, this is implemented as MinMaxScaler.
Standardization, also known as z-score normalization, centers each feature to have a mean of zero and scales it to have a standard deviation of one:
x' = (x - mean) / std
This is the most widely used scaling method and is suitable for algorithms that assume input features follow a Gaussian distribution, such as support vector machines and logistic regression. Unlike min-max scaling, standardization does not bound the output to a specific range, which means outliers are less likely to distort the scale of normal values. In scikit-learn, this is implemented as StandardScaler.
Max-abs scaling divides each feature by its maximum absolute value:
x' = x / |x_max|
This produces values in the range [-1, 1] without shifting or centering the data. Its key advantage is that it preserves sparsity: zero values remain zero after transformation. This makes it particularly useful for sparse datasets, such as those produced by bag of words representations or TF-IDF vectorization. In scikit-learn, this is implemented as MaxAbsScaler.
Robust scaling uses the median and interquartile range (IQR) instead of the mean and standard deviation:
x' = (x - median) / IQR
Because the median and IQR are robust statistics, this method is much less sensitive to outliers than standardization or min-max scaling. It is the preferred choice when the dataset contains significant outliers that would otherwise distort the scaling. In scikit-learn, this is implemented as RobustScaler.
Mean normalization centers each feature by subtracting the mean and dividing by the range:
x' = (x - mean) / (x_max - x_min)
This produces values centered around zero, typically in the range [-1, 1]. It combines aspects of both min-max scaling and standardization.
Not all machine learning models are equally sensitive to feature scale. The table below summarizes which categories of models benefit from scaling and which do not.
| Model Category | Examples | Scaling Needed? | Reason |
|---|---|---|---|
| Linear models | Linear regression, logistic regression | Yes | Weights are directly influenced by feature magnitudes; regularization requires uniform scales |
| Distance-based models | k-NN, k-means, SVM | Yes | Distance metrics are dominated by large-scale features |
| Neural networks | MLP, CNN, Transformer | Yes | Gradient-based optimization converges faster with scaled inputs |
| Gradient boosting | XGBoost, LightGBM, CatBoost | No | Splits are based on thresholds; feature order is invariant to scaling |
| Tree-based models | Decision tree, random forest | No | Splits compare values within each feature independently; relative ordering is unaffected by scaling |
| Naive Bayes | Gaussian NB, Multinomial NB | No | Parameters are estimated per feature independently |
Tree-based models do not need scaling because they make decisions by finding threshold values that best split the data at each node. These splits depend only on the relative ordering of values within each feature, not on their absolute magnitudes. Whether a feature ranges from 0 to 1 or from 0 to 1,000,000, the same split points produce the same partitioning of data.
The scikit-learn library provides a consistent API for feature scaling through its preprocessing module. All scalers follow the fit-transform pattern:
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler, MaxAbsScaler
from sklearn.model_selection import train_test_split
# Split data first to prevent data leakage
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
# Fit on training data only, then transform both sets
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
A critical best practice is to fit the scaler only on the training data and then use the fitted scaler to transform the test data. Fitting on the full dataset (including test data) causes data leakage, because the scaler parameters (mean, standard deviation, min, max) would incorporate information from the test set. In production pipelines, the scaler should be serialized alongside the model so that new data receives the same transformation.
Beyond data preprocessing, "scaling" in AI also refers to how model performance improves as key resources (parameters, data, and compute) increase. Researchers have discovered that neural network loss follows predictable power-law relationships with these resources, a finding with profound implications for how large models are trained.
In January 2020, researchers at OpenAI published "Scaling Laws for Neural Language Models," establishing that the cross-entropy loss of language models decreases as a smooth power law with respect to three factors: the number of model parameters (N), the size of the training dataset (D), and the amount of training compute (C). The relationship takes the form:
L(x) = (x_0 / x)^alpha
where x represents any of the three factors and alpha is a scaling exponent. For model parameters, alpha was found to range from approximately 0.07 to 0.08, meaning that each tenfold increase in parameters yields a consistent reduction in loss.
A key finding was that, for a fixed compute budget, optimal performance comes from training a very large model on a relatively modest amount of data and stopping training well before convergence. Kaplan et al. proposed that optimal model size should scale as N_opt proportional to C^0.7, while dataset size should scale as D_opt proportional to C^0.3. This implied that scaling up model parameters was more important than scaling up training data.
In 2022, a team at DeepMind led by Jordan Hoffmann challenged the Kaplan findings with the paper "Training Compute-Optimal Large Language Models." Their key contribution was the concept of compute-optimal training: for a given compute budget C, the model size N and dataset size D should be scaled in equal proportions.
The Chinchilla loss model is:
L(N, D) = A/N^alpha + B/D^beta + L_0
where A = 406.4, B = 410.7, alpha = 0.34, beta = 0.28, and L_0 = 1.69. The total compute is approximated as C = 6ND (six floating-point operations per parameter per token).
The critical finding was that both N and D should scale proportionally with compute: N_opt proportional to C^0.5 and D_opt proportional to C^0.5. In practical terms, this means approximately 20 training tokens per parameter for optimal compute efficiency. The Chinchilla model (70B parameters, 1.4 trillion tokens) outperformed the much larger Gopher (280B parameters, 300 billion tokens) while using the same compute budget, demonstrating that many existing large models were significantly undertrained.
The discrepancy between Kaplan and Chinchilla scaling laws stems from methodological differences. Kaplan et al. excluded embedding parameters from their parameter counts and used a fixed learning rate schedule that did not fully account for differences in training duration. When these differences are corrected, the two sets of results become more consistent. The Chinchilla framework has become the standard reference for compute-optimal training and has directly influenced the design of models such as LLaMA, which was explicitly trained following compute-optimal principles with significantly more data than earlier models of similar size.
Neural scaling laws have shaped the strategies of major AI laboratories. Rather than training the largest possible model, teams now balance model size with data quantity and training duration to maximize performance per unit of compute. This has led to a generation of models (LLaMA, Mistral, Gemma) that achieve strong performance at smaller sizes by training on much larger datasets than their predecessors.
Model scaling refers to strategies for adjusting the dimensions of a neural network architecture to improve performance. For convolutional neural networks, the three primary scaling dimensions are:
The 2019 paper "EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks" by Mingxing Tan and Quoc Le introduced compound scaling, which scales all three dimensions simultaneously using a single compound coefficient. Instead of scaling depth, width, or resolution independently, the method uses the relationships:
under the constraint alpha * beta^2 * gamma^2 approximately equals 2, where phi is the compound coefficient. The optimal base values were found to be alpha = 1.2, beta = 1.1, and gamma = 1.15.
The resulting EfficientNet family (B0 through B7) demonstrated that compound scaling produces consistently better accuracy per unit of compute compared to scaling any single dimension. EfficientNet-B7 achieved 84.4% top-1 accuracy on ImageNet while being 8.4 times smaller and 6.1 times faster than previous state-of-the-art models. EfficientNet V2, published in 2021, further improved training speed by incorporating progressive learning and adaptive regularization.
Compound scaling principles have since been extended beyond image classification to object detection, segmentation, and other vision tasks.
Test-time compute scaling (also called inference-time scaling) is a more recent paradigm in which additional computation is applied during inference, rather than during training, to improve model outputs. This approach has become a major research direction since 2024.
Traditional scaling focuses on making models larger or training them on more data. Test-time scaling takes a different approach: given a fixed pretrained model, spend more compute at inference time to produce better answers. This is analogous to a human thinking longer and more carefully about a difficult problem before answering.
OpenAI's o1 model, released in September 2024, demonstrated that test-time scaling could produce dramatic improvements in reasoning tasks. The model uses an extended internal chain of thought, generated through reinforcement learning, to "think" through problems step by step before producing a final answer. Performance scales smoothly with the amount of test-time compute allocated: more thinking time yields better results on mathematics, coding, and scientific reasoning benchmarks.
The o3 model, previewed in December 2024, pushed this further by achieving 87.5% on the ARC-AGI benchmark. OpenAI released o3-mini in early 2025 as a more cost-efficient variant optimized for strong reasoning at lower compute budgets.
Several approaches to test-time scaling have been explored:
| Approach | Description | Example |
|---|---|---|
| Chain-of-thought | Model generates intermediate reasoning steps before the final answer | OpenAI o1, o3 |
| Best-of-N sampling | Generate multiple candidate answers and select the best one using a verifier | AlphaCode |
| Tree search | Explore a tree of reasoning paths and select the most promising branches | AlphaProof |
| Budget forcing | Control the length of reasoning chains to trade off between compute cost and accuracy | s1 model (2025) |
| Self-refinement | Model iteratively critiques and improves its own output | Various research papers |
Research from Snell et al. (2024) showed that scaling inference compute with the right strategies can be more effective than scaling model parameters. A smaller model with optimal test-time compute allocation can match or exceed the performance of a model that is 14 times larger, suggesting that compute-optimal inference may shift the balance between training and deployment costs.
Test-time scaling has significant cost implications. Reasoning models such as o1 and DeepSeek-R1 generate orders of magnitude more tokens than standard models. OpenAI's 2024 inference spending reportedly reached $2.3 billion, roughly 15 times the training cost for GPT-4.5. This has prompted research into more efficient inference strategies, including distillation of reasoning capabilities into smaller models.
Imagine you have a recipe that calls for 2 cups of flour and 1 teaspoon of salt. If you tried to compare those amounts directly, the flour would seem way more important because 2 cups is a bigger number than 1 teaspoon. But they are just measured in different units. Feature scaling is like converting everything to the same unit so the computer can fairly compare all the ingredients.
Scaling laws are a different idea. They are like noticing that if you give a student twice as many textbooks and twice as much study time, their test scores improve by a predictable amount. Scientists have found that AI models follow similar patterns: making them bigger or giving them more data improves their performance in a very predictable way.
Test-time scaling is like giving a student extra time on an exam. The student already knows what they know, but with more time to think carefully, check their work, and try different approaches, they get better scores.