Scaling

Introduction

In machine learning and artificial intelligence, scaling is a broad term that refers to several related but distinct concepts. In data preprocessing, scaling (often called feature scaling) is the process of adjusting the range or distribution of input features so they share a common scale. This normalization step is essential for many algorithms and can significantly affect model performance. In a separate but equally important sense, scaling describes how neural network performance changes as researchers increase model size, dataset size, or compute budget, a line of research known as neural scaling laws. More recently, the term also covers model scaling strategies (adjusting network width, depth, and resolution) and test-time scaling (allocating additional compute during inference to improve reasoning).

This article covers all of these uses, starting with feature scaling in data preprocessing and continuing through scaling laws, model architecture scaling, and test-time compute scaling.

Why Feature Scaling Matters

Feature scaling is a critical preprocessing step because many machine learning algorithms are sensitive to the relative magnitudes of input features. Without scaling, a feature measured in thousands (such as annual income) can dominate a feature measured in single digits (such as number of bedrooms), leading to poor model behavior. The primary reasons scaling matters are outlined below.

Faster Convergence in Gradient Descent

Algorithms that use gradient descent for optimization, including linear regression, logistic regression, and neural networks, converge much faster when features are on a similar scale. When feature magnitudes differ widely, the loss surface becomes elongated, causing the optimizer to zigzag slowly toward the minimum. Scaling produces a more spherical loss surface, allowing the optimizer to take more direct steps.

Fair Treatment of Features in Distance-Based Algorithms

Algorithms that compute distances between data points, such as k-nearest neighbors, k-means clustering, and support vector machines, are directly affected by feature magnitudes. A feature with a range of 0 to 100,000 will dominate Euclidean distance calculations over a feature with a range of 0 to 1. Scaling ensures every feature contributes proportionally to distance computations.

Correct Regularization Behavior

Regularization techniques such as L1 regularization and L2 regularization add penalty terms based on the magnitudes of model weights. If features are on different scales, the corresponding weights will also be on different scales, and the regularization penalty will be applied unevenly. For instance, a feature with a large numeric range will have a small weight, receiving a disproportionately small penalty. Scaling ensures that the regularization parameter applies uniformly across all features, so feature selection and weight shrinkage reflect genuine predictive signal rather than differences in units.

Improved Numeric Stability

Scaling helps prevent issues related to floating-point arithmetic. Very large or very small feature values can lead to numerical overflow or underflow during matrix operations, gradient computations, and activation functions. Keeping values in a moderate range improves the stability of training.

Feature Scaling Methods

Several techniques are commonly used to scale features. Each has its own strengths and is suited to different situations. The table below summarizes the most widely used methods.

Method	Formula	Output Range	Handles Outliers	Best Used When
Min-Max Scaling	x' = (x - x_min) / (x_max - x_min)	[0, 1]	No	Bounded data with no extreme outliers
Standardization (Z-score)	x' = (x - mean) / std	Unbounded (mean=0, std=1)	Partially	Data is roughly Gaussian; most general-purpose choice
Max-Abs Scaling	x' = x / abs(x_max)	[-1, 1]	No	Sparse data (preserves zero entries)
Robust Scaling	x' = (x - median) / IQR	Unbounded	Yes	Data with significant outliers
Unit Vector (L2 Norm)	x' = x / \|\|x\|\|	Unit length	No	When direction matters more than magnitude

Min-Max Scaling

Min-max scaling, also called normalization, transforms each feature to a fixed range, typically [0, 1]. The formula is:

x' = (x - x_min) / (x_max - x_min)

This method preserves the original distribution shape and produces bounded outputs. It works well when features have a known, finite range and the data contains no extreme outliers. However, a single outlier can compress the remaining values into a narrow band, reducing the effective resolution of the feature. In scikit-learn, this is implemented as MinMaxScaler.

Standardization (Z-Score Normalization)

Standardization, also known as z-score normalization, centers each feature to have a mean of zero and scales it to have a standard deviation of one:

x' = (x - mean) / std

This is the most widely used scaling method and is suitable for algorithms that assume input features follow a Gaussian distribution, such as support vector machines and logistic regression. Unlike min-max scaling, standardization does not bound the output to a specific range, which means outliers are less likely to distort the scale of normal values. In scikit-learn, this is implemented as StandardScaler.

Max-Abs Scaling

Max-abs scaling divides each feature by its maximum absolute value:

x' = x / |x_max|

This produces values in the range [-1, 1] without shifting or centering the data. Its key advantage is that it preserves sparsity: zero values remain zero after transformation. This makes it particularly useful for sparse datasets, such as those produced by bag of words representations or TF-IDF vectorization. In scikit-learn, this is implemented as MaxAbsScaler.

Robust Scaling

Robust scaling uses the median and interquartile range (IQR) instead of the mean and standard deviation:

x' = (x - median) / IQR

Because the median and IQR are robust statistics, this method is much less sensitive to outliers than standardization or min-max scaling. It is the preferred choice when the dataset contains significant outliers that would otherwise distort the scaling. In scikit-learn, this is implemented as RobustScaler.

Mean Normalization

Mean normalization centers each feature by subtracting the mean and dividing by the range:

x' = (x - mean) / (x_max - x_min)

This produces values centered around zero, typically in the range [-1, 1]. It combines aspects of both min-max scaling and standardization.

Which Models Need Feature Scaling

Not all machine learning models are equally sensitive to feature scale. The table below summarizes which categories of models benefit from scaling and which do not.

Model Category	Examples	Scaling Needed?	Reason
Linear models	Linear regression, logistic regression	Yes	Weights are directly influenced by feature magnitudes; regularization requires uniform scales
Distance-based models	k-NN, k-means, SVM	Yes	Distance metrics are dominated by large-scale features
Neural networks	MLP, CNN, Transformer	Yes	Gradient-based optimization converges faster with scaled inputs
Gradient boosting	XGBoost, LightGBM, CatBoost	No	Splits are based on thresholds; feature order is invariant to scaling
Tree-based models	Decision tree, random forest	No	Splits compare values within each feature independently; relative ordering is unaffected by scaling
Naive Bayes	Gaussian NB, Multinomial NB	No	Parameters are estimated per feature independently

Tree-based models do not need scaling because they make decisions by finding threshold values that best split the data at each node. These splits depend only on the relative ordering of values within each feature, not on their absolute magnitudes. Whether a feature ranges from 0 to 1 or from 0 to 1,000,000, the same split points produce the same partitioning of data.

Implementation in Scikit-Learn

The scikit-learn library provides a consistent API for feature scaling through its preprocessing module. All scalers follow the fit-transform pattern:

from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler, MaxAbsScaler
from sklearn.model_selection import train_test_split

# Split data first to prevent data leakage
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Fit on training data only, then transform both sets
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

A critical best practice is to fit the scaler only on the training data and then use the fitted scaler to transform the test data. Fitting on the full dataset (including test data) causes data leakage, because the scaler parameters (mean, standard deviation, min, max) would incorporate information from the test set. In production pipelines, the scaler should be serialized alongside the model so that new data receives the same transformation.

Neural Scaling Laws

Beyond data preprocessing, "scaling" in AI also refers to how model performance improves as key resources (parameters, data, and compute) increase. Researchers have discovered that neural network loss follows predictable power-law relationships with these resources, a finding with profound implications for how large models are trained.

Kaplan et al. (2020)

In January 2020, researchers at OpenAI published "Scaling Laws for Neural Language Models," establishing that the cross-entropy loss of language models decreases as a smooth power law with respect to three factors: the number of model parameters (N), the size of the training dataset (D), and the amount of training compute (C). The relationship takes the form:

L(x) = (x_0 / x)^alpha

where x represents any of the three factors and alpha is a scaling exponent. For model parameters, alpha was found to range from approximately 0.07 to 0.08, meaning that each tenfold increase in parameters yields a consistent reduction in loss.

A key finding was that, for a fixed compute budget, optimal performance comes from training a very large model on a relatively modest amount of data and stopping training well before convergence. Kaplan et al. proposed that optimal model size should scale as N_opt proportional to C^0.7, while dataset size should scale as D_opt proportional to C^0.3. This implied that scaling up model parameters was more important than scaling up training data.

Chinchilla Scaling Laws (Hoffmann et al., 2022)

In 2022, a team at DeepMind led by Jordan Hoffmann challenged the Kaplan findings with the paper "Training Compute-Optimal Large Language Models." Their key contribution was the concept of compute-optimal training: for a given compute budget C, the model size N and dataset size D should be scaled in equal proportions.

The Chinchilla loss model is:

L(N, D) = A/N^alpha + B/D^beta + L_0

where A = 406.4, B = 410.7, alpha = 0.34, beta = 0.28, and L_0 = 1.69. The total compute is approximated as C = 6ND (six floating-point operations per parameter per token).

The critical finding was that both N and D should scale proportionally with compute: N_opt proportional to C^0.5 and D_opt proportional to C^0.5. In practical terms, this means approximately 20 training tokens per parameter for optimal compute efficiency. The Chinchilla model (70B parameters, 1.4 trillion tokens) outperformed the much larger Gopher (280B parameters, 300 billion tokens) while using the same compute budget, demonstrating that many existing large models were significantly undertrained.

Reconciling the Two Approaches

The discrepancy between Kaplan and Chinchilla scaling laws stems from methodological differences. Kaplan et al. excluded embedding parameters from their parameter counts and used a fixed learning rate schedule that did not fully account for differences in training duration. When these differences are corrected, the two sets of results become more consistent. The Chinchilla framework has become the standard reference for compute-optimal training and has directly influenced the design of models such as LLaMA, which was explicitly trained following compute-optimal principles with significantly more data than earlier models of similar size.

Impact on Industry

Neural scaling laws have shaped the strategies of major AI laboratories. Rather than training the largest possible model, teams now balance model size with data quantity and training duration to maximize performance per unit of compute. This has led to a generation of models (LLaMA, Mistral, Gemma) that achieve strong performance at smaller sizes by training on much larger datasets than their predecessors.

Model Scaling (Architecture Scaling)

Model scaling refers to strategies for adjusting the dimensions of a neural network architecture to improve performance. For convolutional neural networks, the three primary scaling dimensions are:

Depth: the number of layers in the network
Width: the number of channels or neurons per layer
Resolution: the size of the input image or sequence length

EfficientNet and Compound Scaling

The 2019 paper "EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks" by Mingxing Tan and Quoc Le introduced compound scaling, which scales all three dimensions simultaneously using a single compound coefficient. Instead of scaling depth, width, or resolution independently, the method uses the relationships:

depth: d = alpha^phi
width: w = beta^phi
resolution: r = gamma^phi

under the constraint alpha * beta^2 * gamma^2 approximately equals 2, where phi is the compound coefficient. The optimal base values were found to be alpha = 1.2, beta = 1.1, and gamma = 1.15.

The resulting EfficientNet family (B0 through B7) demonstrated that compound scaling produces consistently better accuracy per unit of compute compared to scaling any single dimension. EfficientNet-B7 achieved 84.4% top-1 accuracy on ImageNet while being 8.4 times smaller and 6.1 times faster than previous state-of-the-art models. EfficientNet V2, published in 2021, further improved training speed by incorporating progressive learning and adaptive regularization.

Compound scaling principles have since been extended beyond image classification to object detection, segmentation, and other vision tasks.

Test-Time Compute Scaling

Test-time compute scaling (also called inference-time scaling) is a more recent paradigm in which additional computation is applied during inference, rather than during training, to improve model outputs. This approach has become a major research direction since 2024.

The Core Idea

Traditional scaling focuses on making models larger or training them on more data. Test-time scaling takes a different approach: given a fixed pretrained model, spend more compute at inference time to produce better answers. This is analogous to a human thinking longer and more carefully about a difficult problem before answering.

OpenAI o1 and Reasoning Models

OpenAI's o1 model, released in September 2024, demonstrated that test-time scaling could produce dramatic improvements in reasoning tasks. The model uses an extended internal chain of thought, generated through reinforcement learning, to "think" through problems step by step before producing a final answer. Performance scales smoothly with the amount of test-time compute allocated: more thinking time yields better results on mathematics, coding, and scientific reasoning benchmarks.

The o3 model, previewed in December 2024, pushed this further by achieving 87.5% on the ARC-AGI benchmark. OpenAI released o3-mini in early 2025 as a more cost-efficient variant optimized for strong reasoning at lower compute budgets.

Research Directions

Several approaches to test-time scaling have been explored:

Approach	Description	Example
Chain-of-thought	Model generates intermediate reasoning steps before the final answer	OpenAI o1, o3
Best-of-N sampling	Generate multiple candidate answers and select the best one using a verifier	AlphaCode
Tree search	Explore a tree of reasoning paths and select the most promising branches	AlphaProof
Budget forcing	Control the length of reasoning chains to trade off between compute cost and accuracy	s1 model (2025)
Self-refinement	Model iteratively critiques and improves its own output	Various research papers

Research from Snell et al. (2024) showed that scaling inference compute with the right strategies can be more effective than scaling model parameters. A smaller model with optimal test-time compute allocation can match or exceed the performance of a model that is 14 times larger, suggesting that compute-optimal inference may shift the balance between training and deployment costs.

Infrastructure Implications

Test-time scaling has significant cost implications. Reasoning models such as o1 and DeepSeek-R1 generate orders of magnitude more tokens than standard models. OpenAI's 2024 inference spending reportedly reached $2.3 billion, roughly 15 times the training cost for GPT-4.5. This has prompted research into more efficient inference strategies, including distillation of reasoning capabilities into smaller models.

Explain Like I'm 5 (ELI5)

Imagine you have a recipe that calls for 2 cups of flour and 1 teaspoon of salt. If you tried to compare those amounts directly, the flour would seem way more important because 2 cups is a bigger number than 1 teaspoon. But they are just measured in different units. Feature scaling is like converting everything to the same unit so the computer can fairly compare all the ingredients.

Scaling laws are a different idea. They are like noticing that if you give a student twice as many textbooks and twice as much study time, their test scores improve by a predictable amount. Scientists have found that AI models follow similar patterns: making them bigger or giving them more data improves their performance in a very predictable way.

Test-time scaling is like giving a student extra time on an exam. The student already knows what they know, but with more time to think carefully, check their work, and try different approaches, they get better scores.

References

Kaplan, J., McCandlish, S., Henighan, T., et al. (2020). "Scaling Laws for Neural Language Models." arXiv:2001.08361.
Hoffmann, J., Borgeaud, S., Mensch, A., et al. (2022). "Training Compute-Optimal Large Language Models." Proceedings of NeurIPS 2022.
Tan, M. and Le, Q. (2019). "EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks." Proceedings of ICML 2019.
Snell, C., Lee, J., Xu, K., and Kumar, A. (2024). "Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters." arXiv:2408.03314.
Pedregosa, F., Varoquaux, G., Gramfort, A., et al. (2011). "Scikit-learn: Machine Learning in Python." Journal of Machine Learning Research, 12, pp. 2825-2830.
OpenAI. (2024). "Learning to Reason with LLMs." OpenAI Blog.
Muennighoff, N., Rush, A., Barak, B., et al. (2024). "Scaling Data-Constrained Language Models." Proceedings of NeurIPS 2024.
Wikipedia contributors. "Feature scaling." Wikipedia, The Free Encyclopedia.
Wikipedia contributors. "Neural scaling law." Wikipedia, The Free Encyclopedia.
Aghajanyan, A., et al. (2025). "A Survey on Test-Time Scaling in Large Language Models: What, How, Where, and How Well?" arXiv:2503.24235.

Introduction

Why Feature Scaling Matters

Faster Convergence in Gradient Descent

Fair Treatment of Features in Distance-Based Algorithms

Correct Regularization Behavior

Improved Numeric Stability

Feature Scaling Methods

Min-Max Scaling

Standardization (Z-Score Normalization)

Max-Abs Scaling

Robust Scaling

Mean Normalization

Which Models Need Feature Scaling

Implementation in Scikit-Learn

Neural Scaling Laws

Kaplan et al. (2020)

Chinchilla Scaling Laws (Hoffmann et al., 2022)

Reconciling the Two Approaches

Impact on Industry

Model Scaling (Architecture Scaling)

EfficientNet and Compound Scaling

Test-Time Compute Scaling

The Core Idea

OpenAI o1 and Reasoning Models

Research Directions

Infrastructure Implications

Explain Like I'm 5 (ELI5)

References

Improve this article

Related Articles

ARC-AGI 2

Dimension Reduction

Bucketing

Class-Imbalanced Dataset

Data Augmentation

Data Set or Dataset

Introduction

Why Feature Scaling Matters

Faster Convergence in Gradient Descent

Fair Treatment of Features in Distance-Based Algorithms

Correct Regularization Behavior

Improved Numeric Stability

Feature Scaling Methods

Min-Max Scaling

Standardization (Z-Score Normalization)

Max-Abs Scaling

Robust Scaling

Mean Normalization

Which Models Need Feature Scaling

Implementation in Scikit-Learn

Neural Scaling Laws

Kaplan et al. (2020)

Chinchilla Scaling Laws (Hoffmann et al., 2022)

Reconciling the Two Approaches

Impact on Industry

Model Scaling (Architecture Scaling)

EfficientNet and Compound Scaling

Test-Time Compute Scaling

The Core Idea

OpenAI o1 and Reasoning Models

Research Directions

Infrastructure Implications

Explain Like I'm 5 (ELI5)

References

Related Articles

ARC-AGI 2

Dimension Reduction

Bucketing

Class-Imbalanced Dataset

Data Augmentation

Data Set or Dataset