Bucketing

Bucketing, also called binning, is a feature engineering technique in machine learning that converts continuous features into discrete categories by dividing a range of values into intervals known as "buckets" or "bins." Each data point is then assigned to the bucket whose interval contains its value. By grouping numerical data into a smaller number of categories, bucketing simplifies the input space, reduces the influence of outliers, and can help models capture non-linear relationships that would otherwise be difficult to learn.

Bucketing is widely used across data preprocessing, data visualization, and model training pipelines. It appears not only as a standalone preprocessing step but also as a built-in mechanism in frameworks such as TensorFlow and XGBoost, and as a batching strategy for sequence-to-sequence models in natural language processing.

ELI5 (explain like I'm 5)

Imagine you have a big pile of crayons in all sorts of colors. Instead of keeping track of every single shade, you sort them into a few boxes: one box for reds, one for blues, one for greens, and so on. Now when someone asks for a color, you just pick a box instead of searching through every crayon. Bucketing does the same thing with numbers. If you have ages like 3, 7, 12, 25, and 41, you might make a "kid" box (0 to 12), a "young adult" box (13 to 30), and an "adult" box (31 and up). The computer can then work with a few simple groups instead of every possible number.

How bucketing works

The general procedure for bucketing a continuous feature involves three steps:

Define bucket boundaries. Choose a set of boundary values that partition the feature's range into non-overlapping intervals. For example, given the boundaries [0, 10, 20, 30], the resulting buckets are (negative infinity, 0), [0, 10), [10, 20), [20, 30), and [30, positive infinity).
Assign each value to a bucket. For every observation, determine which interval contains its value and label it accordingly.
Encode the bucket labels. The bucket assignments are typically one-hot encoded so that each bucket becomes a separate binary feature, or they are represented as an ordinal integer.

Once bucketed, a single numerical feature with many unique values becomes a small set of categorical features. This allows models, particularly linear models, to learn a separate weight for each bucket rather than fitting a single slope across the entire range.

Types of bucketing

Several strategies exist for choosing bucket boundaries. The choice depends on the data distribution, the downstream model, and the domain.

Equal-width bucketing

Equal-width bucketing (also called uniform binning) divides the range of a feature into a fixed number of intervals of the same size. If the minimum value is v_min, the maximum is v_max, and the desired number of buckets is k, then the width of each bucket is:

width = (v_max - v_min) / k

Advantages. The method is simple to implement and easy to interpret. When data is approximately uniformly distributed, equal-width bins produce balanced groups.

Disadvantages. When data is skewed, most observations may fall into one or two bins while the remaining bins are nearly empty. Outliers stretch the range, making the problem worse. This imbalance can reduce a model's ability to distinguish between common values.

Equal-frequency (quantile) bucketing

Equal-frequency bucketing, sometimes called quantile-based binning, sets boundaries so that each bucket contains approximately the same number of data points. For k buckets, the boundaries correspond to the 1/k, 2/k, ..., (k-1)/k quantiles of the feature distribution.

Advantages. It handles skewed distributions well because every bucket receives a roughly equal share of examples. This gives the model sufficient training signal in every bucket and avoids the sparse-bin problem of equal-width methods.

Disadvantages. Bucket widths can vary dramatically. In heavily skewed data the widest bucket may span a range orders of magnitude larger than the narrowest, which can obscure meaningful differences among values in the long tail.

Log-scale bucketing

For features with right-skewed distributions (common with counts, monetary values, and web traffic metrics), bucket boundaries can be placed on a logarithmic scale. For example, boundaries at 1, 10, 100, 1000, and 10000 produce buckets that grow exponentially in width. This approach devotes more resolution to the dense lower range of values while still distinguishing among rare high values, combining some benefits of both equal-width and equal-frequency strategies.

Custom (domain-specific) bucketing

Domain experts often define boundaries based on meaningful thresholds rather than statistical properties. Examples include:

Age groups: 0 to 12 (child), 13 to 17 (adolescent), 18 to 64 (adult), 65 and above (senior).
Income brackets: tax brackets or poverty-line thresholds.
Credit scoring: risk bands defined by regulatory or business rules.

Custom bucketing can capture domain knowledge that purely data-driven methods miss, but it requires human judgment and may not generalize across datasets.

Comparison of bucketing methods

Method	Boundary rule	Best suited for	Main weakness
Equal-width	Fixed interval size	Uniformly distributed data	Sparse bins with skewed data
Equal-frequency	Quantile-based	Skewed distributions	Variable bin widths can obscure tail differences
Log-scale	Exponentially growing intervals	Right-skewed data (counts, prices)	Not suitable for data with negative values
Custom	Domain knowledge	Regulated or well-understood domains	Requires expert input; may not transfer

Supervised (optimal) binning methods

The methods above are all unsupervised: they consider only the feature values, not the target variable. Supervised binning methods use the relationship between the feature and the target to find boundaries that maximize predictive power.

Decision tree-based binning

A single decision tree trained on one feature and the target variable naturally produces splits that minimize impurity (Gini or entropy). The resulting leaf nodes define bins whose boundaries are chosen to separate target classes as cleanly as possible. Scikit-learn's DecisionTreeClassifier or DecisionTreeRegressor with a constrained max_leaf_nodes parameter is a common way to implement this approach.

ChiMerge

ChiMerge is a bottom-up merging algorithm that starts with one bin per distinct value and iteratively merges adjacent bins whose target distributions are statistically similar according to the chi-squared test. Merging stops when all remaining adjacent pairs are significantly different. This produces bins that are homogeneous with respect to the target.

Entropy-based (minimum description length) binning

Entropy-based methods recursively split the feature range at the point that maximizes information gain, similar to how a decision tree selects splits. The minimum description length principle (MDLP) criterion, proposed by Fayyad and Irani in 1993, adds a stopping condition that prevents over-splitting by penalizing model complexity.

Weight of evidence (WoE) binning

Widely used in credit scoring and financial modeling, WoE binning groups feature values into bins that maximize the separation between positive and negative target classes. The weight of evidence for each bin is calculated as ln(proportion of positives / proportion of negatives). Optimal WoE bins are monotonic with respect to the target, which is often a regulatory requirement in financial applications.

Bucketing in sequence-to-sequence models

In natural language processing and speech recognition, "bucketing" refers to a different but related idea: grouping training sequences by length so that each mini-batch contains sequences of similar length. This usage is common in recurrent neural network and transformer training pipelines.

The padding problem

When sequences of varying lengths are combined in a single batch, shorter sequences must be padded with zeros to match the longest sequence. These padding tokens waste computation during forward and backward passes because the model processes them even though they carry no information.

How sequence bucketing works

Sequences are sorted by length and assigned to buckets (for example, lengths 1 to 20, 21 to 40, 41 to 60). During training, a batch is drawn from a single bucket so that all sequences in the batch have similar lengths. This reduces the amount of padding, speeds up training, and can lower memory usage.

Frameworks that support sequence bucketing include:

TensorFlow: tf.data.experimental.bucket_by_sequence_length()
MXNet: native bucketing module for sequence models
PyTorch: custom BucketBatchSampler implementations used with DataLoader

Bucketing in TensorFlow's feature columns

TensorFlow provides tf.feature_column.bucketized_column as a built-in way to bucketize numerical features for use with Estimator-based models. The function accepts a numeric column and a sorted list of boundary values, then produces a one-hot encoded categorical column.

For example, given boundaries [0., 1., 2.], TensorFlow creates four buckets: (negative infinity, 0), [0, 1), [1, 2), and [2, positive infinity). Each input value is mapped to the corresponding bucket and represented as a multi-hot vector. Bucketized columns can be crossed with other categorical columns using tf.feature_column.crossed_column to capture interaction effects.

Note: the tf.feature_column API is deprecated in TensorFlow 2.x in favor of Keras preprocessing layers such as tf.keras.layers.Discretization.

Hash-based bucketing

For high-cardinality categorical features, hash-based bucketing (also called feature hashing or the hashing trick) applies a hash function to map feature values into a fixed number of buckets. Unlike standard bucketing of numerical ranges, hash-based bucketing operates on categorical values and does not require an explicit mapping table.

How it works. A hash function converts each feature value to an integer, and a modulo operation maps that integer to one of n buckets. The result is a fixed-size vector regardless of how many distinct values the feature has.

Trade-offs. Hash collisions are inevitable: distinct values may land in the same bucket, causing some information loss. Increasing the number of buckets reduces collisions but increases dimensionality. In practice, moderate collision rates have a small impact on model accuracy while providing large savings in memory and computation.

Impact on model performance

Bucketing affects different model families in different ways.

Model type	Effect of bucketing	Notes
Linear models and logistic regression	Often beneficial	Allows piecewise linear relationships; especially useful when the raw feature has a non-linear relationship with the target
Decision trees and ensemble methods	Rarely needed	Trees find their own split points; pre-bucketing may reduce granularity without adding value. Histogram-based methods like LightGBM already bin features internally.
Neural networks	Sometimes helpful	Can stabilize training when input distributions are highly skewed, but networks can learn non-linear mappings on their own given enough data
K-nearest neighbors	Generally harmful	Bucketing collapses distinct values, destroying distance information that KNN relies on

Information loss

Every bucketing scheme discards some information because all values within a bucket are treated identically. The key trade-off is between reducing noise (and overfitting) and losing signal. Using too few buckets leads to underfitting; using too many reintroduces the problems bucketing was meant to solve. Cross-validation is the standard way to select an appropriate number of buckets.

Bucketing for handling outliers

One practical use of bucketing is to contain the influence of extreme values. Instead of clipping outliers or applying a log transform, a practitioner can place all values above (or below) a threshold into a single bucket. For example, if income values range from 0 to 10,000,000 but 99% of observations fall below 200,000, a boundary at 200,000 groups all extreme incomes together. This prevents outliers from dominating feature scaling or distorting model coefficients while preserving the distinction among more common values.

Bucketing in data visualization

Outside of machine learning, bucketing is the foundation of histograms, one of the most common tools in exploratory data analysis. A histogram groups continuous values into bins and plots the count (or density) of observations per bin. The choice of bin width or bin count directly affects the visual impression of the data distribution; too few bins obscure structure, while too many bins introduce visual noise. Rules of thumb for choosing histogram bins, such as Sturges' rule (k = 1 + log2(n)) and the Freedman-Diaconis rule (bin width = 2 * IQR * n^(-1/3)), are in effect bucketing strategies optimized for visualization rather than prediction.

Practical example

Consider a dataset of house prices where the feature "lot area" ranges from 1,300 to 215,000 square feet but is heavily right-skewed, with most lots between 5,000 and 15,000 square feet.

Equal-width bucketing with 5 bins would create intervals roughly 42,740 square feet wide. Nearly all observations would fall into the first bin, leaving the other four almost empty.

Equal-frequency bucketing with 5 bins (quintiles) would place about 20% of the data in each bin. Boundaries might fall near 7,500, 9,500, 11,100, and 13,600 square feet, giving the model useful resolution in the dense region.

Log-scale bucketing with boundaries at 1,000, 3,000, 10,000, 30,000, and 100,000 would provide finer resolution at the low end and coarser grouping for rare large lots.

Decision tree-based binning trained against the sale price target might place splits at 7,200, 10,800, and 16,000 square feet, reflecting the points where lot area has the strongest effect on price.

In Python using pandas and scikit-learn:

import pandas as pd
from sklearn.tree import DecisionTreeRegressor
import numpy as np

# Equal-width
pd.cut(df['lot_area'], bins=5)

# Equal-frequency (quantile)
pd.qcut(df['lot_area'], q=5)

# Log-scale
boundaries = [0, 3000, 10000, 30000, 100000, np.inf]
pd.cut(df['lot_area'], bins=boundaries)

# Decision tree-based
tree = DecisionTreeRegressor(max_leaf_nodes=5)
tree.fit(df[['lot_area']], df['sale_price'])
boundaries = sorted(tree.tree_.threshold[tree.tree_.feature == 0])

Best practices

Start with domain knowledge. If meaningful thresholds exist, use them before trying automated methods.
Use cross-validation to choose the number of buckets. There is no universal optimal count; it depends on the dataset size, feature distribution, and model type.
Prefer supervised binning when a target variable is available. Methods like decision tree-based binning or WoE binning generally outperform unsupervised approaches for prediction tasks.
Avoid bucketing for tree-based models. Gradient-boosted trees and random forests already perform internal splitting and gain no benefit from pre-bucketing. Pre-bucketing may even reduce performance by limiting the splits the tree can consider.
Combine bucketing with feature crossing. Bucketed features can be combined with other categorical features to create interaction features, a technique used extensively in large-scale ad-click prediction systems.
Document bucket boundaries. In production systems, the same boundaries must be applied to training and serving data. Store boundaries as configuration or as part of the preprocessing pipeline.

References

Dougherty, J., Kohavi, R., & Sahami, M. (1995). "Supervised and unsupervised discretization of continuous features." *Proceedings of the 12th International Conference on Machine Learning (ICML)*, pp. 194-202.
Fayyad, U. M., & Irani, K. B. (1993). "Multi-interval discretization of continuous-valued attributes for classification learning." *Proceedings of the 13th International Joint Conference on Artificial Intelligence (IJCAI)*, pp. 1022-1027.
Kerber, R. (1992). "ChiMerge: Discretization of numeric attributes." *Proceedings of the 10th National Conference on Artificial Intelligence (AAAI)*, pp. 123-128.
Google Developers. "Bucketing." *Machine Learning Data Preparation*. https://developers.google.com/machine-learning/data-prep/transform/bucketing
Google Developers. "Numerical data: Binning." *Machine Learning Crash Course*. https://developers.google.com/machine-learning/crash-course/numerical-data/binning
Weinberger, K. Q., Dasgupta, A., Langford, J., Smola, A., & Attenberg, J. (2009). "Feature hashing for large scale multitask learning." *Proceedings of the 26th International Conference on Machine Learning (ICML)*, pp. 1113-1120.
Navas-Palencia, G. (2020). "Optimal binning: Mathematical programming formulation." *arXiv preprint arXiv:2001.08025*.
Khomtchouk, B. B., Hennessy, J. R., & Bhargava, R. (2020). "Discretization of continuous features in clinical datasets." *Journal of Biomedical Informatics*, 102, 103385.
TensorFlow Documentation. "tf.feature_column.bucketized_column." https://www.tensorflow.org/api_docs/python/tf/feature_column/bucketized_column
Soczewica, R. (2019). "Nice bucket challenge: Overview of data binning techniques." *Medium*. https://robertsoczewica.medium.com/nice-bucket-challenge-5b511c00f1b3

ELI5 (explain like I'm 5)

How bucketing works

Types of bucketing

Equal-width bucketing

Equal-frequency (quantile) bucketing

Log-scale bucketing

Custom (domain-specific) bucketing

Comparison of bucketing methods

Supervised (optimal) binning methods

Decision tree-based binning

ChiMerge

Entropy-based (minimum description length) binning

Weight of evidence (WoE) binning

Bucketing in sequence-to-sequence models

The padding problem

How sequence bucketing works

Bucketing in TensorFlow's feature columns

Hash-based bucketing

Impact on model performance

Information loss

Bucketing for handling outliers

Bucketing in data visualization

Practical example

Best practices

References

See also

Improve this article

Related Articles

ARC-AGI 2

Feature Engineering

One-Hot Encoding

Discrete Feature

Categorical Data

Continuous Feature

ELI5 (explain like I'm 5)

How bucketing works

Types of bucketing

Equal-width bucketing

Equal-frequency (quantile) bucketing

Log-scale bucketing

Custom (domain-specific) bucketing

Comparison of bucketing methods

Supervised (optimal) binning methods

Decision tree-based binning

ChiMerge

Entropy-based (minimum description length) binning

Weight of evidence (WoE) binning

Bucketing in sequence-to-sequence models

The padding problem

How sequence bucketing works

Bucketing in TensorFlow's feature columns

Hash-based bucketing

Impact on model performance

Information loss

Bucketing for handling outliers

Bucketing in data visualization

Practical example

Best practices

References

See also

Related Articles

ARC-AGI 2

Feature Engineering

One-Hot Encoding

Discrete Feature

Categorical Data

Continuous Feature