Bucketing, also called binning, is a feature engineering technique in machine learning that converts continuous features into discrete categories by dividing a range of values into intervals known as "buckets" or "bins." Each data point is then assigned to the bucket whose interval contains its value. By grouping numerical data into a smaller number of categories, bucketing simplifies the input space, reduces the influence of outliers, and can help models capture non-linear relationships that would otherwise be difficult to learn.
Bucketing is widely used across data preprocessing, data visualization, and model training pipelines. It appears not only as a standalone preprocessing step but also as a built-in mechanism in frameworks such as TensorFlow and XGBoost, and as a batching strategy for sequence-to-sequence models in natural language processing.
Imagine you have a big pile of crayons in all sorts of colors. Instead of keeping track of every single shade, you sort them into a few boxes: one box for reds, one for blues, one for greens, and so on. Now when someone asks for a color, you just pick a box instead of searching through every crayon. Bucketing does the same thing with numbers. If you have ages like 3, 7, 12, 25, and 41, you might make a "kid" box (0 to 12), a "young adult" box (13 to 30), and an "adult" box (31 and up). The computer can then work with a few simple groups instead of every possible number.
The general procedure for bucketing a continuous feature involves three steps:
Once bucketed, a single numerical feature with many unique values becomes a small set of categorical features. This allows models, particularly linear models, to learn a separate weight for each bucket rather than fitting a single slope across the entire range.
Several strategies exist for choosing bucket boundaries. The choice depends on the data distribution, the downstream model, and the domain.
Equal-width bucketing (also called uniform binning) divides the range of a feature into a fixed number of intervals of the same size. If the minimum value is v_min, the maximum is v_max, and the desired number of buckets is k, then the width of each bucket is:
width = (v_max - v_min) / k
Advantages. The method is simple to implement and easy to interpret. When data is approximately uniformly distributed, equal-width bins produce balanced groups.
Disadvantages. When data is skewed, most observations may fall into one or two bins while the remaining bins are nearly empty. Outliers stretch the range, making the problem worse. This imbalance can reduce a model's ability to distinguish between common values.
Equal-frequency bucketing, sometimes called quantile-based binning, sets boundaries so that each bucket contains approximately the same number of data points. For k buckets, the boundaries correspond to the 1/k, 2/k, ..., (k-1)/k quantiles of the feature distribution.
Advantages. It handles skewed distributions well because every bucket receives a roughly equal share of examples. This gives the model sufficient training signal in every bucket and avoids the sparse-bin problem of equal-width methods.
Disadvantages. Bucket widths can vary dramatically. In heavily skewed data the widest bucket may span a range orders of magnitude larger than the narrowest, which can obscure meaningful differences among values in the long tail.
For features with right-skewed distributions (common with counts, monetary values, and web traffic metrics), bucket boundaries can be placed on a logarithmic scale. For example, boundaries at 1, 10, 100, 1000, and 10000 produce buckets that grow exponentially in width. This approach devotes more resolution to the dense lower range of values while still distinguishing among rare high values, combining some benefits of both equal-width and equal-frequency strategies.
Domain experts often define boundaries based on meaningful thresholds rather than statistical properties. Examples include:
Custom bucketing can capture domain knowledge that purely data-driven methods miss, but it requires human judgment and may not generalize across datasets.
| Method | Boundary rule | Best suited for | Main weakness |
|---|---|---|---|
| Equal-width | Fixed interval size | Uniformly distributed data | Sparse bins with skewed data |
| Equal-frequency | Quantile-based | Skewed distributions | Variable bin widths can obscure tail differences |
| Log-scale | Exponentially growing intervals | Right-skewed data (counts, prices) | Not suitable for data with negative values |
| Custom | Domain knowledge | Regulated or well-understood domains | Requires expert input; may not transfer |
The methods above are all unsupervised: they consider only the feature values, not the target variable. Supervised binning methods use the relationship between the feature and the target to find boundaries that maximize predictive power.
A single decision tree trained on one feature and the target variable naturally produces splits that minimize impurity (Gini or entropy). The resulting leaf nodes define bins whose boundaries are chosen to separate target classes as cleanly as possible. Scikit-learn's DecisionTreeClassifier or DecisionTreeRegressor with a constrained max_leaf_nodes parameter is a common way to implement this approach.
ChiMerge is a bottom-up merging algorithm that starts with one bin per distinct value and iteratively merges adjacent bins whose target distributions are statistically similar according to the chi-squared test. Merging stops when all remaining adjacent pairs are significantly different. This produces bins that are homogeneous with respect to the target.
Entropy-based methods recursively split the feature range at the point that maximizes information gain, similar to how a decision tree selects splits. The minimum description length principle (MDLP) criterion, proposed by Fayyad and Irani in 1993, adds a stopping condition that prevents over-splitting by penalizing model complexity.
Widely used in credit scoring and financial modeling, WoE binning groups feature values into bins that maximize the separation between positive and negative target classes. The weight of evidence for each bin is calculated as ln(proportion of positives / proportion of negatives). Optimal WoE bins are monotonic with respect to the target, which is often a regulatory requirement in financial applications.
In natural language processing and speech recognition, "bucketing" refers to a different but related idea: grouping training sequences by length so that each mini-batch contains sequences of similar length. This usage is common in recurrent neural network and transformer training pipelines.
When sequences of varying lengths are combined in a single batch, shorter sequences must be padded with zeros to match the longest sequence. These padding tokens waste computation during forward and backward passes because the model processes them even though they carry no information.
Sequences are sorted by length and assigned to buckets (for example, lengths 1 to 20, 21 to 40, 41 to 60). During training, a batch is drawn from a single bucket so that all sequences in the batch have similar lengths. This reduces the amount of padding, speeds up training, and can lower memory usage.
Frameworks that support sequence bucketing include:
tf.data.experimental.bucket_by_sequence_length()BucketBatchSampler implementations used with DataLoaderTensorFlow provides tf.feature_column.bucketized_column as a built-in way to bucketize numerical features for use with Estimator-based models. The function accepts a numeric column and a sorted list of boundary values, then produces a one-hot encoded categorical column.
For example, given boundaries [0., 1., 2.], TensorFlow creates four buckets: (negative infinity, 0), [0, 1), [1, 2), and [2, positive infinity). Each input value is mapped to the corresponding bucket and represented as a multi-hot vector. Bucketized columns can be crossed with other categorical columns using tf.feature_column.crossed_column to capture interaction effects.
Note: the tf.feature_column API is deprecated in TensorFlow 2.x in favor of Keras preprocessing layers such as tf.keras.layers.Discretization.
For high-cardinality categorical features, hash-based bucketing (also called feature hashing or the hashing trick) applies a hash function to map feature values into a fixed number of buckets. Unlike standard bucketing of numerical ranges, hash-based bucketing operates on categorical values and does not require an explicit mapping table.
How it works. A hash function converts each feature value to an integer, and a modulo operation maps that integer to one of n buckets. The result is a fixed-size vector regardless of how many distinct values the feature has.
Trade-offs. Hash collisions are inevitable: distinct values may land in the same bucket, causing some information loss. Increasing the number of buckets reduces collisions but increases dimensionality. In practice, moderate collision rates have a small impact on model accuracy while providing large savings in memory and computation.
Bucketing affects different model families in different ways.
| Model type | Effect of bucketing | Notes |
|---|---|---|
| Linear models and logistic regression | Often beneficial | Allows piecewise linear relationships; especially useful when the raw feature has a non-linear relationship with the target |
| Decision trees and ensemble methods | Rarely needed | Trees find their own split points; pre-bucketing may reduce granularity without adding value. Histogram-based methods like LightGBM already bin features internally. |
| Neural networks | Sometimes helpful | Can stabilize training when input distributions are highly skewed, but networks can learn non-linear mappings on their own given enough data |
| K-nearest neighbors | Generally harmful | Bucketing collapses distinct values, destroying distance information that KNN relies on |
Every bucketing scheme discards some information because all values within a bucket are treated identically. The key trade-off is between reducing noise (and overfitting) and losing signal. Using too few buckets leads to underfitting; using too many reintroduces the problems bucketing was meant to solve. Cross-validation is the standard way to select an appropriate number of buckets.
One practical use of bucketing is to contain the influence of extreme values. Instead of clipping outliers or applying a log transform, a practitioner can place all values above (or below) a threshold into a single bucket. For example, if income values range from 0 to 10,000,000 but 99% of observations fall below 200,000, a boundary at 200,000 groups all extreme incomes together. This prevents outliers from dominating feature scaling or distorting model coefficients while preserving the distinction among more common values.
Outside of machine learning, bucketing is the foundation of histograms, one of the most common tools in exploratory data analysis. A histogram groups continuous values into bins and plots the count (or density) of observations per bin. The choice of bin width or bin count directly affects the visual impression of the data distribution; too few bins obscure structure, while too many bins introduce visual noise. Rules of thumb for choosing histogram bins, such as Sturges' rule (k = 1 + log2(n)) and the Freedman-Diaconis rule (bin width = 2 * IQR * n^(-1/3)), are in effect bucketing strategies optimized for visualization rather than prediction.
Consider a dataset of house prices where the feature "lot area" ranges from 1,300 to 215,000 square feet but is heavily right-skewed, with most lots between 5,000 and 15,000 square feet.
Equal-width bucketing with 5 bins would create intervals roughly 42,740 square feet wide. Nearly all observations would fall into the first bin, leaving the other four almost empty.
Equal-frequency bucketing with 5 bins (quintiles) would place about 20% of the data in each bin. Boundaries might fall near 7,500, 9,500, 11,100, and 13,600 square feet, giving the model useful resolution in the dense region.
Log-scale bucketing with boundaries at 1,000, 3,000, 10,000, 30,000, and 100,000 would provide finer resolution at the low end and coarser grouping for rare large lots.
Decision tree-based binning trained against the sale price target might place splits at 7,200, 10,800, and 16,000 square feet, reflecting the points where lot area has the strongest effect on price.
In Python using pandas and scikit-learn:
import pandas as pd
from sklearn.tree import DecisionTreeRegressor
import numpy as np
# Equal-width
pd.cut(df['lot_area'], bins=5)
# Equal-frequency (quantile)
pd.qcut(df['lot_area'], q=5)
# Log-scale
boundaries = [0, 3000, 10000, 30000, 100000, np.inf]
pd.cut(df['lot_area'], bins=boundaries)
# Decision tree-based
tree = DecisionTreeRegressor(max_leaf_nodes=5)
tree.fit(df[['lot_area']], df['sale_price'])
boundaries = sorted(tree.tree_.threshold[tree.tree_.feature == 0])