# Bucketing

> Source: https://aiwiki.ai/wiki/bucketing
> Updated: 2026-06-01
> Categories: Data & Datasets, Machine Learning
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

**Bucketing**, also called [binning](/wiki/binning), is a [feature engineering](/wiki/feature_engineering) technique in [machine learning](/wiki/machine_learning) that converts [continuous features](/wiki/continuous_feature) into discrete categories by dividing a range of values into intervals known as "buckets" or "bins."[4] Each data point is then assigned to the bucket whose interval contains its value. By grouping [numerical data](/wiki/numerical_data) into a smaller number of categories, bucketing simplifies the input space, reduces the influence of [outliers](/wiki/outliers), and can help models capture non-linear relationships that would otherwise be difficult to learn.[5]

Bucketing is widely used across data preprocessing, [data visualization](/wiki/data_visualization), and model training pipelines. It appears not only as a standalone preprocessing step but also as a built-in mechanism in frameworks such as [TensorFlow](/wiki/tensorflow) and [XGBoost](/wiki/xgboost), and as a batching strategy for [sequence-to-sequence](/wiki/sequence-to-sequence_task) models in [natural language processing](/wiki/natural_language_processing).[10]

## ELI5 (explain like I'm 5)

Imagine you have a big pile of crayons in all sorts of colors. Instead of keeping track of every single shade, you sort them into a few boxes: one box for reds, one for blues, one for greens, and so on. Now when someone asks for a color, you just pick a box instead of searching through every crayon. Bucketing does the same thing with numbers. If you have ages like 3, 7, 12, 25, and 41, you might make a "kid" box (0 to 12), a "young adult" box (13 to 30), and an "adult" box (31 and up). The computer can then work with a few simple groups instead of every possible number.

## How bucketing works

The general procedure for bucketing a continuous feature involves three steps:

1. **Define bucket boundaries.** Choose a set of boundary values that partition the feature's range into non-overlapping intervals. For example, given the boundaries [0, 10, 20, 30], the resulting buckets are (negative infinity, 0), [0, 10), [10, 20), [20, 30), and [30, positive infinity).
2. **Assign each value to a bucket.** For every observation, determine which interval contains its value and label it accordingly.
3. **Encode the bucket labels.** The bucket assignments are typically [one-hot encoded](/wiki/one-hot_encoding) so that each bucket becomes a separate binary feature, or they are represented as an ordinal integer.[4]

Once bucketed, a single numerical feature with many unique values becomes a small set of [categorical features](/wiki/categorical_data). This allows models, particularly [linear models](/wiki/linear_model), to learn a separate weight for each bucket rather than fitting a single slope across the entire range.[5]

## Types of bucketing

Several strategies exist for choosing bucket boundaries. The choice depends on the data distribution, the downstream model, and the domain.

### Equal-width bucketing

Equal-width bucketing (also called uniform binning) divides the range of a feature into a fixed number of intervals of the same size.[5] If the minimum value is *v_min*, the maximum is *v_max*, and the desired number of buckets is *k*, then the width of each bucket is:

```
width = (v_max - v_min) / k
```

**Advantages.** The method is simple to implement and easy to interpret. When data is approximately uniformly distributed, equal-width bins produce balanced groups.

**Disadvantages.** When data is skewed, most observations may fall into one or two bins while the remaining bins are nearly empty. Outliers stretch the range, making the problem worse. This imbalance can reduce a model's ability to distinguish between common values.

### Equal-frequency (quantile) bucketing

Equal-frequency bucketing, sometimes called [quantile](/wiki/quantile)-based binning, sets boundaries so that each bucket contains approximately the same number of data points.[10] For *k* buckets, the boundaries correspond to the 1/k, 2/k, ..., (k-1)/k quantiles of the feature distribution.

**Advantages.** It handles skewed distributions well because every bucket receives a roughly equal share of examples. This gives the model sufficient training signal in every bucket and avoids the sparse-bin problem of equal-width methods.

**Disadvantages.** Bucket widths can vary dramatically. In heavily skewed data the widest bucket may span a range orders of magnitude larger than the narrowest, which can obscure meaningful differences among values in the long tail.

### Log-scale bucketing

For features with right-skewed distributions (common with counts, monetary values, and web traffic metrics), bucket boundaries can be placed on a logarithmic scale. For example, boundaries at 1, 10, 100, 1000, and 10000 produce buckets that grow exponentially in width. This approach devotes more resolution to the dense lower range of values while still distinguishing among rare high values, combining some benefits of both equal-width and equal-frequency strategies.[10]

### Custom (domain-specific) bucketing

Domain experts often define boundaries based on meaningful thresholds rather than statistical properties. Examples include:

- **Age groups:** 0 to 12 (child), 13 to 17 (adolescent), 18 to 64 (adult), 65 and above (senior).
- **Income brackets:** tax brackets or poverty-line thresholds.
- **Credit scoring:** risk bands defined by regulatory or business rules.

Custom bucketing can capture domain knowledge that purely data-driven methods miss, but it requires human judgment and may not generalize across datasets.[8]

### Comparison of bucketing methods

| Method | Boundary rule | Best suited for | Main weakness |
|---|---|---|---|
| Equal-width | Fixed interval size | Uniformly distributed data | Sparse bins with skewed data |
| Equal-frequency | Quantile-based | Skewed distributions | Variable bin widths can obscure tail differences |
| Log-scale | Exponentially growing intervals | Right-skewed data (counts, prices) | Not suitable for data with negative values |
| Custom | Domain knowledge | Regulated or well-understood domains | Requires expert input; may not transfer |

## Supervised (optimal) binning methods

The methods above are all unsupervised: they consider only the feature values, not the target variable.[1] Supervised binning methods use the relationship between the feature and the target to find boundaries that maximize predictive power.[7]

### Decision tree-based binning

A single [decision tree](/wiki/decision_tree) trained on one feature and the target variable naturally produces splits that minimize impurity (Gini or entropy). The resulting leaf nodes define bins whose boundaries are chosen to separate target classes as cleanly as possible.[1] Scikit-learn's `DecisionTreeClassifier` or `DecisionTreeRegressor` with a constrained `max_leaf_nodes` parameter is a common way to implement this approach.

### ChiMerge

ChiMerge is a bottom-up merging algorithm that starts with one bin per distinct value and iteratively merges adjacent bins whose target distributions are statistically similar according to the chi-squared test.[3] Merging stops when all remaining adjacent pairs are significantly different. This produces bins that are homogeneous with respect to the target.

### Entropy-based (minimum description length) binning

Entropy-based methods recursively split the feature range at the point that maximizes [information gain](/wiki/information_gain), similar to how a decision tree selects splits. The minimum description length principle (MDLP) criterion, proposed by Fayyad and Irani in 1993, adds a stopping condition that prevents over-splitting by penalizing model complexity.[2]

### Weight of evidence (WoE) binning

Widely used in credit scoring and financial modeling, WoE binning groups feature values into bins that maximize the separation between positive and negative target classes. The weight of evidence for each bin is calculated as ln(proportion of positives / proportion of negatives). Optimal WoE bins are monotonic with respect to the target, which is often a regulatory requirement in financial applications.[7]

## Bucketing in sequence-to-sequence models

In [natural language processing](/wiki/natural_language_processing) and speech recognition, "bucketing" refers to a different but related idea: grouping training sequences by length so that each mini-batch contains sequences of similar length. This usage is common in [recurrent neural network](/wiki/recurrent_neural_network) and [transformer](/wiki/transformer) training pipelines.

### The padding problem

When sequences of varying lengths are combined in a single batch, shorter sequences must be padded with zeros to match the longest sequence. These padding tokens waste computation during forward and backward passes because the model processes them even though they carry no information.

### How sequence bucketing works

Sequences are sorted by length and assigned to buckets (for example, lengths 1 to 20, 21 to 40, 41 to 60). During training, a batch is drawn from a single bucket so that all sequences in the batch have similar lengths. This reduces the amount of padding, speeds up training, and can lower memory usage.[10]

Frameworks that support sequence bucketing include:

- **TensorFlow:** `tf.data.experimental.bucket_by_sequence_length()`
- **MXNet:** native bucketing module for sequence models
- **[PyTorch](/wiki/pytorch):** custom `BucketBatchSampler` implementations used with `DataLoader`

## Bucketing in TensorFlow's feature columns

[TensorFlow](/wiki/tensorflow) provides `tf.feature_column.bucketized_column` as a built-in way to bucketize numerical features for use with Estimator-based models. The function accepts a numeric column and a sorted list of boundary values, then produces a one-hot encoded categorical column.[9]

For example, given boundaries `[0., 1., 2.]`, TensorFlow creates four buckets: (negative infinity, 0), [0, 1), [1, 2), and [2, positive infinity). Each input value is mapped to the corresponding bucket and represented as a multi-hot vector.[9] Bucketized columns can be crossed with other categorical columns using `tf.feature_column.crossed_column` to capture interaction effects.

Note: the `tf.feature_column` API is deprecated in TensorFlow 2.x in favor of Keras preprocessing layers such as `tf.keras.layers.Discretization`.

## Hash-based bucketing

For high-cardinality categorical features, hash-based bucketing (also called [feature hashing](/wiki/feature_hashing) or the hashing trick) applies a hash function to map feature values into a fixed number of buckets.[6] Unlike standard bucketing of numerical ranges, hash-based bucketing operates on categorical values and does not require an explicit mapping table.

**How it works.** A hash function converts each feature value to an integer, and a modulo operation maps that integer to one of *n* buckets. The result is a fixed-size vector regardless of how many distinct values the feature has.

**Trade-offs.** Hash collisions are inevitable: distinct values may land in the same bucket, causing some information loss. Increasing the number of buckets reduces collisions but increases dimensionality. In practice, moderate collision rates have a small impact on model accuracy while providing large savings in memory and computation.[6]

## Impact on model performance

Bucketing affects different model families in different ways.

| Model type | Effect of bucketing | Notes |
|---|---|---|
| [Linear models](/wiki/linear_model) and [logistic regression](/wiki/logistic_regression) | Often beneficial | Allows piecewise linear relationships; especially useful when the raw feature has a non-linear relationship with the target |
| [Decision trees](/wiki/decision_tree) and ensemble methods | Rarely needed | Trees find their own split points; pre-bucketing may reduce granularity without adding value. Histogram-based methods like [LightGBM](/wiki/lightgbm) already bin features internally. |
| [Neural networks](/wiki/neural_network) | Sometimes helpful | Can stabilize training when input distributions are highly skewed, but networks can learn non-linear mappings on their own given enough data |
| [K-nearest neighbors](/wiki/k_nearest_neighbors) | Generally harmful | Bucketing collapses distinct values, destroying distance information that KNN relies on |

### Information loss

Every bucketing scheme discards some information because all values within a bucket are treated identically. The key trade-off is between reducing noise (and overfitting) and losing signal.[5] Using too few buckets leads to underfitting; using too many reintroduces the problems bucketing was meant to solve. Cross-validation is the standard way to select an appropriate number of buckets.

## Bucketing for handling outliers

One practical use of bucketing is to contain the influence of extreme values. Instead of clipping outliers or applying a log transform, a practitioner can place all values above (or below) a threshold into a single bucket. For example, if income values range from 0 to 10,000,000 but 99% of observations fall below 200,000, a boundary at 200,000 groups all extreme incomes together. This prevents outliers from dominating feature scaling or distorting model coefficients while preserving the distinction among more common values.[4]

## Bucketing in data visualization

Outside of machine learning, bucketing is the foundation of [histograms](/wiki/histogram), one of the most common tools in exploratory data analysis. A histogram groups continuous values into bins and plots the count (or density) of observations per bin. The choice of bin width or bin count directly affects the visual impression of the data distribution; too few bins obscure structure, while too many bins introduce visual noise. Rules of thumb for choosing histogram bins, such as Sturges' rule (k = 1 + log2(n)) and the Freedman-Diaconis rule (bin width = 2 * IQR * n^(-1/3)), are in effect bucketing strategies optimized for visualization rather than prediction.[10]

## Practical example

Consider a dataset of house prices where the feature "lot area" ranges from 1,300 to 215,000 square feet but is heavily right-skewed, with most lots between 5,000 and 15,000 square feet.

**Equal-width bucketing** with 5 bins would create intervals roughly 42,740 square feet wide. Nearly all observations would fall into the first bin, leaving the other four almost empty.

**Equal-frequency bucketing** with 5 bins (quintiles) would place about 20% of the data in each bin. Boundaries might fall near 7,500, 9,500, 11,100, and 13,600 square feet, giving the model useful resolution in the dense region.

**Log-scale bucketing** with boundaries at 1,000, 3,000, 10,000, 30,000, and 100,000 would provide finer resolution at the low end and coarser grouping for rare large lots.

**Decision tree-based binning** trained against the sale price target might place splits at 7,200, 10,800, and 16,000 square feet, reflecting the points where lot area has the strongest effect on price.

In Python using pandas and scikit-learn:

```python
import pandas as pd
from sklearn.tree import DecisionTreeRegressor
import numpy as np

# Equal-width
pd.cut(df['lot_area'], bins=5)

# Equal-frequency (quantile)
pd.qcut(df['lot_area'], q=5)

# Log-scale
boundaries = [0, 3000, 10000, 30000, 100000, np.inf]
pd.cut(df['lot_area'], bins=boundaries)

# Decision tree-based
tree = DecisionTreeRegressor(max_leaf_nodes=5)
tree.fit(df[['lot_area']], df['sale_price'])
boundaries = sorted(tree.tree_.threshold[tree.tree_.feature == 0])
```

## Best practices

- **Start with domain knowledge.** If meaningful thresholds exist, use them before trying automated methods.
- **Use cross-validation to choose the number of buckets.** There is no universal optimal count; it depends on the dataset size, feature distribution, and model type.
- **Prefer supervised binning when a target variable is available.** Methods like decision tree-based binning or WoE binning generally outperform unsupervised approaches for prediction tasks.[1][8]
- **Avoid bucketing for tree-based models.** Gradient-boosted trees and random forests already perform internal splitting and gain no benefit from pre-bucketing. Pre-bucketing may even reduce performance by limiting the splits the tree can consider.
- **Combine bucketing with feature crossing.** Bucketed features can be combined with other categorical features to create interaction features, a technique used extensively in large-scale ad-click prediction systems.
- **Document bucket boundaries.** In production systems, the same boundaries must be applied to training and serving data. Store boundaries as configuration or as part of the preprocessing pipeline.

## References

1. Dougherty, J., Kohavi, R., & Sahami, M. (1995). "Supervised and unsupervised discretization of continuous features." *Proceedings of the 12th International Conference on Machine Learning (ICML)*, pp. 194-202.
2. Fayyad, U. M., & Irani, K. B. (1993). "Multi-interval discretization of continuous-valued attributes for classification learning." *Proceedings of the 13th International Joint Conference on Artificial Intelligence (IJCAI)*, pp. 1022-1027.
3. Kerber, R. (1992). "ChiMerge: Discretization of numeric attributes." *Proceedings of the 10th National Conference on Artificial Intelligence (AAAI)*, pp. 123-128.
4. Google Developers. "Bucketing." *Machine Learning Data Preparation*. https://developers.google.com/machine-learning/data-prep/transform/bucketing
5. Google Developers. "Numerical data: Binning." *Machine Learning Crash Course*. https://developers.google.com/machine-learning/crash-course/numerical-data/binning
6. Weinberger, K. Q., Dasgupta, A., Langford, J., Smola, A., & Attenberg, J. (2009). "Feature hashing for large scale multitask learning." *Proceedings of the 26th International Conference on Machine Learning (ICML)*, pp. 1113-1120.
7. Navas-Palencia, G. (2020). "Optimal binning: Mathematical programming formulation." *arXiv preprint arXiv:2001.08025*.
8. Khomtchouk, B. B., Hennessy, J. R., & Bhargava, R. (2020). "Discretization of continuous features in clinical datasets." *Journal of Biomedical Informatics*, 102, 103385.
9. TensorFlow Documentation. "tf.feature_column.bucketized_column." https://www.tensorflow.org/api_docs/python/tf/feature_column/bucketized_column
10. Soczewica, R. (2019). "Nice bucket challenge: Overview of data binning techniques." *Medium*. https://robertsoczewica.medium.com/nice-bucket-challenge-5b511c00f1b3

## See also

- [Binning](/wiki/binning)
- [Feature engineering](/wiki/feature_engineering)
- [Data preprocessing](/wiki/data_preprocessing)
- [One-hot encoding](/wiki/one-hot_encoding)
- [Decision tree](/wiki/decision_tree)
- [Histogram](/wiki/histogram)
- [Feature hashing](/wiki/feature_hashing)
- [Quantile](/wiki/quantile)
