# Bucketing

> Source: https://aiwiki.ai/wiki/bucketing
> Updated: 2026-07-12
> Categories: Data & Datasets, Machine Learning
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

**Bucketing**, also called [binning](/wiki/binning) or discretization, is a [feature engineering](/wiki/feature_engineering) technique in [machine learning](/wiki/machine_learning) that converts a [continuous feature](/wiki/continuous_feature) into a small number of discrete categories by dividing its range of values into intervals known as "buckets" or "bins."[4][5] Google's Machine Learning Glossary defines it as "converting a single feature into multiple binary features called buckets or bins, typically based on a value range," while the Machine Learning Crash Course describes binning as a method that "groups different numerical subranges into bins or buckets," in many cases turning numerical data into categorical data.[11][5] Each data point is assigned to the bucket whose interval contains its value, so a feature with many unique values becomes a handful of groups.[4]

Bucketing is used because it lets simple models capture non-linear patterns, makes a [feature set](/wiki/feature_set) more robust to [outliers](/wiki/outliers), and enables bucketed values to be combined through a [feature cross](/wiki/feature_cross).[5] The main cost is information loss: every value inside a bucket is treated identically, so boundaries must be chosen carefully.[5] The two most common strategies are equal-width bucketing (intervals of identical size) and quantile, or equal-frequency, bucketing (intervals chosen so each holds roughly the same number of examples).[5]

Bucketing is widely used across data preprocessing, [data visualization](/wiki/data_visualization), and model training pipelines. It appears not only as a standalone preprocessing step but also as a built-in mechanism in frameworks such as [TensorFlow](/wiki/tensorflow) and [XGBoost](/wiki/xgboost), and as a batching strategy for [sequence-to-sequence](/wiki/sequence-to-sequence_task) models in [natural language processing](/wiki/natural_language_processing).[10]

## ELI5 (explain like I'm 5)

Imagine you have a big pile of crayons in all sorts of colors. Instead of keeping track of every single shade, you sort them into a few boxes: one box for reds, one for blues, one for greens, and so on. Now when someone asks for a color, you just pick a box instead of searching through every crayon. Bucketing does the same thing with numbers. If you have ages like 3, 7, 12, 25, and 41, you might make a "kid" box (0 to 12), a "young adult" box (13 to 30), and an "adult" box (31 and up). The computer can then work with a few simple groups instead of every possible number.

## What is bucketing in machine learning?

Bucketing (binning, discretization) transforms a [continuous feature](/wiki/continuous_feature) into discrete categories so that downstream models work with intervals instead of raw values. Google's Machine Learning Crash Course gives a worked example with car prices, noting that you could create "ten buckets where each of the ten buckets represents a span of exactly 10,000 dollars."[5] The transformation matters most when the relationship between a feature and the target is not a straight line: per the Crash Course, binning is useful when "the overall linear relationship between the feature and the label is weak or nonexistent" or "when the feature values are clustered."[5]

## How does bucketing work?

The general procedure for bucketing a continuous feature involves three steps:

1. **Define bucket boundaries.** Choose a set of boundary values that partition the feature's range into non-overlapping intervals. For example, given the boundaries [0, 10, 20, 30], the resulting buckets are $$(-\infty, 0)$$, $$[0, 10)$$, $$[10, 20)$$, $$[20, 30)$$, and $$[30, +\infty)$$.
2. **Assign each value to a bucket.** For every observation, determine which interval contains its value and label it accordingly.
3. **Encode the bucket labels.** The bucket assignments are typically [one-hot encoded](/wiki/one-hot_encoding) so that each bucket becomes a separate binary feature, or they are represented as an ordinal integer.[4]

Once bucketed, a single numerical feature with many unique values becomes a small set of [categorical features](/wiki/categorical_data). This allows models, particularly [linear models](/wiki/linear_model), to learn a separate weight for each bucket rather than fitting a single slope across the entire range.[5] As Google's Crash Course puts it, binning lets a model "learn separate weights for each bin," which is how a linear model can fit a piecewise, non-linear shape.[5]

## Why use bucketing?

Bucketing offers three commonly cited benefits, all confirmed by Google's documentation:[5]

- **Captures non-linearity in linear models.** A [linear model](/wiki/linear_model) fits one weight per feature. Splitting a continuous feature into buckets and one-hot encoding them lets the model assign an independent weight to each interval, approximating a non-linear curve as a piecewise-constant function.[5]
- **Robustness to outliers.** Extreme values can be absorbed into a boundary bucket instead of dominating feature scaling. The information-loss inside a bucket is exactly what blunts the effect of an outlier.[4]
- **Enables feature crosses.** Bucketed features can be combined into a [feature cross](/wiki/feature_cross). In Google's words, "feature crosses are created by crossing (taking the Cartesian product of) two or more categorical or bucketed features of the dataset," a technique used heavily in large-scale ad-click and recommendation models.[12]

The trade-off is information loss: too few buckets underfit, while too many reintroduce sparsity. The Crash Course warns that with too many bins "none of the bins would contain enough examples for the model to train on."[5]

## What is the difference between equal-width and quantile bucketing?

Several strategies exist for choosing bucket boundaries. The choice depends on the data distribution, the downstream model, and the domain.

### Equal-width bucketing

Equal-width bucketing (also called uniform or equal-interval binning) divides the range of a feature into a fixed number of intervals of the same size.[5] If the minimum value is $$v_{\min}$$, the maximum is $$v_{\max}$$, and the desired number of buckets is $$k$$, then the width of each bucket is:

$$
\text{width} = \frac{v_{\max} - v_{\min}}{k}
$$

**Advantages.** The method is simple to implement and easy to interpret. When data is approximately uniformly distributed, equal-width bins produce balanced groups.

**Disadvantages.** When data is skewed, most observations may fall into one or two bins while the remaining bins are nearly empty. Outliers stretch the range, making the problem worse. As Google notes, equal intervals "give extra information space to the long tail while compacting the large torso into a single bucket," which can reduce a model's ability to distinguish between common values.[5]

### Equal-frequency (quantile) bucketing

Equal-frequency bucketing, sometimes called [quantile](/wiki/quantile)-based binning, sets boundaries so that each bucket contains approximately the same number of data points. Google defines quantile bucketing as a method that "creates bucketing boundaries such that the number of examples in each bucket is exactly or nearly equal."[5] For $$k$$ buckets, the boundaries correspond to the $$1/k, 2/k, \ldots, (k-1)/k$$ quantiles of the feature distribution.

**Advantages.** It handles skewed distributions well because every bucket receives a roughly equal share of examples. In Google's framing, "quantile buckets give extra information space to the large torso while compacting the long tail into a single bucket," the opposite of the equal-width failure mode.[5] This gives the model sufficient training signal in every bucket and avoids the sparse-bin problem.

**Disadvantages.** Bucket widths can vary dramatically. In heavily skewed data the widest bucket may span a range orders of magnitude larger than the narrowest, which can obscure meaningful differences among values in the long tail.

### Log-scale bucketing

For features with right-skewed distributions (common with counts, monetary values, and web traffic metrics), bucket boundaries can be placed on a logarithmic scale. For example, boundaries at 1, 10, 100, 1000, and 10000 produce buckets that grow exponentially in width. This approach devotes more resolution to the dense lower range of values while still distinguishing among rare high values, combining some benefits of both equal-width and equal-frequency strategies.[10]

### Custom (domain-specific) bucketing

Domain experts often define boundaries based on meaningful thresholds rather than statistical properties. Examples include:

- **Age groups:** 0 to 12 (child), 13 to 17 (adolescent), 18 to 64 (adult), 65 and above (senior).
- **Income brackets:** tax brackets or poverty-line thresholds.
- **Credit scoring:** risk bands defined by regulatory or business rules.

Custom bucketing can capture domain knowledge that purely data-driven methods miss, but it requires human judgment and may not generalize across datasets.[8]

### Comparison of bucketing methods

| Method | Boundary rule | Best suited for | Main weakness |
|---|---|---|---|
| Equal-width | Fixed interval size | Uniformly distributed data | Sparse bins with skewed data |
| Equal-frequency | Quantile-based | Skewed distributions | Variable bin widths can obscure tail differences |
| Log-scale | Exponentially growing intervals | Right-skewed data (counts, prices) | Not suitable for data with negative values |
| Custom | Domain knowledge | Regulated or well-understood domains | Requires expert input; may not transfer |

## What are supervised (optimal) binning methods?

The methods above are all unsupervised: they consider only the feature values, not the target variable.[1] Supervised binning methods use the relationship between the feature and the target to find boundaries that maximize predictive power.[7]

### Decision tree-based binning

A single [decision tree](/wiki/decision_tree) trained on one feature and the target variable naturally produces splits that minimize impurity (Gini or entropy). The resulting leaf nodes define bins whose boundaries are chosen to separate target classes as cleanly as possible.[1] Scikit-learn's `DecisionTreeClassifier` or `DecisionTreeRegressor` with a constrained `max_leaf_nodes` parameter is a common way to implement this approach.

### ChiMerge

ChiMerge is a bottom-up merging algorithm that starts with one bin per distinct value and iteratively merges adjacent bins whose target distributions are statistically similar according to the chi-squared test.[3] Merging stops when all remaining adjacent pairs are significantly different. This produces bins that are homogeneous with respect to the target.

### Entropy-based (minimum description length) binning

Entropy-based methods recursively split the feature range at the point that maximizes [information gain](/wiki/information_gain), similar to how a decision tree selects splits. The minimum description length principle (MDLP) criterion, proposed by Fayyad and Irani in 1993, adds a stopping condition that prevents over-splitting by penalizing model complexity.[2]

### Weight of evidence (WoE) binning

Widely used in credit scoring and financial modeling, WoE binning groups feature values into bins that maximize the separation between positive and negative target classes. The weight of evidence for each bin is calculated as ln(proportion of positives / proportion of negatives). Optimal WoE bins are monotonic with respect to the target, which is often a regulatory requirement in financial applications.[7]

## How does bucketing work in sequence-to-sequence models?

In [natural language processing](/wiki/natural_language_processing) and speech recognition, "bucketing" refers to a different but related idea: grouping training sequences by length so that each mini-batch contains sequences of similar length. This usage is common in [recurrent neural network](/wiki/recurrent_neural_network) and [transformer](/wiki/transformer) training pipelines.

### The padding problem

When sequences of varying lengths are combined in a single batch, shorter sequences must be padded with zeros to match the longest sequence. These padding tokens waste computation during forward and backward passes because the model processes them even though they carry no information.

### How sequence bucketing works

Sequences are sorted by length and assigned to buckets (for example, lengths 1 to 20, 21 to 40, 41 to 60). During training, a batch is drawn from a single bucket so that all sequences in the batch have similar lengths. This reduces the amount of padding, speeds up training, and can lower memory usage.[10]

Frameworks that support sequence bucketing include:

- **TensorFlow:** `tf.data.experimental.bucket_by_sequence_length()`
- **MXNet:** native bucketing module for sequence models
- **[PyTorch](/wiki/pytorch):** custom `BucketBatchSampler` implementations used with `DataLoader`

## How does bucketing work in TensorFlow's feature columns?

[TensorFlow](/wiki/tensorflow) provides `tf.feature_column.bucketized_column` as a built-in way to bucketize numerical features for use with Estimator-based models. The function accepts a numeric column and a sorted list of boundary values, then produces a one-hot encoded categorical column.[9]

For example, given boundaries `[0., 1., 2.]`, TensorFlow creates four buckets: $$(-\infty, 0)$$, $$[0, 1)$$, $$[1, 2)$$, and $$[2, +\infty)$$. Each input value is mapped to the corresponding bucket and represented as a multi-hot vector.[9] Bucketized columns can be crossed with other categorical columns using `tf.feature_column.crossed_column` to capture interaction effects, the API-level realization of a [feature cross](/wiki/feature_cross).

Note: the `tf.feature_column` API is deprecated in TensorFlow 2.x in favor of Keras preprocessing layers such as `tf.keras.layers.Discretization`.

## What is hash-based bucketing?

For high-cardinality categorical features, hash-based bucketing (also called [feature hashing](/wiki/feature_hashing) or the hashing trick) applies a hash function to map feature values into a fixed number of buckets.[6] Unlike standard bucketing of numerical ranges, hash-based bucketing operates on categorical values and does not require an explicit mapping table.

**How it works.** A hash function converts each feature value to an integer, and a modulo operation maps that integer to one of *n* buckets. The result is a fixed-size vector regardless of how many distinct values the feature has.

**Trade-offs.** Hash collisions are inevitable: distinct values may land in the same bucket, causing some information loss. Increasing the number of buckets reduces collisions but increases dimensionality. In practice, moderate collision rates have a small impact on model accuracy while providing large savings in memory and computation.[6]

## How does bucketing affect model performance?

Bucketing affects different model families in different ways.

| Model type | Effect of bucketing | Notes |
|---|---|---|
| [Linear models](/wiki/linear_model) and [logistic regression](/wiki/logistic_regression) | Often beneficial | Allows piecewise linear relationships; especially useful when the raw feature has a non-linear relationship with the target |
| [Decision trees](/wiki/decision_tree) and ensemble methods | Rarely needed | Trees find their own split points; pre-bucketing may reduce granularity without adding value. Histogram-based methods like [LightGBM](/wiki/lightgbm) already bin features internally. |
| [Neural networks](/wiki/neural_network) | Sometimes helpful | Can stabilize training when input distributions are highly skewed, but networks can learn non-linear mappings on their own given enough data |
| [K-nearest neighbors](/wiki/k_nearest_neighbors) | Generally harmful | Bucketing collapses distinct values, destroying distance information that KNN relies on |

### Information loss

Every bucketing scheme discards some information because all values within a bucket are treated identically. The key trade-off is between reducing noise (and overfitting) and losing signal.[5] Using too few buckets leads to underfitting; using too many reintroduces the problems bucketing was meant to solve, and per Google can leave bins with too few examples to train on.[5] Cross-validation is the standard way to select an appropriate number of buckets.

## How is bucketing used to handle outliers?

One practical use of bucketing is to contain the influence of extreme values. Instead of clipping outliers or applying a log transform, a practitioner can place all values above (or below) a threshold into a single bucket. For example, if income values range from 0 to 10,000,000 but 99% of observations fall below 200,000, a boundary at 200,000 groups all extreme incomes together. This prevents outliers from dominating feature scaling or distorting model coefficients while preserving the distinction among more common values.[4]

## How is bucketing used in data visualization?

Outside of machine learning, bucketing is the foundation of [histograms](/wiki/histogram), one of the most common tools in exploratory data analysis. A histogram groups continuous values into bins and plots the count (or density) of observations per bin. The choice of bin width or bin count directly affects the visual impression of the data distribution; too few bins obscure structure, while too many bins introduce visual noise. Rules of thumb for choosing histogram bins, such as Sturges' rule ($$k = 1 + \log_2(n)$$) and the Freedman-Diaconis rule ($$\text{bin width} = 2 \cdot \text{IQR} \cdot n^{-1/3}$$), are in effect bucketing strategies optimized for visualization rather than prediction.[10]

## Practical example

Consider a dataset of house prices where the feature "lot area" ranges from 1,300 to 215,000 square feet but is heavily right-skewed, with most lots between 5,000 and 15,000 square feet.

**Equal-width bucketing** with 5 bins would create intervals roughly 42,740 square feet wide. Nearly all observations would fall into the first bin, leaving the other four almost empty.

**Equal-frequency bucketing** with 5 bins (quintiles) would place about 20% of the data in each bin. Boundaries might fall near 7,500, 9,500, 11,100, and 13,600 square feet, giving the model useful resolution in the dense region.

**Log-scale bucketing** with boundaries at 1,000, 3,000, 10,000, 30,000, and 100,000 would provide finer resolution at the low end and coarser grouping for rare large lots.

**Decision tree-based binning** trained against the sale price target might place splits at 7,200, 10,800, and 16,000 square feet, reflecting the points where lot area has the strongest effect on price.

In Python using pandas and scikit-learn:

```python
import pandas as pd
from sklearn.tree import DecisionTreeRegressor
import numpy as np

# Equal-width
pd.cut(df['lot_area'], bins=5)

# Equal-frequency (quantile)
pd.qcut(df['lot_area'], q=5)

# Log-scale
boundaries = [0, 3000, 10000, 30000, 100000, np.inf]
pd.cut(df['lot_area'], bins=boundaries)

# Decision tree-based
tree = DecisionTreeRegressor(max_leaf_nodes=5)
tree.fit(df[['lot_area']], df['sale_price'])
boundaries = sorted(tree.tree_.threshold[tree.tree_.feature == 0])
```

## Best practices

- **Start with domain knowledge.** If meaningful thresholds exist, use them before trying automated methods.
- **Use cross-validation to choose the number of buckets.** There is no universal optimal count; it depends on the dataset size, feature distribution, and model type.
- **Prefer supervised binning when a target variable is available.** Methods like decision tree-based binning or WoE binning generally outperform unsupervised approaches for prediction tasks.[1][8]
- **Avoid bucketing for tree-based models.** Gradient-boosted trees and random forests already perform internal splitting and gain no benefit from pre-bucketing. Pre-bucketing may even reduce performance by limiting the splits the tree can consider.
- **Combine bucketing with feature crossing.** Bucketed features can be combined with other categorical features to create a [feature cross](/wiki/feature_cross), a technique used extensively in large-scale ad-click prediction systems.[12]
- **Document bucket boundaries.** In production systems, the same boundaries must be applied to training and serving data. Store boundaries as configuration or as part of the preprocessing pipeline.

## References

1. Dougherty, J., Kohavi, R., & Sahami, M. (1995). "Supervised and unsupervised discretization of continuous features." *Proceedings of the 12th International Conference on Machine Learning (ICML)*, pp. 194-202.
2. Fayyad, U. M., & Irani, K. B. (1993). "Multi-interval discretization of continuous-valued attributes for classification learning." *Proceedings of the 13th International Joint Conference on Artificial Intelligence (IJCAI)*, pp. 1022-1027.
3. Kerber, R. (1992). "ChiMerge: Discretization of numeric attributes." *Proceedings of the 10th National Conference on Artificial Intelligence (AAAI)*, pp. 123-128.
4. Google Developers. "Bucketing." *Machine Learning Data Preparation*. https://developers.google.com/machine-learning/data-prep/transform/bucketing
5. Google Developers. "Numerical data: Binning." *Machine Learning Crash Course*. https://developers.google.com/machine-learning/crash-course/numerical-data/binning
6. Weinberger, K. Q., Dasgupta, A., Langford, J., Smola, A., & Attenberg, J. (2009). "Feature hashing for large scale multitask learning." *Proceedings of the 26th International Conference on Machine Learning (ICML)*, pp. 1113-1120.
7. Navas-Palencia, G. (2020). "Optimal binning: Mathematical programming formulation." *arXiv preprint arXiv:2001.08025*.
8. Khomtchouk, B. B., Hennessy, J. R., & Bhargava, R. (2020). "Discretization of continuous features in clinical datasets." *Journal of Biomedical Informatics*, 102, 103385.
9. TensorFlow Documentation. "tf.feature_column.bucketized_column." https://www.tensorflow.org/api_docs/python/tf/feature_column/bucketized_column
10. Soczewica, R. (2019). "Nice bucket challenge: Overview of data binning techniques." *Medium*. https://robertsoczewica.medium.com/nice-bucket-challenge-5b511c00f1b3
11. Google Developers. "Machine Learning Glossary." *Google for Developers*. https://developers.google.com/machine-learning/glossary
12. Google Developers. "Categorical data: Feature crosses." *Machine Learning Crash Course*. https://developers.google.com/machine-learning/crash-course/categorical-data/feature-crosses

## See also

- [Binning](/wiki/binning)
- [Feature engineering](/wiki/feature_engineering)
- [Feature cross](/wiki/feature_cross)
- [Data preprocessing](/wiki/data_preprocessing)
- [One-hot encoding](/wiki/one-hot_encoding)
- [Decision tree](/wiki/decision_tree)
- [Histogram](/wiki/histogram)
- [Feature hashing](/wiki/feature_hashing)
- [Quantile](/wiki/quantile)