Quantile bucketing

Quantile bucketing, also called quantile binning, equal-frequency binning, or quantile discretization, is a feature engineering technique that converts a continuous numeric variable into a small number of ordered, discrete buckets, where each bucket contains roughly the same number of training examples. The bucket edges are placed at sample quantiles of the training distribution: for ten buckets the splits sit at the 10th, 20th, ..., 90th percentiles, so each bucket receives about 10 percent of the data.

Quantile bucketing is one of several strategies for discretization, the broader family of techniques that turn numeric features into categorical ones. It is a workhorse preprocessing step in tabular machine learning, and a quantile-style binning step also runs internally in modern histogram-based gradient boosting libraries such as LightGBM and XGBoost.

The defining property of quantile bucketing is that bin widths adapt to the data. Where many examples cluster (a dense part of the distribution), bins are narrow. Where examples are spread out (the tails), bins are wide. This is the opposite of equal-width binning, where bins all span the same range and the bin counts vary. Adaptive widths are what make quantile bucketing robust to skew and to extreme values: a long right tail simply ends up in one wide bin rather than producing dozens of nearly empty bins as it would under equal-width binning.

Definition and algorithm

Given a continuous training column x with n observations and a desired number of buckets K, quantile bucketing produces K - 1 cut points and assigns every value to one of K ordered buckets. The standard recipe has three steps.

Sort the training values of x in ascending order.
Compute the cut points at the 1/K, 2/K, ..., (K-1)/K sample quantiles. For K = 4 (quartiles), the cut points are the 25th, 50th, and 75th percentiles. For K = 10 (deciles), they are the 10th through 90th percentiles.
At inference time, place a new value into the bucket whose interval contains it. Values below the smallest cut point fall into bucket 0; values above the largest cut point fall into bucket K - 1.

With ties in the data the cut points may not produce exactly equal counts. Libraries differ in how they break ties: pandas raises an error by default if duplicate edges occur and offers a duplicates='drop' option, while scikit-learn merges adjacent identical edges and may return fewer than K buckets when too many values land at the same number.

A practical complication is that real-world features often contain repeated values. If 30 percent of the column is exactly zero (a common pattern in spend, click, and dosage features), several of the requested cut points will collide on the value zero. Handling this gracefully is one of the more annoying details in production pipelines.

A second complication is computational cost. Naive sorting is O(n log n) per feature, which is fine for tens of millions of rows but starts to dominate training time on very wide tabular datasets. Streaming approximations such as the Greenwald-Khanna sketch (Greenwald and Khanna, 2001), the t-digest (Dunning and Ertl, 2014), and weighted quantile sketches (Chen and Guestrin, 2016) compute approximate quantiles in a single pass with bounded error. These sketches are what histogram-based gradient boosting libraries use under the hood.

The result is an ordinal feature with K levels. It is common to follow bucketing with a downstream encoder such as one-hot encoding for linear models, ordinal encoding for tree models, weight of evidence for credit scoring, or an entity embedding lookup for neural networks.

Comparison with other binning strategies

Quantile bucketing is one of several discretization methods. Each has different assumptions about where bin edges should fall.

Strategy	How edges are chosen	Uses target labels	Strengths	Weaknesses
Equal-width	Range `[min, max]` divided into `K` intervals of equal width	No	Simple, easy to explain, preserves notion of distance	Very sensitive to outliers, often produces nearly empty bins for skewed data
Equal-frequency (quantile)	Cut points placed at sample quantiles so each bin has about `n/K` examples	No	Robust to outliers, balanced counts, good default for skewed data	Adjacent bins may cover very different value ranges; loses the original distance scale
K-means	One-dimensional k-means clustering of the values; bin edges sit between cluster centers	No	Adapts to multimodal distributions	Sensitive to initialization, can produce empty bins, less interpretable
Decision tree	A shallow decision tree is fit on the single feature and the leaf splits become bin edges	Yes	Edges align with target signal, tunable by depth	Risk of overfitting; needs cross-validation
MDLP (Fayyad and Irani, 1993)	Recursive binary splits chosen to minimize class entropy, with a Minimum Description Length stopping rule	Yes	Picks the number of bins automatically; strong empirical track record	Designed for classification only, harder to extend to regression
ChiMerge	Adjacent intervals are merged if a chi-square test cannot reject independence with the target	Yes	Statistically grounded, monotonic with target	Slower on large data; assumes a meaningful chi-square test
Custom or domain-driven	Edges chosen by hand, e.g., age groups 0 to 12, 13 to 17, 18 to 64, 65 plus	No	Encodes prior knowledge, very interpretable	Fragile if the underlying distribution shifts

Garcia and colleagues' 2013 survey in IEEE Transactions on Knowledge and Data Engineering catalogued more than 80 discretization algorithms and concluded that supervised methods like MDLP usually beat unsupervised ones for classification accuracy, while unsupervised methods like quantile binning remain popular because they are fast, label-free, and reusable across tasks.

MDLP, introduced by Fayyad and Irani (1993) in their paper "Multi-Interval Discretization of Continuous-Valued Attributes for Classification Learning," is the classical supervised discretizer. It sorts the feature, considers every midpoint between adjacent values as a candidate cut, picks the cut that most reduces the class entropy of the partition, and recurses on each side. The recursion stops when the minimum description length principle says further splits would cost more bits to encode than they save in entropy. MDLP often produces fewer, more meaningful bins than quantile bucketing because it can ignore parts of the range where the target is constant. The price is that the cut points depend on the labels, so they must be fit on training data only and held fixed at inference time.

Weight-of-evidence binning, common in credit scoring since the 1980s, takes the supervised idea further by replacing the bin index with the log-odds of the positive class within that bin. The resulting feature is unitless, monotone with respect to the target by construction (after merging non-monotone bins), and well suited for the logistic regression scorecards that regulators expect (Refaat, 2011). WoE binning is usually preceded by a quantile or supervised binning step that proposes the initial cut points.

Why bucket a continuous feature

Discretization throws information away on purpose. In return it offers several practical benefits that show up especially for linear models, simple Bayesian classifiers, and any pipeline where features need to be categorical.

The most important is to let a linear model capture nonlinear behavior. A linear regression or logistic regression on the raw feature x assumes the response is monotonic and roughly proportional to x. After quantile bucketing into ten bins and one-hot encoding, the model fits ten separate intercepts, one per decile, with no shape assumption between them. This is the cheapest possible nonlinear transform.

A second motivation is robustness to outliers. A single observation with an unrealistic value, perhaps a data-entry typo of 1,000,000 for a typical age column, can dominate the parameter estimates in a regression. The same value, after quantile bucketing, simply falls into the top bin alongside all the other large values. Quantile bucketing also dampens the influence of long tails: the rich and the very rich both end up in the top decile of income. This is what Google's Machine Learning Crash Course means when it says quantile buckets give extra information space to the large torso of a distribution while compressing the long tail.

A third motivation is regularization. Replacing a continuous feature with a small number of bin indicators reduces the effective resolution and can mitigate overfitting on small datasets. The K hyperparameter becomes a knob: smaller K gives more regularization at the cost of information loss; larger K approaches the original feature with more parameters to learn.

A fourth motivation is interpretability. Categorical buckets are easier to communicate to non-technical stakeholders. Saying "customers in the top quartile of recency are 3.2x more likely to churn" is more intuitive than reporting a logistic coefficient on log(days_since_last_purchase).

A fifth motivation is non-parametric handling of skew. Skewed features are often handled with a log transform or standardization. Quantile bucketing is a non-parametric alternative that does not assume positivity and does not need a closed-form transform.

Finally, certain algorithm families require or strongly benefit from discrete inputs. Classical Naive Bayes implementations assume categorical features. Rule-mining and association-rule algorithms operate on discrete items. Histogram-based gradient boosting performs its own internal discretization for speed, which is itself a form of quantile bucketing.

Use inside histogram-based gradient boosting

The largest single user of quantile-style binning today is the histogram split finder used inside modern boosting libraries.

LightGBM bins each continuous feature into at most max_bin discrete bins, with a default of 255, before training begins. The bins are constructed from a sample of size bin_construct_sample_cnt (default 200,000) and the bin layout is fixed for the rest of training. LightGBM uses an unsigned 8-bit integer for feature values when max_bin = 255, which keeps the histogram compact. During training, finding the best split for a feature reduces to scanning a histogram of gradient and Hessian sums over those bins, which collapses the per-feature complexity from O(n log n) to O(max_bin) and is the main reason LightGBM is fast (Ke et al., 2017).

XGBoost offers tree_method='hist', which has been the default since XGBoost 2.0. Continuous features are bucketed using a weighted quantile sketch (Chen and Guestrin, 2016), the original contribution of the XGBoost paper that made distributed, sparsity-aware quantile estimation tractable. The max_bin parameter defaults to 256. Unlike the older tree_method='approx', which recomputes bins per iteration, hist reuses the bins across all iterations, which is what makes it the fastest tree method in the library (XGBoost developers, 2024).

CatBoost performs a similar quantization controlled by border_count (default 254 for the GPU implementation, 32 to 254 for CPU depending on mode). For numeric features it offers several border_type strategies including uniform, median, min-entropy, and uniform-and-quantiles, the last of which combines equal-width and equal-frequency cuts to better cover both dense and sparse regions of the distribution.

Scikit-learn's own HistGradientBoostingClassifier and HistGradientBoostingRegressor implement the same approach in pure Python plus Cython. The internal bin mapper performs equal-frequency bucketing into at most max_bins (default 255) bins per feature.

Lowering max_bin speeds up training and adds regularization at the cost of fine-grained splits. The LightGBM documentation suggests values from 63 to 255 in practice, with 63 commonly used on GPU. The consequence is that practitioners who use modern tree ensembles get quantile bucketing for free and almost never need to do it manually before training. Where manual bucketing still pays off is in the linear and additive models that sit alongside or downstream of those tree ensembles, in feature stores that need a stable categorical representation across many models, and in scorecards that must be human-readable.

Choice of K, the number of buckets

There is no universal best K. The trade-off is straightforward: more buckets retain more information but reduce regularization and slow down downstream models; fewer buckets generalize better but can wash out useful structure.

Use case	Typical K
Coarse exposure features in scorecards (income band, age band)	4 to 10
Generic tabular feature engineering for linear models	10 (deciles)
WOE binning for credit risk	5 to 20, often optimized via monotonicity constraints
Histogram boosting (`max_bin`)	63 to 255
Embedding lookup keys for tabular deep learning	32, 64, 128, or 256

Very small K (2 to 4) gives strong regularization and produces interpretable buckets such as low-medium-high. It loses most of the variance in the original feature, which is acceptable when the feature is weakly informative or when the downstream model is itself low-capacity. Medium K (10 to 20) is a common default. Deciles are easy to communicate and capture most of the shape of a typical continuous distribution. Large K (100 to 256) is what histogram-based tree libraries use internally; at this granularity, the discretization is essentially lossless for the purpose of finding a good split, and the speed and memory benefits of histogram aggregation dominate.

In practice K is treated as a hyperparameter and tuned with cross-validation, often jointly with the encoder choice (one-hot vs ordinal vs WOE). When the downstream model already includes regularization, the choice of K is usually flat across a wide range, which means it can be picked for interpretability rather than predictive accuracy.

Library implementations

Library	API	Notes
scikit-learn	`KBinsDiscretizer(n_bins=K, strategy='quantile', encode='ordinal')`	Default strategy is quantile; alternatives are uniform and kmeans. Encoding can be ordinal, onehot, or onehot-dense. Default `subsample=200_000` since version 1.3 to keep fitting fast on large arrays
pandas	`pd.qcut(x, q=K, labels=False, duplicates='drop')`	Returns a categorical or integer codes; `duplicates='drop'` handles ties; `retbins=True` to capture edges for reuse
scikit-learn	`quantile_transform`	Maps the feature to a uniform or normal distribution via its empirical CDF; continuous output rather than discrete bins
numpy	`np.quantile(x, np.linspace(0, 1, K + 1))` plus `np.digitize`	Manual but flexible; useful for custom pipelines
LightGBM	`max_bin=255`, `bin_construct_sample_cnt=200000`	Histogram bins are built from a sample of the training data
XGBoost	`tree_method='hist'`, `max_bin=256`	Sketch is run once at the start of training
BigQuery ML	`ML.QUANTILE_BUCKETIZE(value, num_buckets)`	SQL-side quantile bucketing for in-warehouse feature engineering
TensorFlow	`tf.feature_column.bucketized_column` (TF1) and `tf.keras.layers.Discretization` (TF2)	Edges supplied as a list; can be derived from quantile statistics
Spark MLlib	`QuantileDiscretizer`	Distributed quantile binning for large datasets
feature-engine	`EqualFrequencyDiscretiser(q=10)`	Drop-in scikit-learn-compatible transformer; pairs naturally with `WoEEncoder` for credit-risk pipelines

For very large datasets the subsample parameter on KBinsDiscretizer matters in practice: scikit-learn changed the default from None (use all rows) to 200,000 in version 1.3 specifically because sorting tens of millions of values per column was making fit slow enough that users were skipping the step (scikit-learn developers, 2024).

Worked example: house prices

Suppose a dataset of 100,000 home sales has prices ranging from $30,000 to $48,000,000, with most homes between $200,000 and $700,000 and a long tail of luxury properties. Equal-width binning into ten buckets would put almost every house into the lowest bucket and leave the top nine buckets nearly empty.

With quantile bucketing into ten buckets, the cut points might look like this.

Decile	Cut point ($)	Bucket size
1	0 to 145,000	10,000
2	145,000 to 195,000	10,000
3	195,000 to 240,000	10,000
4	240,000 to 285,000	10,000
5	285,000 to 335,000	10,000
6	335,000 to 395,000	10,000
7	395,000 to 475,000	10,000
8	475,000 to 600,000	10,000
9	600,000 to 900,000	10,000
10	900,000 to 48,000,000	10,000

Notice how the bottom buckets are narrow and the top bucket is huge. That is the point: the model gets fine resolution where most of the data lives, and the long tail of mansions all fall into one bucket where their exact price matters less.

A second example shows the failure mode of duplicate values. Suppose pageviews_last_7_days is zero for 60 percent of users. Asking for ten quantile bins forces six of the cut points to land at zero. With duplicates='drop', pandas silently returns five bins instead of ten: one for users with zero pageviews and four for the active 40 percent. This is usually the right behavior, but downstream code that hard-codes ten bins will break.

Combining bucketing with downstream encoders

A bucketed feature is just an ordinal integer. To feed it into a model it usually needs another encoding step.

The most common choice is one-hot encoding, which expands one categorical column into K indicator columns and lets a linear model fit a separate intercept per bin. With K in the tens this is straightforward; with K in the hundreds the resulting matrix becomes wide and sparse, which favors L1-regularized models such as the lasso.

Ordinal encoding keeps the integer bin index as a single column. This is cheap and is the right choice for tree models, which split on order rather than identity. For linear models it implicitly assumes that bin indices are equally spaced on the response scale, which defeats most of the reason to bucket in the first place.

Weight-of-evidence encoding, common in finance, replaces each bin with the log-odds of the positive class observed in that bin on the training data: log(P(non-event | bucket) / P(event | bucket)). The result is a single continuous, monotone, target-aware column that plugs directly into a logistic regression. WoE encoding leaks target information into the feature, so it must be fit only on training folds and applied to validation and test folds, the same hygiene rule that applies to any target encoder.

Learned embeddings turn each bin into a low-dimensional vector trained jointly with the rest of the network. Tabular deep learning architectures such as TabNet (Arik and Pfister, 2021) and FT-Transformer (Gorishniy et al., 2021) use this idea for both categorical and numeric features, often combined with quantile bucketing of the numeric inputs as a tokenization step.

Limitations and pitfalls

Quantile bucketing is simple, but a few failure modes recur often enough to be worth flagging.

The first cost is information loss. Two values inside the same bucket are indistinguishable to the downstream model. With only a handful of buckets this can erase a real signal.

The second cost is boundary artifacts. Two values just on either side of a cut point are treated as completely different. Whether 199,999 and 200,001 belong in different buckets is somewhat arbitrary. Some practitioners use overlapping windows or a soft binning encoding (sometimes called binning with a sigmoid) to soften these jumps.

The third cost is distribution shift. The cut points are fit on training data and held fixed at inference time. If production data drifts (a common occurrence in deployed systems), the bucket counts no longer match the original quantiles and the model sees out-of-distribution buckets. Periodic re-fitting is needed, and monitoring the per-bin frequencies on production traffic is one way to detect drift.

The fourth cost is data leakage. Cut points must be computed from the training fold only. Computing quantiles over train and test together leaks test information into the model.

The fifth cost is ties and discrete features. Variables with many repeated values, like a binary indicator or a count with lots of zeros, may end up with fewer than K distinct buckets after deduplication.

The sixth cost is loss of monotonic interpretation. A coefficient on the raw feature in linear regression has a clear meaning. After bucketing and one-hot encoding, that meaning splits across K coefficients.

The seventh cost is redundancy with tree models. Modern gradient boosted trees can split directly on raw numeric features. Pre-bucketing them rarely helps and sometimes hurts because the tree loses the option to split between two adjacent buckets. The exception is internal histogram binning for speed, which is a separate concern from preprocessing.

Finally, quantile bucketing is not standard in image and text models. Convolutional networks and transformers operate on raw pixel intensities or token embeddings. Quantile bucketing is almost exclusive to tabular pipelines.

Modern context

Despite being a decades-old idea, quantile bucketing has not gone away. Three places where it still matters in 2026 stand out.

In tabular deep learning, TabNet and FT-Transformer explicitly bucket numeric features and learn an embedding per bucket, which often outperforms feeding raw normalized values into the network. Gorishniy et al. (2021) showed that quantile-based numeric tokenization is a key ingredient in FT-Transformer's competitive performance against gradient boosting on standard tabular benchmarks.

In credit scoring and other regulated industries, WoE on quantile bins remains the dominant feature engineering pattern for logistic regression scorecards because regulators expect interpretable, monotonic transforms. The pattern dates to the 1980s and has survived several waves of fashion in machine learning.

Inside boosted trees, histogram split finding is the default in LightGBM, in XGBoost's hist and gpu_hist modes, and in CatBoost. Every tree these libraries grow is built on top of a quantile-style bin layout that was computed once at the start of training. In other words, even data scientists who never write a qcut call still rely on quantile bucketing every time they call lightgbm.train.

Quantile bucketing is rare in computer vision and natural language processing pipelines, where the dominant representations are continuous (pixel intensities, learned embeddings) and the model architectures handle continuous signals natively. Where it does appear in those domains, it is usually as part of a tabular side-channel: structured metadata about an image or a document being mixed with the learned features.

Explain like I'm 5

Imagine you are sorting a giant pile of marbles into ten boxes. You could decide ahead of time that box 1 holds tiny marbles, box 10 holds huge marbles, and the boxes in between cover sizes evenly. That is equal-width bucketing. The trouble is that almost all your marbles are small, so most of them pile into box 1 and the other boxes sit empty.

Quantile bucketing fixes this. You line up all the marbles from smallest to largest, then you grab the first tenth of the line and put them in box 1, the next tenth in box 2, and so on. Now every box has the same number of marbles. The boxes that hold tiny marbles cover only a narrow size range, and the box that holds the giant marbles covers a huge size range, but each box has the same count.

Machine learning models like this because they get to learn one rule per box, and every box has enough examples to learn from.

References

Fayyad, U. M. and Irani, K. B. (1993). Multi-Interval Discretization of Continuous-Valued Attributes for Classification Learning. *Proceedings of the 13th International Joint Conference on Artificial Intelligence (IJCAI-93)*, pp. 1022 to 1027.
Garcia, S., Luengo, J., Saez, J. A., Lopez, V. and Herrera, F. (2013). A Survey of Discretization Techniques: Taxonomy and Empirical Analysis in Supervised Learning. *IEEE Transactions on Knowledge and Data Engineering*, 25(4), pp. 734 to 750.
Dougherty, J., Kohavi, R. and Sahami, M. (1995). Supervised and Unsupervised Discretization of Continuous Features. *Proceedings of the Twelfth International Conference on Machine Learning*, pp. 194 to 202.
Greenwald, M. and Khanna, S. (2001). Space-efficient online computation of quantile summaries. *Proceedings of the 2001 ACM SIGMOD International Conference on Management of Data*, pp. 58 to 66.
Chen, T. and Guestrin, C. (2016). XGBoost: A Scalable Tree Boosting System. *Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD)*, pp. 785 to 794.
Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Ma, W., Ye, Q. and Liu, T. (2017). LightGBM: A Highly Efficient Gradient Boosting Decision Tree. *Advances in Neural Information Processing Systems 30 (NeurIPS 2017)*.
Dunning, T. and Ertl, O. (2014). Computing Extremely Accurate Quantiles Using t-Digests. arXiv:1902.04023.
Refaat, M. (2011). *Credit Risk Scorecards: Development and Implementation Using SAS*. Lulu.com.
Arik, S. O. and Pfister, T. (2021). TabNet: Attentive Interpretable Tabular Learning. *Proceedings of the AAAI Conference on Artificial Intelligence*, 35(8), pp. 6679 to 6687.
Gorishniy, Y., Rubachev, I., Khrulkov, V. and Babenko, A. (2021). Revisiting Deep Learning Models for Tabular Data. *Advances in Neural Information Processing Systems 34 (NeurIPS 2021)*.
Google Developers. *Machine Learning Crash Course: Numerical data, Binning.* https://developers.google.com/machine-learning/crash-course/numerical-data/binning
scikit-learn developers. *KBinsDiscretizer.* https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.KBinsDiscretizer.html
pandas development team. *pandas.qcut.* https://pandas.pydata.org/docs/reference/api/pandas.qcut.html
Microsoft. *LightGBM Parameters.* https://lightgbm.readthedocs.io/en/latest/Parameters.html
XGBoost developers. *Tree Methods.* https://xgboost.readthedocs.io/en/stable/treemethod.html
Google Cloud. *BigQuery ML: ML.QUANTILE_BUCKETIZE.* https://cloud.google.com/bigquery/docs/reference/standard-sql/bigqueryml-syntax-quantile-bucketize
Apache Spark. *QuantileDiscretizer.* https://spark.apache.org/docs/latest/ml-features.html#quantilediscretizer

Definition and algorithm

Comparison with other binning strategies

Why bucket a continuous feature

Use inside histogram-based gradient boosting

Choice of K, the number of buckets

Library implementations

Worked example: house prices

Combining bucketing with downstream encoders

Limitations and pitfalls

Modern context

Explain like I'm 5

References

Improve this article

Related Articles

Discrete Feature

Bucketing

Categorical Data

Continuous Feature

Dense Feature

Feature Cross

Definition and algorithm

Comparison with other binning strategies

Why bucket a continuous feature

Use inside histogram-based gradient boosting

Choice of K, the number of buckets

Library implementations

Worked example: house prices

Combining bucketing with downstream encoders

Limitations and pitfalls

Modern context

Explain like I'm 5

References

Related Articles

Discrete Feature

Bucketing

Categorical Data

Continuous Feature

Dense Feature

Feature Cross