See also: Discretization, Feature engineering, Quantile
Quantile bucketing, also called quantile binning, equal-frequency binning, or quantile discretization, is a feature engineering technique that converts a continuous numeric variable into a small number of ordered, discrete buckets, where each bucket contains roughly the same number of training examples. The bucket edges are placed at sample quantiles of the training distribution: for ten buckets the splits sit at the 10th, 20th, ..., 90th percentiles, so each bucket receives about 10 percent of the data.
Quantile bucketing is one of several strategies for discretization, the broader family of techniques that turn numeric features into categorical ones. It is a workhorse preprocessing step in tabular machine learning, and a quantile-style binning step also runs internally in modern histogram-based gradient boosting libraries such as LightGBM and XGBoost.
The defining property of quantile bucketing is that bin widths adapt to the data. Where many examples cluster (a dense part of the distribution), bins are narrow. Where examples are spread out (the tails), bins are wide. This is the opposite of equal-width binning, where bins all span the same range and the bin counts vary. Adaptive widths are what make quantile bucketing robust to skew and to extreme values: a long right tail simply ends up in one wide bin rather than producing dozens of nearly empty bins as it would under equal-width binning.
Given a continuous training column x with n observations and a desired number of buckets K, quantile bucketing produces K - 1 cut points and assigns every value to one of K ordered buckets. The standard recipe has three steps.
x in ascending order.1/K, 2/K, ..., (K-1)/K sample quantiles. For K = 4 (quartiles), the cut points are the 25th, 50th, and 75th percentiles. For K = 10 (deciles), they are the 10th through 90th percentiles.K - 1.With ties in the data the cut points may not produce exactly equal counts. Libraries differ in how they break ties: pandas raises an error by default if duplicate edges occur and offers a duplicates='drop' option, while scikit-learn merges adjacent identical edges and may return fewer than K buckets when too many values land at the same number.
A practical complication is that real-world features often contain repeated values. If 30 percent of the column is exactly zero (a common pattern in spend, click, and dosage features), several of the requested cut points will collide on the value zero. Handling this gracefully is one of the more annoying details in production pipelines.
A second complication is computational cost. Naive sorting is O(n log n) per feature, which is fine for tens of millions of rows but starts to dominate training time on very wide tabular datasets. Streaming approximations such as the Greenwald-Khanna sketch (Greenwald and Khanna, 2001), the t-digest (Dunning and Ertl, 2014), and weighted quantile sketches (Chen and Guestrin, 2016) compute approximate quantiles in a single pass with bounded error. These sketches are what histogram-based gradient boosting libraries use under the hood.
The result is an ordinal feature with K levels. It is common to follow bucketing with a downstream encoder such as one-hot encoding for linear models, ordinal encoding for tree models, weight of evidence for credit scoring, or an entity embedding lookup for neural networks.
Quantile bucketing is one of several discretization methods. Each has different assumptions about where bin edges should fall.
| Strategy | How edges are chosen | Uses target labels | Strengths | Weaknesses |
|---|---|---|---|---|
| Equal-width | Range [min, max] divided into K intervals of equal width | No | Simple, easy to explain, preserves notion of distance | Very sensitive to outliers, often produces nearly empty bins for skewed data |
| Equal-frequency (quantile) | Cut points placed at sample quantiles so each bin has about n/K examples | No | Robust to outliers, balanced counts, good default for skewed data | Adjacent bins may cover very different value ranges; loses the original distance scale |
| K-means | One-dimensional k-means clustering of the values; bin edges sit between cluster centers | No | Adapts to multimodal distributions | Sensitive to initialization, can produce empty bins, less interpretable |
| Decision tree | A shallow decision tree is fit on the single feature and the leaf splits become bin edges | Yes | Edges align with target signal, tunable by depth | Risk of overfitting; needs cross-validation |
| MDLP (Fayyad and Irani, 1993) | Recursive binary splits chosen to minimize class entropy, with a Minimum Description Length stopping rule | Yes | Picks the number of bins automatically; strong empirical track record | Designed for classification only, harder to extend to regression |
| ChiMerge | Adjacent intervals are merged if a chi-square test cannot reject independence with the target | Yes | Statistically grounded, monotonic with target | Slower on large data; assumes a meaningful chi-square test |
| Custom or domain-driven | Edges chosen by hand, e.g., age groups 0 to 12, 13 to 17, 18 to 64, 65 plus | No | Encodes prior knowledge, very interpretable | Fragile if the underlying distribution shifts |
Garcia and colleagues' 2013 survey in IEEE Transactions on Knowledge and Data Engineering catalogued more than 80 discretization algorithms and concluded that supervised methods like MDLP usually beat unsupervised ones for classification accuracy, while unsupervised methods like quantile binning remain popular because they are fast, label-free, and reusable across tasks.
MDLP, introduced by Fayyad and Irani (1993) in their paper "Multi-Interval Discretization of Continuous-Valued Attributes for Classification Learning," is the classical supervised discretizer. It sorts the feature, considers every midpoint between adjacent values as a candidate cut, picks the cut that most reduces the class entropy of the partition, and recurses on each side. The recursion stops when the minimum description length principle says further splits would cost more bits to encode than they save in entropy. MDLP often produces fewer, more meaningful bins than quantile bucketing because it can ignore parts of the range where the target is constant. The price is that the cut points depend on the labels, so they must be fit on training data only and held fixed at inference time.
Weight-of-evidence binning, common in credit scoring since the 1980s, takes the supervised idea further by replacing the bin index with the log-odds of the positive class within that bin. The resulting feature is unitless, monotone with respect to the target by construction (after merging non-monotone bins), and well suited for the logistic regression scorecards that regulators expect (Refaat, 2011). WoE binning is usually preceded by a quantile or supervised binning step that proposes the initial cut points.
Discretization throws information away on purpose. In return it offers several practical benefits that show up especially for linear models, simple Bayesian classifiers, and any pipeline where features need to be categorical.
The most important is to let a linear model capture nonlinear behavior. A linear regression or logistic regression on the raw feature x assumes the response is monotonic and roughly proportional to x. After quantile bucketing into ten bins and one-hot encoding, the model fits ten separate intercepts, one per decile, with no shape assumption between them. This is the cheapest possible nonlinear transform.
A second motivation is robustness to outliers. A single observation with an unrealistic value, perhaps a data-entry typo of 1,000,000 for a typical age column, can dominate the parameter estimates in a regression. The same value, after quantile bucketing, simply falls into the top bin alongside all the other large values. Quantile bucketing also dampens the influence of long tails: the rich and the very rich both end up in the top decile of income. This is what Google's Machine Learning Crash Course means when it says quantile buckets give extra information space to the large torso of a distribution while compressing the long tail.
A third motivation is regularization. Replacing a continuous feature with a small number of bin indicators reduces the effective resolution and can mitigate overfitting on small datasets. The K hyperparameter becomes a knob: smaller K gives more regularization at the cost of information loss; larger K approaches the original feature with more parameters to learn.
A fourth motivation is interpretability. Categorical buckets are easier to communicate to non-technical stakeholders. Saying "customers in the top quartile of recency are 3.2x more likely to churn" is more intuitive than reporting a logistic coefficient on log(days_since_last_purchase).
A fifth motivation is non-parametric handling of skew. Skewed features are often handled with a log transform or standardization. Quantile bucketing is a non-parametric alternative that does not assume positivity and does not need a closed-form transform.
Finally, certain algorithm families require or strongly benefit from discrete inputs. Classical Naive Bayes implementations assume categorical features. Rule-mining and association-rule algorithms operate on discrete items. Histogram-based gradient boosting performs its own internal discretization for speed, which is itself a form of quantile bucketing.
The largest single user of quantile-style binning today is the histogram split finder used inside modern boosting libraries.
LightGBM bins each continuous feature into at most max_bin discrete bins, with a default of 255, before training begins. The bins are constructed from a sample of size bin_construct_sample_cnt (default 200,000) and the bin layout is fixed for the rest of training. LightGBM uses an unsigned 8-bit integer for feature values when max_bin = 255, which keeps the histogram compact. During training, finding the best split for a feature reduces to scanning a histogram of gradient and Hessian sums over those bins, which collapses the per-feature complexity from O(n log n) to O(max_bin) and is the main reason LightGBM is fast (Ke et al., 2017).
XGBoost offers tree_method='hist', which has been the default since XGBoost 2.0. Continuous features are bucketed using a weighted quantile sketch (Chen and Guestrin, 2016), the original contribution of the XGBoost paper that made distributed, sparsity-aware quantile estimation tractable. The max_bin parameter defaults to 256. Unlike the older tree_method='approx', which recomputes bins per iteration, hist reuses the bins across all iterations, which is what makes it the fastest tree method in the library (XGBoost developers, 2024).
CatBoost performs a similar quantization controlled by border_count (default 254 for the GPU implementation, 32 to 254 for CPU depending on mode). For numeric features it offers several border_type strategies including uniform, median, min-entropy, and uniform-and-quantiles, the last of which combines equal-width and equal-frequency cuts to better cover both dense and sparse regions of the distribution.
Scikit-learn's own HistGradientBoostingClassifier and HistGradientBoostingRegressor implement the same approach in pure Python plus Cython. The internal bin mapper performs equal-frequency bucketing into at most max_bins (default 255) bins per feature.
Lowering max_bin speeds up training and adds regularization at the cost of fine-grained splits. The LightGBM documentation suggests values from 63 to 255 in practice, with 63 commonly used on GPU. The consequence is that practitioners who use modern tree ensembles get quantile bucketing for free and almost never need to do it manually before training. Where manual bucketing still pays off is in the linear and additive models that sit alongside or downstream of those tree ensembles, in feature stores that need a stable categorical representation across many models, and in scorecards that must be human-readable.
There is no universal best K. The trade-off is straightforward: more buckets retain more information but reduce regularization and slow down downstream models; fewer buckets generalize better but can wash out useful structure.
| Use case | Typical K |
|---|---|
| Coarse exposure features in scorecards (income band, age band) | 4 to 10 |
| Generic tabular feature engineering for linear models | 10 (deciles) |
| WOE binning for credit risk | 5 to 20, often optimized via monotonicity constraints |
Histogram boosting (max_bin) | 63 to 255 |
| Embedding lookup keys for tabular deep learning | 32, 64, 128, or 256 |
Very small K (2 to 4) gives strong regularization and produces interpretable buckets such as low-medium-high. It loses most of the variance in the original feature, which is acceptable when the feature is weakly informative or when the downstream model is itself low-capacity. Medium K (10 to 20) is a common default. Deciles are easy to communicate and capture most of the shape of a typical continuous distribution. Large K (100 to 256) is what histogram-based tree libraries use internally; at this granularity, the discretization is essentially lossless for the purpose of finding a good split, and the speed and memory benefits of histogram aggregation dominate.
In practice K is treated as a hyperparameter and tuned with cross-validation, often jointly with the encoder choice (one-hot vs ordinal vs WOE). When the downstream model already includes regularization, the choice of K is usually flat across a wide range, which means it can be picked for interpretability rather than predictive accuracy.
| Library | API | Notes |
|---|---|---|
| scikit-learn | KBinsDiscretizer(n_bins=K, strategy='quantile', encode='ordinal') | Default strategy is quantile; alternatives are uniform and kmeans. Encoding can be ordinal, onehot, or onehot-dense. Default subsample=200_000 since version 1.3 to keep fitting fast on large arrays |
| pandas | pd.qcut(x, q=K, labels=False, duplicates='drop') | Returns a categorical or integer codes; duplicates='drop' handles ties; retbins=True to capture edges for reuse |
| scikit-learn | quantile_transform | Maps the feature to a uniform or normal distribution via its empirical CDF; continuous output rather than discrete bins |
| numpy | np.quantile(x, np.linspace(0, 1, K + 1)) plus np.digitize | Manual but flexible; useful for custom pipelines |
| LightGBM | max_bin=255, bin_construct_sample_cnt=200000 | Histogram bins are built from a sample of the training data |
| XGBoost | tree_method='hist', max_bin=256 | Sketch is run once at the start of training |
| BigQuery ML | ML.QUANTILE_BUCKETIZE(value, num_buckets) | SQL-side quantile bucketing for in-warehouse feature engineering |
| TensorFlow | tf.feature_column.bucketized_column (TF1) and tf.keras.layers.Discretization (TF2) | Edges supplied as a list; can be derived from quantile statistics |
| Spark MLlib | QuantileDiscretizer | Distributed quantile binning for large datasets |
| feature-engine | EqualFrequencyDiscretiser(q=10) | Drop-in scikit-learn-compatible transformer; pairs naturally with WoEEncoder for credit-risk pipelines |
For very large datasets the subsample parameter on KBinsDiscretizer matters in practice: scikit-learn changed the default from None (use all rows) to 200,000 in version 1.3 specifically because sorting tens of millions of values per column was making fit slow enough that users were skipping the step (scikit-learn developers, 2024).
Suppose a dataset of 100,000 home sales has prices ranging from $30,000 to $48,000,000, with most homes between $200,000 and $700,000 and a long tail of luxury properties. Equal-width binning into ten buckets would put almost every house into the lowest bucket and leave the top nine buckets nearly empty.
With quantile bucketing into ten buckets, the cut points might look like this.
| Decile | Cut point ($) | Bucket size |
|---|---|---|
| 1 | 0 to 145,000 | 10,000 |
| 2 | 145,000 to 195,000 | 10,000 |
| 3 | 195,000 to 240,000 | 10,000 |
| 4 | 240,000 to 285,000 | 10,000 |
| 5 | 285,000 to 335,000 | 10,000 |
| 6 | 335,000 to 395,000 | 10,000 |
| 7 | 395,000 to 475,000 | 10,000 |
| 8 | 475,000 to 600,000 | 10,000 |
| 9 | 600,000 to 900,000 | 10,000 |
| 10 | 900,000 to 48,000,000 | 10,000 |
Notice how the bottom buckets are narrow and the top bucket is huge. That is the point: the model gets fine resolution where most of the data lives, and the long tail of mansions all fall into one bucket where their exact price matters less.
A second example shows the failure mode of duplicate values. Suppose pageviews_last_7_days is zero for 60 percent of users. Asking for ten quantile bins forces six of the cut points to land at zero. With duplicates='drop', pandas silently returns five bins instead of ten: one for users with zero pageviews and four for the active 40 percent. This is usually the right behavior, but downstream code that hard-codes ten bins will break.
A bucketed feature is just an ordinal integer. To feed it into a model it usually needs another encoding step.
The most common choice is one-hot encoding, which expands one categorical column into K indicator columns and lets a linear model fit a separate intercept per bin. With K in the tens this is straightforward; with K in the hundreds the resulting matrix becomes wide and sparse, which favors L1-regularized models such as the lasso.
Ordinal encoding keeps the integer bin index as a single column. This is cheap and is the right choice for tree models, which split on order rather than identity. For linear models it implicitly assumes that bin indices are equally spaced on the response scale, which defeats most of the reason to bucket in the first place.
Weight-of-evidence encoding, common in finance, replaces each bin with the log-odds of the positive class observed in that bin on the training data: log(P(non-event | bucket) / P(event | bucket)). The result is a single continuous, monotone, target-aware column that plugs directly into a logistic regression. WoE encoding leaks target information into the feature, so it must be fit only on training folds and applied to validation and test folds, the same hygiene rule that applies to any target encoder.
Learned embeddings turn each bin into a low-dimensional vector trained jointly with the rest of the network. Tabular deep learning architectures such as TabNet (Arik and Pfister, 2021) and FT-Transformer (Gorishniy et al., 2021) use this idea for both categorical and numeric features, often combined with quantile bucketing of the numeric inputs as a tokenization step.
Quantile bucketing is simple, but a few failure modes recur often enough to be worth flagging.
The first cost is information loss. Two values inside the same bucket are indistinguishable to the downstream model. With only a handful of buckets this can erase a real signal.
The second cost is boundary artifacts. Two values just on either side of a cut point are treated as completely different. Whether 199,999 and 200,001 belong in different buckets is somewhat arbitrary. Some practitioners use overlapping windows or a soft binning encoding (sometimes called binning with a sigmoid) to soften these jumps.
The third cost is distribution shift. The cut points are fit on training data and held fixed at inference time. If production data drifts (a common occurrence in deployed systems), the bucket counts no longer match the original quantiles and the model sees out-of-distribution buckets. Periodic re-fitting is needed, and monitoring the per-bin frequencies on production traffic is one way to detect drift.
The fourth cost is data leakage. Cut points must be computed from the training fold only. Computing quantiles over train and test together leaks test information into the model.
The fifth cost is ties and discrete features. Variables with many repeated values, like a binary indicator or a count with lots of zeros, may end up with fewer than K distinct buckets after deduplication.
The sixth cost is loss of monotonic interpretation. A coefficient on the raw feature in linear regression has a clear meaning. After bucketing and one-hot encoding, that meaning splits across K coefficients.
The seventh cost is redundancy with tree models. Modern gradient boosted trees can split directly on raw numeric features. Pre-bucketing them rarely helps and sometimes hurts because the tree loses the option to split between two adjacent buckets. The exception is internal histogram binning for speed, which is a separate concern from preprocessing.
Finally, quantile bucketing is not standard in image and text models. Convolutional networks and transformers operate on raw pixel intensities or token embeddings. Quantile bucketing is almost exclusive to tabular pipelines.
Despite being a decades-old idea, quantile bucketing has not gone away. Three places where it still matters in 2026 stand out.
In tabular deep learning, TabNet and FT-Transformer explicitly bucket numeric features and learn an embedding per bucket, which often outperforms feeding raw normalized values into the network. Gorishniy et al. (2021) showed that quantile-based numeric tokenization is a key ingredient in FT-Transformer's competitive performance against gradient boosting on standard tabular benchmarks.
In credit scoring and other regulated industries, WoE on quantile bins remains the dominant feature engineering pattern for logistic regression scorecards because regulators expect interpretable, monotonic transforms. The pattern dates to the 1980s and has survived several waves of fashion in machine learning.
Inside boosted trees, histogram split finding is the default in LightGBM, in XGBoost's hist and gpu_hist modes, and in CatBoost. Every tree these libraries grow is built on top of a quantile-style bin layout that was computed once at the start of training. In other words, even data scientists who never write a qcut call still rely on quantile bucketing every time they call lightgbm.train.
Quantile bucketing is rare in computer vision and natural language processing pipelines, where the dominant representations are continuous (pixel intensities, learned embeddings) and the model architectures handle continuous signals natively. Where it does appear in those domains, it is usually as part of a tabular side-channel: structured metadata about an image or a document being mixed with the learned features.
Imagine you are sorting a giant pile of marbles into ten boxes. You could decide ahead of time that box 1 holds tiny marbles, box 10 holds huge marbles, and the boxes in between cover sizes evenly. That is equal-width bucketing. The trouble is that almost all your marbles are small, so most of them pile into box 1 and the other boxes sit empty.
Quantile bucketing fixes this. You line up all the marbles from smallest to largest, then you grab the first tenth of the line and put them in box 1, the next tenth in box 2, and so on. Now every box has the same number of marbles. The boxes that hold tiny marbles cover only a narrow size range, and the box that holds the giant marbles covers a huge size range, but each box has the same count.
Machine learning models like this because they get to learn one rule per box, and every box has enough examples to learn from.