Rank (ordinality)

introduction

In machine learning and statistics, rank or ordinality describes data whose values have a meaningful order but whose spacing between values is not necessarily equal or quantitatively meaningful. A movie can be rated 1 to 5 stars, a customer can be "very dissatisfied" through "very satisfied", a tumor can be classified as stage I, II, III, or IV; in each case the labels carry order, but the gap between a 4 star and a 5 star rating is not the same kind of quantity as the gap between 4.0 and 5.0 kilograms. Treating ordinal data correctly matters because the wrong choice of model or loss function can throw away information about order, or invent a metric structure that does not exist. This article covers the place of ordinal data in measurement theory, the main models and metrics used to handle it, and how rank-based methods turn up in modern systems such as learning to rank, preference learning, and RLHF reward modeling.

measurement theory and the four scales

The modern vocabulary for talking about ordinality comes from the psychophysicist S. S. Stevens. In a 1946 paper in Science, Stevens proposed that all measurement falls into one of four scales: nominal, ordinal, interval, and ratio. Each scale supports a different set of admissible operations and a different family of statistics, and Stevens argued that statistics whose values change under permutations of arbitrary labels are not appropriate for nominal data, statistics that depend on the size of differences are not appropriate for ordinal data, and so on.

The table below summarizes the four scales and the operations they support.

Scale	Distinguishes	Order	Equal intervals	True zero	Examples	Permitted statistics
Nominal	Yes	No	No	No	Eye color, ZIP code, categorical data labels	Mode, frequency, chi-square
Ordinal	Yes	Yes	No	No	Movie stars, Likert scale responses, education level, cancer stage	Median, percentiles, Spearman correlation, Kendall tau
Interval	Yes	Yes	Yes	No	Celsius temperature, calendar year, IQ score	Mean, standard deviation, Pearson correlation
Ratio	Yes	Yes	Yes	Yes	Mass, length, count, Kelvin temperature	All arithmetic, geometric mean, coefficient of variation

Stevens' typology has been criticized for being too rigid in practice; many real measurements sit between scales, and some statisticians, including John Tukey and Frederick Lord, argued that the choice of statistic should depend on the question, not just the label on the scale. Even with those caveats, the nominal/ordinal/interval/ratio split is still the standard way to introduce the topic in statistics and machine learning courses, and the ordinal category remains the one most often mishandled.

what makes ordinal data different

Ordinal data carries less structure than interval or ratio data but more than nominal data. Three properties matter in practice.

The order is real but the spacing is not. A respondent moving from "agree" to "strongly agree" on a five-point scale is not necessarily moving the same psychological distance as another respondent moving from "neutral" to "agree". Treating the labels 1 through 5 as if they were measured in the same units as a thermometer reading bakes in an assumption that is rarely tested.

Arithmetic on raw labels is suspect. The mean of "poor", "fair", "good", "very good" is not well defined unless you commit to a numerical encoding, and the choice of encoding can change the answer. Many practitioners do average Likert items anyway, especially after summing several items into a composite score, because the sum often behaves close enough to interval for downstream parametric methods to be useful. The debate over whether this is justified is more than fifty years old and shows no sign of ending.

Standard regression and classification leave information on the table. Ordinary least squares regression treats the labels as numbers on a continuous scale, which assumes equal spacing. A multi-class logistic regression classifier ignores the ordering entirely and predicts each class as if it were nominal. Both work, but neither uses the structure that the labels are ordered.

ordinal regression

Ordinal regression, sometimes called ordered regression or ranking regression, is the family of supervised methods designed for an ordered discrete response. The predicted output is one of $K$ ordered categories, and the model is judged not just by whether it picks the right category but by how far off it is when it misses.

proportional odds and cumulative link models

The most cited formulation is the proportional odds model introduced by Peter McCullagh in 1980 in the Journal of the Royal Statistical Society, Series B. Instead of modeling each class probability directly, the model writes the cumulative probabilities $\gamma_j = P(Y \le j \mid X)$ as

$$\text{logit}(\gamma_j) = \theta_j - \beta^\top X, \quad j = 1, \ldots, K-1,$$

where $\theta_1 < \theta_2 < \cdots < \theta_{K-1}$ are class thresholds and $\beta$ is a shared coefficient vector. The key constraint is that $\beta$ is the same across all $K-1$ logits; only the intercepts shift. This is the proportional odds assumption, and it implies that the effect of any predictor on the log odds of "being above category $j$" is the same regardless of which $j$ you pick.

Replacing the logit with a probit gives the ordered probit model, often associated with the econometrics literature. More generally, any cumulative link model uses an inverse cumulative distribution function as the link, and the package ordinal in R supports logit, probit, complementary log-log, cauchit, and others. In Python, statsmodels.miscmodels.ordinal_model.OrderedModel provides a comparable implementation, and the proportional odds parameterization comes out of the box.

ordinal classification by binary decomposition

Eibe Frank and Mark Hall, at the University of Waikato, proposed in 2001 a different angle. Their paper A Simple Approach to Ordinal Classification in the proceedings of ECML showed that any standard classification algorithm can be turned into an ordinal classifier by training $K-1$ binary problems, each one asking "is the rank greater than $j$?". At prediction time, the probabilities from the binary classifiers are combined to recover a probability distribution over the $K$ ordered classes. The approach is attractive because it works with any underlying learner, including decision trees, random forests, and neural networks, and the original paper showed it consistently improved over treating the labels as nominal when the base learner was a C4.5 tree.

Ling Li and Hsuan-Tien Lin generalized the idea in 2006 with their NeurIPS paper on Ordinal Regression by Extended Binary Classification, which derives loss bounds and shows that the choice of binary loss controls the ordinal loss being optimized.

support vector ordinal regression

Wei Chu and S. Sathiya Keerthi presented two SVM formulations for ordinal regression at ICML 2005. SVOREX (explicit constraints) trains $K-1$ parallel hyperplanes with margin constraints only against patterns from the two adjacent classes, while SVORIM (implicit constraints) constrains each hyperplane against all higher and lower classes simultaneously. A useful property of SVORIM is that the ordering of thresholds $\theta_1 < \theta_2 < \cdots < \theta_{K-1}$ is satisfied automatically at the optimal solution, even though the constraint is not written explicitly. Both methods extended the SVM framework cleanly and have been benchmarked widely as baselines in the ordinal regression literature.

neural network approaches

Jianlin Cheng, Zheng Wang, and Gianluca Pollastri proposed a neural network approach to ordinal regression in 2007, later published in IEEE proceedings in 2008. Their NNRank model uses $K$ output units and reframes the ordinal classification as a series of binary classifications using a cumulative encoding: a label of $k$ becomes the target vector $(1, 1, \ldots, 1, 0, \ldots, 0)$ with the first $k$ entries set to one. Each output unit predicts whether the rank is at least its position, which mirrors the binary decomposition idea but inside a single end-to-end network. Modern variants of this trick, including the CORAL and CORN methods of Cao and Raschka, are common in deep ordinal regression today.

method summary

Method	Year	Approach	Strengths	Limitations
Proportional odds (McCullagh)	1980	Cumulative logit, shared $\beta$	Interpretable, well studied, default in R `MASS::polr`	Proportional odds assumption may fail
Ordered probit	1980s econometrics	Cumulative probit	Natural for latent normal variables	Same shared-slope assumption
Frank and Hall ordinal classifier	2001	$K-1$ binary problems wrapping any classifier	Works with any base learner	Probabilities can be inconsistent across thresholds
Support vector ordinal regression (Chu and Keerthi)	2005	SVOREX, SVORIM	Strong margin-based baseline	Computationally heavier, kernel choice matters
Cheng NNRank	2007	Cumulative target vector in a neural net	End-to-end, scales to deep models	Needs care with monotonicity at inference
OrdinalRidge / regression on labels	older	Treat labels as numeric, fit ridge	Simple, fast	Assumes equal spacing
Ordinal extension of CRF, GP, deep ResNet	various	Probabilistic and deep variants	Match the assumptions of richer models	Implementation complexity

rank correlation measures

When the goal is to measure agreement between two ordered variables, ordinary Pearson correlation is awkward because it depends on the numerical encoding. Two rank-based measures dominate.

spearman rho

Charles Spearman introduced the rank correlation now named after him in a 1904 paper in the American Journal of Psychology. The Spearman coefficient is just the Pearson correlation computed on the ranks of the two variables, and it captures monotonic association rather than strictly linear association. It ranges from $-1$ (perfectly opposite orderings) to $+1$ (identical orderings), with $0$ meaning no monotonic relationship. Spearman is well suited when the data have outliers or non-normal marginal distributions, since ranks are insensitive to the actual values.

kendall tau

Maurice Kendall introduced his $\tau$ statistic in 1938 in Biometrika. The construction is more combinatorial than Spearman's: for every pair of observations $(i, j)$, the pair is concordant if the two variables agree on which member of the pair is larger and discordant otherwise. The Kendall $\tau_a$ statistic is

$$\tau_a = \frac{C - D}{C + D},$$

where $C$ is the number of concordant pairs and $D$ the number of discordant pairs out of $\binom{n}{2}$ total pairs. Kendall $\tau_b$ adjusts for ties and is the version commonly reported by statistical software.

gamma, somers d, and friends

Leo Goodman and William Kruskal proposed the $\gamma$ coefficient in a series of papers starting in the 1950s. Goodman-Kruskal $\gamma$ uses the same concordant minus discordant numerator as Kendall $\tau$ but ignores tied pairs in the denominator. Somers' $D$ is an asymmetric variant that treats one variable as the predictor and the other as the response. These measures are part of the standard output of ordinal regression software and show up in clinical and social-science papers.

Measure	Year	Formula sketch	Range	Handles ties
Spearman $\rho$	1904	Pearson correlation of ranks	$[-1, 1]$	Average ranks for ties
Kendall $\tau_a$	1938	$(C - D) / \binom{n}{2}$	$[-1, 1]$	No
Kendall $\tau_b$	1945	$(C - D)$ over corrected denominator	$[-1, 1]$	Yes
Goodman-Kruskal $\gamma$	1954	$(C - D) / (C + D)$, ignoring ties	$[-1, 1]$	Excludes from denominator
Somers' $D$	1962	Asymmetric variant of $\gamma$	$[-1, 1]$	Yes

evaluation metrics for ordinal predictions

Classification accuracy treats every misclassification as equally bad, which is the wrong default for ordinal data: predicting "4 stars" when the truth is "5 stars" is clearly less wrong than predicting "1 star". A few metrics are common in practice.

Mean absolute error of ranks treats the integer-coded labels as a number line and reports the average $|y - \hat{y}|$. It is easy to compute and easy to communicate but pretends the encoding is interval.

Spearman $\rho$ and Kendall $\tau$ between predictions and ground truth measure the rank agreement of a model's outputs against the labels and are often used in ranking-style benchmarks.

Quadratic weighted kappa, introduced by Jacob Cohen in 1968 as part of his weighted kappa family, has become the de facto metric in ordinal Kaggle competitions, including the 2014 Diabetic Retinopathy Detection contest, the 2015 Hewlett Foundation essay scoring contest, and several follow-on challenges. The metric is

$$\kappa = 1 - \frac{\sum_{i,j} w_{ij} O_{ij}}{\sum_{i,j} w_{ij} E_{ij}},$$

where $O$ is the observed confusion matrix, $E$ is the expected matrix under chance, and $w_{ij} = (i - j)^2 / (K - 1)^2$ is the quadratic weight that penalizes far-off predictions more than near-off ones. Practitioners often optimize models that output a continuous score and then tune cut points between classes to maximize quadratic weighted kappa on a validation set.

pairwise preference and learning to rank

Ranking can also be approached without ever assigning a class label. Instead, the data come as pairwise comparisons or full rankings of items, and the model learns a real-valued score that respects them.

The foundational model is the Bradley-Terry model, presented by Ralph Bradley and Milton Terry in a 1952 Biometrika paper on incomplete block designs and paired comparisons. Each item $i$ gets a positive score $\pi_i$, and the probability that $i$ beats $j$ is

$$P(i \succ j) = \frac{\pi_i}{\pi_i + \pi_j}.$$

Ernst Zermelo described an essentially equivalent model for ranking chess players in 1929, and Louis Thurstone's 1927 law of comparative judgment is closely related. Bradley-Terry remains the workhorse for converting wins-and-losses data into a global ranking.

In machine learning, Christopher Burges and colleagues at Microsoft Research introduced RankNet in 2005. RankNet trains a neural network to score documents and uses a pairwise cross-entropy loss derived from a logistic preference model that is mathematically a cousin of Bradley-Terry. Its successors are LambdaRank, which scales pairwise gradients by the change in NDCG that swapping the pair would cause, and LambdaMART, which uses gradient-boosted regression trees as the underlying learner. An ensemble of LambdaMART models won track 1 of the 2010 Yahoo Learning to Rank Challenge.

These ideas show up in modern preference-based language model training. Reward models for RLHF are typically trained on pairs of responses with a Bradley-Terry-style log-likelihood, and the Direct Preference Optimization (DPO) algorithm of Rafailov and colleagues, presented at NeurIPS 2023, derives a closed-form policy from the Bradley-Terry preference model and replaces the explicit reward training step with a single binary cross-entropy objective. The popularity of DPO and its descendants means that the math of ordinal pairwise preference is now sitting under a large fraction of post-training pipelines for frontier models.

software

Tool	Language	Capability
`mord`	Python	Threshold, regression, and classification ordinal models with a scikit-learn API
`statsmodels.OrderedModel`	Python	Proportional odds and probit cumulative link models
`ordinal`	R	Cumulative link models with logit, probit, cloglog, cauchit links
`MASS::polr`	R	Proportional odds logistic regression
`LightGBM` (`objective='lambdarank'`)	Python, R, C++	Pairwise learning to rank with NDCG-aware gradients
`XGBoost` (`rank:pairwise`, `rank:ndcg`)	Python, R	Pairwise and listwise ranking objectives
`coral-pytorch`	Python	Cheng-style cumulative ordinal regression heads for PyTorch models
`BradleyTerry2`	R	Fitting Bradley-Terry models for paired comparisons

modern applications

Ordinal data and rank-based models are common across machine learning systems.

Recommender systems frequently predict ordinal star ratings. The Netflix Prize famously framed rating prediction as a regression problem on a 1-to-5 scale, and many production recommenders predict ratings, which are then used to rank items.

Search and ad ranking depend on relevance grades, often on a five-level scale from "perfect" down to "bad", that are themselves ordinal. NDCG, mean reciprocal rank, and other ranking metrics are designed around this ordering.

Clinical scoring is full of ordinal scales: pain scales from 0 to 10, cancer staging from I to IV, the Glasgow Coma Scale, NYHA heart failure classes. Models that predict these benefit from cumulative link or ordinal classification techniques rather than treating the labels as nominal.

Content moderation and toxicity scoring often use ordinal severity tiers, for example "safe", "borderline", "harmful", "egregious". A multi-class model that ignores ordering can confidently confuse "safe" with "egregious" and lose the partial-credit signal that an ordinal model would preserve.

Language model alignment turns ordinal preferences into training signal. Annotators rank multiple completions, and reward models or DPO objectives use those rankings to fit a policy. Even where the data look like pairs, they are often the result of asking annotators for an ordinal quality score and converting to pairs after the fact.

a worked example

Suppose you run a SaaS product and want to predict customer satisfaction on a five-point scale from 1 = very dissatisfied to 5 = very satisfied, given features such as monthly logins, support tickets opened, and tenure in months. A naive approach would treat the score as a continuous target and fit linear regression, then round to the nearest integer. That works but it ignores the ordinal nature of the scale and is sensitive to how you encode the labels.

A proportional odds model is a more principled fit. Using statsmodels.miscmodels.ordinal_model.OrderedModel, you can write

from statsmodels.miscmodels.ordinal_model import OrderedModel

mod = OrderedModel(y_train, X_train, distr='logit')
res = mod.fit(method='bfgs')
print(res.summary())

The output gives a single $\beta$ for each predictor (interpreted as the change in log-odds of being in a higher satisfaction class for a one-unit increase in the predictor) and four threshold parameters that separate the five categories. To evaluate, compute quadratic weighted kappa, mean absolute error, and Spearman $\rho$ between predictions and held-out labels. If the proportional odds assumption fails, options include the partial proportional odds model, the Frank-Hall binary decomposition, or a neural network with a cumulative output head as in NNRank.

common mistakes

A few mistakes appear over and over again in applied work.

Treating ordinal as nominal is the most common error. Plugging a 1-to-5 satisfaction score into a softmax classifier without any ordinal structure throws away the information that 4 is closer to 5 than to 1.

Treating ordinal as interval is the opposite error. Computing the mean of a five-point Likert item and reporting it as if the difference between 1 and 2 has the same meaning as the difference between 4 and 5 is an assumption that should be tested, not assumed.

Ignoring ties matters when computing rank correlation on data with many duplicate values. Use Kendall $\tau_b$, Spearman with averaged ranks, or Goodman-Kruskal $\gamma$ depending on how you want ties to be handled.

Accuracy as the only metric for ordinal classification hides whether the model fails by one class or by four. Always pair it with mean absolute error of ranks or quadratic weighted kappa.

Forgetting to monotonize cumulative predictions can produce inconsistent class probabilities in binary-decomposition or NNRank-style models. The standard fix is to constrain or post-process the per-threshold probabilities so that $P(Y > 1) \ge P(Y > 2) \ge \cdots \ge P(Y > K-1)$.

Ordinal regression is the dedicated supervised learning subfield. Multinomial classification is the nominal cousin that ignores order. The ranking article covers ranking metrics and tasks more broadly. Likert scale data is the classic ordinal source from survey research.

references

Stevens, S. S. (1946). On the Theory of Scales of Measurement. Science, 103(2684), 677-680.
McCullagh, P. (1980). Regression Models for Ordinal Data. Journal of the Royal Statistical Society, Series B (Methodological), 42(2), 109-127.
Frank, E., and Hall, M. (2001). A Simple Approach to Ordinal Classification. In Machine Learning: ECML 2001, Lecture Notes in Computer Science vol. 2167, 145-156.
Chu, W., and Keerthi, S. S. (2005). New Approaches to Support Vector Ordinal Regression. Proceedings of the 22nd International Conference on Machine Learning (ICML).
Cheng, J., Wang, Z., and Pollastri, G. (2008). A Neural Network Approach to Ordinal Regression. IEEE International Joint Conference on Neural Networks (IJCNN).
Li, L., and Lin, H.-T. (2006). Ordinal Regression by Extended Binary Classification. Advances in Neural Information Processing Systems (NeurIPS) 19.
Spearman, C. (1904). The Proof and Measurement of Association between Two Things. American Journal of Psychology, 15(1), 72-101.
Kendall, M. G. (1938). A New Measure of Rank Correlation. Biometrika, 30(1/2), 81-93.
Goodman, L. A., and Kruskal, W. H. (1954). Measures of Association for Cross Classifications. Journal of the American Statistical Association, 49(268), 732-764.
Somers, R. H. (1962). A New Asymmetric Measure of Association for Ordinal Variables. American Sociological Review, 27(6), 799-811.
Cohen, J. (1968). Weighted Kappa: Nominal Scale Agreement with Provision for Scaled Disagreement or Partial Credit. Psychological Bulletin, 70(4), 213-220.
Bradley, R. A., and Terry, M. E. (1952). Rank Analysis of Incomplete Block Designs: I. The Method of Paired Comparisons. Biometrika, 39(3/4), 324-345.
Burges, C., Shaked, T., Renshaw, E., Lazier, A., Deeds, M., Hamilton, N., and Hullender, G. (2005). Learning to Rank Using Gradient Descent. Proceedings of the 22nd International Conference on Machine Learning (ICML).
Burges, C. J. C. (2010). From RankNet to LambdaRank to LambdaMART: An Overview. Microsoft Research Technical Report MSR-TR-2010-82.
Rafailov, R., Sharma, A., Mitchell, E., Ermon, S., Manning, C. D., and Finn, C. (2023). Direct Preference Optimization: Your Language Model is Secretly a Reward Model. NeurIPS 2023.
Pedregosa, F. mord: Ordinal Regression in Python. https://pythonhosted.org/mord/
statsmodels developers. OrderedModel documentation. https://www.statsmodels.org/stable/examples/notebooks/generated/ordinal_regression.html
Microsoft. LightGBM lambdarank objective documentation. https://lightgbm.readthedocs.io/en/latest/Parameters.html
Wikipedia contributors. Level of measurement. https://en.wikipedia.org/wiki/Level_of_measurement

introduction

measurement theory and the four scales

what makes ordinal data different

ordinal regression

proportional odds and cumulative link models

ordinal classification by binary decomposition

support vector ordinal regression

neural network approaches

method summary

rank correlation measures

spearman rho

kendall tau

gamma, somers d, and friends

evaluation metrics for ordinal predictions

pairwise preference and learning to rank

software

modern applications

a worked example

common mistakes

related concepts

references

Improve this article

Related Articles

A/B Testing

Generalized Linear Model

L1 Loss

L2 Loss

Squared Loss

Stationarity

introduction

measurement theory and the four scales

what makes ordinal data different

ordinal regression

proportional odds and cumulative link models

ordinal classification by binary decomposition

support vector ordinal regression

neural network approaches

method summary

rank correlation measures

spearman rho

kendall tau

gamma, somers d, and friends

evaluation metrics for ordinal predictions

pairwise preference and learning to rank

software

modern applications

a worked example

common mistakes

related concepts

references

Related Articles

A/B Testing

Generalized Linear Model

L1 Loss

L2 Loss

Squared Loss

Stationarity