See also: Machine learning terms
In machine learning and statistics, rank or ordinality describes data whose values have a meaningful order but whose spacing between values is not necessarily equal or quantitatively meaningful. A movie can be rated 1 to 5 stars, a customer can be "very dissatisfied" through "very satisfied", a tumor can be classified as stage I, II, III, or IV; in each case the labels carry order, but the gap between a 4 star and a 5 star rating is not the same kind of quantity as the gap between 4.0 and 5.0 kilograms. Treating ordinal data correctly matters because the wrong choice of model or loss function can throw away information about order, or invent a metric structure that does not exist. This article covers the place of ordinal data in measurement theory, the main models and metrics used to handle it, and how rank-based methods turn up in modern systems such as learning to rank, preference learning, and RLHF reward modeling.
The modern vocabulary for talking about ordinality comes from the psychophysicist S. S. Stevens. In a 1946 paper in Science, Stevens proposed that all measurement falls into one of four scales: nominal, ordinal, interval, and ratio. Each scale supports a different set of admissible operations and a different family of statistics, and Stevens argued that statistics whose values change under permutations of arbitrary labels are not appropriate for nominal data, statistics that depend on the size of differences are not appropriate for ordinal data, and so on.
The table below summarizes the four scales and the operations they support.
| Scale | Distinguishes | Order | Equal intervals | True zero | Examples | Permitted statistics |
|---|---|---|---|---|---|---|
| Nominal | Yes | No | No | No | Eye color, ZIP code, categorical data labels | Mode, frequency, chi-square |
| Ordinal | Yes | Yes | No | No | Movie stars, Likert scale responses, education level, cancer stage | Median, percentiles, Spearman correlation, Kendall tau |
| Interval | Yes | Yes | Yes | No | Celsius temperature, calendar year, IQ score | Mean, standard deviation, Pearson correlation |
| Ratio | Yes | Yes | Yes | Yes | Mass, length, count, Kelvin temperature | All arithmetic, geometric mean, coefficient of variation |
Stevens' typology has been criticized for being too rigid in practice; many real measurements sit between scales, and some statisticians, including John Tukey and Frederick Lord, argued that the choice of statistic should depend on the question, not just the label on the scale. Even with those caveats, the nominal/ordinal/interval/ratio split is still the standard way to introduce the topic in statistics and machine learning courses, and the ordinal category remains the one most often mishandled.
Ordinal data carries less structure than interval or ratio data but more than nominal data. Three properties matter in practice.
The order is real but the spacing is not. A respondent moving from "agree" to "strongly agree" on a five-point scale is not necessarily moving the same psychological distance as another respondent moving from "neutral" to "agree". Treating the labels 1 through 5 as if they were measured in the same units as a thermometer reading bakes in an assumption that is rarely tested.
Arithmetic on raw labels is suspect. The mean of "poor", "fair", "good", "very good" is not well defined unless you commit to a numerical encoding, and the choice of encoding can change the answer. Many practitioners do average Likert items anyway, especially after summing several items into a composite score, because the sum often behaves close enough to interval for downstream parametric methods to be useful. The debate over whether this is justified is more than fifty years old and shows no sign of ending.
Standard regression and classification leave information on the table. Ordinary least squares regression treats the labels as numbers on a continuous scale, which assumes equal spacing. A multi-class logistic regression classifier ignores the ordering entirely and predicts each class as if it were nominal. Both work, but neither uses the structure that the labels are ordered.
Ordinal regression, sometimes called ordered regression or ranking regression, is the family of supervised methods designed for an ordered discrete response. The predicted output is one of $K$ ordered categories, and the model is judged not just by whether it picks the right category but by how far off it is when it misses.
The most cited formulation is the proportional odds model introduced by Peter McCullagh in 1980 in the Journal of the Royal Statistical Society, Series B. Instead of modeling each class probability directly, the model writes the cumulative probabilities $\gamma_j = P(Y \le j \mid X)$ as
$$\text{logit}(\gamma_j) = \theta_j - \beta^\top X, \quad j = 1, \ldots, K-1,$$
where $\theta_1 < \theta_2 < \cdots < \theta_{K-1}$ are class thresholds and $\beta$ is a shared coefficient vector. The key constraint is that $\beta$ is the same across all $K-1$ logits; only the intercepts shift. This is the proportional odds assumption, and it implies that the effect of any predictor on the log odds of "being above category $j$" is the same regardless of which $j$ you pick.
Replacing the logit with a probit gives the ordered probit model, often associated with the econometrics literature. More generally, any cumulative link model uses an inverse cumulative distribution function as the link, and the package ordinal in R supports logit, probit, complementary log-log, cauchit, and others. In Python, statsmodels.miscmodels.ordinal_model.OrderedModel provides a comparable implementation, and the proportional odds parameterization comes out of the box.
Eibe Frank and Mark Hall, at the University of Waikato, proposed in 2001 a different angle. Their paper A Simple Approach to Ordinal Classification in the proceedings of ECML showed that any standard classification algorithm can be turned into an ordinal classifier by training $K-1$ binary problems, each one asking "is the rank greater than $j$?". At prediction time, the probabilities from the binary classifiers are combined to recover a probability distribution over the $K$ ordered classes. The approach is attractive because it works with any underlying learner, including decision trees, random forests, and neural networks, and the original paper showed it consistently improved over treating the labels as nominal when the base learner was a C4.5 tree.
Ling Li and Hsuan-Tien Lin generalized the idea in 2006 with their NeurIPS paper on Ordinal Regression by Extended Binary Classification, which derives loss bounds and shows that the choice of binary loss controls the ordinal loss being optimized.
Wei Chu and S. Sathiya Keerthi presented two SVM formulations for ordinal regression at ICML 2005. SVOREX (explicit constraints) trains $K-1$ parallel hyperplanes with margin constraints only against patterns from the two adjacent classes, while SVORIM (implicit constraints) constrains each hyperplane against all higher and lower classes simultaneously. A useful property of SVORIM is that the ordering of thresholds $\theta_1 < \theta_2 < \cdots < \theta_{K-1}$ is satisfied automatically at the optimal solution, even though the constraint is not written explicitly. Both methods extended the SVM framework cleanly and have been benchmarked widely as baselines in the ordinal regression literature.
Jianlin Cheng, Zheng Wang, and Gianluca Pollastri proposed a neural network approach to ordinal regression in 2007, later published in IEEE proceedings in 2008. Their NNRank model uses $K$ output units and reframes the ordinal classification as a series of binary classifications using a cumulative encoding: a label of $k$ becomes the target vector $(1, 1, \ldots, 1, 0, \ldots, 0)$ with the first $k$ entries set to one. Each output unit predicts whether the rank is at least its position, which mirrors the binary decomposition idea but inside a single end-to-end network. Modern variants of this trick, including the CORAL and CORN methods of Cao and Raschka, are common in deep ordinal regression today.
| Method | Year | Approach | Strengths | Limitations |
|---|---|---|---|---|
| Proportional odds (McCullagh) | 1980 | Cumulative logit, shared $\beta$ | Interpretable, well studied, default in R MASS::polr | Proportional odds assumption may fail |
| Ordered probit | 1980s econometrics | Cumulative probit | Natural for latent normal variables | Same shared-slope assumption |
| Frank and Hall ordinal classifier | 2001 | $K-1$ binary problems wrapping any classifier | Works with any base learner | Probabilities can be inconsistent across thresholds |
| Support vector ordinal regression (Chu and Keerthi) | 2005 | SVOREX, SVORIM | Strong margin-based baseline | Computationally heavier, kernel choice matters |
| Cheng NNRank | 2007 | Cumulative target vector in a neural net | End-to-end, scales to deep models | Needs care with monotonicity at inference |
| OrdinalRidge / regression on labels | older | Treat labels as numeric, fit ridge | Simple, fast | Assumes equal spacing |
| Ordinal extension of CRF, GP, deep ResNet | various | Probabilistic and deep variants | Match the assumptions of richer models | Implementation complexity |
When the goal is to measure agreement between two ordered variables, ordinary Pearson correlation is awkward because it depends on the numerical encoding. Two rank-based measures dominate.
Charles Spearman introduced the rank correlation now named after him in a 1904 paper in the American Journal of Psychology. The Spearman coefficient is just the Pearson correlation computed on the ranks of the two variables, and it captures monotonic association rather than strictly linear association. It ranges from $-1$ (perfectly opposite orderings) to $+1$ (identical orderings), with $0$ meaning no monotonic relationship. Spearman is well suited when the data have outliers or non-normal marginal distributions, since ranks are insensitive to the actual values.
Maurice Kendall introduced his $\tau$ statistic in 1938 in Biometrika. The construction is more combinatorial than Spearman's: for every pair of observations $(i, j)$, the pair is concordant if the two variables agree on which member of the pair is larger and discordant otherwise. The Kendall $\tau_a$ statistic is
$$\tau_a = \frac{C - D}{C + D},$$
where $C$ is the number of concordant pairs and $D$ the number of discordant pairs out of $\binom{n}{2}$ total pairs. Kendall $\tau_b$ adjusts for ties and is the version commonly reported by statistical software.
Leo Goodman and William Kruskal proposed the $\gamma$ coefficient in a series of papers starting in the 1950s. Goodman-Kruskal $\gamma$ uses the same concordant minus discordant numerator as Kendall $\tau$ but ignores tied pairs in the denominator. Somers' $D$ is an asymmetric variant that treats one variable as the predictor and the other as the response. These measures are part of the standard output of ordinal regression software and show up in clinical and social-science papers.
| Measure | Year | Formula sketch | Range | Handles ties |
|---|---|---|---|---|
| Spearman $\rho$ | 1904 | Pearson correlation of ranks | $[-1, 1]$ | Average ranks for ties |
| Kendall $\tau_a$ | 1938 | $(C - D) / \binom{n}{2}$ | $[-1, 1]$ | No |
| Kendall $\tau_b$ | 1945 | $(C - D)$ over corrected denominator | $[-1, 1]$ | Yes |
| Goodman-Kruskal $\gamma$ | 1954 | $(C - D) / (C + D)$, ignoring ties | $[-1, 1]$ | Excludes from denominator |
| Somers' $D$ | 1962 | Asymmetric variant of $\gamma$ | $[-1, 1]$ | Yes |
Classification accuracy treats every misclassification as equally bad, which is the wrong default for ordinal data: predicting "4 stars" when the truth is "5 stars" is clearly less wrong than predicting "1 star". A few metrics are common in practice.
Mean absolute error of ranks treats the integer-coded labels as a number line and reports the average $|y - \hat{y}|$. It is easy to compute and easy to communicate but pretends the encoding is interval.
Spearman $\rho$ and Kendall $\tau$ between predictions and ground truth measure the rank agreement of a model's outputs against the labels and are often used in ranking-style benchmarks.
Quadratic weighted kappa, introduced by Jacob Cohen in 1968 as part of his weighted kappa family, has become the de facto metric in ordinal Kaggle competitions, including the 2014 Diabetic Retinopathy Detection contest, the 2015 Hewlett Foundation essay scoring contest, and several follow-on challenges. The metric is
$$\kappa = 1 - \frac{\sum_{i,j} w_{ij} O_{ij}}{\sum_{i,j} w_{ij} E_{ij}},$$
where $O$ is the observed confusion matrix, $E$ is the expected matrix under chance, and $w_{ij} = (i - j)^2 / (K - 1)^2$ is the quadratic weight that penalizes far-off predictions more than near-off ones. Practitioners often optimize models that output a continuous score and then tune cut points between classes to maximize quadratic weighted kappa on a validation set.
Ranking can also be approached without ever assigning a class label. Instead, the data come as pairwise comparisons or full rankings of items, and the model learns a real-valued score that respects them.
The foundational model is the Bradley-Terry model, presented by Ralph Bradley and Milton Terry in a 1952 Biometrika paper on incomplete block designs and paired comparisons. Each item $i$ gets a positive score $\pi_i$, and the probability that $i$ beats $j$ is
$$P(i \succ j) = \frac{\pi_i}{\pi_i + \pi_j}.$$
Ernst Zermelo described an essentially equivalent model for ranking chess players in 1929, and Louis Thurstone's 1927 law of comparative judgment is closely related. Bradley-Terry remains the workhorse for converting wins-and-losses data into a global ranking.
In machine learning, Christopher Burges and colleagues at Microsoft Research introduced RankNet in 2005. RankNet trains a neural network to score documents and uses a pairwise cross-entropy loss derived from a logistic preference model that is mathematically a cousin of Bradley-Terry. Its successors are LambdaRank, which scales pairwise gradients by the change in NDCG that swapping the pair would cause, and LambdaMART, which uses gradient-boosted regression trees as the underlying learner. An ensemble of LambdaMART models won track 1 of the 2010 Yahoo Learning to Rank Challenge.
These ideas show up in modern preference-based language model training. Reward models for RLHF are typically trained on pairs of responses with a Bradley-Terry-style log-likelihood, and the Direct Preference Optimization (DPO) algorithm of Rafailov and colleagues, presented at NeurIPS 2023, derives a closed-form policy from the Bradley-Terry preference model and replaces the explicit reward training step with a single binary cross-entropy objective. The popularity of DPO and its descendants means that the math of ordinal pairwise preference is now sitting under a large fraction of post-training pipelines for frontier models.
| Tool | Language | Capability |
|---|---|---|
mord | Python | Threshold, regression, and classification ordinal models with a scikit-learn API |
statsmodels.OrderedModel | Python | Proportional odds and probit cumulative link models |
ordinal | R | Cumulative link models with logit, probit, cloglog, cauchit links |
MASS::polr | R | Proportional odds logistic regression |
LightGBM (objective='lambdarank') | Python, R, C++ | Pairwise learning to rank with NDCG-aware gradients |
XGBoost (rank:pairwise, rank:ndcg) | Python, R | Pairwise and listwise ranking objectives |
coral-pytorch | Python | Cheng-style cumulative ordinal regression heads for PyTorch models |
BradleyTerry2 | R | Fitting Bradley-Terry models for paired comparisons |
Ordinal data and rank-based models are common across machine learning systems.
Recommender systems frequently predict ordinal star ratings. The Netflix Prize famously framed rating prediction as a regression problem on a 1-to-5 scale, and many production recommenders predict ratings, which are then used to rank items.
Search and ad ranking depend on relevance grades, often on a five-level scale from "perfect" down to "bad", that are themselves ordinal. NDCG, mean reciprocal rank, and other ranking metrics are designed around this ordering.
Clinical scoring is full of ordinal scales: pain scales from 0 to 10, cancer staging from I to IV, the Glasgow Coma Scale, NYHA heart failure classes. Models that predict these benefit from cumulative link or ordinal classification techniques rather than treating the labels as nominal.
Content moderation and toxicity scoring often use ordinal severity tiers, for example "safe", "borderline", "harmful", "egregious". A multi-class model that ignores ordering can confidently confuse "safe" with "egregious" and lose the partial-credit signal that an ordinal model would preserve.
Language model alignment turns ordinal preferences into training signal. Annotators rank multiple completions, and reward models or DPO objectives use those rankings to fit a policy. Even where the data look like pairs, they are often the result of asking annotators for an ordinal quality score and converting to pairs after the fact.
Suppose you run a SaaS product and want to predict customer satisfaction on a five-point scale from 1 = very dissatisfied to 5 = very satisfied, given features such as monthly logins, support tickets opened, and tenure in months. A naive approach would treat the score as a continuous target and fit linear regression, then round to the nearest integer. That works but it ignores the ordinal nature of the scale and is sensitive to how you encode the labels.
A proportional odds model is a more principled fit. Using statsmodels.miscmodels.ordinal_model.OrderedModel, you can write
from statsmodels.miscmodels.ordinal_model import OrderedModel
mod = OrderedModel(y_train, X_train, distr='logit')
res = mod.fit(method='bfgs')
print(res.summary())
The output gives a single $\beta$ for each predictor (interpreted as the change in log-odds of being in a higher satisfaction class for a one-unit increase in the predictor) and four threshold parameters that separate the five categories. To evaluate, compute quadratic weighted kappa, mean absolute error, and Spearman $\rho$ between predictions and held-out labels. If the proportional odds assumption fails, options include the partial proportional odds model, the Frank-Hall binary decomposition, or a neural network with a cumulative output head as in NNRank.
A few mistakes appear over and over again in applied work.
Treating ordinal as nominal is the most common error. Plugging a 1-to-5 satisfaction score into a softmax classifier without any ordinal structure throws away the information that 4 is closer to 5 than to 1.
Treating ordinal as interval is the opposite error. Computing the mean of a five-point Likert item and reporting it as if the difference between 1 and 2 has the same meaning as the difference between 4 and 5 is an assumption that should be tested, not assumed.
Ignoring ties matters when computing rank correlation on data with many duplicate values. Use Kendall $\tau_b$, Spearman with averaged ranks, or Goodman-Kruskal $\gamma$ depending on how you want ties to be handled.
Accuracy as the only metric for ordinal classification hides whether the model fails by one class or by four. Always pair it with mean absolute error of ranks or quadratic weighted kappa.
Forgetting to monotonize cumulative predictions can produce inconsistent class probabilities in binary-decomposition or NNRank-style models. The standard fix is to constrain or post-process the per-threshold probabilities so that $P(Y > 1) \ge P(Y > 2) \ge \cdots \ge P(Y > K-1)$.
Ordinal regression is the dedicated supervised learning subfield. Multinomial classification is the nominal cousin that ignores order. The ranking article covers ranking metrics and tasks more broadly. Likert scale data is the classic ordinal source from survey research.