Scoring
Last reviewed
May 11, 2026
Sources
12 citations
Review status
Source-backed
Revision
v2 ยท 2,318 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 11, 2026
Sources
12 citations
Review status
Source-backed
Revision
v2 ยท 2,318 words
Add missing citations, update stale details, or suggest a clearer explanation.
See also: Machine learning terms
In machine learning, scoring has two closely related meanings that often get blurred in practice. The first is evaluation: applying a metric to a model's predictions on labeled data to measure how well the model is performing. The second is inference: running a trained model on new data points to produce predictions, probabilities, or ranking scores that downstream systems can use. The same word covers both because the underlying mechanic is the same. The model emits a number, and that number gets interpreted as either a quality measurement (in evaluation) or a decision input (in production).
The inference sense of scoring is the older one. In credit risk, fraud detection, and direct marketing, "scoring" has meant running a fitted model against a customer record to produce a number since at least the 1980s. The evaluation sense became dominant later, popularized by tools like scikit-learn that exposed a scoring parameter for selecting models against a chosen metric. Both senses share a convention: higher scores mean better outcomes for the receiver, whether that is a more accurate model or a higher predicted likelihood of the event of interest.
When a trained classifier sees a new input, it usually produces more than a hard class label. Most modern classifiers expose a continuous score for each class. In scikit-learn, the predict_proba method returns an array with one column per class, where each column holds the model's estimated probability that the observation belongs to that class. The companion method decision_function returns the raw, uncalibrated score, which for many linear models corresponds to the signed distance from the decision boundary and for neural networks corresponds to the pre-softmax logits.
The relationship between these scores and probabilities depends on the model family. A logistic regression maps its linear score through a sigmoid (or softmax for multiclass) and the output is genuinely a probability under the model's assumptions. A random forest returns the proportion of trees voting for each class. A support vector machine has no native probability and only fits one if you ask for it (sklearn uses Platt scaling for that). Tree ensembles like XGBoost and gradient-boosted forests output a raw margin that becomes a probability after a sigmoid or softmax step. The scikit-learn documentation explicitly warns that for many estimators with a predict_proba method, including tree-based methods, ensembles, kNN, and Naive Bayes, the output should be thought of as a score rather than a true probability.
That distinction matters because scores are not automatically calibrated. A well-calibrated classifier is one where samples it labels with 0.8 confidence actually belong to the positive class about 80 percent of the time. Many high-performing models are systematically over- or under-confident. Probability calibration techniques such as Platt scaling and isotonic regression are post-processing steps that fit a one-dimensional function from raw scores to calibrated probabilities, typically on a held-out set. Calibration is measured with strictly proper scoring rules: the Brier score (mean squared error between predicted probability and outcome) and log loss (cross-entropy), both of which jointly capture calibration and discrimination.
In production systems, scoring is usually described as either batch or online. Batch scoring runs the model on a large set of records on a schedule, perhaps overnight, and writes predictions to a database for downstream use. A credit card issuer might re-score every active account weekly. A retailer might generate next-day product recommendations for every customer once per day. Batch jobs trade latency for throughput and tend to be cheaper per prediction because the infrastructure can be sized for steady utilization.
Online scoring (also called real-time inference or dynamic inference) responds to individual requests as they arrive, typically through a REST API or RPC endpoint. The scoring path needs to be fast, often under 100 milliseconds, and the system has to handle traffic spikes. Online scoring is the default for use cases where the prediction depends on signals that change in real time: fraud detection during checkout, ranking the next video to autoplay, or routing a support ticket. Some platforms combine both modes in a lambda-style architecture where a batch layer scores everything regularly and a speed layer handles fresh requests in real time.
Recommender systems are where the inference sense of scoring is most visible, because production pipelines are almost always built as a two-stage funnel. Candidate generation comes first. Given a user and some context, the system retrieves a few hundred or a few thousand items from a catalog that may contain millions or billions of entries. Candidate generators have to be cheap. They typically rely on approximate nearest-neighbor lookups against embeddings, collaborative filtering, or simple heuristics like "recently popular in your region."
Then comes the scoring stage, often just called ranking. A second, heavier model takes the candidate set and assigns each item a relevance score for the specific user in the specific context. Because the candidate set is small, the ranker can afford to consume hundreds of features per item: user history, item metadata, time of day, device, recent session behavior, and cross features built from all of these. Google's recommender systems documentation gives the standard reason for this split: scores from different candidate generators may not be comparable, and a small candidate set lets you use a model rich enough to capture nuance.
The canonical reference for this architecture is the 2016 paper "Deep Neural Networks for YouTube Recommendations" by Paul Covington, Jay Adams, and Emre Sargin. The paper describes candidate generation as an extreme multiclass classification problem solved with a softmax over millions of videos, followed by a separate deep ranking model. YouTube's ranker famously optimizes expected watch time rather than click probability, a choice that changes which videos win and which lose. Google explicitly warns that the choice of scoring objective changes the system: optimizing pure click rate tends to surface clickbait, optimizing watch time tends to favor long videos, and combinations are usually needed to balance engagement against diversity.
The machine learning literature on how to train rankers is grouped into three families based on the granularity of the training signal. Pointwise approaches treat ranking as a regression or classification problem, predicting a relevance score for each item independently of the others. Pairwise approaches train on pairs of items and try to predict which item in the pair is more relevant; RankNet, LambdaRank, and LambdaMART are the well-known pairwise algorithms. Listwise approaches optimize a loss over the full ranked list, directly targeting ranking metrics like NDCG. In practice, listwise methods tend to outperform pairwise, which tends to outperform pointwise, but listwise training is more expensive and harder to scale.
Online platforms commonly mix these. A pointwise click-probability model can serve as a strong baseline ranker, with a pairwise or listwise model layered on top for fine-grained sorting. Position bias is another concern at scoring time. Items at the top of a feed naturally attract more clicks regardless of their true relevance, so production rankers either model the position explicitly as a feature, apply inverse-propensity weighting, or score every candidate as if it were in the top slot.
Credit scoring is the original "scoring" application and is still the largest by economic impact. The Fair Isaac Corporation (now FICO) was founded in 1956 and introduced the modern general-purpose FICO Score at Equifax in 1989. FICO scores run from 300 to 850, a three-digit range that was chosen partly because storage was expensive at the credit bureaus when the format was designed and a three-digit field hit the right balance of precision and compactness.
FICO models are built on what the company calls Scorecard Module technology, which produces interpretable scorecards composed of binned features and additive weights. The resulting score is essentially a calibrated log-odds, mapped to the familiar 300 to 850 range. FICO has used machine learning in its development pipeline for decades, primarily to identify candidate variables and validate features, but the final scoring model that gets shipped is still an interpretable scorecard rather than a neural network or gradient boosted tree. In published research, FICO has compared scorecards against neural networks and gradient boosted trees on identical credit bureau data and found the accuracy gap to be under two percent, which the company argues does not justify giving up interpretability and the ability to generate adverse-action reasons required by the U.S. Equal Credit Opportunity Act.
VantageScore, the credit score developed jointly by the three major U.S. credit bureaus and introduced in 2006, uses the same 300 to 850 range as modern FICO scores and has shifted more aggressively toward machine learning in recent versions. Credit scoring is a useful example of why "score" is the right word: the output is a single number per applicant, intended to be ranked and thresholded by lenders, and the model behind it has to be defensible to regulators on both fairness and adverse-action grounds.
Scikit-learn standardized a particular interpretation of "scoring" through its scoring parameter, which appears in cross_val_score, cross_validate, GridSearchCV, RandomizedSearchCV, and related model-selection tools. The argument tells the tool which metric to optimize when comparing candidate models. It accepts three forms: None (use the estimator's default), a string naming a predefined metric, or a callable scorer.
The convention is that higher return values are always better. To handle error metrics that are naturally minimized, scikit-learn provides negated versions. For instance, mean squared error is exposed as the string 'neg_mean_squared_error', which returns the negative of the metric so that a higher number still means a better model. The same pattern produces 'neg_log_loss', 'neg_brier_score', 'neg_mean_absolute_error', and friends.
The table below lists frequently used scoring strings in scikit-learn alongside their typical use cases.
| Metric string | Underlying function | Problem type | Notes |
|---|---|---|---|
accuracy | accuracy_score | Classification | Fraction of correct predictions; misleading for imbalanced classes. |
balanced_accuracy | balanced_accuracy_score | Classification | Macro-average of per-class recall; better for imbalance. |
roc_auc | roc_auc_score | Binary classification | Threshold-free measure of separability. |
average_precision | average_precision_score | Binary classification | Summary of the precision-recall curve. |
f1, f1_macro, f1_weighted | f1_score | Classification | Harmonic mean of precision and recall. |
neg_log_loss | log_loss | Probabilistic classification | Strictly proper; needs predict_proba. |
neg_brier_score | brier_score_loss | Probabilistic classification | Mean squared error on probabilities. |
r2 | r2_score | Regression | Coefficient of determination. |
neg_mean_squared_error | mean_squared_error | Regression | Negated so higher is better. |
neg_root_mean_squared_error | root_mean_squared_error | Regression | Same units as the target. |
neg_mean_absolute_error | mean_absolute_error | Regression | More robust to outliers than MSE. |
adjusted_rand_score | adjusted_rand_score | Clustering | Compares two partitions, chance-corrected. |
v_measure_score | v_measure_score | Clustering | Harmonic mean of homogeneity and completeness. |
For custom needs, sklearn.metrics.make_scorer wraps any metric function into a callable scorer that obeys the higher-is-better convention.
The choice of scoring metric is rarely neutral. Accuracy is the default for classification but quietly fails on imbalanced data. A model that always predicts "no fraud" on a dataset where fraud is one percent of transactions scores 99 percent accuracy and detects zero fraud. ROC AUC sidesteps that by measuring how well the model ranks positives above negatives at every threshold, which is why it dominates as a default for binary classification with skewed classes. Average precision (the area under the precision-recall curve) tends to be more informative than ROC AUC when the positive class is very rare, because precision-recall focuses on the region of the threshold space where positives live.
For regression, R-squared is comparable across problems but can be misleading when the target variance differs between training and test sets. RMSE and MAE are in the same units as the target, which makes them easier to communicate. MAE is more robust to outliers; RMSE penalizes large errors more heavily. Probabilistic predictions need probabilistic metrics. If a downstream decision uses the probability directly (for example, an insurance premium that scales with predicted loss probability), then log loss or Brier score is the right evaluation target rather than accuracy.
Scoring is the part where the model actually gives you an answer. You feed in a new example, the model spits out a number, and that number tells you what it thinks. Sometimes the number is a probability, like "there is a 73 percent chance this email is spam." Sometimes it is a ranking, like "this video is a better match for you than that one." Sometimes it is a credit score, like 720 out of 850. The same word, scoring, is also used to grade the model itself. You give the model a test, compare its answers to the real ones, and the test result is a score that tells you how good the model is at its job.