Class
Last reviewed
May 11, 2026
Sources
10 citations
Review status
Source-backed
Revision
v2 · 2,182 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 11, 2026
Sources
10 citations
Review status
Source-backed
Revision
v2 · 2,182 words
Add missing citations, update stale details, or suggest a clearer explanation.
See also: Machine learning terms
In machine learning, a class is one of the discrete labels that a classification model can assign to an input. Google's Machine Learning Glossary defines a class as "a category that a label can belong to," giving the spam filter as the canonical example: the two classes are spam and not spam. A dog breed identifier might use classes such as poodle, beagle, and pug. The class is the answer the model is choosing among, and in supervised learning the training data carries the correct class as the target.
Classes appear in nearly every classification setting: image recognition picks one class per image, sentiment models choose positive, neutral, or negative, medical screening tools predict disease present or disease absent, and language modeling treats each possible next token as its own class. The shape of the class set drives most of the architectural and loss function choices that follow.
A class is a discrete value drawn from a fixed, known set, often written as the label space Y. For a binary task Y might be {0, 1} or {spam, not spam}; for a multiclass task Y might be {cat, dog, bird, fish}. Picking a class for a given input means choosing one element of Y.
In practice the words class, label, and category are used almost interchangeably. The compound class label is common and treats the two as a single thing. A subtle distinction sometimes drawn is that a class is the group itself, while a label is the human readable name attached to that group, but in code the words tend to point to the same idea.
Class differs from feature. Features are the inputs the model reads; the class is the output it produces. In a dataset the class lives in the target column, often called y or label, and the features live in the remaining columns. Class also differs from a continuous target: when the prediction is a real number such as price or temperature, the task is regression, not classification.
Binary classification uses exactly two classes. By convention one is called the positive class and the other the negative class. The positive class is usually the outcome the model is trying to detect: spam in spam filtering, fraud in transaction monitoring, cancer in a screening test. The negative class is the absence of that condition. Google's glossary puts it simply: the positive class is "the class you're trying to detect or predict."
The labels are arbitrary mathematically. Whether you call the rare event 1 or 0 does not change what the classifier learns, but it changes how every downstream metric reads. The four cells of a confusion matrix (true positive, false positive, true negative, false negative) and the precision, recall, and F1 numbers built on top of them are all defined relative to which class you have named positive. Most teams pick the positive class to be the rare or actionable outcome, so that recall and precision describe how well the model finds the thing that matters.
Binary classifiers usually emit a single probability for the positive class, and a decision threshold (often 0.5) converts that probability into a hard prediction.
A multi-class classification task assigns each sample to exactly one of three or more classes. Wikipedia defines it as "the problem of classifying instances into one of three or more classes," with binary classification reserved for the two-class case. Classic examples include MNIST digit recognition (ten classes), ImageNet image classification (1,000 classes), and the Iris dataset (three flower species).
The key constraint is that the classes are mutually exclusive: a sample is one class or another, never both.
Neural networks usually handle multiclass output with a softmax layer. Softmax turns a vector of raw scores (logits) into a vector of probabilities that sums to one, and the predicted class is the index of the largest probability. Many non-neural algorithms also support multiclass natively, including decision trees, k-nearest neighbors, naive Bayes, and logistic regression with most solvers.
When a base algorithm only knows how to draw a boundary between two groups (the classic example is a linear support vector machine), there are two standard recipes for stretching it to many classes.
One vs rest (also called one vs all) trains one binary classifier per class. For class k, the positive class is k and the negative class is every other class combined. With K classes you train K classifiers, and at prediction time the class whose classifier scores highest wins: y_hat = argmax_k f_k(x). It is the default behavior in many scikit-learn estimators.
One vs one trains a classifier for every pair of classes, which is K(K - 1)/2 classifiers in total. Each one sees only the samples from its two classes. Prediction uses a vote: the class with the most wins is predicted. With ten classes that is 45 classifiers, with 100 it is 4,950, so one vs one is best when K is small or when the base learner scales badly with sample count. Kernel SVMs often use one vs one, and scikit-learn's SVC defaults to it.
A third option, error correcting output codes, assigns each class a bitstring and trains one classifier per bit; redundant bits let the ensemble recover from individual classifier errors. Scikit-learn exposes this as OutputCodeClassifier.
Multilabel classification is the variant where a single sample can carry several class labels at once. Wikipedia describes it as "a variant of the classification problem where multiple nonexclusive labels may be assigned to each instance." A news article might be tagged politics and finance; a photo might contain cat, sofa, and window together.
The distinction from multiclass is the mutual exclusivity rule. In multiclass exactly one class fires per sample; in multilabel any subset can fire, including the empty set. Output neurons typically use a sigmoid per class rather than a shared softmax, so each class probability is independent. Binary cross-entropy applied per label is the common loss.
Evaluation needs metrics that account for partial overlap between predicted and true label sets. Common choices are Hamming loss, the Jaccard index, micro and macro F1, and exact match (subset accuracy). Scikit-learn also distinguishes multiclass-multioutput, where the model predicts several non-binary properties at once, such as both the type and color of a fruit.
Most models cannot consume the string "cat" directly, so classes have to be turned into numbers. Two encodings dominate.
Integer or label encoding maps each class to an integer: cat = 0, dog = 1, bird = 2. This is compact and works for tree based models, which only care whether two samples share a class. For models that treat inputs as ordered numbers, label encoding lies, since it implies dog sits between cat and bird.
One hot encoding sidesteps that by giving each class its own column. Three classes become three columns; the row for a dog is [0, 1, 0]. Wikipedia defines one hot as "a group of bits among which the legal combinations of values are only those with a single high (1) bit and all the others low (0)." One hot is the standard target representation for softmax classifiers and is required for cross-entropy when the loss expects a probability distribution. Pandas exposes it via get_dummies, scikit-learn via OneHotEncoder.
For very high cardinality class sets, such as a vocabulary of a hundred thousand tokens, one hot vectors are still the conceptual target, but implementations usually compute cross-entropy directly from the integer class id and the model's logits.
A dataset is class imbalanced when one class has far more examples than another. Google's glossary uses an extreme illustration: "1,000,000 negative labels" against "10 positive labels," a ratio of 100,000 to 1. Credit card fraud is a textbook case, with fraudulent transactions often well under 1% of the data. Medical screening, defect detection, and rare disease prediction all share the same shape.
The trouble is that a model trained on raw imbalanced data can score very high accuracy by always predicting the majority class, while missing the rare class that matters. The terms majority class and minority class name the two sides of the gap, and the minority class is usually the one of interest.
The simplest moves are oversampling the minority class (duplicating examples) and undersampling the majority class (discarding examples). Both shift the training distribution toward balance, with the risk that duplication does not add real information and undersampling throws information away.
SMOTE (Synthetic Minority Over-sampling Technique), introduced by Chawla, Bowyer, Hall, and Kegelmeyer in 2002 in the Journal of Artificial Intelligence Research, takes a more interesting approach. For each minority example it finds the k nearest minority neighbors and generates synthetic points along the line segments connecting them. The synthetic points sit between real points in feature space, so the minority class fills out its region rather than stacking on itself. The original paper showed that combining SMOTE with majority undersampling improved classifier performance in ROC space, and the method has spawned variants like Borderline SMOTE, ADASYN, and SMOTE-NC.
Many training algorithms accept a per class weight that scales how much each example contributes to the loss. Setting the minority class weight higher tells the optimizer that missing a minority example costs more, pushing the decision boundary in the minority direction. Scikit-learn classifiers expose this through the class_weight argument, often with class_weight="balanced" to set weights automatically from class frequencies. The advantage is that you keep every example; the disadvantage is sensitivity to how the weights are chosen.
Focal loss was proposed by Lin, Goyal, Girshick, He, and Dollar in the 2017 paper Focal Loss for Dense Object Detection, which introduced the RetinaNet detector. The paper identified that dense object detection suffers from extreme foreground to background class imbalance: most candidate boxes are easy negatives, and standard cross-entropy is dominated by their sheer volume.
Focal loss reshapes cross-entropy with a modulating factor (1 - p_t)^gamma that down weights examples the model already classifies confidently. As p_t approaches 1 the factor goes to zero, so a well classified example contributes almost nothing; the gradient concentrates on hard, misclassified examples instead. With gamma = 2, RetinaNet matched the speed of earlier one stage detectors while exceeding the accuracy of two stage detectors. Focal loss is now used outside object detection wherever extreme imbalance appears.
A simpler move is to keep the model and data as they are and adjust the decision threshold that converts probability into class. The default of 0.5 is convenient but not principled; lowering it for the positive class trades precision for recall. A 2024 comparison study on binary classification ("Balancing the Scales") found threshold calibration among the most consistently effective approaches, often matching or beating SMOTE and class weighting.
The shape of the class set drives the rest of the system. A binary class set lets you use a single sigmoid and a threshold. A multiclass set with K classes calls for softmax and categorical cross-entropy. A multilabel set with K classes calls for K sigmoids and per label binary cross-entropy. The same input data can be reframed across these regimes by changing how classes are defined.
Class definitions are a design decision, not a fact about the world. Sentiment can be cast as binary (positive vs negative), three way (with neutral), five way, or ordinal regression. Each framing yields a different model and a different way to be wrong. Class boundaries that look natural to a researcher may collapse important distinctions for the end user, which is why thoughtful class design often matters more than algorithm choice.
Think of class like a sticker you put on something so you can sort it later. A teacher hands out toys and asks the kids to put each toy in one of three boxes: cars, dolls, or animals. Each box has a sticker on it, and that sticker is the class. A machine learning model does the same thing: it looks at a picture or a message and decides which sticker to put on it. Sometimes there are only two stickers, like spam and not spam. Sometimes there are a thousand. And sometimes one thing gets more than one sticker, like a movie that is both funny and scary.