Class

Machine Learning

12 min read

Updated Jun 25, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 25, 2026

Fact-checked

In review queue

Sources

13 citations

Revision

v3 · 2,476 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

See also: Machine learning terms

In machine learning, a class is one of the discrete categories that a classification model can assign to an input. Google's Machine Learning Glossary defines a class as "a category that a label can belong to," using a spam filter as the canonical example: the two classes are spam and not spam ^[1]. A class is the answer the model chooses among, drawn from a fixed and known set of possibilities, and in supervised learning the training data carries the correct class as the target label. A dog-breed identifier might use classes such as poodle, beagle, and pug; the ImageNet benchmark uses 1,000 classes; MNIST uses 10 ^[2]^[9].

Classes appear in nearly every classification setting: image recognition picks one class per image, sentiment models choose positive, neutral, or negative, medical screening tools predict disease present or disease absent, and language modeling treats each possible next token as its own class. The shape of the class set (how many classes, and whether a sample can belong to more than one) drives most of the architectural and loss-function choices that follow.

What is a class in machine learning?

A class is a discrete value drawn from a fixed, known set, often written as the label space Y. For a binary task Y might be {0, 1} or {spam, not spam}; for a multiclass task Y might be {cat, dog, bird, fish}. Picking a class for a given input means choosing one element of Y. A classification model predicts a class, whereas a regression model predicts a number rather than a class ^[1].

In practice the words class, label, and category are used almost interchangeably. The compound class label is common and treats the two as a single thing. A subtle distinction sometimes drawn is that a class is the group itself, while a label is the human-readable name attached to that group, but in code the words tend to point to the same idea.

Class differs from feature. Features are the inputs the model reads; the class is the output it produces. In a dataset the class lives in the target column, often called y or label, and the features live in the remaining columns. Class also differs from a continuous target: when the prediction is a real number such as price or temperature, the task is regression, not classification.

What is a positive class versus a negative class in binary classification?

Binary classification uses exactly two classes. By convention one is called the positive class and the other the negative class. The positive class is usually the outcome the model is trying to detect: spam in spam filtering, fraud in transaction monitoring, cancer in a screening test. The negative class is the absence of that condition. Google's glossary defines the positive class as "the class that your model is testing for," the outcome you are trying to detect or predict ^[1]^[8].

The labels are arbitrary mathematically. Whether you call the rare event 1 or 0 does not change what the classifier learns, but it changes how every downstream metric reads. The four cells of a confusion matrix (true positive, false positive, true negative, false negative) and the precision, recall, and F1 numbers built on top of them are all defined relative to which class you have named positive ^[8]. Most teams pick the positive class to be the rare or actionable outcome, so that recall and precision describe how well the model finds the thing that matters.

Binary classifiers usually emit a single probability for the positive class, and a decision threshold (often 0.5) converts that probability into a hard prediction.

What is the difference between binary and multiclass classification?

Binary classification has exactly two classes; a multi-class classification task assigns each sample to exactly one of three or more classes. Wikipedia defines multiclass as "the problem of classifying instances into one of three or more classes," with binary classification reserved for the two-class case ^[2]. The classes are mutually exclusive: a sample is one class or another, never both.

Classic multiclass benchmarks make the idea concrete:

Dataset	Number of classes	Samples	Notes
MNIST (handwritten digits)	10	60,000 train + 10,000 test (7,000 per class)	28x28 grayscale images ^[11]
Iris (Fisher, 1936)	3	150 (50 per class)	setosa, versicolor, virginica ^[12]
ImageNet (ILSVRC 2012)	1,000	1,281,167 train + 50,000 validation + 100,000 test	leaf-node categories from the ImageNet hierarchy ^[9]

Neural networks usually handle multiclass output with a softmax layer. Softmax turns a vector of raw scores (logits) into a vector of probabilities that sums to one, and the predicted class is the index of the largest probability ^[10]. Many non-neural algorithms also support multiclass natively, including decision trees, k-nearest neighbors, naive Bayes, and logistic regression with most solvers ^[5].

How do one-vs-rest and one-vs-one extend binary classifiers to many classes?

When a base algorithm only knows how to draw a boundary between two groups (the classic example is a linear support vector machine), there are two standard recipes for stretching it to many classes.

One vs rest (also called one vs all) trains one binary classifier per class. For class k, the positive class is k and the negative class is every other class combined. With K classes you train K classifiers, and at prediction time the class whose classifier scores highest wins: y_hat = argmax_k f_k(x). Scikit-learn implements it as OneVsRestClassifier and uses it as the default for many estimators ^[5].

One vs one trains a classifier for every pair of classes, which is K(K - 1)/2 classifiers in total. Each one sees only the samples from its two classes. Prediction uses a vote: the class with the most wins is predicted. With 10 classes that is 45 classifiers, with 100 it is 4,950, so one vs one is best when K is small or when the base learner scales badly with sample count ^[5]. Kernel SVMs often use one vs one, and scikit-learn's SVC defaults to it.

A third option, error-correcting output codes, assigns each class a bitstring and trains one classifier per bit; redundant bits let the ensemble recover from individual classifier errors. Scikit-learn exposes this as OutputCodeClassifier ^[5].

What is multilabel classification, and how is it different from multiclass?

Multilabel classification is the variant where a single sample can carry several class labels at once. Wikipedia describes it as "a variant of the classification problem where multiple nonexclusive labels may be assigned to each instance" ^[3]. A news article might be tagged politics and finance; a photo might contain cat, sofa, and window together.

The distinction from multiclass is the mutual-exclusivity rule. In multiclass exactly one class fires per sample; in multilabel any subset can fire, including the empty set. Output neurons typically use a sigmoid per class rather than a shared softmax, so each class probability is independent. Binary cross-entropy applied per label is the common loss.

Evaluation needs metrics that account for partial overlap between predicted and true label sets. Common choices are Hamming loss, the Jaccard index, micro and macro F1, and exact match (subset accuracy). Scikit-learn also distinguishes multiclass-multioutput, where the model predicts several non-binary properties at once, such as both the type and color of a fruit ^[5].

How are classes encoded as numbers?

Most models cannot consume the string "cat" directly, so classes have to be turned into numbers. Two encodings dominate.

Integer or label encoding maps each class to an integer: cat = 0, dog = 1, bird = 2. This is compact and works for tree-based models, which only care whether two samples share a class. For models that treat inputs as ordered numbers, label encoding lies, since it implies dog sits between cat and bird.

One-hot encoding sidesteps that by giving each class its own column. Three classes become three columns; the row for a dog is [0, 1, 0]. Wikipedia defines one-hot as "a group of bits among which the legal combinations of values are only those with a single high (1) bit and all the others low (0)" ^[4]. One-hot is the standard target representation for softmax classifiers and is required for cross-entropy when the loss expects a probability distribution. Pandas exposes it via get_dummies, scikit-learn via OneHotEncoder.

For very high-cardinality class sets, such as a vocabulary of a hundred thousand tokens, one-hot vectors are still the conceptual target, but implementations usually compute cross-entropy directly from the integer class id and the model's logits.

What is class imbalance?

A dataset is class imbalanced when one class has far more examples than another. Google's glossary uses an extreme illustration: a dataset with "1,000,000 negative labels" against "10 positive labels," a ratio of 100,000 to 1 ^[1]. Credit-card fraud is a textbook case, with fraudulent transactions often well under 1% of the data. Medical screening, defect detection, and rare-disease prediction all share the same shape.

The trouble is that a model trained on raw imbalanced data can score very high accuracy by always predicting the majority class, while missing the rare class that matters. The terms majority class and minority class name the two sides of the gap, and the minority class is usually the one of interest ^[1].

How does SMOTE handle class imbalance?

The simplest moves are oversampling the minority class (duplicating examples) and undersampling the majority class (discarding examples). Both shift the training distribution toward balance, with the risk that duplication does not add real information and undersampling throws information away.

SMOTE (Synthetic Minority Over-sampling Technique), introduced by Chawla, Bowyer, Hall, and Kegelmeyer in 2002 in the Journal of Artificial Intelligence Research (volume 16, pages 321-357), takes a more interesting approach ^[6]. For each minority example it finds the k nearest minority neighbors and generates synthetic points along the line segments connecting them. The synthetic points sit between real points in feature space, so the minority class fills out its region rather than stacking on itself. The original paper concluded that "a combination of our method of over-sampling the minority (abnormal) class and under-sampling the majority (normal) class can achieve better classifier performance (in ROC space) than only under-sampling the majority class" ^[6]. The method has since spawned variants such as Borderline-SMOTE, ADASYN, and SMOTE-NC.

What are class weights?

Many training algorithms accept a per-class weight that scales how much each example contributes to the loss. Setting the minority-class weight higher tells the optimizer that missing a minority example costs more, pushing the decision boundary in the minority direction. Scikit-learn classifiers expose this through the class_weight argument, often with class_weight="balanced" to set weights automatically from class frequencies. The advantage is that you keep every example; the disadvantage is sensitivity to how the weights are chosen.

What is focal loss?

Focal loss was proposed by Lin, Goyal, Girshick, He, and Dollar in the 2017 paper Focal Loss for Dense Object Detection, which introduced the RetinaNet detector ^[7]. The paper identified "the extreme foreground-background class imbalance encountered during training of dense detectors" as the central cause of lower one-stage accuracy: most candidate boxes are easy negatives, and standard cross-entropy is dominated by their sheer volume ^[7].

Focal loss reshapes cross-entropy with a modulating factor (1 - p_t)^gamma that down-weights examples the model already classifies confidently. As p_t approaches 1 the factor goes to zero, so a well-classified example contributes almost nothing; the gradient concentrates on hard, misclassified examples instead. With gamma = 2 (the value the authors found best), RetinaNet matched the speed of earlier one-stage detectors while exceeding the accuracy of the existing two-stage detectors ^[7]. Focal loss is now used well outside object detection wherever extreme imbalance appears.

How well does threshold calibration work compared to SMOTE?

A simpler move is to keep the model and data as they are and adjust the decision threshold that converts probability into class. The default of 0.5 is convenient but not principled; lowering it for the positive class trades precision for recall. A 2024 comparison study by Abdelhamid and Desai, Balancing the Scales: A Comprehensive Study on Tackling Class Imbalance in Binary Classification, ran 9,000 experiments across 15 machine learning models and 30 datasets and found decision threshold calibration to be "the most consistently effective technique," often matching or beating SMOTE and class-weight tuning; the same study reported that SMOTE produced the least well-calibrated probabilities of the three ^[13].

Why does the shape of the class set matter?

The shape of the class set drives the rest of the system. A binary class set lets you use a single sigmoid and a threshold. A multiclass set with K classes calls for softmax and categorical cross-entropy. A multilabel set with K classes calls for K sigmoids and per-label binary cross-entropy. The same input data can be reframed across these regimes by changing how classes are defined.

Class definitions are a design decision, not a fact about the world. Sentiment can be cast as binary (positive vs negative), three-way (with neutral), five-way, or ordinal regression. Each framing yields a different model and a different way to be wrong. Class boundaries that look natural to a researcher may collapse important distinctions for the end user, which is why thoughtful class design often matters more than algorithm choice.

Explain like I am 5

Think of class like a sticker you put on something so you can sort it later. A teacher hands out toys and asks the kids to put each toy in one of three boxes: cars, dolls, or animals. Each box has a sticker on it, and that sticker is the class. A machine learning model does the same thing: it looks at a picture or a message and decides which sticker to put on it. Sometimes there are only two stickers, like spam and not spam. Sometimes there are a thousand. And sometimes one thing gets more than one sticker, like a movie that is both funny and scary.

References

Google for Developers. "Machine Learning Glossary." https://developers.google.com/machine-learning/glossary ↩
Wikipedia. "Multiclass classification." https://en.wikipedia.org/wiki/Multiclass_classification ↩
Wikipedia. "Multi-label classification." https://en.wikipedia.org/wiki/Multi-label_classification ↩
Wikipedia. "One-hot." https://en.wikipedia.org/wiki/One-hot ↩
Scikit-learn. "Multiclass and multioutput algorithms." https://scikit-learn.org/stable/modules/multiclass.html ↩
Chawla, N. V., Bowyer, K. W., Hall, L. O., and Kegelmeyer, W. P. "SMOTE: Synthetic Minority Over-sampling Technique." *Journal of Artificial Intelligence Research*, 16:321-357 (2002). https://www.jair.org/index.php/jair/article/view/10302 ↩
Lin, T.-Y., Goyal, P., Girshick, R., He, K., and Dollar, P. "Focal Loss for Dense Object Detection." ICCV 2017. https://arxiv.org/abs/1708.02002 ↩
Sharp Sight. "Positive and Negative Classes, Explained." https://sharpsight.ai/blog/positive-and-negative-classes/ ↩
Google for Developers. "Neural networks: Multi-class classification." https://developers.google.com/machine-learning/crash-course/neural-networks/multi-class ↩
Wikipedia. "Softmax function." https://en.wikipedia.org/wiki/Softmax_function ↩
Wikipedia. "MNIST database." https://en.wikipedia.org/wiki/MNIST_database ↩
Wikipedia. "Iris flower data set." https://en.wikipedia.org/wiki/Iris_flower_data_set ↩
Abdelhamid, M., and Desai, A. "Balancing the Scales: A Comprehensive Study on Tackling Class Imbalance in Binary Classification." arXiv:2409.19751 (2024). https://arxiv.org/abs/2409.19751 ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

2 revisions by 1 contributors · full history

Suggest edit

What links here

AUC-ROC Machine learning terms/All Machine learning terms/Fundamentals Terms Validation

What is a class in machine learning?

What is a positive class versus a negative class in binary classification?

What is the difference between binary and multiclass classification?

How do one-vs-rest and one-vs-one extend binary classifiers to many classes?

What is multilabel classification, and how is it different from multiclass?

How are classes encoded as numbers?

What is class imbalance?

How does SMOTE handle class imbalance?

What are class weights?

What is focal loss?

How well does threshold calibration work compared to SMOTE?

Why does the shape of the class set matter?

Explain like I am 5

References

Improve this article

Related Articles

A/B Testing

Diffusion model

Dimension Reduction

Dimensions

Discrete Feature

Discriminative Model

What links here

Related Articles

A/B Testing

Diffusion model

Dimension Reduction

Dimensions

Discrete Feature

Discriminative Model

What links here