Focal loss
Last reviewed
Apr 30, 2026
Sources
25 citations
Review status
Source-backed
Revision
v3 ยท 3,557 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Apr 30, 2026
Sources
25 citations
Review status
Source-backed
Revision
v3 ยท 3,557 words
Add missing citations, update stale details, or suggest a clearer explanation.
Focal loss is a loss function introduced by Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollar at Facebook AI Research in their 2017 paper "Focal Loss for Dense Object Detection." [1] It was designed to address extreme class imbalance in one-stage object detection models and was a central component of the RetinaNet architecture introduced in the same paper. The paper received the Best Student Paper Award at ICCV 2017. [2]
Focal loss reshapes the standard cross-entropy loss so that well-classified examples contribute much less to the gradient, focusing training on hard, misclassified examples. The reshaping is governed by a focusing parameter gamma, which scales a (1 - p_t)^gamma modulating factor in front of the log-likelihood term. When gamma = 0, focal loss reduces exactly to cross-entropy. When gamma > 0, easy examples are down-weighted and the bulk of the gradient signal comes from hard examples.
The motivation is specific to dense prediction. A one-stage detector like YOLO or SSD evaluates roughly 10^4 to 10^5 candidate locations per image, and the overwhelming majority correspond to background. Cross-entropy summed across so many easy negatives drowns out the gradient from the few foreground objects. Two-stage detectors such as Faster R-CNN sidestep this by sampling a balanced mini-batch from region proposals. Focal loss offered a simpler alternative that required no sampling heuristics, and it allowed RetinaNet to match the accuracy of two-stage detectors at one-stage speeds. [1]
Since its introduction, focal loss has been adopted across computer vision tasks beyond detection, including instance segmentation, semantic segmentation, and long-tail imbalanced learning. It has also seeded follow-up losses including Quality and Distribution Focal Loss in Generalized Focal Loss, the Class-Balanced Loss of Cui et al. (2019), asymmetric focal variants, and PolyLoss. As of early 2026, the paper has been cited more than 30,000 times on Google Scholar. [3]
The class imbalance problem in object detection comes from how dense detectors enumerate candidate locations. A typical anchor-based one-stage detector tiles the input image with anchor boxes at multiple scales and aspect ratios, yielding on the order of 100,000 candidate locations per image, of which only a few dozen actually overlap with ground-truth objects. The ratio of negatives to positives is therefore roughly 1000:1 or higher. [1]
With vanilla binary cross-entropy summed over all anchors, two issues emerge. First, the loss is dominated by negatives in raw magnitude: even if each individual easy negative has a small loss value, the sheer count means the cumulative loss from negatives is much larger than from positives. Second, the gradients are dominated by easy negatives in the same way, so the optimizer spends most of its updates pushing already-correct background predictions slightly more toward 0, rather than learning to detect the rare foreground objects.
Previous approaches addressed this through sampling and mining. Two-stage detectors handle imbalance via region proposals that prune the candidate set down to roughly 1,000 to 2,000 boxes per image, followed by sampling a fixed-size mini-batch (typically 256 in Faster R-CNN) with a controlled positive-to-negative ratio of 1:3. SSD used hard negative mining, sorting negatives by their loss and keeping the top-k highest-loss negatives so the negative-to-positive ratio is at most 3:1. [4] Online Hard Example Mining (OHEM), introduced by Shrivastava, Gupta, and Girshick in 2016, automated the selection of hard examples within mini-batches based on their current loss values. [5] All of these methods added pipeline complexity, hyperparameters, and sometimes extra forward passes.
Focal loss instead changes the loss function so that no sampling is needed. Every anchor contributes to the loss, but easy examples contribute almost nothing. The key insight: rather than fix imbalance at the data level, fix it at the loss level. [1]
Focal loss starts from binary cross-entropy loss. For a binary problem with label y in {0, 1} and predicted probability p for class 1:
CE(p, y) = -log(p) if y = 1
= -log(1 - p) if y = 0
Defining p_t = p if y = 1 else 1 - p, this becomes CE(p_t) = -log(p_t). [1]
A first attempt at handling imbalance is to weight the loss differently for the two classes. Define alpha in [0, 1] for class 1 and 1 - alpha for class 0, and let alpha_t = alpha if y = 1 else 1 - alpha. The alpha-balanced cross-entropy is -alpha_t * log(p_t). This rebalances positive and negative classes but does not distinguish easy from hard examples. [1]
Focal loss adds a multiplicative modulating factor (1 - p_t)^gamma:
FL(p_t) = -(1 - p_t)^gamma * log(p_t)
where gamma >= 0 is the focusing parameter. The alpha-balanced form, used in the paper's main experiments, combines both:
FL(p_t) = -alpha_t * (1 - p_t)^gamma * log(p_t)
When gamma = 0, focal loss reduces to alpha-balanced cross-entropy. As gamma increases, the modulating factor sharply down-weights well-classified examples (p_t close to 1) while leaving hard examples (p_t close to 0) almost unchanged. [1]
The paper reports that gamma = 2 and alpha = 0.25 worked best for RetinaNet on COCO. With gamma = 2, the optimal alpha shifts down from the standard imbalance-correcting value (around 0.75) to 0.25. The intuition is that the modulating factor itself already shifts attention to positives, so the class weighting needs to be milder, not stronger. [1]
RetinaNet uses sigmoid activations rather than softmax, treating each class as an independent binary classifier. The image-level loss is summed over all anchors and divided by the number of anchors assigned to ground-truth boxes (not the total anchor count). Dividing by foreground count gives a stable gradient magnitude as easy negatives are down-weighted. [1]
The modulating factor (1 - p_t)^gamma down-weights the loss for well-classified examples and widens the range of p_t values that receive substantial loss. The table below shows this for gamma = 2.
p_t | CE = -log(p_t) | (1 - p_t)^2 | FL (gamma = 2) | Ratio FL / CE |
|---|---|---|---|---|
| 0.99 | 0.0101 | 0.0001 | 0.0000 | 0.0001 |
| 0.95 | 0.0513 | 0.0025 | 0.0001 | 0.0025 |
| 0.90 | 0.1054 | 0.0100 | 0.0011 | 0.0100 |
| 0.75 | 0.2877 | 0.0625 | 0.0180 | 0.0625 |
| 0.50 | 0.6931 | 0.2500 | 0.1733 | 0.2500 |
| 0.25 | 1.3863 | 0.5625 | 0.7798 | 0.5625 |
| 0.10 | 2.3026 | 0.8100 | 1.8651 | 0.8100 |
A confidently correct example with p_t = 0.9 has its loss reduced by a factor of 100. A truly confident example with p_t = 0.99 has its loss reduced by 10,000. By contrast, a hard example with p_t = 0.1 is barely down-weighted at all (factor 0.81). Examples right at the decision boundary (p_t = 0.5) are reduced by a factor of 4. [1]
No example is fully ignored, but the cumulative effect of summing across 100,000 anchors is that easy negatives contribute negligibly to the total. The paper supports this empirically: with cross-entropy, the bottom 70 percent of negatives by loss contribute about half of the total loss, while with focal loss at gamma = 2 they contribute essentially zero. [1]
The paper sweeps gamma from 0 to 5 and reports COCO AP with RetinaNet using a ResNet-50 FPN backbone. The reported numbers were approximately: 30.2 AP at gamma = 0 (with optimal alpha), 31.1 at gamma = 0.5, 32.9 at gamma = 1, 33.7 at gamma = 2, and 33.4 at gamma = 5. The accuracy peaks around gamma = 2, which became the de facto default. [1] Pushing gamma too high under-weights examples that are merely hard rather than misclassified, slowing convergence.
RetinaNet is the one-stage object detector proposed in the same 2017 paper. It combines a ResNet backbone, a Feature Pyramid Network (FPN) neck, and two task-specific subnetworks for classification and box regression. [1]
The FPN constructs feature maps at five pyramid levels (P3 through P7) with strides of 8, 16, 32, 64, and 128 pixels. At each level, anchors with three scales and three aspect ratios are tiled across the feature map. With nine anchors per spatial location across five pyramid levels, a 600-by-600 input image yields roughly 100,000 anchors, the source of the extreme class imbalance. [1] The classification subnet is a small fully-convolutional network with four 3-by-3 conv layers (256 channels, ReLU) and a final conv predicting K * A channels.
A crucial implementation detail involves the bias initialization of the final classification layer. Without it, training is unstable: at the first iterations the model assigns roughly equal probability to foreground and background, the loss is dominated by gradient from the many anchors, and the model diverges. The fix is to initialize the bias so the prior probability of the foreground class is pi = 0.01 at the start of training, with bias b = -log((1 - pi) / pi). The paper emphasizes this trick as essential, not optional. [1]
With ResNet-101 plus FPN as the backbone, RetinaNet achieved 39.1 percent COCO AP with multi-scale testing, exceeding all then-published one-stage detectors and matching or surpassing two-stage detectors of the time. The single-scale ResNet-50 RetinaNet achieved 32.5 to 34.0 AP at 11 to 13 fps on a Titan X GPU. [1]
Focal loss is straightforward to implement and has been added to most deep learning frameworks and detection libraries.
Detectron and Detectron2 are the official reference implementations from FAIR. Detectron2's sigmoid_focal_loss_jit in fvcore.nn is the canonical PyTorch implementation. [6] MMDetection from OpenMMLab provides FocalLoss and SigmoidFocalLoss modules with both Python and CUDA implementations. [7]
PyTorch core does not ship focal loss in torch.nn; torchvision.ops.sigmoid_focal_loss provides a standalone implementation. [8] TensorFlow Addons previously provided tfa.losses.SigmoidFocalCrossEntropy but was deprecated in 2024. [9] Kornia includes kornia.losses.FocalLoss and BinaryFocalLossWithLogits. [10]
A minimal binary focal loss in PyTorch:
import torch
import torch.nn.functional as F
def binary_focal_loss(logits, targets, alpha=0.25, gamma=2.0):
p = torch.sigmoid(logits)
ce = F.binary_cross_entropy_with_logits(logits, targets, reduction="none")
p_t = p * targets + (1 - p) * (1 - targets)
alpha_t = alpha * targets + (1 - alpha) * (1 - targets)
loss = alpha_t * (1 - p_t) ** gamma * ce
return loss.mean()
Production implementations use fused CUDA kernels or torch.jit for speed, but the math is identical.
The success of focal loss spawned a family of related losses.
Cui, Jia, Lin, Song, and Belongie introduced the Class-Balanced Loss in 2019. They proposed weighting by the effective number of samples (1 - beta^n) / (1 - beta), where n is the per-class sample count and beta is close to 1. The class-balanced focal loss replaces alpha_t with this effective-number weight, showing gains over vanilla focal loss and inverse-frequency weighting on LVIS, iNaturalist (long-tail variants), and CIFAR-100-LT. [11]
Li et al. proposed Generalized Focal Loss (GFL) in 2020 to address two limits of focal loss: treating classification and localization quality as independent tasks, and being defined only for {0, 1} targets while quality labels can be continuous in [0, 1]. The paper introduced Quality Focal Loss (QFL), which extends focal loss to continuous targets by replacing (1 - p_t)^gamma with |y - sigma|^beta, and Distribution Focal Loss (DFL), which treats box coordinates as a discrete distribution over candidate values. The combination has been adopted in detectors including PP-PicoDet and YOLOv6. [12]
Asymmetric focal loss generalizes the focusing parameter to two values, gamma_pos and gamma_neg. With gamma_neg > gamma_pos, easy negatives are down-weighted more aggressively than easy positives, improving recall on imbalanced multi-label classification. Ridnik et al. (2021) developed such a loss with shifted thresholds and probability margin terms, used in medical image classification and audio tagging. [13]
PolyLoss, introduced by Leng et al. (2022), interprets cross-entropy as the leading term of a Taylor series expansion of -log(p_t) around p_t = 1: -log(p_t) = (1 - p_t) + (1 - p_t)^2 / 2 + .... PolyLoss adds adjustable coefficients to these terms. Focal loss falls naturally into this family and can be approximated within PolyLoss by appropriate coefficient choices. [14]
Reduced focal loss caps the modulating factor at low confidence to avoid amplifying noise from mislabeled examples. Tversky-focal hybrids combine focal modulation with the Tversky index for medical segmentation, where false positives and false negatives have asymmetric clinical costs. [15]
While focal loss was proposed for dense object detection, its utility for any imbalanced classification task became clear quickly.
Instance segmentation: Mask R-CNN variants use focal loss in the classification branch when paired with FPN backbones. Mask-aware variants apply the modulation per pixel for the mask head.
Semantic segmentation: Pixel-level segmentation often suffers from class imbalance, especially in medical imaging where lesion pixels can be a tiny fraction of the total. Focal loss is commonly combined with Dice or Tversky loss, with consistent improvements over cross-entropy on BraTS and LiTS. [16]
Long-tail image classification: On LVIS, iNaturalist, and ImageNet-LT, focal loss is a standard baseline alongside class-balanced loss, equalization loss, balanced softmax, and decoupled training. It is not always the top performer on extreme long-tail problems but is robust and needs no extra sampling infrastructure. [11]
Imbalanced tabular classification: In credit card fraud detection, where positives can be 0.1 percent or less, focal loss is used in deep tabular models as an alternative to SMOTE-style oversampling.
NLP and audio: Toxic comment detection, rare-intent classification, and sound event detection (DCASE challenges) all use focal loss in highly imbalanced settings, though benefits are smaller than in detection.
The table below compares focal loss with other strategies for handling class imbalance.
| Method | Mechanism | Typical setting |
|---|---|---|
| Standard cross-entropy | Equal weighting | Balanced classification |
| Weighted cross-entropy / alpha-balanced | Per-class weights | Mild imbalance |
| Hard negative mining | Keep top-k highest-loss negatives | SSD, early R-CNN |
| OHEM | Online selection of hard examples | Two-stage detectors |
| Two-stage proposal sampling | 1:3 fg:bg from RPN | Faster R-CNN, Mask R-CNN |
| Focal loss | Modulating factor in loss | Dense detection, imbalance |
| Class-balanced loss | Effective-number weighting | Long-tail classification |
| Equalization loss | Suppresses negative gradients to rare classes | LVIS, long-tail detection |
| Balanced softmax | Adjusts logits by class prior | Long-tail classification |
| GHM (gradient harmonizing) | Down-weights very easy and very hard | Noisy-label detection |
| Quality / Distribution focal loss | Continuous targets, joint quality | GFL, modern detectors |
Gradient Harmonizing Mechanism (GHM), proposed by Li, Liu, and Wang (2019), is related to focal loss but down-weights examples with both very low and very high difficulty, on the grounds that very-hard examples are often noisy or mislabeled. [17] For severely long-tail problems (LVIS, OpenImages), focal loss alone is usually outperformed by equalization loss, seesaw loss, and balanced softmax. Focal loss assumes difficulty alone identifies imbalance, while the long-tail problem also has a label-frequency component. Combining focal modulation with class-balanced weighting recovers some of the gap. [11]
Focal loss has become standard but is not without criticism.
Hyperparameter sensitivity. The choice of gamma and alpha interacts with model architecture, imbalance ratio, and optimization schedule. The gamma = 2, alpha = 0.25 combination is robust for COCO-scale detection but may need tuning elsewhere. Reports vary: some find gamma between 1 and 3 works across tasks, others find it sensitive at the percent level. [18]
Less effective when data is balanced. When positive-to-negative ratio is balanced or only mildly skewed, focal loss often matches or slightly underperforms cross-entropy. The extra hyperparameters add complexity without benefit. The original paper claims focal loss is the right tool for extreme imbalance, not a universal upgrade.
Sensitivity to label noise. Because focal loss up-weights examples the model is currently misclassifying, a mislabeled example receives a disproportionately large gradient. GHM was developed in part to address this. [17] In practice, cleaned datasets, label smoothing, or self-distillation mitigate the problem.
Not optimal for very long-tail problems. For hundreds or thousands of classes with extreme imbalance, focal loss alone is usually beaten by methods that explicitly model class frequency. It can still help as part of a combined approach.
Mismatch with anchor-free and DETR-style detectors. Focal loss was designed for the anchor-based dense prediction paradigm. FCOS uses focal loss directly on per-pixel classifications and it works well. DETR uses Hungarian matching between predicted queries and ground-truth boxes, then applies cross-entropy or focal loss only on matched pairs plus a "no object" class for unmatched queries. The imbalance problem in DETR is much milder because there are only around 100 queries, so the role of focal loss shifts from essential to helpful. Many DETR variants use focal loss for the classification head, but the gains over cross-entropy are smaller. [19]
Calibration. Models trained with focal loss are sometimes less well calibrated than cross-entropy models. Mukhoti et al. (2020) argue this is partly an artifact of evaluation and that focal-trained models can be better calibrated under careful analysis. The practical implication is that post-hoc calibration (temperature scaling, Platt scaling) may be needed for probabilistic use. [20]
Focal loss has had substantial influence on detection research and on deep learning more broadly.
Best Student Paper Award at ICCV 2017, reflecting both the practical impact of RetinaNet and the elegance of the reformulation. [2]
Adoption in detection. Most one-stage detectors after 2017 use focal loss or a variant. FCOS, CenterNet, and ATSS adopted it for classification. YOLOv5 uses a focal-loss variant in some configurations; YOLOv7 and YOLOv8 use loss components with focal-style modulation. Quality Focal Loss is a default in PP-PicoDet, NanoDet, and YOLOv6. [12]
Segmentation. Focal loss is standard in nnU-Net and U-Net++ medical recipes, often combined with Dice loss. [16]
Citation count. As of early 2026, the paper has accumulated roughly 30,000+ citations on Google Scholar, among the most-cited deep learning papers of the late 2010s. [3]
Conceptual influence. Reshaping the loss to focus on hard examples has been imported into contrastive learning (hard negative weighting in InfoNCE-style losses), knowledge distillation, and self-supervised learning. "Reweight rather than resample" is now a standard tool for imbalanced data.