Focal loss

Computer Vision Deep Learning Machine Learning Training & Optimization

19 min read

Updated Jun 24, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 24, 2026

Fact-checked

In review queue

Sources

25 citations

Revision

v4 · 3,820 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Focal loss is a loss function that reshapes standard cross-entropy loss by adding a (1 - p_t)^gamma modulating factor, which down-weights well-classified (easy) examples so that training concentrates on hard, misclassified examples. It was introduced by Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollar at Facebook AI Research in the 2017 paper "Focal Loss for Dense Object Detection," which won the Best Student Paper Award at ICCV 2017. ^[1] ^[2] The loss was designed to fix the roughly 1:1000 foreground-background class imbalance in one-stage object detection, and it powered the RetinaNet detector, which reached 39.1 AP on COCO test-dev while running at 5 fps, surpassing all previously published one-stage and two-stage detectors of its time. ^[1] The recommended default settings are gamma = 2 and alpha = 0.25. ^[1]

The full alpha-balanced form is FL(p_t) = -alpha_t * (1 - p_t)^gamma * log(p_t), where p_t is the model's estimated probability for the ground-truth class, gamma >= 0 is the focusing parameter, and alpha_t is a class-balancing weight. When gamma = 0, focal loss reduces exactly to (alpha-balanced) cross-entropy. When gamma > 0, easy examples are down-weighted and the bulk of the gradient signal comes from hard examples. The authors describe the goal plainly: "We propose to reshape the loss function to down-weight easy examples and thus focus training on hard negatives." ^[1]

The motivation is specific to dense prediction. A one-stage detector like YOLO or SSD evaluates roughly 10^4 to 10^5 candidate locations per image, and the overwhelming majority correspond to background. Cross-entropy summed across so many easy negatives drowns out the gradient from the few foreground objects. Two-stage detectors such as Faster R-CNN sidestep this by sampling a balanced mini-batch from region proposals. Focal loss offered a simpler alternative that required no sampling heuristics, and it allowed RetinaNet to match the accuracy of two-stage detectors at one-stage speeds. ^[1]

Since its introduction, focal loss has been adopted across computer vision tasks beyond detection, including instance segmentation, semantic segmentation, and long-tail imbalanced learning on imbalanced datasets. It has also seeded follow-up losses including Quality and Distribution Focal Loss in Generalized Focal Loss, the Class-Balanced Loss of Cui et al. (2019), asymmetric focal variants, and PolyLoss. As of early 2026, the paper has been cited more than 30,000 times on Google Scholar. ^[3]

What problem does focal loss solve?

The class imbalance problem in object detection comes from how dense detectors enumerate candidate locations. A typical anchor-based one-stage detector tiles the input image with anchor boxes at multiple scales and aspect ratios, yielding on the order of 100,000 candidate locations per image, of which only a few dozen actually overlap with ground-truth objects. The paper frames it directly: "The Focal Loss is designed to address the one-stage object detection scenario in which there is an extreme imbalance between foreground and background classes during training (e.g., 1:1000)." ^[1]

The authors identify this imbalance as the root cause of the historical accuracy gap between one-stage and two-stage detectors: "We discover that the extreme foreground-background class imbalance encountered during training of dense detectors is the central cause." ^[1]

With vanilla binary cross-entropy summed over all anchors, two issues emerge. First, the loss is dominated by negatives in raw magnitude: even if each individual easy negative has a small loss value, the sheer count means the cumulative loss from negatives is much larger than from positives. Second, the gradients are dominated by easy negatives in the same way, so the optimizer spends most of its updates pushing already-correct background predictions slightly more toward 0, rather than learning to detect the rare foreground objects.

Previous approaches addressed this through sampling and mining. Two-stage detectors handle imbalance via region proposals that prune the candidate set down to roughly 1,000 to 2,000 boxes per image, followed by sampling a fixed-size mini-batch (typically 256 in Faster R-CNN) with a controlled positive-to-negative ratio of 1:3. SSD used hard negative mining, sorting negatives by their loss and keeping the top-k highest-loss negatives so the negative-to-positive ratio is at most 3:1. ^[4] Online Hard Example Mining (OHEM), introduced by Shrivastava, Gupta, and Girshick in 2016, automated the selection of hard examples within mini-batches based on their current loss values. ^[5] All of these methods added pipeline complexity, hyperparameters, and sometimes extra forward passes.

Focal loss instead changes the loss function so that no sampling is needed. Every anchor contributes to the loss, but easy examples contribute almost nothing. The key insight: rather than fix imbalance at the data level, fix it at the loss level. ^[1]

How is focal loss defined mathematically?

Focal loss starts from binary cross-entropy loss. For a binary problem with label y in {0, 1} and predicted probability p for class 1:

CE(p, y) = -log(p)        if y = 1
         = -log(1 - p)    if y = 0

Defining p_t = p if y = 1 else 1 - p, this becomes CE(p_t) = -log(p_t). ^[1]

Alpha-balanced cross-entropy

A first attempt at handling imbalance is to weight the loss differently for the two classes. Define alpha in [0, 1] for class 1 and 1 - alpha for class 0, and let alpha_t = alpha if y = 1 else 1 - alpha. The alpha-balanced cross-entropy is -alpha_t * log(p_t). This rebalances positive and negative classes but does not distinguish easy from hard examples. ^[1]

Focal loss

Focal loss adds a multiplicative modulating factor (1 - p_t)^gamma:

FL(p_t) = -(1 - p_t)^gamma * log(p_t)

where gamma >= 0 is the focusing parameter. The alpha-balanced form, used in the paper's main experiments, combines both:

FL(p_t) = -alpha_t * (1 - p_t)^gamma * log(p_t)

When gamma = 0, focal loss reduces to alpha-balanced cross-entropy. As gamma increases, the modulating factor sharply down-weights well-classified examples (p_t close to 1) while leaving hard examples (p_t close to 0) almost unchanged. ^[1]

The paper reports that gamma = 2 and alpha = 0.25 worked best for RetinaNet on COCO, stating: "we use gamma = 2.0 with alpha = .25 for all experiments" while noting that alpha = .5 works nearly as well (0.4 AP lower). ^[1] With gamma = 2, the optimal alpha shifts down from the standard imbalance-correcting value (around 0.75) to 0.25. The intuition is that the modulating factor itself already shifts attention to positives, so the class weighting needs to be milder, not stronger. ^[1]

Multi-class extension

RetinaNet uses sigmoid activations rather than softmax, treating each class as an independent binary classifier. The image-level loss is summed over all anchors and divided by the number of anchors assigned to ground-truth boxes (not the total anchor count). Dividing by foreground count gives a stable gradient magnitude as easy negatives are down-weighted. ^[1]

How does the modulating factor behave?

The modulating factor (1 - p_t)^gamma down-weights the loss for well-classified examples and widens the range of p_t values that receive substantial loss. The table below shows this for gamma = 2.

`p_t`	`CE = -log(p_t)`	`(1 - p_t)^2`	`FL` (`gamma = 2`)	Ratio FL / CE
0.99	0.0101	0.0001	0.0000	0.0001
0.95	0.0513	0.0025	0.0001	0.0025
0.90	0.1054	0.0100	0.0011	0.0100
0.75	0.2877	0.0625	0.0180	0.0625
0.50	0.6931	0.2500	0.1733	0.2500
0.25	1.3863	0.5625	0.7798	0.5625
0.10	2.3026	0.8100	1.8651	0.8100

A confidently correct example with p_t = 0.9 has its loss reduced by a factor of 100. A truly confident example with p_t = 0.99 has its loss reduced by 10,000. By contrast, a hard example with p_t = 0.1 is barely down-weighted at all (factor 0.81). Examples right at the decision boundary (p_t = 0.5) are reduced by a factor of 4. ^[1]

No example is fully ignored, but the cumulative effect of summing across 100,000 anchors is that easy negatives contribute negligibly to the total. The paper supports this empirically: with cross-entropy, the bottom 70 percent of negatives by loss contribute about half of the total loss, while with focal loss at gamma = 2 they contribute essentially zero. ^[1]

How do you choose gamma?

The paper sweeps gamma from 0 to 5 and reports COCO AP with RetinaNet using a ResNet-50 FPN backbone at 600-pixel scale. The reported numbers (Table 1b) were approximately: 31.1 AP at gamma = 0 (with optimal alpha), 32.9 at gamma = 0.5, 33.7 at gamma = 1, 34.0 at gamma = 2, and 32.2 at gamma = 5. The accuracy peaks around gamma = 2, which became the de facto default. ^[1] Pushing gamma too high under-weights examples that are merely hard rather than misclassified, slowing convergence.

RetinaNet (where focal loss debuted)

RetinaNet is the one-stage object detector proposed in the same 2017 paper. It combines a ResNet backbone, a Feature Pyramid Network (FPN) neck, and two task-specific subnetworks for classification and box regression. ^[1]

The FPN constructs feature maps at five pyramid levels (P3 through P7) with strides of 8, 16, 32, 64, and 128 pixels. At each level, anchors with three scales and three aspect ratios are tiled across the feature map. With nine anchors per spatial location across five pyramid levels, a 600-by-600 input image yields roughly 100,000 anchors, the source of the extreme class imbalance. ^[1] The classification subnet is a small fully-convolutional network with four 3-by-3 conv layers (256 channels, ReLU) and a final conv predicting K * A channels.

A crucial implementation detail involves the bias initialization of the final classification layer. Without it, training is unstable: at the first iterations the model assigns roughly equal probability to foreground and background, the loss is dominated by gradient from the many anchors, and the model diverges. The fix is to initialize the bias so the prior probability of the foreground class is pi = 0.01 at the start of training, with bias b = -log((1 - pi) / pi); the authors note they "use pi = .01 in all experiments." ^[1] The paper emphasizes this trick as essential, not optional.

With ResNet-101 plus FPN as the backbone, RetinaNet achieved 39.1 percent COCO AP with multi-scale testing, exceeding all then-published one-stage detectors and matching or surpassing two-stage detectors of the time. As the authors summarize, "our best model, based on a ResNet-101-FPN backbone, achieves a COCO test-dev AP of 39.1 while running at 5 fps, surpassing the previously best published single-model results from both one and two-stage detectors." ^[1] The single-scale ResNet-50 RetinaNet achieved 32.5 to 34.0 AP at 11 to 13 fps on a Titan X GPU. ^[1]

How is focal loss implemented in code?

Focal loss is straightforward to implement and has been added to most deep learning frameworks and detection libraries.

Detectron and Detectron2 are the official reference implementations from FAIR. Detectron2's sigmoid_focal_loss_jit in fvcore.nn is the canonical PyTorch implementation. ^[6] MMDetection from OpenMMLab provides FocalLoss and SigmoidFocalLoss modules with both Python and CUDA implementations. ^[7]

PyTorch core does not ship focal loss in torch.nn; torchvision.ops.sigmoid_focal_loss provides a standalone implementation. ^[8] TensorFlow Addons previously provided tfa.losses.SigmoidFocalCrossEntropy but was deprecated in 2024. ^[9] Kornia includes kornia.losses.FocalLoss and BinaryFocalLossWithLogits. ^[10]

A minimal binary focal loss in PyTorch:

import torch
import torch.nn.functional as F

def binary_focal_loss(logits, targets, alpha=0.25, gamma=2.0):
    p = torch.sigmoid(logits)
    ce = F.binary_cross_entropy_with_logits(logits, targets, reduction="none")
    p_t = p * targets + (1 - p) * (1 - targets)
    alpha_t = alpha * targets + (1 - alpha) * (1 - targets)
    loss = alpha_t * (1 - p_t) ** gamma * ce
    return loss.mean()

Production implementations use fused CUDA kernels or torch.jit for speed, but the math is identical.

What variants and extensions exist?

The success of focal loss spawned a family of related losses.

Class-balanced focal loss

Cui, Jia, Lin, Song, and Belongie introduced the Class-Balanced Loss in 2019. They proposed weighting by the effective number of samples (1 - beta^n) / (1 - beta), where n is the per-class sample count and beta is close to 1. The class-balanced focal loss replaces alpha_t with this effective-number weight, showing gains over vanilla focal loss and inverse-frequency weighting on LVIS, iNaturalist (long-tail variants), and CIFAR-100-LT. ^[11]

Generalized Focal Loss

Li et al. proposed Generalized Focal Loss (GFL) in 2020 to address two limits of focal loss: treating classification and localization quality as independent tasks, and being defined only for {0, 1} targets while quality labels can be continuous in [0, 1]. The paper introduced Quality Focal Loss (QFL), which extends focal loss to continuous targets by replacing (1 - p_t)^gamma with |y - sigma|^beta, and Distribution Focal Loss (DFL), which treats box coordinates as a discrete distribution over candidate values. The combination has been adopted in detectors including PP-PicoDet and YOLOv6. ^[12]

Asymmetric focal loss

Asymmetric focal loss generalizes the focusing parameter to two values, gamma_pos and gamma_neg. With gamma_neg > gamma_pos, easy negatives are down-weighted more aggressively than easy positives, improving recall on imbalanced multi-label classification. Ridnik et al. (2021) developed such a loss with shifted thresholds and probability margin terms, used in medical image classification and audio tagging. ^[13]

PolyLoss

PolyLoss, introduced by Leng et al. (2022), interprets cross-entropy as the leading term of a Taylor series expansion of -log(p_t) around p_t = 1: -log(p_t) = (1 - p_t) + (1 - p_t)^2 / 2 + .... PolyLoss adds adjustable coefficients to these terms. Focal loss falls naturally into this family and can be approximated within PolyLoss by appropriate coefficient choices. ^[14]

Other variants

Reduced focal loss caps the modulating factor at low confidence to avoid amplifying noise from mislabeled examples. Tversky-focal hybrids combine focal modulation with the Tversky index for medical segmentation, where false positives and false negatives have asymmetric clinical costs. ^[15]

What is focal loss used for beyond detection?

While focal loss was proposed for dense object detection, its utility for any imbalanced classification task became clear quickly.

Instance segmentation: Mask R-CNN variants use focal loss in the classification branch when paired with FPN backbones. Mask-aware variants apply the modulation per pixel for the mask head.

Semantic segmentation: Pixel-level segmentation often suffers from class imbalance, especially in medical imaging where lesion pixels can be a tiny fraction of the total. Focal loss is commonly combined with Dice or Tversky loss, with consistent improvements over cross-entropy on BraTS and LiTS. ^[16]

Long-tail image classification: On LVIS, iNaturalist, and ImageNet-LT, focal loss is a standard baseline alongside class-balanced loss, equalization loss, balanced softmax, and decoupled training. It is not always the top performer on extreme long-tail problems but is robust and needs no extra sampling infrastructure. ^[11]

Imbalanced tabular classification: In credit card fraud detection, where positives can be 0.1 percent or less, focal loss is used in deep tabular models as an alternative to SMOTE-style oversampling.

NLP and audio: Toxic comment detection, rare-intent classification, and sound event detection (DCASE challenges) all use focal loss in highly imbalanced settings, though benefits are smaller than in detection.

How does focal loss compare with other imbalance methods?

The table below compares focal loss with other strategies for handling class imbalance.

Method	Mechanism	Typical setting
Standard cross-entropy	Equal weighting	Balanced classification
Weighted cross-entropy / alpha-balanced	Per-class weights	Mild imbalance
Hard negative mining	Keep top-k highest-loss negatives	SSD, early R-CNN
OHEM	Online selection of hard examples	Two-stage detectors
Two-stage proposal sampling	1:3 fg:bg from RPN	Faster R-CNN, Mask R-CNN
Focal loss	Modulating factor in loss	Dense detection, imbalance
Class-balanced loss	Effective-number weighting	Long-tail classification
Equalization loss	Suppresses negative gradients to rare classes	LVIS, long-tail detection
Balanced softmax	Adjusts logits by class prior	Long-tail classification
GHM (gradient harmonizing)	Down-weights very easy and very hard	Noisy-label detection
Quality / Distribution focal loss	Continuous targets, joint quality	GFL, modern detectors

Gradient Harmonizing Mechanism (GHM), proposed by Li, Liu, and Wang (2019), is related to focal loss but down-weights examples with both very low and very high difficulty, on the grounds that very-hard examples are often noisy or mislabeled. ^[17] For severely long-tail problems (LVIS, OpenImages), focal loss alone is usually outperformed by equalization loss, seesaw loss, and balanced softmax. Focal loss assumes difficulty alone identifies imbalance, while the long-tail problem also has a label-frequency component. Combining focal modulation with class-balanced weighting recovers some of the gap. ^[11]

What are the limitations of focal loss?

Focal loss has become standard but is not without criticism.

Hyperparameter sensitivity. The choice of gamma and alpha interacts with model architecture, imbalance ratio, and optimization schedule. The gamma = 2, alpha = 0.25 combination is robust for COCO-scale detection but may need tuning elsewhere. Reports vary: some find gamma between 1 and 3 works across tasks, others find it sensitive at the percent level. ^[18]

Less effective when data is balanced. When positive-to-negative ratio is balanced or only mildly skewed, focal loss often matches or slightly underperforms cross-entropy. The extra hyperparameters add complexity without benefit. The original paper claims focal loss is the right tool for extreme imbalance, not a universal upgrade.

Sensitivity to label noise. Because focal loss up-weights examples the model is currently misclassifying, a mislabeled example receives a disproportionately large gradient. GHM was developed in part to address this. ^[17] In practice, cleaned datasets, label smoothing, or self-distillation mitigate the problem.

Not optimal for very long-tail problems. For hundreds or thousands of classes with extreme imbalance, focal loss alone is usually beaten by methods that explicitly model class frequency. It can still help as part of a combined approach.

Mismatch with anchor-free and DETR-style detectors. Focal loss was designed for the anchor-based dense prediction paradigm. FCOS uses focal loss directly on per-pixel classifications and it works well. DETR uses Hungarian matching between predicted queries and ground-truth boxes, then applies cross-entropy or focal loss only on matched pairs plus a "no object" class for unmatched queries. The imbalance problem in DETR is much milder because there are only around 100 queries, so the role of focal loss shifts from essential to helpful. Many DETR variants use focal loss for the classification head, but the gains over cross-entropy are smaller. ^[19]

Calibration. Models trained with focal loss are sometimes less well calibrated than cross-entropy models. Mukhoti et al. (2020) argue this is partly an artifact of evaluation and that focal-trained models can be better calibrated under careful analysis. The practical implication is that post-hoc calibration (temperature scaling, Platt scaling) may be needed for probabilistic use. ^[20]

Influence

Focal loss has had substantial influence on detection research and on deep learning more broadly.

Best Student Paper Award at ICCV 2017, reflecting both the practical impact of RetinaNet and the elegance of the reformulation. ^[2]

Adoption in detection. Most one-stage detectors after 2017 use focal loss or a variant. FCOS, CenterNet, and ATSS adopted it for classification. YOLOv5 uses a focal-loss variant in some configurations; YOLOv7 and YOLOv8 use loss components with focal-style modulation. Quality Focal Loss is a default in PP-PicoDet, NanoDet, and YOLOv6. ^[12]

Segmentation. Focal loss is standard in nnU-Net and U-Net++ medical recipes, often combined with Dice loss. ^[16]

Citation count. As of early 2026, the paper has accumulated roughly 30,000+ citations on Google Scholar, among the most-cited deep learning papers of the late 2010s. ^[3]

Conceptual influence. Reshaping the loss to focus on hard examples has been imported into contrastive learning (hard negative weighting in InfoNCE-style losses), knowledge distillation, and self-supervised learning. "Reweight rather than resample" is now a standard tool for imbalanced data.

References

Lin, T.-Y., Goyal, P., Girshick, R., He, K., & Dollar, P. (2017). "Focal Loss for Dense Object Detection." *ICCV 2017*. arXiv:1708.02002. https://arxiv.org/abs/1708.02002 ↩
ICCV 2017 Awards. "Best Student Paper Award." https://openaccess.thecvf.com/ICCV2017 ↩
Google Scholar citations for "Focal Loss for Dense Object Detection." https://scholar.google.com/scholar?q=focal+loss+for+dense+object+detection ↩
Liu, W., Anguelov, D., Erhan, D., et al. (2016). "SSD: Single Shot MultiBox Detector." *ECCV 2016*. arXiv:1512.02325. https://arxiv.org/abs/1512.02325 ↩
Shrivastava, A., Gupta, A., & Girshick, R. (2016). "Training Region-Based Object Detectors with Online Hard Example Mining." *CVPR 2016*. arXiv:1604.03540. https://arxiv.org/abs/1604.03540 ↩
Detectron2 source, fvcore.nn.focal_loss. https://github.com/facebookresearch/fvcore/blob/main/fvcore/nn/focal_loss.py ↩
MMDetection documentation, FocalLoss. https://mmdetection.readthedocs.io/en/latest/api.html ↩
PyTorch torchvision.ops.sigmoid_focal_loss. https://pytorch.org/vision/stable/generated/torchvision.ops.sigmoid_focal_loss.html ↩
TensorFlow Addons SigmoidFocalCrossEntropy (archived). https://www.tensorflow.org/addons/api_docs/python/tfa/losses/SigmoidFocalCrossEntropy ↩
Kornia documentation, kornia.losses.FocalLoss. https://kornia.readthedocs.io/en/latest/losses.html ↩
Cui, Y., Jia, M., Lin, T.-Y., Song, Y., & Belongie, S. (2019). "Class-Balanced Loss Based on Effective Number of Samples." *CVPR 2019*. arXiv:1901.05555. https://arxiv.org/abs/1901.05555 ↩
Li, X., Wang, W., Wu, L., et al. (2020). "Generalized Focal Loss." *NeurIPS 2020*. arXiv:2006.04388. https://arxiv.org/abs/2006.04388 ↩
Ridnik, T., Ben-Baruch, E., Zamir, N., et al. (2021). "Asymmetric Loss for Multi-Label Classification." *ICCV 2021*. arXiv:2009.14119. https://arxiv.org/abs/2009.14119 ↩
Leng, Z., Tan, M., Liu, C., et al. (2022). "PolyLoss: A Polynomial Expansion Perspective of Classification Loss Functions." *ICLR 2022*. arXiv:2204.12511. https://arxiv.org/abs/2204.12511 ↩
Abraham, N., & Khan, N. M. (2019). "A Novel Focal Tversky Loss Function with Improved Attention U-Net for Lesion Segmentation." *ISBI 2019*. arXiv:1810.07842. https://arxiv.org/abs/1810.07842 ↩
Isensee, F., Jaeger, P. F., Kohl, S. A. A., Petersen, J., & Maier-Hein, K. H. (2021). "nnU-Net: A Self-Configuring Method for Deep Learning-Based Biomedical Image Segmentation." *Nature Methods*. https://www.nature.com/articles/s41592-020-01008-z ↩
Li, B., Liu, Y., & Wang, X. (2019). "Gradient Harmonized Single-Stage Detector." *AAAI 2019*. arXiv:1811.05181. https://arxiv.org/abs/1811.05181 ↩
Tan, J., Wang, C., Li, B., et al. (2020). "Equalization Loss for Long-Tailed Object Recognition." *CVPR 2020*. arXiv:2003.05176. https://arxiv.org/abs/2003.05176 ↩
Carion, N., Massa, F., Synnaeve, G., et al. (2020). "End-to-End Object Detection with Transformers." *ECCV 2020*. arXiv:2005.12872. https://arxiv.org/abs/2005.12872 ↩
Mukhoti, J., Kulharia, V., Sanyal, A., et al. (2020). "Calibrating Deep Neural Networks Using Focal Loss." *NeurIPS 2020*. arXiv:2002.09437. https://arxiv.org/abs/2002.09437 ↩
Tian, Z., Shen, C., Chen, H., & He, T. (2019). "FCOS: Fully Convolutional One-Stage Object Detection." *ICCV 2019*. arXiv:1904.01355. https://arxiv.org/abs/1904.01355
Lin, T.-Y., Dollar, P., Girshick, R., He, K., Hariharan, B., & Belongie, S. (2017). "Feature Pyramid Networks for Object Detection." *CVPR 2017*. arXiv:1612.03144. https://arxiv.org/abs/1612.03144
He, K., Zhang, X., Ren, S., & Sun, J. (2016). "Deep Residual Learning for Image Recognition." *CVPR 2016*. arXiv:1512.03385. https://arxiv.org/abs/1512.03385
Ren, S., He, K., Girshick, R., & Sun, J. (2015). "Faster R-CNN." *NeurIPS 2015*. arXiv:1506.01497. https://arxiv.org/abs/1506.01497
Lin, T.-Y., Maire, M., Belongie, S., et al. (2014). "Microsoft COCO: Common Objects in Context." *ECCV 2014*. arXiv:1405.0312. https://arxiv.org/abs/1405.0312

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

3 revisions by 1 contributors · full history

Suggest edit

What links here

Classification (machine learning)Cost Cost-sensitive learning Log Loss Maximum likelihood estimation (MLE)Minority class Negative class Positive class SMOTE (Synthetic Minority Over-sampling Technique)Undersampling Upweighting

What problem does focal loss solve?

How is focal loss defined mathematically?

Alpha-balanced cross-entropy

Focal loss

Multi-class extension

How does the modulating factor behave?

How do you choose gamma?

RetinaNet (where focal loss debuted)

How is focal loss implemented in code?

What variants and extensions exist?

Class-balanced focal loss

Generalized Focal Loss

Asymmetric focal loss

PolyLoss

Other variants

What is focal loss used for beyond detection?

How does focal loss compare with other imbalance methods?

What are the limitations of focal loss?

Influence

See also

References

Improve this article

Related Articles

Pre-training

Masked autoencoder (MAE)

Clipping

Dropout Regularization

Early Stopping

Fine Tuning

What links here

Related Articles

Pre-training

Masked autoencoder (MAE)

Clipping

Dropout Regularization

Early Stopping

Fine Tuning

What links here