# Focal loss

> Source: https://aiwiki.ai/wiki/focal_loss
> Updated: 2026-06-24
> Categories: Computer Vision, Deep Learning, Machine Learning, Training & Optimization
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

**Focal loss** is a [loss function](/wiki/loss_function) that reshapes standard [cross-entropy loss](/wiki/cross_entropy_loss) by adding a `(1 - p_t)^gamma` modulating factor, which down-weights well-classified (easy) examples so that training concentrates on hard, misclassified examples. It was introduced by [Tsung-Yi Lin](/wiki/tsung_yi_lin), [Priya Goyal](/wiki/priya_goyal), [Ross Girshick](/wiki/ross_girshick), [Kaiming He](/wiki/kaiming_he), and [Piotr Dollar](/wiki/piotr_dollar) at [Facebook AI Research](/wiki/meta_ai) in the 2017 paper "Focal Loss for Dense Object Detection," which won the Best Student Paper Award at [ICCV](/wiki/iccv) 2017. [1] [2] The loss was designed to fix the roughly 1:1000 foreground-background [class imbalance](/wiki/class_imbalance) in one-stage [object detection](/wiki/object_detection), and it powered the [RetinaNet](/wiki/retinanet) detector, which reached 39.1 AP on [COCO](/wiki/coco_dataset) test-dev while running at 5 fps, surpassing all previously published one-stage and two-stage detectors of its time. [1] The recommended default settings are `gamma = 2` and `alpha = 0.25`. [1]

The full alpha-balanced form is `FL(p_t) = -alpha_t * (1 - p_t)^gamma * log(p_t)`, where `p_t` is the model's estimated probability for the ground-truth class, `gamma >= 0` is the focusing parameter, and `alpha_t` is a class-balancing weight. When `gamma = 0`, focal loss reduces exactly to (alpha-balanced) cross-entropy. When `gamma > 0`, easy examples are down-weighted and the bulk of the gradient signal comes from hard examples. The authors describe the goal plainly: "We propose to reshape the loss function to down-weight easy examples and thus focus training on hard negatives." [1]

The motivation is specific to dense prediction. A one-stage detector like [YOLO](/wiki/yolo) or [SSD](/wiki/ssd_object_detection) evaluates roughly 10^4 to 10^5 candidate locations per image, and the overwhelming majority correspond to background. Cross-entropy summed across so many easy negatives drowns out the gradient from the few foreground objects. Two-stage detectors such as [Faster R-CNN](/wiki/faster_r_cnn) sidestep this by sampling a balanced mini-batch from region proposals. Focal loss offered a simpler alternative that required no sampling heuristics, and it allowed RetinaNet to match the accuracy of two-stage detectors at one-stage speeds. [1]

Since its introduction, focal loss has been adopted across [computer vision](/wiki/computer_vision) tasks beyond detection, including [instance segmentation](/wiki/instance_segmentation), [semantic segmentation](/wiki/semantic_segmentation), and long-tail [imbalanced learning](/wiki/imbalanced_learning) on [imbalanced datasets](/wiki/imbalanced_dataset). It has also seeded follow-up losses including Quality and Distribution Focal Loss in [Generalized Focal Loss](/wiki/generalized_focal_loss), the [Class-Balanced Loss](/wiki/class_balanced_loss) of Cui et al. (2019), asymmetric focal variants, and [PolyLoss](/wiki/polyloss). As of early 2026, the paper has been cited more than 30,000 times on Google Scholar. [3]

## What problem does focal loss solve?

The class imbalance problem in object detection comes from how dense detectors enumerate candidate locations. A typical anchor-based one-stage detector tiles the input image with anchor boxes at multiple scales and aspect ratios, yielding on the order of 100,000 candidate locations per image, of which only a few dozen actually overlap with ground-truth objects. The paper frames it directly: "The Focal Loss is designed to address the one-stage object detection scenario in which there is an extreme imbalance between foreground and background classes during training (e.g., 1:1000)." [1]

The authors identify this imbalance as the root cause of the historical accuracy gap between one-stage and two-stage detectors: "We discover that the extreme foreground-background class imbalance encountered during training of dense detectors is the central cause." [1]

With vanilla [binary cross-entropy](/wiki/binary_cross_entropy) summed over all anchors, two issues emerge. First, the loss is dominated by negatives in raw magnitude: even if each individual easy negative has a small loss value, the sheer count means the cumulative loss from negatives is much larger than from positives. Second, the gradients are dominated by easy negatives in the same way, so the optimizer spends most of its updates pushing already-correct background predictions slightly more toward 0, rather than learning to detect the rare foreground objects.

Previous approaches addressed this through sampling and mining. Two-stage detectors handle imbalance via region proposals that prune the candidate set down to roughly 1,000 to 2,000 boxes per image, followed by sampling a fixed-size mini-batch (typically 256 in Faster R-CNN) with a controlled positive-to-negative ratio of 1:3. SSD used hard negative mining, sorting negatives by their loss and keeping the top-k highest-loss negatives so the negative-to-positive ratio is at most 3:1. [4] [Online Hard Example Mining](/wiki/ohem) (OHEM), introduced by Shrivastava, Gupta, and Girshick in 2016, automated the selection of hard examples within mini-batches based on their current loss values. [5] All of these methods added pipeline complexity, hyperparameters, and sometimes extra forward passes.

Focal loss instead changes the loss function so that no sampling is needed. Every anchor contributes to the loss, but easy examples contribute almost nothing. The key insight: rather than fix imbalance at the data level, fix it at the loss level. [1]

## How is focal loss defined mathematically?

Focal loss starts from binary [cross-entropy loss](/wiki/cross_entropy_loss). For a binary problem with label `y` in `{0, 1}` and predicted probability `p` for class 1:

```
CE(p, y) = -log(p)        if y = 1
         = -log(1 - p)    if y = 0
```

Defining `p_t = p` if `y = 1` else `1 - p`, this becomes `CE(p_t) = -log(p_t)`. [1]

### Alpha-balanced cross-entropy

A first attempt at handling imbalance is to weight the loss differently for the two classes. Define `alpha` in `[0, 1]` for class 1 and `1 - alpha` for class 0, and let `alpha_t = alpha` if `y = 1` else `1 - alpha`. The alpha-balanced cross-entropy is `-alpha_t * log(p_t)`. This rebalances positive and negative classes but does not distinguish easy from hard examples. [1]

### Focal loss

Focal loss adds a multiplicative modulating factor `(1 - p_t)^gamma`:

```
FL(p_t) = -(1 - p_t)^gamma * log(p_t)
```

where `gamma >= 0` is the focusing parameter. The alpha-balanced form, used in the paper's main experiments, combines both:

```
FL(p_t) = -alpha_t * (1 - p_t)^gamma * log(p_t)
```

When `gamma = 0`, focal loss reduces to alpha-balanced cross-entropy. As `gamma` increases, the modulating factor sharply down-weights well-classified examples (`p_t` close to 1) while leaving hard examples (`p_t` close to 0) almost unchanged. [1]

The paper reports that `gamma = 2` and `alpha = 0.25` worked best for RetinaNet on COCO, stating: "we use gamma = 2.0 with alpha = .25 for all experiments" while noting that `alpha = .5` works nearly as well (0.4 AP lower). [1] With `gamma = 2`, the optimal `alpha` shifts *down* from the standard imbalance-correcting value (around 0.75) to 0.25. The intuition is that the modulating factor itself already shifts attention to positives, so the class weighting needs to be milder, not stronger. [1]

### Multi-class extension

RetinaNet uses sigmoid activations rather than softmax, treating each class as an independent binary classifier. The image-level loss is summed over all anchors and divided by the number of anchors assigned to ground-truth boxes (not the total anchor count). Dividing by foreground count gives a stable gradient magnitude as easy negatives are down-weighted. [1]

## How does the modulating factor behave?

The modulating factor `(1 - p_t)^gamma` down-weights the loss for well-classified examples and widens the range of `p_t` values that receive substantial loss. The table below shows this for `gamma = 2`.

| `p_t` | `CE = -log(p_t)` | `(1 - p_t)^2` | `FL` (`gamma = 2`) | Ratio FL / CE |
|---|---|---|---|---|
| 0.99 | 0.0101 | 0.0001 | 0.0000 | 0.0001 |
| 0.95 | 0.0513 | 0.0025 | 0.0001 | 0.0025 |
| 0.90 | 0.1054 | 0.0100 | 0.0011 | 0.0100 |
| 0.75 | 0.2877 | 0.0625 | 0.0180 | 0.0625 |
| 0.50 | 0.6931 | 0.2500 | 0.1733 | 0.2500 |
| 0.25 | 1.3863 | 0.5625 | 0.7798 | 0.5625 |
| 0.10 | 2.3026 | 0.8100 | 1.8651 | 0.8100 |

A confidently correct example with `p_t = 0.9` has its loss reduced by a factor of 100. A truly confident example with `p_t = 0.99` has its loss reduced by 10,000. By contrast, a hard example with `p_t = 0.1` is barely down-weighted at all (factor 0.81). Examples right at the decision boundary (`p_t = 0.5`) are reduced by a factor of 4. [1]

No example is fully ignored, but the cumulative effect of summing across 100,000 anchors is that easy negatives contribute negligibly to the total. The paper supports this empirically: with cross-entropy, the bottom 70 percent of negatives by loss contribute about half of the total loss, while with focal loss at `gamma = 2` they contribute essentially zero. [1]

### How do you choose gamma?

The paper sweeps `gamma` from 0 to 5 and reports COCO AP with RetinaNet using a ResNet-50 FPN backbone at 600-pixel scale. The reported numbers (Table 1b) were approximately: 31.1 AP at `gamma = 0` (with optimal alpha), 32.9 at `gamma = 0.5`, 33.7 at `gamma = 1`, 34.0 at `gamma = 2`, and 32.2 at `gamma = 5`. The accuracy peaks around `gamma = 2`, which became the de facto default. [1] Pushing `gamma` too high under-weights examples that are merely hard rather than misclassified, slowing convergence.

## RetinaNet (where focal loss debuted)

[RetinaNet](/wiki/retinanet) is the one-stage object detector proposed in the same 2017 paper. It combines a [ResNet](/wiki/resnet) backbone, a [Feature Pyramid Network](/wiki/feature_pyramid_network) (FPN) neck, and two task-specific subnetworks for classification and box regression. [1]

The FPN constructs feature maps at five pyramid levels (P3 through P7) with strides of 8, 16, 32, 64, and 128 pixels. At each level, anchors with three scales and three aspect ratios are tiled across the feature map. With nine anchors per spatial location across five pyramid levels, a 600-by-600 input image yields roughly 100,000 anchors, the source of the extreme class imbalance. [1] The classification subnet is a small fully-convolutional network with four 3-by-3 conv layers (256 channels, ReLU) and a final conv predicting `K * A` channels.

A crucial implementation detail involves the bias initialization of the final classification layer. Without it, training is unstable: at the first iterations the model assigns roughly equal probability to foreground and background, the loss is dominated by gradient from the many anchors, and the model diverges. The fix is to initialize the bias so the prior probability of the foreground class is `pi = 0.01` at the start of training, with bias `b = -log((1 - pi) / pi)`; the authors note they "use pi = .01 in all experiments." [1] The paper emphasizes this trick as essential, not optional.

With ResNet-101 plus FPN as the backbone, RetinaNet achieved 39.1 percent COCO AP with multi-scale testing, exceeding all then-published one-stage detectors and matching or surpassing two-stage detectors of the time. As the authors summarize, "our best model, based on a ResNet-101-FPN backbone, achieves a COCO test-dev AP of 39.1 while running at 5 fps, surpassing the previously best published single-model results from both one and two-stage detectors." [1] The single-scale ResNet-50 RetinaNet achieved 32.5 to 34.0 AP at 11 to 13 fps on a Titan X GPU. [1]

## How is focal loss implemented in code?

Focal loss is straightforward to implement and has been added to most deep learning frameworks and detection libraries.

**Detectron and [Detectron2](/wiki/detectron)** are the official reference implementations from FAIR. Detectron2's `sigmoid_focal_loss_jit` in `fvcore.nn` is the canonical PyTorch implementation. [6] **[MMDetection](/wiki/mmdetection)** from OpenMMLab provides `FocalLoss` and `SigmoidFocalLoss` modules with both Python and CUDA implementations. [7]

**PyTorch core** does not ship focal loss in `torch.nn`; `torchvision.ops.sigmoid_focal_loss` provides a standalone implementation. [8] **TensorFlow Addons** previously provided `tfa.losses.SigmoidFocalCrossEntropy` but was deprecated in 2024. [9] **Kornia** includes `kornia.losses.FocalLoss` and `BinaryFocalLossWithLogits`. [10]

A minimal binary focal loss in PyTorch:

```
import torch
import torch.nn.functional as F

def binary_focal_loss(logits, targets, alpha=0.25, gamma=2.0):
    p = torch.sigmoid(logits)
    ce = F.binary_cross_entropy_with_logits(logits, targets, reduction="none")
    p_t = p * targets + (1 - p) * (1 - targets)
    alpha_t = alpha * targets + (1 - alpha) * (1 - targets)
    loss = alpha_t * (1 - p_t) ** gamma * ce
    return loss.mean()
```

Production implementations use fused CUDA kernels or `torch.jit` for speed, but the math is identical.

## What variants and extensions exist?

The success of focal loss spawned a family of related losses.

### Class-balanced focal loss

Cui, Jia, Lin, Song, and Belongie introduced the [Class-Balanced Loss](/wiki/class_balanced_loss) in 2019. They proposed weighting by the *effective number of samples* `(1 - beta^n) / (1 - beta)`, where `n` is the per-class sample count and `beta` is close to 1. The class-balanced focal loss replaces `alpha_t` with this effective-number weight, showing gains over vanilla focal loss and inverse-frequency weighting on [LVIS](/wiki/lvis), iNaturalist (long-tail variants), and CIFAR-100-LT. [11]

### Generalized Focal Loss

Li et al. proposed [Generalized Focal Loss](/wiki/generalized_focal_loss) (GFL) in 2020 to address two limits of focal loss: treating classification and localization quality as independent tasks, and being defined only for `{0, 1}` targets while quality labels can be continuous in `[0, 1]`. The paper introduced Quality Focal Loss (QFL), which extends focal loss to continuous targets by replacing `(1 - p_t)^gamma` with `|y - sigma|^beta`, and Distribution Focal Loss (DFL), which treats box coordinates as a discrete distribution over candidate values. The combination has been adopted in detectors including PP-PicoDet and YOLOv6. [12]

### Asymmetric focal loss

Asymmetric focal loss generalizes the focusing parameter to two values, `gamma_pos` and `gamma_neg`. With `gamma_neg > gamma_pos`, easy negatives are down-weighted more aggressively than easy positives, improving recall on imbalanced multi-label classification. Ridnik et al. (2021) developed such a loss with shifted thresholds and probability margin terms, used in medical image classification and audio tagging. [13]

### PolyLoss

[PolyLoss](/wiki/polyloss), introduced by Leng et al. (2022), interprets cross-entropy as the leading term of a Taylor series expansion of `-log(p_t)` around `p_t = 1`: `-log(p_t) = (1 - p_t) + (1 - p_t)^2 / 2 + ...`. PolyLoss adds adjustable coefficients to these terms. Focal loss falls naturally into this family and can be approximated within PolyLoss by appropriate coefficient choices. [14]

### Other variants

Reduced focal loss caps the modulating factor at low confidence to avoid amplifying noise from mislabeled examples. Tversky-focal hybrids combine focal modulation with the Tversky index for medical segmentation, where false positives and false negatives have asymmetric clinical costs. [15]

## What is focal loss used for beyond detection?

While focal loss was proposed for dense object detection, its utility for any imbalanced classification task became clear quickly.

**[Instance segmentation](/wiki/instance_segmentation)**: [Mask R-CNN](/wiki/mask_r_cnn) variants use focal loss in the classification branch when paired with FPN backbones. Mask-aware variants apply the modulation per pixel for the mask head.

**[Semantic segmentation](/wiki/semantic_segmentation)**: Pixel-level segmentation often suffers from class imbalance, especially in medical imaging where lesion pixels can be a tiny fraction of the total. Focal loss is commonly combined with Dice or Tversky loss, with consistent improvements over cross-entropy on BraTS and LiTS. [16]

**Long-tail image classification**: On LVIS, iNaturalist, and ImageNet-LT, focal loss is a standard baseline alongside class-balanced loss, equalization loss, balanced softmax, and decoupled training. It is not always the top performer on extreme long-tail problems but is robust and needs no extra sampling infrastructure. [11]

**Imbalanced tabular classification**: In credit card fraud detection, where positives can be 0.1 percent or less, focal loss is used in deep tabular models as an alternative to SMOTE-style oversampling.

**NLP and audio**: Toxic comment detection, rare-intent classification, and sound event detection (DCASE challenges) all use focal loss in highly imbalanced settings, though benefits are smaller than in detection.

## How does focal loss compare with other imbalance methods?

The table below compares focal loss with other strategies for handling class imbalance.

| Method | Mechanism | Typical setting |
|---|---|---|
| Standard cross-entropy | Equal weighting | Balanced classification |
| Weighted cross-entropy / alpha-balanced | Per-class weights | Mild imbalance |
| [Hard negative mining](/wiki/hard_negative_mining) | Keep top-k highest-loss negatives | SSD, early R-CNN |
| [OHEM](/wiki/ohem) | Online selection of hard examples | Two-stage detectors |
| Two-stage proposal sampling | 1:3 fg:bg from RPN | Faster R-CNN, Mask R-CNN |
| Focal loss | Modulating factor in loss | Dense detection, imbalance |
| [Class-balanced loss](/wiki/class_balanced_loss) | Effective-number weighting | Long-tail classification |
| Equalization loss | Suppresses negative gradients to rare classes | LVIS, long-tail detection |
| Balanced softmax | Adjusts logits by class prior | Long-tail classification |
| GHM (gradient harmonizing) | Down-weights very easy and very hard | Noisy-label detection |
| Quality / Distribution focal loss | Continuous targets, joint quality | GFL, modern detectors |

Gradient Harmonizing Mechanism (GHM), proposed by Li, Liu, and Wang (2019), is related to focal loss but down-weights examples with both very low and very high difficulty, on the grounds that very-hard examples are often noisy or mislabeled. [17] For severely long-tail problems (LVIS, OpenImages), focal loss alone is usually outperformed by equalization loss, seesaw loss, and balanced softmax. Focal loss assumes difficulty alone identifies imbalance, while the long-tail problem also has a label-frequency component. Combining focal modulation with class-balanced weighting recovers some of the gap. [11]

## What are the limitations of focal loss?

Focal loss has become standard but is not without criticism.

**Hyperparameter sensitivity**. The choice of `gamma` and `alpha` interacts with model architecture, imbalance ratio, and optimization schedule. The `gamma = 2`, `alpha = 0.25` combination is robust for COCO-scale detection but may need tuning elsewhere. Reports vary: some find `gamma` between 1 and 3 works across tasks, others find it sensitive at the percent level. [18]

**Less effective when data is balanced**. When positive-to-negative ratio is balanced or only mildly skewed, focal loss often matches or slightly underperforms cross-entropy. The extra hyperparameters add complexity without benefit. The original paper claims focal loss is the right tool for extreme imbalance, not a universal upgrade.

**Sensitivity to label noise**. Because focal loss up-weights examples the model is currently misclassifying, a mislabeled example receives a disproportionately large gradient. GHM was developed in part to address this. [17] In practice, cleaned datasets, label smoothing, or self-distillation mitigate the problem.

**Not optimal for very long-tail problems**. For hundreds or thousands of classes with extreme imbalance, focal loss alone is usually beaten by methods that explicitly model class frequency. It can still help as part of a combined approach.

**Mismatch with anchor-free and DETR-style detectors**. Focal loss was designed for the anchor-based dense prediction paradigm. FCOS uses focal loss directly on per-pixel classifications and it works well. [DETR](/wiki/detr) uses Hungarian matching between predicted queries and ground-truth boxes, then applies cross-entropy or focal loss only on matched pairs plus a "no object" class for unmatched queries. The imbalance problem in DETR is much milder because there are only around 100 queries, so the role of focal loss shifts from essential to helpful. Many DETR variants use focal loss for the classification head, but the gains over cross-entropy are smaller. [19]

**Calibration**. Models trained with focal loss are sometimes less well calibrated than cross-entropy models. Mukhoti et al. (2020) argue this is partly an artifact of evaluation and that focal-trained models can be *better* calibrated under careful analysis. The practical implication is that post-hoc calibration (temperature scaling, Platt scaling) may be needed for probabilistic use. [20]

## Influence

Focal loss has had substantial influence on detection research and on deep learning more broadly.

**Best Student Paper Award at ICCV 2017**, reflecting both the practical impact of RetinaNet and the elegance of the reformulation. [2]

**Adoption in detection**. Most one-stage detectors after 2017 use focal loss or a variant. FCOS, CenterNet, and ATSS adopted it for classification. YOLOv5 uses a focal-loss variant in some configurations; YOLOv7 and YOLOv8 use loss components with focal-style modulation. Quality Focal Loss is a default in PP-PicoDet, NanoDet, and YOLOv6. [12]

**Segmentation**. Focal loss is standard in nnU-Net and U-Net++ medical recipes, often combined with Dice loss. [16]

**Citation count**. As of early 2026, the paper has accumulated roughly 30,000+ citations on Google Scholar, among the most-cited deep learning papers of the late 2010s. [3]

**Conceptual influence**. Reshaping the loss to focus on hard examples has been imported into contrastive learning (hard negative weighting in InfoNCE-style losses), knowledge distillation, and self-supervised learning. "Reweight rather than resample" is now a standard tool for imbalanced data.

## See also

- [Cross-entropy loss](/wiki/cross_entropy_loss)
- [Binary cross-entropy](/wiki/binary_cross_entropy)
- [Loss function](/wiki/loss_function)
- [Class imbalance](/wiki/class_imbalance)
- [Imbalanced dataset](/wiki/imbalanced_dataset)
- [Object detection](/wiki/object_detection)
- [RetinaNet](/wiki/retinanet)
- [Feature Pyramid Network](/wiki/feature_pyramid_network)
- [COCO dataset](/wiki/coco_dataset)
- [Pascal VOC](/wiki/pascal_voc)
- [YOLO](/wiki/yolo)
- [SSD](/wiki/ssd_object_detection)
- [Faster R-CNN](/wiki/faster_r_cnn)
- [Mask R-CNN](/wiki/mask_r_cnn)
- [Detectron](/wiki/detectron)
- [MMDetection](/wiki/mmdetection)
- [Online Hard Example Mining](/wiki/ohem)
- [PolyLoss](/wiki/polyloss)
- [Generalized Focal Loss](/wiki/generalized_focal_loss)
- [Class-Balanced Loss](/wiki/class_balanced_loss)
- [DETR](/wiki/detr)
- [Imbalanced learning](/wiki/imbalanced_learning)
- [Meta AI](/wiki/meta_ai)

## References

[1] Lin, T.-Y., Goyal, P., Girshick, R., He, K., & Dollar, P. (2017). "Focal Loss for Dense Object Detection." *ICCV 2017*. arXiv:1708.02002. https://arxiv.org/abs/1708.02002

[2] ICCV 2017 Awards. "Best Student Paper Award." https://openaccess.thecvf.com/ICCV2017

[3] Google Scholar citations for "Focal Loss for Dense Object Detection." https://scholar.google.com/scholar?q=focal+loss+for+dense+object+detection

[4] Liu, W., Anguelov, D., Erhan, D., et al. (2016). "SSD: Single Shot MultiBox Detector." *ECCV 2016*. arXiv:1512.02325. https://arxiv.org/abs/1512.02325

[5] Shrivastava, A., Gupta, A., & Girshick, R. (2016). "Training Region-Based Object Detectors with Online Hard Example Mining." *CVPR 2016*. arXiv:1604.03540. https://arxiv.org/abs/1604.03540

[6] Detectron2 source, fvcore.nn.focal_loss. https://github.com/facebookresearch/fvcore/blob/main/fvcore/nn/focal_loss.py

[7] MMDetection documentation, FocalLoss. https://mmdetection.readthedocs.io/en/latest/api.html

[8] PyTorch torchvision.ops.sigmoid_focal_loss. https://pytorch.org/vision/stable/generated/torchvision.ops.sigmoid_focal_loss.html

[9] TensorFlow Addons SigmoidFocalCrossEntropy (archived). https://www.tensorflow.org/addons/api_docs/python/tfa/losses/SigmoidFocalCrossEntropy

[10] Kornia documentation, kornia.losses.FocalLoss. https://kornia.readthedocs.io/en/latest/losses.html

[11] Cui, Y., Jia, M., Lin, T.-Y., Song, Y., & Belongie, S. (2019). "Class-Balanced Loss Based on Effective Number of Samples." *CVPR 2019*. arXiv:1901.05555. https://arxiv.org/abs/1901.05555

[12] Li, X., Wang, W., Wu, L., et al. (2020). "Generalized Focal Loss." *NeurIPS 2020*. arXiv:2006.04388. https://arxiv.org/abs/2006.04388

[13] Ridnik, T., Ben-Baruch, E., Zamir, N., et al. (2021). "Asymmetric Loss for Multi-Label Classification." *ICCV 2021*. arXiv:2009.14119. https://arxiv.org/abs/2009.14119

[14] Leng, Z., Tan, M., Liu, C., et al. (2022). "PolyLoss: A Polynomial Expansion Perspective of Classification Loss Functions." *ICLR 2022*. arXiv:2204.12511. https://arxiv.org/abs/2204.12511

[15] Abraham, N., & Khan, N. M. (2019). "A Novel Focal Tversky Loss Function with Improved Attention U-Net for Lesion Segmentation." *ISBI 2019*. arXiv:1810.07842. https://arxiv.org/abs/1810.07842

[16] Isensee, F., Jaeger, P. F., Kohl, S. A. A., Petersen, J., & Maier-Hein, K. H. (2021). "nnU-Net: A Self-Configuring Method for Deep Learning-Based Biomedical Image Segmentation." *Nature Methods*. https://www.nature.com/articles/s41592-020-01008-z

[17] Li, B., Liu, Y., & Wang, X. (2019). "Gradient Harmonized Single-Stage Detector." *AAAI 2019*. arXiv:1811.05181. https://arxiv.org/abs/1811.05181

[18] Tan, J., Wang, C., Li, B., et al. (2020). "Equalization Loss for Long-Tailed Object Recognition." *CVPR 2020*. arXiv:2003.05176. https://arxiv.org/abs/2003.05176

[19] Carion, N., Massa, F., Synnaeve, G., et al. (2020). "End-to-End Object Detection with Transformers." *ECCV 2020*. arXiv:2005.12872. https://arxiv.org/abs/2005.12872

[20] Mukhoti, J., Kulharia, V., Sanyal, A., et al. (2020). "Calibrating Deep Neural Networks Using Focal Loss." *NeurIPS 2020*. arXiv:2002.09437. https://arxiv.org/abs/2002.09437

[21] Tian, Z., Shen, C., Chen, H., & He, T. (2019). "FCOS: Fully Convolutional One-Stage Object Detection." *ICCV 2019*. arXiv:1904.01355. https://arxiv.org/abs/1904.01355

[22] Lin, T.-Y., Dollar, P., Girshick, R., He, K., Hariharan, B., & Belongie, S. (2017). "Feature Pyramid Networks for Object Detection." *CVPR 2017*. arXiv:1612.03144. https://arxiv.org/abs/1612.03144

[23] He, K., Zhang, X., Ren, S., & Sun, J. (2016). "Deep Residual Learning for Image Recognition." *CVPR 2016*. arXiv:1512.03385. https://arxiv.org/abs/1512.03385

[24] Ren, S., He, K., Girshick, R., & Sun, J. (2015). "Faster R-CNN." *NeurIPS 2015*. arXiv:1506.01497. https://arxiv.org/abs/1506.01497

[25] Lin, T.-Y., Maire, M., Belongie, S., et al. (2014). "Microsoft COCO: Common Objects in Context." *ECCV 2014*. arXiv:1405.0312. https://arxiv.org/abs/1405.0312