Florence-2
Last reviewed
May 20, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 4,170 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 20, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 4,170 words
Add missing citations, update stale details, or suggest a clearer explanation.
Florence-2 is a vision foundation model developed by Microsoft Research that handles a wide range of computer vision and vision-language tasks through a single unified, prompt-based sequence-to-sequence interface. The model accepts a text prompt describing the desired task (such as captioning, object detection, optical character recognition, dense region captioning, region proposal, or phrase grounding) along with an image, and emits a structured text sequence that can include natural language and quantized location tokens representing bounding boxes, quadrilaterals, or polygons. Florence-2 was introduced in the November 2023 technical report "Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks" by Bin Xiao, Haiping Wu, Weijian Xu, Xiyang Dai, Houdong Hu, Yumao Lu, Michael Zeng, Ce Liu, and Lu Yuan, and the paper was accepted to the 2024 Conference on Computer Vision and Pattern Recognition (CVPR).[^1][^2] Microsoft released open weights for four variants (Florence-2-base, Florence-2-large, Florence-2-base-ft, Florence-2-large-ft) on the Hugging Face Hub in mid-2024 under the MIT License, with the base variant at 232 million parameters and the large variant at 771 million parameters.[^3][^4]
| Field | Value |
|---|---|
| Developer | Microsoft Research |
| First arXiv release | 2023-11-10 (arXiv:2311.06242) |
| Conference venue | CVPR 2024 |
| Hugging Face release | 2024-06-16 |
| Architecture | DaViT vision encoder plus transformer encoder-decoder (BART-style) |
| Parameter counts | 232M (base), 771M (large) |
| Training dataset | FLD-5B (126M images, 5.4B annotations) |
| Open weights variants | Florence-2-base, Florence-2-large, Florence-2-base-ft, Florence-2-large-ft |
| License | MIT |
| Default precision | float16 |
The Florence project is a line of research at Microsoft aimed at producing a single vision model that can serve as a backbone for diverse downstream tasks, in contrast to the prevailing pattern of training one specialized model per task (one detector, one captioner, one OCR system, and so on). The original Florence model, released in 2021, focused on contrastive image-text pre-training and adaptation heads for classification, retrieval, and detection. Florence-2, published two years later, departs from that approach by reformulating every supported vision task as a sequence-to-sequence problem with text in and text out, so that one set of weights and one decoding procedure handles tasks that previously required heterogeneous architectures and losses.[^1]
The motivation given by the authors is that prior vision foundation models tended to specialize at one of three levels of spatial granularity, namely image-level understanding (such as classification and image-text retrieval), region-level recognition (such as object detection), and pixel-level prediction (such as segmentation). They argue that the lack of a single representation across these levels constrained the transfer of large-scale pretraining benefits to fine-grained tasks like detection and grounding. Florence-2 attempts to bridge the three levels by encoding spatial outputs as discrete location tokens that share the same vocabulary as the natural language decoder, so that a caption, a bounding box, a quadrilateral, or a polygon are all produced through the same next-token prediction step.[^1][^2]
A second contribution of the paper is a data engine that produces FLD-5B, a corpus of 5.4 billion annotations distributed across 126 million images. The dataset is generated by an iterative pipeline that runs an ensemble of specialist models over web-collected images, fuses their outputs, applies filtering and consistency checks, then retrains improved annotators on the cleaned labels. FLD-5B was not released publicly, but the trained Florence-2 weights and the description of the pipeline have made the work influential in subsequent open-source vision-language efforts.[^1][^2][^5]
Florence-2 is built from two main components: a hierarchical vision encoder that converts an input image into a sequence of visual tokens, and a transformer encoder-decoder that ingests visual tokens together with text prompt tokens and produces a target text sequence.[^1][^4]
The vision backbone is the Dual Attention Vision Transformer (DaViT), originally introduced at ECCV 2022 by Mingyu Ding and collaborators including several authors who would later work on Florence-2. DaViT alternates two attention operations within each block: spatial-window self-attention, which attends across image patches inside a local window, and channel-group self-attention, which attends across channel groups while pooling spatially. The channel attention captures global interactions implicitly because each channel token aggregates information from every spatial position, while the spatial attention preserves fine-grained local structure. DaViT-Tiny, DaViT-Small, and DaViT-Base reach 82.8 percent, 84.2 percent, and 84.6 percent top-1 accuracy respectively on ImageNet-1K classification, and a scaled-up DaViT-Giant trained with 1.5 billion weakly supervised image-text pairs reaches 90.4 percent top-1.[^6]
Florence-2 inherits the hierarchical pyramid of DaViT and uses it to produce a flattened sequence of visual token embeddings of shape Nv x Dv, where Nv is the number of visual tokens after the final stage and Dv is the channel dimension. These embeddings are then projected and fed into the multimodal transformer alongside the text prompt embeddings. The base variant of Florence-2 uses a DaViT backbone configured to match the 232 million parameter budget of the overall model; the large variant uses a wider DaViT with channel dimension 1024, contributing the majority of the increase to 771 million parameters.[^1][^6]
The Florence-2 paper notes that the input image is resized to a fixed square resolution during pretraining, with most of the training samples processed at 384 by 384 pixels and a smaller fraction at 768 by 768 pixels during a high-resolution refinement stage. The hierarchical DaViT structure means that visual tokens at the deepest stage correspond to comparatively coarse image patches, which keeps the visual sequence short enough that the downstream multimodal transformer does not become a bottleneck for high-resolution inputs.[^1]
After the vision backbone, Florence-2 concatenates visual tokens with the embeddings of the textual task prompt and feeds the combined sequence into a transformer encoder-decoder modeled after the BART architecture introduced for sequence-to-sequence text generation. The encoder produces context-aware representations of both modalities, and the decoder autoregressively emits the target sequence one token at a time using cross-attention into the encoder output and causal self-attention over previously generated tokens. The model is trained with a standard cross-entropy language modeling loss, identical to the loss used for purely textual sequence-to-sequence systems, and no task-specific heads are added.[^1][^7]
The text tokenizer follows the BERT vocabulary for natural language tokens, augmented with a special set of location tokens described below. Because every output, whether a caption, an object detection result, or a polygon, is a sequence of these tokens, Florence-2 can be invoked at inference time with the same generation API across tasks; only the leading prompt string differs.[^4][^7]
The key device that makes the sequence-to-sequence formulation cover region-level and pixel-level tasks is the discrete location token. Florence-2 quantizes the image coordinate space into 1000 bins and adds 1000 new tokens to the vocabulary, written <loc_0> through <loc_999>, each corresponding to a position along an axis after normalization to the image dimensions. Coordinates are emitted as ordered pairs or longer sequences of these tokens to represent different geometric primitives:[^7][^8]
(x0, y0, x1, y1) corresponding to the top-left and bottom-right corners of an axis-aligned bounding box. This format is used for object detection, dense region captioning, region proposal, and phrase grounding.[^7](x0, y0, x1, y1, x2, y2, x3, y3) describing the four vertices of a quadrilateral, used to localize text regions in OCR with rotated or perspective-distorted glyphs.[^7](x_i, y_i) location token pairs in clockwise order to describe the vertices of a polygon, used for referring expression segmentation and other mask-like outputs.[^7]Because all three formats share the same underlying 1000-bin vocabulary and the same decoder, the model can switch among them based purely on the task prompt and the data distribution it was trained on. The discrete representation also avoids the need to predict continuous coordinates with regression heads, simplifying the training objective.[^1][^7]
Florence-2 exposes a fixed set of task-specifier prompts during pretraining; each prompt is a special string that conditions the decoder to produce output in a particular format. The set documented in the official Hugging Face model cards includes the following:[^4]
| Prompt | Task | Output format |
|---|---|---|
<CAPTION> | Brief image caption | Free text |
<DETAILED_CAPTION> | Detailed image caption | Free text |
<MORE_DETAILED_CAPTION> | Extended image description | Free text |
<OD> | Object detection | Boxes and class labels |
<DENSE_REGION_CAPTION> | Per-region captioning | Boxes and free text |
<REGION_PROPOSAL> | Class-agnostic regions | Boxes only |
<CAPTION_TO_PHRASE_GROUNDING> | Phrase grounding given a caption | Phrase plus boxes |
<OCR> | Optical character recognition | Concatenated text |
<OCR_WITH_REGION> | OCR with rotated boxes | Quad boxes and text |
Several additional prompts are used during fine-tuning, including <REFERRING_EXPRESSION_SEGMENTATION> for polygon outputs and <REGION_TO_SEGMENTATION> for converting a given box into a refined polygon mask. Florence-2 does not natively support general visual question answering through the pretrained prompts; the Hugging Face tutorial on DocVQA notes that VQA capabilities have to be introduced by fine-tuning the model with a new task prefix.[^5]
A central claim of the Florence-2 paper is that competitive vision foundation model behavior at the 232M-771M parameter scale comes primarily from the breadth and density of the training annotations rather than from architectural novelty. The authors built FLD-5B, a corpus of 5.4 billion annotations spanning 126 million images, to support this multi-task pretraining.[^1][^2]
The total annotation count is partitioned as follows, according to the paper and accompanying summaries:[^1][^7]
The annotations are produced by a data engine described in the paper as a three-stage iterative loop. The first stage runs an ensemble of pretrained specialist annotators over web-collected images, including caption generators, object detectors, OCR models, and region grounding models such as DETR variants. The second stage filters these synthetic labels using a combination of confidence thresholding, non-maximum suppression for boxes, and a text complexity filter based on dependency parsing implemented with the spaCy natural language toolkit. The third stage retrains improved annotators on the filtered labels, then reruns the pipeline so that successive iterations replace noisier labels with more accurate ones from the updated annotators.[^7][^5]
For the text-phrase-region annotations specifically, the pipeline extracts candidate noun phrases from each caption, queries a grounding model to localize each phrase, and then optionally uses a segmentation model (the Segment Anything Model (SAM)) to convert boxes into masks where needed for downstream polygon supervision. The combination of caption generation, phrase grounding, and segmentation refinement gives FLD-5B a denser per-image annotation density than earlier large-scale vision datasets such as WIT or LAION.[^1][^5]
FLD-5B has not been released to the public; only the trained Florence-2 model weights and the textual description of the construction pipeline are publicly available. This contrasts with the SA-1B dataset, which Meta released alongside SAM with full image and mask data.[^4][^5]
Florence-2 is trained from scratch with a uniform cross-entropy loss across all task prompts using the AdamW optimizer. The paper specifies a maximum learning rate of 1e-4 for the base model and 1e-5 for the large model, a 5000-step linear warmup, and cosine decay thereafter. The base model uses a mini-batch size of 2048 examples and the large model uses 3072, both distributed across many GPUs.[^7]
Training proceeds in two resolution stages. The first stage processes approximately three billion effective samples at 384 by 384 input resolution, where one effective sample is one image-annotation pair drawn from FLD-5B. The second stage refines the model at 768 by 768 resolution for an additional 0.5 billion samples for the base model and 0.1 billion samples for the large model. Within each stage, batches mix prompts from every supported task in proportions roughly matching the annotation distribution of FLD-5B, so the model learns all tasks simultaneously rather than in a curriculum.[^7]
The *-ft variants released on Hugging Face are produced by additional supervised fine-tuning on a mixture of standard public benchmarks, including COCO captioning and detection, RefCOCO, RefCOCO+, RefCOCOg, and various VQA datasets. The Hugging Face fine-tuning tutorial recommends a very small learning rate of 1e-6 for downstream task adaptation, warning that higher rates lead to overfitting because the pretrained representations are already dense. The same tutorial reports that unfreezing the DaViT vision tower yields better performance than freezing it when sufficient GPU memory is available.[^5]
Microsoft released four pretrained variants on Hugging Face, all in float16 precision under the MIT License:[^3][^4]
| Variant | Parameters | Stage | Hugging Face slug |
|---|---|---|---|
| Florence-2-base | 232M | Pretrained on FLD-5B | microsoft/Florence-2-base |
| Florence-2-large | 771M | Pretrained on FLD-5B | microsoft/Florence-2-large |
| Florence-2-base-ft | 232M | FLD-5B plus downstream fine-tuning | microsoft/Florence-2-base-ft |
| Florence-2-large-ft | 771M | FLD-5B plus downstream fine-tuning | microsoft/Florence-2-large-ft |
The pretrained variants are intended primarily as starting points for further task-specific adaptation, while the *-ft variants are tuned for direct use on common benchmarks. Community redistribution of the weights has produced ONNX exports (under the onnx-community organization on the Hub) and various quantized formats suitable for execution under llama.cpp, Ollama, LM Studio, and similar runtimes, although the original weights remain the canonical reference.[^4]
The Florence-2 paper reports both zero-shot and fine-tuned numbers across a wide range of vision and vision-language benchmarks. Selected results from the paper and from the Hugging Face model cards include the following:[^1][^4]
| Benchmark | Metric | Florence-2-base | Florence-2-large |
|---|---|---|---|
| COCO captioning | CIDEr | 133.0 | 135.6 |
| NoCaps | CIDEr | 118.7 | 120.8 |
| TextCaps | CIDEr | 70.1 | 72.8 |
| COCO detection | mAP | 34.7 | 37.5 |
| Flickr30k phrase grounding | Recall@1 | 83.6 | 84.4 |
The authors emphasize two comparisons in the zero-shot setting. Florence-2-large achieves a higher COCO captioning CIDEr score (135.6) than DeepMind's 80-billion-parameter Flamingo model, despite having less than one percent of Flamingo's parameter count. It also outperforms Microsoft's 1.6-billion-parameter Kosmos-2 across the reported zero-shot benchmarks, including a 5.7-point gain in Flickr30k Recall@1 and approximate absolute gains of 4, 8, and 8 points on RefCOCO, RefCOCO+, and RefCOCOg respectively.[^1][^8]
After downstream fine-tuning on the relevant benchmark datasets, the large variant reaches additional headline numbers reported in the paper and on the Hugging Face model card:[^1][^4]
| Benchmark | Metric | Florence-2-large-ft |
|---|---|---|
| COCO captioning | CIDEr | 143.3 |
| COCO object detection | mAP | 43.4 |
| VQAv2 | Accuracy | 81.7 |
| TextVQA | Accuracy | 73.5 |
| RefCOCO | Accuracy | 93.4 |
TextVQA in particular is notable because Florence-2 reaches its 81.5 percent accuracy on the test split (with the 73.5 figure above corresponding to the validation split reported on the model card) without using an external OCR system, indicating that the OCR capabilities induced by training on FLD-5B's textual annotations transfer to a downstream visual question answering setting.[^1][^8]
The paper also reports competitive results on referring expression segmentation, dense captioning, and ADE20K semantic segmentation when fine-tuned, although the model's segmentation outputs are polygon-based rather than per-pixel masks, which limits accuracy on benchmarks with very fine boundary structure.[^1]
Florence-2 occupies a distinctive position in the landscape of large vision-language systems. Most contemporaneous open vision-language models, including LLaVA and similar architectures, follow a pattern of plugging a frozen or lightly tuned vision encoder (often CLIP ViT) into a pretrained large language model and training only a projector and the language model on instruction-following image-text data. Such models excel at conversational image understanding and visual question answering but typically do not produce structured spatial outputs such as bounding boxes or polygons without specialized fine-tuning.[^9]
Florence-2 takes the opposite design choice: it trains a comparatively small encoder-decoder from scratch on a very large corpus of structured annotations, sacrificing the open-ended conversational behavior of LLM-based systems in exchange for direct support for dense detection, OCR, and grounding outputs. The two designs are complementary rather than competing; subsequent work has used Florence-2 as a fast spatial backbone whose detections feed into larger language models, or as a baseline against which to compare LLM-based grounding capabilities.[^4][^9]
Compared to closed vision-language models such as GPT-4V, Florence-2 is much smaller and slower-evolving but is freely redistributable under MIT and runs on consumer hardware. Roboflow's deployment notes report that Florence-2 produces inference results in approximately one second per image on an NVIDIA T4 GPU and in several seconds per image on CPU, which is fast enough for moderate-throughput document processing or batch annotation pipelines without specialized accelerators.[^9]
The following table summarizes the broad design contrast among open vision-language systems available in mid-2024:
| System | Approx. params | Architecture | Native spatial outputs | License |
|---|---|---|---|---|
| Florence-2-large | 771M | DaViT plus BART-style decoder | Yes (boxes, quads, polygons) | MIT |
| LLaVA-1.5 (13B) | 13B | ViT-L plus Vicuna LLM | No | Apache 2.0 / LLaMA |
| Kosmos-2 | 1.6B | Encoder plus decoder LLM | Boxes via inline location tokens | MIT |
| Flamingo (80B) | 80B | NFNet plus cross-attention into Chinchilla | No | Closed (not released) |
The presence of native spatial outputs in Florence-2 and Kosmos-2 reflects a shared design choice of expressing coordinates as quantized tokens in the language model vocabulary, an idea that traces back to Pix2Seq and earlier object-detection-as-language-modeling work.[^1][^8]
Because Florence-2 is small enough to deploy without specialized infrastructure and exposes a single API for many tasks, it has been adopted as a building block in several application domains. Reported uses include:
The CVPR 2024 paper also describes downstream evaluation in which Florence-2's pretrained vision tower is transferred to dense prediction tasks (semantic segmentation on ADE20K, object detection on COCO with Mask R-CNN-style heads) and used as a drop-in replacement for ImageNet-pretrained backbones, yielding competitive results despite the model's much smaller compute budget than backbones trained specifically for those benchmarks.[^1]
Several limitations of Florence-2 have been documented by the original authors and by subsequent users.
First, the model's spatial output format is intrinsically discrete and limited to 1000 quantization bins per axis. For very high-resolution images or applications that require sub-pixel localization, the quantization grid imposes an irreducible error of approximately one part in a thousand of the image dimension, which translates to roughly two pixels at 2048-pixel-wide inputs. The polygon segmentation output is similarly coarser than mask-based representations used by Mask R-CNN or SAM and may miss fine boundary detail on benchmarks like ADE20K.[^1][^7]
Second, Florence-2 does not natively support open-ended visual question answering at the level achieved by larger multimodal LLMs. The pretrained prompts cover a fixed taxonomy of tasks, and the Hugging Face fine-tuning tutorial confirms that adding a VQA capability requires defining a new task prefix and fine-tuning on labeled examples, with a frozen-vision-encoder baseline scoring zero on DocVQA before fine-tuning and 57.0 Levenshtein similarity afterward.[^5]
Third, the FLD-5B dataset is not released, which limits reproducibility and the ability of the community to verify the exact contents on which the model was trained. The data engine pipeline is described at a high level in the paper but cannot be exactly reproduced without access to the same web image corpus and specialist annotators used by Microsoft. Independent reproductions of FLD-5B-style annotation pipelines have begun to appear in open-source projects, but no public replacement of comparable scale exists as of mid-2026.[^1][^4]
Fourth, like other models trained on web-scale annotation pipelines, Florence-2 inherits biases from its source data and annotators. The model card on Hugging Face notes that Microsoft did not perform extensive bias evaluation and recommends task-specific testing before deployment. Failures on long-tail categories, non-English text in OCR, and unusual image domains (medical, satellite, microscopy) have been reported by community users.[^4]
Fifth, the float16 precision of the released weights is sufficient for inference but may require care when fine-tuning. The recommended workflow involves either mixed-precision training with a float32 master copy of the weights or full-precision conversion before fine-tuning, particularly when unfreezing the DaViT vision tower.[^5]
Florence-2 has had measurable influence on vision foundation model design in three respects. First, it provides a credible demonstration that compact encoder-decoder models trained on dense multi-task supervision can rival or exceed much larger LLM-based vision-language models on structured benchmarks, validating the approach of investing in data engineering rather than parameter count alone. Second, its uniform sequence-to-sequence interface for captioning, detection, OCR, and grounding has been imitated by subsequent unified vision systems and has reinforced the trend of treating coordinates as language tokens. Third, its release under a permissive MIT license with weights freely downloadable on Hugging Face made high-quality spatial vision capabilities available to a much broader community of researchers and developers than the closed Florence-1 era.[^1][^4][^9]
The model is now widely used as a benchmark for compact open vision-language systems. Comparisons in subsequent papers and product launches frequently cite Florence-2 as the reference point for what a single-billion-parameter unified vision model can achieve on COCO captioning, COCO detection, and TextVQA without external OCR.[^9]
Florence-2 sits at the intersection of several research lines.