Florence-2

AI Models Computer Vision Microsoft

23 min read

Updated Jul 11, 2026

Suggest edit History Talk

RawGraph

Last edited

Jul 11, 2026

Fact-checked

In review queue

Sources

10 citations

Revision

v3 · 4,695 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Florence-2 is a vision foundation model developed by Microsoft Research that handles a wide range of computer vision and vision-language tasks through a single unified, prompt-based sequence-to-sequence interface. The model accepts a text prompt describing the desired task (such as captioning, object detection, optical character recognition, dense region captioning, region proposal, or phrase grounding) along with an image, and emits a structured text sequence that can include natural language and quantized location tokens representing bounding boxes, quadrilaterals, or polygons. Florence-2 was introduced in the November 2023 technical report "Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks" by Bin Xiao, Haiping Wu, Weijian Xu, Xiyang Dai, Houdong Hu, Yumao Lu, Michael Zeng, Ce Liu, and Lu Yuan, and the paper was accepted to the 2024 Conference on Computer Vision and Pattern Recognition (CVPR).^[1]^[2] Microsoft released open weights for four variants (Florence-2-base, Florence-2-large, Florence-2-base-ft, Florence-2-large-ft) on the Hugging Face Hub in mid-2024 under the MIT License, with the base variant at 232 million parameters and the large variant at 771 million parameters.^[3]^[4]

The paper's authors describe Florence-2 as "a novel vision foundation model with a unified, prompt-based representation for a variety of computer vision and vision-language tasks," and trained it on FLD-5B, a purpose-built corpus of 5.4 billion visual annotations across 126 million images.^[1] Despite its compact size, Florence-2 competes with far larger systems: in the zero-shot setting the 771-million-parameter Florence-2-large scores 135.6 CIDEr on COCO captioning, higher than DeepMind's 80-billion-parameter Flamingo, and the paper reports the fine-tuned model reaching an accuracy of 81.5 on TextVQA "without any external OCR token input."^[1]^[8] It has become one of the most widely used open vision backbones: as of July 2026 the Hugging Face model cards recorded roughly 2.66 million downloads of Florence-2-base and over 813,000 of Florence-2-large in the preceding month.^[4]^[8]

Infobox

Field	Value
Developer	Microsoft Research
First arXiv release	2023-11-10 (arXiv:2311.06242)
Conference venue	CVPR 2024
Hugging Face release	2024-06-16
Architecture	DaViT vision encoder plus transformer encoder-decoder (BART-style)
Parameter counts	232M (base), 771M (large)
Training dataset	FLD-5B (126M images, 5.4B annotations)
Open weights variants	Florence-2-base, Florence-2-large, Florence-2-base-ft, Florence-2-large-ft
License	MIT
Default precision	float16
Transformers integration	Native (Florence2ForConditionalGeneration)

What problem does Florence-2 solve?

The Florence project is a line of research at Microsoft aimed at producing a single vision model that can serve as a backbone for diverse downstream tasks, in contrast to the prevailing pattern of training one specialized model per task (one detector, one captioner, one OCR system, and so on). The original Florence model, released in 2021, focused on contrastive image-text pre-training and adaptation heads for classification, retrieval, and detection. Florence-2, published two years later, departs from that approach by reformulating every supported vision task as a sequence-to-sequence problem with text in and text out, so that one set of weights and one decoding procedure handles tasks that previously required heterogeneous architectures and losses.^[1]

The motivation given by the authors is that prior vision foundation models tended to specialize at one of three levels of spatial granularity, namely image-level understanding (such as classification and image-text retrieval), region-level recognition (such as object detection), and pixel-level prediction (such as segmentation). They argue that the lack of a single representation across these levels constrained the transfer of large-scale pretraining benefits to fine-grained tasks like detection and grounding. Florence-2 attempts to bridge the three levels by encoding spatial outputs as discrete location tokens that share the same vocabulary as the natural language decoder, so that a caption, a bounding box, a quadrilateral, or a polygon are all produced through the same next-token prediction step.^[1]^[2]

A second contribution of the paper is a data engine that produces FLD-5B, a corpus of 5.4 billion annotations distributed across 126 million images. The dataset is generated by an iterative pipeline that runs an ensemble of specialist models over web-collected images, fuses their outputs, applies filtering and consistency checks, then retrains improved annotators on the cleaned labels. FLD-5B was not released publicly, but the trained Florence-2 weights and the description of the pipeline have made the work influential in subsequent open-source vision-language efforts.^[1]^[2]^[5]

How does Florence-2 work?

Florence-2 is built from two main components: a hierarchical vision encoder that converts an input image into a sequence of visual tokens, and a transformer encoder-decoder that ingests visual tokens together with text prompt tokens and produces a target text sequence.^[1]^[4]

Vision encoder (DaViT)

The vision backbone is the Dual Attention Vision Transformer (DaViT), originally introduced at ECCV 2022 by Mingyu Ding and collaborators including several authors who would later work on Florence-2. DaViT alternates two attention operations within each block: spatial-window self-attention, which attends across image patches inside a local window, and channel-group self-attention, which attends across channel groups while pooling spatially. The channel attention captures global interactions implicitly because each channel token aggregates information from every spatial position, while the spatial attention preserves fine-grained local structure. DaViT-Tiny, DaViT-Small, and DaViT-Base reach 82.8 percent, 84.2 percent, and 84.6 percent top-1 accuracy respectively on ImageNet-1K classification, and a scaled-up DaViT-Giant trained with 1.5 billion weakly supervised image-text pairs reaches 90.4 percent top-1.^[6]

Florence-2 inherits the hierarchical pyramid of DaViT and uses it to produce a flattened sequence of visual token embeddings of shape Nv x Dv, where Nv is the number of visual tokens after the final stage and Dv is the channel dimension. These embeddings are then projected and fed into the multimodal transformer alongside the text prompt embeddings. The base variant of Florence-2 uses a DaViT backbone configured to match the 232 million parameter budget of the overall model; the large variant uses a wider DaViT with channel dimension 1024, contributing the majority of the increase to 771 million parameters.^[1]^[6]

The Florence-2 paper notes that the input image is resized to a fixed square resolution during pretraining, with most of the training samples processed at 384 by 384 pixels and a smaller fraction at 768 by 768 pixels during a high-resolution refinement stage. The hierarchical DaViT structure means that visual tokens at the deepest stage correspond to comparatively coarse image patches, which keeps the visual sequence short enough that the downstream multimodal transformer does not become a bottleneck for high-resolution inputs.^[1]

Multimodal encoder-decoder

After the vision backbone, Florence-2 concatenates visual tokens with the embeddings of the textual task prompt and feeds the combined sequence into a transformer encoder-decoder modeled after the BART architecture introduced for sequence-to-sequence text generation. The encoder produces context-aware representations of both modalities, and the decoder autoregressively emits the target sequence one token at a time using cross-attention into the encoder output and causal self-attention over previously generated tokens. The model is trained with a standard cross-entropy language modeling loss, identical to the loss used for purely textual sequence-to-sequence systems, and no task-specific heads are added.^[1]^[7]

The text tokenizer follows the BERT vocabulary for natural language tokens, augmented with a special set of location tokens described below. Because every output, whether a caption, an object detection result, or a polygon, is a sequence of these tokens, Florence-2 can be invoked at inference time with the same generation API across tasks; only the leading prompt string differs.^[4]^[7]

Location tokens

The key device that makes the sequence-to-sequence formulation cover region-level and pixel-level tasks is the discrete location token. Florence-2 quantizes the image coordinate space into 1000 bins and adds 1000 new tokens to the vocabulary, written <loc_0> through <loc_999>, each corresponding to a position along an axis after normalization to the image dimensions. Coordinates are emitted as ordered pairs or longer sequences of these tokens to represent different geometric primitives:^[7]^[8]

Box representation uses four location tokens (x0, y0, x1, y1) corresponding to the top-left and bottom-right corners of an axis-aligned bounding box. This format is used for object detection, dense region captioning, region proposal, and phrase grounding.^[7]
Quad box representation uses eight location tokens (x0, y0, x1, y1, x2, y2, x3, y3) describing the four vertices of a quadrilateral, used to localize text regions in OCR with rotated or perspective-distorted glyphs.^[7]
Polygon representation uses an arbitrary-length sequence of (x_i, y_i) location token pairs in clockwise order to describe the vertices of a polygon, used for referring expression segmentation and other mask-like outputs.^[7]

Because all three formats share the same underlying 1000-bin vocabulary and the same decoder, the model can switch among them based purely on the task prompt and the data distribution it was trained on. The discrete representation also avoids the need to predict continuous coordinates with regression heads, simplifying the training objective.^[1]^[7]

What tasks can Florence-2 perform?

Florence-2 exposes a fixed set of task-specifier prompts during pretraining; each prompt is a special string that conditions the decoder to produce output in a particular format. The set documented in the official Hugging Face model cards includes the following:^[4]

Prompt	Task	Output format
`<CAPTION>`	Brief image caption	Free text
`<DETAILED_CAPTION>`	Detailed image caption	Free text
`<MORE_DETAILED_CAPTION>`	Extended image description	Free text
`<OD>`	Object detection	Boxes and class labels
`<DENSE_REGION_CAPTION>`	Per-region captioning	Boxes and free text
`<REGION_PROPOSAL>`	Class-agnostic regions	Boxes only
`<CAPTION_TO_PHRASE_GROUNDING>`	Phrase grounding given a caption	Phrase plus boxes
`<OCR>`	Optical character recognition	Concatenated text
`<OCR_WITH_REGION>`	OCR with rotated boxes	Quad boxes and text

Several additional prompts are used during fine-tuning, including <REFERRING_EXPRESSION_SEGMENTATION> for polygon outputs and <REGION_TO_SEGMENTATION> for converting a given box into a refined polygon mask. Florence-2 does not natively support general visual question answering through the pretrained prompts; the Hugging Face tutorial on DocVQA notes that VQA capabilities have to be introduced by fine-tuning the model with a new task prefix.^[5]

The Hugging Face Transformers implementation additionally documents region-analysis prompts such as <OPEN_VOCABULARY_DETECTION>, <REGION_TO_CATEGORY>, <REGION_TO_DESCRIPTION>, and <REGION_TO_OCR>, which return an open-vocabulary detection, a category label, a description, or a transcription for a supplied region.^[10]

What is the FLD-5B dataset?

A central claim of the Florence-2 paper is that competitive vision foundation model behavior at the 232M-771M parameter scale comes primarily from the breadth and density of the training annotations rather than from architectural novelty. The authors built FLD-5B, a corpus of 5.4 billion annotations spanning 126 million images, to support this multi-task pretraining.^[1]^[2]

The total annotation count is partitioned as follows, according to the paper and accompanying summaries:^[1]^[7]

approximately 500 million text annotations, split into 235 million brief captions, 126 million detailed captions, and 126 million more-detailed captions;
approximately 1.3 billion region-text annotations, each pairing a bounding box with an associated short textual label or description;
approximately 3.6 billion text-phrase-region annotations, where a noun phrase extracted from a caption is grounded to a specific image region, supporting tasks such as phrase grounding and dense region captioning.

The annotations are produced by a data engine described in the paper as a three-stage iterative loop. The first stage runs an ensemble of pretrained specialist annotators over web-collected images, including caption generators, object detectors, OCR models, and region grounding models such as DETR variants. The second stage filters these synthetic labels using a combination of confidence thresholding, non-maximum suppression for boxes, and a text complexity filter based on dependency parsing implemented with the spaCy natural language toolkit. The third stage retrains improved annotators on the filtered labels, then reruns the pipeline so that successive iterations replace noisier labels with more accurate ones from the updated annotators.^[7]^[5]

For the text-phrase-region annotations specifically, the pipeline extracts candidate noun phrases from each caption, queries a grounding model to localize each phrase, and then optionally uses a segmentation model (the Segment Anything Model (SAM)) to convert boxes into masks where needed for downstream polygon supervision. The combination of caption generation, phrase grounding, and segmentation refinement gives FLD-5B a denser per-image annotation density than earlier large-scale vision datasets such as WIT or LAION.^[1]^[5]

FLD-5B has not been released to the public; only the trained Florence-2 model weights and the textual description of the construction pipeline are publicly available. This contrasts with the SA-1B dataset, which Meta released alongside SAM with full image and mask data.^[4]^[5]

How was Florence-2 trained?

Florence-2 is trained from scratch with a uniform cross-entropy loss across all task prompts using the AdamW optimizer. The paper specifies a maximum learning rate of 1e-4 for the base model and 1e-5 for the large model, a 5000-step linear warmup, and cosine decay thereafter. The base model uses a mini-batch size of 2048 examples and the large model uses 3072, both distributed across many GPUs.^[7]

Training proceeds in two resolution stages. The first stage processes approximately three billion effective samples at 384 by 384 input resolution, where one effective sample is one image-annotation pair drawn from FLD-5B. The second stage refines the model at 768 by 768 resolution for an additional 0.5 billion samples for the base model and 0.1 billion samples for the large model. Within each stage, batches mix prompts from every supported task in proportions roughly matching the annotation distribution of FLD-5B, so the model learns all tasks simultaneously rather than in a curriculum.^[7]

The *-ft variants released on Hugging Face are produced by additional supervised fine-tuning on a mixture of standard public benchmarks, including COCO captioning and detection, RefCOCO, RefCOCO+, RefCOCOg, and various VQA datasets. The Hugging Face fine-tuning tutorial recommends a very small learning rate of 1e-6 for downstream task adaptation, warning that higher rates lead to overfitting because the pretrained representations are already dense. The same tutorial reports that unfreezing the DaViT vision tower yields better performance than freezing it when sufficient GPU memory is available.^[5]

What Florence-2 variants are available?

Microsoft released four pretrained variants on Hugging Face, all in float16 precision under the MIT License:^[3]^[4]

Variant	Parameters	Stage	Hugging Face slug
Florence-2-base	232M	Pretrained on FLD-5B	microsoft/Florence-2-base
Florence-2-large	771M	Pretrained on FLD-5B	microsoft/Florence-2-large
Florence-2-base-ft	232M	FLD-5B plus downstream fine-tuning	microsoft/Florence-2-base-ft
Florence-2-large-ft	771M	FLD-5B plus downstream fine-tuning	microsoft/Florence-2-large-ft

The pretrained variants are intended primarily as starting points for further task-specific adaptation, while the *-ft variants are tuned for direct use on common benchmarks. Community redistribution of the weights has produced ONNX exports (under the onnx-community organization on the Hub) and various quantized formats suitable for execution under llama.cpp, Ollama, LM Studio, and similar runtimes, although the original weights remain the canonical reference.^[4]

The base and large checkpoints are among the most downloaded vision models on the Hugging Face Hub. As of July 2026 the Hugging Face model cards reported roughly 2.66 million downloads of Florence-2-base and more than 813,000 downloads of Florence-2-large in the preceding month.^[4]^[8]

How well does Florence-2 perform?

The Florence-2 paper reports both zero-shot and fine-tuned numbers across a wide range of vision and vision-language benchmarks. Selected results from the paper and from the Hugging Face model cards include the following:^[1]^[4]

Zero-shot

Benchmark	Metric	Florence-2-base	Florence-2-large
COCO captioning	CIDEr	133.0	135.6
NoCaps	CIDEr	118.7	120.8
TextCaps	CIDEr	70.1	72.8
COCO detection	mAP	34.7	37.5
Flickr30k phrase grounding	Recall@1	83.6	84.4

The authors emphasize two comparisons in the zero-shot setting. Florence-2-large achieves a higher COCO captioning CIDEr score (135.6) than DeepMind's 80-billion-parameter Flamingo model, despite having less than one percent of Flamingo's parameter count. It also outperforms Microsoft's 1.6-billion-parameter Kosmos-2 across the reported zero-shot benchmarks, including a 5.7-point gain in Flickr30k Recall@1 and approximate absolute gains of 4, 8, and 8 points on RefCOCO, RefCOCO+, and RefCOCOg respectively.^[1]^[8]

Fine-tuned

After downstream fine-tuning on the relevant benchmark datasets, the large variant reaches additional headline numbers reported in the paper and on the Hugging Face model card:^[1]^[4]

Benchmark	Metric	Florence-2-large-ft
COCO captioning	CIDEr	143.3
COCO object detection	mAP	43.4
VQAv2	Accuracy	81.7
TextVQA	Accuracy	73.5
RefCOCO	Accuracy	93.4

On the COCO Caption Karpathy test split the paper reports a CIDEr score of 140.0 for the fine-tuned large model, which it notes surpasses the 80-billion-parameter Flamingo (138.1 CIDEr), and on RefCOCO referring expression comprehension Florence-2-large reaches 93.4 accuracy on the validation split, with 95.3 and 92.0 on the test-A and test-B splits.^[1]

TextVQA is a particular highlight. The 73.5 figure in the table above is the TextVQA number listed on the Hugging Face model card, while the paper itself states that after task-specific fine-tuning, "Florence-2-L sets a new state-of-the-art performance with an accuracy of 81.5 without any external OCR token input."^[1]^[8] Either way, the model reads and reasons over scene text without a separate OCR module, which the authors present as evidence that the OCR ability learned from FLD-5B's textual annotations transfers to visual question answering.^[1]

The paper also reports competitive results on referring expression segmentation, dense captioning, and ADE20K semantic segmentation when fine-tuned, although the model's segmentation outputs are polygon-based rather than per-pixel masks, which limits accuracy on benchmarks with very fine boundary structure.^[1]

How does Florence-2 compare to other vision-language models?

Florence-2 occupies a distinctive position in the landscape of large vision-language systems. Most contemporaneous open vision-language models, including LLaVA and similar architectures, follow a pattern of plugging a frozen or lightly tuned vision encoder (often CLIP ViT) into a pretrained large language model and training only a projector and the language model on instruction-following image-text data. Such models excel at conversational image understanding and visual question answering but typically do not produce structured spatial outputs such as bounding boxes or polygons without specialized fine-tuning.^[9]

Florence-2 takes the opposite design choice: it trains a comparatively small encoder-decoder from scratch on a very large corpus of structured annotations, sacrificing the open-ended conversational behavior of LLM-based systems in exchange for direct support for dense detection, OCR, and grounding outputs. The two designs are complementary rather than competing; subsequent work has used Florence-2 as a fast spatial backbone whose detections feed into larger language models, or as a baseline against which to compare LLM-based grounding capabilities.^[4]^[9]

Compared to closed vision-language models such as GPT-4V, Florence-2 is much smaller and slower-evolving but is freely redistributable under MIT and runs on consumer hardware. Roboflow's deployment notes report that Florence-2 produces inference results in approximately one second per image on an NVIDIA T4 GPU and in several seconds per image on CPU, which is fast enough for moderate-throughput document processing or batch annotation pipelines without specialized accelerators.^[9]

The following table summarizes the broad design contrast among open vision-language systems available in mid-2024:

System	Approx. params	Architecture	Native spatial outputs	License
Florence-2-large	771M	DaViT plus BART-style decoder	Yes (boxes, quads, polygons)	MIT
LLaVA-1.5 (13B)	13B	ViT-L plus Vicuna LLM	No	Apache 2.0 / LLaMA
Kosmos-2	1.6B	Encoder plus decoder LLM	Boxes via inline location tokens	MIT
Flamingo (80B)	80B	NFNet plus cross-attention into Chinchilla	No	Closed (not released)

The presence of native spatial outputs in Florence-2 and Kosmos-2 reflects a shared design choice of expressing coordinates as quantized tokens in the language model vocabulary, an idea that traces back to Pix2Seq and earlier object-detection-as-language-modeling work.^[1]^[8]

What is Florence-2 used for?

Because Florence-2 is small enough to deploy without specialized infrastructure and exposes a single API for many tasks, it has been adopted as a building block in several application domains. Reported uses include:

automated content tagging and accessibility tooling, where the captioning prompts produce alt-text candidates and the OCR prompt extracts text from images;^[4]
document understanding pipelines, where Florence-2 is fine-tuned on DocVQA, FUNSD, or layout-specific data and provides region-level reading plus reasoning;^[5]
data labeling and bootstrapping, where Florence-2's zero-shot grounding and detection outputs serve as initial annotations that are then reviewed and refined by humans, reducing labeling cost compared to manual-from-scratch workflows;^[9]
robotics and edge perception, where the relatively small parameter count of Florence-2-base allows real-time inference on commodity GPUs and even mobile-class accelerators after quantization;^[9]
preprocessing for larger vision-language pipelines, where Florence-2 is used as a fast spatial extractor whose detections, captions, and OCR outputs are passed as structured context to a downstream LLM-based reasoner.^[5]

The CVPR 2024 paper also describes downstream evaluation in which Florence-2's pretrained vision tower is transferred to dense prediction tasks (semantic segmentation on ADE20K, object detection on COCO with Mask R-CNN-style heads) and used as a drop-in replacement for ImageNet-pretrained backbones, yielding competitive results despite the model's much smaller compute budget than backbones trained specifically for those benchmarks.^[1]

Is Florence-2 open source?

Yes. Florence-2 is released under the permissive MIT License, which allows commercial use, modification, and redistribution, and all four checkpoints (base, large, and their fine-tuned -ft variants) are downloadable from the Hugging Face Hub.^[4]^[8] Roboflow describes it as "a lightweight vision-language model open-sourced by Microsoft under the MIT license."^[9]

The original Microsoft checkpoints ship as custom modeling code that must be loaded with trust_remote_code=True. Florence-2 was later added to the Hugging Face Transformers library as a natively supported architecture through the Florence2ForConditionalGeneration class, with community-maintained mirrors published under the florence-community organization, so current versions load through the standard AutoModel API without remote code.^[10]

The one component that is not open is the training data. FLD-5B, the corpus of 5.4 billion annotations on 126 million images that the model was trained on, was never released, so the weights and a description of the annotation pipeline are public while the exact training set is not reproducible from first-party materials.^[1]^[4]

What are the limitations of Florence-2?

Several limitations of Florence-2 have been documented by the original authors and by subsequent users.

First, the model's spatial output format is intrinsically discrete and limited to 1000 quantization bins per axis. For very high-resolution images or applications that require sub-pixel localization, the quantization grid imposes an irreducible error of approximately one part in a thousand of the image dimension, which translates to roughly two pixels at 2048-pixel-wide inputs. The polygon segmentation output is similarly coarser than mask-based representations used by Mask R-CNN or SAM and may miss fine boundary detail on benchmarks like ADE20K.^[1]^[7]

Second, Florence-2 does not natively support open-ended visual question answering at the level achieved by larger multimodal LLMs. The pretrained prompts cover a fixed taxonomy of tasks, and the Hugging Face fine-tuning tutorial confirms that adding a VQA capability requires defining a new task prefix and fine-tuning on labeled examples, with a frozen-vision-encoder baseline scoring zero on DocVQA before fine-tuning and 57.0 Levenshtein similarity afterward.^[5]

Third, the FLD-5B dataset is not released, which limits reproducibility and the ability of the community to verify the exact contents on which the model was trained. The data engine pipeline is described at a high level in the paper but cannot be exactly reproduced without access to the same web image corpus and specialist annotators used by Microsoft. Independent reproductions of FLD-5B-style annotation pipelines have begun to appear in open-source projects, but no public replacement of comparable scale exists as of mid-2026.^[1]^[4]

Fourth, like other models trained on web-scale annotation pipelines, Florence-2 inherits biases from its source data and annotators. The model card on Hugging Face notes that Microsoft did not perform extensive bias evaluation and recommends task-specific testing before deployment. Failures on long-tail categories, non-English text in OCR, and unusual image domains (medical, satellite, microscopy) have been reported by community users.^[4]

Fifth, the float16 precision of the released weights is sufficient for inference but may require care when fine-tuning. The recommended workflow involves either mixed-precision training with a float32 master copy of the weights or full-precision conversion before fine-tuning, particularly when unfreezing the DaViT vision tower.^[5]

Why does Florence-2 matter?

Florence-2 has had measurable influence on vision foundation model design in three respects. First, it provides a credible demonstration that compact encoder-decoder models trained on dense multi-task supervision can rival or exceed much larger LLM-based vision-language models on structured benchmarks, validating the approach of investing in data engineering rather than parameter count alone. Second, its uniform sequence-to-sequence interface for captioning, detection, OCR, and grounding has been imitated by subsequent unified vision systems and has reinforced the trend of treating coordinates as language tokens. Third, its release under a permissive MIT license with weights freely downloadable on Hugging Face made high-quality spatial vision capabilities available to a much broader community of researchers and developers than the closed Florence-1 era.^[1]^[4]^[9]

The model is now widely used as a benchmark for compact open vision-language systems. Comparisons in subsequent papers and product launches frequently cite Florence-2 as the reference point for what a single-billion-parameter unified vision model can achieve on COCO captioning, COCO detection, and TextVQA without external OCR.^[9] Its uptake is visible both in its integration into the mainline Hugging Face Transformers library and in monthly download counts in the millions across the base and large checkpoints.^[8]^[10]

Florence-2 sits at the intersection of several research lines.

It builds on Vision Transformer-family backbones, specifically the DaViT dual attention transformer that combines local spatial attention with channel attention.^[6]
It uses a sequence-to-sequence decoder modeled after BART, adapted to mix natural language with quantized location tokens.^[1]^[7]
It shares its language-model-based detection philosophy with Kosmos-2, Pix2Seq, and earlier "detection as language modeling" systems.^[1]
It uses Segment Anything Model (SAM) as part of its annotation pipeline to convert boxes into masks for polygon supervision.^[5]
It is often deployed alongside or compared with LLM-based vision-language systems such as LLaVA and PaLM-E for tasks that benefit from both spatial outputs and conversational reasoning.^[9]
It can be combined with classic detection heads such as DETR or Mask R-CNN when downstream applications need denser or more precise spatial outputs than the polygon decoder produces.^[1]
It shares architectural elements with other hierarchical vision transformers including the Swin Transformer, although DaViT's channel attention distinguishes it from the shifted-window approach.^[6]

References

Xiao, Bin et al., "Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks", arXiv preprint, 2023-11-10. https://arxiv.org/abs/2311.06242. Accessed 2026-07-12. ↩
Xiao, Bin et al., "Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks", Microsoft Research publications, 2024-06-01. https://www.microsoft.com/en-us/research/publication/florence-2-advancing-a-unified-representation-for-a-variety-of-vision-tasks/. Accessed 2026-05-20. ↩
Xiao, Bin et al., "Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks (CVPR 2024)", CVF Open Access, 2024-06-17. https://openaccess.thecvf.com/content/CVPR2024/papers/Xiao_Florence-2_Advancing_a_Unified_Representation_for_a_Variety_of_Vision_CVPR_2024_paper.pdf. Accessed 2026-05-20. ↩
Microsoft, "Florence-2-base model card", Hugging Face, 2024-06-16. https://huggingface.co/microsoft/Florence-2-base. Accessed 2026-07-12. ↩
Marafioti, Andres and contributors, "Fine-tuning Florence-2: Microsoft's Cutting-edge Vision Language Models", Hugging Face blog, 2024-06-24. https://huggingface.co/blog/finetune-florence2. Accessed 2026-05-20. ↩
Ding, Mingyu et al., "DaViT: Dual Attention Vision Transformers", European Conference on Computer Vision (ECCV) 2022 proceedings, Springer, 2022-10-23. https://link.springer.com/chapter/10.1007/978-3-031-20053-3_5. Accessed 2026-05-20. ↩
Tsang, Sik-Ho, "Brief Review: Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks", Medium, 2024-08-12. https://sh-tsang.medium.com/brief-review-florence-2-advancing-a-unified-representation-for-a-variety-of-vision-tasks-f2ab66fc7415. Accessed 2026-05-20. ↩
Microsoft, "Florence-2-large model card", Hugging Face, 2024-06-16. https://huggingface.co/microsoft/Florence-2-large. Accessed 2026-07-12. ↩
Skalski, Piotr, "Florence-2: Open Source Vision Foundation Model by Microsoft", Roboflow blog, 2024-06-20. https://blog.roboflow.com/florence-2/. Accessed 2026-07-12. ↩
Hugging Face, "Florence-2", Transformers documentation (model_doc/florence2). https://huggingface.co/docs/transformers/en/model_doc/florence2. Accessed 2026-07-12. ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

2 revisions by 1 contributor · full history

Suggest edit

What links here

Image-to-Text Models Molmo Qwen2.5-VL

Infobox

What problem does Florence-2 solve?

How does Florence-2 work?

Vision encoder (DaViT)

Multimodal encoder-decoder

Location tokens

What tasks can Florence-2 perform?

What is the FLD-5B dataset?

How was Florence-2 trained?

What Florence-2 variants are available?

How well does Florence-2 perform?

Zero-shot

Fine-tuned

How does Florence-2 compare to other vision-language models?

What is Florence-2 used for?

Is Florence-2 open source?

What are the limitations of Florence-2?

Why does Florence-2 matter?

Related work

See also

References

Improve this article

Related Articles

Phi-3

Phi-4

MAI-Code-1

Microsoft MAI

Image-to-Image Models

Image Classification Models

What links here

Related Articles

Phi-3

Phi-4

MAI-Code-1

Microsoft MAI

Image-to-Image Models

Image Classification Models

What links here