PaliGemma
Last reviewed
Jun 3, 2026
Sources
9 citations
Review status
Source-backed
Revision
v1 · 1,334 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 3, 2026
Sources
9 citations
Review status
Source-backed
Revision
v1 · 1,334 words
Add missing citations, update stale details, or suggest a clearer explanation.
PaliGemma is an open vision-language model developed by Google and first released on May 14, 2024, alongside that year's Google I/O developer conference. [1][2] It pairs a SigLIP image encoder with a Gemma language model and is built as a base model intended for fine-tuning rather than for general-purpose chat. The original release is a single 3-billion-parameter model, and a second generation, PaliGemma 2, arrived in December 2024 with a wider range of sizes built on Gemma 2. [3][4] The model belongs to the broader Gemma family of open-weight models, which Google describes as built from the same research and technology behind its Gemini systems.
By mid-2024 Google had released several open text-only models under the Gemma name, but none could read images. PaliGemma filled that gap as the family's first vision-language model (VLM), giving developers an openly distributed system that takes both an image and a text prompt as input and produces text as output. [2][5]
The design follows the recipe described for PaLI-3, an earlier line of vision-language research from Google. PaLI ("Pathways Language and Image") models had shown that a comparatively small VLM, trained carefully, could match or beat much larger systems on transfer tasks. PaliGemma applied that lesson with publicly available components: a SigLIP vision backbone and a Gemma text decoder, both already released openly, combined and trained together. [3][5]
The accompanying technical report, "PaliGemma: A versatile 3B VLM for transfer," was posted to arXiv on July 10, 2024 by Lucas Beyer, Andreas Steiner, André Susano Pinto, Alexander Kolesnikov, and a large group of co-authors from Google Research. The paper evaluates the model across nearly 40 tasks, spanning standard captioning and question-answering benchmarks as well as less common domains such as remote-sensing question answering and referring-expression segmentation. [3]
PaliGemma is the composition of two parts: a Vision Transformer image encoder and a Transformer text decoder. The image encoder is a SigLIP "shape-optimized" model, SigLIP-So400m/14, that was contrastively pretrained at scale using a sigmoid loss. The text decoder is initialized from the Gemma-2B checkpoint. Together the two components total roughly 3 billion parameters. [1][3]
The two halves are joined by a single linear projection layer. An input image is split into patches of 14 by 14 pixels and passed through the SigLIP encoder, which produces a sequence of visual features. The projection layer maps those features into the same embedding space the Gemma decoder uses for text. The resulting image tokens, sometimes called "soft tokens," are concatenated with the tokenized text prompt, and the combined sequence is fed to the decoder. The decoder then generates the answer autoregressively. [5][6]
Image tokens are processed with full (bidirectional) attention so the model can attend across the entire image, while the generated text uses standard causal masking. The number of image tokens depends on the input resolution, since the patch size is fixed at 14 pixels.
| Input resolution | Image tokens |
|---|---|
| 224 x 224 | 256 |
| 448 x 448 | 1,024 |
| 896 x 896 | 4,096 |
Higher resolutions give the model more detail to work with, which helps on tasks that depend on small text or fine structure, at the cost of a longer token sequence and more compute. [6]
PaliGemma is designed for class-leading fine-tuned performance on a wide range of vision-language tasks rather than as a finished conversational assistant. The tasks the model targets include image and short-video captioning, visual question answering, reading text in images (OCR), object detection, and object segmentation. [1][2]
Google released the model in three forms to suit different needs. [5]
| Checkpoint type | Purpose |
|---|---|
| Pretrained (pt) | Base models meant to be fine-tuned on a downstream task |
| Mix | Pretrained models fine-tuned on a mixture of tasks, for general inference with free-text prompts |
| Fine-tuned (ft) | Models specialized for individual research benchmarks |
The pretrained checkpoints are the main artifact. The mix checkpoints, offered at 224 and 448 pixels, are tuned on a blend of tasks so they can respond to open prompts out of the box, and Google positions them as suitable for exploration. The single-task fine-tuned checkpoints reproduce results on specific academic datasets. [1][5]
On benchmarks, PaliGemma was reported across captioning sets such as COCO, NoCaps, TextCaps, and SciCap; question-answering sets including VQAv2, OKVQA, GQA, TextVQA, and DocVQA; segmentation on the RefCOCO datasets; and video tasks such as MSR-VTT, ActivityNet, and VATEX. Scores generally improved at higher input resolutions. [1]
PaliGemma 2 was published on December 5, 2024, with the technical report "PaliGemma 2: A Family of Versatile VLMs for Transfer" posted to arXiv the day before. [4][7] The update keeps the same SigLIP-So400m vision encoder but replaces the original Gemma-2B decoder with the full lineup of Gemma 2 language models, giving the family three sizes that span a much wider range of capability. [4][8]
| Model | Text decoder (Gemma 2) | Supported resolutions |
|---|---|---|
| PaliGemma 2 3B | Gemma 2 2B | 224, 448, 896 |
| PaliGemma 2 10B | Gemma 2 9B | 224, 448, 896 |
| PaliGemma 2 28B | Gemma 2 27B | 224, 448, 896 |
Because each of the three sizes is offered at each of the three resolutions, the release included nine pretrained checkpoints. Google also published two fine-tuned variants trained on the DOCCI image-caption dataset, covering the 3B and 10B sizes at 448-pixel resolution, which produce long and detailed captions with strong text rendering and spatial description. [4][8]
The second-generation report extends the model into several specialized domains beyond the original task set. The authors report results on table structure recognition (using the FinTabNet and PubTabNet datasets), molecular structure recognition (PubChem), optical music score recognition (GrandStaff), long-form captioning, spatial reasoning, and radiography report generation from chest X-rays (MIMIC-CXR), reporting state-of-the-art transfer performance in several of these areas. [7][9] Google also presented the new generation as a drop-in replacement for the original, so existing PaliGemma users could upgrade with minimal changes to their fine-tuning workflows. [8]
PaliGemma's vision approach influenced later Google releases. Gemma 3, the multimodal generation of the core Gemma text models, adopted a similar pattern of feeding SigLIP image features into the language model as a fixed number of soft tokens.
Both generations are openly distributed. At launch PaliGemma was made available through GitHub, Hugging Face, Kaggle, Vertex AI Model Garden, and NVIDIA's platform, with reference integrations for JAX and Hugging Face Transformers and runnable examples in Colab and Kaggle notebooks. Google also offered cloud credits to academic researchers working with the model. [2][5]
The weights are released under the Gemma license rather than a standard open-source license such as Apache 2.0. The terms permit redistribution, commercial use, fine-tuning, and the creation of derivative models, subject to Google's prohibited-use policy. [4] Because access on the Hugging Face Hub is gated, users must accept the Gemma license terms before downloading the checkpoints. [5] PaliGemma sits alongside other openly released Gemma derivatives such as CodeGemma, and it is distinct from Google's image-generation models like Imagen, since PaliGemma reads images and outputs text rather than generating pictures.