# PaliGemma

> Source: https://aiwiki.ai/wiki/paligemma
> Updated: 2026-06-23
> Categories: Google, Multimodal AI, Open Source AI
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

PaliGemma is an open [vision-language model](/wiki/vision_language_model) developed by [Google](/wiki/google_deepmind) that pairs the [SigLIP](/wiki/siglip) image encoder with a [Gemma](/wiki/gemma) language model, takes an image plus a text prompt as input, and produces text as output. It was first released on May 14, 2024, alongside that year's Google I/O developer conference. [1][2] The original release is a single 3-billion-parameter model built as a base for fine-tuning rather than for general-purpose chat, and a second generation, PaliGemma 2, arrived in December 2024 with three sizes built on [Gemma 2](/wiki/gemma_2). [3][4] Its technical report describes it as "an open Vision-Language Model (VLM) that is based on the SigLIP-So400m vision encoder and the Gemma-2B language model," trained "to be a versatile and broadly knowledgeable base model that is effective to transfer." [3] The model belongs to the broader Gemma family of open-weight models, which Google describes as built from the same research and technology behind its [Gemini](/wiki/gemini) systems.

## Background

By mid-2024 Google had released several open text-only models under the Gemma name, but none could read images. PaliGemma filled that gap as the family's first vision-language model (VLM), giving developers an openly distributed system that takes both an image and a text prompt as input and produces text as output. [2][5]

The design follows the recipe described for PaLI-3, an earlier line of vision-language research from Google. PaLI ("Pathways Language and Image") models had shown that a comparatively small VLM, trained carefully, could match or beat much larger systems on transfer tasks. PaliGemma applied that lesson with publicly available components: a SigLIP vision backbone and a Gemma text decoder, both already released openly, combined and trained together. [3][5]

The accompanying technical report, "PaliGemma: A versatile 3B VLM for transfer," was posted to arXiv on July 10, 2024 by Lucas Beyer, Andreas Steiner, Andre Susano Pinto, Alexander Kolesnikov, Xiao Wang, and a large group of co-authors from Google Research. [3] The paper evaluates the model across almost 40 tasks, spanning standard captioning and question-answering benchmarks as well as less common domains such as remote-sensing question answering and referring-expression segmentation. As the abstract puts it, "We evaluate PaliGemma on almost 40 diverse tasks including standard VLM benchmarks, but also more specialized tasks such as remote-sensing and segmentation." [3]

## What is PaliGemma's architecture?

PaliGemma is the composition of two parts: a Vision Transformer image encoder and a Transformer text decoder. The image encoder is a SigLIP "shape-optimized" model, SigLIP-So400m/14, that was contrastively pretrained at scale using a sigmoid loss. The text decoder is initialized from the Gemma-2B checkpoint. Together the two components total roughly 3 billion parameters. [1][3]

The two halves are joined by a single linear projection layer. An input image is split into patches of 14 by 14 pixels and passed through the SigLIP encoder, which produces a sequence of visual features with 1,152 dimensions per patch. The projection layer maps those features to 2,048 dimensions, the same embedding size the Gemma decoder uses for text, so the two streams can be concatenated. [6] The resulting image tokens, sometimes called "soft tokens," are placed ahead of the tokenized text prompt, and the combined sequence is fed to the decoder. The decoder then generates the answer autoregressively. [5][6] To support detection and segmentation outputs, PaliGemma extends Gemma's vocabulary to 257,216 tokens, adding 1,024 location tokens for bounding-box coordinates and 128 codeword tokens for segmentation masks. [6]

Image tokens are processed with full (bidirectional) attention so the model can attend across the entire image, while the generated text uses standard causal masking. The number of image tokens depends on the input resolution, since the patch size is fixed at 14 pixels.

| Input resolution | Image tokens |
|---|---|
| 224 x 224 | 256 |
| 448 x 448 | 1,024 |
| 896 x 896 | 4,096 |

Higher resolutions give the model more detail to work with, which helps on tasks that depend on small text or fine structure, at the cost of a longer token sequence and more compute. [6]

## What can PaliGemma do?

PaliGemma is designed for class-leading fine-tuned performance on a wide range of vision-language tasks rather than as a finished conversational assistant. The tasks the model targets include image and short-video captioning, visual question answering, reading text in images (OCR), object detection, and object segmentation. [1][2]

Google released the model in three forms to suit different needs. [5]

| Checkpoint type | Purpose |
|---|---|
| Pretrained (pt) | Base models meant to be fine-tuned on a downstream task |
| Mix | Pretrained models fine-tuned on a mixture of tasks, for general inference with free-text prompts |
| Fine-tuned (ft) | Models specialized for individual research benchmarks |

The pretrained checkpoints are the main artifact. The mix checkpoints, offered at 224 and 448 pixels, are tuned on a blend of tasks so they can respond to open prompts out of the box, and Google positions them as suitable for exploration. The single-task fine-tuned checkpoints reproduce results on specific academic datasets. [1][5]

On benchmarks, PaliGemma was reported across captioning sets such as COCO, NoCaps, TextCaps, and SciCap; question-answering sets including VQAv2, OKVQA, GQA, TextVQA, and DocVQA; segmentation on the RefCOCO datasets; and video tasks such as MSR-VTT, ActivityNet, and VATEX. Scores generally improved at higher input resolutions. [1] At sub-3B scale the model was shown to rival the performance of much larger predecessors such as PaLI-X, PaLM-E, and PaLI-3. [3][5]

## What is PaliGemma 2?

PaliGemma 2 was published on December 5, 2024, with the technical report "PaliGemma 2: A Family of Versatile VLMs for Transfer" posted to arXiv the day before. [4][7] The update keeps the same SigLIP-So400m vision encoder but replaces the original Gemma-2B decoder with the full lineup of [Gemma 2](/wiki/gemma_2) language models, giving the family three sizes that span a much wider range of capability. [4][8]

| Model | Text decoder (Gemma 2) | Supported resolutions |
|---|---|---|
| PaliGemma 2 3B | Gemma 2 2B | 224, 448, 896 |
| PaliGemma 2 10B | Gemma 2 9B | 224, 448, 896 |
| PaliGemma 2 28B | Gemma 2 27B | 224, 448, 896 |

Because each of the three sizes is offered at each of the three resolutions, the release included nine pretrained checkpoints in bfloat16 precision. Google also published two fine-tuned variants trained on the DOCCI image-caption dataset, covering the 3B and 10B sizes at 448-pixel resolution, which produce long and detailed captions with strong text rendering and spatial description. [4][8]

The second-generation report extends the model into several specialized domains beyond the original task set. The authors report results on table structure recognition (using the FinTabNet and PubTabNet datasets), molecular structure recognition (PubChem), optical music score recognition (GrandStaff), long-form captioning, spatial reasoning, and radiography report generation from chest X-rays (MIMIC-CXR), reporting state-of-the-art transfer performance in several of these areas. [7][9] Google also presented the new generation as a drop-in replacement for the original, so existing PaliGemma users could upgrade with minimal changes to their fine-tuning workflows. [8]

On February 19, 2025, Google followed up with PaliGemma 2 mix, a set of ready-to-use checkpoints tuned on a mixture of tasks at the 3B, 10B, and 28B sizes and 224 and 448 pixels. Google describes these as models "tuned to a mixture of tasks that allow directly exploring the model capabilities and using it out-of-the-box for common use cases," covering short and long captioning, OCR, image question answering, object detection, and image segmentation without task-specific fine-tuning. [10]

PaliGemma's vision approach influenced later Google releases. [Gemma 3](/wiki/gemma_3), the multimodal generation of the core Gemma text models, adopted a similar pattern of feeding SigLIP image features into the language model as a fixed number of soft tokens.

## Is PaliGemma open source?

Both generations are openly distributed under open weights, though under Google's own license rather than a standard open-source license such as Apache 2.0. At launch PaliGemma was made available through GitHub, Hugging Face, Kaggle, Vertex AI Model Garden, and NVIDIA's platform, with reference integrations for JAX and Hugging Face Transformers and runnable examples in Colab and Kaggle notebooks. Google also offered cloud credits to academic researchers working with the model. [2][5]

The weights are released under the Gemma license. The terms permit redistribution, commercial use, fine-tuning, and the creation of derivative models, subject to Google's prohibited-use policy. [4] Because access on the Hugging Face Hub is gated, users must accept the Gemma license terms before downloading the checkpoints. [5] PaliGemma sits alongside other openly released Gemma derivatives such as [CodeGemma](/wiki/codegemma), and it is distinct from Google's image-generation models like [Imagen](/wiki/imagen), since PaliGemma reads images and outputs text rather than generating pictures.

## References

1. Google AI for Developers. "PaliGemma model card." https://ai.google.dev/gemma/docs/paligemma/model-card
2. Google Developers Blog. "Introducing PaliGemma, Gemma 2, and an Upgraded Responsible AI Toolkit." May 14, 2024. https://developers.googleblog.com/en/gemma-family-and-toolkit-expansion-io-2024/
3. Beyer, Lucas, et al. "PaliGemma: A versatile 3B VLM for transfer." arXiv:2407.07726. July 10, 2024. https://arxiv.org/abs/2407.07726
4. Hugging Face Blog. "Welcome PaliGemma 2 - New vision language models by Google." December 5, 2024. https://huggingface.co/blog/paligemma2
5. Hugging Face Blog. "PaliGemma - Google's Cutting-Edge Open Vision Language Model." May 14, 2024. https://huggingface.co/blog/paligemma
6. Google Developers Blog. "Gemma explained: PaliGemma architecture." https://developers.googleblog.com/gemma-explained-paligemma-architecture/
7. Steiner, Andreas, et al. "PaliGemma 2: A Family of Versatile VLMs for Transfer." arXiv:2412.03555. December 4, 2024. https://arxiv.org/abs/2412.03555
8. Google Developers Blog. "Introducing PaliGemma 2: Powerful Vision-Language Models, Simple Fine-Tuning." December 5, 2024. https://developers.googleblog.com/en/introducing-paligemma-2-powerful-vision-language-models-simple-fine-tuning/
9. Google AI for Developers. "PaliGemma 2 model card." https://ai.google.dev/gemma/docs/paligemma/model-card-2
10. Google Developers Blog. "Introducing PaliGemma 2 mix: A vision-language model for multiple tasks." February 19, 2025. https://developers.googleblog.com/en/introducing-paligemma-2-mix/

