ShieldGemma
Last reviewed
Jun 3, 2026
Sources
10 citations
Review status
Source-backed
Revision
v1 · 1,244 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 3, 2026
Sources
10 citations
Review status
Source-backed
Revision
v1 · 1,244 words
Add missing citations, update stale details, or suggest a clearer explanation.
ShieldGemma is a family of open safety-classifier models from Google, built on the Gemma family of lightweight open models. The models are designed to moderate content in large language model applications: given a piece of text or, in the second generation, an image, a ShieldGemma model predicts whether that content violates a defined safety policy. They are meant to sit in front of or behind another model, screening user prompts before they reach a generative system and screening that system's responses before they reach a user. The first release arrived in mid-2024 alongside Gemma 2, and a follow-up focused on images shipped with Gemma 3 in 2025. The work comes out of Google's responsible-AI tooling effort and is described in two technical reports from researchers across Google and Google DeepMind.[1][2]
Production deployments of generative models usually pair the model with a separate moderation layer rather than relying on the model's own refusals. ShieldGemma was Google's open contribution to that layer. The first version was announced on July 31, 2024, as part of a Gemma 2 update that also introduced a 2B-parameter Gemma 2 model and the Gemma Scope interpretability tooling.[3][4] Rather than ship a single closed safety filter, Google released the classifiers with open weights so that developers could run them locally, inspect them, and fine-tune them for a particular product or policy.[3]
The accompanying report, "ShieldGemma: Generative AI Content Moderation Based on Gemma," was posted to arXiv on July 31, 2024.[1] A notable claim in the paper is that the models were trained largely on synthetic data produced through an LLM-based data-curation pipeline, which the authors argue helped the classifiers generalize. They reported that ShieldGemma outperformed comparable systems on public benchmarks, including a roughly 10.8 percent improvement in AU-PRC over Llama Guard.[1]
ShieldGemma is not a freestanding moderation API but a text-to-text, decoder-only language model that has been instruction-tuned for a classification job. A request is wrapped in a prompt template that states the safety policy in plain language and asks, in effect, whether the supplied content violates it. The model is steered to answer with a single token, "Yes" or "No."[2]
In practice the probability is read directly from the model's output rather than from parsed text. An integration extracts the logits for the "Yes" and "No" tokens at the final position and applies a softmax over just those two, yielding a continuous score. A higher score reflects greater confidence that the content breaks the stated policy, which lets a developer set their own decision threshold instead of accepting a fixed verdict.[2] The same model handles two related modes: prompt-only classification, which screens an incoming user message, and prompt-response classification, which screens what a chatbot produced. The policy text shifts accordingly, from "the prompt shall not..." to "the chatbot shall not...".[2]
Because the policy lives in the prompt, the categories are configurable. The published definitions are defaults rather than a hard-coded taxonomy, and the open weights mean a team can further fine-tune a classifier toward its own guidelines.[2][3] The first-generation models were trained on TPUv5e hardware using JAX and ML Pathways, and were released as English-language, text-only classifiers.[2]
The original ShieldGemma model card defines four content-safety categories, each with an associated policy statement. The summaries below paraphrase those definitions.[2]
| Category | What the policy covers |
|---|---|
| Sexually explicit information | References to sexual acts or other lewd content, such as sexually graphic descriptions or material aimed at causing arousal; medical and scientific terms about anatomy or sex education are permitted. |
| Dangerous content | Content that harms oneself or others, such as instructions for building firearms or explosives, promotion of terrorism, or instructions for suicide. |
| Hate speech | Content targeting identity or protected attributes, such as racial slurs, promotion of discrimination, or calls to violence against protected groups, including dehumanizing or vilifying language. |
| Harassment | Malicious, intimidating, bullying, or abusive content aimed at an individual, such as physical threats, denial of tragic events, or disparagement of victims of violence. |
The first-generation suite shipped in three sizes, all built on Gemma 2 and distributed with open weights under Google's Gemma license. The smallest is meant for latency- or cost-sensitive filtering, while the larger checkpoints trade compute for accuracy.[1][3][5]
| Model | Parameters | Base model | Modality |
|---|---|---|---|
| shieldgemma-2b | 2B | Gemma 2 | Text |
| shieldgemma-9b | 9B | Gemma 2 | Text |
| shieldgemma-27b | 27B | Gemma 2 | Text |
These checkpoints are available on Hugging Face and through other model hosts, and require accepting Google's usage terms before download.[5][6]
ShieldGemma 2 extended the approach from text to images. It was announced on March 12, 2025, the same day Google introduced the Gemma 3 family, and was positioned as the safety companion to those multimodal open models.[7][8] Unlike the original suite, it is a single 4-billion-parameter model trained on Gemma 3's 4B instruction-tuned checkpoint, and it takes an image plus a policy as input rather than text alone.[9] The detailed report, "ShieldGemma 2: Robust and Tractable Image Content Moderation," followed on arXiv on April 1, 2025.[10]
The model classifies both natural photographs and synthetically generated images, which makes it usable in two roles: as an input filter for vision-language models and as an output filter for image-generation systems.[7][9] It outputs the same kind of "Yes"/"No" probability as its predecessor.[9] ShieldGemma 2 defines three image-safety categories.[9][10]
| Category | What the policy covers |
|---|---|
| Sexually explicit | Explicit or graphic sexual acts, such as pornography, erotic nudity, or depictions of sexual assault. |
| Dangerous content | Imagery facilitating real-world harm, such as building firearms or explosives, promotion of terrorism, or instructions for suicide. |
| Violence and gore | Shocking, sensational, or gratuitous violence, such as excessive blood and gore or extreme injury. |
In the report's evaluation, ShieldGemma 2 led every comparison model across all three policies, with an average PR-AUC of 89.1 percent. The authors put that at improvements of 6.8, 12.9, and 14.8 percentage points over the base Gemma-3-4B-IT model, GPT-4o mini, and LlavaGuard 7B respectively, and credited a custom adversarial data-generation pipeline for the gains.[8][10] The image-classification design parallels Google's other Gemma-derived vision work such as PaliGemma, reusing the Gemma 3 backbone for a focused task.
All ShieldGemma models are distributed with open weights under Google's Gemma terms of use, which permit commercial use subject to a prohibited-use policy and require accepting the license before download.[6][9] The checkpoints are published on Hugging Face and are also served through third-party providers such as NVIDIA's NIM catalog.[5][6] Google frames the models as a building block rather than a finished service: developers are expected to choose decision thresholds, adapt the policy text, and where needed fine-tune the weights to fit a specific application, since no single safety classifier fits every product.[2][3]