ShieldGemma

AI Safety Google Open Source AI

6 min read

Updated Jun 3, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 3, 2026

Fact-checked

In review queue

Sources

10 citations

Revision

v1 · 1,244 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

ShieldGemma is a family of open safety-classifier models from Google, built on the Gemma family of lightweight open models. The models are designed to moderate content in large language model applications: given a piece of text or, in the second generation, an image, a ShieldGemma model predicts whether that content violates a defined safety policy. They are meant to sit in front of or behind another model, screening user prompts before they reach a generative system and screening that system's responses before they reach a user. The first release arrived in mid-2024 alongside Gemma 2, and a follow-up focused on images shipped with Gemma 3 in 2025. The work comes out of Google's responsible-AI tooling effort and is described in two technical reports from researchers across Google and Google DeepMind.^[1]^[2]

Background

Production deployments of generative models usually pair the model with a separate moderation layer rather than relying on the model's own refusals. ShieldGemma was Google's open contribution to that layer. The first version was announced on July 31, 2024, as part of a Gemma 2 update that also introduced a 2B-parameter Gemma 2 model and the Gemma Scope interpretability tooling.^[3]^[4] Rather than ship a single closed safety filter, Google released the classifiers with open weights so that developers could run them locally, inspect them, and fine-tune them for a particular product or policy.^[3]

The accompanying report, "ShieldGemma: Generative AI Content Moderation Based on Gemma," was posted to arXiv on July 31, 2024.^[1] A notable claim in the paper is that the models were trained largely on synthetic data produced through an LLM-based data-curation pipeline, which the authors argue helped the classifiers generalize. They reported that ShieldGemma outperformed comparable systems on public benchmarks, including a roughly 10.8 percent improvement in AU-PRC over Llama Guard.^[1]

How it works

ShieldGemma is not a freestanding moderation API but a text-to-text, decoder-only language model that has been instruction-tuned for a classification job. A request is wrapped in a prompt template that states the safety policy in plain language and asks, in effect, whether the supplied content violates it. The model is steered to answer with a single token, "Yes" or "No."^[2]

In practice the probability is read directly from the model's output rather than from parsed text. An integration extracts the logits for the "Yes" and "No" tokens at the final position and applies a softmax over just those two, yielding a continuous score. A higher score reflects greater confidence that the content breaks the stated policy, which lets a developer set their own decision threshold instead of accepting a fixed verdict.^[2] The same model handles two related modes: prompt-only classification, which screens an incoming user message, and prompt-response classification, which screens what a chatbot produced. The policy text shifts accordingly, from "the prompt shall not..." to "the chatbot shall not...".^[2]

Because the policy lives in the prompt, the categories are configurable. The published definitions are defaults rather than a hard-coded taxonomy, and the open weights mean a team can further fine-tune a classifier toward its own guidelines.^[2]^[3] The first-generation models were trained on TPUv5e hardware using JAX and ML Pathways, and were released as English-language, text-only classifiers.^[2]

Harm categories

The original ShieldGemma model card defines four content-safety categories, each with an associated policy statement. The summaries below paraphrase those definitions.^[2]

Category	What the policy covers
Sexually explicit information	References to sexual acts or other lewd content, such as sexually graphic descriptions or material aimed at causing arousal; medical and scientific terms about anatomy or sex education are permitted.
Dangerous content	Content that harms oneself or others, such as instructions for building firearms or explosives, promotion of terrorism, or instructions for suicide.
Hate speech	Content targeting identity or protected attributes, such as racial slurs, promotion of discrimination, or calls to violence against protected groups, including dehumanizing or vilifying language.
Harassment	Malicious, intimidating, bullying, or abusive content aimed at an individual, such as physical threats, denial of tragic events, or disparagement of victims of violence.

Model sizes

The first-generation suite shipped in three sizes, all built on Gemma 2 and distributed with open weights under Google's Gemma license. The smallest is meant for latency- or cost-sensitive filtering, while the larger checkpoints trade compute for accuracy.^[1]^[3]^[5]

Model	Parameters	Base model	Modality
shieldgemma-2b	2B	Gemma 2	Text
shieldgemma-9b	9B	Gemma 2	Text
shieldgemma-27b	27B	Gemma 2	Text

These checkpoints are available on Hugging Face and through other model hosts, and require accepting Google's usage terms before download.^[5]^[6]

ShieldGemma 2 (image safety)

ShieldGemma 2 extended the approach from text to images. It was announced on March 12, 2025, the same day Google introduced the Gemma 3 family, and was positioned as the safety companion to those multimodal open models.^[7]^[8] Unlike the original suite, it is a single 4-billion-parameter model trained on Gemma 3's 4B instruction-tuned checkpoint, and it takes an image plus a policy as input rather than text alone.^[9] The detailed report, "ShieldGemma 2: Robust and Tractable Image Content Moderation," followed on arXiv on April 1, 2025.^[10]

The model classifies both natural photographs and synthetically generated images, which makes it usable in two roles: as an input filter for vision-language models and as an output filter for image-generation systems.^[7]^[9] It outputs the same kind of "Yes"/"No" probability as its predecessor.^[9] ShieldGemma 2 defines three image-safety categories.^[9]^[10]

Category	What the policy covers
Sexually explicit	Explicit or graphic sexual acts, such as pornography, erotic nudity, or depictions of sexual assault.
Dangerous content	Imagery facilitating real-world harm, such as building firearms or explosives, promotion of terrorism, or instructions for suicide.
Violence and gore	Shocking, sensational, or gratuitous violence, such as excessive blood and gore or extreme injury.

In the report's evaluation, ShieldGemma 2 led every comparison model across all three policies, with an average PR-AUC of 89.1 percent. The authors put that at improvements of 6.8, 12.9, and 14.8 percentage points over the base Gemma-3-4B-IT model, GPT-4o mini, and LlavaGuard 7B respectively, and credited a custom adversarial data-generation pipeline for the gains.^[8]^[10] The image-classification design parallels Google's other Gemma-derived vision work such as PaliGemma, reusing the Gemma 3 backbone for a focused task.

Availability and licensing

All ShieldGemma models are distributed with open weights under Google's Gemma terms of use, which permit commercial use subject to a prohibited-use policy and require accepting the license before download.^[6]^[9] The checkpoints are published on Hugging Face and are also served through third-party providers such as NVIDIA's NIM catalog.^[5]^[6] Google frames the models as a building block rather than a finished service: developers are expected to choose decision thresholds, adapt the policy text, and where needed fine-tune the weights to fit a specific application, since no single safety classifier fits every product.^[2]^[3]

References

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

Suggest edit

What links here

Gemma 3 TensorFlow

Background

How it works

Harm categories

Model sizes

ShieldGemma 2 (image safety)

Availability and licensing

References

Improve this article

Related Articles

TensorFlow

Agent2Agent Protocol

Gemma 2

Gemma 3

Gemini CLI

CodeGemma

What links here

Related Articles

TensorFlow

Agent2Agent Protocol

Gemma 2

Gemma 3

Gemini CLI

CodeGemma

What links here