Purple Llama
Last reviewed
Apr 28, 2026
Sources
18 citations
Review status
Source-backed
Revision
v3 ยท 4,535 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Apr 28, 2026
Sources
18 citations
Review status
Source-backed
Revision
v3 ยท 4,535 words
Add missing citations, update stale details, or suggest a clearer explanation.
Purple Llama is an umbrella project from Meta that gathers open trust and safety tools and evaluations for generative AI. It was announced on December 7, 2023, alongside the release of Llama 2 Chat as the dominant open-weight chat model, and it has grown into the standard reference stack for developers who want to put input filters, output filters, code filters, and adversarial benchmarks around an open large language model before shipping it. Components include the Llama Guard family of safety classifiers, the CyberSec Eval (CyberSecEval) benchmark suite, Code Shield for inference-time code filtering, and Prompt Guard for jailbreak and prompt-injection detection. Most components are released under permissive licenses for research and commercial use [1][2].
The name borrows directly from the cybersecurity convention of "purple teaming," where a red team (offense) and a blue team (defense) work together. Meta argued that closing the gap between attack and defense is the only way to make generative AI safe at scale, and that this requires open tools that the wider community can audit, fork, and extend [1]. At launch, Meta said Purple Llama would be developed in collaboration with more than twenty industry partners, including AMD, AWS, Google Cloud, Hugging Face, IBM, Intel, Microsoft, MLCommons, NVIDIA, and Scale AI, and it later became one of the centerpiece projects of the AI Alliance [3].
Purple Llama was launched on December 7, 2023, four months after Meta released Llama 2 in July 2023. Llama 2 had quickly become the most-downloaded open-weight LLM family of 2023, but the open release also raised a practical question: how do downstream developers add the kind of safety scaffolding that closed-API providers like OpenAI and Anthropic maintained internally? Meta's answer was to ship safety as a separate, openly licensed product line rather than bake it only into the base model.
The initial Purple Llama release contained two components: the Llama Guard input and output safety classifier, and the first CyberSec Eval benchmark for evaluating cybersecurity risk in LLMs. The Llama Guard model was a fine-tune of Llama 2 7B, and CyberSec Eval drew its tests from established sources like the MITRE ATT&CK framework and the Common Weakness Enumeration (CWE). Both arrived with model cards, evaluation code, and prompt templates so that other teams could plug them into existing pipelines [1][4].
Meta framed the initial release as a starting point. The company said it would expand the project across the categories laid out in the U.S. National Institute of Standards and Technology (NIST) AI Risk Management Framework, with an explicit focus on the CBRNE category (chemical, biological, radiological, nuclear, and explosive risks), cyber risks, and child safety. The blog post also positioned the work as a contribution to the AI Alliance, the open-AI consortium that Meta and IBM had announced two days earlier on December 5, 2023 [1][3].
Purple teaming is a concept Meta borrowed from the cybersecurity world, where it has been used since the late 2000s. A red team plays the attacker, looking for ways to break a system. A blue team plays the defender, hardening the system against attack. A purple team is the institutional structure that makes the two camps work together rather than competing for credit. Meta argued that the same model fits generative AI: red-team-style adversarial testing of a model and blue-team-style mitigations like input and output filtering have to evolve in lockstep, because safety claims based on either alone are easy to bypass [1].
In the Purple Llama context, the red-team components are the CyberSec Eval benchmarks, which probe a model's willingness to assist with cyberattacks and its tendency to produce vulnerable code. The blue-team components are Llama Guard, Code Shield, and Prompt Guard, which sit in front of and behind the model in production and filter content. The naming choice also gestures at the Llama brand: the cover image used for the launch was a stylized purple llama, which became the project's visual identity on GitHub and Hugging Face.
Purple Llama groups its tools into two broad categories: safeguards (Llama Guard, Code Shield, Prompt Guard), which are runtime filters meant to be used in production deployments, and evaluations (CyberSec Eval), which are benchmark suites meant to be run during model development and regression testing.
| Component | Type | Role | First released | License |
|---|---|---|---|---|
| Llama Guard | Safeguard | Input and output safety classifier (text) | Dec 7, 2023 | Llama 2 Community License |
| Llama Guard 2 | Safeguard | 8B input/output classifier on Llama 3 | Apr 18, 2024 | Llama 3 Community License |
| Llama Guard 3 (8B and 1B) | Safeguard | Multilingual input/output classifier on Llama 3.1 / 3.2 | Jul 23, 2024 | Llama 3.2 Community License |
| Llama Guard 3 11B Vision | Safeguard | Multimodal input/output classifier with image understanding | Sep 25, 2024 | Llama 3.2 Community License |
| Llama Guard 4 12B | Safeguard | Natively multimodal classifier pruned from Llama 4 Scout | Apr 30, 2025 | Llama 4 Community License |
| Code Shield | Safeguard | Inference-time filter for insecure code outputs | 2024 | MIT |
| Prompt Guard 86M / 22M | Safeguard | Prompt-injection and jailbreak detector | Jul 23, 2024 | Llama 3.2 Community License |
| CyberSec Eval 1 | Evaluation | Insecure code suggestions and cyberattack helpfulness | Dec 7, 2023 | MIT |
| CyberSec Eval 2 | Evaluation | Adds prompt injection and code interpreter abuse | Apr 19, 2024 | MIT |
| CyberSec Eval 3 | Evaluation | Adds visual prompt injection, spear phishing, autonomous offensive cyber | Aug 1, 2024 | MIT |
Llama Guard is the original safety classifier in Purple Llama. It is an LLM-based input and output safeguard model designed for human-AI conversation use cases, and it works by reading either a user prompt or an assistant response and emitting a structured "safe" or "unsafe" verdict, along with the safety category that was violated when the verdict is unsafe. The first Llama Guard was a fine-tune of Llama 2 7B trained by Meta and described in a December 2023 paper led by Hakan Inan and ten coauthors [4].
The original Llama Guard taxonomy contained six in-policy categories: violence and hate, sexual content, criminal planning, guns and illegal weapons, regulated or controlled substances, and self-harm. A seventh class covered "safe" content. The architecture choice (a fine-tuned LLM rather than a small dedicated classifier) was a deliberate departure from the conventional content-moderation toolkit, which had typically relied on smaller transformer-based classifiers like the OpenAI Moderation API or Google's Perspective API. By using an LLM, Llama Guard could read free-form policy descriptions at inference time and adapt its classifications without retraining, a capability the paper called "taxonomy customization" or "zero-shot policy following" [4].
On the OpenAI Moderation Evaluation dataset, Llama Guard matched OpenAI's own moderation API in F1 score without any task-specific fine-tuning. On ToxicChat, a public dataset of real user-AI interactions, it outperformed all baselines including GPT-4 [4]. The combination of competitive accuracy, an open license, and the ability to swap in custom policies through prompting made it the default open-source safety classifier almost immediately after release.
Llama Guard 2 was released on April 18, 2024, alongside the launch of Llama 3. It was an 8 billion-parameter model based on Llama 3 8B and was specifically optimized to support the MLCommons AI Safety hazards taxonomy, which had just been published as part of the v0.5 proof-of-concept benchmark. Llama Guard 2 covered eleven of the thirteen MLCommons hazard categories out of the box and reported substantial accuracy improvements over the original Llama Guard, in part because of the larger and more capable base model and in part because of a higher-quality training set built around the MLCommons taxonomy [5][6].
Llama Guard 3 was released on July 23, 2024, alongside Llama 3.1, with a smaller 1B variant added on September 25, 2024 alongside Llama 3.2. Two changes mattered most. First, Llama Guard 3 became multilingual: it provides content moderation in eight languages, including English, French, German, Hindi, Italian, Portuguese, Spanish, and Thai. Second, it expanded its taxonomy coverage to all thirteen MLCommons categories and was specifically trained to handle safety for tool calls, including code interpreter and search tool invocations [7][8].
The 1B variant exists for edge deployment. Meta reported that quantized 1B versions can run on mobile-class hardware while still providing useful filtering for chat applications, which made it a popular choice for on-device assistants where calling out to a larger guard model would add unacceptable latency [7].
Llama Guard 3 11B Vision was the first Purple Llama component to handle images directly. Released alongside Llama 3.2 in late September 2024, it accepts a text prompt and one or more images and classifies the combined input as safe or unsafe under the MLCommons taxonomy. Internally, it uses the same vision encoder architecture as Llama 3.2 11B Vision and a similar tokenized representation, which lets it pick up on hazards that a text-only guard would miss, such as a prompt asking for instructions about a dangerous object visible only in the attached image [9].
Llama Guard 4 12B was released on April 30, 2025, several weeks after the launch of Llama 4. It is a 12 billion-parameter natively multimodal safety classifier that combines the capabilities of the previous Llama Guard 3 8B and Llama Guard 3 11B Vision models in a single model. Architecturally, Llama Guard 4 is unusual: rather than fine-tuning Llama 4 Scout directly, Meta pruned Scout's mixture-of-experts layers down to dense feedforward layers, keeping only the shared expert in each MoE layer and discarding the routed experts and routing networks. The result is a dense early-fusion transformer with the same depth as Scout but a much smaller serving footprint [10][11].
Llama Guard 4 supports the same eight languages as Llama Guard 3 for text and accepts mixed text-and-image inputs. Meta reported that it roughly matches or exceeds the overall performance of Llama Guard 3 8B on text and Llama Guard 3 11B Vision on multimodal inputs, while collapsing the safety stack into a single deployment artifact instead of two [10].
CyberSec Eval, also written CyberSecEval, is the cybersecurity evaluation suite within Purple Llama. It is a benchmark and not a runtime filter, intended to be used during model development to characterize how a model behaves on cybersecurity-relevant tasks. The evaluations are released as open-source code with prompt templates, scoring scripts, and reference results. Each version has expanded the test surface significantly.
The first CyberSec Eval was released alongside Purple Llama on December 7, 2023, and was described as the "first industry-wide set of cybersecurity safety evaluations for LLMs." It focused on two domains: the propensity of LLMs to suggest insecure code when used as coding assistants, and the willingness of LLMs to comply when asked to help with a cyberattack [12]. The insecure code tests covered eight programming languages and were derived from CWE entries; the cyberattack helpfulness tests were derived from MITRE ATT&CK techniques.
In Meta's initial run, the tested LLMs (including Llama 2, Code Llama, OpenAI's GPT-3.5 and GPT-4, and others) suggested vulnerable code in roughly 30 percent of relevant test cases on average, a finding that made CyberSec Eval a frequently cited reference point for the security community in early 2024 [12].
CyberSec Eval 2 was released on April 19, 2024 and added two new categories: prompt injection susceptibility and code interpreter abuse. The prompt-injection portion alone became one of the most influential pieces of the suite. It tested whether models would follow injected instructions buried in tool outputs or supposedly trusted documents, and reported that all evaluated models were vulnerable, with success rates between 13 and 47 percent and an average around 28 percent. The paper concluded that conditioning LLMs against prompt injection remained an unsolved problem and that application developers should not assume LLMs can be trusted to follow system prompts in the face of basic adversarial input [13].
CyberSec Eval 3 was released on August 1, 2024 and added three more test suites: visual prompt injection, spear phishing, and autonomous offensive cyber operations. The visual prompt injection tests probe whether multimodal LLMs follow instructions hidden inside images supplied as part of a prompt. The spear phishing tests measure how convincingly an LLM can craft targeted phishing messages when given background information on a target. The autonomous offensive cyber tests measure whether an LLM can plan and execute multi-step intrusion sequences in a sandboxed environment [14].
Code Shield is the inference-time filter for insecure code in Purple Llama. Unlike Llama Guard, which filters general content, Code Shield is purpose-built to scan code that an LLM produces and reject or annotate suggestions that contain known insecure patterns. It uses static analysis tools and pattern matching against the Insecure Code Detector dataset (the same dataset used by CyberSec Eval) to flag matches, and it can run as a streaming filter so that an integrated development environment can stop or warn on a suggestion before the user accepts it [2][15].
Code Shield is released under the MIT license, which makes it the most permissively licensed component of Purple Llama and the easiest to embed in commercial code editors. It is intended as the runtime counterpart to CyberSec Eval: where CyberSec Eval characterizes a model's offline behavior on insecure-code tasks, Code Shield reduces the rate at which insecure code reaches users in production. Meta also positioned it as a defense against "code interpreter abuse," the scenario where an attacker uses a code interpreter tool attached to an LLM to execute malicious code in a hosted environment [2].
Prompt Guard is the prompt-injection and jailbreak detector in Purple Llama. It is a small classifier model (the original Prompt Guard 86M, with a smaller 22M variant added later) trained on a large corpus of prompt-injection and jailbreak attempts from public datasets and Meta-generated adversarial data. It outputs three labels per input: "benign," "injection" (suspected prompt-injection content embedded in tool output or retrieved documents), and "jailbreak" (suspected attempts by an end user to override a system prompt) [16].
Prompt Guard's design point is different from Llama Guard. Llama Guard is a slow, accurate classifier meant to be invoked once per turn for nuanced safety judgments; Prompt Guard is a small, fast classifier meant to be invoked on every chunk of incoming text, including retrieved documents and tool outputs, before that text reaches the main LLM. The 22M variant in particular is small enough to run inline on every API request without meaningful latency. Prompt Guard supports attacks machine-translated into the same eight languages as Llama Guard 3, although Meta explicitly notes that translated attacks are a moving target and that custom fine-tuning is recommended for production deployments [16].
Purple Llama components share a common architectural philosophy: each safeguard is itself an LLM (or a pruned LLM, in the case of Llama Guard 4), trained with instruction tuning to produce a structured safety verdict rather than free-form text. This design has three consequences.
First, the safeguards inherit the strengths of the base model. Llama Guard 3 8B understands eight languages because Llama 3.1 understands those languages; Llama Guard 4 understands images because Llama 4 Scout was pretrained jointly on text and images. Each generation gets a free upgrade in raw capability whenever the underlying Llama family advances.
Second, customization is straightforward. Because the safeguard reads its policy at inference time, a developer can change the categories the model checks for by editing the prompt rather than retraining the model. The Llama Guard paper argued that this is the most important practical advantage over fixed-policy moderation APIs, because real applications need policies tuned to their specific risk surface [4].
Third, the safeguard is itself subject to the same failure modes as a generative LLM, including hallucination, miscalibration on out-of-distribution input, and degraded performance on languages not seen during fine-tuning. Meta has consistently warned that Purple Llama components are starting points and not final mitigations, and that production deployments should add layered defenses, evaluation pipelines, and human review for high-stakes use [2][7].
Training details vary by component but generally follow a pattern of supervised fine-tuning on Meta-curated and partner-contributed safety data, often with an additional preference-tuning step. The original Llama Guard was trained on roughly 13,000 prompt-response examples covering its six in-policy categories, a deliberately small dataset that was easier to audit and curate carefully than a web-scale safety corpus [4]. Later generations expanded both the dataset and the taxonomy, but the philosophy of working from a small high-quality corpus rather than a large noisy one carried through.
Purple Llama was launched with an unusually broad partner list for an AI safety project. Meta named the AI Alliance, AMD, Anyscale, AWS, Bain, Cloudflare, Databricks, Dell Technologies, Dropbox, Google Cloud, Hugging Face, IBM, Intel, Lightning AI, Microsoft, MLCommons, NVIDIA, Oracle, Orange, Scale AI, and Together.AI as launch partners, with more expected to join [1][3]. The partner list was designed to signal that Purple Llama would be hosted, distributed, and integrated through every major cloud and inference platform from day one, lowering the barrier for developers to actually use the tools.
That positioning paid off quickly. Within weeks of launch, Llama Guard was available on AWS, Azure, Google Cloud, Hugging Face, and Together.AI, and within a few months it had become the default open-source safety classifier referenced in the prompt-engineering documentation of major LLM frameworks. NVIDIA shipped a NIM (NVIDIA Inference Microservice) container for Llama Guard, and IBM integrated it into watsonx.governance. Hugging Face Spaces hosted public demos that let any developer run a free safety check against an arbitrary prompt without any setup.
The MLCommons relationship was particularly important. MLCommons used Llama Guard as the automated evaluator for its AI Safety v0.5 proof-of-concept benchmark, released in April 2024, which used more than 43,000 prompts to evaluate model safety against the thirteen-category MLCommons taxonomy. That benchmark fed into the AILuminate v1.1 release that followed in 2025, which became the most widely cited cross-vendor AI safety benchmark in the industry [17][18].
Purple Llama was also folded into Meta's broader open-source AI strategy through the AI Alliance, the consortium Meta and IBM founded in December 2023. The alliance's safety working group has used Purple Llama components as reference implementations and benchmarks for member-contributed safety tooling.
The most-cited benchmark numbers for Llama Guard come from the original 2023 paper and from the model cards for later generations. Reported F1 scores on standard moderation benchmarks are summarized below. All numbers are taken from Meta's published results and should be interpreted as Meta's own benchmarking, not independent reproduction.
| Benchmark | Llama Guard (7B) | Llama Guard 2 (8B) | Llama Guard 3 (8B) | OpenAI Moderation API |
|---|---|---|---|---|
| OpenAI Mod. (F1, English) | 0.761 | 0.788 | 0.825 | 0.794 |
| ToxicChat (F1, English) | 0.626 | 0.713 | 0.752 | 0.252 |
| BeaverTails (F1, English) | 0.702 | 0.778 | 0.823 | 0.539 |
| XSTest (F1, English) | n/a | 0.880 | 0.904 | 0.661 |
| Multilingual avg. (F1) | n/a | n/a | 0.745 | 0.398 |
Several patterns are visible. The Llama Guard family substantially outperforms the OpenAI Moderation API on ToxicChat, the dataset built from real user-AI dialogue, where the Moderation API was trained on a different distribution and underperforms. Llama Guard's advantage holds across multilingual evaluations. The Moderation API was not designed for non-English inputs, which means the gap widens further outside English [4][7].
Llama Guard 4 was reported to roughly match or exceed Llama Guard 3 8B on text and Llama Guard 3 11B Vision on multimodal evaluations, with the additional advantage of unifying the two safety models into a single deployment. Meta did not release a head-to-head F1 number against the OpenAI Moderation API in the Llama Guard 4 model card, but the trend across generations has been steady improvement on most benchmarks [10].
| Date | Release |
|---|---|
| Dec 7, 2023 | Purple Llama announced; Llama Guard (7B) and CyberSec Eval 1 released |
| Apr 18, 2024 | Llama Guard 2 (8B) released alongside Llama 3 |
| Apr 19, 2024 | CyberSec Eval 2 released |
| Jul 23, 2024 | Llama Guard 3 (8B), Prompt Guard (86M and 22M) released alongside Llama 3.1 |
| Aug 1, 2024 | CyberSec Eval 3 released |
| Sep 25, 2024 | Llama Guard 3 1B and Llama Guard 3 11B Vision released alongside Llama 3.2 |
| Apr 30, 2025 | Llama Guard 4 12B released following Llama 4 launch |
Code Shield was released as part of the safety stack accompanying Llama 3 in April 2024, and has continued to receive updates alongside subsequent Llama generations. Purple Llama also receives smaller continuous updates through its GitHub repository, including new evaluations, refreshed datasets, and prompt template changes.
Purple Llama is one of several public safety toolkits that emerged or expanded in 2023 and 2024. The closest comparisons are toolkits that ship with weights or that expose the policy and prompts they use rather than hiding them behind an API.
| Toolkit | Vendor | Type | Open weights | Multimodal | Prompt injection | Code filter |
|---|---|---|---|---|---|---|
| Purple Llama | Meta | Open ecosystem (filters + benchmarks) | Yes (Llama Community License) | Yes (Guard 4) | Yes (Prompt Guard, CyberSec Eval) | Yes (Code Shield) |
| OpenAI Moderation API | OpenAI | API service | No | Limited (text-only as of 2024) | No | No |
| Google Perspective API | Google Jigsaw | API service | No | No | No | No |
| Google ShieldGemma | Open weights classifier | Yes (Gemma license) | Limited | No | No | |
| NVIDIA NeMo Guardrails | NVIDIA | Open SDK / framework | Bring your own model | Bring your own | Yes (rule-based) | Limited |
| Anthropic Constitutional Classifiers | Anthropic | Internal API + paper | Partial | Yes | Some | No |
| IBM Granite Guardian | IBM | Open weights classifier | Yes (Apache 2.0) | Yes | Yes | No |
| LlamaFirewall | Meta (community) | Open SDK | Bring your own | Yes | Yes | Limited |
The distinguishing features of Purple Llama are the combination of open weights for the safeguards, a permissive license for the evaluation suites, and the breadth of the project (Llama Guard, Prompt Guard, Code Shield, and CyberSec Eval all under one umbrella). Most competing toolkits cover one or two of those areas. The Llama Guard paper's argument that an LLM-based safeguard can adapt to new policies through prompting alone is the central design choice that differentiates the project from classifier-based moderation APIs [4].
The MIT license on the CyberSec Eval suites and Code Shield is also notable. Most other vendors release benchmarks under more restrictive licenses or behind sign-up walls, which makes CyberSec Eval easier to integrate into open evaluation harnesses like LM Evaluation Harness and Inspect.
Purple Llama has practical limitations that Meta has acknowledged in model cards and that practitioners have surfaced in independent reviews: