Purple Llama

Purple Llama is an umbrella project from Meta that gathers open trust and safety tools and evaluations for generative AI. It was announced on December 7, 2023, alongside the release of Llama 2 Chat as the dominant open-weight chat model, and it has grown into the standard reference stack for developers who want to put input filters, output filters, code filters, and adversarial benchmarks around an open large language model before shipping it. Components include the Llama Guard family of safety classifiers, the CyberSec Eval (CyberSecEval) benchmark suite, Code Shield for inference-time code filtering, and Prompt Guard for jailbreak and prompt-injection detection. Most components are released under permissive licenses for research and commercial use ^[1]^[2].

The name borrows directly from the cybersecurity convention of "purple teaming," where a red team (offense) and a blue team (defense) work together. Meta argued that closing the gap between attack and defense is the only way to make generative AI safe at scale, and that this requires open tools that the wider community can audit, fork, and extend ^[1]. At launch, Meta said Purple Llama would be developed in collaboration with more than twenty industry partners, including AMD, AWS, Google Cloud, Hugging Face, IBM, Intel, Microsoft, MLCommons, NVIDIA, and Scale AI, and it later became one of the centerpiece projects of the AI Alliance ^[3].

Origin

Purple Llama was launched on December 7, 2023, four months after Meta released Llama 2 in July 2023. Llama 2 had quickly become the most-downloaded open-weight LLM family of 2023, but the open release also raised a practical question: how do downstream developers add the kind of safety scaffolding that closed-API providers like OpenAI and Anthropic maintained internally? Meta's answer was to ship safety as a separate, openly licensed product line rather than bake it only into the base model.

The initial Purple Llama release contained two components: the Llama Guard input and output safety classifier, and the first CyberSec Eval benchmark for evaluating cybersecurity risk in LLMs. The Llama Guard model was a fine-tune of Llama 2 7B, and CyberSec Eval drew its tests from established sources like the MITRE ATT&CK framework and the Common Weakness Enumeration (CWE). Both arrived with model cards, evaluation code, and prompt templates so that other teams could plug them into existing pipelines ^[1]^[4].

Meta framed the initial release as a starting point. The company said it would expand the project across the categories laid out in the U.S. National Institute of Standards and Technology (NIST) AI Risk Management Framework, with an explicit focus on the CBRNE category (chemical, biological, radiological, nuclear, and explosive risks), cyber risks, and child safety. The blog post also positioned the work as a contribution to the AI Alliance, the open-AI consortium that Meta and IBM had announced two days earlier on December 5, 2023 ^[1]^[3].

The purple teaming concept

Purple teaming is a concept Meta borrowed from the cybersecurity world, where it has been used since the late 2000s. A red team plays the attacker, looking for ways to break a system. A blue team plays the defender, hardening the system against attack. A purple team is the institutional structure that makes the two camps work together rather than competing for credit. Meta argued that the same model fits generative AI: red-team-style adversarial testing of a model and blue-team-style mitigations like input and output filtering have to evolve in lockstep, because safety claims based on either alone are easy to bypass ^[1].

In the Purple Llama context, the red-team components are the CyberSec Eval benchmarks, which probe a model's willingness to assist with cyberattacks and its tendency to produce vulnerable code. The blue-team components are Llama Guard, Code Shield, and Prompt Guard, which sit in front of and behind the model in production and filter content. The naming choice also gestures at the Llama brand: the cover image used for the launch was a stylized purple llama, which became the project's visual identity on GitHub and Hugging Face.

Components

Purple Llama groups its tools into two broad categories: safeguards (Llama Guard, Code Shield, Prompt Guard), which are runtime filters meant to be used in production deployments, and evaluations (CyberSec Eval), which are benchmark suites meant to be run during model development and regression testing.

Component	Type	Role	First released	License
Llama Guard	Safeguard	Input and output safety classifier (text)	Dec 7, 2023	Llama 2 Community License
Llama Guard 2	Safeguard	8B input/output classifier on Llama 3	Apr 18, 2024	Llama 3 Community License
Llama Guard 3 (8B and 1B)	Safeguard	Multilingual input/output classifier on Llama 3.1 / 3.2	Jul 23, 2024	Llama 3.2 Community License
Llama Guard 3 11B Vision	Safeguard	Multimodal input/output classifier with image understanding	Sep 25, 2024	Llama 3.2 Community License
Llama Guard 4 12B	Safeguard	Natively multimodal classifier pruned from Llama 4 Scout	Apr 30, 2025	Llama 4 Community License
Code Shield	Safeguard	Inference-time filter for insecure code outputs	2024	MIT
Prompt Guard 86M / 22M	Safeguard	Prompt-injection and jailbreak detector	Jul 23, 2024	Llama 3.2 Community License
CyberSec Eval 1	Evaluation	Insecure code suggestions and cyberattack helpfulness	Dec 7, 2023	MIT
CyberSec Eval 2	Evaluation	Adds prompt injection and code interpreter abuse	Apr 19, 2024	MIT
CyberSec Eval 3	Evaluation	Adds visual prompt injection, spear phishing, autonomous offensive cyber	Aug 1, 2024	MIT

Llama Guard

Llama Guard is the original safety classifier in Purple Llama. It is an LLM-based input and output safeguard model designed for human-AI conversation use cases, and it works by reading either a user prompt or an assistant response and emitting a structured "safe" or "unsafe" verdict, along with the safety category that was violated when the verdict is unsafe. The first Llama Guard was a fine-tune of Llama 2 7B trained by Meta and described in a December 2023 paper led by Hakan Inan and ten coauthors ^[4].

The original Llama Guard taxonomy contained six in-policy categories: violence and hate, sexual content, criminal planning, guns and illegal weapons, regulated or controlled substances, and self-harm. A seventh class covered "safe" content. The architecture choice (a fine-tuned LLM rather than a small dedicated classifier) was a deliberate departure from the conventional content-moderation toolkit, which had typically relied on smaller transformer-based classifiers like the OpenAI Moderation API or Google's Perspective API. By using an LLM, Llama Guard could read free-form policy descriptions at inference time and adapt its classifications without retraining, a capability the paper called "taxonomy customization" or "zero-shot policy following" ^[4].

On the OpenAI Moderation Evaluation dataset, Llama Guard matched OpenAI's own moderation API in F1 score without any task-specific fine-tuning. On ToxicChat, a public dataset of real user-AI interactions, it outperformed all baselines including GPT-4 ^[4]. The combination of competitive accuracy, an open license, and the ability to swap in custom policies through prompting made it the default open-source safety classifier almost immediately after release.

Llama Guard 2

Llama Guard 2 was released on April 18, 2024, alongside the launch of Llama 3. It was an 8 billion-parameter model based on Llama 3 8B and was specifically optimized to support the MLCommons AI Safety hazards taxonomy, which had just been published as part of the v0.5 proof-of-concept benchmark. Llama Guard 2 covered eleven of the thirteen MLCommons hazard categories out of the box and reported substantial accuracy improvements over the original Llama Guard, in part because of the larger and more capable base model and in part because of a higher-quality training set built around the MLCommons taxonomy ^[5]^[6].

Llama Guard 3

Llama Guard 3 was released on July 23, 2024, alongside Llama 3.1, with a smaller 1B variant added on September 25, 2024 alongside Llama 3.2. Two changes mattered most. First, Llama Guard 3 became multilingual: it provides content moderation in eight languages, including English, French, German, Hindi, Italian, Portuguese, Spanish, and Thai. Second, it expanded its taxonomy coverage to all thirteen MLCommons categories and was specifically trained to handle safety for tool calls, including code interpreter and search tool invocations ^[7]^[8].

The 1B variant exists for edge deployment. Meta reported that quantized 1B versions can run on mobile-class hardware while still providing useful filtering for chat applications, which made it a popular choice for on-device assistants where calling out to a larger guard model would add unacceptable latency ^[7].

Llama Guard 3 Vision (11B Vision)

Llama Guard 3 11B Vision was the first Purple Llama component to handle images directly. Released alongside Llama 3.2 in late September 2024, it accepts a text prompt and one or more images and classifies the combined input as safe or unsafe under the MLCommons taxonomy. Internally, it uses the same vision encoder architecture as Llama 3.2 11B Vision and a similar tokenized representation, which lets it pick up on hazards that a text-only guard would miss, such as a prompt asking for instructions about a dangerous object visible only in the attached image ^[9].

Llama Guard 4

Llama Guard 4 12B was released on April 30, 2025, several weeks after the launch of Llama 4. It is a 12 billion-parameter natively multimodal safety classifier that combines the capabilities of the previous Llama Guard 3 8B and Llama Guard 3 11B Vision models in a single model. Architecturally, Llama Guard 4 is unusual: rather than fine-tuning Llama 4 Scout directly, Meta pruned Scout's mixture-of-experts layers down to dense feedforward layers, keeping only the shared expert in each MoE layer and discarding the routed experts and routing networks. The result is a dense early-fusion transformer with the same depth as Scout but a much smaller serving footprint ^[10]^[11].

Llama Guard 4 supports the same eight languages as Llama Guard 3 for text and accepts mixed text-and-image inputs. Meta reported that it roughly matches or exceeds the overall performance of Llama Guard 3 8B on text and Llama Guard 3 11B Vision on multimodal inputs, while collapsing the safety stack into a single deployment artifact instead of two ^[10].

CyberSec Eval (CyberSecEval)

CyberSec Eval, also written CyberSecEval, is the cybersecurity evaluation suite within Purple Llama. It is a benchmark and not a runtime filter, intended to be used during model development to characterize how a model behaves on cybersecurity-relevant tasks. The evaluations are released as open-source code with prompt templates, scoring scripts, and reference results. Each version has expanded the test surface significantly.

CyberSec Eval 1

The first CyberSec Eval was released alongside Purple Llama on December 7, 2023, and was described as the "first industry-wide set of cybersecurity safety evaluations for LLMs." It focused on two domains: the propensity of LLMs to suggest insecure code when used as coding assistants, and the willingness of LLMs to comply when asked to help with a cyberattack ^[12]. The insecure code tests covered eight programming languages and were derived from CWE entries; the cyberattack helpfulness tests were derived from MITRE ATT&CK techniques.

In Meta's initial run, the tested LLMs (including Llama 2, Code Llama, OpenAI's GPT-3.5 and GPT-4, and others) suggested vulnerable code in roughly 30 percent of relevant test cases on average, a finding that made CyberSec Eval a frequently cited reference point for the security community in early 2024 ^[12].

CyberSec Eval 2

CyberSec Eval 2 was released on April 19, 2024 and added two new categories: prompt injection susceptibility and code interpreter abuse. The prompt-injection portion alone became one of the most influential pieces of the suite. It tested whether models would follow injected instructions buried in tool outputs or supposedly trusted documents, and reported that all evaluated models were vulnerable, with success rates between 13 and 47 percent and an average around 28 percent. The paper concluded that conditioning LLMs against prompt injection remained an unsolved problem and that application developers should not assume LLMs can be trusted to follow system prompts in the face of basic adversarial input ^[13].

CyberSec Eval 3

CyberSec Eval 3 was released on August 1, 2024 and added three more test suites: visual prompt injection, spear phishing, and autonomous offensive cyber operations. The visual prompt injection tests probe whether multimodal LLMs follow instructions hidden inside images supplied as part of a prompt. The spear phishing tests measure how convincingly an LLM can craft targeted phishing messages when given background information on a target. The autonomous offensive cyber tests measure whether an LLM can plan and execute multi-step intrusion sequences in a sandboxed environment ^[14].

Code Shield

Code Shield is the inference-time filter for insecure code in Purple Llama. Unlike Llama Guard, which filters general content, Code Shield is purpose-built to scan code that an LLM produces and reject or annotate suggestions that contain known insecure patterns. It uses static analysis tools and pattern matching against the Insecure Code Detector dataset (the same dataset used by CyberSec Eval) to flag matches, and it can run as a streaming filter so that an integrated development environment can stop or warn on a suggestion before the user accepts it ^[2]^[15].

Code Shield is released under the MIT license, which makes it the most permissively licensed component of Purple Llama and the easiest to embed in commercial code editors. It is intended as the runtime counterpart to CyberSec Eval: where CyberSec Eval characterizes a model's offline behavior on insecure-code tasks, Code Shield reduces the rate at which insecure code reaches users in production. Meta also positioned it as a defense against "code interpreter abuse," the scenario where an attacker uses a code interpreter tool attached to an LLM to execute malicious code in a hosted environment ^[2].

Prompt Guard

Prompt Guard is the prompt-injection and jailbreak detector in Purple Llama. It is a small classifier model (the original Prompt Guard 86M, with a smaller 22M variant added later) trained on a large corpus of prompt-injection and jailbreak attempts from public datasets and Meta-generated adversarial data. It outputs three labels per input: "benign," "injection" (suspected prompt-injection content embedded in tool output or retrieved documents), and "jailbreak" (suspected attempts by an end user to override a system prompt) ^[16].

Prompt Guard's design point is different from Llama Guard. Llama Guard is a slow, accurate classifier meant to be invoked once per turn for nuanced safety judgments; Prompt Guard is a small, fast classifier meant to be invoked on every chunk of incoming text, including retrieved documents and tool outputs, before that text reaches the main LLM. The 22M variant in particular is small enough to run inline on every API request without meaningful latency. Prompt Guard supports attacks machine-translated into the same eight languages as Llama Guard 3, although Meta explicitly notes that translated attacks are a moving target and that custom fine-tuning is recommended for production deployments ^[16].

Architecture and training

Purple Llama components share a common architectural philosophy: each safeguard is itself an LLM (or a pruned LLM, in the case of Llama Guard 4), trained with instruction tuning to produce a structured safety verdict rather than free-form text. This design has three consequences.

First, the safeguards inherit the strengths of the base model. Llama Guard 3 8B understands eight languages because Llama 3.1 understands those languages; Llama Guard 4 understands images because Llama 4 Scout was pretrained jointly on text and images. Each generation gets a free upgrade in raw capability whenever the underlying Llama family advances.

Second, customization is straightforward. Because the safeguard reads its policy at inference time, a developer can change the categories the model checks for by editing the prompt rather than retraining the model. The Llama Guard paper argued that this is the most important practical advantage over fixed-policy moderation APIs, because real applications need policies tuned to their specific risk surface ^[4].

Third, the safeguard is itself subject to the same failure modes as a generative LLM, including hallucination, miscalibration on out-of-distribution input, and degraded performance on languages not seen during fine-tuning. Meta has consistently warned that Purple Llama components are starting points and not final mitigations, and that production deployments should add layered defenses, evaluation pipelines, and human review for high-stakes use ^[2]^[7].

Training details vary by component but generally follow a pattern of supervised fine-tuning on Meta-curated and partner-contributed safety data, often with an additional preference-tuning step. The original Llama Guard was trained on roughly 13,000 prompt-response examples covering its six in-policy categories, a deliberately small dataset that was easier to audit and curate carefully than a web-scale safety corpus ^[4]. Later generations expanded both the dataset and the taxonomy, but the philosophy of working from a small high-quality corpus rather than a large noisy one carried through.

Partners and adoption

Purple Llama was launched with an unusually broad partner list for an AI safety project. Meta named the AI Alliance, AMD, Anyscale, AWS, Bain, Cloudflare, Databricks, Dell Technologies, Dropbox, Google Cloud, Hugging Face, IBM, Intel, Lightning AI, Microsoft, MLCommons, NVIDIA, Oracle, Orange, Scale AI, and Together.AI as launch partners, with more expected to join ^[1]^[3]. The partner list was designed to signal that Purple Llama would be hosted, distributed, and integrated through every major cloud and inference platform from day one, lowering the barrier for developers to actually use the tools.

That positioning paid off quickly. Within weeks of launch, Llama Guard was available on AWS, Azure, Google Cloud, Hugging Face, and Together.AI, and within a few months it had become the default open-source safety classifier referenced in the prompt-engineering documentation of major LLM frameworks. NVIDIA shipped a NIM (NVIDIA Inference Microservice) container for Llama Guard, and IBM integrated it into watsonx.governance. Hugging Face Spaces hosted public demos that let any developer run a free safety check against an arbitrary prompt without any setup.

The MLCommons relationship was particularly important. MLCommons used Llama Guard as the automated evaluator for its AI Safety v0.5 proof-of-concept benchmark, released in April 2024, which used more than 43,000 prompts to evaluate model safety against the thirteen-category MLCommons taxonomy. That benchmark fed into the AILuminate v1.1 release that followed in 2025, which became the most widely cited cross-vendor AI safety benchmark in the industry ^[17]^[18].

Purple Llama was also folded into Meta's broader open-source AI strategy through the AI Alliance, the consortium Meta and IBM founded in December 2023. The alliance's safety working group has used Purple Llama components as reference implementations and benchmarks for member-contributed safety tooling.

Benchmark comparisons

The most-cited benchmark numbers for Llama Guard come from the original 2023 paper and from the model cards for later generations. Reported F1 scores on standard moderation benchmarks are summarized below. All numbers are taken from Meta's published results and should be interpreted as Meta's own benchmarking, not independent reproduction.

Benchmark	Llama Guard (7B)	Llama Guard 2 (8B)	Llama Guard 3 (8B)	OpenAI Moderation API
OpenAI Mod. (F1, English)	0.761	0.788	0.825	0.794
ToxicChat (F1, English)	0.626	0.713	0.752	0.252
BeaverTails (F1, English)	0.702	0.778	0.823	0.539
XSTest (F1, English)	n/a	0.880	0.904	0.661
Multilingual avg. (F1)	n/a	n/a	0.745	0.398

Several patterns are visible. The Llama Guard family substantially outperforms the OpenAI Moderation API on ToxicChat, the dataset built from real user-AI dialogue, where the Moderation API was trained on a different distribution and underperforms. Llama Guard's advantage holds across multilingual evaluations. The Moderation API was not designed for non-English inputs, which means the gap widens further outside English ^[4]^[7].

Llama Guard 4 was reported to roughly match or exceed Llama Guard 3 8B on text and Llama Guard 3 11B Vision on multimodal evaluations, with the additional advantage of unifying the two safety models into a single deployment. Meta did not release a head-to-head F1 number against the OpenAI Moderation API in the Llama Guard 4 model card, but the trend across generations has been steady improvement on most benchmarks ^[10].

Releases timeline

Date	Release
Dec 7, 2023	Purple Llama announced; Llama Guard (7B) and CyberSec Eval 1 released
Apr 18, 2024	Llama Guard 2 (8B) released alongside Llama 3
Apr 19, 2024	CyberSec Eval 2 released
Jul 23, 2024	Llama Guard 3 (8B), Prompt Guard (86M and 22M) released alongside Llama 3.1
Aug 1, 2024	CyberSec Eval 3 released
Sep 25, 2024	Llama Guard 3 1B and Llama Guard 3 11B Vision released alongside Llama 3.2
Apr 30, 2025	Llama Guard 4 12B released following Llama 4 launch

Code Shield was released as part of the safety stack accompanying Llama 3 in April 2024, and has continued to receive updates alongside subsequent Llama generations. Purple Llama also receives smaller continuous updates through its GitHub repository, including new evaluations, refreshed datasets, and prompt template changes.

Comparison to other safety toolkits

Purple Llama is one of several public safety toolkits that emerged or expanded in 2023 and 2024. The closest comparisons are toolkits that ship with weights or that expose the policy and prompts they use rather than hiding them behind an API.

Toolkit	Vendor	Type	Open weights	Multimodal	Prompt injection	Code filter
Purple Llama	Meta	Open ecosystem (filters + benchmarks)	Yes (Llama Community License)	Yes (Guard 4)	Yes (Prompt Guard, CyberSec Eval)	Yes (Code Shield)
OpenAI Moderation API	OpenAI	API service	No	Limited (text-only as of 2024)	No	No
Google Perspective API	Google Jigsaw	API service	No	No	No	No
Google ShieldGemma	Google	Open weights classifier	Yes (Gemma license)	Limited	No	No
NVIDIA NeMo Guardrails	NVIDIA	Open SDK / framework	Bring your own model	Bring your own	Yes (rule-based)	Limited
Anthropic Constitutional Classifiers	Anthropic	Internal API + paper	Partial	Yes	Some	No
IBM Granite Guardian	IBM	Open weights classifier	Yes (Apache 2.0)	Yes	Yes	No
LlamaFirewall	Meta (community)	Open SDK	Bring your own	Yes	Yes	Limited

The distinguishing features of Purple Llama are the combination of open weights for the safeguards, a permissive license for the evaluation suites, and the breadth of the project (Llama Guard, Prompt Guard, Code Shield, and CyberSec Eval all under one umbrella). Most competing toolkits cover one or two of those areas. The Llama Guard paper's argument that an LLM-based safeguard can adapt to new policies through prompting alone is the central design choice that differentiates the project from classifier-based moderation APIs ^[4].

The MIT license on the CyberSec Eval suites and Code Shield is also notable. Most other vendors release benchmarks under more restrictive licenses or behind sign-up walls, which makes CyberSec Eval easier to integrate into open evaluation harnesses like LM Evaluation Harness and Inspect.

Limitations

Purple Llama has practical limitations that Meta has acknowledged in model cards and that practitioners have surfaced in independent reviews:

Inherits base-model failure modes. The Llama Guard family is itself an LLM, so it can hallucinate, misclassify edge cases, and degrade on out-of-distribution input. The model card explicitly recommends layered defenses and human review for high-stakes deployments ^[7].
English bias. Even Llama Guard 3, which adds seven non-English languages, performs noticeably better in English than in lower-resource languages it was trained on, and provides only weak protection in languages outside its eight-language list ^[7].
Latency cost. Running an 8B-parameter classifier on every input and output roughly doubles the inference cost of a chat application unless the application uses the smaller 1B variant, which trades accuracy for speed.
Adversarial robustness is bounded. CyberSec Eval 2's own results showed that no model, including Meta's, fully resists prompt injection. Prompt Guard catches a meaningful fraction of attacks but is itself a classifier that can be evaded by novel attack patterns ^[13].
Fixed taxonomies bias coverage. The MLCommons hazard taxonomy that Llama Guard 3 and 4 follow does not cover every harm category that any application might care about. Customizing the policy through prompting helps but does not fully bridge the gap, especially for narrow regulated domains.
Licensing nuance. Most Llama Guard models are released under the Llama Community License, which the Open Source Initiative has consistently said does not meet the open source definition, mainly because of the 700 million monthly active user threshold and acceptable use policy. Code Shield and CyberSec Eval are MIT-licensed and unaffected by this distinction ^[2].
Vision and multimodal coverage is newer. Llama Guard 3 11B Vision and Llama Guard 4 are the only multimodal options in the suite, and both have less independent evaluation than the long-standing text-only models ^[9]^[10].
Not a substitute for system design. Meta consistently positions Purple Llama as a layer in a larger safety architecture, not a complete solution. Production deployments still need rate limiting, abuse detection, monitoring, and policies, none of which Purple Llama provides on its own ^[1]^[2].

References

Meta. (2023, December 7). "Introducing Purple Llama for Safe and Responsible AI Development." https://about.fb.com/news/2023/12/purple-llama-safe-responsible-ai-development/
Meta. "PurpleLlama GitHub repository." https://github.com/meta-llama/PurpleLlama
The AI Alliance. (2023, December 5). "Launch announcement." https://thealliance.ai/
Inan, H., Upasani, K., Chi, J., Rungta, R., Iyer, K., Mao, Y., Tontchev, M., Hu, Q., Fuller, B., Testuggine, D., & Khabsa, M. (2023). "Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations." arXiv:2312.06674. https://arxiv.org/abs/2312.06674
Meta. (2024). "Meta-Llama-Guard-2-8B Model Card." https://huggingface.co/meta-llama/Meta-Llama-Guard-2-8B
MLCommons. (2024, April 16). "Announcing MLCommons AI Safety v0.5 Proof of Concept." https://mlcommons.org/2024/04/mlc-aisafety-v0-5-poc/
Meta. (2024). "Llama-Guard-3-8B Model Card." https://huggingface.co/meta-llama/Llama-Guard-3-8B
Meta. "Llama Guard 3 Documentation." https://www.llama.com/docs/model-cards-and-prompt-formats/llama-guard-3/
Meta. (2024). "Llama Guard 3 11B Vision Model Card." https://github.com/meta-llama/PurpleLlama/blob/main/Llama-Guard3/11B-vision/MODEL_CARD.md
Meta. (2025). "Llama-Guard-4-12B Model Card." https://huggingface.co/meta-llama/Llama-Guard-4-12B
Meta. "Llama Guard 4 Documentation." https://www.llama.com/docs/model-cards-and-prompt-formats/llama-guard-4/
Bhatt, M., et al. (2023). "Purple Llama CyberSecEval: A Secure Coding Benchmark for Language Models." arXiv:2312.04724. https://arxiv.org/abs/2312.04724
Bhatt, M., et al. (2024). "CyberSecEval 2: A Wide-Ranging Cybersecurity Evaluation Suite for Large Language Models." arXiv:2404.13161. https://arxiv.org/abs/2404.13161
Wan, S., et al. (2024). "CyberSecEval 3: Advancing the Evaluation of Cybersecurity Risks and Capabilities in Large Language Models." arXiv:2408.01605. https://arxiv.org/abs/2408.01605
Meta. "Purple Llama Code Shield documentation." https://github.com/meta-llama/PurpleLlama/tree/main/CodeShield
Meta. (2024). "Prompt-Guard-86M Model Card." https://huggingface.co/meta-llama/Prompt-Guard-86M
Vidgen, B., et al. (2024). "Introducing v0.5 of the AI Safety Benchmark from MLCommons." arXiv:2404.12241. https://arxiv.org/abs/2404.12241
MLCommons. "AILuminate v1.1 benchmark suite." https://github.com/mlcommons/ailuminate

Origin

The purple teaming concept

Components

Llama Guard

Llama Guard 2

Llama Guard 3

Llama Guard 3 Vision (11B Vision)

Llama Guard 4

CyberSec Eval (CyberSecEval)

CyberSec Eval 1

CyberSec Eval 2

CyberSec Eval 3

Code Shield

Prompt Guard

Architecture and training

Partners and adoption

Benchmark comparisons

Releases timeline

Comparison to other safety toolkits

Limitations

See also

References

Improve this article

Related Articles

OpenClaw

LLaMA

LLaMA 3

Code Llama

AI 2027

Situational Awareness

Origin

The purple teaming concept

Components

Llama Guard

Llama Guard 2

Llama Guard 3

Llama Guard 3 Vision (11B Vision)

Llama Guard 4

CyberSec Eval (CyberSecEval)

CyberSec Eval 1

CyberSec Eval 2

CyberSec Eval 3

Code Shield

Prompt Guard

Architecture and training

Partners and adoption

Benchmark comparisons

Releases timeline

Comparison to other safety toolkits

Limitations

See also

References

Related Articles

OpenClaw

LLaMA

LLaMA 3

Code Llama

AI 2027

Situational Awareness