Red teaming (artificial intelligence)

AI Safety Artificial Intelligence

30 min read

Updated Jun 21, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 21, 2026

Fact-checked

In review queue

Sources

24 citations

Revision

v4 · 6,036 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Red teaming in artificial intelligence is the systematic, adversarial testing of an AI system to find vulnerabilities, biases, harmful outputs, and other failure modes before deployment or as part of ongoing monitoring. U.S. Executive Order 14110 defined the practice as "a structured testing effort to find flaws and vulnerabilities in an AI system, often in a controlled environment and in collaboration with developers of AI" ^[17]. Borrowed from military strategy and cybersecurity, the practice involves dedicated teams or individuals who adopt an attacker's mindset, deliberately probing AI models to discover weaknesses that standard quality assurance and benchmarking might miss. As large language models (LLMs) and other generative AI systems have become widely deployed, red teaming has emerged as one of the most important practical tools in AI safety, referenced in government executive orders, international regulations, and the internal safety protocols of every major AI laboratory. By January 2025, Microsoft's AI Red Team alone had red teamed more than 100 generative AI products, and prompt injection (a core red teaming target) held the number one spot on the OWASP Top 10 for LLM Applications for a second consecutive year ^[13] ^[10].

Origin of the term

The concept of red teaming traces back to Cold War-era military exercises in the 1960s. The think tank RAND Corporation conducted simulations for the United States military in which a "red team" represented the Soviet Union and a "blue team" represented the United States. The color coding reflected the geopolitical alignment of the era: red for communist adversaries, blue for NATO allies ^[1]. United States Secretary of Defense Robert McNamara also assembled red and blue teams to evaluate competing proposals from government contractors for experimental aircraft programs.

More broadly, the idea of structured adversarial opposition in military planning dates to the early 19th century, when armies employed officers to challenge battle plans and identify weaknesses. By the mid-20th century, red teaming had become a formalized component of military doctrine, particularly for anticipating enemy strategies during the Cold War ^[1].

In the 1980s and 1990s, red teaming migrated into cybersecurity. The National Security Agency (NSA) pioneered the use of dedicated red teams to assess the security of classified systems. Private sector companies and government agencies adopted the practice through the 1990s and 2000s, with red teams simulating cyberattacks to test network defenses, penetrate perimeters, and expose vulnerabilities before real adversaries could exploit them ^[1].

The application of red teaming to AI systems began in earnest around 2020 and 2021, as the rapid growth in LLM capabilities made it clear that conventional software testing was insufficient. Unlike traditional software, where bugs produce deterministic failures, AI models can fail in unpredictable, context-dependent ways: generating toxic content, leaking private training data, or providing instructions for dangerous activities. Red teaming adapted to address these novel failure modes, and by 2023 the term had become standard vocabulary in AI safety discussions.

How does AI red teaming work?

AI red teaming differs from traditional software testing in several important respects. Standard software testing typically checks whether a program produces correct outputs for given inputs according to a specification. AI red teaming, by contrast, involves creative, open-ended exploration of a system's behavior under adversarial conditions, where the goal is to discover failure modes that the developers did not anticipate.

A typical AI red teaming exercise proceeds through several phases:

Phase	Description
Scoping	Define the target system, threat model, and categories of risk to be tested (e.g., harmful content, bias, jailbreaks, privacy leaks)
Reconnaissance	Understand the system's architecture, intended use cases, known guardrails, and any documentation of safety measures
Attack design	Develop adversarial prompts, scenarios, and interaction strategies intended to bypass safety measures or elicit undesirable outputs
Execution	Systematically probe the system using the designed attacks, documenting all inputs and outputs
Analysis	Categorize and prioritize discovered vulnerabilities by severity, exploitability, and potential real-world impact
Reporting	Deliver structured findings to the development team with recommendations for remediation
Retesting	After mitigations are implemented, verify that the identified vulnerabilities have been addressed

Red teamers employ a wide range of techniques. Manual probing involves human experts crafting adversarial prompts based on their understanding of model behavior, domain knowledge, and creative intuition. Automated attacks use algorithms or other AI models to generate large volumes of adversarial inputs at scale. Structured evaluations follow predefined taxonomies of risk, testing the model systematically against known categories of harmful behavior ^[2].

The Georgetown University Center for Security and Emerging Technology (CSET) has noted that the term "AI red teaming" encompasses a broader range of activities than traditional cybersecurity red teaming. In cybersecurity, red teams typically simulate specific threat actors with defined capabilities and objectives. In AI, red teaming often serves a dual purpose: identifying security vulnerabilities (like prompt injection) and evaluating responsible AI concerns (like bias and harmful content generation) ^[3].

Types of AI red teaming

AI red teaming takes several forms, each suited to different objectives and stages of the development lifecycle.

Internal versus external red teaming

Internal red teaming is conducted by employees of the organization that developed the AI system. Internal teams have deep knowledge of the model's architecture, training data, and safety measures, which allows them to craft highly targeted attacks. The disadvantage is that internal teams may share the same blind spots as the developers.

External red teaming brings in independent experts, often with specialized domain knowledge, to test the system. External red teamers offer fresh perspectives and are less likely to overlook issues that internal teams have normalized. OpenAI's Red Teaming Network, for instance, recruits domain experts from fields including biosecurity, political science, and education to evaluate new models before release. For GPT-4o, OpenAI worked with more than 100 external red teamers speaking 45 languages and representing 29 countries, with testing running from early March through late June 2024 ^[4] ^[23]. Anthropic similarly engages external biosecurity and cybersecurity experts for its frontier red teaming program ^[5].

Many organizations use both approaches. Internal teams conduct continuous testing during development, while external teams provide independent assessments at key milestones such as pre-deployment reviews.

Manual versus automated red teaming

Manual red teaming relies on human creativity and expertise. Skilled red teamers can identify subtle, context-dependent vulnerabilities that automated systems would miss: cultural sensitivities, nuanced forms of bias, or complex multi-turn manipulation strategies. The DEF CON AI Village red teaming challenges have demonstrated the value of human ingenuity, with participants discovering vulnerabilities through techniques like role prompting (asking the model to assume a persona such as a professor researching hate speech) that achieved high attack success rates ^[6].

Automated red teaming uses algorithms, scripts, or other AI models to generate adversarial inputs at scale. This approach can test thousands of attack variations in the time it takes a human to test dozens. Anthropic has described an approach where one model generates attacks and another model defends, running in an iterative loop to progressively discover and harden against vulnerabilities ^[5]. Automated methods are especially valuable for regression testing, ensuring that previously discovered vulnerabilities remain fixed as models are updated.

Hybrid approaches combine human expertise with automated scaling. A common pattern involves human experts designing attack strategies and taxonomies, which are then implemented as automated test suites. Results from automated testing are reviewed by humans to assess severity and develop mitigations.

Domain-specific red teaming

Some categories of risk require specialized knowledge to evaluate effectively. Domain-specific red teaming brings in subject matter experts to assess whether an AI system could facilitate harm in their area of expertise.

Domain	What red teamers test for
Biosecurity	Whether the model can provide actionable instructions for synthesizing biological agents or toxins
Cybersecurity	Whether the model can assist in writing malware, discovering exploits, or conducting cyberattacks
Nuclear and radiological	Whether the model reveals sensitive information about weapons design, enrichment processes, or safeguards circumvention
Chemistry	Whether the model provides synthesis routes for chemical weapons or precursor chemicals
Persuasion and manipulation	Whether the model can generate convincing disinformation, propaganda, or social engineering attacks
Child safety	Whether the model can be manipulated into producing child sexual abuse material or grooming content

Anthropic's Frontier Red Team, a group of approximately 15 researchers, has spent over 150 hours with biosecurity experts evaluating models' ability to output harmful biological information. In April 2025, Anthropic partnered with the U.S. Department of Energy's National Nuclear Security Administration (NNSA) to assess models for nuclear proliferation risks, with NNSA staff red teaming Claude models in a classified environment ^[5] ^[7].

What do red teams test for?

The scope of AI red teaming has expanded significantly since its early days. Modern red teaming programs typically evaluate AI systems across multiple categories of risk.

Harmful content generation

Red teamers test whether models can be induced to generate content that is violent, sexually explicit, hateful, or otherwise harmful. This includes testing the robustness of content filters and refusal mechanisms under adversarial pressure. Even models with strong safety training can sometimes be manipulated into producing harmful content through creative prompting strategies.

Bias and discrimination

Red teams evaluate whether AI systems produce outputs that reflect or amplify societal biases based on race, gender, religion, nationality, sexual orientation, disability, or other protected characteristics. This testing goes beyond simple keyword detection to examine subtle forms of bias in recommendations, characterizations, and differential treatment of demographic groups.

Jailbreaks

A jailbreak is an adversarial prompt or sequence of prompts designed to bypass an AI model's safety restrictions, causing it to produce outputs it was trained to refuse. Jailbreak techniques have grown increasingly sophisticated. A systematic 2025 evaluation found that prompt injections exploiting roleplay dynamics achieved the highest attack success rate of 89.6%, often bypassing filters by deflecting responsibility away from the model (for example, framing a request as dialogue for a movie script); by comparison, logic-trap attacks reached 81.4% and encoding tricks 76.2% ^[8] ^[21]. Joint testing by the UK and US AI Safety Institutes found that safety guardrails built into frontier models could be "routinely circumvented" through jailbreaking techniques, and the UK AI Security Institute's 2025 report stated that universal jailbreaks, techniques that override safeguards across a range of harmful request categories, "were found in every system tested" ^[9].

Common jailbreak categories include:

Technique	Description
Role prompting	Instructing the model to adopt a persona that would not be bound by safety rules
Payload splitting	Breaking a harmful request across multiple messages so no single message triggers a refusal
Encoding attacks	Expressing harmful requests in code, base64, or other formats that bypass text-level filters
Crescendo attacks	Gradually escalating the sensitivity of requests over a multi-turn conversation
Few-shot manipulation	Providing examples of harmful outputs to establish a pattern the model continues
Translation attacks	Requesting harmful content in low-resource languages where safety training is weaker

Prompt injection

Prompt injection attacks involve inserting malicious instructions into inputs that an AI system processes, causing it to override its original instructions or behave in unintended ways. This is particularly dangerous for AI systems that process external data, such as web content or user-uploaded documents. OWASP's 2025 Top 10 for LLM Applications ranked prompt injection (LLM01:2025) as the number one vulnerability for the second consecutive year ^[10].

Hallucinations

Red teams test the extent to which models generate plausible-sounding but factually incorrect information, often called hallucinations. This is especially critical for applications in medicine, law, finance, and other domains where inaccurate information could cause real harm. Red teamers craft questions that are likely to elicit confident but wrong answers, such as questions about obscure topics, requests for specific citations, or queries that combine real and fabricated premises.

Privacy leaks

Models trained on large datasets may memorize and reproduce personal information, confidential data, or copyrighted material from their training corpus. Red teamers probe for these leaks by crafting prompts designed to extract specific types of sensitive information. OWASP's 2025 Top 10 elevated sensitive information disclosure from sixth to second place (LLM02:2025), reflecting growing concern about this risk ^[10].

CBRN risks

Chemical, biological, radiological, and nuclear (CBRN) risk assessment has become a central focus of frontier model red teaming. The concern is that advanced AI models might lower the barrier to creating weapons of mass destruction by providing detailed technical guidance to individuals who lack specialized training. Multiple AI laboratories now conduct pre-deployment CBRN evaluations as standard practice, often engaging external domain experts ^[5] ^[11].

Key organizations and initiatives

Numerous organizations have established dedicated red teaming programs or frameworks for AI systems.

NIST AI Risk Management Framework

The National Institute of Standards and Technology (NIST) AI Risk Management Framework (AI RMF) provides a structured approach for organizations to manage AI risks throughout the system lifecycle. The framework emphasizes continuous testing and evaluation, including red teaming, as a core component of responsible AI development. In July 2024, NIST released updated guidelines and a global engagement plan for AI security testing pursuant to Executive Order 14110 ^[12].

NIST also developed Dioptra, a security testbed designed to help users assess which types of attacks would degrade their AI model's performance. Dioptra supports AI model testing, research, evaluations, and red teaming exercises. Between December 2024 and January 2025, NIST ran the ARIA (Assessing Risks and Impacts of AI) 0.1 pilot program, in which 51 red teamers and 19 field testers evaluated seven AI applications across three scenarios (TV Spoilers, Meal Planner, and Pathfinder), producing 508 testing sessions ^[12].

DEF CON AI Village

The DEF CON hacking conference's AI Village has hosted some of the largest public AI red teaming events. At DEF CON 31 in August 2023, the inaugural Generative AI Red Team Challenge brought together 2,244 participants who evaluated eight LLMs over two and a half days, producing 17,469 conversations and 164,208 messages across 21 challenges spanning cybersecurity, misinformation, bias, and human rights ^[6]. The eight models came from OpenAI, Anthropic, Meta, Google, Hugging Face, NVIDIA, Stability AI, and Cohere, and organizers described it as "the first instance of a live hacking event of a generative AI system at scale" ^[6].

At DEF CON 32 in August 2024, the Generative Red Team (GRT-2) returned with a revised format. Participants performed real evaluations of LLM flaws and vulnerabilities, with bounties offered for each finding. CSET Research Fellow Colin Shea-Blymyer won the challenge by discovering significant vulnerabilities through creative attack strategies ^[6]. These events, organized in collaboration with the nonprofit Humane Intelligence, have demonstrated the value of large-scale public participation in AI safety testing.

Anthropic's red teaming program

Anthropic maintains one of the most publicly documented red teaming programs in the AI industry. The company's Frontier Red Team focuses on CBRN risks, cybersecurity capabilities, and autonomous AI behavior. Anthropic's approach involves both automated red teaming (using AI models to attack other AI models in iterative loops) and extensive engagement with external domain experts ^[5].

In 2025, Anthropic published research on strengthening red teams through a modular scaffold for control evaluations. The company studies risks through what it calls the AI control framework, testing systems in adversarial settings by having a red team design attack policies that allow an AI model to intentionally pursue hidden, harmful goals ^[7]. Separately, a prototype of Anthropic's Constitutional Classifiers withstood over 3,000 hours of expert red teaming from a HackerOne program (405 participants invited, with bounties of up to 15,000 USD) without a single universal jailbreak being found; the classifiers reduced the jailbreak success rate from 86% on an unguarded model to 4.4% ^[22]. In May 2025, Anthropic released Claude Opus 4 under "AI Safety Level 3" (ASL-3), the first model deployed under that standard of its Responsible Scaling Policy, which includes real-time classifiers targeting CBRN misuse ^[7] ^[22].

Anthropic also maintains the site red.anthropic.com, where it publishes detailed findings from its red teaming work, including evaluations of AI capabilities in areas such as cybersecurity and nuclear safeguards ^[7].

OpenAI's Red Teaming Network

OpenAI established its Red Teaming Network to deepen collaborations with external experts for systematic pre-deployment testing of new models. The network recruits domain specialists from diverse fields to evaluate harmful capabilities and stress-test mitigations ^[4].

For the GPT-4o evaluation, OpenAI conducted external red teaming in four phases. In the first three phases, testers used an internal tool; in the final phase, they tested the full consumer experience. Red teamers performed exploratory capability discovery, assessed novel potential risks, and stress-tested mitigations, with particular attention to multimodal capabilities including audio input and generation. The testing uncovered instances where the model would unintentionally generate output emulating the user's voice, a finding that informed the development of robust detection mitigations designed to keep the assistant voice on its preset reference voices ^[4] ^[23].

In March 2025, OpenAI published a detailed paper describing its approach to external red teaming, covering methodology, lessons learned, and evolving best practices ^[4]. The company's frontier model testing process now includes rigorous internal safety testing, external red teaming through the network, and collaborations with third-party testing organizations and government AI safety institutes.

Microsoft's AI Red Team

In 2018, Microsoft established what is considered the industry's first dedicated AI red team: an interdisciplinary group of security, adversarial machine learning, and responsible AI experts, founded under Ram Shankar Siva Kumar. The team's mandate extends beyond traditional security testing to include probing for harmful content generation, bias, and other responsible AI failures ^[13].

Since its founding, Microsoft's AI Red Team has tested over 100 generative AI applications, including every flagship model available on Azure OpenAI, every flagship Copilot product, and every release of the Phi model series. The team also draws on resources from across the Microsoft ecosystem, including the Fairness Center in Microsoft Research and AETHER (Microsoft's cross-company initiative on AI Ethics and Effects in Engineering and Research) ^[13]. In a January 2025 paper, "Lessons From Red Teaming 100 Generative AI Products," the team distilled eight lessons, concluding that "the work of securing AI systems will never be complete" ^[13] ^[24].

Key milestones in Microsoft's red teaming work include:

Year	Milestone
2018	Established the industry's first AI-focused red team
2020	Collaborated with MITRE to develop the Adversarial Machine Learning Threat Matrix
2020	Created and open-sourced Microsoft Counterfit, an automation tool for AI security testing
2021	Released the AI Security Risk Assessment Framework
2024	Open-sourced PyRIT (Python Risk Identification Tool) for orchestrating LLM attack suites
2025	Published findings from red teaming 100 generative AI products

Automated red teaming tools

The growth of AI red teaming has spawned a rich ecosystem of open-source and commercial tools designed to automate vulnerability discovery in AI systems.

Garak

Garak, developed by NVIDIA, is an open-source LLM vulnerability scanner that probes generative AI models for a wide range of weaknesses. Named as a reference to the Star Trek character (and backronymized as "Generative AI Red-teaming and Assessment Kit"), Garak uses a modular plugin architecture to test for prompt injection, data leakage, toxicity, hallucination, and other failure modes. It supports multiple model backends and can be extended with custom probes and detectors ^[14].

PyRIT

PyRIT (Python Risk Identification Tool), developed by Microsoft, is an open-source framework for orchestrating multi-turn attacks against AI systems. PyRIT excels at sophisticated, multi-turn attack strategies that single-pass scanners cannot replicate. It has become a de facto standard for organizations seeking to automate complex adversarial testing scenarios, supporting techniques like crescendo attacks and Tree of Attacks with Pruning (TAP) ^[14] ^[15].

PAIR

PAIR (Prompt Automatic Iterative Refinement) is an automated jailbreaking technique developed by researchers at Carnegie Mellon University and other institutions. PAIR uses an attacker LLM to iteratively refine adversarial prompts against a target LLM, automatically generating jailbreaks without requiring human-crafted templates. The attacker model queries the target, evaluates the response, and refines its approach over multiple iterations until it succeeds in bypassing safety measures ^[14].

TAP

TAP (Tree of Attacks with Pruning) extends the iterative approach of PAIR by maintaining a tree-structured search over possible attack strategies. At each step, TAP generates multiple candidate attack prompts, evaluates their effectiveness, prunes unpromising branches, and expands the most successful approaches. This tree search enables more efficient exploration of the attack space compared to linear iterative methods ^[14].

CyberSecEval

CyberSecEval, developed by Meta as part of its Purple Llama project, is a benchmark suite designed to assess cybersecurity vulnerabilities in LLMs. First released as what was described as the industry's first comprehensive set of cybersecurity safety evaluations for LLMs, CyberSecEval has reached version 3 as of 2025. The benchmarks are based on industry standards including CWE (Common Weakness Enumeration) and MITRE ATT&CK, and they evaluate the frequency of insecure code suggestions, the ease of generating malicious code, and multilingual prompt injection resilience ^[16].

Promptfoo

Promptfoo is an open-source tool for testing and evaluating LLM applications. While broader in scope than pure red teaming, it includes adversarial testing capabilities and integrates with other red teaming datasets and methodologies, including CyberSecEval. Promptfoo supports automated scanning for common vulnerability classes and can be integrated into continuous integration pipelines ^[14].

Comparison of tools

Tool	Developer	Primary strength	Open source	Key technique
Garak	NVIDIA	Broad vulnerability scanning with modular plugins	Yes	Plugin-based probing across multiple failure categories
PyRIT	Microsoft	Multi-turn attack orchestration	Yes	Crescendo and TAP attack strategies
PAIR	Academic researchers	Automated jailbreak generation	Yes	Iterative prompt refinement using attacker LLM
TAP	Academic researchers	Efficient attack space exploration	Yes	Tree-structured search with pruning
CyberSecEval	Meta	Cybersecurity-focused benchmarking	Yes	Standards-based evaluation (CWE, MITRE ATT&CK)
Promptfoo	Promptfoo Inc.	LLM application testing and evaluation	Yes	Integration with multiple scanning frameworks

A recommended layered approach to automated red teaming involves three tiers: Layer 1 uses broad scanners like Garak or Promptfoo for initial vulnerability discovery; Layer 2 applies compliance-focused scans aligned with specific standards; Layer 3 employs deep exploitation tools like PyRIT for sophisticated multi-turn campaigns ^[14].

Regulatory requirements

Biden Executive Order 14110

On October 30, 2023, U.S. President Joe Biden signed Executive Order 14110, titled "Executive Order on the Safe, Secure, and Trustworthy Development and Use of Artificial Intelligence." The order placed AI red teaming at the center of federal AI governance by requiring companies developing dual-use foundation models to conduct red team testing and report results to the government ^[17].

The executive order defined AI red teaming as "a structured testing effort to find flaws and vulnerabilities in an AI system, often in a controlled environment and in collaboration with developers of AI." It specified that red teams should adopt adversarial methods to identify flaws such as harmful or discriminatory outputs, unforeseen system behaviors, limitations, and potential misuse risks ^[17].

Key provisions included:

Within 90 days, the Secretary of Commerce was required to establish reporting requirements for companies developing or intending to develop dual-use foundation models.
Companies had to report the results of red team tests following NIST guidance, along with a description of safety measures taken in response.
NIST was directed to develop guidelines for AI red teaming, including testing for misuse risks involving CBRN weapons, cyber operations, and generation of child sexual abuse material.
The order mandated the development of testing environments (testbeds) to support safe and trustworthy AI development ^[17].

However, on January 20, 2025, President Donald Trump revoked Executive Order 14110 within hours of taking office. Three days later, Trump signed Executive Order 14179, "Removing Barriers to American Leadership in Artificial Intelligence," which shifted emphasis from oversight and mandatory testing to deregulation and innovation promotion. The revocation effectively halted the mandatory red teaming and reporting requirements that had been established under the Biden order ^[18].

EU AI Act

The EU AI Act, which entered into force on August 1, 2024, includes explicit requirements for adversarial testing of general-purpose AI (GPAI) models. For GPAI models designated as posing systemic risk (those exceeding 10^25 floating-point operations in training), Article 55 requires providers to ^[19]:

Conduct and document adversarial testing, including red teaming, to identify and mitigate systemic risks.
Provide detailed documentation of measures for conducting internal and external adversarial testing as part of their technical documentation (Annex XI).
Perform model evaluations throughout the product lifecycle, not only before initial deployment.
Track and report serious incidents and ensure cybersecurity protections.

The European Commission launched the GPAI Code of Practice in 2025, providing voluntary benchmarks for companies building or deploying foundation models. Full enforcement of the AI Act's provisions, including penalties of up to 35 million EUR or 7% of global annual turnover (whichever is higher) for noncompliance, is expected to take effect in stages through 2026, following formal designation of systemic risk models by the European Commission ^[19].

Relationship to AI safety evaluation

Red teaming is one component of a broader ecosystem of AI safety evaluation methods. While red teaming focuses on adversarial testing (actively trying to make systems fail), safety evaluation also encompasses capability assessments, alignment testing, and dangerous capability evaluations.

METR (Model Evaluation and Threat Research)

METR is a nonprofit research institute based in Berkeley, California, that evaluates frontier AI models' capabilities to carry out long-horizon, agentic tasks. Originally formed as ARC Evals within the Alignment Research Center (founded by Paul Christiano in 2021), the group spun off as an independent nonprofit in December 2023 and was renamed METR (a reference to metrology, the science of measurement) ^[20].

METR conducts pre-deployment evaluations of frontier models in partnership with AI developers including Anthropic and OpenAI. The organization has contributed to system cards for models including OpenAI's o3, o4-mini, GPT-4o, and GPT-4.5, as well as Anthropic's Claude models. METR's research focuses on measuring how AI agents perform on progressively longer and more complex tasks. In a March 2025 study, METR found that the length of tasks (measured by how long they take human professionals) that frontier model agents can complete autonomously with 50% reliability has been doubling approximately every seven months over roughly six years ^[20].

As of late 2025, twelve companies had published frontier AI safety policies that reference external evaluation partnerships: Anthropic, OpenAI, Google DeepMind, Magic, Naver, Meta, G42, Cohere, Microsoft, Amazon, xAI, and NVIDIA ^[20].

UK AI Security Institute (formerly AI Safety Institute)

The UK AI Security Institute, established following the Bletchley Park AI Safety Summit in November 2023 with approximately 100 million GBP in funding, has built one of the world's largest safety evaluation teams. The institute conducts pre-deployment evaluations of frontier models and has performed joint evaluations with its U.S. counterpart. Its 2025 Frontier AI Trends Report, which drew on roughly two years of evaluations across more than 30 state-of-the-art models, found that while model capabilities in cybersecurity and scientific domains had improved dramatically, safety guardrails could still be routinely circumvented ^[9].

U.S. AI Safety Institute

Created within NIST following Executive Order 14110, the U.S. AI Safety Institute collaborates with the UK institute on joint frontier model evaluations. Following the change in administration in January 2025, the institute was reorganized and renamed the Center for AI Standards and Innovation (CAISI) ^[9].

Challenges and limitations

Despite its importance, AI red teaming faces several significant challenges.

Lack of standardization. Different organizations use different techniques, taxonomies, and success criteria when red teaming the same types of models. This makes it difficult to compare the safety of different AI systems or to aggregate findings across the industry. The Georgetown CSET has noted that the term "red teaming" is used so broadly in the AI context that it can refer to activities ranging from informal prompt testing to rigorous, structured evaluations ^[3].

Arms race dynamics. Red teaming exists in an adversarial relationship with safety measures. As safety training improves, red teamers develop more sophisticated attacks; as attacks become known, developers patch them. This dynamic means that a clean bill of health from one round of red teaming provides limited assurance about future safety. VentureBeat reported in 2025 that this pattern "exposes a harsh truth about the AI security arms race": no amount of testing can guarantee that a model is safe against all possible attacks ^[8].

Scalability. Manual red teaming is labor-intensive and cannot cover the vast space of possible inputs to a modern AI system. Automated tools improve coverage but may miss subtle, context-dependent vulnerabilities that require human understanding. Finding the right balance between human expertise and automated scale remains an open problem.

Evaluation of evaluators. Assessing the quality and completeness of a red teaming exercise is itself difficult. There are no universally accepted benchmarks for red team performance. A red team that finds no vulnerabilities might be testing a genuinely safe system, or it might simply not be creative or persistent enough.

Dual-use concerns. Publishing detailed descriptions of successful attacks can help defenders improve their systems, but it also provides a roadmap for malicious actors. The red teaming community must balance transparency (which advances the field) against the risk of enabling harm.

Access and resources. Effective red teaming of frontier models requires access to those models, often before public deployment. This creates a dependency on AI developers being willing to share access with external testers. It also requires significant computational resources and domain expertise that not all organizations can afford.

Best practices

Based on guidance from NIST, major AI laboratories, and the broader research community, several best practices have emerged for AI red teaming.

Best practice	Description
Define clear scope and objectives	Specify which risks are in scope, what constitutes a successful attack, and how findings will be prioritized
Use diverse red teams	Include people with varied backgrounds, skills, languages, and cultural perspectives to maximize the range of discovered vulnerabilities
Combine manual and automated methods	Use automation for breadth and human expertise for depth
Test throughout the lifecycle	Conduct red teaming before deployment, after major updates, and on an ongoing basis
Engage domain experts	For high-risk areas like CBRN, cybersecurity, and child safety, involve specialists with relevant expertise
Document everything	Maintain detailed records of all inputs, outputs, and contextual factors for each test
Establish responsible disclosure	Define clear processes for handling and communicating discovered vulnerabilities
Iterate on mitigations	After addressing vulnerabilities, retest to verify that fixes are effective and have not introduced new problems
Consider real-world deployment context	Test the system as it will actually be used, including with the specific user interfaces, system prompts, and integrations that will be in production
Follow established taxonomies	Use frameworks like OWASP Top 10 for LLMs or MITRE ATLAS to ensure comprehensive coverage

Microsoft's AI Red Team has emphasized three key lessons from testing over 100 generative AI products: security and responsible AI risks should be tested together rather than in silos; red teaming must account for the full system (not just the model in isolation); and the most impactful vulnerabilities often arise from the interaction between the model and its deployment context rather than from the model alone ^[13].

What is the current state of AI red teaming (2025-2026)?

As of early 2026, AI red teaming is transitioning from an ad hoc practice to an expected component of responsible AI development and, in some jurisdictions, a legal requirement.

Regulatory pressure is growing. The EU AI Act's requirements for adversarial testing of GPAI models with systemic risk took effect in August 2025, with full enforcement approaching in 2026. While the U.S. federal government stepped back from mandatory red teaming requirements after revoking Executive Order 14110, individual states have enacted their own AI safety laws, and industry self-regulation through frameworks like Anthropic's Responsible Scaling Policy and OpenAI's Preparedness Framework continues to incorporate red teaming as a core practice ^[18] ^[19].

Tooling is maturing. The ecosystem of automated red teaming tools has grown substantially. PyRIT, Garak, Promptfoo, and other open-source tools are increasingly integrated into continuous development pipelines. Microsoft launched an AI Red Teaming Agent in 2025, and the layered approach to automated testing (broad scanning, compliance testing, deep exploitation) has become a recognized methodology ^[14] ^[15].

Public participation is expanding. Events like the DEF CON AI Village's Generative Red Team challenges have demonstrated the value of crowdsourced adversarial testing. The 2024 Generative AI Red Teaming Transparency Report, published by Humane Intelligence, documented findings from large-scale public red teaming exercises and provided recommendations for future events ^[6].

CBRN and national security testing is intensifying. Partnerships between AI laboratories and government agencies for CBRN evaluation have deepened. Anthropic's partnership with the National Nuclear Security Administration, expanded CBRN evaluations across the industry, and the FORTRESS (Frontier Risk Evaluation for National Security and Public Safety) framework published in 2025 all reflect the growing emphasis on national security dimensions of AI red teaming ^[7].

The OWASP Top 10 for LLMs continues to evolve. The 2025 edition introduced new vulnerability categories specific to generative AI, including excessive agency, system prompt leakage, vector and embedding weaknesses, misinformation, and unbounded consumption. Prompt injection retained its top position for the second year, while sensitive information disclosure rose sharply to second ^[10].

Frontier model capabilities are outpacing defenses. The UK AI Security Institute's 2025 report found that frontier models could complete apprentice-level cybersecurity tasks 50% of the time, up from roughly 10% in early 2024. In chemistry and biology, AI models exceeded expert (PhD-level) baselines by up to 60% on some evaluations, with lab-troubleshooting performance exceeding human experts by as much as 90%. Yet guardrails remained routinely circumventable, with universal jailbreaks found in every system tested ^[9]. This widening gap between model capability and safety assurance underscores the ongoing importance of red teaming, even as practitioners acknowledge that it cannot, on its own, guarantee safety.

References

Wikipedia. "Red team." https://en.wikipedia.org/wiki/Red_team ↩
Confident AI. "Red Teaming LLMs: A Step-by-Step Guide." https://www.confident-ai.com/blog/red-teaming-llms-a-step-by-step-guide ↩
Georgetown CSET. "What Does AI Red-Teaming Actually Mean?" https://cset.georgetown.edu/article/what-does-ai-red-teaming-actually-mean/ ↩
OpenAI. "OpenAI's Approach to External Red Teaming for AI Models and Systems." March 2025. https://arxiv.org/html/2503.16431v1 ↩
Anthropic. "Frontier Threats Red Teaming for AI Safety." https://www.anthropic.com/news/frontier-threats-red-teaming-for-ai-safety ↩
Humane Intelligence. "Generative AI Red Teaming Challenge." https://www.humane-intelligence.org/grt ↩
Anthropic. "Strengthening Red Teams: A Modular Scaffold for Control Evaluations." 2025. https://alignment.anthropic.com/2025/strengthening-red-teams/ ↩
VentureBeat. "Red teaming LLMs exposes a harsh truth about the AI security arms race." 2025. https://venturebeat.com/security/red-teaming-llms-harsh-truth-ai-security-arms-race ↩
UK AI Security Institute. "Frontier AI Trends Report." 2025. https://www.aisi.gov.uk/frontier-ai-trends-report ↩
OWASP. "OWASP Top 10 for LLM Applications 2025." https://genai.owasp.org/ ↩
UC Berkeley CLTC. "Benchmark Early and Red Team Often." https://cltc.berkeley.edu/publication/benchmark-early-and-red-team-often-a-framework-for-assessing-and-managing-dual-use-hazards-of-ai-foundation-models/ ↩
NIST. "AI Risk Management Framework." https://www.nist.gov/artificial-intelligence/ai-risk-management-framework ↩
Microsoft. "Microsoft AI Red Team building future of safer AI." August 2023. https://www.microsoft.com/en-us/security/blog/2023/08/07/microsoft-ai-red-team-building-future-of-safer-ai/ ↩
Promptfoo. "Top Open Source AI Red-Teaming and Fuzzing Tools in 2025." https://www.promptfoo.dev/blog/top-5-open-source-ai-red-teaming-tools-2025/ ↩
Amine Raji. "LLM Red Teaming Tools: PyRIT & Garak (2025 Guide)." https://aminrj.com/posts/attack-patterns-red-teaming/ ↩
Meta. "Purple Llama CyberSecEval." https://github.com/meta-llama/PurpleLlama ↩
The White House. "Executive Order on the Safe, Secure, and Trustworthy Development and Use of Artificial Intelligence." October 30, 2023. https://www.federalregister.gov/documents/2023/11/01/2023-24283/safe-secure-and-trustworthy-development-and-use-of-artificial-intelligence ↩
Federal Register. "Removing Barriers to American Leadership in Artificial Intelligence." January 2025. https://www.federalregister.gov/documents/2025/01/31/2025-02172/removing-barriers-to-american-leadership-in-artificial-intelligence ↩
European Commission. "EU AI Act." https://artificialintelligenceact.eu/high-level-summary/ ↩
METR. "Measuring AI Ability to Complete Long Tasks." March 2025. https://metr.org/blog/2025-03-19-measuring-ai-ability-to-complete-long-tasks/ ↩
Verma et al. "Red Teaming the Mind of the Machine: A Systematic Evaluation of Prompt Injection and Jailbreak Vulnerabilities in LLMs." arXiv, May 2025. https://arxiv.org/abs/2505.04806 ↩
Anthropic. "Constitutional Classifiers: Defending against universal jailbreaks." 2025. https://www.anthropic.com/research/constitutional-classifiers ↩
OpenAI. "GPT-4o System Card." August 2024. https://openai.com/index/gpt-4o-system-card/ ↩
Bullwinkel et al. "Lessons From Red Teaming 100 Generative AI Products." arXiv, January 2025. https://arxiv.org/abs/2501.07238 ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

3 revisions by 1 contributors · full history

Suggest edit