Red teaming in artificial intelligence refers to the systematic, adversarial testing of AI systems to identify vulnerabilities, biases, harmful outputs, and other failure modes before those systems are deployed or after deployment as part of ongoing monitoring. Borrowed from military strategy and cybersecurity, the practice involves dedicated teams or individuals who adopt an attacker's mindset, deliberately probing AI models to discover weaknesses that standard quality assurance and benchmarking might miss. As large language models (LLMs) and other generative AI systems have become widely deployed, red teaming has emerged as one of the most important practical tools in AI safety, referenced in government executive orders, international regulations, and the internal safety protocols of every major AI laboratory.
The concept of red teaming traces back to Cold War-era military exercises in the 1960s. The think tank RAND Corporation conducted simulations for the United States military in which a "red team" represented the Soviet Union and a "blue team" represented the United States. The color coding reflected the geopolitical alignment of the era: red for communist adversaries, blue for NATO allies [1]. United States Secretary of Defense Robert McNamara also assembled red and blue teams to evaluate competing proposals from government contractors for experimental aircraft programs.
More broadly, the idea of structured adversarial opposition in military planning dates to the early 19th century, when armies employed officers to challenge battle plans and identify weaknesses. By the mid-20th century, red teaming had become a formalized component of military doctrine, particularly for anticipating enemy strategies during the Cold War [1].
In the 1980s and 1990s, red teaming migrated into cybersecurity. The National Security Agency (NSA) pioneered the use of dedicated red teams to assess the security of classified systems. Private sector companies and government agencies adopted the practice through the 1990s and 2000s, with red teams simulating cyberattacks to test network defenses, penetrate perimeters, and expose vulnerabilities before real adversaries could exploit them [1].
The application of red teaming to AI systems began in earnest around 2020 and 2021, as the rapid growth in LLM capabilities made it clear that conventional software testing was insufficient. Unlike traditional software, where bugs produce deterministic failures, AI models can fail in unpredictable, context-dependent ways: generating toxic content, leaking private training data, or providing instructions for dangerous activities. Red teaming adapted to address these novel failure modes, and by 2023 the term had become standard vocabulary in AI safety discussions.
AI red teaming differs from traditional software testing in several important respects. Standard software testing typically checks whether a program produces correct outputs for given inputs according to a specification. AI red teaming, by contrast, involves creative, open-ended exploration of a system's behavior under adversarial conditions, where the goal is to discover failure modes that the developers did not anticipate.
A typical AI red teaming exercise proceeds through several phases:
| Phase | Description |
|---|---|
| Scoping | Define the target system, threat model, and categories of risk to be tested (e.g., harmful content, bias, jailbreaks, privacy leaks) |
| Reconnaissance | Understand the system's architecture, intended use cases, known guardrails, and any documentation of safety measures |
| Attack design | Develop adversarial prompts, scenarios, and interaction strategies intended to bypass safety measures or elicit undesirable outputs |
| Execution | Systematically probe the system using the designed attacks, documenting all inputs and outputs |
| Analysis | Categorize and prioritize discovered vulnerabilities by severity, exploitability, and potential real-world impact |
| Reporting | Deliver structured findings to the development team with recommendations for remediation |
| Retesting | After mitigations are implemented, verify that the identified vulnerabilities have been addressed |
Red teamers employ a wide range of techniques. Manual probing involves human experts crafting adversarial prompts based on their understanding of model behavior, domain knowledge, and creative intuition. Automated attacks use algorithms or other AI models to generate large volumes of adversarial inputs at scale. Structured evaluations follow predefined taxonomies of risk, testing the model systematically against known categories of harmful behavior [2].
The Georgetown University Center for Security and Emerging Technology (CSET) has noted that the term "AI red teaming" encompasses a broader range of activities than traditional cybersecurity red teaming. In cybersecurity, red teams typically simulate specific threat actors with defined capabilities and objectives. In AI, red teaming often serves a dual purpose: identifying security vulnerabilities (like prompt injection) and evaluating responsible AI concerns (like bias and harmful content generation) [3].
AI red teaming takes several forms, each suited to different objectives and stages of the development lifecycle.
Internal red teaming is conducted by employees of the organization that developed the AI system. Internal teams have deep knowledge of the model's architecture, training data, and safety measures, which allows them to craft highly targeted attacks. The disadvantage is that internal teams may share the same blind spots as the developers.
External red teaming brings in independent experts, often with specialized domain knowledge, to test the system. External red teamers offer fresh perspectives and are less likely to overlook issues that internal teams have normalized. OpenAI's Red Teaming Network, for instance, recruits domain experts from fields including biosecurity, political science, and education to evaluate new models before release. For GPT-4o, OpenAI worked with more than 100 external red teamers speaking 45 languages from 29 countries [4]. Anthropic similarly engages external biosecurity and cybersecurity experts for its frontier red teaming program [5].
Many organizations use both approaches. Internal teams conduct continuous testing during development, while external teams provide independent assessments at key milestones such as pre-deployment reviews.
Manual red teaming relies on human creativity and expertise. Skilled red teamers can identify subtle, context-dependent vulnerabilities that automated systems would miss: cultural sensitivities, nuanced forms of bias, or complex multi-turn manipulation strategies. The DEF CON AI Village red teaming challenges have demonstrated the value of human ingenuity, with participants discovering vulnerabilities through techniques like role prompting (asking the model to assume a persona such as a professor researching hate speech) that achieved high attack success rates [6].
Automated red teaming uses algorithms, scripts, or other AI models to generate adversarial inputs at scale. This approach can test thousands of attack variations in the time it takes a human to test dozens. Anthropic has described an approach where one model generates attacks and another model defends, running in an iterative loop to progressively discover and harden against vulnerabilities [5]. Automated methods are especially valuable for regression testing, ensuring that previously discovered vulnerabilities remain fixed as models are updated.
Hybrid approaches combine human expertise with automated scaling. A common pattern involves human experts designing attack strategies and taxonomies, which are then implemented as automated test suites. Results from automated testing are reviewed by humans to assess severity and develop mitigations.
Some categories of risk require specialized knowledge to evaluate effectively. Domain-specific red teaming brings in subject matter experts to assess whether an AI system could facilitate harm in their area of expertise.
| Domain | What red teamers test for |
|---|---|
| Biosecurity | Whether the model can provide actionable instructions for synthesizing biological agents or toxins |
| Cybersecurity | Whether the model can assist in writing malware, discovering exploits, or conducting cyberattacks |
| Nuclear and radiological | Whether the model reveals sensitive information about weapons design, enrichment processes, or safeguards circumvention |
| Chemistry | Whether the model provides synthesis routes for chemical weapons or precursor chemicals |
| Persuasion and manipulation | Whether the model can generate convincing disinformation, propaganda, or social engineering attacks |
| Child safety | Whether the model can be manipulated into producing child sexual abuse material or grooming content |
Anthropic's Frontier Red Team, a group of approximately 15 researchers, has spent over 150 hours with biosecurity experts evaluating models' ability to output harmful biological information. In April 2025, Anthropic partnered with the U.S. Department of Energy's National Nuclear Security Administration to assess models for nuclear proliferation risks [5] [7].
The scope of AI red teaming has expanded significantly since its early days. Modern red teaming programs typically evaluate AI systems across multiple categories of risk.
Red teamers test whether models can be induced to generate content that is violent, sexually explicit, hateful, or otherwise harmful. This includes testing the robustness of content filters and refusal mechanisms under adversarial pressure. Even models with strong safety training can sometimes be manipulated into producing harmful content through creative prompting strategies.
Red teams evaluate whether AI systems produce outputs that reflect or amplify societal biases based on race, gender, religion, nationality, sexual orientation, disability, or other protected characteristics. This testing goes beyond simple keyword detection to examine subtle forms of bias in recommendations, characterizations, and differential treatment of demographic groups.
A jailbreak is an adversarial prompt or sequence of prompts designed to bypass an AI model's safety restrictions, causing it to produce outputs it was trained to refuse. Jailbreak techniques have grown increasingly sophisticated. Research published in 2025 found that prompt injections exploiting roleplay dynamics achieved attack success rates as high as 89.6%, often bypassing filters by deflecting responsibility away from the model [8]. Joint testing by the UK and US AI Safety Institutes found that safety guardrails built into frontier models could be "routinely circumvented" through jailbreaking techniques [9].
Common jailbreak categories include:
| Technique | Description |
|---|---|
| Role prompting | Instructing the model to adopt a persona that would not be bound by safety rules |
| Payload splitting | Breaking a harmful request across multiple messages so no single message triggers a refusal |
| Encoding attacks | Expressing harmful requests in code, base64, or other formats that bypass text-level filters |
| Crescendo attacks | Gradually escalating the sensitivity of requests over a multi-turn conversation |
| Few-shot manipulation | Providing examples of harmful outputs to establish a pattern the model continues |
| Translation attacks | Requesting harmful content in low-resource languages where safety training is weaker |
Prompt injection attacks involve inserting malicious instructions into inputs that an AI system processes, causing it to override its original instructions or behave in unintended ways. This is particularly dangerous for AI systems that process external data, such as web content or user-uploaded documents. OWASP's 2025 Top 10 for LLM Applications ranked prompt injection as the number one vulnerability for the second consecutive year [10].
Red teams test the extent to which models generate plausible-sounding but factually incorrect information, often called hallucinations. This is especially critical for applications in medicine, law, finance, and other domains where inaccurate information could cause real harm. Red teamers craft questions that are likely to elicit confident but wrong answers, such as questions about obscure topics, requests for specific citations, or queries that combine real and fabricated premises.
Models trained on large datasets may memorize and reproduce personal information, confidential data, or copyrighted material from their training corpus. Red teamers probe for these leaks by crafting prompts designed to extract specific types of sensitive information. OWASP's 2025 Top 10 elevated sensitive information disclosure from sixth to second place, reflecting growing concern about this risk [10].
Chemical, biological, radiological, and nuclear (CBRN) risk assessment has become a central focus of frontier model red teaming. The concern is that advanced AI models might lower the barrier to creating weapons of mass destruction by providing detailed technical guidance to individuals who lack specialized training. Multiple AI laboratories now conduct pre-deployment CBRN evaluations as standard practice, often engaging external domain experts [5] [11].
Numerous organizations have established dedicated red teaming programs or frameworks for AI systems.
The National Institute of Standards and Technology (NIST) AI Risk Management Framework (AI RMF) provides a structured approach for organizations to manage AI risks throughout the system lifecycle. The framework emphasizes continuous testing and evaluation, including red teaming, as a core component of responsible AI development. In July 2024, NIST released updated guidelines and a global engagement plan for AI security testing pursuant to Executive Order 14110 [12].
NIST also developed Dioptra, a security testbed designed to help users assess which types of attacks would degrade their AI model's performance. Dioptra supports AI model testing, research, evaluations, and red teaming exercises. Between December 2024 and January 2025, NIST ran the ARIA (Assessing Risks and Impacts of AI) 0.1 pilot program, in which 51 red teamers attempted to induce application failures across defined risk scenarios [12].
The DEF CON hacking conference's AI Village has hosted some of the largest public AI red teaming events. At DEF CON 31 in August 2023, the inaugural Generative AI Red Team Challenge brought together 2,244 participants who evaluated eight LLMs over two and a half days, producing over 17,000 conversations across 21 topics spanning cybersecurity, misinformation, bias, and human rights [6].
At DEF CON 32 in August 2024, the Generative Red Team (GRT-2) returned with a revised format. Participants performed real evaluations of LLM flaws and vulnerabilities, with bounties offered for each finding. CSET Research Fellow Colin Shea-Blymyer won the challenge by discovering significant vulnerabilities through creative attack strategies [6]. These events, organized in collaboration with the nonprofit Humane Intelligence, have demonstrated the value of large-scale public participation in AI safety testing.
Anthropic maintains one of the most publicly documented red teaming programs in the AI industry. The company's Frontier Red Team focuses on CBRN risks, cybersecurity capabilities, and autonomous AI behavior. Anthropic's approach involves both automated red teaming (using AI models to attack other AI models in iterative loops) and extensive engagement with external domain experts [5].
In 2025, Anthropic published research on strengthening red teams through a modular scaffold for control evaluations. The company studies risks through what it calls the AI control framework, testing systems in adversarial settings by having a red team design attack policies that allow an AI model to intentionally pursue hidden, harmful goals. A prototype system withstood over 3,000 hours of red teaming with no universal jailbreak discovered. Anthropic released Claude Opus 4 under what it called "AI Safety Level 3" (ASL-3), the first model released under that designation [7].
Anthropic also maintains the site red.anthropic.com, where it publishes detailed findings from its red teaming work, including evaluations of AI capabilities in areas such as cybersecurity and nuclear safeguards [7].
OpenAI established its Red Teaming Network to deepen collaborations with external experts for systematic pre-deployment testing of new models. The network recruits domain specialists from diverse fields to evaluate harmful capabilities and stress-test mitigations [4].
For the GPT-4o evaluation, OpenAI conducted external red teaming in four phases. In the first three phases, testers used an internal tool; in the final phase, they tested the full consumer experience. Red teamers performed exploratory capability discovery, assessed novel potential risks, and stress-tested mitigations, with particular attention to multimodal capabilities including audio input and generation. The testing uncovered instances where the model would unintentionally generate output emulating the user's voice, leading to the development of robust detection mitigations [4].
In March 2025, OpenAI published a detailed paper describing its approach to external red teaming, covering methodology, lessons learned, and evolving best practices [4]. The company's frontier model testing process now includes rigorous internal safety testing, external red teaming through the network, and collaborations with third-party testing organizations and government AI safety institutes.
In 2018, Microsoft established what is considered the industry's first dedicated AI red team: an interdisciplinary group of security, adversarial machine learning, and responsible AI experts. The team's mandate extends beyond traditional security testing to include probing for harmful content generation, bias, and other responsible AI failures [13].
Since its founding, Microsoft's AI Red Team has tested over 100 generative AI applications, including every flagship model available on Azure OpenAI, every flagship Copilot product, and every release of the Phi model series. The team also draws on resources from across the Microsoft ecosystem, including the Fairness Center in Microsoft Research and AETHER (Microsoft's cross-company initiative on AI Ethics and Effects in Engineering and Research) [13].
Key milestones in Microsoft's red teaming work include:
| Year | Milestone |
|---|---|
| 2018 | Established the industry's first AI-focused red team |
| 2020 | Collaborated with MITRE to develop the Adversarial Machine Learning Threat Matrix |
| 2020 | Created and open-sourced Microsoft Counterfit, an automation tool for AI security testing |
| 2021 | Released the AI Security Risk Assessment Framework |
| 2024 | Open-sourced PyRIT (Python Risk Identification Tool) for orchestrating LLM attack suites |
| 2025 | Published findings from red teaming 100 generative AI products |
The growth of AI red teaming has spawned a rich ecosystem of open-source and commercial tools designed to automate vulnerability discovery in AI systems.
Garak, developed by NVIDIA, is an open-source LLM vulnerability scanner that probes generative AI models for a wide range of weaknesses. Named as a reference to the Star Trek character (and backronymized as "Generative AI Red-teaming and Assessment Kit"), Garak uses a modular plugin architecture to test for prompt injection, data leakage, toxicity, hallucination, and other failure modes. It supports multiple model backends and can be extended with custom probes and detectors [14].
PyRIT (Python Risk Identification Tool), developed by Microsoft, is an open-source framework for orchestrating multi-turn attacks against AI systems. PyRIT excels at sophisticated, multi-turn attack strategies that single-pass scanners cannot replicate. It has become a de facto standard for organizations seeking to automate complex adversarial testing scenarios, supporting techniques like crescendo attacks and Tree of Attacks with Pruning (TAP) [14] [15].
PAIR (Prompt Automatic Iterative Refinement) is an automated jailbreaking technique developed by researchers at Carnegie Mellon University and other institutions. PAIR uses an attacker LLM to iteratively refine adversarial prompts against a target LLM, automatically generating jailbreaks without requiring human-crafted templates. The attacker model queries the target, evaluates the response, and refines its approach over multiple iterations until it succeeds in bypassing safety measures [14].
TAP (Tree of Attacks with Pruning) extends the iterative approach of PAIR by maintaining a tree-structured search over possible attack strategies. At each step, TAP generates multiple candidate attack prompts, evaluates their effectiveness, prunes unpromising branches, and expands the most successful approaches. This tree search enables more efficient exploration of the attack space compared to linear iterative methods [14].
CyberSecEval, developed by Meta as part of its Purple Llama project, is a benchmark suite designed to assess cybersecurity vulnerabilities in LLMs. First released as what was described as the industry's first comprehensive set of cybersecurity safety evaluations for LLMs, CyberSecEval has reached version 3 as of 2025. The benchmarks are based on industry standards including CWE (Common Weakness Enumeration) and MITRE ATT&CK, and they evaluate the frequency of insecure code suggestions, the ease of generating malicious code, and multilingual prompt injection resilience [16].
Promptfoo is an open-source tool for testing and evaluating LLM applications. While broader in scope than pure red teaming, it includes adversarial testing capabilities and integrates with other red teaming datasets and methodologies, including CyberSecEval. Promptfoo supports automated scanning for common vulnerability classes and can be integrated into continuous integration pipelines [14].
| Tool | Developer | Primary strength | Open source | Key technique |
|---|---|---|---|---|
| Garak | NVIDIA | Broad vulnerability scanning with modular plugins | Yes | Plugin-based probing across multiple failure categories |
| PyRIT | Microsoft | Multi-turn attack orchestration | Yes | Crescendo and TAP attack strategies |
| PAIR | Academic researchers | Automated jailbreak generation | Yes | Iterative prompt refinement using attacker LLM |
| TAP | Academic researchers | Efficient attack space exploration | Yes | Tree-structured search with pruning |
| CyberSecEval | Meta | Cybersecurity-focused benchmarking | Yes | Standards-based evaluation (CWE, MITRE ATT&CK) |
| Promptfoo | Promptfoo Inc. | LLM application testing and evaluation | Yes | Integration with multiple scanning frameworks |
A recommended layered approach to automated red teaming involves three tiers: Layer 1 uses broad scanners like Garak or Promptfoo for initial vulnerability discovery; Layer 2 applies compliance-focused scans aligned with specific standards; Layer 3 employs deep exploitation tools like PyRIT for sophisticated multi-turn campaigns [14].
On October 30, 2023, U.S. President Joe Biden signed Executive Order 14110, titled "Executive Order on the Safe, Secure, and Trustworthy Development and Use of Artificial Intelligence." The order placed AI red teaming at the center of federal AI governance by requiring companies developing dual-use foundation models to conduct red team testing and report results to the government [17].
The executive order defined AI red teaming as "a structured testing effort to find flaws and vulnerabilities in an AI system, often in a controlled environment and in collaboration with developers of AI." It specified that red teams should adopt adversarial methods to identify flaws such as harmful or discriminatory outputs, unforeseen system behaviors, limitations, and potential misuse risks [17].
Key provisions included:
However, on January 20, 2025, President Donald Trump revoked Executive Order 14110 within hours of taking office. Three days later, Trump signed Executive Order 14179, "Removing Barriers to American Leadership in Artificial Intelligence," which shifted emphasis from oversight and mandatory testing to deregulation and innovation promotion. The revocation effectively halted the mandatory red teaming and reporting requirements that had been established under the Biden order [18].
The EU AI Act, which entered into force on August 1, 2024, includes explicit requirements for adversarial testing of general-purpose AI (GPAI) models. For GPAI models designated as posing systemic risk (those exceeding 10^25 floating-point operations in training), providers must [19]:
The European Commission launched the GPAI Code of Practice in 2025, providing voluntary benchmarks for companies building or deploying foundation models. Full enforcement of the AI Act's provisions, including penalties of up to 35 million EUR or 7% of global revenue for noncompliance, is expected to take effect in stages through 2026, following formal designation of systemic risk models by the European Commission [19].
Red teaming is one component of a broader ecosystem of AI safety evaluation methods. While red teaming focuses on adversarial testing (actively trying to make systems fail), safety evaluation also encompasses capability assessments, alignment testing, and dangerous capability evaluations.
METR is a nonprofit research institute based in Berkeley, California, that evaluates frontier AI models' capabilities to carry out long-horizon, agentic tasks. Originally formed as ARC Evals within the Alignment Research Center (founded by Paul Christiano in 2021), the group spun off as an independent nonprofit in December 2023 and was renamed METR (a reference to metrology, the science of measurement) [20].
METR conducts pre-deployment evaluations of frontier models in partnership with AI developers including Anthropic and OpenAI. The organization has contributed to system cards for models including OpenAI's o3, o4-mini, GPT-4o, and GPT-4.5, as well as Anthropic's Claude models. METR's research focuses on measuring how AI agents perform on progressively longer and more complex tasks, and the organization has found that this capability has been doubling approximately every seven months over the past six years [20].
As of late 2025, twelve companies had published frontier AI safety policies that reference external evaluation partnerships: Anthropic, OpenAI, Google DeepMind, Magic, Naver, Meta, G42, Cohere, Microsoft, Amazon, xAI, and NVIDIA [20].
The UK AI Security Institute, established following the Bletchley Park AI Safety Summit in November 2023 with approximately 100 million GBP in funding, has built one of the world's largest safety evaluation teams. The institute conducts pre-deployment evaluations of frontier models and has performed joint evaluations with its U.S. counterpart. Its 2025 Frontier AI Trends Report found that while model capabilities in cybersecurity and scientific domains had improved dramatically, safety guardrails could still be routinely circumvented [9].
Created within NIST following Executive Order 14110, the U.S. AI Safety Institute collaborates with the UK institute on joint frontier model evaluations. Following the change in administration in January 2025, the institute was reorganized and renamed the Center for AI Standards and Innovation (CAISI) [9].
Despite its importance, AI red teaming faces several significant challenges.
Lack of standardization. Different organizations use different techniques, taxonomies, and success criteria when red teaming the same types of models. This makes it difficult to compare the safety of different AI systems or to aggregate findings across the industry. The Georgetown CSET has noted that the term "red teaming" is used so broadly in the AI context that it can refer to activities ranging from informal prompt testing to rigorous, structured evaluations [3].
Arms race dynamics. Red teaming exists in an adversarial relationship with safety measures. As safety training improves, red teamers develop more sophisticated attacks; as attacks become known, developers patch them. This dynamic means that a clean bill of health from one round of red teaming provides limited assurance about future safety. VentureBeat reported in 2025 that this pattern "exposes a harsh truth about the AI security arms race": no amount of testing can guarantee that a model is safe against all possible attacks [8].
Scalability. Manual red teaming is labor-intensive and cannot cover the vast space of possible inputs to a modern AI system. Automated tools improve coverage but may miss subtle, context-dependent vulnerabilities that require human understanding. Finding the right balance between human expertise and automated scale remains an open problem.
Evaluation of evaluators. Assessing the quality and completeness of a red teaming exercise is itself difficult. There are no universally accepted benchmarks for red team performance. A red team that finds no vulnerabilities might be testing a genuinely safe system, or it might simply not be creative or persistent enough.
Dual-use concerns. Publishing detailed descriptions of successful attacks can help defenders improve their systems, but it also provides a roadmap for malicious actors. The red teaming community must balance transparency (which advances the field) against the risk of enabling harm.
Access and resources. Effective red teaming of frontier models requires access to those models, often before public deployment. This creates a dependency on AI developers being willing to share access with external testers. It also requires significant computational resources and domain expertise that not all organizations can afford.
Based on guidance from NIST, major AI laboratories, and the broader research community, several best practices have emerged for AI red teaming.
| Best practice | Description |
|---|---|
| Define clear scope and objectives | Specify which risks are in scope, what constitutes a successful attack, and how findings will be prioritized |
| Use diverse red teams | Include people with varied backgrounds, skills, languages, and cultural perspectives to maximize the range of discovered vulnerabilities |
| Combine manual and automated methods | Use automation for breadth and human expertise for depth |
| Test throughout the lifecycle | Conduct red teaming before deployment, after major updates, and on an ongoing basis |
| Engage domain experts | For high-risk areas like CBRN, cybersecurity, and child safety, involve specialists with relevant expertise |
| Document everything | Maintain detailed records of all inputs, outputs, and contextual factors for each test |
| Establish responsible disclosure | Define clear processes for handling and communicating discovered vulnerabilities |
| Iterate on mitigations | After addressing vulnerabilities, retest to verify that fixes are effective and have not introduced new problems |
| Consider real-world deployment context | Test the system as it will actually be used, including with the specific user interfaces, system prompts, and integrations that will be in production |
| Follow established taxonomies | Use frameworks like OWASP Top 10 for LLMs or MITRE ATLAS to ensure comprehensive coverage |
Microsoft's AI Red Team has emphasized three key lessons from testing over 100 generative AI products: security and responsible AI risks should be tested together rather than in silos; red teaming must account for the full system (not just the model in isolation); and the most impactful vulnerabilities often arise from the interaction between the model and its deployment context rather than from the model alone [13].
As of early 2026, AI red teaming is transitioning from an ad hoc practice to an expected component of responsible AI development and, in some jurisdictions, a legal requirement.
Regulatory pressure is growing. The EU AI Act's requirements for adversarial testing of GPAI models with systemic risk took effect in August 2025, with full enforcement approaching in 2026. While the U.S. federal government stepped back from mandatory red teaming requirements after revoking Executive Order 14110, individual states have enacted their own AI safety laws, and industry self-regulation through frameworks like Anthropic's Responsible Scaling Policy and OpenAI's Preparedness Framework continues to incorporate red teaming as a core practice [18] [19].
Tooling is maturing. The ecosystem of automated red teaming tools has grown substantially. PyRIT, Garak, Promptfoo, and other open-source tools are increasingly integrated into continuous development pipelines. Microsoft launched an AI Red Teaming Agent in 2025, and the layered approach to automated testing (broad scanning, compliance testing, deep exploitation) has become a recognized methodology [14] [15].
Public participation is expanding. Events like the DEF CON AI Village's Generative Red Team challenges have demonstrated the value of crowdsourced adversarial testing. The 2024 Generative AI Red Teaming Transparency Report, published by Humane Intelligence, documented findings from large-scale public red teaming exercises and provided recommendations for future events [6].
CBRN and national security testing is intensifying. Partnerships between AI laboratories and government agencies for CBRN evaluation have deepened. Anthropic's partnership with the National Nuclear Security Administration, expanded CBRN evaluations across the industry, and the FORTRESS (Frontier Risk Evaluation for National Security and Public Safety) framework published in 2025 all reflect the growing emphasis on national security dimensions of AI red teaming [7].
The OWASP Top 10 for LLMs continues to evolve. The 2025 edition introduced five new vulnerability categories specific to generative AI: excessive agency, system prompt leakage, vector and embedding weaknesses, misinformation, and unbounded consumption. Prompt injection retained its top position for the second year, while sensitive information disclosure rose sharply [10].
Frontier model capabilities are outpacing defenses. The UK AI Security Institute's 2025 report found that frontier models could complete apprentice-level cybersecurity tasks 50% of the time, up from roughly 10% in early 2024. In chemistry and biology, AI models exceeded PhD-level expert performance by up to 60% on some benchmarks. Yet guardrails remained routinely circumventable [9]. This widening gap between model capability and safety assurance underscores the ongoing importance of red teaming, even as practitioners acknowledge that it cannot, on its own, guarantee safety.