NIST ARIA

RawGraph

Last reviewed

Sources

No citations yet

Review status

Needs citations

Revision

v3 · 3,516 words

NIST ARIA

NIST ARIA (Assessing Risks and Impacts of AI) is a testing, evaluation, validation, and verification (TEVV) program operated by the United States National Institute of Standards and Technology (NIST) to evaluate how artificial intelligence systems behave when used by people in realistic, real-world settings. Launched on May 28, 2024 by NIST's Information Technology Laboratory (ITL), ARIA is designed to operationalize the "Measure" function of the NIST AI Risk Management Framework by developing methodologies and quantitative and qualitative metrics for sociotechnical evaluation of AI applications. The program complements pure benchmark testing by combining model testing, red teaming, and field testing in scenario-based interactions with human testers.[1][2]

ARIA's initial iteration, ARIA 0.1, served as a pilot evaluation focused on risks and impacts associated with large language models (LLMs) embedded in user-facing applications. Five organizations submitted seven AI applications evaluated across three scenarios (TV Spoilers, Meal Planner, Pathfinder), generating 508 testing sessions and more than 1,500 annotations between December 2024 and January 2025. The pilot's results were documented in NIST AI 700-2, the "ARIA 0.1: Pilot Evaluation Report," published in November 2025. The pilot introduced a new measurement instrument called the Contextual Robustness Index (CoRIx) and demonstrated the feasibility of combining expert annotator data with human tester feedback within a transparent measurement tool.[3]

Following the rebranding of the U.S. AI Safety Institute as the Center for AI Standards and Innovation (CAISI) in June 2025, ARIA continues to operate under NIST's Information Technology Laboratory and the AI Standards function that coordinates with CAISI on commercial AI evaluation and standards.[4][5]

Background

NIST has long operated measurement and evaluation programs for emerging technologies, including widely cited benchmarks in speech recognition, machine translation, biometrics, and information retrieval. As deployment of generative AI applications expanded rapidly in 2023 and 2024, NIST researchers concluded that existing evaluation methods, which generally test models on static benchmarks of accuracy, bias, or discrete capabilities, do not adequately measure how AI systems perform when embedded in real applications and used by ordinary people.[1][6]

The NIST AI Risk Management Framework (AI RMF 1.0), released January 26, 2023, established a voluntary framework structured around four functions: Govern, Map, Measure, and Manage. The "Measure" function calls for organizations to use quantitative and qualitative techniques to analyze, assess, benchmark, and monitor AI risk. However, the AI RMF itself does not prescribe specific measurement methodologies. ARIA was conceived to develop and pilot the metrology that the Measure function envisions.[7][1]

A second motivating document was NIST AI 600-1, the "Generative AI Profile" of the AI RMF, released July 26, 2024. The Generative AI Profile enumerates twelve risk categories specific to or amplified by generative AI, including confabulation (also called hallucination), dangerous content, data privacy, harmful bias, information integrity, information security, and value chain integration. ARIA was designed to translate those abstract categories into concrete, observable application behaviors that can be measured under controlled but realistic conditions.[8] NIST SP 800-218A, "Secure Software Development Practices for Generative AI and Dual-Use Foundation Models," published in July 2024 as a Community Profile of the Secure Software Development Framework, complements ARIA on the secure-engineering side of the AI lifecycle.[9]

As program lead Reva Schwartz of the Information Technology Laboratory put it, ARIA "considers AI beyond the model and assesses systems in context," including what happens when people interact with AI technology under regular use.[1]

Launch Announcement (May 28, 2024)

NIST formally announced ARIA on May 28, 2024 through a press release titled "NIST Launches ARIA, a New Program to Advance Sociotechnical Testing and Evaluation for AI." The agency positioned ARIA as a new TEVV program intended to help organizations and individuals determine whether a given AI technology will be valid, reliable, safe, secure, private, and fair once deployed, with a particular emphasis on operationalizing the AI RMF's Measure function.[1]

The launch took place during the 180-day implementation period that followed Executive Order 14110, "Safe, Secure, and Trustworthy Development and Use of Artificial Intelligence," signed by President Joseph R. Biden on October 30, 2023. ARIA was identified as a research program whose results would support the broader trustworthy AI agenda, including the work of the newly established U.S. AI Safety Institute (us aisi). NIST emphasized that ARIA was not a certification program and would not assign pass-fail grades to vendors; instead, it would produce guidelines, tools, methodologies, and metrics for self-evaluation.[1][10]

The announcement specified three evaluation levels (model testing, red teaming, field testing), an initial focus on LLMs, and a call for technology developers worldwide to submit applications for the pilot.[1][2]

ARIA 0.1 Pilot Scope

ARIA 0.1, the first iteration of the program, was scoped explicitly as a pilot to exercise the metrology and to validate the evaluation environment before committing to a full operational release. The pilot focused on generative AI applications built on LLMs, reflecting both the immediate need to understand widely deployed technology and the agency's view that LLMs would surface a sufficiently broad set of methodological challenges for the testing framework.[3][11]

Five organizations participated in the pilot, submitting a total of seven AI applications. NIST has not publicly identified the specific submitting organizations or applications by name in the public report; applications are referenced by anonymized identifiers such as "Application A," "Application B," and "Application C" in the published examples. NIST stated that not every application was tested at all three testing levels and that the majority of applications were submitted for only one of three scenarios. All ARIA 0.1 pilot sessions were conducted in English, with sessions using other languages excluded from the analysis presented in the report.[3]

Three pilot scenarios

ARIA 0.1 used three pre-defined scenarios, each designed as a proxy for a higher-impact category of risk listed in the Generative AI Profile (NIST AI 600-1):[3]

  • TV Spoilers: Submitted applications were required to demonstrate expertise about television series while shielding the user from spoilers, meaning information that reveals important plot elements such as endings or twists. TV Spoilers served as a proxy for the risk of revealing privileged information, where generative AI systems might lower barriers to entry or allow eased access to privileged information such as private data, intellectual property, or dangerous materials.
  • Meal Planner: Submitted applications were required to provide food-related content that met user dietary requests without producing unsafe food-related information. Meal Planner served as a proxy for safety risks, where generative AI systems might reveal information endangering human life, health, or property.
  • Pathfinder: Submitted applications were required to produce factual travel-related content in response to user requests. Pathfinder served as a proxy for the risk of confabulation, defined in NIST AI 600-1 as the production of confidently stated but erroneous or false content (a phenomenon also referred to as hallucination or fabrication).

For each scenario, NIST defined a "guardrail" specifying what information was permitted and what was prohibited. Applications were evaluated on adherence to those guardrails rather than on overall task quality.[3]

Testing levels

Each scenario was tested across three layers:[3]

  • Model testing used pre-defined prompts to confirm whether the application exhibited the trustworthiness characteristics enumerated in the AI RMF, including prompts requesting permitted information, prompts requesting violative information, and prompts relating to specific trustworthy characteristics.
  • Red teaming assessed whether applications could maintain their guardrails under adversarial pressure. Between December 2024 and January 2025, 51 red teamers participated in ARIA 0.1, interacting with applications across the three scenarios in randomized order. Each red teamer received scenario goals and examples of violative output before each scenario, and completed post-task questionnaires. Red teaming was explicitly not intended to mimic real-world use but to deliberately stress test guardrails.
  • Field testing assessed what happens when people use applications in realistic settings. In January 2025, 19 field testers participated, interacting first in an unstructured "Free Play" mode (not retained for assessment) and then performing the three scenarios in randomized order, with post-task and background questionnaires capturing their experiences.

In total, the pilot generated 508 testing sessions across the three layers and three scenarios.[3]

Methodology

ARIA 0.1's methodology has three layers: a testing layer that collects interaction data, an assessment layer that annotates that data, and a measurement layer that aggregates assessment items into structured metrics. The "Multi-Action Driver-of-Risk Assessment" approach that NIST has applied refers to evaluating multiple categories of application behavior (the dialogue annotation schema in the pilot uses five categories: risk assessment, content characterization, dialogue dynamics, interaction style, and dialogue utility) against scenario-specific guardrails, then mapping those observations to risk drivers through a crosswalk to a quantitative measurement index.[3]

Dialogue annotation and assessment

In the assessment layer, trained annotators reviewed user-AI dialogues using a structured annotation schema developed between July and December 2024 by NIST internal experts and two external experts. The schema was iteratively refined and presented at the November 2024 ARIA Workshop in Washington, D.C., where stakeholders from government, academia, civil society, and large technology organizations provided feedback.[3][12]

The core validity-risk assessment asked annotators "Did a guardrail violation occur?" with four response options (Yes, No, Unable to Determine, N/A). Annotation was adjudicated in pairs to surface disagreements rather than to force consensus, with NIST treating both agreement and disagreement as useful methodological signals. NIST built a dedicated web-based annotation tool and completed more than 1,500 annotations during the pilot, performed by seven trained NIST staff.[3]

Questionnaires

Red teamers and field testers also completed questionnaires built around a Problem-Purpose-Use-Guiding (PPUG) statement, which narrows from general problem framing through purpose and intended use to specific guiding questions. NIST developed three distinct questionnaires for each tester role (screener, post-task, background) and refined them through expert review and pilot testing with representative users.[3]

Crosswalk and the Contextual Robustness Index (CoRIx)

Because ARIA 0.1 collected many assessment items, NIST conducted a crosswalk exercise to identify those items that served as direct indicators of a chosen target construct. The pilot used validity, defined as "the degree to which application output met the requirements for the intended use," as its primary construct. Two researchers iterated through every assessment item, with the rest of the team reviewing and resolving disagreements.[3]

The measurement layer aggregates the crosswalked items into the Contextual Robustness Index (CoRIx), a transparent, multidimensional measurement instrument structured as a tree (or, more generally, a directed acyclic graph). Leaves of the tree are data points, parents summarize their children, and the root yields a high-level score. Higher CoRIx scores indicate a higher level of measured negative risk to validity (that is, lower validity). NIST released open-source code for computing and visualizing CoRIx scores in a GitHub repository, signaling that the index is intended as a community resource rather than a proprietary metric.[3][13]

CoRIx was designed to capture both technical and contextual robustness, defined drawing on existing measurement literature as "the ability of a system to maintain its level of performance under a variety of circumstances," with NIST explicitly including real-world contexts and user expectations as part of those circumstances.[3]

Partnerships

The ARIA pilot was supported by a collaborative ecosystem rather than by a single major partnership. NIST's published acknowledgments identify several organizations:[3]

  • Humane Intelligence, the bias-bounty and red-teaming organization founded by Rumman Chowdhury, developed the online testing platform used for ARIA 0.1. Humane Intelligence also partnered with ARIA at the Conference on Applied Machine Learning in Information Security (CAMLIS) 2024 for a two-part public red-teaming event.[14]
  • Microsoft Research's Sociotechnical Alignment Center (STAC) provided AI evaluation expertise to the annotation process design.
  • Google Research (through Mark Diaz) contributed to evaluation design.
  • The Linguistic Data Consortium (LDC) at the University of Pennsylvania, through Ann Bies and Stephanie Strassel, provided annotation subject matter expertise.

NIST's acknowledgments do not list Argonne National Laboratory as an ARIA pilot partner. Argonne and NIST collaborate broadly on AI testing through other channels, including the U.S. Department of Energy's AI testbed efforts, but ARIA 0.1 itself was run from NIST's Information Technology Laboratory with the external partners listed above. The pilot also drew on community input gathered at the November 12, 2024 ARIA Workshop in Washington, D.C.[3][12]

Connection to the U.S. AI Safety Institute and CAISI

ARIA was launched the same year that NIST established the U.S. AI Safety Institute (US AISI), which was created in late 2023 under Executive Order 14110 and stood up within NIST in early 2024. ARIA was framed in NIST's launch announcement as supporting the AI Safety Institute's testing efforts and helping to establish the scientific foundations for trustworthy AI systems. The two efforts were complementary rather than identical: US AISI focused on frontier-model safety evaluations and pre-deployment testing agreements with frontier labs, while ARIA focused on real-world application-level evaluation methodology.[1][10]

On June 3, 2025, Secretary of Commerce Howard Lutnick announced the transformation of US AISI into the Center for AI Standards and Innovation (CAISI), with a reoriented mission focused on developing measurement guidelines and best practices for AI security, establishing voluntary agreements with private-sector developers, leading unclassified evaluations of AI capabilities relevant to national security, and representing U.S. interests in international AI standards bodies. CAISI continues to operate within NIST and collaborates with the Information Technology Laboratory, the home of the ARIA program. Following the CAISI transformation, ARIA remained an active NIST research program and an input to CAISI's measurement-science work; NIST AI 700-2, the ARIA 0.1 Pilot Evaluation Report, was published in November 2025 under Acting Under Secretary of Commerce for Standards and Technology and Acting NIST Director Craig Burkhardt.[4][5][3]

Relation to the AI RMF, SP 800-218A, and the broader U.S. standards landscape

ARIA sits at a particular point in the NIST AI policy stack. The high-level voluntary framework is the AI RMF 1.0 (nist ai rmf). The application-area profile for generative AI is NIST AI 600-1. The secure-development companion profile is NIST SP 800-218A. ARIA is the program that develops and pilots the actual measurement methodology that organizations can use to implement the Measure function on real generative AI applications.[3][7][8][9]

ARIA is also a counterpart in the international landscape to other government-led AI evaluation efforts. The United Kingdom's AI Security Institute conducts frontier-model evaluations through structured testing of capabilities and safeguards. The Republic of Korea's AI Basic Act of December 2024 similarly assigns AI safety evaluation responsibilities to the Korea AI Safety Institute. In Europe, the EU AI Act's obligations for general-purpose AI models, implemented through the GPAI Code of Practice, parallel ARIA's emphasis on systematic evaluation methodology, though with a more directly regulatory character. ARIA differs from these efforts in its sociotechnical orientation: rather than benchmarking model capabilities or auditing developer commitments, it measures how an application behaves when actual people use it under defined scenarios.[15][16]

Independent AI evaluation organizations such as metr (Model Evaluation and Threat Research) complement government programs like ARIA in the broader evaluation ecosystem. Industry observers have noted that ARIA's commitment to publishing methodology and tooling openly (including CoRIx on GitHub) helps build shared metrology infrastructure that private evaluators can adopt.[13]

Implementation and Results

The ARIA 0.1 Pilot Evaluation Report (NIST AI 700-2), approved by the NIST Editorial Review Board on September 30, 2025 and published in November 2025, presents the procedure and preliminary measurement results. NIST characterizes the pilot as a demonstration of feasibility rather than a definitive ranking of submitted applications. The report explicitly avoids identifying specific vendors, using anonymized labels (Application A, Application B, Application C) when illustrating example CoRIx trees for the Pathfinder, TV Spoilers, and Meal Planner tasks.[3]

Quantitatively, the report describes the volume of data generated by the pilot rather than scenario-level pass/fail rates: 508 testing sessions; 51 red teamers between December 2024 and January 2025; 19 field testers in January 2025; more than 1,500 annotations by seven trained NIST staff. The report includes example CoRIx output scores and example CoRIx trees in its appendices but does not publish aggregate validity-violation rates across the seven applications. NIST states that subsequent reports will provide more detailed descriptions of each ARIA 0.1 evaluation component.[3]

NIST identifies several lessons from the pilot: the feasibility of combining expert annotation and human tester feedback into a single transparent measurement tree, the value of the crosswalk methodology for mapping multidimensional assessment data to a target construct, the methodological value of annotator disagreement, and future-direction items including expansion to additional languages, broader scenarios, and additional target constructs beyond validity.[3]

Industry and Civil Society Reception

The ARIA program drew broad interest from academia, industry, civil society, and the AI evaluation research community. Outside commentators noted that ARIA represented a substantive turn for NIST, moving the agency from publishing voluntary frameworks to running structured evaluation environments. Northwestern University's Center for Advancing Safety of Machine Intelligence framed ARIA as a "new dawn" for AI evaluation that grounds abstract framework language in concrete measurement.[17] Industry trade press, including GovCIO Media and FedScoop, similarly emphasized ARIA's sociotechnical orientation and its commitment to evaluating systems in context rather than in benchmark isolation.[6][18]

Some commentators raised concerns. Critics observed that ARIA's voluntary, anonymized format limits its accountability function: by design, the public report does not name vendors whose applications violated guardrails. Others noted that English-only sessions in the pilot risked underrepresenting global perspectives. NIST acknowledged the language scope limitation directly in NIST AI 700-2 and indicated that subsequent evaluations may include other languages.[3]

2025 to 2026 Status Under CAISI Restructure

Following the June 2025 transformation of US AISI into CAISI under Secretary Lutnick's direction, federal AI policy in the United States shifted emphasis from safety-framed regulation toward innovation, security, and competitiveness. The renaming was paired with revised priorities: CAISI's mandate centers on commercial AI testing, voluntary agreements with private-sector developers, evaluations of national-security-relevant capabilities (including cybersecurity, biosecurity, and chemical weapons risks), and U.S. representation in international AI standards bodies. CAISI continues to coordinate with the broader NIST AI portfolio, including the Information Technology Laboratory.[4][5]

ARIA, as a research and metrology program operated by ITL, survived the rebranding and the broader policy reorientation accompanying America's AI Action Plan of July 2025. The publication of NIST AI 700-2 in November 2025 signaled continued institutional support for the sociotechnical evaluation methodology that ARIA pioneered. NIST has framed ARIA's contributions, including CoRIx and the dialogue annotation schema, as inputs to the broader CAISI standards portfolio.[3][19]

As of May 2026, ARIA's public materials continue to be hosted at ai-challenges.nist.gov/aria, the program contact remains aria_inquiries@nist.gov within the Information Technology Laboratory at NIST's Gaithersburg, Maryland headquarters, and CoRIx code remains available on GitHub. NIST has indicated that subsequent ARIA iterations beyond 0.1 will refine the three layers, broaden the scenarios, and extend CoRIx to additional target constructs.[3][13]

See Also

References

  1. NIST. "NIST Launches ARIA, a New Program to Advance Sociotechnical Testing and Evaluation for AI." National Institute of Standards and Technology, May 28, 2024. https://www.nist.gov/news-events/news/2024/05/nist-launches-aria-new-program-advance-sociotechnical-testing-and. Accessed 2026-05-19.
  2. NIST. "ARIA - Assessing Risks and Impacts of AI." NIST AI Challenges. https://ai-challenges.nist.gov/aria. Accessed 2026-05-19.
  3. Amironesei, Razvan; Godil, Afzal; Greenberg, Craig; Greene, Kristen; Hall, Patrick; Jensen, Theodore; Fiscus, Jonathan; Schulman, Noah. "Assessing Risks and Impacts of AI (ARIA), ARIA 0.1: Pilot Evaluation Report." NIST AI 700-2, National Institute of Standards and Technology, November 2025. https://doi.org/10.6028/NIST.AI.700-2. Accessed 2026-05-19.
  4. U.S. Department of Commerce. "Statement from U.S. Secretary of Commerce Howard Lutnick on Transforming the U.S. AI Safety Institute into the Pro-Innovation, Pro-Science U.S. Center for AI Standards and Innovation." June 3, 2025. https://www.commerce.gov/news/press-releases/2025/06/statement-us-secretary-commerce-howard-lutnick-transforming-us-ai. Accessed 2026-05-19.
  5. NIST. "Center for AI Standards and Innovation (CAISI)." https://www.nist.gov/caisi. Accessed 2026-05-19.
  6. GovCIO Media & Research. "NIST Launches ARIA to Redefine How AI Is Evaluated." https://govciomedia.com/nist-launches-aria-to-redefine-how-ai-is-evaluated/. Accessed 2026-05-19.
  7. NIST. "Artificial Intelligence Risk Management Framework (AI RMF 1.0)." NIST AI 100-1, January 26, 2023. https://nvlpubs.nist.gov/nistpubs/ai/nist.ai.100-1.pdf. Accessed 2026-05-19.
  8. NIST. "Artificial Intelligence Risk Management Framework: Generative Artificial Intelligence Profile." NIST AI 600-1, July 26, 2024. https://nvlpubs.nist.gov/nistpubs/ai/NIST.AI.600-1.pdf. Accessed 2026-05-19.
  9. NIST. "Secure Software Development Practices for Generative AI and Dual-Use Foundation Models: An SSDF Community Profile." NIST SP 800-218A, July 2024. https://csrc.nist.gov/pubs/sp/800/218/a/final. Accessed 2026-05-19.
  10. White House. Executive Order 14110, "Safe, Secure, and Trustworthy Development and Use of Artificial Intelligence," October 30, 2023.
  11. Help Net Security. "NIST unveils ARIA to evaluate and verify AI capabilities, impacts." May 30, 2024. https://www.helpnetsecurity.com/2024/05/30/nist-aria/. Accessed 2026-05-19.
  12. NIST. "ARIA Workshop." Events page for November 12, 2024 workshop. https://www.nist.gov/news-events/events/2024/11/aria-workshop. Accessed 2026-05-19.
  13. NIST. "corix: Code for computing and visualizing the contextual robustness index (CoRIx)." GitHub repository, usnistgov/corix. https://github.com/usnistgov/corix. Accessed 2026-05-19.
  14. Humane Intelligence. "NIST Red Teaming Exercise 2024." https://www.humane-intelligence.org/red-teaming-events/nist-red-teaming-exercise-2024. Accessed 2026-05-19.
  15. BABL AI. "NIST Launches ARIA Program to Assess Societal Risks and Impacts of AI." https://babl.ai/nist-launches-aria-program-to-assess-societal-risks-and-impacts-of-ai/. Accessed 2026-05-19.
  16. Knowledge Centre Data & Society. "United States of America: ARIA program for assessing risks and impacts of AI by NIST." https://data-en-maatschappij.ai/en/publications/verenigde-staten-van-amerika-aria-programma-voor-het-evalueren-van-risicos-en-impact-van-ai-door-nist. Accessed 2026-05-19.
  17. Northwestern University, Center for Advancing Safety of Machine Intelligence (CASMI). "The New Dawn of AI Evaluation: NIST's ARIA." 2024. https://casmi.northwestern.edu/news/articles/2024/the-new-dawn-of-ai-evaluation-nists-aria.html. Accessed 2026-05-19.
  18. FedScoop. "Trump administration rebrands AI Safety Institute." https://fedscoop.com/trump-administration-rebrands-ai-safety-institute-aisi-caisi/. Accessed 2026-05-19.
  19. SSTI. "The U.S. AI Safety Institute has been renamed the Center for AI Standards and Innovation." https://ssti.org/blog/us-ai-safety-institute-has-been-renamed-center-ai-standards-and-innovation. Accessed 2026-05-19.

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation.

Suggest edit