NIST ARIA
Last reviewed
May 18, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 3,500 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 18, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 3,500 words
Add missing citations, update stale details, or suggest a clearer explanation.
NIST ARIA (Assessing Risks and Impacts of AI) is a testing, evaluation, validation, and verification (TEVV) program operated by the United States National Institute of Standards and Technology (NIST) to evaluate how artificial intelligence systems behave when used by people in realistic, real-world settings. Launched on May 28, 2024 by NIST's Information Technology Laboratory (ITL), ARIA is designed to operationalize the "Measure" function of the NIST AI Risk Management Framework by developing methodologies and quantitative and qualitative metrics for sociotechnical evaluation of AI applications. The program complements pure benchmark testing by combining model testing, red teaming, and field testing in scenario-based interactions with human testers.[^1][^2]
ARIA's initial iteration, ARIA 0.1, served as a pilot evaluation focused on risks and impacts associated with large language models (LLMs) embedded in user-facing applications. Five organizations submitted seven AI applications evaluated across three scenarios (TV Spoilers, Meal Planner, Pathfinder), generating 508 testing sessions and more than 1,500 annotations between December 2024 and January 2025. The pilot's results were documented in NIST AI 700-2, the "ARIA 0.1: Pilot Evaluation Report," published in November 2025. The pilot introduced a new measurement instrument called the Contextual Robustness Index (CoRIx) and demonstrated the feasibility of combining expert annotator data with human tester feedback within a transparent measurement tool.[^3]
Following the rebranding of the U.S. AI Safety Institute as the Center for AI Standards and Innovation (CAISI) in June 2025, ARIA continues to operate under NIST's Information Technology Laboratory and the AI Standards function that coordinates with CAISI on commercial AI evaluation and standards.[^4][^5]
NIST has long operated measurement and evaluation programs for emerging technologies, including widely cited benchmarks in speech recognition, machine translation, biometrics, and information retrieval. As deployment of generative AI applications expanded rapidly in 2023 and 2024, NIST researchers concluded that existing evaluation methods, which generally test models on static benchmarks of accuracy, bias, or discrete capabilities, do not adequately measure how AI systems perform when embedded in real applications and used by ordinary people.[^1][^6]
The NIST AI Risk Management Framework (AI RMF 1.0), released January 26, 2023, established a voluntary framework structured around four functions: Govern, Map, Measure, and Manage. The "Measure" function calls for organizations to use quantitative and qualitative techniques to analyze, assess, benchmark, and monitor AI risk. However, the AI RMF itself does not prescribe specific measurement methodologies. ARIA was conceived to develop and pilot the metrology that the Measure function envisions.[^7][^1]
A second motivating document was NIST AI 600-1, the "Generative AI Profile" of the AI RMF, released July 26, 2024. The Generative AI Profile enumerates twelve risk categories specific to or amplified by generative AI, including confabulation (also called hallucination), dangerous content, data privacy, harmful bias, information integrity, information security, and value chain integration. ARIA was designed to translate those abstract categories into concrete, observable application behaviors that can be measured under controlled but realistic conditions.[^8] NIST SP 800-218A, "Secure Software Development Practices for Generative AI and Dual-Use Foundation Models," published in July 2024 as a Community Profile of the Secure Software Development Framework, complements ARIA on the secure-engineering side of the AI lifecycle.[^9]
As program lead Reva Schwartz of the Information Technology Laboratory put it, ARIA "considers AI beyond the model and assesses systems in context," including what happens when people interact with AI technology under regular use.[^1]
NIST formally announced ARIA on May 28, 2024 through a press release titled "NIST Launches ARIA, a New Program to Advance Sociotechnical Testing and Evaluation for AI." The agency positioned ARIA as a new TEVV program intended to help organizations and individuals determine whether a given AI technology will be valid, reliable, safe, secure, private, and fair once deployed, with a particular emphasis on operationalizing the AI RMF's Measure function.[^1]
The launch took place during the 180-day implementation period that followed Executive Order 14110, "Safe, Secure, and Trustworthy Development and Use of Artificial Intelligence," signed by President Joseph R. Biden on October 30, 2023. ARIA was identified as a research program whose results would support the broader trustworthy AI agenda, including the work of the newly established U.S. AI Safety Institute (us aisi). NIST emphasized that ARIA was not a certification program and would not assign pass-fail grades to vendors; instead, it would produce guidelines, tools, methodologies, and metrics for self-evaluation.[^1][^10]
The announcement specified three evaluation levels (model testing, red teaming, field testing), an initial focus on LLMs, and a call for technology developers worldwide to submit applications for the pilot.[^1][^2]
ARIA 0.1, the first iteration of the program, was scoped explicitly as a pilot to exercise the metrology and to validate the evaluation environment before committing to a full operational release. The pilot focused on generative AI applications built on LLMs, reflecting both the immediate need to understand widely deployed technology and the agency's view that LLMs would surface a sufficiently broad set of methodological challenges for the testing framework.[^3][^11]
Five organizations participated in the pilot, submitting a total of seven AI applications. NIST has not publicly identified the specific submitting organizations or applications by name in the public report; applications are referenced by anonymized identifiers such as "Application A," "Application B," and "Application C" in the published examples. NIST stated that not every application was tested at all three testing levels and that the majority of applications were submitted for only one of three scenarios. All ARIA 0.1 pilot sessions were conducted in English, with sessions using other languages excluded from the analysis presented in the report.[^3]
ARIA 0.1 used three pre-defined scenarios, each designed as a proxy for a higher-impact category of risk listed in the Generative AI Profile (NIST AI 600-1):[^3]
For each scenario, NIST defined a "guardrail" specifying what information was permitted and what was prohibited. Applications were evaluated on adherence to those guardrails rather than on overall task quality.[^3]
Each scenario was tested across three layers:[^3]
In total, the pilot generated 508 testing sessions across the three layers and three scenarios.[^3]
ARIA 0.1's methodology has three layers: a testing layer that collects interaction data, an assessment layer that annotates that data, and a measurement layer that aggregates assessment items into structured metrics. The "Multi-Action Driver-of-Risk Assessment" approach that NIST has applied refers to evaluating multiple categories of application behavior (the dialogue annotation schema in the pilot uses five categories: risk assessment, content characterization, dialogue dynamics, interaction style, and dialogue utility) against scenario-specific guardrails, then mapping those observations to risk drivers through a crosswalk to a quantitative measurement index.[^3]
In the assessment layer, trained annotators reviewed user-AI dialogues using a structured annotation schema developed between July and December 2024 by NIST internal experts and two external experts. The schema was iteratively refined and presented at the November 2024 ARIA Workshop in Washington, D.C., where stakeholders from government, academia, civil society, and large technology organizations provided feedback.[^3][^12]
The core validity-risk assessment asked annotators "Did a guardrail violation occur?" with four response options (Yes, No, Unable to Determine, N/A). Annotation was adjudicated in pairs to surface disagreements rather than to force consensus, with NIST treating both agreement and disagreement as useful methodological signals. NIST built a dedicated web-based annotation tool and completed more than 1,500 annotations during the pilot, performed by seven trained NIST staff.[^3]
Red teamers and field testers also completed questionnaires built around a Problem-Purpose-Use-Guiding (PPUG) statement, which narrows from general problem framing through purpose and intended use to specific guiding questions. NIST developed three distinct questionnaires for each tester role (screener, post-task, background) and refined them through expert review and pilot testing with representative users.[^3]
Because ARIA 0.1 collected many assessment items, NIST conducted a crosswalk exercise to identify those items that served as direct indicators of a chosen target construct. The pilot used validity, defined as "the degree to which application output met the requirements for the intended use," as its primary construct. Two researchers iterated through every assessment item, with the rest of the team reviewing and resolving disagreements.[^3]
The measurement layer aggregates the crosswalked items into the Contextual Robustness Index (CoRIx), a transparent, multidimensional measurement instrument structured as a tree (or, more generally, a directed acyclic graph). Leaves of the tree are data points, parents summarize their children, and the root yields a high-level score. Higher CoRIx scores indicate a higher level of measured negative risk to validity (that is, lower validity). NIST released open-source code for computing and visualizing CoRIx scores in a GitHub repository, signaling that the index is intended as a community resource rather than a proprietary metric.[^3][^13]
CoRIx was designed to capture both technical and contextual robustness, defined drawing on existing measurement literature as "the ability of a system to maintain its level of performance under a variety of circumstances," with NIST explicitly including real-world contexts and user expectations as part of those circumstances.[^3]
The ARIA pilot was supported by a collaborative ecosystem rather than by a single major partnership. NIST's published acknowledgments identify several organizations:[^3]
NIST's acknowledgments do not list Argonne National Laboratory as an ARIA pilot partner. Argonne and NIST collaborate broadly on AI testing through other channels, including the U.S. Department of Energy's AI testbed efforts, but ARIA 0.1 itself was run from NIST's Information Technology Laboratory with the external partners listed above. The pilot also drew on community input gathered at the November 12, 2024 ARIA Workshop in Washington, D.C.[^3][^12]
ARIA was launched the same year that NIST established the U.S. AI Safety Institute (US AISI), which was created in late 2023 under Executive Order 14110 and stood up within NIST in early 2024. ARIA was framed in NIST's launch announcement as supporting the AI Safety Institute's testing efforts and helping to establish the scientific foundations for trustworthy AI systems. The two efforts were complementary rather than identical: US AISI focused on frontier-model safety evaluations and pre-deployment testing agreements with frontier labs, while ARIA focused on real-world application-level evaluation methodology.[^1][^10]
On June 3, 2025, Secretary of Commerce Howard Lutnick announced the transformation of US AISI into the Center for AI Standards and Innovation (CAISI), with a reoriented mission focused on developing measurement guidelines and best practices for AI security, establishing voluntary agreements with private-sector developers, leading unclassified evaluations of AI capabilities relevant to national security, and representing U.S. interests in international AI standards bodies. CAISI continues to operate within NIST and collaborates with the Information Technology Laboratory, the home of the ARIA program. Following the CAISI transformation, ARIA remained an active NIST research program and an input to CAISI's measurement-science work; NIST AI 700-2, the ARIA 0.1 Pilot Evaluation Report, was published in November 2025 under Acting Under Secretary of Commerce for Standards and Technology and Acting NIST Director Craig Burkhardt.[^4][^5][^3]
ARIA sits at a particular point in the NIST AI policy stack. The high-level voluntary framework is the AI RMF 1.0 (nist ai rmf). The application-area profile for generative AI is NIST AI 600-1. The secure-development companion profile is NIST SP 800-218A. ARIA is the program that develops and pilots the actual measurement methodology that organizations can use to implement the Measure function on real generative AI applications.[^3][^7][^8][^9]
ARIA is also a counterpart in the international landscape to other government-led AI evaluation efforts. The United Kingdom's AI Security Institute conducts frontier-model evaluations through structured testing of capabilities and safeguards. The Republic of Korea's AI Basic Act of December 2024 similarly assigns AI safety evaluation responsibilities to the Korea AI Safety Institute. In Europe, the EU AI Act's obligations for general-purpose AI models, implemented through the GPAI Code of Practice, parallel ARIA's emphasis on systematic evaluation methodology, though with a more directly regulatory character. ARIA differs from these efforts in its sociotechnical orientation: rather than benchmarking model capabilities or auditing developer commitments, it measures how an application behaves when actual people use it under defined scenarios.[^15][^16]
Independent AI evaluation organizations such as metr (Model Evaluation and Threat Research) complement government programs like ARIA in the broader evaluation ecosystem. Industry observers have noted that ARIA's commitment to publishing methodology and tooling openly (including CoRIx on GitHub) helps build shared metrology infrastructure that private evaluators can adopt.[^13]
The ARIA 0.1 Pilot Evaluation Report (NIST AI 700-2), approved by the NIST Editorial Review Board on September 30, 2025 and published in November 2025, presents the procedure and preliminary measurement results. NIST characterizes the pilot as a demonstration of feasibility rather than a definitive ranking of submitted applications. The report explicitly avoids identifying specific vendors, using anonymized labels (Application A, Application B, Application C) when illustrating example CoRIx trees for the Pathfinder, TV Spoilers, and Meal Planner tasks.[^3]
Quantitatively, the report describes the volume of data generated by the pilot rather than scenario-level pass/fail rates: 508 testing sessions; 51 red teamers between December 2024 and January 2025; 19 field testers in January 2025; more than 1,500 annotations by seven trained NIST staff. The report includes example CoRIx output scores and example CoRIx trees in its appendices but does not publish aggregate validity-violation rates across the seven applications. NIST states that subsequent reports will provide more detailed descriptions of each ARIA 0.1 evaluation component.[^3]
NIST identifies several lessons from the pilot: the feasibility of combining expert annotation and human tester feedback into a single transparent measurement tree, the value of the crosswalk methodology for mapping multidimensional assessment data to a target construct, the methodological value of annotator disagreement, and future-direction items including expansion to additional languages, broader scenarios, and additional target constructs beyond validity.[^3]
The ARIA program drew broad interest from academia, industry, civil society, and the AI evaluation research community. Outside commentators noted that ARIA represented a substantive turn for NIST, moving the agency from publishing voluntary frameworks to running structured evaluation environments. Northwestern University's Center for Advancing Safety of Machine Intelligence framed ARIA as a "new dawn" for AI evaluation that grounds abstract framework language in concrete measurement.[^17] Industry trade press, including GovCIO Media and FedScoop, similarly emphasized ARIA's sociotechnical orientation and its commitment to evaluating systems in context rather than in benchmark isolation.[^6][^18]
Some commentators raised concerns. Critics observed that ARIA's voluntary, anonymized format limits its accountability function: by design, the public report does not name vendors whose applications violated guardrails. Others noted that English-only sessions in the pilot risked underrepresenting global perspectives. NIST acknowledged the language scope limitation directly in NIST AI 700-2 and indicated that subsequent evaluations may include other languages.[^3]
Following the June 2025 transformation of US AISI into CAISI under Secretary Lutnick's direction, federal AI policy in the United States shifted emphasis from safety-framed regulation toward innovation, security, and competitiveness. The renaming was paired with revised priorities: CAISI's mandate centers on commercial AI testing, voluntary agreements with private-sector developers, evaluations of national-security-relevant capabilities (including cybersecurity, biosecurity, and chemical weapons risks), and U.S. representation in international AI standards bodies. CAISI continues to coordinate with the broader NIST AI portfolio, including the Information Technology Laboratory.[^4][^5]
ARIA, as a research and metrology program operated by ITL, survived the rebranding and the broader policy reorientation accompanying America's AI Action Plan of July 2025. The publication of NIST AI 700-2 in November 2025 signaled continued institutional support for the sociotechnical evaluation methodology that ARIA pioneered. NIST has framed ARIA's contributions, including CoRIx and the dialogue annotation schema, as inputs to the broader CAISI standards portfolio.[^3][^19]
As of May 2026, ARIA's public materials continue to be hosted at ai-challenges.nist.gov/aria, the program contact remains aria_inquiries@nist.gov within the Information Technology Laboratory at NIST's Gaithersburg, Maryland headquarters, and CoRIx code remains available on GitHub. NIST has indicated that subsequent ARIA iterations beyond 0.1 will refine the three layers, broaden the scenarios, and extend CoRIx to additional target constructs.[^3][^13]