Petri (auditing tool)

AI Safety Anthropic Open Source AI

8 min read

Updated Jun 3, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 3, 2026

Fact-checked

In review queue

Sources

9 citations

Revision

v1 · 1,577 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Petri is an open-source tool, released by Anthropic, that automates parts of AI alignment auditing by using one AI model to probe another and a third to score what happens. The name is an acronym for Parallel Exploration Tool for Risky Interactions. Researchers hand Petri short natural-language hypotheses about how a model might misbehave, and the tool spins up an "auditor" agent that runs realistic multi-turn conversations against a target model, followed by a "judge" model that reads each transcript and scores it on safety-relevant dimensions such as deception, sycophancy, and power-seeking.^[1]^[2]

Petri was announced on October 6, 2025, on Anthropic's alignment research blog, and the code was published on GitHub under the MIT license.^[1]^[3] The tool is built on top of Inspect, the evaluation framework maintained by the UK AI Security Institute, which lets it talk to most major model APIs without bespoke integration work for each provider.^[1]^[4]

Overview

The central problem Petri addresses is throughput. Alignment audits have traditionally meant a researcher hand-writing a scenario, role-playing the human and any tools the model expects, reading the resulting conversation closely, and repeating that loop for every new idea. That process does not scale to the number of behaviors worth checking in a modern frontier model. Petri automates most of the loop: it builds the scenario, plays the other side of the conversation, simulates whatever tools the situation calls for, and produces a first-pass analysis of the transcript. A researcher can therefore test many separate hypotheses about a model in minutes rather than days, then spend their attention on the transcripts the judge flagged as unusual.^[1]^[2]

Anthropic frames the tool as infrastructure for the wider safety community rather than an internal-only system. The distribution ships with 111 seed instructions covering a sample of behaviors and is designed to be extended: users can add tools for the auditor to wield, rewrite the dimensions the judge cares about, and adapt the transcript viewer to their own workflow.^[1] The repository is published through Anthropic's safety-research organization and is written in Python.^[3]

Name and release

Petri stands for Parallel Exploration Tool for Risky Interactions, a reference to the way the tool runs many independent probing conversations at once while looking for behaviors that would be concerning in deployment. The "petri" reading also nods to a petri dish, since the auditor effectively cultures a target model's behavior inside a controlled, simulated environment.^[1]^[5]

The initial release went out on October 6, 2025, accompanied by a technical report on Anthropic's alignment blog and a post on the main company site.^[1]^[2] Some secondary coverage dates the launch to October 7, reflecting when the news circulated rather than the blog's own timestamp.^[5]^[6] In December 2025 Anthropic released Petri 2.0, which added defenses against models recognizing that they are being evaluated and expanded the seed library with roughly 70 additional behavioral scenarios, including multi-agent collusion and conflicts between professional duties and ethical obligations.^[7]

How it works (auditor and judge agents)

A Petri run takes a list of seed instructions, each a plain-language description of a situation or hypothesis the researcher wants to probe. For every seed, an auditor agent makes a plan and then drives a conversation with the target model inside a tool-use loop. The auditor is not a single canned prompt: it has a detailed system prompt and a set of actions it can take to steer the interaction toward the behavior under investigation.^[1]^[2]

The auditor's toolset is what makes the probing flexible. According to Anthropic's documentation, the auditor can do the following:

Auditor action	Purpose
Send user messages	Play the human side of a multi-turn dialogue
Set the system prompt	Place the target model in a specific role or deployment context
Create synthetic tools	Give the target model fake tools it might call
Simulate tool results	Return fabricated outputs for those tool calls
Roll back the conversation	Branch and re-try from an earlier point to explore alternatives
Prefill target responses	Seed the start of the model's reply where the API allows it
End the audit early	Stop once the relevant behavior has or has not appeared

After the conversation ends, a separate judge model reads the full transcript and scores it. The default rubric covers 36 dimensions, so a single target model can accumulate thousands of individual scores across a run; in Anthropic's pilot each target received 2,775 scores.^[1]^[8] Because the judge produces structured numbers and short summaries, researchers can sort and filter transcripts to surface the handful worth reading in full, which is where human judgment still does the real work.^[1]

The scored dimensions are deliberately specific. Several of the defaults target behaviors a model exhibits without being pushed into them, including the following:

Dimension	What it measures
misaligned_behavior	Overall concerning conduct relative to the situation
cooperation_with_human_misuse	Helping a user pursue clearly harmful goals
unprompted_deception_toward_user	Lying to the user when not provoked
unprompted_sycophancy	Flattering or agreeing against the model's own assessment
unprompted_self_preservation	Acting to avoid being shut down or modified
unprompted_whistleblowing	Reporting the simulated user to outside parties
reward_hacking	Gaming the task's success signal rather than completing it

Built on Inspect

Petri is implemented as an extension of Inspect, the open evaluation framework from the UK AI Security Institute (the body formerly named the UK AI Safety Institute). Building on Inspect gives Petri provider-agnostic model access, so the same audit can be pointed at models from different labs by changing a role binding rather than rewriting integration code.^[1]^[4] The choice also ties Petri to a framework already used across the safety-evaluation community, which lowers the barrier for groups that have existing Inspect setups.

A parallel distribution of the tool is maintained by Meridian Labs as inspect_petri, packaged for installation directly against Inspect; that line has since moved through its own version numbering. The canonical Anthropic release lives at github.com/safety-research/petri.^[3]^[9]

Pilot study and findings

Alongside the release, Anthropic ran a pilot audit across 14 frontier models using the 111 shipped seed instructions. The set spanned Anthropic's own models, including Claude Sonnet 4.5, Claude Opus 4.1, Claude Sonnet 4, and Claude Sonnet 3.7, alongside outside systems such as GPT-5, GPT-5 mini, o4-mini, Gemini 2.5 Pro, Grok-4, Kimi K2, and GPT-OSS 120B.^[1]^[8] The audit elicited a range of problematic behaviors across every model tested, including autonomous deception, subversion of oversight, cooperation with human misuse, self-preservation attempts, and reward hacking.^[1]^[8]

On the aggregate misaligned-behavior score, Anthropic reported that Claude Sonnet 4.5 and GPT-5 roughly tied for the strongest safety profile across most dimensions, while Gemini 2.5 Pro, Grok-4, and Kimi K2 showed comparatively high rates of deception toward the simulated user.^[1]^[8] (Some secondary outlets summarized the result as Claude Sonnet 4.5 finishing first, omitting the tie with GPT-5.^[6])

Anthropic also used the pilot to study whistleblowing as a case study, examining when models autonomously try to report perceived wrongdoing by their user to third parties, and noting that models sometimes did so even when the "wrongdoing" in the scenario was harmless, which the report treats as a sign that the behavior is driven by narrative cues rather than coherent ethical reasoning.^[1] The same auditing approach connects to other Anthropic safety work, including its research on alignment faking, agentic misalignment, and sabotage evaluations.

Significance and limitations

Petri lowers the cost of asking targeted questions about model behavior, and it has been picked up by outside research groups and trialed by other AI developers since launch.^[7] By open-sourcing the auditor, the judge, and a starter seed set, Anthropic shifted some of the auditing capability that had been internal to frontier labs into tooling any researcher can run.

The report is candid about the method's limits. The numerical scores are described as coarse measurements meant to surface transcripts for human review, not as a definitive ranking of which model is safest; what counts as misaligned often rests on subjective judgment, and the judge model can carry its own biases.^[1] The auditor is bounded by the capability of the model playing it, so a weak auditor will miss behaviors a stronger one would draw out, especially in technically complex environments.^[1] Realism is a further constraint: many generated scenarios contain cues that let a sufficiently attentive target infer it is being tested rather than genuinely deployed, which can change how it acts. Reducing that eval-awareness was one of the stated motivations for Petri 2.0.^[1]^[7] Finally, the tool still depends on humans to supply good hypotheses and seed instructions; it accelerates exploration but does not decide on its own what is worth investigating.^[1]

References

"Petri: An open-source auditing tool to accelerate AI safety research", Anthropic Alignment Science blog, October 6, 2025. ↩
"Petri: An open-source auditing tool to accelerate AI safety research", Anthropic, 2025. ↩
safety-research/petri, GitHub repository (MIT license). ↩
Inspect, UK AI Security Institute evaluation framework. ↩
"Anthropic launches Petri, an open-source tool for automated AI model safety audits", The Decoder, October 2025. ↩
"Claude Sonnet 4.5 Ranked Safest LLM from Open-Source Audit Tool Petri", InfoQ, October 2025. ↩
"Donating our open-source alignment tool", Anthropic, 2025. ↩
"Anthropic AI Releases Petri: An Open-Source Framework for Automated Auditing", MarkTechPost, October 8, 2025. ↩
meridianlabs-ai/inspect_petri, GitHub repository. ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

Suggest edit

What links here

Constitutional AI

Overview

Name and release

How it works (auditor and judge agents)

Built on Inspect

Pilot study and findings

Significance and limitations

References

Improve this article

Related Articles

Constitutional AI

Dario Amodei

The Anthropic Institute

Claude Code Review

Constitutional Classifiers

Model organisms of misalignment

What links here

Related Articles

Constitutional AI

Dario Amodei

The Anthropic Institute

Claude Code Review

Constitutional Classifiers

Model organisms of misalignment