Patronus AI is an automated LLM evaluation, observability, and guardrails platform founded in 2023 and headquartered in San Francisco. The company builds infrastructure that helps organizations test, monitor, and safeguard large language model (LLM) applications in both development and production environments. Its core offering combines proprietary evaluation models, a self-serve API, and a growing suite of open-source benchmarks and model releases. Patronus AI is used by enterprises in regulated industries including finance, healthcare, and legal services, where factual errors and model failures carry meaningful consequences. As of early 2026, the company has raised over $20 million across a seed round and a Series A.
Patronus AI was founded in March 2023 by Anand Kannappan (CEO) and Rebecca Qian (CTO), two former Meta researchers who met while working on responsible AI. Kannappan spent nearly a decade at Meta, where he built and led AI teams at Meta Reality Labs and developed explainable machine learning frameworks for augmented reality applications. Qian was a research engineer and team lead on responsible natural language processing at Meta AI Research (FAIR), where she trained and released FairBERTa, a fairness-focused language model, among other safety-oriented research contributions. Both co-founders hold computer science degrees from the University of Chicago.
The impetus for founding Patronus AI came directly from the rapid commercialization of LLMs after OpenAI released ChatGPT in late 2022. Kannappan and Qian observed firsthand that while organizations were rushing to deploy generative AI, the tooling to evaluate whether those systems were actually safe and accurate lagged far behind. Having spent years building guardrails and evaluation systems inside a large technology company, they saw an opportunity to package that institutional knowledge as an independent product.
The company's name is a reference to the Patronus charm from the Harry Potter series, a protective spell conjured to ward off dark forces. The founders intended the metaphor to communicate the product's core purpose: shielding AI-driven applications from dangerous or incorrect outputs.
Patronus AI operated in stealth from March 2023 until September 2023, when it simultaneously launched publicly and announced its seed funding round. At launch the company had six full-time employees and an initial platform capable of evaluating LLM outputs against criteria including hallucination rate, toxicity, copyright infringement, and PII leakage. The September 2023 TechCrunch coverage described the company as the first automated evaluation platform specifically targeting regulated industries, where wrong answers can have legal, financial, or medical consequences.
Early customer traction came from financial services and education companies. The company established partnerships with Nvidia, MongoDB, and IBM as distribution and integration partners during its first year of operation.
Patronus AI raised a $3 million seed round led by Lightspeed Venture Partners, with additional participation from Factorial Capital and a group of industry angels. The round was announced at the same time as the company's public launch, which was an unusual sequence that reflected the founders' confidence in revealing a working product alongside the fundraise announcement.
The seed capital was used primarily to hire engineering talent, expand the evaluation model library, and release FinanceBench, the company's first major open-source benchmark.
In May 2024, Patronus AI raised a $17 million Series A round led by Glenn Solomon at Notable Capital (formerly GGV Capital). The round included continued participation from Lightspeed Venture Partners, strategic investment from Datadog, angel investment from Gokul Rajaram, and contributions from Factorial Capital and several AI and software executives. Glenn Solomon joined Patronus AI's board of directors as part of the transaction.
Notable Capital's investment thesis for the round stated that "every company that deploys LLMs needs Patronus," citing growing enterprise demand for reliability tooling and high-profile AI failures at companies including Google and a major automotive brand. The $17 million total brought cumulative funding to approximately $20 million.
The Series A was used to expand engineering and research headcount, launch the self-serve Patronus API, open-source the Lynx hallucination detection model, and release the Glider small language model judge.
The core Patronus platform is an end-to-end system for evaluating, monitoring, and improving LLM applications. It supports two primary deployment patterns: offline evaluation during development and real-time monitoring in production.
In offline mode, developers send batches of model inputs and outputs to the Patronus API or SDK, run them through a suite of evaluators, and receive structured scores with explanations. The platform supports A/B testing across different prompt configurations, models, and retrieval strategies, allowing teams to compare system variants before deploying.
In production mode, the platform acts as a monitoring and alerting layer. Every LLM interaction can be logged, scored by evaluators, and visualized in a dashboard. Operators configure alert thresholds so that anomalous behavior -- a spike in hallucination rate, a safety policy violation, or an unusual pattern of prompt injection attempts -- triggers notifications before users are affected at scale.
Patronus offers both Python and TypeScript/Node.js SDKs. The API is programming-language-agnostic, taking JSON-formatted requests over HTTPS. New users receive $5 in free credits. Enterprise tiers include higher rate limits, custom evaluation model training, webhook integrations, and AI professional services.
The platform ships with a large library of pre-built evaluators organized across several categories:
Developers can also define custom evaluators using natural language rubrics, Likert-scale criteria, or binary pass/fail rules. These custom judges are powered by Glider or other configurable LLM backends.
Glider is an open-source small language model (SLM) designed specifically for use as an LLM evaluation judge. Patronus AI released Glider in December 2024 as the smallest model in its class to outperform GPT-4o-mini as an evaluator, providing a fast and cost-effective alternative to using large frontier models for evaluation workloads.
Glider is based on a fine-tuned version of Phi-3.5-mini-instruct and has 3.8 billion parameters. It was trained on a combination of synthetic and domain-adapted data covering more than 183 evaluation metrics and 685 subject domains, including finance, medicine, legal documents, and customer service. This breadth of training data gives Glider strong generalization to new domains without requiring task-specific fine-tuning.
Key performance characteristics at the time of release:
Glider accepts flexible output formats: binary pass/fail, 1-3 Likert scales, and 1-5 Likert scales. It generates natural-language reasoning chains alongside its scores, which makes its decisions auditable and actionable. The model is available on Hugging Face under a research-friendly license and can be accessed through the Patronus SDK with an API key.
The technical report accompanying Glider's release (arXiv:2412.14140) describes the training methodology and benchmark results in detail.
Lynx is an open-source hallucination detection model released by Patronus AI in July 2024. It was designed to address one of the most pervasive failure modes in production LLM systems: the tendency of language models to generate plausible-sounding but factually unsupported statements, particularly in RAG pipelines where retrieved context should constrain the model's responses.
Lynx is a fine-tuned version of Meta's Llama 3 Instruct family and is available in two sizes: an 8B parameter variant suitable for real-time inference and a 70B parameter variant optimized for maximum accuracy in offline evaluation. Both models produce hallucination scores accompanied by natural-language reasoning, functioning similarly to a human expert reviewer who explains their judgment rather than just rendering a verdict.
Benchmark results at release:
Lynx and the associated HaluBench dataset are publicly available on Hugging Face. Lynx is also integrated into the Patronus API as the recommended hallucination detection evaluator for RAG applications. The Databricks blog published a technical case study in 2024 detailing how Patronus collaborated with Databricks MLflow to train and deploy Lynx on enterprise data pipelines.
On October 31, 2024, Patronus AI launched what it described as the industry's first self-serve API for AI evaluation and guardrails. The Patronus API is positioned as a production reliability layer that developers embed directly into their LLM application code to catch failures before they reach end users.
The API operates in two latency modes. A fast-response mode is designed for real-time guardrailing, with small evaluator models that can return a verdict within the response latency budget of a live user interaction. A thoroughness mode uses larger, more capable models and is appropriate for offline batch evaluation or asynchronous post-processing where latency is less constrained.
Guardrail capabilities at launch included hallucination detection via Lynx, prompt injection detection, safety and toxicity classification, and custom rule-based filters. Patronus claimed significantly higher precision and recall compared to competing open-source solutions including Meta's LlamaGuard and Microsoft's Prompt Guard. Enterprise customers can configure industry-specific guardrail policies: a financial services firm might specify rules about regulatory disclosure requirements, while a healthcare provider could configure constraints around medical advice boundaries.
The API is accessed at app.patronus.ai, supports pay-as-you-go billing, and is available on the AWS Marketplace.
Percival is Patronus AI's agentic observability and debugging tool, launched in 2025. It targets the growing complexity of multi-step AI agent systems, where a single user request may span dozens of tool calls, LLM completions, retrieval operations, and branching decision points.
Percival inspects the full execution trace of an agent workflow, detecting more than 20 distinct failure modes including planning errors (wrong sequencing or skipping required steps), tool misuse (calling the wrong API or passing malformed arguments), hallucinated tool results, context loss across long conversation threads, and repetitive looping behavior.
Rather than requiring developers to manually read through long execution logs, Percival applies an AI analysis layer to the trace and surfaces a prioritized diagnosis with suggested prompt refinements and architectural improvements. According to Patronus, early customers reduced the time spent analyzing failing agent workflows from approximately one hour to one to one-and-a-half minutes.
Percival integrates with major agentic frameworks including LangChain, LangGraph, CrewAI, Hugging Face SmolAgents, Pydantic AI, and the OpenAI Agents SDK. This framework-agnostic design allows teams to add observability to existing codebases without rebuilding around a new orchestration layer.
In December 2025, Patronus AI announced Generative Simulators, a framework for creating adaptive synthetic training environments for AI agents. The announcement addressed a known weakness in agent development: static test sets and benchmarks do not reflect the dynamic, stateful nature of real-world environments, so agents that perform well on fixed evaluations often fail when deployed in production.
Generative Simulators produce self-refreshing task environments in which the assignment, the surrounding world state, and the evaluation criteria are all dynamically generated and can evolve in response to the agent's behavior. Patronus built a multi-agent architecture that generates curriculum-based tasks at appropriate difficulty levels relative to the current agent's capabilities, paired with tool stacks of commensurate complexity.
The framework introduced what Patronus calls Open Recursive Self-Improvement (ORSI): a regime in which an agent can improve through interaction and feedback without requiring a full retraining cycle between attempts. Patronus reported that training agents in these environments increased task completion rates by 10 to 20 percentage points on real-world tasks spanning software engineering, customer service simulation, and financial analysis workflows.
FinanceBench is a financial question-answering benchmark released by Patronus AI in November 2023, developed in collaboration with 15 financial industry domain experts. It is notable for being the first large-scale benchmark specifically designed to evaluate LLM performance on open-book financial document tasks.
The full dataset contains 10,231 question-and-answer pairs derived from publicly available financial documents including SEC 10-K annual reports, SEC 10-Q quarterly reports, SEC 8-K current reports, earnings releases, and earnings call transcripts. Questions span a range of financial reasoning capabilities:
The benchmark was designed to represent a minimum performance standard: the questions are intended to be clear-cut and unambiguous, so that a competent human financial analyst would answer them correctly with high reliability. FinanceBench therefore measures the floor of what an enterprise finance AI system should achieve, not an aspirational frontier.
In testing 16 model configurations at release -- including GPT-4 Turbo, Llama 2, and Claude 2 with both retrieval-augmented and long-context approaches -- Patronus found sobering results. GPT-4 Turbo paired with a retrieval system incorrectly answered or refused to answer 81% of questions. These findings highlighted a large gap between the perceived capabilities of frontier models and their actual reliability on domain-specific professional tasks.
FinanceBench is published under an open license on GitHub (github.com/patronus-ai/financebench) and as a dataset on Hugging Face. The accompanying research paper was published in November 2023 (arXiv:2311.11944).
HaluBench is a hallucination detection benchmark released alongside Lynx in July 2024. It consists of 15,000 samples drawn from real-world domains including medicine, finance, and general knowledge, constructed to support evaluation of hallucination detection models rather than the base LLMs themselves.
Each sample provides a context document, a model-generated answer, and a ground-truth label indicating whether the answer is supported by the context. The benchmark is structured to enable chain-of-thought reasoning in evaluation models, allowing judges like Lynx to trace their detection reasoning step by step.
HaluBench was designed to address a gap in existing hallucination benchmarks, most of which were either too narrow in domain coverage or based on synthetic constructions that did not reflect real retrieval and generation patterns. The dataset is hosted on Hugging Face and is freely available for evaluation research.
Patronus AI's platform is structured as a layered evaluation and observability stack. At its foundation sit the evaluation models -- Glider for general-purpose LLM judging and Lynx for hallucination detection -- which function as the core inference engines. Above those are the evaluator abstractions exposed through the API: named evaluators with configurable parameters, rubrics, and output formats that developers compose into evaluation pipelines.
The tracing layer, used by both the offline evaluation and Percival's agentic observability features, captures structured execution records. Each trace records the full input/output sequence for every LLM call, tool invocation, and retrieval operation within a run, tagged with timestamps, latency, and token counts. These traces are stored and indexed to support both real-time alerting and retrospective analysis.
The dataset layer holds curated evaluation sets including FinanceBench, HaluBench, EnterprisePII, and SimpleSafetyTests. Developers can also upload custom datasets and use Patronus's proprietary dataset generation algorithms to synthesize additional evaluation cases from their own application logs.
The dashboard provides visualization, side-by-side comparison of system variants, performance trend tracking, and alert management. Enterprise deployments can configure webhook callbacks to pipe evaluation results into external observability stacks, ticketing systems, or CI/CD pipelines.
The stack is cloud-hosted with a multi-tenant architecture for the standard API and isolated single-tenant deployment options for enterprises with strict data residency or compliance requirements. The Python SDK wraps the REST API with idiomatic helpers for common evaluation patterns, and the TypeScript SDK enables native integration for Node.js-based backend applications.
Patronus AI has maintained an active open-source and research publication program throughout its history. In addition to Glider, Lynx, FinanceBench, and HaluBench, the company has released CopyrightCatcher and EnterprisePII.
CopyrightCatcher, launched in early 2024, is an API and evaluation dataset for detecting when LLMs reproduce copyrighted text verbatim. An analysis using CopyrightCatcher found that GPT-4's outputs contained copyrighted content at a rate of 44%, the highest among the major models tested at the time. The associated evaluation dataset and scoring methodology are published on GitHub (github.com/patronus-ai/copyright-evals).
EnterprisePII is an evaluation dataset and API for detecting business-sensitive information in LLM outputs, covering categories of confidential data that appear in commercial contracts, meeting notes, performance reviews, marketing communications, and other enterprise document types. It extends standard PII detection beyond consumer-facing personally identifiable information to include the kinds of institutional data that carry legal and competitive sensitivity in a business context.
Patronus has also maintained an Enterprise Scenarios Leaderboard on Hugging Face, which provides a public benchmark of LLM performance on enterprise-specific evaluation tasks, enabling companies to compare models before committing to deployment.
The LLM evaluation and observability space has grown rapidly since 2023, with several well-funded companies offering overlapping but differentiated products. The table below compares Patronus AI against its most frequently cited competitors.
| Feature | Patronus AI | LangSmith | Galileo AI | Confident AI |
|---|---|---|---|---|
| Primary focus | Evaluation models + guardrails + observability | LangChain tracing and evaluation | Evaluation intelligence + hallucination monitoring | Python-native LLM unit testing |
| Hallucination detection | Lynx (purpose-built model) | LLM-as-judge via configuration | Hallucination Index metric | G-Eval and custom metrics |
| Agentic tracing | Percival (20+ failure modes) | Full LangChain trace capture | Agent workflow evaluation | Limited |
| Open-source models | Glider (3.8B), Lynx (8B, 70B) | No proprietary models | No proprietary models | DeepEval open-source framework |
| Framework lock-in | Framework-agnostic | Tightly coupled to LangChain | Framework-agnostic | Framework-agnostic |
| Guardrails API | Yes (real-time, self-serve) | Limited | Yes (Protect product) | No native guardrails API |
| Benchmark contributions | FinanceBench, HaluBench, EnterprisePII, CopyrightCatcher | None | Hallucination Index | RAGAS metric suite |
| Pricing | Pay-as-you-go + enterprise | Pay-per-trace + enterprise | Enterprise contract | Free open-source + enterprise |
| AWS Marketplace | Yes | No | No | No |
LangSmith, developed by the LangChain team, is the dominant choice for organizations already building on the LangChain framework. Its deep integration with LangChain primitives makes it the path of least resistance for LangChain users, but that same tight coupling creates meaningful vendor lock-in for teams considering framework migration.
Arize Phoenix, the open-source observability tool from Arize AI, occupies a slightly different position: it provides trace visualization and evaluation infrastructure but is primarily a monitoring and debugging tool rather than a guardrails platform.
Galileo AI positions itself as an evaluation intelligence platform with particular strength in RAG evaluation and a dedicated Hallucination Index leaderboard. Its Evaluate/Observe/Protect product suite covers a similar lifecycle to Patronus, but Galileo has not released proprietary open-source evaluation models.
Confident AI's DeepEval is an open-source framework designed to fit into Python test suites in a manner analogous to pytest. It is popular among teams that prefer a code-first, self-hosted evaluation workflow and provides a broad set of metrics including G-Eval, a prompt-based LLM judge methodology. Confident AI trades the managed infrastructure and specialized hardware of Patronus for maximum developer control.
Patronus AI's primary differentiators are its proprietary evaluation models (Glider and Lynx), its domain-specific benchmarks (particularly FinanceBench for financial applications), its production guardrails API with claimed industry-leading precision and recall, and its agentic observability through Percival. The company's research heritage from Meta AI's responsible AI work informs a deeper investment in specialized judge models compared to platforms that rely entirely on general-purpose LLMs as evaluators.
Patronus AI has publicly named customers including AngelList, Pearson, and HP. The broader enterprise customer base spans financial services, education, healthcare, and enterprise software. The company reports that Fortune 500 enterprises across automotive, financial, and software industries use Patronus to deploy LLMs safely in production.
In financial services, the predominant use case is RAG application validation: firms building document-grounded question-answering systems on top of SEC filings, earnings reports, research notes, or regulatory documents use FinanceBench as a pre-deployment benchmark and Lynx as a production hallucination guard. The cost of a model confidently stating an incorrect revenue figure or misquoting a regulatory requirement motivates investment in evaluation infrastructure.
In education, companies like Pearson use Patronus to evaluate AI tutoring and content generation systems for factual accuracy and age-appropriate safety. Educational deployments require consistent tone and pedagogical alignment in addition to factual correctness, making multi-criteria custom evaluators a common configuration.
In enterprise software, AngelList and similar companies use Patronus to monitor AI-powered search and summarization features. At scale, even low individual error rates translate into large absolute volumes of incorrect outputs, so continuous monitoring with automated alerting provides an early warning system before error patterns compound.
The Patronus API's availability on the AWS Marketplace has been a distribution channel for cloud-native companies that prefer to consolidate vendor billing and procurement through their existing AWS relationships.
Like all automated evaluation systems, Patronus AI's evaluators are probabilistic and not perfectly accurate. A high-precision evaluator with 95% accuracy still allows 5% of evaluated outputs to be incorrectly classified, which at production scale can mean thousands of miscategorized responses per day. Teams must calibrate acceptable thresholds for their use case rather than treating evaluator scores as ground truth.
The Glider and Lynx models, while strong performers on their respective benchmarks, were trained on data distributions that may not fully represent novel enterprise domains. Organizations in highly specialized fields -- niche areas of insurance law, rare-disease clinical documentation, or proprietary financial instruments -- may find that generic evaluation model performance degrades on their specific content and requires custom fine-tuning.
Real-time guardrailing introduces latency overhead. Even the fast-response Patronus evaluators add measurable delay to the response path of a live application, which constrains deployment patterns in latency-sensitive contexts such as real-time voice interfaces or high-frequency trading applications.
The platform's pricing model, while more accessible than legacy enterprise AI vendors, is still consumption-based, meaning evaluation costs scale linearly with API traffic. High-volume production deployments may require careful cost modeling to avoid surprising infrastructure bills, particularly when using the larger Lynx 70B evaluator.
More broadly, the LLM evaluation field lacks universal standards. Different evaluation frameworks use different metric definitions for concepts like "hallucination" and "groundedness," making it difficult to compare scores across platforms or to establish industry-wide thresholds. Patronus AI's benchmarks have contributed to improving this situation, but the field remains fragmented.