CRMArena / CRMArena-Pro
Last reviewed
Jun 8, 2026
Sources
6 citations
Review status
Source-backed
Revision
v1 · 1,444 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 8, 2026
Sources
6 citations
Review status
Source-backed
Revision
v1 · 1,444 words
Add missing citations, update stale details, or suggest a clearer explanation.
CRMArena is an AI benchmark for evaluating large language model agents on professional customer relationship management (CRM) tasks inside a realistic, schema-faithful Salesforce environment. It was introduced by Kung-Hsiang Huang and colleagues at Salesforce AI Research in a paper first posted to arXiv on November 4, 2024, and accepted to the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics (NAACL 2025) [1][2]. The benchmark grounds agent evaluation in an actual Salesforce organization populated with synthetic but realistic data, and measures whether agents can complete tasks faced by customer-service personas such as service agents, analysts, and managers [1].
A follow-up benchmark, CRMArena-Pro, was released by the same group in May 2025. It broadens coverage beyond customer service to sales and configure-price-quote (CPQ) workflows, adds both single-turn and multi-turn interactions, spans business-to-business (B2B) and business-to-consumer (B2C) settings, and introduces a confidentiality-awareness evaluation [3][4]. Across both benchmarks, frontier LLM agents complete only a modest fraction of tasks, and they perform especially poorly on multi-turn dialogue and on recognizing confidential information, which the authors frame as evidence of a substantial gap between current model capabilities and the demands of real enterprise work [1][3].
Enterprises increasingly want to deploy LLM agents to handle customer-relationship and customer-service work: routing and resolving support cases, looking up knowledge articles, checking policy compliance, and analyzing operational trends. CRM platforms such as Salesforce sit at the center of this work, organizing accounts, cases, orders, and knowledge into highly interconnected data models. Salesforce and other vendors have positioned autonomous agents as a major product direction and revenue opportunity, which makes credible evaluation commercially important [4][5].
The authors of CRMArena argue that prior agent benchmarks did not capture this setting well. Many existing benchmarks used simplified or simulated environments that did not reflect the schemas, object relationships, and rule-following demands of a production CRM system, so they could not reliably predict whether an agent would be useful when connected to real enterprise data [1]. CRMArena was designed to close that gap by building tasks on top of an actual Salesforce organization and validating them with CRM domain experts, so that success on the benchmark corresponds more closely to the work an agent would do in deployment [1].
CRMArena is built on a Salesforce organization (Org) that acts as the agent's sandbox. The environment models 16 commonly used business objects, such as account, contact, case, order, and knowledge article, with high interconnectivity between them. To make the data realistic, the authors generate records using latent variables that induce realistic distributions rather than uniform random values, so that patterns in the data resemble those an analyst would actually encounter [1]. Agents interact with this environment through tools backed by the Salesforce API, querying and acting on the underlying records.
On top of this environment, CRMArena defines nine tasks representative of CRM use cases, organized around three professional personas [1]:
| Persona | Example responsibilities |
|---|---|
| Service Agent | Resolving and routing individual customer cases, looking up knowledge articles |
| Service Analyst | Identifying trends and patterns across cases and other records |
| Service Manager | Higher-level oversight tasks that depend on aggregated data and policy |
Each task was designed with guidance from CRM practitioners and industry conventions, which the authors describe as making CRMArena an expert-validated benchmark spanning both a realistic environment and realistic work tasks [1][2]. The benchmark's code and data are released for research use, with datasets hosted on Hugging Face under a Creative Commons Attribution-NonCommercial (CC BY-NC 4.0) license [6].
CRMArena-Pro, titled "CRMArena-Pro: Holistic Assessment of LLM Agents Across Diverse Business Scenarios and Interactions," extends the original benchmark along several axes [3]. Where CRMArena focused on customer service, CRMArena-Pro covers 19 expert-validated tasks spanning three domains: customer service, sales, and configure-price-quote (CPQ). The tasks are grouped into four business-skill categories: workflow execution, policy compliance, information retrieval and textual reasoning, and database querying and numerical computation [3][4].
Several additions distinguish CRMArena-Pro from its predecessor [3][4]:
Counting query instances across tasks, contexts, and the confidentiality probes, the benchmark comprises roughly 4,000 queries in total [4]. Like CRMArena, CRMArena-Pro is released for research purposes under a non-commercial license, with data available on Hugging Face [6].
On the original CRMArena, the authors report that state-of-the-art LLM agents succeed on less than 58% of tasks when using ReAct prompting, and less than 65% even when provided with manually crafted function-calling tools [1][2]. The paper concludes that current agents need stronger function-calling and rule-following abilities before they can be deployed reliably in real CRM work environments [1].
CRMArena-Pro reports similarly limited performance, with a sharp drop from single-turn to multi-turn settings. Among the models evaluated, the strongest, Gemini 2.5 Pro, reaches roughly 58% accuracy in single-turn tasks but only about 35% in multi-turn settings [3][4][5]. Performance is uneven across skills: agents do comparatively well on workflow execution, where the top model exceeds 83% in single-turn settings, but struggle more on tasks that require reasoning over the database or following nuanced policy [4].
The confidentiality-awareness results are a central finding. The authors report that, without explicit instruction, agents exhibit near-zero confidentiality awareness, failing to recognize when information should be protected. Targeted prompting can raise this awareness, but doing so tends to degrade overall task performance, exposing a tension between safeguarding sensitive data and completing the assigned work [3][4][5]. Coverage of the benchmark summarized the headline numbers as roughly 58% success on single-step tasks and 35% on multi-step tasks, alongside the confidentiality gap [5].
The following figures summarize commonly cited results; precise per-model and per-domain numbers appear in the source papers.
| Setting | Reported result |
|---|---|
| CRMArena, ReAct prompting | Under 58% of tasks succeed [1] |
| CRMArena, function-calling tools | Under 65% of tasks succeed [1] |
| CRMArena-Pro, single-turn (best model) | About 58% accuracy [3][4] |
| CRMArena-Pro, multi-turn (best model) | About 35% accuracy [3][4] |
| CRMArena-Pro, confidentiality awareness | Near zero without explicit prompting [3][4] |
CRMArena and CRMArena-Pro are notable as grounded, expert-validated evaluations of enterprise agents, built on real Salesforce organizations rather than abstracted or fully simulated settings. By tying tasks to authentic CRM schemas, interconnected objects, and realistic data distributions, they offer a closer proxy for production customer-relationship work than many earlier agent benchmarks [1]. Because they originate from Salesforce AI Research, they also reflect first-hand knowledge of how CRM systems are structured and used.
The benchmarks have become a reference point for claims about enterprise and business agents, and their results temper optimism about near-term autonomous deployment. The headline conclusion, that frontier agents complete only a modest fraction of realistic tasks and falter on multi-turn interaction and confidentiality, supports the authors' framing of a significant gap between current LLM capabilities and the multifaceted demands of real-world enterprise scenarios [3][5]. CRMArena-Pro's addition of confidentiality-awareness evaluation is particularly distinctive, since most agent benchmarks focus on task success and largely overlook whether an agent appropriately handles sensitive information [4][5].