CRMArena / CRMArena-Pro

AI Benchmarks AI Code Generation

7 min read

Updated Jun 8, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 8, 2026

Fact-checked

In review queue

Sources

6 citations

Revision

v1 · 1,444 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Overview

CRMArena is an AI benchmark for evaluating large language model agents on professional customer relationship management (CRM) tasks inside a realistic, schema-faithful Salesforce environment. It was introduced by Kung-Hsiang Huang and colleagues at Salesforce AI Research in a paper first posted to arXiv on November 4, 2024, and accepted to the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics (NAACL 2025) ^[1]^[2]. The benchmark grounds agent evaluation in an actual Salesforce organization populated with synthetic but realistic data, and measures whether agents can complete tasks faced by customer-service personas such as service agents, analysts, and managers ^[1].

A follow-up benchmark, CRMArena-Pro, was released by the same group in May 2025. It broadens coverage beyond customer service to sales and configure-price-quote (CPQ) workflows, adds both single-turn and multi-turn interactions, spans business-to-business (B2B) and business-to-consumer (B2C) settings, and introduces a confidentiality-awareness evaluation ^[3]^[4]. Across both benchmarks, frontier LLM agents complete only a modest fraction of tasks, and they perform especially poorly on multi-turn dialogue and on recognizing confidential information, which the authors frame as evidence of a substantial gap between current model capabilities and the demands of real enterprise work ^[1]^[3].

Motivation: enterprise CRM agents

Enterprises increasingly want to deploy LLM agents to handle customer-relationship and customer-service work: routing and resolving support cases, looking up knowledge articles, checking policy compliance, and analyzing operational trends. CRM platforms such as Salesforce sit at the center of this work, organizing accounts, cases, orders, and knowledge into highly interconnected data models. Salesforce and other vendors have positioned autonomous agents as a major product direction and revenue opportunity, which makes credible evaluation commercially important ^[4]^[5].

The authors of CRMArena argue that prior agent benchmarks did not capture this setting well. Many existing benchmarks used simplified or simulated environments that did not reflect the schemas, object relationships, and rule-following demands of a production CRM system, so they could not reliably predict whether an agent would be useful when connected to real enterprise data ^[1]. CRMArena was designed to close that gap by building tasks on top of an actual Salesforce organization and validating them with CRM domain experts, so that success on the benchmark corresponds more closely to the work an agent would do in deployment ^[1].

What CRMArena is

CRMArena is built on a Salesforce organization (Org) that acts as the agent's sandbox. The environment models 16 commonly used business objects, such as account, contact, case, order, and knowledge article, with high interconnectivity between them. To make the data realistic, the authors generate records using latent variables that induce realistic distributions rather than uniform random values, so that patterns in the data resemble those an analyst would actually encounter ^[1]. Agents interact with this environment through tools backed by the Salesforce API, querying and acting on the underlying records.

On top of this environment, CRMArena defines nine tasks representative of CRM use cases, organized around three professional personas ^[1]:

Persona	Example responsibilities
Service Agent	Resolving and routing individual customer cases, looking up knowledge articles
Service Analyst	Identifying trends and patterns across cases and other records
Service Manager	Higher-level oversight tasks that depend on aggregated data and policy

Each task was designed with guidance from CRM practitioners and industry conventions, which the authors describe as making CRMArena an expert-validated benchmark spanning both a realistic environment and realistic work tasks ^[1]^[2]. The benchmark's code and data are released for research use, with datasets hosted on Hugging Face under a Creative Commons Attribution-NonCommercial (CC BY-NC 4.0) license ^[6].

CRMArena-Pro

CRMArena-Pro, titled "CRMArena-Pro: Holistic Assessment of LLM Agents Across Diverse Business Scenarios and Interactions," extends the original benchmark along several axes ^[3]. Where CRMArena focused on customer service, CRMArena-Pro covers 19 expert-validated tasks spanning three domains: customer service, sales, and configure-price-quote (CPQ). The tasks are grouped into four business-skill categories: workflow execution, policy compliance, information retrieval and textual reasoning, and database querying and numerical computation ^[3]^[4].

Several additions distinguish CRMArena-Pro from its predecessor ^[3]^[4]:

B2B and B2C contexts. The benchmark provides separate Salesforce organizations and datasets for B2B and B2C businesses, reported as roughly 29,000 records for the B2B org and roughly 55,000 records for the B2C org, allowing comparison across organization types ^[4]^[6].
Single-turn and multi-turn interactions. In addition to single-turn queries, CRMArena-Pro evaluates multi-turn settings in which an agent must gather information incrementally through dialogue with a simulated user, rather than receiving a fully specified request up front ^[3]^[4].
Confidentiality awareness. A dedicated evaluation tests whether agents recognize and appropriately withhold sensitive information, covering categories such as private customer data, internal operational data, and confidential company knowledge ^[4].

Counting query instances across tasks, contexts, and the confidentiality probes, the benchmark comprises roughly 4,000 queries in total ^[4]. Like CRMArena, CRMArena-Pro is released for research purposes under a non-commercial license, with data available on Hugging Face ^[6].

Results

On the original CRMArena, the authors report that state-of-the-art LLM agents succeed on less than 58% of tasks when using ReAct prompting, and less than 65% even when provided with manually crafted function-calling tools ^[1]^[2]. The paper concludes that current agents need stronger function-calling and rule-following abilities before they can be deployed reliably in real CRM work environments ^[1].

CRMArena-Pro reports similarly limited performance, with a sharp drop from single-turn to multi-turn settings. Among the models evaluated, the strongest, Gemini 2.5 Pro, reaches roughly 58% accuracy in single-turn tasks but only about 35% in multi-turn settings ^[3]^[4]^[5]. Performance is uneven across skills: agents do comparatively well on workflow execution, where the top model exceeds 83% in single-turn settings, but struggle more on tasks that require reasoning over the database or following nuanced policy ^[4].

The confidentiality-awareness results are a central finding. The authors report that, without explicit instruction, agents exhibit near-zero confidentiality awareness, failing to recognize when information should be protected. Targeted prompting can raise this awareness, but doing so tends to degrade overall task performance, exposing a tension between safeguarding sensitive data and completing the assigned work ^[3]^[4]^[5]. Coverage of the benchmark summarized the headline numbers as roughly 58% success on single-step tasks and 35% on multi-step tasks, alongside the confidentiality gap ^[5].

The following figures summarize commonly cited results; precise per-model and per-domain numbers appear in the source papers.

Setting	Reported result
CRMArena, ReAct prompting	Under 58% of tasks succeed ^[1]
CRMArena, function-calling tools	Under 65% of tasks succeed ^[1]
CRMArena-Pro, single-turn (best model)	About 58% accuracy ^[3]^[4]
CRMArena-Pro, multi-turn (best model)	About 35% accuracy ^[3]^[4]
CRMArena-Pro, confidentiality awareness	Near zero without explicit prompting ^[3]^[4]

Significance

CRMArena and CRMArena-Pro are notable as grounded, expert-validated evaluations of enterprise agents, built on real Salesforce organizations rather than abstracted or fully simulated settings. By tying tasks to authentic CRM schemas, interconnected objects, and realistic data distributions, they offer a closer proxy for production customer-relationship work than many earlier agent benchmarks ^[1]. Because they originate from Salesforce AI Research, they also reflect first-hand knowledge of how CRM systems are structured and used.

The benchmarks have become a reference point for claims about enterprise and business agents, and their results temper optimism about near-term autonomous deployment. The headline conclusion, that frontier agents complete only a modest fraction of realistic tasks and falter on multi-turn interaction and confidentiality, supports the authors' framing of a significant gap between current LLM capabilities and the multifaceted demands of real-world enterprise scenarios ^[3]^[5]. CRMArena-Pro's addition of confidentiality-awareness evaluation is particularly distinctive, since most agent benchmarks focus on task success and largely overlook whether an agent appropriately handles sensitive information ^[4]^[5].

References

Huang, Kung-Hsiang; Prabhakar, Akshara; Dhawan, Sidharth; Mao, Yixin; Wang, Huan; Savarese, Silvio; Xiong, Caiming; Laban, Philippe; Wu, Chien-Sheng. "CRMArena: Understanding the Capacity of LLM Agents to Perform Professional CRM Tasks in Realistic Environments." arXiv:2411.02305, November 4, 2024 (revised February 16, 2025). https://arxiv.org/abs/2411.02305 ↩
"CRMArena: Understanding the Capacity of LLM Agents to Perform Professional CRM Tasks in Realistic Environments." Proceedings of NAACL 2025, ACL Anthology. https://aclanthology.org/2025.naacl-long.194/ ↩
Huang, Kung-Hsiang; Prabhakar, Akshara; Thorat, Onkar; Agarwal, Divyansh; Choubey, Prafulla Kumar; Mao, Yixin; Savarese, Silvio; Xiong, Caiming; Wu, Chien-Sheng. "CRMArena-Pro: Holistic Assessment of LLM Agents Across Diverse Business Scenarios and Interactions." arXiv:2505.18878, May 2025. https://arxiv.org/abs/2505.18878 ↩
"Salesforce AI Introduces CRMArena-Pro: The First Multi-Turn and Enterprise-Grade Benchmark for LLM Agents." MarkTechPost, June 5, 2025. https://www.marktechpost.com/2025/06/05/salesforce-ai-introduces-crmarena-pro-the-first-multi-turn-and-enterprise-grade-benchmark-for-llm-agents/ ↩
"LLM agents flunk CRM and confidentiality tasks." The Register, June 16, 2025. https://www.theregister.com/2025/06/16/salesforce_llm_agents_benchmark/ ↩
"Official Repo for CRMArena and CRMArena-Pro." SalesforceAIResearch/CRMArena, GitHub. https://github.com/SalesforceAIResearch/CRMArena ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

Suggest edit

What links here

HumanEval Pass@k

Overview

Motivation: enterprise CRM agents

What CRMArena is

CRMArena-Pro

Results

Significance

References

Improve this article

Related Articles

HumanEval

LiveCodeBench

SWE-bench Verified

MBPP

CodeContests

CRUXEval

What links here

Related Articles

HumanEval

LiveCodeBench

SWE-bench Verified

MBPP

CodeContests

CRUXEval

What links here