MultiChallenge

AI Benchmarks AI Code Generation

7 min read

Updated Jun 8, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 8, 2026

Fact-checked

In review queue

Sources

4 citations

Revision

v1 · 1,317 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Overview

MultiChallenge is an AI benchmark for evaluating large language models on realistic multi-turn conversations. It was introduced in the paper "MultiChallenge: A Realistic Multi-Turn Conversation Evaluation Benchmark Challenging to Frontier LLMs" by Ved Sirdeshmukh, Kaustubh Deshpande, and colleagues at Scale AI, first posted to arXiv in January 2025 and later published at the Findings of the Association for Computational Linguistics (ACL) 2025 in Vienna ^[1]^[2]^[3]. The benchmark isolates four categories of conversational challenge that are common in everyday human-assistant interaction yet remain difficult for current frontier systems: instruction retention, inference memory of user information across turns, reliable versioned editing, and self-coherence ^[1].

The central finding is stark: although leading models score near-perfectly on older multi-turn evaluations, every frontier model tested at the time of release scored below 50 percent on MultiChallenge, with the top performer, Claude 3.5 Sonnet (June 2024), reaching only 41.4 percent average accuracy ^[1]^[2]. MultiChallenge is maintained as one of the public benchmarks on Scale's SEAL leaderboards, where it is updated as new models are released ^[4].

Motivation: multi-turn conversation is hard

Most widely used LLM benchmarks measure a model's response to a single prompt in isolation. Real assistant use, however, is overwhelmingly conversational: a user issues an instruction, refines it, provides new constraints, asks for edits, and expects the assistant to keep all of that consistent over many turns. In practice, models frequently lose track of earlier instructions, forget facts the user mentioned several turns ago, or contradict their own previous answers.

The authors argue that conducting a coherent multi-turn conversation simultaneously demands three capabilities that single-turn tests do not jointly exercise: accurate instruction following, careful allocation of attention across the full conversation context, and in-context reasoning at or beyond human level ^[1]. Each of the four MultiChallenge categories is deliberately constructed so that a model must combine all three at once, rather than succeeding on any single axis.

A further motivation is benchmark saturation. Earlier multi-turn evaluations such as MT-Bench, the two-turn conversation benchmark from the LMSYS team, have become easy enough that frontier models cluster near the ceiling, limiting their ability to discriminate between strong systems ^[1]. MultiChallenge was designed specifically to remain unsaturated and to reveal failure modes that gentler benchmarks no longer surface.

The four challenge categories

MultiChallenge organizes its conversations into four distinct skills, each targeting a recognizable way that assistants fail over extended dialogue ^[1]^[2]:

Category	What it tests
Instruction retention	Whether the model continues to honor a directive given early in the conversation across many subsequent turns, rather than silently dropping it.
Inference memory of user information	Whether the model recalls and correctly uses details the user scattered across earlier turns to inform a later response.
Reliable versioned editing	Whether the model can manage a sequence of evolving edits to a document or artifact without losing earlier changes or reintroducing fixed problems.
Self-coherence	Whether the model stays internally consistent and avoids contradicting statements it made earlier in the same conversation.

The paper emphasizes that all four categories require instruction following, context allocation, and in-context reasoning together, which is what makes them resistant to shortcuts ^[1]. A model cannot, for example, pass the inference-memory items merely by following the final instruction well; it must also have correctly attended to and retained the relevant facts from earlier turns.

Evaluation methodology

Building the conversations

MultiChallenge consists of 273 carefully curated multi-turn test conversations, distributed across the four categories: 113 for inference memory, 69 for instruction retention, 50 for self-coherence, and 41 for reliable versioned editing ^[2]. The conversations were produced through a hybrid pipeline. An initial multi-agent generation system, using planner, user, and responder agents, drafted candidate conversations, which were then reviewed and revised by trained human annotators across multiple review layers ^[2]. According to the authors, seeding the process with machine-generated drafts cut the human effort required to build each item by roughly half compared with writing from scratch, while the human review preserved realism and difficulty ^[2].

LLM-as-a-judge with instance-level rubrics

Because the target behaviors (such as "did the model contradict itself?") are open-ended, MultiChallenge scores model responses using an LLM-as-a-judge approach. A key methodological contribution is that each conversation ships with its own instance-level rubric: a specific, item-tailored checklist describing what a correct final response must and must not do ^[1]. The judge model evaluates a candidate response against this targeted rubric rather than against a single generic prompt.

The authors report that this rubric-based judging agrees with experienced human raters about 93.95 percent of the time, compared with only 37.33 percent for a naive baseline that asks a judge to assess the full conversation without instance-level guidance ^[2]. That large gap is central to the paper's claim that MultiChallenge can be scored automatically while still tracking human judgment closely, which is what makes the benchmark practical to run at leaderboard scale.

Results

When the benchmark was released, no frontier model reached the 50 percent mark, and several scored far lower, underscoring how much headroom remains in multi-turn conversation despite saturation on simpler tests ^[1]^[2]. The table below reproduces the human-evaluated scores reported in the paper for a representative set of models ^[2].

Model	MultiChallenge accuracy
Claude 3.5 Sonnet (June 2024)	41.4%
o1-preview	37.2%
Gemini 1.5 Pro (Aug 2024)	20.0%
Llama 3.1 405B	14.9%
Mistral Large	14.6%
GPT-4o (Aug 2024)	12.5%

The spread is notable: the best model scored more than three times higher than several other capable systems, indicating that multi-turn robustness is not simply a function of a model's general single-turn strength ^[2]. The same models that lead on conventional reasoning or instruction-following benchmarks did not uniformly lead here.

Scale keeps MultiChallenge as a living leaderboard, re-running it on newer releases ^[4]. By a February 2026 snapshot of the Scale Labs leaderboard, scores had risen substantially as model capabilities advanced, with the leading entries scoring in roughly the high-60s to mid-70s percent range, a large improvement over the sub-42 percent ceiling at the benchmark's launch in early 2025 ^[4]. The benchmark's continued use as a discriminating multi-turn evaluation, rather than its having been retired after saturation, reflects its design goal of staying difficult.

Significance

MultiChallenge filled a recognized gap in LLM evaluation by targeting the conversational dynamics that dominate real assistant use but that single-turn and lightly multi-turn benchmarks largely ignore. By decomposing multi-turn competence into instruction retention, inference memory, versioned editing, and self-coherence, it gave developers a structured way to diagnose where a model breaks down over a long conversation, rather than reporting a single opaque score ^[1].

Two aspects have been especially influential. First, the demonstration that frontier models score below 50 percent on realistic multi-turn tasks, despite near-ceiling results on prior benchmarks such as MT-Bench, provided concrete evidence that conversational robustness lagged behind headline capabilities ^[1]. Second, the instance-level rubric method for LLM-as-a-judge, validated at roughly 94 percent agreement with human raters, offered a reusable recipe for scoring open-ended conversational behavior automatically and credibly ^[2].

As a component of Scale's SEAL leaderboard suite, MultiChallenge sits alongside other Scale-built evaluations and complements broader multi-turn and instruction-following benchmarks, giving the research community a standardized, regularly updated measure of how well successive model generations sustain coherent, instruction-faithful dialogue ^[4].

References

Ved Sirdeshmukh, Kaustubh Deshpande, et al. "MultiChallenge: A Realistic Multi-Turn Conversation Evaluation Benchmark Challenging to Frontier LLMs." arXiv:2501.17399, January 2025. https://arxiv.org/abs/2501.17399 ↩
"MultiChallenge: A Realistic Multi-Turn Conversation Evaluation Benchmark Challenging to Frontier LLMs." Scale Labs (paper page). https://labs.scale.com/papers/multichallenge ↩
"MultiChallenge: A Realistic Multi-Turn Conversation Evaluation Benchmark Challenging to Frontier LLMs." Findings of the Association for Computational Linguistics: ACL 2025. https://aclanthology.org/2025.findings-acl.958/ ↩
"MultiChallenge." SEAL leaderboard, Scale AI. https://scale.com/leaderboard/multichallenge ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

Suggest edit

What links here

GPT-4.1