OpenAI MRCR (Multi-Round Co-reference Resolution)

AI Benchmarks AI Safety

8 min read

Updated Jun 8, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 8, 2026

Fact-checked

In review queue

Sources

7 citations

Revision

v1 · 1,531 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Overview

OpenAI MRCR (Multi-Round Co-reference Resolution) is a long-context evaluation dataset published by OpenAI that measures a language model's ability to distinguish between multiple near-identical "needles" buried inside a very long, multi-turn conversation. Rather than asking a model to find a single distinctive fact, MRCR scatters several almost-identical user requests (for example, several requests to write a poem about the same topic) throughout a synthetic dialogue and then asks the model to reproduce one specific instance, such as the second poem about a given subject. The difficulty comes from disambiguating among highly similar items spread across contexts that scale up to roughly one million tokens. ^[1]^[2]

OpenAI open-sourced the dataset on Hugging Face under an MIT license, with the initial upload on April 12, 2025 and a bugfix revision on December 5, 2025. ^[1] It was introduced publicly alongside the GPT-4.1 model family, which launched in the OpenAI API on April 14, 2025. ^[3]^[4] MRCR expands on earlier work by a Google DeepMind team, which first defined a multi-round co-reference resolution task in the "Michelangelo" long-context evaluation paper. ^[1]^[5] As an AI benchmark, OpenAI MRCR has become a commonly cited probe of whether a model with a large advertised context window can actually use that window reliably across its full length.

Background: long-context evaluation

Early evaluations of long-context models relied heavily on the needle in a haystack (NIAH) test, in which a single distinctive sentence (the "needle") is inserted at a known position inside a large block of filler text (the "haystack"), and the model is asked to retrieve it. NIAH is simple and easy to score, and frontier models quickly began passing it at long context lengths. OpenAI itself reported that GPT-4.1 retrieves the needle accurately at all tested positions up to one million tokens in the NIAH setting, while acknowledging the test's limitations. ^[3]^[6]

Because plain retrieval saturated, researchers moved toward harder synthetic tasks that require disambiguation, ordering, and reasoning rather than a single keyword match. Two influential efforts framed this shift. NVIDIA's RULER benchmark generalized NIAH into a family of configurable synthetic tasks, including multi-key, multi-value, and multi-query retrieval, variable tracking, aggregation, and question answering, to measure how quality degrades as input length grows. ^[7] Separately, a Google DeepMind team proposed the "Latent Structure Queries" framework in the Michelangelo paper, arguing that meaningful long-context evaluation should force a model to "chisel away" irrelevant content to reveal a latent structure, rather than locate an isolated fact. ^[5] MRCR was one of the three Michelangelo tasks, designed to test understanding of ordering in natural text, the ability to tell apart similar drafts, and faithful reproduction of a specified passage. OpenAI's release builds directly on that formulation while increasing the difficulty and providing open, reproducible data. ^[1]^[5]

What MRCR tests

In an MRCR sample, the model is given a long, synthetically generated, multi-turn conversation between a user and an assistant. The user repeatedly asks for a piece of writing about a topic, for example "write a poem about tapirs" or "write a blog post about volcanoes," and the assistant responds each time with generated content. Several of these requests are near-duplicates: the same kind of writing about the same entity appears two, four, or eight times across the dialogue, interleaved with many distractor requests on other topics. At the end, the user asks the model to return a specific instance, such as the i-th poem about the target topic. ^[1]^[2]

The task is deliberately resistant to shortcuts. Because the needles are drawn from the same distribution as the distractors, simple keyword search does not isolate the correct answer; the model must both locate the relevant cluster of requests and reason about their order to pick the right one. OpenAI describes the challenge as requiring the model to "distinguish order amongst the needles" that are statistically indistinguishable from the surrounding content. ^[1] This stresses retrieval, co-reference resolution, and ordered reproduction simultaneously, which is why performance tends to fall off well before the maximum context length is reached. ^[5]^[6]

Structure and scoring

The dataset is organized into subsets by the number of repeated requests, or needles, that the model must disambiguate: a 2-needle subset, a 4-needle subset, and an 8-needle subset. More needles make the disambiguation task harder. Samples are bucketed into eight token-length bins so that performance can be reported as a function of context size. ^[1]

Property	Value
Task	Retrieve the i-th of several near-identical requests in a long multi-turn dialogue
Needle subsets	2, 4, and 8 needles
Token bins	[4096, 8192], (8192, 16384], (16384, 32768], (32768, 65536], (65536, 131072], (131072, 262144], (262144, 524288], (524288, 1048576]
Samples per bin	100
Total samples	2,400
Distinct entities	438
Writing formats	10
Dataset size	About 1.39 GB
License	MIT

Token counts in the released code are computed with tiktoken using the o200k_base encoding, so the bin boundaries reflect OpenAI tokenizer tokens rather than raw characters. ^[1]

Scoring uses a string-similarity ratio rather than exact match. The reference grading code measures the SequenceMatcher ratio from Python's difflib library, comparing the model's answer against the target piece of writing; the score is a continuous value between 0 and 1. To prevent models from succeeding without genuinely identifying the correct instance, MRCR also requires the model to prepend a specific randomly generated alphanumeric hash to the beginning of its answer. If that hash is missing, the match ratio for that sample is set to 0. ^[1] Reported MRCR results are therefore typically averages of these per-sample SequenceMatcher ratios, grouped by needle count and by context-length bin.

Results

The consistent finding across MRCR is that models degrade substantially as context length grows and as the number of needles increases, even when the same model passes simpler NIAH retrieval at the same lengths. When OpenAI introduced the dataset with GPT-4.1, it reported that GPT-4.1 outperformed GPT-4o at context lengths up to 128,000 tokens and maintained relatively strong performance out to one million tokens, while still showing a clear decline. Coverage of the launch noted GPT-4.1's MRCR accuracy falling from roughly 80 percent toward about 50 percent as the full context window was used. ^[4]^[6] OpenAI presented MRCR results for GPT-4.1, GPT-4.1 mini, and GPT-4.1 nano as part of the same long-context evaluation suite that also included its Graphwalks multi-hop reasoning benchmark, on which GPT-4.1 scored 61.7 percent at contexts under 128,000 tokens versus about 42 percent for GPT-4o, with accuracy dropping sharply at longer lengths. ^[4]^[6]

The Michelangelo paper that originated the task evaluated GPT-4o, Claude 3 Opus, Claude 3.5 Sonnet, Gemini 1.5 Flash, and Gemini 1.5 Pro on MRCR from 2,000 up to 128,000 tokens and found that every model experienced significant fall-off, in many cases beginning before 32,000 tokens. ^[5] Because OpenAI's open-source MRCR and the original Gemini-team MRCR differ in their exact data and difficulty, scores are not directly comparable across the two versions, and care is needed when reading third-party leaderboards: some report numbers against the Michelangelo formulation while others use the OpenAI dataset. ^[1]^[5] In general, the 8-needle subsets and the longest context bins are where even leading models lose the most accuracy.

Relationship to other long-context benchmarks

OpenAI MRCR sits in a family of synthetic long-context evaluations that share the goal of stressing models beyond simple retrieval.

Needle in a haystack (NIAH): The baseline single-fact retrieval test. MRCR can be viewed as a multi-needle, disambiguation-focused successor that frontier models cannot solve by keyword matching alone. ^[3]^[6]
Michelangelo (Latent Structure Queries): The Google DeepMind paper that first defined MRCR, alongside the Latent List and "I do not know" (IDK) tasks. OpenAI MRCR is a harder, open-sourced reimplementation of the MRCR component. ^[1]^[5]
RULER: NVIDIA's configurable synthetic suite of 13 tasks spanning retrieval, multi-hop variable tracking, aggregation, and question answering, used to chart quality degradation as sequence length increases. RULER and MRCR are complementary, with MRCR concentrating specifically on co-reference and ordering among near-identical items. ^[7]
LongBench and similar suites: Broader long-context benchmarks that combine more naturalistic tasks. MRCR differs by being fully synthetic, automatically scored by string similarity, and resistant to data contamination because each sample is generated fresh. ^[5]

By focusing narrowly on disambiguating many similar needles, OpenAI MRCR provides a sharper signal than plain NIAH about whether a model's advertised million-token context translates into reliable in-context recall and ordering, which is why it is frequently cited in long-context model reports and leaderboards. ^[1]^[6]

References

OpenAI. "openai/mrcr." Hugging Face Datasets. https://huggingface.co/datasets/openai/mrcr ↩
OpenAI. "openai/mrcr: README.md." Hugging Face. https://huggingface.co/datasets/openai/mrcr/blob/main/README.md ↩
OpenAI. "Introducing GPT-4.1 in the API." April 14, 2025. https://openai.com/index/gpt-4-1/ ↩
The Decoder. "OpenAI launches GPT-4.1: New model family to improve agents, long contexts and coding." April 2025. https://the-decoder.com/openai-launches-gpt-4-1-new-model-family-to-improve-agents-long-contexts-and-coding/ ↩
Vodrahalli, K., et al. "Michelangelo: Long Context Evaluations Beyond Haystacks via Latent Structure Queries." arXiv:2409.12640. https://arxiv.org/abs/2409.12640 ↩
Analytics Vidhya. "All About OpenAI's Latest GPT 4.1 Family." April 2025. https://www.analyticsvidhya.com/blog/2025/04/open-ai-gpt-4-1/ ↩
Hsieh, C., et al. "RULER: What's the Real Context Size of Your Long-Context Language Models?" arXiv:2404.06654. https://github.com/NVIDIA/RULER ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

Suggest edit

What links here

Humanity's Last Exam METR

Overview

Background: long-context evaluation

What MRCR tests

Structure and scoring

Results

Relationship to other long-context benchmarks

References

Improve this article

Related Articles

Humanity's Last Exam

METR

SimpleQA

TruthfulQA

HaluEval

MACHIAVELLI (benchmark)

What links here

Related Articles

Humanity's Last Exam

METR

SimpleQA

TruthfulQA

HaluEval

MACHIAVELLI (benchmark)

What links here