OpenAI MRCR (Multi-Round Co-reference Resolution)
Last reviewed
Jun 8, 2026
Sources
7 citations
Review status
Source-backed
Revision
v1 · 1,531 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 8, 2026
Sources
7 citations
Review status
Source-backed
Revision
v1 · 1,531 words
Add missing citations, update stale details, or suggest a clearer explanation.
OpenAI MRCR (Multi-Round Co-reference Resolution) is a long-context evaluation dataset published by OpenAI that measures a language model's ability to distinguish between multiple near-identical "needles" buried inside a very long, multi-turn conversation. Rather than asking a model to find a single distinctive fact, MRCR scatters several almost-identical user requests (for example, several requests to write a poem about the same topic) throughout a synthetic dialogue and then asks the model to reproduce one specific instance, such as the second poem about a given subject. The difficulty comes from disambiguating among highly similar items spread across contexts that scale up to roughly one million tokens. [1][2]
OpenAI open-sourced the dataset on Hugging Face under an MIT license, with the initial upload on April 12, 2025 and a bugfix revision on December 5, 2025. [1] It was introduced publicly alongside the GPT-4.1 model family, which launched in the OpenAI API on April 14, 2025. [3][4] MRCR expands on earlier work by a Google DeepMind team, which first defined a multi-round co-reference resolution task in the "Michelangelo" long-context evaluation paper. [1][5] As an AI benchmark, OpenAI MRCR has become a commonly cited probe of whether a model with a large advertised context window can actually use that window reliably across its full length.
Early evaluations of long-context models relied heavily on the needle in a haystack (NIAH) test, in which a single distinctive sentence (the "needle") is inserted at a known position inside a large block of filler text (the "haystack"), and the model is asked to retrieve it. NIAH is simple and easy to score, and frontier models quickly began passing it at long context lengths. OpenAI itself reported that GPT-4.1 retrieves the needle accurately at all tested positions up to one million tokens in the NIAH setting, while acknowledging the test's limitations. [3][6]
Because plain retrieval saturated, researchers moved toward harder synthetic tasks that require disambiguation, ordering, and reasoning rather than a single keyword match. Two influential efforts framed this shift. NVIDIA's RULER benchmark generalized NIAH into a family of configurable synthetic tasks, including multi-key, multi-value, and multi-query retrieval, variable tracking, aggregation, and question answering, to measure how quality degrades as input length grows. [7] Separately, a Google DeepMind team proposed the "Latent Structure Queries" framework in the Michelangelo paper, arguing that meaningful long-context evaluation should force a model to "chisel away" irrelevant content to reveal a latent structure, rather than locate an isolated fact. [5] MRCR was one of the three Michelangelo tasks, designed to test understanding of ordering in natural text, the ability to tell apart similar drafts, and faithful reproduction of a specified passage. OpenAI's release builds directly on that formulation while increasing the difficulty and providing open, reproducible data. [1][5]
In an MRCR sample, the model is given a long, synthetically generated, multi-turn conversation between a user and an assistant. The user repeatedly asks for a piece of writing about a topic, for example "write a poem about tapirs" or "write a blog post about volcanoes," and the assistant responds each time with generated content. Several of these requests are near-duplicates: the same kind of writing about the same entity appears two, four, or eight times across the dialogue, interleaved with many distractor requests on other topics. At the end, the user asks the model to return a specific instance, such as the i-th poem about the target topic. [1][2]
The task is deliberately resistant to shortcuts. Because the needles are drawn from the same distribution as the distractors, simple keyword search does not isolate the correct answer; the model must both locate the relevant cluster of requests and reason about their order to pick the right one. OpenAI describes the challenge as requiring the model to "distinguish order amongst the needles" that are statistically indistinguishable from the surrounding content. [1] This stresses retrieval, co-reference resolution, and ordered reproduction simultaneously, which is why performance tends to fall off well before the maximum context length is reached. [5][6]
The dataset is organized into subsets by the number of repeated requests, or needles, that the model must disambiguate: a 2-needle subset, a 4-needle subset, and an 8-needle subset. More needles make the disambiguation task harder. Samples are bucketed into eight token-length bins so that performance can be reported as a function of context size. [1]
| Property | Value |
|---|---|
| Task | Retrieve the i-th of several near-identical requests in a long multi-turn dialogue |
| Needle subsets | 2, 4, and 8 needles |
| Token bins | [4096, 8192], (8192, 16384], (16384, 32768], (32768, 65536], (65536, 131072], (131072, 262144], (262144, 524288], (524288, 1048576] |
| Samples per bin | 100 |
| Total samples | 2,400 |
| Distinct entities | 438 |
| Writing formats | 10 |
| Dataset size | About 1.39 GB |
| License | MIT |
Token counts in the released code are computed with tiktoken using the o200k_base encoding, so the bin boundaries reflect OpenAI tokenizer tokens rather than raw characters. [1]
Scoring uses a string-similarity ratio rather than exact match. The reference grading code measures the SequenceMatcher ratio from Python's difflib library, comparing the model's answer against the target piece of writing; the score is a continuous value between 0 and 1. To prevent models from succeeding without genuinely identifying the correct instance, MRCR also requires the model to prepend a specific randomly generated alphanumeric hash to the beginning of its answer. If that hash is missing, the match ratio for that sample is set to 0. [1] Reported MRCR results are therefore typically averages of these per-sample SequenceMatcher ratios, grouped by needle count and by context-length bin.
The consistent finding across MRCR is that models degrade substantially as context length grows and as the number of needles increases, even when the same model passes simpler NIAH retrieval at the same lengths. When OpenAI introduced the dataset with GPT-4.1, it reported that GPT-4.1 outperformed GPT-4o at context lengths up to 128,000 tokens and maintained relatively strong performance out to one million tokens, while still showing a clear decline. Coverage of the launch noted GPT-4.1's MRCR accuracy falling from roughly 80 percent toward about 50 percent as the full context window was used. [4][6] OpenAI presented MRCR results for GPT-4.1, GPT-4.1 mini, and GPT-4.1 nano as part of the same long-context evaluation suite that also included its Graphwalks multi-hop reasoning benchmark, on which GPT-4.1 scored 61.7 percent at contexts under 128,000 tokens versus about 42 percent for GPT-4o, with accuracy dropping sharply at longer lengths. [4][6]
The Michelangelo paper that originated the task evaluated GPT-4o, Claude 3 Opus, Claude 3.5 Sonnet, Gemini 1.5 Flash, and Gemini 1.5 Pro on MRCR from 2,000 up to 128,000 tokens and found that every model experienced significant fall-off, in many cases beginning before 32,000 tokens. [5] Because OpenAI's open-source MRCR and the original Gemini-team MRCR differ in their exact data and difficulty, scores are not directly comparable across the two versions, and care is needed when reading third-party leaderboards: some report numbers against the Michelangelo formulation while others use the OpenAI dataset. [1][5] In general, the 8-needle subsets and the longest context bins are where even leading models lose the most accuracy.
OpenAI MRCR sits in a family of synthetic long-context evaluations that share the goal of stressing models beyond simple retrieval.
By focusing narrowly on disambiguating many similar needles, OpenAI MRCR provides a sharper signal than plain NIAH about whether a model's advertised million-token context translates into reliable in-context recall and ordering, which is why it is frequently cited in long-context model reports and leaderboards. [1][6]