MRCR

AI Benchmarks Large Language Models Model Evaluation

10 min read

Updated Jun 2, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 2, 2026

Fact-checked

In review queue

Sources

8 citations

Revision

v1 · 2,062 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

MRCR (Multi-Round Co-reference Resolution) is a synthetic long-context evaluation that tests whether a large language model can locate and disambiguate among several near-identical "needles" buried inside a long, multi-turn conversation, then reproduce one specific instance on request. The task was introduced by Google DeepMind in 2024 as one of three diagnostics in the Michelangelo long-context benchmark, and a public version was later released by OpenAI as the dataset openai/mrcr on Hugging Face.^[1]^[2]^[3] Unlike a plain retrieval probe, MRCR forces a model to reason about ordering and identity among confusable items, which is why it has become a standard figure in the long-context sections of recent model cards.^[3]^[4]

Overview

Most early long-context tests followed the needle in a haystack pattern: a single distinctive fact is hidden in a long body of filler text, and the model is asked to retrieve it. That format measures recall but is easy to saturate, because the needle stands out from its surroundings. MRCR raises the difficulty by planting multiple needles that are deliberately similar to one another and to the surrounding distractor content, so the model cannot succeed by simple pattern matching. It must instead track which instance is which, count occurrences, and resolve a co-reference such as "the second poem about tapirs" against the full conversation history.^[1]^[2]

The evaluation is built from synthetic multi-turn dialogues in which a user repeatedly asks an assistant to write short pieces (poems, essays, blog posts, riddles, and similar formats) on various topics. Several of those requests are identical, and the final instruction asks the model to return a particular one of them by index. Because the needles share topic and format with both each other and the distractors, the model has to perform genuine disambiguation rather than keyword lookup.^[1]^[3]

Origin and the Gemini 1.5 long-context work

The conceptual lineage of MRCR runs through Google's push toward million-token context windows. The Gemini 1.5 technical report (2024) demonstrated near-perfect "needle" recall, above 99 percent, on synthetic retrieval tasks at context lengths up to 1 million tokens, and showed the same behavior holding out to 10 million tokens across text, audio, and video.^[5] That report established that single-needle retrieval was largely solved for the Gemini 1.5 family, but it did not yet define a task under the name "multi-round co-reference resolution"; its synthetic probes were single-needle haystack experiments inspired by earlier needle-in-a-haystack work.^[5]

The MRCR task itself was formally introduced in the paper "Michelangelo: Long Context Evaluations Beyond Haystacks via Latent Structure Queries" by Kiran Vodrahalli and colleagues at Google DeepMind, posted to arXiv in September 2024.^[1] Michelangelo proposes a framework called Latent Structure Queries (LSQ), in which a model must "chisel away" irrelevant context to expose a latent structure and then answer queries about that structure. MRCR is one of three LSQ evaluations released with the paper, spanning natural-language and code domains, and it specifically targets the ability to retrieve and reproduce a designated response from a long conversation while resisting confusion from similar responses.^[1] In the Michelangelo setup the assistant turns were generated with PaLM 2, and adversarially similar "confounding" outputs sharing the same topic or format as the query were inserted to make the disambiguation harder.^[1]

This distinction matters for attribution: although MRCR is frequently described as a "Gemini" eval and is closely tied to the Gemini long-context program, the named task and its scoring originate in the Michelangelo paper rather than the Gemini 1.5 report.^[1]^[5] OpenAI's own dataset card credits "the MRCR eval first introduced by Gemini" and links directly to the Michelangelo arXiv entry.^[3]

The OpenAI public dataset

In April 2025, OpenAI released a public reimplementation as the dataset openai/mrcr on Hugging Face, described as a "long context multiple needle in a haystack benchmark."^[3] The card frames the task plainly: the model is given a long, multi-turn, synthetically generated conversation in which the user asks for writing on a topic, for example "write a poem about tapirs" or "write a blog post about rocks." Hidden in the conversation are 2, 4, or 8 identical requests, and the model is ultimately prompted to return the i-th instance, such as "Return the 2nd poem about tapirs."^[3] All of the content was generated by GPT-4o so that the needles blend in with the distractors, and the difficulty scales with both the number of needles and the length of the context.^[3]

The released dataset contains 2,400 rows organized into bins by the combined token count of prompt plus answer, spanning eight ranges from 4,096 tokens up to roughly 1,048,576 tokens, with 100 samples per bin. It draws on 438 distinct entities and 10 distinct writing formats. Each row exposes fields including the message list (prompt), the ground-truth answer, the random string to prepend, the number of needles n_needles, and the index of the desired message.^[3] In December 2025, OpenAI shipped a bug-fix revision after finding that roughly 10 percent of datapoints had contained too many target needles and about 5 percent had incorrect ground truth; corrected versions were uploaded with an added date_added field.^[3] Google DeepMind separately open-sourced its own internal version, "MRCR v2," through the eval_hub repository, supporting 2-to-8 needle configurations and context lengths scaling well beyond 1 million tokens, and again pointing back to the Michelangelo paper.^[6]

Task design and how needles work

In each MRCR sample, a synthetic conversation alternates between user requests and assistant responses. The user asks for several pieces of writing, and certain requests are repeated verbatim so that the conversation contains multiple distinct responses that all answer the same query. These repeated responses are the needles; the request that selects among them is the key. The remaining turns act as distractors, and they are chosen to overlap in topic, format, or both with the needles so that surface similarity cannot be used to shortcut the answer.^[1]^[3]

A simplified two-needle example from the OpenAI card illustrates the structure:^[3]

Turn	Role	Content
1	User	Write a poem about tapirs
2	Assistant	(first poem about tapirs)
3	User	Write a blog post about rocks
4	Assistant	(first blog post about rocks)
5	User	Write a poem about tapirs
6	Assistant	(second poem about tapirs)
...	...	(further distractor turns)
final	User	Prepend `aYooSG8CQg` to the 2nd (1-indexed) poem about tapirs. Do not include any other text.

To answer, the model must identify every poem about tapirs in order, select the second one, and emit it verbatim with the required prefix string attached. The co-reference resolution is the heart of the task: "the 2nd poem about tapirs" is an expression whose referent can only be fixed by scanning the whole dialogue and counting matching instances, while ignoring blog posts about rocks, social-media posts about tapirs, and other near-misses.^[1]^[3]

Scoring

MRCR is scored with a string-similarity measure rather than exact match, which lets it give partial credit for a mostly-correct reproduction of a long passage.^[1]^[3] The OpenAI implementation uses the SequenceMatcher ratio from Python's difflib library, returning a value between 0 and 1.^[3] Two rules govern grading: the model must prepend a specified random alphanumeric hash to its answer, and if that prefix is missing the score is forced to 0; otherwise the prefix is stripped from both strings and the similarity ratio is computed against the ground truth.^[3] The reference grading function from the dataset card is:

from difflib import SequenceMatcher

def grade(response, answer, random_string_to_prepend) -> float:
    if not response.startswith(random_string_to_prepend):
        return 0
    response = response.removeprefix(random_string_to_prepend)
    answer = answer.removeprefix(random_string_to_prepend)
    return float(SequenceMatcher(None, response, answer).ratio())

The prepended-hash requirement serves as a cheap signal that the model committed to a specific answer before producing it, and it lets the grader reject responses that hedge or refuse.^[3] DeepMind's MRCR v2 documentation notes two useful noise baselines for interpreting scores: roughly 1 percent if a model returns any assistant response at random, and about 1 divided by the number of relevant needles if it picks randomly among the matching responses, which works out to approximately 51 percent for 2 needles, 27 percent for 4 needles, and 15 percent for 8 needles.^[6]

Use in long-context model evaluations

MRCR has become a recurring entry in the long-context portions of frontier model cards and third-party leaderboards. Because there are several configurations (2, 4, or 8 needles, and various context-length bins), reported numbers are only comparable when the variant matches. Google typically reports a cumulative average up to 128k tokens alongside a pointwise score at 1M tokens, and the OpenAI dataset is most often quoted in its 2-needle 128k form.^[4]^[7] The table below collects reported figures from the indicated sources; values are scores on a 0-to-1 scale (or the equivalent percentage).

Model	Variant	Score	Source
Gemini 2.5 Pro	128k average	0.930	^[4]^[8]
Gemini 2.5 Pro	1M pointwise	~0.829	^[4]
Gemini 1.5 Pro	128k (OpenAI-MRCR style)	0.826	^[8]
Gemini 1.5 Flash	128k	0.719	^[8]
Gemini 2.0 Flash	128k	0.692	^[8]
GPT-5	2-needle 128k	0.952	^[7]
GPT-4.1	2-needle 128k	0.572	^[7]
GPT-4.1 mini	2-needle 128k	0.472	^[7]
GPT-4.5	2-needle 128k	0.385	^[7]
GPT-4.1 nano	2-needle 128k	0.366	^[7]
GPT-4o	2-needle 128k	0.319	^[7]
o3-mini	2-needle 128k	0.187	^[7]

Two qualitative patterns recur across reports. First, the Gemini 1.5 and 2.5 families show comparatively flat curves: after an initial drop, performance is roughly non-decreasing from 128k out to 1M tokens, whereas GPT and Claude models tend to decay more steeply as context grows.^[1]^[2] Second, OpenAI describes GPT-4.1 as retrieving the needle accurately at all positions and all context lengths up to 1 million tokens, and as outperforming GPT-4o at lengths up to 128k while holding strong performance to 1M.^[2] These cross-family comparisons should be read with care, since vendors do not always run identical needle counts or token bins.^[4]^[7]

Significance

MRCR addressed a real gap in long-context evaluation. By 2024, single-needle retrieval had become close to trivial for the strongest models, so it could no longer separate them.^[5] Embedding multiple confusable needles and demanding a specific, ordered instance restored discriminative power and pushed the test from pure recall toward retrieval plus reasoning.^[1] The use of a graded string-similarity score, rather than binary match, also makes the metric sensitive to partial degradation in very long reproductions, which is informative for studying how attention quality falls off with distance.^[3] The release of both a Google version (Michelangelo and MRCR v2) and an independent OpenAI dataset gave the community a reproducible, openly available probe that multiple labs cite when reporting context-length capabilities.^[1]^[3]^[6]

Limitations

MRCR is a synthetic, narrow task and is not a measure of general long-context understanding. Its inputs are machine-generated writing requests, so performance need not transfer to natural documents, code, or genuinely structured retrieval.^[1] Because scoring depends on a mandatory prepended hash, a model that simply forgets or reformats the prefix can score zero despite reproducing the correct passage, which can understate capability.^[3] The SequenceMatcher ratio is also a surface comparison that can reward partially correct strings or penalize harmless paraphrase, and it does not check semantic equivalence.^[3] Comparability across published numbers is fragile: different needle counts, token bins, and cumulative-versus-pointwise conventions are easy to conflate.^[4]^[7] Finally, the data-quality fix OpenAI applied in late 2025, after discovering mislabeled needles and incorrect ground truth in a fraction of rows, is a reminder that synthetic benchmarks can carry construction errors that affect early reported scores.^[3]

References

Vodrahalli, Kiran, et al. "Michelangelo: Long Context Evaluations Beyond Haystacks via Latent Structure Queries." arXiv:2409.12640, September 2024. https://arxiv.org/abs/2409.12640 ↩
"Introducing GPT-4.1 in the API." OpenAI, April 2025. https://openai.com/index/gpt-4-1/ ↩
"openai/mrcr." Hugging Face Datasets. https://huggingface.co/datasets/openai/mrcr ↩
"Gemini 2.5 Pro Model Card." Google DeepMind. https://storage.googleapis.com/deepmind-media/Model-Cards/Gemini-2-5-Pro-Model-Card.pdf ↩
Gemini Team, Google. "Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context." arXiv:2403.05530. https://arxiv.org/abs/2403.05530 ↩
"eval_hub: MRCR v2." Google DeepMind, GitHub. https://github.com/google-deepmind/eval_hub/blob/master/eval_hub/mrcr_v2/README.md ↩
"OpenAI-MRCR: 2 needle 128k Benchmark Leaderboard." LLM-Stats. https://llm-stats.com/benchmarks/openai-mrcr:-2-needle-128k ↩
"MRCR Benchmark Leaderboard." LLM-Stats. https://llm-stats.com/benchmarks/mrcr ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

Suggest edit

What links here

LongBench Needle in a Haystack (NIAH)

Overview

Origin and the Gemini 1.5 long-context work

The OpenAI public dataset

Task design and how needles work

Scoring

Use in long-context model evaluations

Significance

Limitations

References

Improve this article

Related Articles

LLM-as-a-judge

FACTS Grounding

NoLiMa

LongBench v2

BABILong

LLM Benchmark Comparison (Leaderboard Overview)

What links here

Related Articles

LLM-as-a-judge

FACTS Grounding

NoLiMa

LongBench v2

BABILong

LLM Benchmark Comparison (Leaderboard Overview)

What links here