MuSR

AI Benchmarks Model Evaluation Reasoning Models

11 min read

Updated Jun 28, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 28, 2026

Fact-checked

In review queue

Sources

9 citations

Revision

v3 · 2,288 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

MuSR (Multistep Soft Reasoning) is a benchmark for evaluating multistep reasoning in large language models, built around long free-text narratives such as murder mysteries, object-placement scenarios, and team-allocation problems. It was introduced in the 2023 paper "MuSR: Testing the Limits of Chain-of-thought with Multistep Soft Reasoning" by Zayne Sprague, Xi Ye, Kaj Bostrom, Swarat Chaudhuri, and Greg Durrett of the University of Texas at Austin, and presented as a spotlight at ICLR 2024.^[1]^[2] Each instance is a story of roughly 1,000 words, and answering the attached question requires chaining together commonsense facts and deductive steps that are spread across the narrative. The benchmark contains 756 instances total across three domains, and even GPT-4 with its best chain-of-thought prompting trailed human annotators on all three.^[1] MuSR is best known to practitioners today as one of the six tasks in version 2 of Hugging Face's Open LLM Leaderboard.^[3]

What is MuSR?

MuSR is a dataset that tests whether a model can read a realistic narrative, extract the facts that matter, and combine them with unstated everyday knowledge to reach a single correct answer. The authors describe it in the paper's abstract as "a dataset for evaluating language models on multistep soft reasoning tasks specified in a natural language narrative," with "two crucial features": it is "created through a novel neurosymbolic synthetic-to-natural generation algorithm," and its "instances are free text narratives corresponding to real-world domains of reasoning."^[1] The word "soft" refers to inference steps that lean on commonsense priors (a person seen leaving the building cannot also be in the kitchen, a witness who saw an object moved knows where it now is) rather than on formal rules alone.

Why was MuSR built?

Many reasoning benchmarks present problems in a clean, abstract form: a short logic puzzle, a math word problem with all the relevant quantities stated, or a multiple-choice question with the premises laid out. MuSR was designed to push in a different direction. The authors wanted reasoning that is embedded in realistic prose, where the model has to read a full story, pick out the facts that matter, and combine them with everyday knowledge that the text never states explicitly. They argue that earlier datasets such as bAbI, RuleTakers, and CLUTRR are "only challenging for 'pure' LLM approaches" and "many are solvable with rule-based methods," while datasets like SocialIQA and StrategyQA "involve more nuanced commonsense" but "are often structurally simple (i.e., only involve 1-2 steps of reasoning)." MuSR aims to combine "sophisticated natural language and sophisticated reasoning" in one benchmark.^[1]

A second goal was resistance to memorization. Static, hand-written reasoning datasets tend to leak into pretraining corpora over time, and once a test set is memorized its scores stop measuring reasoning. Because MuSR instances are generated by an algorithm, the authors can produce fresh examples and, in principle, scale the difficulty as models improve. The paper frames this as a way to keep a benchmark hard even as frontier systems advance, since the same pipeline can be re-run to create new stories that no model has seen.^[1]

The design also targets a specific weakness. Short prompts let a model reason in a few tokens, but a 1,000-word story forces it to track many entities and partial conclusions at once. MuSR therefore mixes commonsense reasoning with long-context bookkeeping, and the two together turn out to be much harder than either alone.

How is MuSR generated?

The core technical contribution is a neurosymbolic synthetic-to-natural generation algorithm. Rather than writing stories by hand or asking a model to invent puzzles freely (which tends to produce inconsistent or unsolvable items), MuSR builds each example from a logical scaffold and only then renders it into prose.^[1]

The process runs in three stages. First, the pipeline instantiates a set of gold facts that fix the answer: in a murder mystery, for instance, which suspect is guilty and the underlying reasons. Second, a prompted language model (GPT-4 in the original work) recursively expands those gold facts into a reasoning tree. Each node is a conclusion, and its children are the sub-facts that, taken together, would let a reader deduce it. The tree mixes scenario facts that the story will state outright with commonsense facts that a human is expected to supply, so the leaves of the tree become the raw material a solver must work from. Third, the leaf facts are grouped and handed back to a language model that writes them into natural-language chunks, a step the authors call "chaptering," stitching the pieces into a coherent narrative. A validation step checks that the generated text still preserves the facts it was supposed to encode, which keeps the stories solvable and internally consistent.^[1]

This structure is what makes the difficulty controllable. The depth and branching of the reasoning tree set how many inference steps the question requires. The paper's dataset statistics report that each domain demands roughly ten reasoning steps per instance: 10 steps for murder mysteries, 11 for object placements, and 10 for team allocation, plus 9, 6, and 9 unstated commonsense facts respectively.^[1] Because the symbolic tree guarantees a correct answer while the language model supplies fluent prose, MuSR avoids the usual trade-off between natural text (often inconsistent) and synthetic logic (often artificial-sounding). The authors note that a final MuSR murder-mystery narrative reaches about 900 words on average, against roughly 280 for a basic one-shot prompt, while keeping a fact recall of 95 percent of the gold leaf facts.^[1]

What are the three MuSR domains?

MuSR covers three domains, each chosen to exercise a different style of reasoning while keeping the same tree-to-narrative recipe. The dataset contains 250 murder mysteries, 256 object-placement stories, and 250 team-allocation problems, for 756 instances in total.^[1]^[4]

Domain	Reasoning type	Task	Answer format	Instances
Murder mysteries	Social deduction	Identify the murderer from a short list of suspects by weighing means, motive, and opportunity	Choose 1 of 2 suspects	250
Object placements	Observational and theory-of-mind	Determine where a given person believes an object is, based on who saw it move and when	Choose 1 of 5 locations	256
Team allocation	Constraint satisfaction	Assign people to tasks to maximize combined skill and teamwork, given individual abilities and pairwise dynamics	Choose 1 of 3 assignments	250

The murder mysteries read like compact detective stories. Each describes a crime and two suspects, and the model has to reason about which suspect had the means, the motive, and the opportunity, then pick the one the evidence points to. The object-placement stories test a form of theory of mind: several people move items around a space and observe one another, and the question asks where a particular character would look for an object, which depends on the last move that person actually witnessed rather than on where the object truly is. The team-allocation problems are small optimization puzzles dressed as workplace scenarios, where the solver assigns individuals to jobs so that the total of skill levels and how well people work together is as high as possible; the authors enforce that the optimal assignment "outperforms all other assignments by a score of at least 2."^[1]

How do models score on MuSR?

MuSR was built to probe chain-of-thought prompting, and the headline finding is that even the strongest model at the time of the study fell well short of people. With its best chain-of-thought prompting (a variant the paper calls "CoT+"), GPT-4 reached 80.4% on murder mysteries, 60.9% on object placements, and 68.4% on team allocation. Human annotators scored 94.1%, 95.0%, and 100.0% on the same three domains.^[1] The paper concludes that GPT-4 "performs the best out of all the models we tested, but still underperforms compared to humans," and that "although GPT-4 was instrumental in creating this dataset, it does not have the reasoning capabilities to solve it end-to-end."^[1] Object placement and team allocation in particular stayed far from human level, and the object-placement task sat only modestly above its random baseline.

The gap looks larger once the chance baselines are taken into account, since the three domains have different numbers of answer options. The table below pairs the human and GPT-4 numbers with the random baseline and GPT-3.5 for each domain.^[1]

Domain	Random baseline	GPT-3.5 (CoT+)	GPT-4 (CoT+)	Human
Murder mysteries	50.0%	61.6%	80.4%	94.1%
Object placements	24.6%	46.9%	60.9%	95.0%
Team allocation	33.3%	40.4%	68.4%	100.0%

Open-weight chat models of the period trailed both GPT models by a wide margin. Llama 2 70B Chat scored 48.8%, 42.2%, and 44.8% across the three domains, and the Vicuna variants often hovered near their respective random baselines, especially on object placement and team allocation.^[1] The authors note that chain-of-thought helps, but that it does not close the distance to human readers, and that errors frequently come from dropping or misapplying one of the many intermediate facts rather than from a single wrong leap.

What role does MuSR play in the Open LLM Leaderboard?

MuSR gained wide visibility when Hugging Face rebuilt its Open LLM Leaderboard in June 2024. The original leaderboard had relied on benchmarks such as ARC, HellaSwag, MMLU, and GSM8K, several of which had become saturated or contaminated as models improved and as test data leaked into training sets. Version 2 replaced them with six harder tasks: IFEval, BIG-Bench Hard, MATH (level 5 subset), GPQA, MuSR, and MMLU-Pro.^[3]

On the leaderboard, MuSR is evaluated zero-shot using the Eleuther AI evaluation harness, scoring each of the three subtasks by normalized accuracy (acc_norm) and then averaging them. The murder-mystery subtask is treated as a two-way choice, object placement as a five-way choice, and team allocation as a three-way choice.^[3] Leaderboard scores are normalized so that random guessing maps to 0 and a perfect score maps to 100, with each subtask normalized against its own chance level before averaging, which keeps domains with different numbers of options on a comparable scale.^[5] Hugging Face's own documentation states that the MuSR "problems include murder mysteries, object placement questions, and team allocation optimizations," that "solving these problems requires models to integrate reasoning with long-range context parsing," and that "few models achieve better than random performance on this dataset," which is part of why it earned a place among the version 2 tasks.^[3] It stayed difficult when older benchmarks no longer separated strong models from weak ones. Because the stories are long, MuSR also tends to reward models with larger context windows and better long-range tracking.

How does MuSR compare to other reasoning benchmarks?

MuSR sits alongside several other reasoning evaluations but differs in format. BIG-Bench Hard gathers a set of short, varied tasks that earlier models found difficult, while GSM8K focuses on grade-school math word problems. GPQA targets expert knowledge with graduate-level science questions, and MMLU-Pro expands multiple-choice knowledge testing to ten options per question. Compared with these, MuSR's distinguishing features are the length of each item (roughly 1,000 words of narrative) and its reliance on unstated commonsense, which is why it pairs naturally with the others on the Open LLM Leaderboard rather than duplicating them.^[3] Its emphasis on commonsense embedded in prose also connects it to the broader tradition of commonsense reasoning datasets, though MuSR adds the multistep, long-context dimension that short commonsense items usually lack.

The benchmark has also become a reference point for evaluating reasoning models, the class of systems that generate extended internal deliberation before answering, since MuSR's design specifically rewards models that can carry many intermediate conclusions through a long chain.

What are MuSR's limitations?

MuSR's reliance on a generation model is a double-edged feature. The same neurosymbolic pipeline that keeps stories solvable also means the prose, and some of the commonsense framing, is shaped by the language model used to write it, so the distribution of stories reflects that model's tendencies. The dataset is also modest in size, with 756 instances split across three domains, which limits fine-grained statistical comparisons between closely matched systems.^[1]^[4]

The three domains, while varied, are narrow slices of reasoning, and strong performance on MuSR does not by itself demonstrate general reasoning ability. There is a further tension in any synthetic benchmark: holding the published test set fixed makes results comparable over time but reopens the door to contamination, whereas regenerating fresh items preserves difficulty but changes what is being measured from one release to the next. Finally, multiple-choice scoring captures whether a model lands on the right answer but not whether it reached that answer through sound reasoning, so a model can sometimes guess correctly for the wrong reasons. These constraints are reasons to read MuSR as one informative signal among several rather than as a single verdict on a model's reasoning.^[1]

References

Sprague, Z., Ye, X., Bostrom, K., Chaudhuri, S., & Durrett, G. (2023). "MuSR: Testing the Limits of Chain-of-thought with Multistep Soft Reasoning." arXiv:2310.16049. https://arxiv.org/abs/2310.16049 ↩
"MuSR: Testing the Limits of Chain-of-thought with Multistep Soft Reasoning." OpenReview (ICLR 2024 spotlight). https://openreview.net/forum?id=jenyYQzue1 ↩
"About the Open LLM Leaderboard." Hugging Face. https://huggingface.co/docs/leaderboards/open_llm_leaderboard/about ↩
"MuSR." Inspect Evals documentation, UK AI Safety Institute. https://ukgovernmentbeis.github.io/inspect_evals/evals/reasoning/musr/ ↩
"Scores Normalization." Open LLM Leaderboard documentation, Hugging Face. https://huggingface.co/docs/leaderboards/open_llm_leaderboard/normalization ↩
Fourrier, C., Habib, N., Lozovskaya, A., Szafer, K., & Wolf, T. (2024). "Open-LLM performances are plateauing, let's make the leaderboard steep again." Hugging Face Blog. https://huggingface.co/spaces/open-llm-leaderboard/blog
"Hugging Face Releases Open LLM Leaderboard 2." MarkTechPost, June 27, 2024. https://www.marktechpost.com/2024/06/27/hugging-face-releases-open-llm-leaderboard-2-a-major-upgrade-featuring-tougher-benchmarks-fairer-scoring-and-enhanced-community-collaboration-for-evaluating-language-models/
"TAUR-Lab/MuSR." Datasets at Hugging Face. https://huggingface.co/datasets/TAUR-Lab/MuSR
"MuSR: Testing the Limits of Chain-of-thought with Multistep Soft Reasoning." ICLR 2024 Poster. https://iclr.cc/virtual/2024/poster/18015

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

2 revisions by 1 contributors · full history

Suggest edit

What links here

BIG-Bench Hard IFEval TruthfulQA

What is MuSR?

Why was MuSR built?

How is MuSR generated?

What are the three MuSR domains?

How do models score on MuSR?

What role does MuSR play in the Open LLM Leaderboard?

How does MuSR compare to other reasoning benchmarks?

What are MuSR's limitations?

See also

References

Improve this article

Related Articles

MATH

ProcessBench

Best AI Models for Reasoning and Math

ARC-AGI 1

GPQA

MathArena

What links here

Related Articles

MATH

ProcessBench

Best AI Models for Reasoning and Math

ARC-AGI 1

GPQA

MathArena

What links here