MuSR
Last reviewed
May 31, 2026
Sources
9 citations
Review status
Source-backed
Revision
v1 · 1,912 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 31, 2026
Sources
9 citations
Review status
Source-backed
Revision
v1 · 1,912 words
Add missing citations, update stale details, or suggest a clearer explanation.
MuSR is a benchmark for evaluating multistep reasoning in large language models, built around long free-text narratives such as murder mysteries, object-placement scenarios, and team-allocation problems. The name stands for Multistep Soft Reasoning. It was introduced in the 2023 paper "MuSR: Testing the Limits of Chain-of-thought with Multistep Soft Reasoning" by Zayne Sprague, Xi Ye, Kaj Bostrom, Swarat Chaudhuri, and Greg Durrett, and presented as a spotlight at ICLR 2024.[1][2] Each instance is a story of roughly 1,000 words, and answering the attached question requires chaining together commonsense facts and deductive steps that are spread across the narrative. MuSR is best known to practitioners today as one of the six tasks in version 2 of Hugging Face's Open LLM Leaderboard.[3]
Many reasoning benchmarks present problems in a clean, abstract form: a short logic puzzle, a math word problem with all the relevant quantities stated, or a multiple-choice question with the premises laid out. MuSR was designed to push in a different direction. The authors wanted reasoning that is embedded in realistic prose, where the model has to read a full story, pick out the facts that matter, and combine them with everyday knowledge that the text never states explicitly. They call this "soft" reasoning because the inference steps lean on commonsense priors (a person who was seen leaving the building cannot also be in the kitchen, a witness who saw an object moved knows where it now is) rather than on formal rules alone.[1]
A second goal was resistance to memorization. Static, hand-written reasoning datasets tend to leak into pretraining corpora over time, and once a test set is memorized its scores stop measuring reasoning. Because MuSR instances are generated by an algorithm, the authors can produce fresh examples and, in principle, scale the difficulty as models improve. The paper frames this as a way to keep a benchmark hard even as frontier systems advance, since the same pipeline can be re-run to create new stories that no model has seen.[1]
The design also targets a specific weakness. Short prompts let a model reason in a few tokens, but a 1,000-word story forces it to track many entities and partial conclusions at once. MuSR therefore mixes commonsense reasoning with long-context bookkeeping, and the two together turn out to be much harder than either alone.
The core technical contribution is a neurosymbolic synthetic-to-natural generation algorithm. Rather than writing stories by hand or asking a model to invent puzzles freely (which tends to produce inconsistent or unsolvable items), MuSR builds each example from a logical scaffold and only then renders it into prose.[1]
The process runs in three stages. First, the pipeline instantiates a set of gold facts that fix the answer: in a murder mystery, for instance, which suspect is guilty and the underlying reasons. Second, a prompted language model recursively expands those gold facts into a reasoning tree. Each node is a conclusion, and its children are the sub-facts that, taken together, would let a reader deduce it. The tree mixes facts that the story will state outright with commonsense facts that a human is expected to supply, so the leaves of the tree become the raw material a solver must work from. Third, the leaf facts are grouped and handed back to a language model that writes them into natural-language chunks, stitching the pieces into a coherent narrative. A validation step checks that the generated text still preserves the facts it was supposed to encode, which keeps the stories solvable and internally consistent.[1]
This structure is what makes the difficulty controllable. The depth and branching of the reasoning tree set how many inference steps the question requires, and the authors report instances that demand around ten reasoning steps and several unstated commonsense facts each. Because the symbolic tree guarantees a correct answer while the language model supplies fluent prose, MuSR avoids the usual trade-off between natural text (often inconsistent) and synthetic logic (often artificial-sounding).[1]
MuSR covers three domains, each chosen to exercise a different style of reasoning while keeping the same tree-to-narrative recipe. The dataset contains 250 murder mysteries, 256 object-placement stories, and 250 team-allocation problems.[1][4]
| Domain | Reasoning type | Task | Answer format |
|---|---|---|---|
| Murder mysteries | Social deduction | Identify the murderer from a short list of suspects by weighing means, motive, and opportunity | Choose 1 of 2 suspects |
| Object placements | Observational and theory-of-mind | Determine where a given person believes an object is, based on who saw it move and when | Choose 1 of 5 locations |
| Team allocation | Constraint satisfaction | Assign people to tasks to maximize combined skill and teamwork, given individual abilities and pairwise dynamics | Choose 1 of 3 assignments |
The murder mysteries read like compact detective stories. Each describes a crime and two suspects, and the model has to reason about which suspect had the means, the motive, and the opportunity, then pick the one the evidence points to. The object-placement stories test a form of theory of mind: several people move items around a space and observe one another, and the question asks where a particular character would look for an object, which depends on the last move that person actually witnessed rather than on where the object truly is. The team-allocation problems are small optimization puzzles dressed as workplace scenarios, where the solver assigns individuals to jobs so that the total of skill levels and how well people work together is as high as possible.[1]
MuSR was built to probe chain-of-thought prompting, and the headline finding is that even the strongest model at the time of the study fell well short of people. With its best chain-of-thought prompting, GPT-4 reached about 80.4% on murder mysteries, 60.9% on object placements, and 68.4% on team allocation. Human annotators scored 94.1%, 95.0%, and 100.0% on the same three domains.[1] Object placement and team allocation in particular stayed far from human level, and the object-placement task sat only modestly above its random baseline.
The gap looks larger once the chance baselines are taken into account, since the three domains have different numbers of answer options. The table below pairs the human and GPT-4 numbers with the random baseline for each domain.[1]
| Domain | Random baseline | GPT-4 (best chain-of-thought) | Human |
|---|---|---|---|
| Murder mysteries | 50.0% | 80.4% | 94.1% |
| Object placements | 24.6% | 60.9% | 95.0% |
| Team allocation | 33.3% | 68.4% | 100.0% |
Other models tested in the paper trailed GPT-4 by a wide margin. GPT-3.5 landed in the low-to-mid range across the three domains, and open-weight chat models of the period, including Llama 2 70B Chat and the Vicuna variants, often hovered near their respective random baselines, especially on object placement and team allocation.[1] The authors note that chain-of-thought helps, but that it does not close the distance to human readers, and that errors frequently come from dropping or misapplying one of the many intermediate facts rather than from a single wrong leap.
MuSR gained wide visibility when Hugging Face rebuilt its Open LLM Leaderboard in June 2024. The original leaderboard had relied on benchmarks such as ARC, HellaSwag, MMLU, and GSM8K, several of which had become saturated or contaminated as models improved and as test data leaked into training sets. Version 2 replaced them with six harder tasks: IFEval, BIG-Bench Hard, MATH (level 5 subset), GPQA, MuSR, and MMLU-Pro.[3]
On the leaderboard, MuSR is evaluated zero-shot using the Eleuther AI evaluation harness, scoring each of the three subtasks by normalized accuracy and then averaging them. The murder-mystery subtask is treated as a two-way choice, object placement as a five-way choice, and team allocation as a three-way choice.[3] Leaderboard scores are normalized so that random guessing maps to 0 and a perfect score maps to 100, with each subtask normalized against its own chance level before averaging, which keeps domains with different numbers of options on a comparable scale.[5] Hugging Face's own description notes that few models score better than random on MuSR, which is part of why it earned a place among the version 2 tasks: it stayed difficult when older benchmarks no longer separated strong models from weak ones.[3] Because the stories are long, MuSR also tends to reward models with larger context windows and better long-range tracking.
MuSR sits alongside several other reasoning evaluations but differs in format. BIG-Bench Hard gathers a set of short, varied tasks that earlier models found difficult, while GSM8K focuses on grade-school math word problems. GPQA targets expert knowledge with graduate-level science questions, and MMLU-Pro expands multiple-choice knowledge testing to ten options per question. Compared with these, MuSR's distinguishing features are the length of each item (roughly 1,000 words of narrative) and its reliance on unstated commonsense, which is why it pairs naturally with the others on the Open LLM Leaderboard rather than duplicating them.[3] Its emphasis on commonsense embedded in prose also connects it to the broader tradition of commonsense reasoning datasets, though MuSR adds the multistep, long-context dimension that short commonsense items usually lack.
The benchmark has also become a reference point for evaluating reasoning models, the class of systems that generate extended internal deliberation before answering, since MuSR's design specifically rewards models that can carry many intermediate conclusions through a long chain.
MuSR's reliance on a generation model is a double-edged feature. The same neurosymbolic pipeline that keeps stories solvable also means the prose, and some of the commonsense framing, is shaped by the language model used to write it, so the distribution of stories reflects that model's tendencies. The dataset is also modest in size, with 750-plus instances split across three domains, which limits fine-grained statistical comparisons between closely matched systems.[1][4]
The three domains, while varied, are narrow slices of reasoning, and strong performance on MuSR does not by itself demonstrate general reasoning ability. There is a further tension in any synthetic benchmark: holding the published test set fixed makes results comparable over time but reopens the door to contamination, whereas regenerating fresh items preserves difficulty but changes what is being measured from one release to the next. Finally, multiple-choice scoring captures whether a model lands on the right answer but not whether it reached that answer through sound reasoning, so a model can sometimes guess correctly for the wrong reasons. These constraints are reasons to read MuSR as one informative signal among several rather than as a single verdict on a model's reasoning.[1]