Marco-o1
Last reviewed
Jun 3, 2026
Sources
5 citations
Review status
Source-backed
Revision
v1 · 1,438 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 3, 2026
Sources
5 citations
Review status
Source-backed
Revision
v1 · 1,438 words
Add missing citations, update stale details, or suggest a clearer explanation.
Marco-o1 is an open reasoning model released in November 2024 by the MarcoPolo team at Alibaba International Digital Commerce (AIDC). It was introduced in the paper "Marco-o1: Towards Open Reasoning Models for Open-Ended Solutions" (arXiv:2411.14405) and is built on Qwen2-7B-Instruct. The project combines chain-of-thought (CoT) fine-tuning, Monte Carlo Tree Search (MCTS), and a reflection mechanism, with an explicit goal of extending o1-style reasoning to open-ended problems that lack a single standard answer.[1][2]
The authors position Marco-o1 as exploratory rather than a finished frontier system. The paper carries a "Work in Progress" note stating that the model "primarily exhibits o1-like reasoning characteristics and its performance still fall short of a fully realized 'o1' model," and that the work is intended to shed light on an otherwise unclear technical roadmap for large reasoning models.[1]
Marco-o1 was published shortly after OpenAI released o1, a large language model trained to spend additional inference-time compute on internal reasoning before answering. Models of this type, sometimes called large reasoning models, performed well on tasks with verifiable answers such as competition mathematics and programming, where reinforcement learning can use a clear reward signal. The Marco-o1 authors framed their central question as whether such models can generalize to broader domains "where clear standards are absent and rewards are challenging to quantify," for example open-ended generation and translation.[1]
The name is derived from o1, and the paper acknowledges OpenAI's o1 as the inspiration. The model is associated with Alibaba's Qwen lineage only through its base checkpoint; it is a separate research effort from the MarcoPolo team rather than part of the main Qwen release series. The listed authors are Yu Zhao, Huifeng Yin, Bo Zeng, Hao Wang, Tianqi Shi, Chenyang Lyu, Longyue Wang, Weihua Luo, and Kaifu Zhang, all of the MarcoPolo Team at Alibaba International Digital Commerce.[1]
Marco-o1 is produced by supervised fine-tuning of Qwen2-7B-Instruct. The paper reports full-parameter fine-tuning on a mixture of three datasets totaling 60,266 samples:[1]
| Dataset | Samples |
|---|---|
| Open-O1 CoT Dataset (filtered) | 45,125 |
| Marco-o1 CoT Dataset (synthetic) | 10,000 |
| Marco Instruction Dataset | 5,141 |
| Total | 60,266 |
The Open-O1 CoT data comes from the community Open-O1 project and was refined with heuristic and quality filtering so the model would adopt structured reasoning patterns. The Marco-o1 CoT data is synthetic, generated using MCTS to produce longer and more complex reasoning chains. The Marco Instruction Dataset adds general instruction-following examples so the model retains broad task competence rather than overfitting to reasoning traces. The model fine-tuned on this mixture is referred to in the paper as Marco-o1-CoT.[1]
The full Marco-o1 system layers two further components on top of the fine-tuned model: solution-space expansion through MCTS, and a set of reasoning action strategies including reflection.[1]
In the MCTS formulation, each node represents a reasoning state and the actions available from a node are candidate continuations generated by the language model. During rollout the model continues reasoning to a terminal state, and a reward score is used to select promising paths. Rather than training a separate reward model, the paper derives the reward from the model's own token confidence. For each token in a rollout, a confidence score is computed by applying a softmax over the log probability of the chosen token and the log probabilities of its top five alternatives:
c_i = exp(p(t_i)) / sum over k=1..5 of exp(p(t_k))
The per-token confidences are averaged across the rollout to give an overall value v for that path, where a higher v indicates a more confident and presumably more reliable reasoning chain. This value guides selection within the tree.[1]
A distinctive part of the work is varying the granularity of the actions used inside MCTS. The paper observes that treating a full reasoning step as a single action is relatively coarse and can cause the search to miss finer reasoning paths. To address this, the authors experiment with smaller units called mini-steps, defined as fixed spans of 32 or 64 tokens, so the search tree can branch at a finer resolution. Token-level search is noted as the theoretical maximum granularity but is treated as impractical given the compute cost and the difficulty of designing a good reward at that level. The three configurations evaluated are Marco-o1-MCTS (step), Marco-o1-MCTS (mini-step of 64 tokens), and Marco-o1-MCTS (mini-step of 32 tokens).[1]
The reflection mechanism appends the phrase "Wait! Maybe I made some mistakes! I need to rethink from scratch." at the end of a thought process, prompting the model to re-examine and revise its reasoning. The paper reports that this self-critique step helps most on harder problems the base model initially gets wrong, and that with reflection added "approximately half of these challenging problems are answered correctly."[1]
Marco-o1 was evaluated on MGSM (Multilingual Grade School Math), using the English (En) and Chinese (Zh) subsets, with a CoT prompt applied at test time for consistency. The headline result reported in the abstract is an accuracy improvement of +6.17% on MGSM English and +5.60% on MGSM Chinese over the Qwen2-7B-Instruct baseline.[1] The per-configuration accuracies are:[1]
| Model | MGSM-En | MGSM-Zh |
|---|---|---|
| Qwen2-7B-Instruct | 84.00% | 76.80% |
| Marco-o1-CoT | 85.60% | 71.20% |
| Marco-o1-MCTS (step) | 90.40% | 80.00% |
| Marco-o1-MCTS (mini-step of 64 tokens) | 88.40% | 80.40% |
| Marco-o1-MCTS (mini-step of 32 tokens) | 87.60% | 82.40% |
The paper draws two cautious conclusions from these numbers. First, all three MCTS-enhanced variants improve over Marco-o1-CoT, indicating that expanding the solution space helps. Second, no single action granularity is uniformly best: the step-level strategy is strongest on MGSM English, while the 32-token mini-step is strongest on MGSM Chinese, and the authors attribute some of this variation to randomness introduced by using a confidence score as the reward. They explicitly state they "cannot draw definitive conclusions about which action strategy is superior." Notably, the CoT-only Marco-o1-CoT regresses on MGSM Chinese (71.20% versus the 76.80% baseline), which the paper attributes to the English-language CoT fine-tuning data not transferring well to Chinese.[1]
The authors also report Test@N results (the share of problems solved correctly at least once across N independent attempts) at Test@1, Test@8, and Test@32, observing that MCTS shows its largest advantage at Test@1, that is when only a single attempt is allowed.[1]
Beyond standard-answer math, the paper presents a machine-translation case study, described as among the first applications of a large reasoning model to translation and to inference-time scaling in the multilingual and translation domain. The example translates the colloquial Chinese sentence "这个鞋拥有踩屎感,很舒服,推荐购买" into English. The phrase "踩屎感" literally means a "stepping-on-poop sensation" but is positive slang in Chinese e-commerce describing a soft, cushioned sole. The paper shows the model reasoning through the literal meaning, recognizing it as crude in English, and rendering it as "This shoe has a comfortable sole and is highly recommended for purchase," which the authors contrast favorably with a standard tool such as Google Translate.[1] The translation focus is consistent with the team's e-commerce remit at Alibaba International Digital Commerce.
The paper is candid that the reward signal is a source of noise and lists improving it through outcome reward modeling and process reward modeling, and exploring reinforcement learning, as future work rather than completed results. In one illustrative case (the "how many r's in strawberry" question shown in the paper), the model returns the correct answer but its trace does not explicitly account for every letter, which the authors note rather than present as a clean success.[1] Coverage in the technical press in late November 2024 generally described Marco-o1 as an early, openly released attempt to reproduce o1-style reasoning, useful as a research artifact and for translation and reasoning experiments, while noting it was not competitive with the strongest closed reasoning systems.[3][4]
Marco-o1 is released under the Apache License 2.0. The model weights are distributed on Hugging Face as AIDC-AI/Marco-o1, and code plus a portion of the data are provided in the AIDC-AI/Marco-o1 GitHub repository.[2][5] The Hugging Face card includes a standard disclaimer that automated compliance checks were applied during training but that the model cannot be guaranteed to be entirely free of copyright or other issues.[5]