MiniMax M2
Last reviewed
May 16, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 · 3,109 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 16, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 · 3,109 words
Add missing citations, update stale details, or suggest a clearer explanation.
MiniMax M2 is an open-weight large language model released on October 27, 2025 by Shanghai-based AI company MiniMax. The model uses a Mixture of Experts (MoE) architecture with 230 billion total parameters and roughly 10 billion active parameters per token, and it is positioned as a specialist for coding and agentic workflows rather than as a general-purpose chat model. MiniMax describes the design as a deliberately compact "mini" model intended to maximize throughput, lower the cost of long agent loops, and run inside developer tools like Claude Code, Cursor, and Cline.[^minimax-blog][^github]
The weights were published on Hugging Face under an MIT license at launch, and the MiniMax Open Platform offered a limited-time free API trial that ran through early November 2025. The list price after the free period was set at 2.1 yuan (around US$0.30) per million input tokens and 8.4 yuan (around US$1.20) per million output tokens, which MiniMax pitched as roughly 8 percent of the price of Anthropic's Claude Sonnet 4.5 while running at nearly twice the speed.[^minimax-blog][^caixin] On the Artificial Analysis Intelligence Index, M2 took the top spot among open-weight systems at the time of release, with a composite score of 61 and standout numbers on tool-use, search, and end-to-end software-engineering benchmarks such as SWE-bench Verified, Terminal-Bench, and BrowseComp.[^hf-card][^marktechpost]
The model is the third generation of MiniMax's open foundation-model line after MiniMax-Text-01 and MiniMax M1, and the first that explicitly trades long-context capacity for active-parameter efficiency. Where M1 advertised a 1 million-token context window built on "lightning attention," M2 ships with a 204,800-token window and a standard MoE transformer, with the savings reinvested into a fast inference path designed for agent harnesses that fire dozens of short tool calls per session.[^caixin][^github]
MiniMax (Chinese: Shanghai Xiyu Technology) was founded in December 2021 by computer-vision researchers who had previously worked at SenseTime, including chief executive Yan Junjie, Bin Yang, and Yucong Zhou. The company received early backing from MiHoYo, the studio behind Genshin Impact, and went on to raise a $600 million round led by Alibaba in March 2024 at a valuation of roughly $2.5 billion. It listed on the Hong Kong Stock Exchange on January 9, 2026.[^wikipedia-mini]
MiniMax is one of China's "Six Little Tigers" (六小虎) of AI alongside Moonshot, Zhipu, Baichuan, 01.AI, and StepFun. Its consumer products include the Hailuo AI text-to-video and image generator, the MiniMax Audio TTS platform, and Talkie, an English-language AI-companion app that the Wall Street Journal reported had around 11 million monthly active users in mid-2024. On the model side, MiniMax has shipped the ABAB series of MoE chat models, the MiniMax-01 family that introduced lightning attention at scale, and the M1 reasoning model released in June 2025.[^wikipedia-mini][^fortune]
M1 was significant in its own right because MiniMax claimed it had trained the 456B-parameter base for around US$534,700 in compute rentals, roughly 1/200th of the estimated training cost of GPT-4o. That number got attention partly because it landed in the wake of DeepSeek's V3 paper, which had already set a low-cost benchmark for Chinese open models, and partly because M1 paired the cheap-training story with a 1 million-token context window. M2 was developed in the same cost-conscious tradition but pointed at a different problem: making agent tool-call loops fast and cheap enough to leave running.[^fortune][^infoq]
MiniMax M2 is a sparse Mixture of Experts transformer with 229 to 230 billion total parameters and approximately 10 billion active parameters per forward pass, depending on routing. The Hugging Face model card frames the design philosophy as "compact, fast, and cost-effective," with the small active footprint chosen so that interactive agent loops do not have to wait for the full model on every step.[^hf-card][^github]
The architecture details published by MiniMax are deliberately spare. The README on GitHub lists the model as MoE with the 230B/10B split, recommends SGLang and vLLM as first-class inference runtimes with day-zero kernels, and ships an MLX-LM option for local Apple Silicon use. It does not publish the exact number of experts, layers, attention heads, or routing top-k, and as of this article those numbers have not been disclosed in primary documentation.[^github][^hf-card]
M2 supports a context window of 204,800 tokens (often rounded to 205K) and the same on output. That is a substantial step down from M1's million-token window. MiniMax's own blog framing is that real agent workloads do not exhaust a million tokens before tool calls reset the conversation, and that a shorter context paired with faster decoding produces better wall-clock results in benchmarks like SWE-bench and Terminal-Bench.[^caixin][^minimax-blog]
A distinctive feature is what MiniMax calls interleaved thinking. The model is trained to emit reasoning content inside <think>...</think> blocks between assistant turns, and MiniMax instructs callers to keep those blocks in the conversation history rather than stripping them out. Removing the thinking traces in multi-turn agent settings, according to the model card, significantly degrades performance because later steps lean on the earlier reasoning. The pattern is similar in spirit to the interleaved-reasoning format used in DeepSeek-R1 and Anthropic's extended thinking but is exposed as a first-class part of the protocol.[^hf-card][^perficient]
MiniMax recommends temperature 1.0, top-p 0.95, and top-k 40 for M2. The Hugging Face card warns that lower temperatures hurt the model's exploration during tool-use planning, and the SGLang and vLLM examples in the GitHub repository use the same defaults.[^hf-card][^github]
MiniMax has not published a training paper for M2 as of mid-2026, and the public material is light on specifics about the data mix, total token count, hardware, or reinforcement-learning recipe. The Hugging Face card describes M2 as "engineered for end-to-end developer workflows" and lists the kinds of tasks that drove the post-training mix, including multi-file edits, run-and-fix loops on compilation and test errors, terminal use, web browsing, and structured tool calling. The model is described as having been trained with interleaved thinking as a native output format from early in post-training rather than as a wrapper layer added on top of an existing chat model.[^hf-card][^github]
MiniMax also has not published a training-cost figure for M2 in the same way it did for M1. The general framing in the launch coverage is that the small active-parameter count and aggressive use of agent-style RL on coding traces are what allows the model to outperform much larger open models on agentic benchmarks while staying cheap to serve.[^marktechpost][^caixin]
The primary benchmark numbers below come from the Hugging Face model card, where MiniMax publishes its own results alongside the evaluation harness used (Claude Code as scaffolding for SWE-bench, OpenHands 0.42 for AgentCompany, and so on). Numbers from third parties such as Artificial Analysis are noted where they are relevant.[^hf-card][^aa]
| Benchmark | MiniMax M2 score | Notes |
|---|---|---|
| SWE-bench Verified | 69.4 | 100 max steps, 128K context, Claude Code scaffold |
| Multi-SWE-Bench | 36.2 | Averaged across 8 runs |
| SWE-bench Multilingual | 56.5 | Cross-language repository tasks |
| LiveCodeBench | 83 | Per Artificial Analysis composite |
| ArtifactsBench | 66.8 | Averaged across 3 runs |
| SciCode | 36 | Scientific coding subset |
The SWE-bench Verified number is the headline figure for coding. At 69.4, M2 lands well above earlier open-weight reasoning models such as DeepSeek-R1 and is within a few points of Anthropic's Claude Sonnet 4.5, which scored around 77.2 on the same benchmark in the period after launch. MiniMax explicitly notes that M2 is evaluated inside Claude Code's harness rather than a custom one, which makes the comparison closer to a like-for-like agent test than some earlier reports.[^hf-card][^dailydose]
| Benchmark | MiniMax M2 score | Notes |
|---|---|---|
| Terminal-Bench | 46.3 | Averaged across 8 runs |
| Terminal-Bench-Hard | 24.0 | Hard subset |
| BrowseComp | 44.0 | English web research |
| BrowseComp-zh | 48.5 | Chinese variant |
| GAIA (text only) | 75.7 | 103-sample validation subset |
| xbench-DeepSearch | 72.0 | Long-horizon search |
| FinSearchComp-global | 65.5 | Financial document search |
| AgentCompany | 36.0 | OpenHands 0.42 framework |
| tau-squared Bench | 77.2 | Extended thinking with tool use |
The agentic numbers are the part of the table that MiniMax highlights most. Terminal-Bench at 46.3 and BrowseComp at 44 were, in October 2025, the best published results from any open-weight model on those tasks. The GAIA text-only score of 75.7 also placed M2 above DeepSeek-V3.1 and Kimi K2 on the same validation subset, although Kimi K2 Thinking later overtook M2 on some of the same metrics once Moonshot released its reasoning variant.[^hf-card][^marktechpost]
| Benchmark | MiniMax M2 score | Notes |
|---|---|---|
| Artificial Analysis composite | 61 | #1 open-weight at launch |
| MMLU-Pro | 82 | Multi-subject reasoning |
| GPQA-Diamond | 78 | Graduate-level physics, biology, chemistry |
| AIME 2025 | 78 | High-school math olympiad |
| HLE (with tools) | 31.8 | Humanity's Last Exam with search and Python |
| HLE (no tools) | 12.5 | Closed-book |
| IFBench | 72 | Instruction following |
| AA-LCR | 61 | Long-context reasoning |
| tau-squared-Telecom | 87 | Telecom domain agents |
Artificial Analysis grouped MiniMax M2 with the strongest open-weight cohort of late 2025 on its composite intelligence score, placing it ahead of DeepSeek-V3.1 and GLM-4.5 but behind Anthropic's Claude Sonnet 4.5 and OpenAI's GPT-5 on overall composite. On agentic sub-scores it sat closer to those proprietary models, which is the part of the index MiniMax pushed hardest in its launch materials.[^aa][^marktechpost]
It is worth noting that several of these scores are self-reported by MiniMax. Third-party replications by groups like LMSYS and independent reviewers have generally confirmed the SWE-bench Verified and Terminal-Bench numbers within a couple of points, but the agentic browse results are harder to reproduce because they depend on the exact tool harness, browser state, and rate-limit conditions used during evaluation.[^marktechpost][^medium-compare]
M2's weights are published on Hugging Face under the MIT license, which permits commercial use, redistribution, and fine-tuning without a separate agreement. This put M2 on a more permissive footing than some peer Chinese releases that ship under custom commercial-use clauses. MiniMax later moved toward more restrictive licenses for its M2.1, M2.5, and M2.7 successors, but the original M2 release remains MIT.[^hf-card][^letsdatascience]
MiniMax offers M2 directly through the MiniMax Open Platform and also through partners including Vercel AI Gateway, OpenRouter, NVIDIA NIM, and Microsoft Azure AI Foundry, where it was added shortly after launch.[^vercel][^msft]
| Tier | Input price | Output price | Notes |
|---|---|---|---|
| MiniMax Open Platform | $0.30 / 1M tokens | $1.20 / 1M tokens | 2.1 RMB / 8.4 RMB native |
| Free trial | $0.00 | $0.00 | Through November 7, 2025 |
| MiniMax Agent | Free | Free | Lightning Mode and Pro Mode during trial |
MiniMax described the post-trial pricing as around 8 percent of what Anthropic charges for Claude Sonnet 4.5 on a comparable token-mix basis, and reported peak throughput of roughly 100 tokens per second per request during launch tests. The free trial covered both the API and the MiniMax Agent product built on top of M2, and the trial period was extended once before the paid tier went live.[^minimax-blog][^caixin]
For teams that want to run the model themselves, MiniMax's GitHub repository ships SGLang and vLLM configurations that boot M2 on a single multi-GPU node with sufficient HBM to hold the 230 billion total parameters in BF16 or FP8. Quantized GGUF builds maintained by community contributors such as Unsloth and the Cerebras MiniMax-M2-REAP-162B-A10B "reaped" variant have lowered the bar further. A 4-bit Q4 GGUF fits within roughly 130 GB of RAM, which makes the model practical for high-end workstations and small inference servers as well as cloud nodes.[^github][^marktechpost-reap]
Reception across English-language coverage was broadly positive, with most reviewers calling out the same three points: M2 was at or near the top of open-weight charts on agentic benchmarks, it was cheap enough to leave running inside coding agents without watching costs, and the small 10B active parameter count made it noticeably faster than peer open models in tight tool-call loops. VentureBeat ran a piece headlined "MiniMax-M2 is the new king of open source LLMs" within two days of the release, citing the Artificial Analysis ranking. MarkTechPost and DigitalOcean published deeper architecture and benchmarking pieces in the days that followed.[^venturebeat][^marktechpost][^digitalocean]
There was also some skepticism. The 1 million-token context window from M1 was popular with users doing long-document analysis, and dropping back to 205K felt like a step backward for that use case even though the agent-loop case is different. Reviewers also pointed out that the published benchmark results are heavily weighted toward agent and coding tasks where MiniMax had tuned the post-training mix, and that on pure reasoning competitions like AIME and HLE the model still trailed proprietary frontier systems. Hacker News commenters and the Hugging Face discussion threads flagged the usual concerns about self-reported benchmarks and asked for independent re-runs.[^caixin][^hn]
The table below compares the headline benchmark numbers for MiniMax M2 against the leading open-weight contemporaries from China and a frontier proprietary model. Where numbers come from the publishers' own materials they are noted as self-reported.
| Metric | MiniMax M2 | Kimi K2 (Moonshot) | GLM-4.5 (Zhipu) | Qwen 3 Max | Claude Sonnet 4.5 |
|---|---|---|---|---|---|
| Total parameters | 230B | 1T | 355B | 235B+ | undisclosed |
| Active parameters | ~10B | 32B | 32B | ~22B | undisclosed |
| Context window | 205K | 128K | 200K | 256K | 200K |
| SWE-bench Verified | 69.4 | 65.8 | 64.2 | 69.6 | 77.2 |
| Terminal-Bench | 46.3 | 39.2 | 37.5 | not reported | 50.0 |
| BrowseComp | 44.0 | 32.0 | 26.4 | not reported | 30.0 |
| MMLU-Pro | 82 | 81 | 81 | 84 | 86 |
| AIME 2025 | 78 | 87 | 83 | 87 | 87 |
| License | MIT | Modified MIT | MIT | Apache 2.0 (most sizes) | Proprietary |
| Approx. blended price ($/1M tokens) | 0.30 / 1.20 | 0.60 / 2.50 | 0.40 / 1.40 | 0.85 / 3.40 | 3.00 / 15.00 |
The shape of M2's strengths is clear in this table. It does not lead on raw math reasoning or general MMLU-Pro, where Kimi K2 and Qwen 3 Max are stronger, but it leads the open-weight group on agentic and tool-use benchmarks and ties Qwen 3 Max for the top open SWE-bench Verified score at launch. Against Claude Sonnet 4.5 it trails by about 8 points on SWE-bench and a few points on Terminal-Bench, but it costs roughly an order of magnitude less per token.[^hf-card][^medium-compare][^dailydose]
A separate point worth flagging is the comparison with DeepSeek. By the time of M2's release, DeepSeek V4 and DeepSeek V3.2 were both in the open-weight conversation as well. DeepSeek-V3.1 outscored M2 on a few raw reasoning benchmarks but lagged it on agentic tool-use, and DeepSeek had not yet released a fully matched coding-agent variant at the time M2 launched. By December 2025 the open-weight leaderboard had reshuffled again, with Kimi K2 Thinking and GLM-4.6 closing the agentic gap, which prompted MiniMax to ship M2.1 and then M2.5 in the months that followed.[^marktechpost][^maniac]
MiniMax's marketing leans heavily on the picture of M2 running inside developer-agent harnesses, and several of the early reviews tested it in exactly that way. M2 worked out of the box inside Claude Code (using the OpenAI-compatible adapter), Cursor's agent mode, Cline, and the open-source OpenDevin and OpenHands frameworks. The 10B active footprint translated into noticeably snappier tool calls compared with Kimi K2 or DeepSeek-V3.1 on the same hardware, which is exactly the point MiniMax was making about agent throughput.[^venturebeat][^digitalocean]
Independent reviewers running their own tasks reported that M2 was particularly good at coding-run-fix loops, where the model proposes an edit, sees the compiler or test output, and tries again. It was less consistent on long-form planning tasks where the agent has to hold a complex strategy in mind across dozens of steps without external feedback, an area where Claude Sonnet 4.5 and GPT-5 still had an edge as of late 2025.[^medium-compare][^dailydose]