MiniMax M1
Last reviewed
May 31, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v2 · 4,098 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 31, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v2 · 4,098 words
Add missing citations, update stale details, or suggest a clearer explanation.
MiniMax M1 (stylised MiniMax-M1) is an open-weight large language reasoning model developed by Shanghai-based artificial-intelligence company MiniMax, released on 16 June 2025 alongside an accompanying technical report on arXiv.[1][2] At launch MiniMax described the model as "the world's first open-weight, large-scale hybrid-attention reasoning model," combining a Mixture-of-Experts (MoE) backbone with the company's proprietary Lightning Attention linear-attention variant.[1][2] M1 contains 456 billion total parameters with 45.9 billion activated per token, natively supports a one-million-token input context window, and can emit up to 80,000 thinking-and-output tokens per response, both figures matching or exceeding any other open-weight model available at the time of release.[1][2][3]
The release attracted unusually broad attention because of three intertwined claims. First, M1 is distributed under the permissive Apache 2.0 license, allowing unrestricted commercial use, modification, and redistribution.[3][4] Second, MiniMax reported that the full reinforcement-learning (RL) phase used just 512 NVIDIA H800 GPUs over three weeks at a rental cost of US$534,700, an order of magnitude cheaper than the widely cited US$5–6 million figure for DeepSeek-R1 and roughly 0.5 % of GPT-4-era estimates.[1][2][5][6] Third, the team introduced a new RL algorithm called CISPO (Clipped Importance Sampling Policy Optimization) which it said converged about twice as fast as ByteDance's DAPO and outperformed the GRPO used in early DeepSeek-R1 training.[1][2][7]
Two variants were released simultaneously, MiniMax-M1-40k and MiniMax-M1-80k, differentiated by their maximum "thinking budgets" of 40,000 and 80,000 tokens respectively, with the 40k checkpoint representing an intermediate phase of the 80k training run.[1][2] Independent benchmarking placed M1 ahead of all other open-weight models on long-context reasoning, agentic τ-bench tool use, and competitive with closed proprietary systems such as Gemini 2.5 Pro, OpenAI o3, and Claude 4 Opus on those specific axes, while trailing the strongest proprietary and open competitors on standalone mathematics and pure-coding evaluations.[2][7][8][9] M1 was followed in late October 2025 by MiniMax M2, a smaller and more agent-focused successor.
MiniMax (officially MiniMax AI; Chinese: 稀宇科技) was founded in Shanghai in December 2021 by SenseTime alumni Yan Junjie (CEO), Yang Bin, and Zhou Yucong; its name is borrowed from the classical minimax algorithm of game theory.[10] The company became known for the international AI-companion app Talkie (2023), its Chinese-market sibling Xing Ye, and the Hailuo AI multimodal text-image-video-audio platform launched in March 2024.[10] In its early phase, MiniMax used proprietary "abab" foundation models; it transitioned to a self-styled "MiniMax-01" series in early 2025 with the open-weight release of MiniMax-Text-01 (a 456B-parameter MoE foundation model introduced in a January 2025 technical report) and the corresponding multimodal MiniMax-VL-01.[11]
By the time of the M1 launch, MiniMax had raised roughly US$850 million across multiple rounds, including a Series A of about US$250 million led by Tencent in 2023 and a Series B of approximately US$600 million in March 2024 led by Alibaba Group that valued the firm at around US$2.5 billion.[10][12] Other backers include HongShan (formerly Sequoia China), IDG Capital, Hillhouse Investment, and the videogame studio MiHoYo.[10] MiniMax filed for a Hong Kong initial public offering during 2025 and ultimately listed on the Hong Kong Stock Exchange in early 2026, with Alibaba and Abu Dhabi's sovereign wealth fund taking cornerstone positions; the offering reportedly raised about US$619 million at a valuation near US$6.5 billion.[13][14] MiniMax operates internationally through a Singapore-registered entity and is generally grouped with DeepSeek, Zhipu, Moonshot, Baichuan, and 01.AI as one of China's "AI Tigers."[10][15]
M1 is not a clean-room model. It is built by continual pretraining and reinforcement-learning fine-tuning on top of MiniMax-Text-01, the company's January 2025 foundation model.[1][2] MiniMax-Text-01 was the company's first model based on Lightning Attention at scale: a 456 B-parameter MoE design comprising 32 experts with top-2 routing per token, 45.9 B active parameters, and 80 transformer layers in which one standard softmax-attention block follows every seven transnormer Lightning-Attention blocks (a 7:1 ratio).[11][16] That architecture was designed specifically to make a one-million-token training context tractable; MiniMax claimed inference extrapolation up to four million tokens.[11][16]
For M1, MiniMax continued pretraining MiniMax-Text-01 on an additional 7.5 trillion tokens of a reasoning-intensive corpus weighted roughly 70 % STEM, code, books, and reasoning material; the learning-rate schedule held a constant 8 × 10⁻⁵ for the first 2.5 T tokens and then decayed over 5 T tokens to 8 × 10⁻⁶.[2] After this continual-pretraining stage, the team applied supervised fine-tuning followed by the large-scale RL run that produced both M1-40k and M1-80k.[2]
MiniMax-M1 keeps the underlying MiniMax-Text-01 architecture unchanged: the model retains the 7:1 hybrid pattern of Lightning Attention blocks (transnormer blocks using a linearised attention variant) interleaved with standard softmax-attention blocks, the 32-expert MoE feed-forward stacks, and the 456 B / 45.9 B parameter counts.[1][2][11] What changes in M1 is the training data, the RL objective, and the inference-time behaviour optimised for very long chains of thought.
Lightning Attention is an I/O-aware, hardware-efficient implementation of a linear-attention variant that approximates softmax attention by re-writing it as a sequence of matrix multiplications, allowing computation cost to scale roughly linearly in sequence length rather than quadratically.[2][17] In practice the MiniMax team reports that this drives dramatic FLOPs savings as generation length grows: at a 100 K-token generation length, M1 reportedly uses only about 25 % of the floating-point operations DeepSeek-R1 requires for the same task, and under 50 % at 64 K tokens.[1][2][17] The hybrid pattern – keeping one softmax block per seven linear blocks – is intended to preserve the long-range information-retrieval behaviour of full attention while letting Lightning Attention carry the bulk of token-level processing.[11][16][17]
A vLLM engineering write-up published two weeks after the model release reported that, with Lightning Attention deployed through vLLM's Triton kernels and combined with PagedAttention, memory usage on a 100 K-token code-completion task dropped by roughly 83 % and end-to-end latency by 67 % relative to a standard softmax-only baseline.[17]
The 456 B / 45.9 B parameter ratio comes from M1's Mixture-of-Experts feed-forward layers, which use 32 experts and a top-2 routing strategy inherited unchanged from MiniMax-Text-01.[11] Approximately 10 % of total parameters are active for any given token, a ratio chosen to balance training cost with inference quality and broadly comparable to other large open-weight MoEs such as DeepSeek-V3.[11][17]
The model natively supports inputs of up to one million tokens, a window roughly eight times that of DeepSeek-R1 and matching the closed-source Gemini 2.5 Pro.[1][2] The "thinking budget" – effectively the maximum number of tokens the model may consume on internal chain-of-thought before producing a final answer – is set per checkpoint: 40,000 tokens for M1-40k and 80,000 tokens for M1-80k, both of which exceed the explicit reasoning budgets exposed by any other open-weight model as of June 2025.[1][2][3] Independent reviewers later observed that, despite the advertised 1 M-token context, MiniMax's hosted chat product imposed a stricter ceiling (one reviewer reported refusals beyond 500,000 characters of prompt) – a deployment-level limit distinct from the model's architectural ceiling.[18]
The first stage of M1 training was a 7.5-trillion-token continual-pretraining pass on top of MiniMax-Text-01's base PLM (not the instruction-tuned variant).[2] MiniMax described the data mixture as "reasoning-intensive," with roughly 70 % of tokens drawn from STEM, code, books, and explicit reasoning corpora; the team emphasised that the data came from natural sources rather than synthetic generation.[2] The learning-rate schedule first held flat at 8 × 10⁻⁵ for 2.5 T tokens, then decayed over 5 T tokens to 8 × 10⁻⁶.[2]
After continual pretraining and supervised fine-tuning, M1 was trained with large-scale reinforcement learning across a deliberately heterogeneous mix of rule-verified and model-judged environments.[2][7]
Verifiable rule-based tasks dominated:
Where rule-based verification was unavailable, the team used model-based feedback on around 25,000 samples covering instruction following and creative writing, with a learned reward model adjudicating quality.[2]
The most distinctive technical contribution of the M1 paper is CISPO (Clipped Importance Sampling Policy Optimization), an RL objective designed to address a specific failure mode that MiniMax identified in GRPO and DAPO when training reasoning models with long chains of thought.[2][7]
In conventional PPO-derived objectives such as GRPO, the importance-sampling ratio is multiplied into the per-token loss and is clipped at the token level; tokens whose ratios fall outside the trust region are effectively zeroed out. The MiniMax team argued that on long reasoning rollouts this disproportionately clips precisely the low-frequency but logically critical "pivot" tokens – words like however, wait, recheck, but, or aha that mark self-correction in chain-of-thought.[2][7] CISPO instead clips the importance-sampling weights themselves while preserving the gradient contribution of every token, formally $$\hat r_{i,t}(\theta) = \mathrm{clip}\big(r_{i,t}(\theta),, 1-\varepsilon^{\mathrm{IS}}{\mathrm{low}},, 1+\varepsilon^{\mathrm{IS}}{\mathrm{high}}\big)$$ with the policy gradient computed against the stop-gradient of $\hat r_{i,t}$ multiplied by the group-relative advantage borrowed from GRPO and applied at the token level.[2][7] In a controlled ablation on AIME problems, MiniMax reported that CISPO matched DAPO's final performance using about half as many training steps and converged "roughly twice as fast."[1][2][7]
The headline efficiency claim of the M1 paper is that the full reinforcement-learning phase ran on 512 NVIDIA H800 GPUs for three weeks, with a rental cost of US$534,700.[1][2] The team described this as roughly an order of magnitude below initial budget expectations.[1] Multiple secondary outlets contrasted the figure with the widely reported US$5–6 million attributed to DeepSeek-R1's training and the >US$100 million sometimes cited for GPT-4-class pretraining.[5][6]
Several caveats are important and were emphasised both by independent commentators on Hacker News and by careful readers of the paper: the US$534,700 figure covers only the reinforcement-learning phase, not the underlying MiniMax-Text-01 pretraining or the 7.5 T-token continual-pretraining pass; it covers GPU rental at market rates rather than fully loaded internal cost (electricity, engineer salaries, data pipelines, dataset licensing); and it does not include the supervised fine-tuning step interposed between continual pretraining and RL.[6][15][20] The headline therefore measures the marginal cost of converting a strong base model into a frontier-grade reasoner via RL, not the all-in cost of producing M1 from scratch. Even granted these qualifications, several engineering blogs treated the result as a meaningful data point for the proposition that the post-training stage of frontier reasoning models can be made dramatically cheaper than first imagined.[7][15][17][20]
MiniMax released two checkpoints simultaneously, distinguished only by their maximum reasoning budgets.[1][2][3]
MiniMax-M1-40k corresponds to an intermediate snapshot taken during the larger RL run, with the rollout length capped at 40,000 tokens during training. It is otherwise identical in architecture and parameter count to the 80k variant.[1][2]
MiniMax-M1-80k is the headline release, trained to use up to 80,000 tokens of reasoning per response. MiniMax reports that the 80k variant outperforms 40k on the most demanding mathematics and coding tasks, "further demonstrating the benefits of scaling test-time compute," consistent with the test-time-compute scaling hypothesis explored by reasoning models such as OpenAI o3 and DeepSeek-R1.[1][2][21]
Both checkpoints are published on Hugging Face (as MiniMaxAI/MiniMax-M1-40k and MiniMaxAI/MiniMax-M1-80k) and on GitHub, with vLLM, Hugging Face Transformers, and SGLang all explicitly supported.[3][17] Recommended inference parameters are temperature 1.0 and top-p 0.95, with a task-specific system prompt template provided for general use, web development, and mathematical reasoning.[3]
The M1 technical report and the accompanying model cards provide a detailed evaluation table comparing the two M1 variants against a set of open-weight and proprietary frontier reasoning models. Selected representative numbers below are sourced from the paper (Table 2 / the HuggingFace model card), with comparator models cited where MiniMax included them; readers should consult the original arXiv paper for the full table.[2][3]
The M1 variants trail DeepSeek-R1 (especially the May-2025 0528 refresh) and the strongest proprietary reasoning models on pure mathematics benchmarks, but the gap from M1-40k to M1-80k consistently widens with problem difficulty, which MiniMax cites as evidence that the model benefits substantively from the larger thinking budget.[2]
On SWE-bench Verified, both M1 variants land within a couple of points of DeepSeek-R1-0528 and well above other open-weight peers, which MiniMax repeatedly cites as M1's most commercially relevant strength: complex agentic software engineering rather than competitive coding.[1][2][7]
On the GPQA Diamond science benchmark and MMLU-Pro, the M1 series lags both DeepSeek-R1 and the strongest closed models by a clear margin, suggesting M1's training tilted explicitly toward long-context, software, and tool-use scenarios at some cost to general factual knowledge.[2][15][18]
Long-context retrieval and reasoning is M1's strongest category in MiniMax's evaluation. On the 1 M-token OpenAI-MRCR setting the only model that beats M1-80k is Gemini 2.5 Pro, and on the 128 K setting M1 substantially outscores OpenAI o3.[1][2][9]
On τ-bench agentic tool-use evaluations, both M1 variants lead all open-weight models and outperform Gemini 2.5 Pro, a result MiniMax positions as one of M1's two flagship strengths along with long-context performance.[1][2][7]
Artificial Analysis independently rated MiniMax-M1-80k at 24 on its composite Intelligence Index, with the 40k variant at 21; in both cases the company described the scores as "below the median" of open-weight models of comparable size in its evaluation suite.[22] Independent reviewers reported that M1's coding ability is broadly competitive with Claude in real-world programming sessions but that it is slower and more prone to over-thinking, sometimes consuming hundreds of seconds for tasks that proprietary reasoning models dispatch in seconds, and that its factuality on benchmarks such as SimpleQA is "mid-tier."[15][18]
MiniMax-M1 is published on Hugging Face (MiniMaxAI/MiniMax-M1-40k, MiniMaxAI/MiniMax-M1-80k) and on GitHub (MiniMax-AI/MiniMax-M1) under the Apache 2.0 license, permitting commercial use, modification, and redistribution with attribution.[3][4] At launch, MiniMax explicitly contrasted Apache 2.0 with the more restrictive community license attached to Meta's Llama family and with DeepSeek's partial open-source posture.[4][5]
For users who prefer a hosted endpoint, MiniMax offers M1 through its own MiniMax Platform and chat product (chat.minimax.io), and the model is also exposed through resellers such as OpenRouter. Reported list pricing from MiniMax is roughly US$0.40 per million input tokens for context windows up to 200 K, US$1.30 per million input tokens for the 200 K–1 M tier, and US$2.20 per million output tokens at either tier.[4][23] Artificial Analysis cited a blended (3:1 input-to-output) rate of about US$0.96 per million tokens for M1-80k.[22] The hosted chat product was free of charge at launch.[4][5]
Deployment is officially supported via vLLM (version 0.9.2 or higher), Hugging Face Transformers (with trust_remote_code=True), and SGLang; MiniMax recommends vLLM for production use and publishes a function-calling guide alongside an MCP-compatible server (MiniMax-MCP) for tool-use scenarios.[3][17] At full precision the model requires roughly 8 NVIDIA H200 GPUs (or equivalent) to serve, although community-quantised variants reduce that footprint considerably.[20]
M1's most explicit point of comparison is DeepSeek-R1; the M1 technical report references DeepSeek-R1 dozens of times and frames Lightning Attention as a direct response to the quadratic-attention compute costs that R1 incurs at long generation lengths.[2][15] M1 trails DeepSeek-R1-0528 by 1–5 percentage points on most pure math/code benchmarks (AIME, GPQA Diamond, MATH-500, SWE-bench Verified) but matches or exceeds R1 on long-context (LongBench-v2) and agentic-tool (τ-bench) tasks, and offers an eight-times-larger context window.[1][2] M1 is also based on the same MiniMax-Text-01 foundation – a DeepSeek-V3-class large MoE – so the architectural family is comparable, with the principal differentiator being Lightning Attention versus DeepSeek's Multi-head Latent Attention.[11]
Qwen3-235B-A22B (Alibaba) is the most direct open-weight peer in parameter scale and was used by MiniMax as a comparator on most benchmarks; M1 slightly trails Qwen3 on LiveCodeBench but leads on long-context and tool-use evaluations.[2] Kimi K2 from Moonshot AI is another notable Chinese open-weight competitor in the post-M1 landscape, though it post-dates the M1 release and is not in the original comparison table.[15]
MiniMax's headline marketing claim is that M1 matches or beats OpenAI o3 and Claude 4 Opus on long-context understanding and ranks "second globally" behind only Gemini 2.5 Pro on a range of long-context tasks, with comparable but not superior performance on the strongest proprietary models' home benchmarks.[1][2] On AIME 2024, M1-80k's 86.0 % score is roughly five percentage points below the reported OpenAI o3 figure; on the 128 K OpenAI-MRCR long-context test, M1-40k's 76.1 % is roughly 20 percentage points above OpenAI o3's 56.5 %.[2]
MiniMax M2 – the follow-up model released by MiniMax in late October 2025 – is positioned as a smaller, more agent-focused successor optimised for tool use and code rather than for raw long-context reasoning, and it does not preserve the 1 M-token context window of M1. Although M2 has received more independent benchmark attention than M1, the M1 architecture and CISPO training methodology remain the foundation that the company iterated on.
Reaction to M1's launch divided fairly cleanly along three axes. On the technical novelty axis, both VentureBeat and InfoQ singled out Lightning Attention's reported 25 %-of-DeepSeek-R1 FLOPs at 100 K-token generation and the CISPO algorithm as the most interesting contributions of the paper, describing M1 as a credible engineering advance over previous open-weight reasoning models.[5][7] Several technical blogs and Substacks – notably The Sequence Radar – described M1 as "a very impressive model" and emphasised that the combination of architectural originality and training economy makes it a useful reference point even if it is not the absolute strongest open-weight reasoner.[20]
On the cost-claim axis, South China Morning Post, The Register, and Computerworld all foregrounded the US$534,700 figure, with SCMP framing it as evidence that Chinese labs can continue to undercut Western frontier-training costs in the wake of DeepSeek-R1's January 2025 release.[4][6][24] The Register noted carefully that the figure covered only the RL phase rather than full pretraining, and Hacker News commentary was particularly attentive to that distinction and to the question of whether community-quantised variants could make the model affordable to self-host on commodity hardware.[6][20]
On the head-to-head usability axis, hands-on reviewers gave more mixed reports. Decrypt's hands-on review praised M1's coding output as "matching Claude" for game-development tasks and found it strong on long-document information retrieval, but criticised its creative writing (mechanical pacing, structural issues), its tendency to over-reason on simple prompts (700-plus seconds of latency for tasks proprietary reasoning models complete in seconds), and the practical gap between the advertised 1 M-token context and the lower per-prompt limits enforced by the hosted chat product.[18] Artificial Analysis's quantitative Intelligence Index placed M1 below the median for open-weight models of comparable size, with output-token consumption during evaluation that the company described as higher than average.[22]
Several limitations were noted at or shortly after release.
Mathematics and pure-coding gap. On standalone mathematics (AIME 2024, AIME 2025, MATH-500) and pure-coding (LiveCodeBench, FullStackBench) benchmarks, M1 trails the strongest open-weight comparator (DeepSeek-R1-0528) and the strongest proprietary reasoning models, generally by 1–5 percentage points on coding and 5–15 points on pure mathematics.[2]
Knowledge benchmarks. On GPQA Diamond and MMLU-Pro the gap to DeepSeek-R1 and proprietary frontier models is larger – 10 percentage points or more in GPQA Diamond's case – suggesting the M1 training recipe traded general-knowledge depth for long-context and tool-use specialisation.[2]
Practical context-window ceiling. Although the architectural context window is 1 M tokens, independent reviewers reported that the hosted MiniMax chat product enforced lower per-prompt ceilings; one reviewer documented refusals beyond roughly 500,000 characters of prompt input.[18]
Over-thinking and latency. Reviewers reported very long reasoning rollouts and corresponding wall-clock latencies on simple prompts – the same trade-off that affects most large reasoning models, but amplified by M1's deliberately generous thinking budgets.[18][22]
Hardware footprint. At full precision the model is reported to require roughly 8 H200-class GPUs to serve, putting unquantised deployment out of reach for hobbyists; community quantisations to Q4 / Q8 have substantially reduced that footprint but at some quality cost.[20]
Self-reported cost figure. The widely cited US$534,700 RL-training figure has not been independently audited; it reflects MiniMax's internal accounting of GPU-rental cost only, excludes the cost of the underlying MiniMax-Text-01 base model and of the 7.5 T-token continual pretraining, and does not capture personnel, data-licensing, or electricity costs.[6][15][20]
Creative writing. Hands-on reviewers consistently described M1's creative-writing output as mechanically structured and below the quality bar set by Claude and Gemini, despite the model's strong performance on instruction following and software-engineering tasks.[18]