# MiniMax M1

> Source: https://aiwiki.ai/wiki/minimax_m1
> Updated: 2026-06-28
> Categories: Chinese AI, Large Language Models, Reasoning Models
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

**MiniMax M1** (stylised **MiniMax-M1**) is an open-weight large language [reasoning model](/wiki/reasoning_models) released on 16 June 2025 by the Shanghai-based artificial-intelligence company [MiniMax](/wiki/minimax); it pairs a 456-billion-parameter [Mixture-of-Experts](/wiki/mixture_of_experts) backbone with the company's proprietary **Lightning Attention** to deliver a one-million-token context window, and its developers describe it as "the world's first open-weight, large-scale hybrid-attention reasoning model."[^1][^2][^4] The model was published under the permissive [Apache 2.0 license](/wiki/mit_license) alongside a technical report on arXiv, and is most notable for two claims: a native 1M-token input window matching the closed-source [Gemini 2.5 Pro](/wiki/gemini_2_5_pro), and a reported reinforcement-learning training cost of just **US$534,700**.[^1][^2][^3]

M1 contains 456 billion total parameters with 45.9 billion activated per token, natively supports a one-million-token input context window, and can emit up to 80,000 thinking-and-output tokens per response, both figures matching or exceeding any other open-weight model available at the time of release.[^1][^2][^3] The release attracted unusually broad attention because of three intertwined claims. First, M1 is distributed under the permissive [Apache 2.0 license](/wiki/mit_license), allowing unrestricted commercial use, modification, and redistribution.[^3][^4] Second, MiniMax reported that the full reinforcement-learning (RL) phase used just 512 [NVIDIA](/wiki/nvidia_h100) H800 GPUs over three weeks at a rental cost of **US$534,700**, an order of magnitude cheaper than the widely cited US$5-6 million figure for [DeepSeek-R1](/wiki/deepseek_r1) and roughly 0.5 % of GPT-4-era estimates.[^1][^2][^5][^6] Third, the team introduced a new RL algorithm called **CISPO** (Clipped IS-weight Policy Optimization) which it said converged about twice as fast as ByteDance's DAPO and outperformed the [GRPO](/wiki/grpo) used in early DeepSeek-R1 training.[^1][^2][^7]

Two variants were released simultaneously, **MiniMax-M1-40k** and **MiniMax-M1-80k**, differentiated by their maximum "thinking budgets" of 40,000 and 80,000 tokens respectively, with the 40k checkpoint representing an intermediate phase of the 80k training run.[^1][^2] Independent benchmarking placed M1 ahead of all other open-weight models on long-context reasoning, agentic [τ-bench](/wiki/tau_bench) tool use, and competitive with closed proprietary systems such as [Gemini 2.5 Pro](/wiki/gemini_2_5_pro), [OpenAI o3](/wiki/o3), and Claude 4 Opus on those specific axes, while trailing the strongest proprietary and open competitors on standalone mathematics and pure-coding evaluations.[^2][^7][^8][^9] M1 was followed in late October 2025 by [MiniMax M2](/wiki/minimax_m2), a smaller and more agent-focused successor.

## Key facts at a glance

| Attribute | Value |
| --- | --- |
| Developer | [MiniMax](/wiki/minimax) (Shanghai) |
| Release date | 16 June 2025 |
| Model type | Open-weight hybrid-attention [reasoning model](/wiki/reasoning_models) |
| Architecture | [Mixture-of-Experts](/wiki/mixture_of_experts) + Lightning Attention (7:1 hybrid) |
| Total parameters | 456 billion |
| Activated parameters | 45.9 billion per token |
| Experts / routing | 32 experts, top-2 routing |
| Context window | 1,000,000 tokens (input) |
| Output / thinking budget | up to 80,000 tokens |
| Variants | MiniMax-M1-40k, MiniMax-M1-80k |
| RL algorithm | CISPO (Clipped IS-weight Policy Optimization) |
| Reported RL training cost | US$534,700 (512 [NVIDIA](/wiki/nvidia_h100) H800 GPUs, 3 weeks) |
| License | [Apache 2.0](/wiki/mit_license) |
| Base model | [MiniMax-Text-01](/wiki/minimax) |
| Successor | [MiniMax M2](/wiki/minimax_m2) (Oct 2025) |

## What is MiniMax M1?

MiniMax-M1 is an open-weight large language [reasoning model](/wiki/reasoning_models) built to scale test-time compute efficiently. It is the first model to combine a large-scale [Mixture-of-Experts](/wiki/mixture_of_experts) backbone with a hybrid linear-attention design at frontier scale, and at launch MiniMax described it as "the world's first open-weight, large-scale hybrid-attention reasoning model."[^1][^2][^4] It is designed for tasks that require very long inputs or very long chains of internal reasoning: long-document analysis, multi-step agentic tool use, and complex software engineering.

### Who made MiniMax M1, and when?

[MiniMax](/wiki/minimax) (officially MiniMax AI; Chinese: 稀宇科技) was founded in Shanghai in December 2021 by [SenseTime](/wiki/sensetime) alumni Yan Junjie (CEO), Yang Bin, and Zhou Yucong; its name is borrowed from the classical minimax algorithm of game theory.[^10] The company became known for the international AI-companion app *Talkie* (2023), its Chinese-market sibling *Xing Ye*, and the *[Hailuo AI](/wiki/hailuo)* multimodal text-image-video-audio platform launched in March 2024.[^10] In its early phase, MiniMax used proprietary "abab" foundation models; it transitioned to a self-styled "MiniMax-01" series in early 2025 with the open-weight release of MiniMax-Text-01 (a 456B-parameter MoE foundation model introduced in a January 2025 technical report) and the corresponding multimodal MiniMax-VL-01.[^11] MiniMax-M1 itself was released on 16 June 2025.[^1][^2]

By the time of the M1 launch, MiniMax had raised roughly US$850 million across multiple rounds, including a Series A of about US$250 million led by [Tencent](/wiki/tencent) in 2023 and a Series B of approximately US$600 million in March 2024 led by [Alibaba Group](/wiki/alibaba) that valued the firm at around US$2.5 billion.[^10][^12] Other backers include HongShan (formerly Sequoia China), IDG Capital, Hillhouse Investment, and the videogame studio MiHoYo.[^10] MiniMax filed for a Hong Kong initial public offering during 2025 and ultimately listed on the Hong Kong Stock Exchange in early 2026, with Alibaba and Abu Dhabi's sovereign wealth fund taking cornerstone positions; the offering reportedly raised about US$619 million at a valuation near US$6.5 billion.[^13][^14] MiniMax operates internationally through a Singapore-registered entity and is generally grouped with [DeepSeek](/wiki/deepseek), Zhipu, Moonshot, Baichuan, and 01.AI as one of China's "AI Tigers."[^10][^15]

### How does MiniMax M1 relate to MiniMax-Text-01?

M1 is not a clean-room model. It is built by continual pretraining and reinforcement-learning fine-tuning on top of MiniMax-Text-01, the company's January 2025 foundation model.[^1][^2] MiniMax-Text-01 was the company's first model based on Lightning Attention at scale: a 456 B-parameter MoE design comprising 32 experts with top-2 routing per token, 45.9 B active parameters, and 80 transformer layers in which one standard softmax-attention block follows every seven transnormer Lightning-Attention blocks (a 7:1 ratio).[^11][^16] That architecture was designed specifically to make a one-million-token training context tractable; MiniMax claimed inference extrapolation up to four million tokens.[^11][^16]

For M1, MiniMax continued pretraining MiniMax-Text-01 on an additional 7.5 trillion tokens of a reasoning-intensive corpus weighted roughly 70 % STEM, code, books, and reasoning material; the learning-rate schedule held a constant 8 × 10⁻⁵ for the first 2.5 T tokens and then decayed over 5 T tokens to 8 × 10⁻⁶.[^2] After this continual-pretraining stage, the team applied supervised fine-tuning followed by the large-scale RL run that produced both M1-40k and M1-80k.[^2]

## How does MiniMax M1 work?

### What is Lightning Attention and the hybrid-attention design?

MiniMax-M1 keeps the underlying MiniMax-Text-01 architecture unchanged: the model retains the 7:1 hybrid pattern of Lightning Attention blocks (transnormer blocks using a linearised attention variant) interleaved with standard softmax-[attention](/wiki/attention) blocks, the 32-expert MoE feed-forward stacks, and the 456 B / 45.9 B parameter counts.[^1][^2][^11] What changes in M1 is the training data, the RL objective, and the inference-time behaviour optimised for very long chains of thought.

Lightning Attention is an I/O-aware, hardware-efficient implementation of a linear-attention variant that approximates softmax attention by re-writing it as a sequence of matrix multiplications, allowing computation cost to scale roughly linearly in sequence length rather than quadratically.[^2][^17] In practice the MiniMax team reports that this drives dramatic FLOPs savings as generation length grows: at a 100 K-token generation length, M1 reportedly uses only about 25 % of the floating-point operations [DeepSeek-R1](/wiki/deepseek_r1) requires for the same task, and under 50 % at 64 K tokens.[^1][^2][^17] The hybrid pattern, keeping one softmax block per seven linear blocks, is intended to preserve the long-range information-retrieval behaviour of full attention while letting Lightning Attention carry the bulk of token-level processing.[^11][^16][^17]

A vLLM engineering write-up published two weeks after the model release reported that, with Lightning Attention deployed through [vLLM](/wiki/vllm)'s Triton kernels and combined with [PagedAttention](/wiki/paged_attention), memory usage on a 100 K-token code-completion task dropped by roughly 83 % and end-to-end latency by 67 % relative to a standard softmax-only baseline.[^17]

### How is the Mixture of Experts structured?

The 456 B / 45.9 B parameter ratio comes from M1's [Mixture-of-Experts](/wiki/mixture_of_experts) feed-forward layers, which use 32 experts and a top-2 routing strategy inherited unchanged from MiniMax-Text-01.[^11] Approximately 10 % of total parameters are active for any given token, a ratio chosen to balance training cost with inference quality and broadly comparable to other large open-weight MoEs such as DeepSeek-V3.[^11][^17]

### How big is the context window and the "thinking budget"?

The model natively supports inputs of up to one million tokens, a window roughly eight times that of DeepSeek-R1 and matching the closed-source [Gemini 2.5 Pro](/wiki/gemini_2_5_pro).[^1][^2] The "thinking budget," effectively the maximum number of tokens the model may consume on internal chain-of-thought before producing a final answer, is set per checkpoint: 40,000 tokens for M1-40k and 80,000 tokens for M1-80k, both of which exceed the explicit reasoning budgets exposed by any other open-weight model as of June 2025.[^1][^2][^3] Independent reviewers later observed that, despite the advertised 1 M-token context, MiniMax's hosted chat product imposed a stricter ceiling (one reviewer reported refusals beyond 500,000 characters of prompt), a deployment-level limit distinct from the model's architectural ceiling.[^18]

## How was MiniMax M1 trained?

### What happened during continual pretraining?

The first stage of M1 training was a 7.5-trillion-token continual-pretraining pass on top of MiniMax-Text-01's base PLM (not the instruction-tuned variant).[^2] MiniMax described the data mixture as "reasoning-intensive," with roughly 70 % of tokens drawn from STEM, code, books, and explicit reasoning corpora; the team emphasised that the data came from natural sources rather than synthetic generation.[^2] The learning-rate schedule first held flat at 8 × 10⁻⁵ for 2.5 T tokens, then decayed over 5 T tokens to 8 × 10⁻⁶.[^2]

### What reinforcement-learning environments were used?

After continual pretraining and supervised fine-tuning, M1 was trained with large-scale [reinforcement learning](/wiki/reinforcement_learning) across a deliberately heterogeneous mix of rule-verified and model-judged environments.[^2][^7]

Verifiable rule-based tasks dominated:

* **Mathematics.** Approximately 50,000 mathematical problems with checkable final answers were used.[^2]
* **Logical reasoning.** Roughly 53,000 problems were synthesised through MiniMax's own **SynLogic** framework, which covers 41 logical-reasoning task families (ciphers, Sudoku, Game of 24, arrow mazes, and similar puzzles whose solutions can be programmatically verified).[^2][^19] SynLogic was released as a separate companion paper at NeurIPS 2025.[^19]
* **Competitive programming.** About 30,000 samples drawn from competitive-programming style problems with executable test suites.[^2]
* **Real-world software engineering.** Containerised sandbox environments derived from [SWE-bench](/wiki/swe_bench_verified)-style repositories, allowing M1 to perform code edits and have its patches validated by automated tests.[^2][^7]

Where rule-based verification was unavailable, the team used model-based feedback on around 25,000 samples covering instruction following and creative writing, with a learned reward model adjudicating quality.[^2]

### What is the CISPO algorithm?

The most distinctive technical contribution of the M1 paper is **CISPO** (Clipped IS-weight Policy Optimization), an RL objective designed to address a specific failure mode that MiniMax identified in [GRPO](/wiki/grpo) and DAPO when training reasoning models with long chains of thought.[^2][^7] ("IS" stands for importance sampling; the paper introduces the method as "CISPO, a novel RL algorithm that clips importance sampling weights instead of token updates.")[^4]

In conventional [PPO](/wiki/ppo)-derived objectives such as GRPO, the importance-sampling ratio is multiplied into the per-token loss and is clipped at the token level; tokens whose ratios fall outside the trust region are effectively zeroed out. The MiniMax team argued that on long reasoning rollouts this disproportionately clips precisely the low-frequency but logically critical "pivot" tokens, words like *however*, *wait*, *recheck*, *but*, or *aha* that mark self-correction in chain-of-thought.[^2][^7] CISPO instead **clips the importance-sampling weights themselves while preserving the gradient contribution of every token**, formally
$$\hat r_{i,t}(\theta) = \mathrm{clip}\big(r_{i,t}(\theta),\, 1-\varepsilon^{\mathrm{IS}}_{\mathrm{low}},\, 1+\varepsilon^{\mathrm{IS}}_{\mathrm{high}}\big)$$
with the policy gradient computed against the stop-gradient of $\hat r_{i,t}$ multiplied by the group-relative advantage borrowed from GRPO and applied at the token level.[^2][^7] In a controlled ablation on AIME problems, MiniMax reported that CISPO matched DAPO's final performance using about half as many training steps and converged "roughly twice as fast."[^1][^2][^7]

### How much did MiniMax M1 cost to train?

The headline efficiency claim of the M1 paper is that the **full reinforcement-learning phase ran on 512 NVIDIA H800 GPUs for three weeks, with a rental cost of US$534,700**.[^1][^2] The technical report states plainly that "our efficient RL framework enables us to complete a full RL run of MiniMax-M1 within 3 weeks using 512 H800 GPUs," equivalent to a rental cost of approximately US$0.53M.[^2] The team described this as roughly an order of magnitude below initial budget expectations.[^1] Multiple secondary outlets contrasted the figure with the widely reported US$5-6 million attributed to DeepSeek-R1's training and the >US$100 million sometimes cited for GPT-4-class pretraining.[^5][^6]

Several caveats are important and were emphasised both by independent commentators on Hacker News and by careful readers of the paper: the US$534,700 figure covers **only the reinforcement-learning phase**, not the underlying MiniMax-Text-01 pretraining or the 7.5 T-token continual-pretraining pass; it covers GPU rental at market rates rather than fully loaded internal cost (electricity, engineer salaries, data pipelines, dataset licensing); and it does not include the supervised fine-tuning step interposed between continual pretraining and RL.[^6][^15][^20] The headline therefore measures the marginal cost of converting a strong base model into a frontier-grade reasoner via RL, not the all-in cost of producing M1 from scratch. Even granted these qualifications, several engineering blogs treated the result as a meaningful data point for the proposition that the post-training stage of frontier reasoning models can be made dramatically cheaper than first imagined.[^7][^15][^17][^20]

## What are the MiniMax M1 variants?

### How do MiniMax-M1-40k and MiniMax-M1-80k differ?

MiniMax released two checkpoints simultaneously, distinguished only by their maximum reasoning budgets.[^1][^2][^3]

**MiniMax-M1-40k** corresponds to an intermediate snapshot taken during the larger RL run, with the rollout length capped at 40,000 tokens during training. It is otherwise identical in architecture and parameter count to the 80k variant.[^1][^2]

**MiniMax-M1-80k** is the headline release, trained to use up to 80,000 tokens of reasoning per response. MiniMax reports that the 80k variant outperforms 40k on the most demanding mathematics and coding tasks, "further demonstrating the benefits of scaling test-time compute," consistent with the test-time-compute scaling hypothesis explored by [reasoning models](/wiki/reasoning_models) such as [OpenAI o3](/wiki/o3) and DeepSeek-R1.[^1][^2][^21]

Both checkpoints are published on [Hugging Face](/wiki/hugging_face) (as `MiniMaxAI/MiniMax-M1-40k` and `MiniMaxAI/MiniMax-M1-80k`) and on GitHub, with vLLM, [Hugging Face Transformers](/wiki/transformers_library), and [SGLang](/wiki/sglang) all explicitly supported.[^3][^17] Recommended inference parameters are temperature 1.0 and top-p 0.95, with a task-specific system prompt template provided for general use, web development, and mathematical reasoning.[^3]

## How does MiniMax M1 perform on benchmarks?

The M1 technical report and the accompanying model cards provide a detailed evaluation table comparing the two M1 variants against a set of open-weight and proprietary frontier reasoning models. Selected representative numbers below are sourced from the paper (Table 2 / the HuggingFace model card), with comparator models cited where MiniMax included them; readers should consult the original arXiv paper for the full table.[^2][^3]

### How good is MiniMax M1 at mathematics?

* **[AIME 2024](/wiki/aime_2024)**: M1-40k 83.3 %, M1-80k 86.0 %; DeepSeek-R1-0528 91.4 %; OpenAI o3 ~91.6 %.[^2][^3]
* **[AIME 2025](/wiki/aime_2025)**: M1-40k 74.6 %, M1-80k 76.9 %; DeepSeek-R1-0528 87.5 %.[^2]
* **[MATH](/wiki/math)-500**: M1-40k 96.0 %, M1-80k 96.8 %; DeepSeek-R1-0528 98.0 %.[^2][^3]

The M1 variants trail [DeepSeek-R1](/wiki/deepseek_r1) (especially the May-2025 0528 refresh) and the strongest proprietary reasoning models on pure mathematics benchmarks, but the gap from M1-40k to M1-80k consistently widens with problem difficulty, which MiniMax cites as evidence that the model benefits substantively from the larger thinking budget.[^2]

### How does MiniMax M1 perform on coding and software engineering?

* **[LiveCodeBench](/wiki/livecodebench)**: M1-40k 62.3 %, M1-80k 65.0 %; [Qwen3](/wiki/qwen_3)-235B 65.9 %.[^2]
* **FullStackBench**: M1-40k 67.6 %, M1-80k 68.3 %; DeepSeek-R1-0528 69.4 %.[^2]
* **[SWE-bench Verified](/wiki/swe_bench_verified)**: M1-40k 55.6 %, M1-80k 56.0 %; DeepSeek-R1-0528 57.6 %.[^2][^3]

On SWE-bench Verified, both M1 variants land within a couple of points of DeepSeek-R1-0528 and well above other open-weight peers, which MiniMax repeatedly cites as M1's most commercially relevant strength: complex agentic software engineering rather than competitive coding.[^1][^2][^7]

### How does MiniMax M1 perform on knowledge and general reasoning?

* **[GPQA Diamond](/wiki/gpqa_diamond)**: M1-40k 69.2 %, M1-80k 70.0 %; DeepSeek-R1-0528 81.0 %.[^2]
* **[MMLU-Pro](/wiki/mmlu-pro)**: M1-40k 80.6 %, M1-80k 81.1 %; DeepSeek-R1-0528 85.0 %.[^2]

On the GPQA Diamond science benchmark and MMLU-Pro, the M1 series lags both DeepSeek-R1 and the strongest closed models by a clear margin, suggesting M1's training tilted explicitly toward long-context, software, and tool-use scenarios at some cost to general factual knowledge.[^2][^15][^18]

### How does MiniMax M1 handle long context?

* **OpenAI-MRCR (128 K)**: M1-40k 76.1 %, M1-80k 73.4 %; OpenAI o3 56.5 %.[^2]
* **OpenAI-MRCR (1 M)**: M1-80k 56.2 %; Gemini 2.5 Pro 58.8 %.[^2][^3]
* **[LongBench](/wiki/longbench)-v2**: M1-40k 61.0 %, M1-80k 61.5 %; DeepSeek-R1-0528 52.1 %.[^2]

Long-context retrieval and reasoning is M1's strongest category in MiniMax's evaluation. On the 1 M-token OpenAI-MRCR setting the only model that beats M1-80k is Gemini 2.5 Pro, and on the 128 K setting M1 substantially outscores OpenAI o3.[^1][^2][^9]

### How does MiniMax M1 perform on agentic tool use?

* **[τ-bench](/wiki/tau_bench) (airline)**: M1-40k 60.0 %, M1-80k 62.0 %; Gemini 2.5 Pro 50.0 %.[^2]
* **τ-bench (retail)**: M1-40k 67.8 %, M1-80k 63.5 %; Qwen3-235B 58.6 %.[^2]

On τ-bench agentic tool-use evaluations, both M1 variants lead all open-weight models and outperform Gemini 2.5 Pro, a result MiniMax positions as one of M1's two flagship strengths along with long-context performance.[^1][^2][^7]

### What does independent benchmarking say?

[Artificial Analysis](/wiki/artificial_analysis) independently rated MiniMax-M1-80k at 24 on its composite *Intelligence Index*, with the 40k variant at 21; in both cases the company described the scores as "below the median" of open-weight models of comparable size in its evaluation suite.[^22] Independent reviewers reported that M1's coding ability is broadly competitive with Claude in real-world programming sessions but that it is slower and more prone to over-thinking, sometimes consuming hundreds of seconds for tasks that proprietary reasoning models dispatch in seconds, and that its factuality on benchmarks such as SimpleQA is "mid-tier."[^15][^18]

## Is MiniMax M1 open source, and how can I use it?

MiniMax-M1 is published on [Hugging Face](/wiki/hugging_face) (`MiniMaxAI/MiniMax-M1-40k`, `MiniMaxAI/MiniMax-M1-80k`) and on GitHub (`MiniMax-AI/MiniMax-M1`) under the [Apache 2.0 license](/wiki/mit_license), permitting commercial use, modification, and redistribution with attribution.[^3][^4] At launch, MiniMax explicitly contrasted Apache 2.0 with the more restrictive community license attached to Meta's Llama family and with DeepSeek's partial open-source posture.[^4][^5]

For users who prefer a hosted endpoint, MiniMax offers M1 through its own *MiniMax Platform* and chat product (`chat.minimax.io`), and the model is also exposed through resellers such as [OpenRouter](/wiki/openrouter). Reported list pricing from MiniMax is roughly US$0.40 per million input tokens for context windows up to 200 K, US$1.30 per million input tokens for the 200 K-1 M tier, and US$2.20 per million output tokens at either tier.[^4][^23] Artificial Analysis cited a blended (3:1 input-to-output) rate of about US$0.96 per million tokens for M1-80k.[^22] The hosted chat product was free of charge at launch.[^4][^5]

Deployment is officially supported via [vLLM](/wiki/vllm) (version 0.9.2 or higher), [Hugging Face Transformers](/wiki/transformers_library) (with `trust_remote_code=True`), and [SGLang](/wiki/sglang); MiniMax recommends vLLM for production use and publishes a function-calling guide alongside an MCP-compatible server (`MiniMax-MCP`) for tool-use scenarios.[^3][^17] At full precision the model requires roughly 8 NVIDIA H200 GPUs (or equivalent) to serve, although community-quantised variants reduce that footprint considerably.[^20]

## How does MiniMax M1 compare to other models?

### How does MiniMax M1 compare to DeepSeek-R1 and DeepSeek-V3?

M1's most explicit point of comparison is [DeepSeek-R1](/wiki/deepseek_r1); the M1 technical report references DeepSeek-R1 dozens of times and frames Lightning Attention as a direct response to the quadratic-attention compute costs that R1 incurs at long generation lengths.[^2][^15] M1 trails DeepSeek-R1-0528 by 1-5 percentage points on most pure math/code benchmarks (AIME, GPQA Diamond, MATH-500, SWE-bench Verified) but matches or exceeds R1 on long-context (LongBench-v2) and agentic-tool (τ-bench) tasks, and offers an eight-times-larger context window.[^1][^2] M1 is also based on the same MiniMax-Text-01 foundation, a [DeepSeek-V3](/wiki/deepseek_v3)-class large MoE, so the architectural family is comparable, with the principal differentiator being Lightning Attention versus DeepSeek's [Multi-head Latent Attention](/wiki/multi-head_latent_attention).[^11]

### How does MiniMax M1 compare to Qwen3 and Kimi K2?

[Qwen3](/wiki/qwen_3)-235B-A22B (Alibaba) is the most direct open-weight peer in parameter scale and was used by MiniMax as a comparator on most benchmarks; M1 slightly trails Qwen3 on LiveCodeBench but leads on long-context and tool-use evaluations.[^2] [Kimi K2](/wiki/kimi_k2) from Moonshot AI is another notable Chinese open-weight competitor in the post-M1 landscape, though it post-dates the M1 release and is not in the original comparison table.[^15]

### How does MiniMax M1 compare to OpenAI o3, Claude 4 Opus, and Gemini 2.5 Pro?

MiniMax's headline marketing claim is that M1 matches or beats [OpenAI o3](/wiki/o3) and Claude 4 Opus on long-context understanding and ranks "second globally" behind only [Gemini 2.5 Pro](/wiki/gemini_2_5_pro) on a range of long-context tasks, with comparable but not superior performance on the strongest proprietary models' home benchmarks.[^1][^2] On AIME 2024, M1-80k's 86.0 % score is roughly five percentage points below the reported OpenAI o3 figure; on the 128 K OpenAI-MRCR long-context test, M1-40k's 76.1 % is roughly 20 percentage points above OpenAI o3's 56.5 %.[^2]

### What is MiniMax M2 and how does it differ?

[MiniMax M2](/wiki/minimax_m2), the follow-up model released by MiniMax in late October 2025, is positioned as a smaller, more agent-focused successor optimised for tool use and code rather than for raw long-context reasoning, and it does not preserve the 1 M-token context window of M1. Although M2 has received more independent benchmark attention than M1, the M1 architecture and CISPO training methodology remain the foundation that the company iterated on.

## How was MiniMax M1 received?

Reaction to M1's launch divided fairly cleanly along three axes. On the technical novelty axis, both VentureBeat and InfoQ singled out Lightning Attention's reported 25 %-of-DeepSeek-R1 FLOPs at 100 K-token generation and the CISPO algorithm as the most interesting contributions of the paper, describing M1 as a credible engineering advance over previous open-weight reasoning models.[^5][^7] Several technical blogs and Substacks, notably *The Sequence Radar*, described M1 as "a very impressive model" and emphasised that the combination of architectural originality and training economy makes it a useful reference point even if it is not the absolute strongest open-weight reasoner.[^20]

On the cost-claim axis, *South China Morning Post*, *The Register*, and Computerworld all foregrounded the US$534,700 figure, with SCMP framing it as evidence that Chinese labs can continue to undercut Western frontier-training costs in the wake of DeepSeek-R1's January 2025 release.[^4][^6][^24] *The Register* noted carefully that the figure covered only the RL phase rather than full pretraining, and Hacker News commentary was particularly attentive to that distinction and to the question of whether community-quantised variants could make the model affordable to self-host on commodity hardware.[^6][^20]

On the head-to-head usability axis, hands-on reviewers gave more mixed reports. Decrypt's hands-on review praised M1's coding output as "matching Claude" for game-development tasks and found it strong on long-document information retrieval, but criticised its creative writing (mechanical pacing, structural issues), its tendency to over-reason on simple prompts (700-plus seconds of latency for tasks proprietary reasoning models complete in seconds), and the practical gap between the advertised 1 M-token context and the lower per-prompt limits enforced by the hosted chat product.[^18] Artificial Analysis's quantitative *Intelligence Index* placed M1 below the median for open-weight models of comparable size, with output-token consumption during evaluation that the company described as higher than average.[^22]

## What are the limitations of MiniMax M1?

Several limitations were noted at or shortly after release.

**Mathematics and pure-coding gap.** On standalone mathematics (AIME 2024, AIME 2025, MATH-500) and pure-coding (LiveCodeBench, FullStackBench) benchmarks, M1 trails the strongest open-weight comparator (DeepSeek-R1-0528) and the strongest proprietary reasoning models, generally by 1-5 percentage points on coding and 5-15 points on pure mathematics.[^2]

**Knowledge benchmarks.** On GPQA Diamond and MMLU-Pro the gap to DeepSeek-R1 and proprietary frontier models is larger, 10 percentage points or more in GPQA Diamond's case, suggesting the M1 training recipe traded general-knowledge depth for long-context and tool-use specialisation.[^2]

**Practical context-window ceiling.** Although the architectural context window is 1 M tokens, independent reviewers reported that the hosted MiniMax chat product enforced lower per-prompt ceilings; one reviewer documented refusals beyond roughly 500,000 characters of prompt input.[^18]

**Over-thinking and latency.** Reviewers reported very long reasoning rollouts and corresponding wall-clock latencies on simple prompts, the same trade-off that affects most large reasoning models, but amplified by M1's deliberately generous thinking budgets.[^18][^22]

**Hardware footprint.** At full precision the model is reported to require roughly 8 H200-class GPUs to serve, putting unquantised deployment out of reach for hobbyists; community quantisations to Q4 / Q8 have substantially reduced that footprint but at some quality cost.[^20]

**Self-reported cost figure.** The widely cited US$534,700 RL-training figure has not been independently audited; it reflects MiniMax's internal accounting of GPU-rental cost only, excludes the cost of the underlying MiniMax-Text-01 base model and of the 7.5 T-token continual pretraining, and does not capture personnel, data-licensing, or electricity costs.[^6][^15][^20]

**Creative writing.** Hands-on reviewers consistently described M1's creative-writing output as mechanically structured and below the quality bar set by Claude and Gemini, despite the model's strong performance on instruction following and software-engineering tasks.[^18]

## See also

* [MiniMax](/wiki/minimax) - parent company
* [MiniMax M2](/wiki/minimax_m2) - successor model
* [DeepSeek-R1](/wiki/deepseek_r1) - primary open-weight reasoning-model comparator
* [Mixture of Experts (MoE)](/wiki/mixture_of_experts)
* [GRPO](/wiki/grpo) and [Proximal Policy Optimization (PPO)](/wiki/ppo) - RL algorithms compared with CISPO
* [Test-time compute](/wiki/test_time_compute)
* [Reasoning models](/wiki/reasoning_models)

## References

[^1]: MiniMax (16 June 2025). "MiniMax-M1, the World's First Open-Source, Large-Scale, Hybrid-Attention Reasoning Model." *MiniMax News*. https://www.minimax.io/news/minimaxm1
[^2]: Chen, A. *et al.* (16 June 2025). "MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention." *arXiv*:2506.13585. https://arxiv.org/abs/2506.13585 and full HTML version https://arxiv.org/html/2506.13585v1
[^3]: MiniMax AI. "MiniMax-M1-80k model card." *Hugging Face*. https://huggingface.co/MiniMaxAI/MiniMax-M1-80k
[^4]: MiniMax-AI (16 June 2025). "MiniMax-M1: the world's first open-weight, large-scale hybrid-attention reasoning model" (GitHub README). https://github.com/MiniMax-AI/MiniMax-M1
[^5]: Franzen, C. (17 June 2025). "MiniMax-M1 is a new open source model with 1 MILLION TOKEN context and new, hyper efficient reinforcement learning." *VentureBeat*. https://venturebeat.com/ai/minimax-m1-is-a-new-open-source-model-with-1-million-token-context-and-new-hyper-efficient-reinforcement-learning
[^6]: Quach, K. (17 June 2025). "MiniMax M1 model claims Chinese LLM crown from DeepSeek." *The Register*. https://www.theregister.com/2025/06/17/minimax_m1_model_chinese_llm/
[^7]: Bhutani, A. (19 June 2025). "MiniMax AI Releases MiniMax-M1: A 456B Parameter Hybrid Model for Long-Context and Reinforcement Learning RL Tasks." *MarkTechPost*. https://www.marktechpost.com/2025/06/19/minimax-ai-releases-minimax-m1-a-456b-parameter-hybrid-model-for-long-context-and-reinforcement-learning-rl-tasks/
[^8]: Sharma, A. (June 2025). "MiniMax Releases M1: a 456B Hybrid-Attention Model for Long-Context Reasoning and Software Tasks." *InfoQ*. https://www.infoq.com/news/2025/06/minimax-m1/
[^9]: "MiniMax M1 80k: Intelligence, Performance & Price Analysis." *Artificial Analysis*. https://artificialanalysis.ai/models/minimax-m1-80k
[^10]: "MiniMax Group." *Wikipedia*. https://en.wikipedia.org/wiki/MiniMax_Group
[^11]: Li, A. *et al.* (January 2025). "MiniMax-01: Scaling Foundation Models with Lightning Attention." *arXiv*:2501.08313. https://arxiv.org/abs/2501.08313
[^12]: "Startup Company MiniMax Completes Series B Funding, Doubling Its Valuation." *Pandaily* (March 2024). https://pandaily.com/startup-company-minimax-completes-series-b-funding-doubling-its-valuation
[^13]: "Alibaba-Backed 'AI Dragon' MiniMax Plans Hong Kong IPO." *Bloomberg* (18 June 2025). https://www.bloomberg.com/news/articles/2025-06-18/alibaba-backed-ai-dragon-minimax-is-said-to-plan-hong-kong-ipo
[^14]: "Alibaba, Abu Dhabi Set to Invest in MiniMax's $600 Million IPO." *Bloomberg* (30 December 2025). https://www.bloomberg.com/news/articles/2025-12-30/alibaba-abu-dhabi-set-to-invest-in-minimax-s-600-million-ipo
[^15]: Soto, J. (16 June 2025). "DeepSeek rival MiniMax says its first AI reasoning model halves compute of R1." *South China Morning Post*. https://www.scmp.com/tech/tech-trends/article/3314819/deepseek-rival-minimax-says-its-first-ai-reasoning-model-halves-compute-r1
[^16]: Huang, J. (January 2025). "Summary on MiniMax-01." https://jianyuh.github.io/minimax-01/2025/01/18/minimax-01.html
[^17]: "MiniMax-M1 Hybrid Architecture Meets vLLM: Long Context, Fast Inference." *vLLM Blog* (30 June 2025). https://blog.vllm.ai/2025/06/30/minimax-m1.html
[^18]: Decrypt (June 2025). "Can China's MiniMax-M1 AI Topple US Rivals? We Put It to the Test." https://decrypt.co/327569/can-china-minimax-m1-ai-topple-us-rivals-review
[^19]: MiniMax-AI. "SynLogic: Synthesizing Verifiable Reasoning Data at Scale for Learning Logical Reasoning and Beyond." NeurIPS 2025 / GitHub. https://github.com/MiniMax-AI/SynLogic and *arXiv*:2505.19641 https://arxiv.org/abs/2505.19641
[^20]: "MiniMax-M1 open-weight, large-scale hybrid-attention reasoning model." *Hacker News* discussion (June 2025). https://news.ycombinator.com/item?id=44307290
[^21]: "The Sequence Radar #669: MiniMax-M1 is a Very Impressive Model." *The Sequence* (June 2025). https://thesequence.substack.com/p/the-sequence-radar-minimax-m1-is
[^22]: "MiniMax M1 40k / 80k: API Provider Performance Benchmarking & Price Analysis." *Artificial Analysis*. https://artificialanalysis.ai/models/minimax-m1-40k
[^23]: MiniMax Platform pricing documentation. https://platform.minimax.io/docs/guides/pricing-token-plan
[^24]: "China's MiniMax launches M1: A reasoning model to rival GPT-4 at 0.5% the cost." *Computerworld* (June 2025). https://www.computerworld.com/article/4008870/chinas-minimax-launches-m1-a-reasoning-model-to-rival-gpt-4-at-0-5-the-cost.html