MiniMax-Text-01
Last reviewed
May 18, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 · 3,940 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 18, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 · 3,940 words
Add missing citations, update stale details, or suggest a clearer explanation.
MiniMax-Text-01 is an open-weights, large-scale mixture-of-experts (MoE) language model released by Shanghai-based AI company MiniMax on January 14, 2025.[^1][^2] The model contains 456 billion total parameters with 45.9 billion activated per token, and is notable for being among the first openly released foundation models to deploy a "lightning attention" linear-attention variant at production scale, supporting a native training context of 1 million tokens and inference extrapolation up to 4 million tokens.[^3][^4][^5] Together with its multimodal sibling MiniMax-VL-01, the model forms the "MiniMax-01" series and was accompanied by an extensive technical report ("MiniMax-01: Scaling Foundation Models with Lightning Attention," arXiv:2501.08313) co-authored by more than one hundred researchers.[^5][^6]
Positioned by MiniMax as on par with frontier closed-weight systems such as GPT-4o and Claude 3.5 Sonnet while offering a context window 20 to 32 times longer, MiniMax-Text-01 attracted attention as one of two flagship Chinese open-weights launches in January 2025, alongside DeepSeek V3.[^1][^4][^7] The model is distributed on Hugging Face under the proprietary "MiniMax Model Agreement" — not a standard open-source license — and the accompanying inference code is released separately under the MIT License.[^8][^9] MiniMax-Text-01 later served as the pretraining base for the company's reinforcement-learning-driven reasoning model MiniMax-M1 (June 2025) and, indirectly, for subsequent generations including the agentic MiniMax M2.[^10][^11]
MiniMax (full name MiniMax AI; Chinese: 稀宇科技, "Xiyu Technology") was incorporated in Shanghai in December 2021 by CEO Yan Junjie, a former vice president of research and development at SenseTime, along with co-founders Yang Bin, Zhou Yucong and Yun Yeyi, also drawn from SenseTime.[^12][^13] Yan had received his PhD from the Chinese Academy of Sciences' Institute of Automation in 2015 before joining SenseTime as one of its youngest VPs; the company's name is a reference to the minimax search algorithm used in classical game-theoretic AI.[^12][^13]
Within the Chinese AI ecosystem, MiniMax is commonly grouped with DeepSeek, Zhipu, Moonshot, Baichuan and 01.AI as one of the country's so-called "Six Little Tigers" (六小虎) of large-language-model development.[^13][^14] Prior to the MiniMax-01 series, the company had built consumer products including the AI character-chat platforms Talkie (international) and Xingye (China), the multimodal assistant Hailuo AI, and a proprietary MoE family branded "abab," with abab 6.5 (released in April 2024) being the immediate technical predecessor of MiniMax-01.[^13][^15] By July 2024 The Wall Street Journal reported Talkie had reached approximately 11 million monthly active users in the United States.[^13]
In March 2024 Alibaba Group led a roughly $600 million funding round that valued MiniMax at approximately $2.5 billion; other backers across earlier rounds include Tencent, Hillhouse Investment, HongShan, IDG Capital and game studio miHoYo.[^12][^14] The company would later list on the Hong Kong Stock Exchange on January 9, 2026, with shares more than doubling on debut — but at the time MiniMax-Text-01 was released in January 2025, MiniMax was still privately held and considerably less internationally visible than DeepSeek, which dominated headlines later that month with the release of DeepSeek-R1.[^16][^17]
MiniMax announced the MiniMax-01 series on its corporate news page on January 14, 2025 (Beijing time) and pushed model weights and a 68-page technical report to Hugging Face, GitHub and arXiv on January 15.[^1][^2][^5] The release comprised two open-weight models — the text-only MiniMax-Text-01 and the vision-language MiniMax-VL-01 — together with an API offering at $0.20 per million input tokens and $1.10 per million output tokens, plus consumer access through Hailuo AI.[^1][^4] MiniMax framed the release as targeting "the AI Agent era," arguing that the dramatic context-length lead over GPT-4o (128K) and Claude 3.5 Sonnet (200K) would enable long-horizon agentic workflows requiring extensive memory and tool use.[^1][^4]
The launch landed about two weeks before DeepSeek-R1 would cause global financial-market reverberations on January 27, 2025, but in the same broader news cycle of Chinese open-weight models exceeding the capabilities of comparable Western releases. Both TechCrunch and VentureBeat covered the launch on January 15, with TechCrunch framing the release in the context of MiniMax being "Alibaba- and Tencent-backed" and having raised approximately $850 million in venture funding.[^4][^7][^14]
MiniMax-Text-01 is a decoder-only transformer with several architectural choices that diverge meaningfully from contemporary MoE models such as DeepSeek V3 (671B/37B) and Llama-style dense designs.[^3][^5][^18]
The headline specifications, as documented in both the model card and the arXiv report, are:[^3][^5][^6]
Compared with DeepSeek V3, which uses 256 routed experts plus one shared expert with eight active per token, MiniMax-Text-01 favors a smaller number of wider experts with simpler routing and no shared expert.[^18] An external Hugging Face analysis notes that this produces a comparable total activated FLOP budget per layer (18,432) but yields a noticeably "deeper" 80-layer stack versus DeepSeek V3's 61, a choice the authors attribute to the depth benefits of mostly-linear attention.[^18]
The model's defining feature is its hybrid attention pattern. Rather than using softmax attention in every layer, MiniMax-Text-01 employs Lightning Attention — an I/O-aware implementation of Lightning Attention-2, a linear-attention variant — in seven out of every eight layers, with a single softmax attention layer inserted as the eighth.[^5][^6][^18] This 7:1 ratio repeats throughout the 80-layer model, giving the model 70 lightning-attention layers and 10 softmax-attention layers in total.[^3][^6]
Lightning Attention-2 replaces the standard quadratic O(n²d) softmax attention with an approximately linear O(d²n) computation by reformulating the attention computation as a sequence of matrix products on gated SiLU-projected queries, keys and values, followed by RMS normalization and a sigmoid gate.[^6][^18] Because complexity is linear in sequence length, the per-token cost of attention stops growing with context, which is what permits million-token training contexts on practical hardware budgets.[^4][^6]
The retained softmax layers, equipped with RoPE on half of each head's dimension, act as periodic "global" mixing layers; ablations in the MiniMax-01 paper indicate that pure lightning-attention models suffered substantial accuracy losses on retrieval tasks such as Needle-in-a-Haystack, while sliding-window softmax alternatives lagged on broader generation quality. The 7:1 hybrid achieved the best overall trade-off in MiniMax's evaluations.[^18] An external deep dive by Hugging Face engineer Elie Bakouch describes this as "the first commercial-grade implementation of linear attention at scale," noting that prior linear-attention systems (Mamba, Cosformer2, HGRN2) had not been deployed in production-class MoE models.[^18]
For inference, MiniMax pairs the architecture with custom tooling including Linear Attention Sequence Parallelism Plus (LASP+), variable-length ring attention, expert tensor parallelism and an optimized CUDA kernel; the company reports achieving over 75% Model FLOPs Utilization (MFU) on NVIDIA H20 GPUs in training.[^2][^4]
MiniMax-Text-01's MoE block uses 32 experts of hidden dimension 9,216 with top-2 routing, meaning each token activates two of the 32 experts per MoE layer. Routing is governed by a "global router" that distributes tokens across each expert-parallel group, paired with a conventional auxiliary load-balancing loss — differing from DeepSeek V3's auxiliary-loss-free dropless routing.[^18] The architecture has no shared (always-on) expert, a deliberate simplification relative to the DeepSeek family.[^18]
A frequently misreported detail is the model's context window. MiniMax-Text-01 was trained at sequence lengths up to 1 million tokens in a three-stage long-context curriculum, and the company demonstrates that the model extrapolates at inference time to 4 million tokens under affordable compute budgets — but the 4M figure is an inference-time extrapolation rather than a natively trained window.[^3][^5][^6] MiniMax reports 100% retrieval accuracy on a 4-million-token vanilla Needle-in-a-Haystack task to substantiate the extrapolation claim.[^1][^4]
The long-context training proceeded in three stages: an initial 8K-token pretraining stage (RoPE base 10,000), a 128K stage on roughly 300 billion tokens (RoPE base 5,000,000), and a final stage at 512K → 1M tokens (RoPE base 10,000,000), with linear interpolation between weight checkpoints used to prevent distribution shift between phases.[^18]
The MiniMax-01 technical report does not disclose the precise composition of its pretraining corpus, but third-party analyses converge on a figure of approximately 12 trillion training tokens for MiniMax-Text-01, processed on a cluster of roughly 2,000 NVIDIA H800 GPUs using AdamW with WSD-like (warmup-stable-decay) learning rate scheduling.[^18][^19] The batch size was progressively warmed up from 16 million to 128 million tokens during training — a scaling technique the authors describe as critical for stability at this size.[^18]
For data curation, the team used an earlier 60B-total / 5B-active "MoE classifier" to label content, deduplicated high-quality data at a higher ratio (4×) than low-quality data (2×), and tracked a custom byte-normalized accuracy metric (acc_norm²) alongside conventional held-out perplexity to guide data mixing decisions.[^18] Post-training combined short-context supervised fine-tuning, long-context SFT, offline DPO and online GRPO, applied iteratively to lift long-context capability without sacrificing short-context performance.[^18]
Released alongside MiniMax-Text-01, MiniMax-VL-01 is the vision-language variant of the system. It adopts the ViT-MLP-LLM framework: a 303-million-parameter Vision Transformer encoder trained from scratch on 694 million image-caption pairs, a two-layer randomly initialized MLP projector, and MiniMax-Text-01 as the underlying language model.[^20][^21] The ViT supports dynamic-resolution inputs from 336×336 to 2016×2016, with images split into non-overlapping patches whose features are concatenated with a 336×336 thumbnail representation.[^20]
The combined VL system was trained on a total of 512 billion vision-language tokens across four pipeline stages, and MiniMax reported strong results on document understanding (DocVQA 96.4) and diagram QA (AI2D 91.7), as well as the multimodal MMMU benchmark.[^20][^21] TechCrunch noted that while MiniMax-VL-01 matched Claude 3.5 Sonnet on chart-understanding tasks like ChartQA, it did not consistently beat Gemini 2.0 Flash, GPT-4o or the open InternVL 2.5 across all multimodal benchmarks.[^7]
MiniMax-Text-01 was benchmarked head-to-head against GPT-4o (0806), Claude 3.5 Sonnet (1022), Gemini 2.0 Flash, DeepSeek V3, Qwen2.5-72B-Instruct and Llama-3.1-405B-Instruct.[^3][^22] Selected scores from the model card and arXiv report:
This is where MiniMax-Text-01 most clearly differentiates itself. On the Needle-in-a-Haystack retrieval task it reports 100% accuracy out to 4 million tokens, and on the Ruler synthetic long-context benchmark it scores 0.910 at the 1-million-token setting — substantially ahead of contemporaries whose effective contexts were limited to 128K–200K tokens.[^3][^4] On LongBench v2 (with chain-of-thought) it scored an overall 56.5, leading most competitors on the 1M-token tasks.[^3]
The consensus reading in technical coverage was that MiniMax-Text-01 reached a "frontier-comparable" tier on general knowledge and instruction following, lagged somewhat on math, reasoning and coding benchmarks relative to DeepSeek V3 and Claude 3.5 Sonnet, but decisively led on long-context evaluations — a profile aligned with its hybrid linear-softmax design.[^4][^7][^22]
MiniMax-Text-01 is distributed on Hugging Face at MiniMaxAI/MiniMax-Text-01 with a two-license structure: model weights under a proprietary MiniMax Model Agreement (file LICENSE-MODEL) and accompanying inference code under the standard MIT License (file LICENSE-CODE).[^8][^9] The Model Agreement is widely considered "open weights" rather than "open source" because it imposes several use-restricting clauses, including:[^9][^23]
A community discussion on the Hugging Face repository explicitly objected that these clauses make the license incompatible with the Open Source Initiative's Open Source Definition and with free-software distribution policies; MiniMax closed the discussion in March 2025 stating that its restrictions focus on illegal activities and large-scale commercial use.[^9][^23] TechCrunch's initial coverage flagged the license as "restrictive," noting it "prohibit[s] using the models to improve rival AI systems" and requires special licensing for the largest platforms.[^7]
Deployment paths supported at launch included downloading raw weights from Hugging Face, running locally via Hugging Face Transformers or vLLM (vLLM was the officially recommended option), accessing the official MiniMax API at $0.20/M input and $1.10/M output tokens, and using the model through the Hailuo AI consumer interface at hailuo.ai.[^1][^2][^4] The model also became available through OpenRouter and a number of third-party hosting providers.[^4]
The most prominent first-party deployment of MiniMax-Text-01 has been Hailuo AI, MiniMax's consumer assistant. On January 15, 2025, the official Hailuo AI account on X announced that MiniMax-01 had gone live across the platform, positioning the 4M-token context window as the differentiating feature for long-document analysis and agentic use cases.[^24] MiniMax also pointed to use of the model inside its enterprise API platform (intl.minimaxi.com) for developers building agentic workflows and long-document analysis tools.[^1]
Beyond first-party deployment, the model has been integrated by third-party API aggregators — including OpenRouter, AIMLAPI and others — typically positioned as a long-context-specialist tier rather than a general-purpose default.[^25] Several deep-research and document-summarization startups adopted MiniMax-Text-01 specifically for use cases involving hundreds of thousands to millions of tokens of input — for example, codebase-wide reasoning, legal-discovery review and large-collection summarization — where competitor APIs either could not match the context window or charged substantially more.[^4]
Initial press coverage emphasized two themes: the context-length advantage and the model's status as a Chinese open-weights entrant. TechCrunch described MiniMax-Text-01 as "competitive with the industry's best" on selected benchmarks but cautioned about the license and copyright lawsuits MiniMax was facing from Chinese streaming company iQiyi.[^7] VentureBeat called the release an "industry-leading 4M token context" launch and highlighted the price-per-token advantage.[^26] MarkTechPost emphasized the architectural novelty of deploying lightning attention at the 456B-parameter scale.[^2]
Within the technical community, the Hugging Face deep dive by Elie Bakouch praised the depth of ablations in the technical report and the elegance of the 7:1 hybrid attention scheme, but criticized the relative lack of long-context benchmarks beyond Needle-in-a-Haystack, the use of fixed learning rates that could bias scaling-law comparisons, and the absence of direct head-to-head comparisons against DeepSeek V3 in the report's own tables.[^18]
The most consistent point of criticism, however, was the license. Open-source advocates objected that calling the release "open-source" was misleading given the MAU cap, attribution requirements and anti-distillation clauses.[^9][^23] These critiques would resurface, considerably amplified, around MiniMax's later models — most prominently when MiniMax-M2.7 launched under a license that explicitly required commercial authorization.[^27]
MiniMax-Text-01 served as the literal pretraining backbone for MiniMax-M1, released on June 16, 2025 and described in the arXiv preprint "MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention" (arXiv:2506.13585).[^10][^11] To build M1, MiniMax continued pretraining MiniMax-Text-01 on an additional 7.5 trillion tokens of reasoning-intensive data, then performed supervised fine-tuning on chain-of-thought traces, and finally trained the model with large-scale reinforcement learning using a novel CISPO (Clipped Importance Sampling Policy Optimization) algorithm.[^10][^11] M1 retained the 456B/45.9B MoE structure and the 7:1 hybrid attention pattern of MiniMax-Text-01, demonstrating the durability of the underlying foundation.[^10][^11]
M1 was released in two thinking-budget variants (40K and 80K tokens). The team emphasized FLOP efficiency at long generation lengths, reporting that at a generation length of 100K tokens M1 consumed roughly 25% of the FLOPs of DeepSeek-R1 — a direct payoff of the lightning-attention architecture inherited from MiniMax-Text-01.[^11] MiniMax later released MiniMax M2 (October 2025) and M2.5/M2.7 variants in 2026, though by M2 the company had moved to a different base architecture optimized for agentic tool use, marking the end of the direct MiniMax-Text-01 lineage as the active production system.[^11][^28]
At the time of release in January 2025, MiniMax-Text-01 sat in a small cluster of frontier-comparable Chinese open-weights MoE models:
Within this cohort MiniMax-Text-01 occupied a distinct niche: matching frontier general performance on most benchmarks, lagging slightly on the hardest reasoning and coding evaluations, but offering a context-window-per-dollar advantage that no other open-weights model approached in early 2025.[^4][^7][^18]
MiniMax-Text-01 is significant for three reasons that recurrent commentary in the technical literature has emphasized.
First, it was the first publicly released MoE model at the hundreds-of-billions scale to make a serious commitment to a linear-attention variant — placing lightning attention in 70 of its 80 layers rather than as an isolated research artifact. The success of this architecture in matching softmax-attention models on general benchmarks while enabling 1M-token training contexts has been read as a partial validation of the linear-attention research program that produced systems like Mamba, RWKV, HGRN and Cosformer.[^6][^18]
Second, the release demonstrated a practical engineering path to multi-million-token inference contexts in a commodity-deployable, open-weights model — establishing a higher bar that subsequent releases (Gemini 2.5 Pro, DeepSeek V3.2, later MiniMax models) would have to compete against on long-context evaluations.[^4][^10]
Third, the license structure that accompanied MiniMax-Text-01 — open weights with a 100M-MAU commercial cap, anti-distillation clause and attribution branding requirement — became a template that MiniMax would refine and tighten across later models, and is frequently cited in discussions of what "open" should mean for foundation models in the Chinese AI ecosystem.[^9][^23][^27]
Subsequent to the MiniMax-Text-01 release, MiniMax has been the subject of two notable legal and reputational controversies. In September 2025, Disney, Universal Pictures and Warner Bros. Discovery jointly sued MiniMax in the U.S. District Court for the Central District of California, alleging that Hailuo AI (the consumer surface that hosts MiniMax-Text-01 and MiniMax's video generators) "pirates and plunders" the studios' copyrighted works on a massive scale, by generating high-fidelity videos of characters such as Darth Vader, the Minions and Superman in response to user prompts.[^29] The complaint was widely seen as an extension of the studios' earlier suit against Midjourney into the Chinese AI sector.[^29]
In February 2026, Anthropic separately accused MiniMax (along with DeepSeek and Moonshot AI) of running "industrial-scale" distillation campaigns against Claude, alleging that MiniMax used approximately fraudulent accounts to generate the majority of an aggregate 16 million Claude interactions for use in training; Anthropic stated that MiniMax alone accounted for more than 13 million of those queries.[^30] Anthropic has not, as of mid-2026, filed suit but said evidence packages are ready if negotiations fail.[^30] These controversies concern MiniMax's broader practices rather than MiniMax-Text-01 specifically, but they have shaped subsequent commentary on the company's models, including reassessments of the MiniMax-Text-01 release.