MiniMax-Text-01

Chinese AI Large Language Models Mixture of Experts

19 min read

Updated Jun 9, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 9, 2026

Fact-checked

In review queue

Sources

30 citations

Revision

v5 · 3,895 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

MiniMax-Text-01 is an open-weights, large-scale mixture-of-experts (MoE) language model released by Shanghai-based AI company MiniMax on January 14, 2025.^[1]^[2] The model contains 456 billion total parameters with 45.9 billion activated per token, and is notable for being among the first openly released foundation models to deploy a "lightning attention" linear-attention variant at production scale, supporting a native training context of 1 million tokens and inference extrapolation up to 4 million tokens.^[3]^[4]^[5] Together with its multimodal sibling MiniMax-VL-01, the model forms the "MiniMax-01" series and was accompanied by an extensive technical report ("MiniMax-01: Scaling Foundation Models with Lightning Attention," arXiv:2501.08313) co-authored by more than one hundred researchers.^[5]^[6]

Positioned by MiniMax as on par with frontier closed-weight systems such as GPT-4o and Claude 3.5 Sonnet while offering a context window 20 to 32 times longer, MiniMax-Text-01 attracted attention as one of two flagship Chinese open-weights launches in January 2025, alongside DeepSeek V3.^[1]^[4]^[7] The model is distributed on Hugging Face under the proprietary "MiniMax Model Agreement" (not a standard open-source license) and the accompanying inference code is released separately under the MIT License.^[8]^[9] MiniMax-Text-01 later served as the pretraining base for the company's reinforcement-learning-driven reasoning model MiniMax-M1 (June 2025) and, indirectly, for subsequent generations including the agentic MiniMax M2.^[10]^[11]

Background

MiniMax the company

MiniMax (full name MiniMax AI; Chinese: 稀宇科技, "Xiyu Technology") was incorporated in Shanghai in December 2021 by CEO Yan Junjie, a former vice president of research and development at SenseTime, along with co-founders Yang Bin, Zhou Yucong and Yun Yeyi, also drawn from SenseTime.^[12]^[13] Yan had received his PhD from the Chinese Academy of Sciences' Institute of Automation in 2015 before joining SenseTime as one of its youngest VPs; the company's name is a reference to the minimax search algorithm used in classical game-theoretic AI.^[12]^[13]

Within the Chinese AI ecosystem, MiniMax is commonly grouped with DeepSeek, Zhipu, Moonshot, Baichuan and 01.AI as one of the country's so-called "Six Little Tigers" (六小虎) of large-language-model development.^[13]^[14] Prior to the MiniMax-01 series, the company had built consumer products including the AI character-chat platforms Talkie (international) and Xingye (China), the multimodal assistant Hailuo AI, and a proprietary MoE family branded "abab," with abab 6.5 (released in April 2024) being the immediate technical predecessor of MiniMax-01.^[13]^[15] By July 2024 The Wall Street Journal reported Talkie had reached approximately 11 million monthly active users in the United States.^[13]

In March 2024 Alibaba Group led a roughly $600 million funding round that valued MiniMax at approximately $2.5 billion; other backers across earlier rounds include Tencent, Hillhouse Investment, HongShan, IDG Capital and game studio miHoYo.^[12]^[14] The company would later list on the Hong Kong Stock Exchange on January 9, 2026, with shares more than doubling on debut; at the time MiniMax-Text-01 was released in January 2025, MiniMax was still privately held and considerably less internationally visible than DeepSeek, which dominated headlines later that month with the release of DeepSeek-R1.^[16]^[17]

The MiniMax-01 launch

MiniMax announced the MiniMax-01 series on its corporate news page on January 14, 2025 (Beijing time) and pushed model weights and a 68-page technical report to Hugging Face, GitHub and arXiv on January 15.^[1]^[2]^[5] The release comprised two open-weight models, the text-only MiniMax-Text-01 and the vision-language MiniMax-VL-01, together with an API offering at $0.20 per million input tokens and $1.10 per million output tokens, plus consumer access through Hailuo AI.^[1]^[4] MiniMax framed the release as targeting "the AI Agent era," arguing that the dramatic context-length lead over GPT-4o (128K) and Claude 3.5 Sonnet (200K) would enable long-horizon agentic workflows requiring extensive memory and tool use.^[1]^[4]

The launch landed about two weeks before DeepSeek-R1 would cause global financial-market reverberations on January 27, 2025, but in the same broader news cycle of Chinese open-weight models exceeding the capabilities of comparable Western releases. Both TechCrunch and VentureBeat covered the launch on January 15, with TechCrunch framing the release in the context of MiniMax being "Alibaba- and Tencent-backed" and having raised approximately $850 million in venture funding.^[4]^[7]^[14]

Architecture

MiniMax-Text-01 is a decoder-only transformer with several architectural choices that diverge meaningfully from contemporary MoE models such as DeepSeek V3 (671B/37B) and Llama-style dense designs.^[3]^[5]^[18]

Overall configuration

The headline specifications, as documented in both the model card and the arXiv report, are:^[3]^[5]^[6]

Total parameters: 456 billion
Active parameters per token: 45.9 billion
Layers: 80 transformer blocks
Hidden size: 6,144
Attention heads (softmax layers): 64, each of dimension 128
Vocabulary: 200,064 tokens
Positional encoding: Rotary Position Embedding (RoPE) applied to half of each attention head's dimension; base frequency 10,000,000 in the final long-context training stage
MoE configuration: 32 experts per MoE block, expert hidden dimension 9,216, top-2 routing with no shared expert
Activated MLP width per layer: 18,432

Compared with DeepSeek V3, which uses 256 routed experts plus one shared expert with eight active per token, MiniMax-Text-01 favors a smaller number of wider experts with simpler routing and no shared expert.^[18] An external Hugging Face analysis notes that this produces a comparable total activated FLOP budget per layer (18,432) but yields a noticeably "deeper" 80-layer stack versus DeepSeek V3's 61, a choice the authors attribute to the depth benefits of mostly-linear attention.^[18]

Lightning Attention and the hybrid stack

The model's defining feature is its hybrid attention pattern. Rather than using softmax attention in every layer, MiniMax-Text-01 employs Lightning Attention (an I/O-aware implementation of Lightning Attention-2, a linear-attention variant) in seven out of every eight layers, with a single softmax attention layer inserted as the eighth.^[5]^[6]^[18] This 7:1 ratio repeats throughout the 80-layer model, giving the model 70 lightning-attention layers and 10 softmax-attention layers in total.^[3]^[6]

Lightning Attention-2 replaces the standard quadratic O(n²d) softmax attention with an approximately linear O(d²n) computation by reformulating the attention computation as a sequence of matrix products on gated SiLU-projected queries, keys and values, followed by RMS normalization and a sigmoid gate.^[6]^[18] Because complexity is linear in sequence length, the per-token cost of attention stops growing with context, which is what permits million-token training contexts on practical hardware budgets.^[4]^[6]

The retained softmax layers, equipped with RoPE on half of each head's dimension, act as periodic "global" mixing layers; ablations in the MiniMax-01 paper indicate that pure lightning-attention models suffered substantial accuracy losses on retrieval tasks such as Needle-in-a-Haystack, while sliding-window softmax alternatives lagged on broader generation quality. The 7:1 hybrid achieved the best overall trade-off in MiniMax's evaluations.^[18] An external deep dive by Hugging Face engineer Elie Bakouch describes this as "the first commercial-grade implementation of linear attention at scale," noting that prior linear-attention systems (Mamba, Cosformer2, HGRN2) had not been deployed in production-class MoE models.^[18]

For inference, MiniMax pairs the architecture with custom tooling including Linear Attention Sequence Parallelism Plus (LASP+), variable-length ring attention, expert tensor parallelism and an optimized CUDA kernel; the company reports achieving over 75% Model FLOPs Utilization (MFU) on NVIDIA H20 GPUs in training.^[2]^[4]

Mixture of experts and routing

MiniMax-Text-01's MoE block uses 32 experts of hidden dimension 9,216 with top-2 routing, meaning each token activates two of the 32 experts per MoE layer. Routing is governed by a "global router" that distributes tokens across each expert-parallel group, paired with a conventional auxiliary load-balancing loss, differing from DeepSeek V3's auxiliary-loss-free dropless routing.^[18] The architecture has no shared (always-on) expert, a deliberate simplification relative to the DeepSeek family.^[18]

Context length: training vs. extrapolation

A frequently misreported detail is the model's context window. MiniMax-Text-01 was trained at sequence lengths up to 1 million tokens in a three-stage long-context curriculum, and the company demonstrates that the model extrapolates at inference time to 4 million tokens under affordable compute budgets; the 4M figure is an inference-time extrapolation rather than a natively trained window.^[3]^[5]^[6] MiniMax reports 100% retrieval accuracy on a 4-million-token vanilla Needle-in-a-Haystack task to substantiate the extrapolation claim.^[1]^[4]

The long-context training proceeded in three stages: an initial 8K-token pretraining stage (RoPE base 10,000), a 128K stage on roughly 300 billion tokens (RoPE base 5,000,000), and a final stage at 512K → 1M tokens (RoPE base 10,000,000), with linear interpolation between weight checkpoints used to prevent distribution shift between phases.^[18]

Training data and infrastructure

The MiniMax-01 technical report does not disclose the precise composition of its pretraining corpus, but third-party analyses converge on a figure of approximately 12 trillion training tokens for MiniMax-Text-01, processed on a cluster of roughly 2,000 NVIDIA H800 GPUs using AdamW with WSD-like (warmup-stable-decay) learning rate scheduling.^[18]^[19] The batch size was progressively warmed up from 16 million to 128 million tokens during training, a scaling technique the authors describe as critical for stability at this size.^[18]

For data curation, the team used an earlier 60B-total / 5B-active "MoE classifier" to label content, deduplicated high-quality data at a higher ratio (4×) than low-quality data (2×), and tracked a custom byte-normalized accuracy metric (acc_norm²) alongside conventional held-out perplexity to guide data mixing decisions.^[18] Post-training combined short-context supervised fine-tuning, long-context SFT, offline DPO and online GRPO, applied iteratively to lift long-context capability without sacrificing short-context performance.^[18]

MiniMax-VL-01

Released alongside MiniMax-Text-01, MiniMax-VL-01 is the vision-language variant of the system. It adopts the ViT-MLP-LLM framework: a 303-million-parameter Vision Transformer encoder trained from scratch on 694 million image-caption pairs, a two-layer randomly initialized MLP projector, and MiniMax-Text-01 as the underlying language model.^[20]^[21] The ViT supports dynamic-resolution inputs from 336×336 to 2016×2016, with images split into non-overlapping patches whose features are concatenated with a 336×336 thumbnail representation.^[20]

The combined VL system was trained on a total of 512 billion vision-language tokens across four pipeline stages, and MiniMax reported strong results on document understanding (DocVQA 96.4) and diagram QA (AI2D 91.7), as well as the multimodal MMMU benchmark.^[20]^[21] TechCrunch noted that while MiniMax-VL-01 matched Claude 3.5 Sonnet on chart-understanding tasks like ChartQA, it did not consistently beat Gemini 2.0 Flash, GPT-4o or the open InternVL 2.5 across all multimodal benchmarks.^[7]

Benchmark performance

MiniMax-Text-01 was benchmarked head-to-head against GPT-4o (0806), Claude 3.5 Sonnet (1022), Gemini 2.0 Flash, DeepSeek V3, Qwen2.5-72B-Instruct and Llama-3.1-405B-Instruct.^[3]^[22] Selected scores from the model card and arXiv report:

General knowledge and instruction following

MMLU (0-shot CoT): 88.5, comparable to DeepSeek V3 (88.5), Claude 3.5 Sonnet (88.3) and Llama-3.1-405B (88.6).^[3]^[22]
MMLU-Pro (5-shot CoT): 75.7, narrowly ahead of DeepSeek V3 (75.9 in some splits).^[22]
IFEval (average accuracy): 89.1, outperforming DeepSeek V3 (87.3) but trailing Claude 3.5 Sonnet (90.1).^[3]^[22]
C-SimpleQA (Chinese factual QA): 67.4, leading GPT-4o (64.8).^[3]
Arena-Hard: 89.1, below DeepSeek V3 (91.4).^[22]

Reasoning and math

GPQA Diamond: 54.4, trailing DeepSeek V3 (59.1) and Claude 3.5 Sonnet (65.0).^[22]
DROP (F1): 87.8, behind DeepSeek V3 (91.0).^[22]
GSM8K (8-shot CoT): 94.8, behind Claude 3.5 Sonnet (96.9) and DeepSeek V3 (96.7).^[3]^[22]
MATH (0-shot): 77.4, behind DeepSeek V3 (84.6) and Gemini 1.5 Pro (84.6).^[22]

Coding

HumanEval: 77.4, slightly behind DeepSeek V3 in MiniMax's reported tables.^[22]
MBPP+: 71.7, trailing DeepSeek V3 (78.8).^[22]

Long-context

This is where MiniMax-Text-01 most clearly differentiates itself. On the Needle-in-a-Haystack retrieval task it reports 100% accuracy out to 4 million tokens, and on the Ruler synthetic long-context benchmark it scores 0.910 at the 1-million-token setting, substantially ahead of contemporaries whose effective contexts were limited to 128K-200K tokens.^[3]^[4] On LongBench v2 (with chain-of-thought) it scored an overall 56.5, leading most competitors on the 1M-token tasks.^[3]

The consensus reading in technical coverage was that MiniMax-Text-01 reached a "frontier-comparable" tier on general knowledge and instruction following, lagged somewhat on math, reasoning and coding benchmarks relative to DeepSeek V3 and Claude 3.5 Sonnet, but decisively led on long-context evaluations, a profile aligned with its hybrid linear-softmax design.^[4]^[7]^[22]

License and availability

MiniMax-Text-01 is distributed on Hugging Face at MiniMaxAI/MiniMax-Text-01 with a two-license structure: model weights under a proprietary MiniMax Model Agreement (file LICENSE-MODEL) and accompanying inference code under the standard MIT License (file LICENSE-CODE).^[8]^[9] The Model Agreement is widely considered "open weights" rather than "open source" because it imposes several use-restricting clauses, including:^[9]^[23]

MAU cap. Products or services with more than 100 million monthly active users require a separate commercial license negotiated directly with MiniMax.^[9]^[23]
Anti-distillation clause prohibiting use of model outputs to improve other large language models.^[9]^[23]
Attribution and branding requirements to "prominently display 'Built with MiniMax AI'" in user-facing surfaces, and to prefix derivative model names with "MiniMax."^[9]^[23]
Use-based restrictions prohibiting illegal use, generation of misinformation intended to harm, unauthorized handling of personally identifiable information, harassment, and hate speech.^[9]^[23]
IP-litigation termination clause that automatically terminates the license if the licensee sues MiniMax for IP infringement.^[9]^[23]

A community discussion on the Hugging Face repository explicitly objected that these clauses make the license incompatible with the Open Source Initiative's Open Source Definition and with free-software distribution policies; MiniMax closed the discussion in March 2025 stating that its restrictions focus on illegal activities and large-scale commercial use.^[9]^[23] TechCrunch's initial coverage flagged the license as "restrictive," noting it "prohibit[s] using the models to improve rival AI systems" and requires special licensing for the largest platforms.^[7]

Deployment paths supported at launch included downloading raw weights from Hugging Face, running locally via Hugging Face Transformers or vLLM (vLLM was the officially recommended option), accessing the official MiniMax API at $0.20/M input and $1.10/M output tokens, and using the model through the Hailuo AI consumer interface at hailuo.ai.^[1]^[2]^[4] The model also became available through OpenRouter and a number of third-party hosting providers.^[4]

Use cases and deployment

The most prominent first-party deployment of MiniMax-Text-01 has been Hailuo AI, MiniMax's consumer assistant. On January 15, 2025, the official Hailuo AI account on X announced that MiniMax-01 had gone live across the platform, positioning the 4M-token context window as the differentiating feature for long-document analysis and agentic use cases.^[24] MiniMax also pointed to use of the model inside its enterprise API platform (intl.minimaxi.com) for developers building agentic workflows and long-document analysis tools.^[1]

Beyond first-party deployment, the model has been integrated by third-party API aggregators, including OpenRouter, AIMLAPI and others, typically positioned as a long-context-specialist tier rather than a general-purpose default.^[25] Several deep-research and document-summarization startups adopted MiniMax-Text-01 specifically for use cases involving hundreds of thousands to millions of tokens of input (for example, codebase-wide reasoning, legal-discovery review and large-collection summarization) where competitor APIs either could not match the context window or charged substantially more.^[4]

Reception

Initial press coverage emphasized two themes: the context-length advantage and the model's status as a Chinese open-weights entrant. TechCrunch described MiniMax-Text-01 as "competitive with the industry's best" on selected benchmarks but cautioned about the license and copyright lawsuits MiniMax was facing from Chinese streaming company iQiyi.^[7] VentureBeat called the release an "industry-leading 4M token context" launch and highlighted the price-per-token advantage.^[26] MarkTechPost emphasized the architectural novelty of deploying lightning attention at the 456B-parameter scale.^[2]

Within the technical community, the Hugging Face deep dive by Elie Bakouch praised the depth of ablations in the technical report and the elegance of the 7:1 hybrid attention scheme, but criticized the relative lack of long-context benchmarks beyond Needle-in-a-Haystack, the use of fixed learning rates that could bias scaling-law comparisons, and the absence of direct head-to-head comparisons against DeepSeek V3 in the report's own tables.^[18]

The most consistent point of criticism, however, was the license. Open-source advocates objected that calling the release "open-source" was misleading given the MAU cap, attribution requirements and anti-distillation clauses.^[9]^[23] These critiques would resurface, considerably amplified, around MiniMax's later models, most prominently when MiniMax-M2.7 launched under a license that explicitly required commercial authorization.^[27]

MiniMax-M1 and successors

MiniMax-Text-01 served as the literal pretraining backbone for MiniMax-M1, released on June 16, 2025 and described in the arXiv preprint "MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention" (arXiv:2506.13585).^[10]^[11] To build M1, MiniMax continued pretraining MiniMax-Text-01 on an additional 7.5 trillion tokens of reasoning-intensive data, then performed supervised fine-tuning on chain-of-thought traces, and finally trained the model with large-scale reinforcement learning using a novel CISPO (Clipped Importance Sampling Policy Optimization) algorithm.^[10]^[11] M1 retained the 456B/45.9B MoE structure and the 7:1 hybrid attention pattern of MiniMax-Text-01, demonstrating the durability of the underlying foundation.^[10]^[11]

M1 was released in two thinking-budget variants (40K and 80K tokens). The team emphasized FLOP efficiency at long generation lengths, reporting that at a generation length of 100K tokens M1 consumed roughly 25% of the FLOPs of DeepSeek-R1, a direct payoff of the lightning-attention architecture inherited from MiniMax-Text-01.^[11] MiniMax later released MiniMax M2 (October 2025) and M2.5/M2.7 variants in 2026, though by M2 the company had moved to a different base architecture optimized for agentic tool use, marking the end of the direct MiniMax-Text-01 lineage as the active production system.^[11]^[28]

Comparison to contemporaries

At the time of release in January 2025, MiniMax-Text-01 sat in a small cluster of frontier-comparable Chinese open-weights MoE models:

DeepSeek V3 (released December 26, 2024): 671B total / 37B active, 128K context, scoring slightly higher than MiniMax-Text-01 on math, coding and GPQA but with a 31× smaller maximum context window. DeepSeek V3 used a fine-grained MoE with 256 routed experts plus a shared expert; MiniMax-Text-01 used a coarser-grained 32-expert design with no shared expert.^[18]^[22]
Qwen2.5-72B-Instruct (from Alibaba's Qwen team, 2024): dense 72B model used by MiniMax as a comparison point on general benchmarks; far smaller context window but competitive on MMLU and IFEval.^[22]
GPT-4o (0806 snapshot) and Claude 3.5 Sonnet (1022 snapshot): closed-weights frontier systems with 128K and 200K context windows respectively, against which MiniMax-Text-01 trailed on reasoning and coding but led decisively on long context.^[3]^[22]
Gemini 1.5 Pro / 2.0 Flash: Google's long-context family, with Gemini 1.5 Pro offering 1M (and experimentally 2M) tokens. MiniMax claimed parity with or superiority to Gemini 2.0 Flash on MMLU and SimpleQA, but Gemini's contexts at this point did not match the 4M-token extrapolation claim.^[4]^[7]
Llama-3.1-405B-Instruct: Meta's flagship open-weights dense model with 128K context; broadly comparable on MMLU but substantially behind on long-context performance.^[22]

Within this cohort MiniMax-Text-01 occupied a distinct niche: matching frontier general performance on most benchmarks, lagging slightly on the hardest reasoning and coding evaluations, but offering a context-window-per-dollar advantage that no other open-weights model approached in early 2025.^[4]^[7]^[18]

Significance

MiniMax-Text-01 is significant for three reasons that recurrent commentary in the technical literature has emphasized.

First, it was the first publicly released MoE model at the hundreds-of-billions scale to make a serious commitment to a linear-attention variant, placing lightning attention in 70 of its 80 layers rather than as an isolated research artifact. The success of this architecture in matching softmax-attention models on general benchmarks while enabling 1M-token training contexts has been read as a partial validation of the linear-attention research program that produced systems like Mamba, RWKV, HGRN and Cosformer.^[6]^[18]

Second, the release demonstrated a practical engineering path to multi-million-token inference contexts in a commodity-deployable, open-weights model, establishing a higher bar that subsequent releases (Gemini 2.5 Pro, DeepSeek V3.2, later MiniMax models) would have to compete against on long-context evaluations.^[4]^[10]

Third, the license structure that accompanied MiniMax-Text-01, open weights with a 100M-MAU commercial cap, anti-distillation clause and attribution branding requirement, became a template that MiniMax would refine and tighten across later models, and is frequently cited in discussions of what "open" should mean for foundation models in the Chinese AI ecosystem.^[9]^[23]^[27]

Controversies

Subsequent to the MiniMax-Text-01 release, MiniMax has been the subject of two notable legal and reputational controversies. In September 2025, Disney, Universal Pictures and Warner Bros. Discovery jointly sued MiniMax in the U.S. District Court for the Central District of California, alleging that Hailuo AI (the consumer surface that hosts MiniMax-Text-01 and MiniMax's video generators) "pirates and plunders" the studios' copyrighted works on a massive scale, by generating high-fidelity videos of characters such as Darth Vader, the Minions and Superman in response to user prompts.^[29] The complaint was widely seen as an extension of the studios' earlier suit against Midjourney into the Chinese AI sector.^[29]

In February 2026, Anthropic separately accused MiniMax (along with DeepSeek and Moonshot AI) of running "industrial-scale" distillation campaigns against Claude, alleging that MiniMax used approximately fraudulent accounts to generate the majority of an aggregate 16 million Claude interactions for use in training; Anthropic stated that MiniMax alone accounted for more than 13 million of those queries.^[30] Anthropic has not, as of mid-2026, filed suit but said evidence packages are ready if negotiations fail.^[30] These controversies concern MiniMax's broader practices rather than MiniMax-Text-01 specifically, but they have shaped subsequent commentary on the company's models, including reassessments of the MiniMax-Text-01 release.

References

MiniMax, "MiniMax-01 is Now Open-Source: Scaling Lightning Attention for the AI Agent Era," MiniMax News, January 14, 2025. https://www.minimax.io/news/minimax-01-series-2 ↩
MarkTechPost, "MiniMax-Text-01 and MiniMax-VL-01 Released: Scalable Models with Lightning Attention, 456B Parameters, 4M Token Contexts, and State-of-the-Art Accuracy," January 15, 2025. https://www.marktechpost.com/2025/01/15/minimax-text-01-and-minimax-vl-01-released-scalable-models-with-lightning-attention-456b-parameters-4b-token-contexts-and-state-of-the-art-accuracy/ ↩
Hugging Face, "MiniMaxAI/MiniMax-Text-01" model card. https://huggingface.co/MiniMaxAI/MiniMax-Text-01 ↩
VentureBeat, "MiniMax unveils its own open-source LLM with industry-leading 4M token context," January 15, 2025. https://venturebeat.com/ai/minimax-unveils-its-own-open-source-llm-with-industry-leading-4m-token-context/ ↩
MiniMax, "MiniMax-01: Scaling Foundation Models with Lightning Attention," arXiv:2501.08313, January 14, 2025. https://arxiv.org/abs/2501.08313 ↩
Hugging Face Papers, "MiniMax-01: Scaling Foundation Models with Lightning Attention." https://huggingface.co/papers/2501.08313 ↩
TechCrunch, "Chinese AI company MiniMax releases new models it claims are competitive with the industry's best," January 15, 2025. https://techcrunch.com/2025/01/15/chinese-ai-company-minimax-releases-new-models-it-claims-are-competitive-with-the-industrys-best/ ↩
GitHub, MiniMax-AI/MiniMax-01 repository. https://github.com/MiniMax-AI/MiniMax-01 ↩
Hugging Face discussion, "MiniMaxAI/MiniMax-Text-01 · Consider making MiniMax Text free software, as license is proprietary," opened January 2025, closed March 2025. https://huggingface.co/MiniMaxAI/MiniMax-Text-01/discussions/2 ↩
MiniMax, "MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention," arXiv:2506.13585, June 2025. https://arxiv.org/abs/2506.13585 ↩
VentureBeat, "MiniMax-M1 is a new open source model with 1 MILLION TOKEN context and new, hyper efficient reinforcement learning," June 2025. https://venturebeat.com/ai/minimax-m1-is-a-new-open-source-model-with-1-million-token-context-and-new-hyper-efficient-reinforcement-learning/ ↩
36Kr / EU 36Kr, "Unicorns Emerge from SenseTime Ecosystem, but YAN Junjie Is Irreplicable," 2025. https://eu.36kr.com/en/p/3609258452141057 ↩
Wikipedia, "MiniMax (company)." https://en.wikipedia.org/wiki/MiniMax_(company) ↩
South China Morning Post, "MiniMax, the 'world-class' AI start-up lauded by Jensen Huang, applies for Hong Kong IPO." https://www.scmp.com/business/banking-finance/article/3318485/minimax-world-class-ai-start-lauded-jensen-huang-applies-hong-kong-ipo ↩
Yahoo Finance / KrASIA, "Chinese AI 'tiger' MiniMax launches text-to-video-generating model to rival OpenAI's Sora." https://finance.yahoo.com/news/chinese-ai-tiger-minimax-launches-093000322.html ↩
CNBC, "MiniMax doubles in Hong Kong debut, marking yet another Chinese AI listing," January 9, 2026. https://www.cnbc.com/2026/01/09/minimax-hong-kong-ipo-ai-tigers-zhipu.html ↩
Implicator AI, "MiniMax Files for $4 Billion IPO, Testing Global Appetite for China's AI Ambitions." https://www.implicator.ai/minimax-files-for-4-billion-ipo-testing-global-appetite-for-chinas-ai-ambitions/ ↩
Hugging Face Blog, Elie Bakouch, "Diving into MiniMax-01 405B MoE," January 2025. https://huggingface.co/blog/eliebak/minimax01-deepdive ↩
Analytics Vidhya, "4M Tokens? MiniMax-Text-01 Raises the Bar, Beating DeepSeek V3," January 2025. https://www.analyticsvidhya.com/blog/2025/01/minimax-text-01/ ↩
MiniMax, "MiniMax-VL-01: A New Milestone in Multimodal AI Models." https://www.minimax01.com/en/blog/minimax-vl-01-introduction ↩
Hugging Face, "MiniMaxAI/MiniMax-VL-01" model card. https://huggingface.co/MiniMaxAI/MiniMax-VL-01 ↩
Analytics Vidhya, benchmark-table comparison between MiniMax-Text-01 and competitors. https://www.analyticsvidhya.com/blog/2025/01/minimax-text-01/ ↩
Hugging Face, "LICENSE-MODEL" file in MiniMaxAI/MiniMax-Text-01. https://huggingface.co/MiniMaxAI/MiniMax-Text-01/blob/main/LICENSE-MODEL ↩
Hailuo AI (MiniMax) on X, announcement of MiniMax-01 deployment on Hailuo platform, January 15, 2025. https://x.com/Hailuo_AI/status/1879229798649856343 ↩
OpenRouter, "MiniMax-01: API Pricing & Providers." https://openrouter.ai/minimax/minimax-01 ↩
VentureBeat coverage of MiniMax-01 release, summary citing 4M-token context window and pricing. https://venturebeat.com/ai/minimax-unveils-its-own-open-source-llm-with-industry-leading-4m-token-context/ ↩
Decrypt, "MiniMax Drops State-of-the-Art AI Agent Model, Then Quietly Changes the License." https://decrypt.co/364225/minimax-m27-agent-model-license-change ↩
VentureBeat, "MiniMax-M2 is the new king of open source LLMs (especially for agentic tool calling)," October 2025. https://venturebeat.com/ai/minimax-m2-is-the-new-king-of-open-source-llms-especially-for-agentic-tool/ ↩
Variety, "Disney, Warner Bros. Discovery, NBCU Sue Chinese AI Company MiniMax, Alleging It 'Pirates and Plunders' Studios' Copyrighted Works on 'Massive Scale,'" September 2025. https://variety.com/2025/digital/news/disney-warner-bros-discovery-nbcu-lawsuit-minimax-chinese-ai-company-1236520395/ ↩
VentureBeat, "Anthropic alleges Chinese AI labs including DeepSeek, Moonshot and MiniMax used fake accounts to distill Claude," February 24, 2026. https://venturebeat.com/technology/anthropic-says-deepseek-moonshot-and-minimax-used-24-000-fake-accounts-to/ ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

4 revisions by 1 contributor · full history

Suggest edit

What links here

LLM Context Window Comparison Lightning Attention Linear Attention

Background

MiniMax the company

The MiniMax-01 launch

Architecture

Overall configuration

Lightning Attention and the hybrid stack

Mixture of experts and routing

Context length: training vs. extrapolation

Training data and infrastructure

MiniMax-VL-01

Benchmark performance

General knowledge and instruction following

Reasoning and math

Coding

Long-context

License and availability

Use cases and deployment

Reception

MiniMax-M1 and successors

Comparison to contemporaries

Significance

Controversies

See also

References

Improve this article

Related Articles

DeepSeek V4

Kimi K2

DeepSeek V3

Hunyuan

GLM-4.5

Qwen3

What links here

Related Articles

DeepSeek V4

Kimi K2

DeepSeek V3

Hunyuan

GLM-4.5

Qwen3

What links here