Kimi K2.5
Last reviewed
May 16, 2026
Sources
12 citations
Review status
Source-backed
Revision
v1 ยท 3,348 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 16, 2026
Sources
12 citations
Review status
Source-backed
Revision
v1 ยท 3,348 words
Add missing citations, update stale details, or suggest a clearer explanation.
Kimi K2.5 is an open-weights, natively multimodal large language model developed by Moonshot AI, released on January 27, 2026. It extends the company's Kimi K2 line from a text-only system to a vision and text system, while keeping the same one trillion parameter Mixture of Experts backbone. The model uses 32 billion active parameters per token, a 256K context window, and a custom 400 million parameter vision encoder called MoonViT. Moonshot positioned K2.5 as a model built for autonomous agent work rather than chat, with a headline feature called Agent Swarm that can orchestrate up to 100 sub-agents in parallel.
The release sits inside a wave of frontier-class models from Chinese labs in late 2025 and early 2026, alongside DeepSeek V4 and the various GLM and Qwen updates. K2.5 was the first open-weights model to post scores competitive with closed frontier systems on Humanity's Last Exam when tools were allowed, and the first major Chinese release to ship a model card explicitly written around agentic deployments rather than benchmarks alone. Moonshot published weights, inference code, and a tech blog on the same day, under a Modified MIT License that permits commercial use with attribution requirements above a certain user threshold.
Reception was strong on capability but mixed on practical reliability. Independent reviewers praised the math and visual reasoning scores and the price, which lands roughly an order of magnitude below Claude Opus 4.7 at the API level. Critics flagged a high hallucination rate, slow inference speeds at full context, and a peculiar identity confusion problem where the model would sometimes refer to itself as Claude without a system prompt, suggesting heavy distillation from Anthropic outputs in the training mix.
Moonshot AI was founded in Beijing in 2023 by Yang Zhilin and several collaborators from Tsinghua University. The company started with the Kimi chat product, a long-context assistant aimed at Chinese consumers, and built early credibility around handling very long documents, hundreds of thousands of tokens at a time, well before that was common in the West. By mid-2024 the company had taken meaningful funding from Alibaba and other Chinese investors, and Kimi had become one of the most-used domestic chat assistants in China alongside Baidu's Ernie products and ByteDance's Doubao line.
The Kimi model series went through several phases. The early Kimi Chat models were proprietary. In January 2025 Moonshot published Kimi K1.5, a closed-weights multimodal reasoning model that posted competitive scores against OpenAI's o1 and was notable for releasing a detailed technical report even though the weights stayed private. In July 2025 the company changed strategy and released Kimi K2 with open weights, a one trillion parameter Mixture of Experts text model trained on roughly 15.5 trillion tokens. K2 was Moonshot's first true open-source release and was specifically optimized for tool use and agentic tasks, with a custom Muon-based optimizer and a 128K context.
After K2 came two refinements. K2 Instruct shipped first, then in November 2025 came Kimi K2 Thinking, a reasoning-tuned variant of K2 that was the first open model to credibly compete with GPT-5 and Claude Opus 4.6 on agent benchmarks. K2 Thinking introduced native INT4 quantization through quantization-aware training, which let the full model run on a single 8x H100 node despite its trillion-parameter total. K2.5 builds directly on this lineage. The base weights are continued from Kimi-K2-Base, the same backbone used for K2 Instruct and K2 Thinking, but with native vision pre-training mixed in from the start of the continual pretraining run rather than added as a late adapter.
Moonshot framed K2.5 as the company's first proper bid for global frontier status. The Kimi product had been growing inside China but was relatively unknown abroad. K2.5's open weights, OpenAI-compatible API, and listings on Hugging Face, AWS Bedrock, OpenRouter, Together AI, and NVIDIA NIM made it accessible to international developers in a way no previous Moonshot release had been.
Kimi K2.5 keeps the high-level shape of K2 and adds vision. The model is a sparse Mixture of Experts transformer with 1 trillion total parameters and 32 billion active parameters per forward pass. There are 61 layers in total, one of which is a standard dense layer, with the remaining 60 layers running the MoE routing. Each MoE layer contains 384 routed experts plus one shared expert, and the router selects 8 experts per token. The expert hidden dimension is 2,048.
For attention, K2.5 uses Multi-head Latent Attention with an attention hidden dimension of 7,168 and 64 attention heads. MLA was popularized by DeepSeek V4 and earlier DeepSeek releases and lets the model compress key-value caches by projecting them into a lower-dimensional latent space, which is crucial for keeping inference memory manageable at 256K context. The activation function inside the experts is SwiGLU, and the vocabulary is 160,000 tokens, the same tokenizer used in K2.
The context window is 262,144 tokens. Output is capped well below that in practice, with most providers limiting completions to around 98,000 tokens. Moonshot trains the model for stable performance across the entire context range rather than the partial degradation pattern common in earlier long-context releases, where models would technically accept 200K tokens but lose track of details past 32K.
The vision side is handled by MoonViT, a Moonshot-built vision encoder with 400 million parameters. MoonViT is not a separately trained encoder bolted onto the language model. Vision tokens are tokenized through MoonViT and then mixed with text tokens in the same context, so the model sees images as just another modality inside its sequence. This native multimodal design follows the pattern set by Gemini 3 Pro and GPT-4o, where vision is trained end-to-end with text rather than fine-tuned in after a text-only base. K2.5 accepts both still images and video frames, though video support varies by provider.
An unusual technical detail is the model's INT4 weights. Following the approach introduced in Kimi K2 Thinking, K2.5 uses native INT4 quantization on the MoE expert weights, applied through quantization-aware training rather than post-hoc rounding. This roughly halves the memory footprint compared to a BF16 deployment without the quality loss that usually comes from naive INT4 conversion. In practice the model can serve from an 8x H100 or 8x H200 node when most other trillion-parameter MoE systems need 16 GPUs minimum.
Moonshot disclosed broad strokes of the training procedure but not the full recipe. The model is built through continual pretraining on top of the Kimi-K2-Base checkpoint, the same base used for K2 Instruct and K2 Thinking. The continual pretraining run processes approximately 15 trillion mixed visual and text tokens. The exact mix between modalities and the data sources are not specified in the public release, though the model card and tech blog mention web crawls, code repositories, math corpora, and a curated visual instruction set.
Post-training uses a multi-stage pipeline that Moonshot calls Parallel Agent Reinforcement Learning. The basic idea is that during reinforcement learning, the model is trained to coordinate multiple instances of itself running on the same problem rather than just optimizing a single rollout. Moonshot argues that this approach is what gives K2.5 its Agent Swarm capability, since the model has actually been trained on the dynamics of multiple agents sharing context and tools rather than just generalizing from single-agent traces.
The model has two reasoning modes baked into the same weights. Thinking mode runs an internal chain of reasoning before producing the visible answer and is the default. Instant mode skips the visible reasoning trace and is recommended for short queries where latency matters more than depth. The two modes use different sampling defaults, with thinking mode preferring temperature 1.0 and top-p 0.95, and instant mode dropping the temperature to 0.6. Moonshot reports both modes in the published benchmark tables, with the thinking mode scores generally several points higher.
Moonshot has not published a full count of GPU-hours or a hardware setup for K2.5. The K2 paper from July 2025 described a custom Muon optimizer variant called MuonClip, designed to handle the optimization instability that hit larger MoE runs, and K2.5 presumably uses an updated version. The company also has not released the post-training data or the reinforcement learning reward models.
Moonshot published a long benchmark table with K2.5 in both standard and Agent Swarm configurations. The scores below come from the official model card and tech blog. Where the model card reports both thinking and instant numbers, the table uses the higher thinking-mode score.
| Benchmark | Category | Kimi K2.5 score |
|---|---|---|
| Humanity's Last Exam (with tools) | General reasoning | 50.2 |
| BrowseComp (no Swarm) | Web navigation | 74.9 |
| BrowseComp (with Agent Swarm) | Web navigation | 78.4 |
| GPQA-Diamond | Graduate-level science | 87.6 |
| MMLU-Pro | General knowledge | 87.1 |
| AIME 2025 | Math olympiad | 96.1 |
| HMMT 2025 | Math olympiad | 95.4 |
| SWE-Bench Verified | Software engineering | 76.8 |
| SWE-Bench Multilingual | Software engineering | 73.0 |
| LiveCodeBench v6 | Competitive programming | 85.0 |
| MMMU-Pro | Multimodal reasoning | 78.5 |
| MathVision | Visual math | 84.2 |
| MathVista (mini) | Visual math | 90.1 |
| OCRBench | Text in images | 92.3 |
| OmniDocBench 1.5 | Document understanding | 88.8 |
| VideoMMMU | Video reasoning | 86.6 |
The Humanity's Last Exam result is the headline number. HLE is a 3,000-question exam built by Scale AI and the Center for AI Safety, with questions designed by domain experts to be genuinely hard for current models. K2.5's 50.2 score with tools allowed beat the GPT-5 family's 45.5 and Claude Opus 4.5's 43.2 by several points, the first time an open-weights model had taken the top position on a frontier benchmark.
The BrowseComp numbers are also notable. BrowseComp measures a model's ability to navigate the open web through tool calls, find specific information, and report it accurately. K2.5 scores 74.9 in single-agent mode and 78.4 with Agent Swarm enabled. The Agent Swarm boost of 3.5 points is smaller than the marketing might suggest but real, and the underlying single-agent number is already well ahead of the competition's 57 to 60 range as reported by Moonshot.
On coding the picture is more mixed. K2.5 posts 76.8 on SWE-Bench Verified, which is strong but trails Claude Opus 4.7's 80.9 and roughly matches GPT-5's scores. The LiveCodeBench v6 score of 85.0 is closer to the front of the pack. K2.5 looks particularly good on multilingual SWE-Bench, where it leads several frontier models by a few points.
Math benchmarks are where K2.5 looks essentially uncatchable. AIME 2025 at 96.1 and HMMT 2025 at 95.4 are both within touching distance of perfect, on tests where most frontier models still score in the 80s. The visual math scores on MathVision and MathVista also lead or tie all published competitors at release time.
Kimi K2.5 ships under a Modified MIT License. The base license is MIT, but Moonshot adds an attribution clause that kicks in for derivatives exceeding 100 million monthly active users or 20 million in monthly revenue, where the deriving party must prominently display "Kimi K2.5" attribution. Below those thresholds the license behaves like a standard MIT license. The terms apply to both code and model weights.
Weights are published at moonshotai/Kimi-K2.5 on Hugging Face. Inference code lives in the MoonshotAI/Kimi-K2.5 GitHub repository. The model card recommends vLLM, SGLang, or KTransformers as inference engines, and requires transformers 4.57.1 or later.
The official Moonshot API is at platform.moonshot.ai and is OpenAI-compatible, which means existing code written against the OpenAI Python SDK can target Moonshot by changing the base URL and key. Anthropic-compatible endpoints are also offered for tools that talk to the Anthropic API. The API exposes the thinking and instant modes through a chat template parameter.
Pricing on the official API at launch was 0.60 dollars per million input tokens on a cache miss, 0.10 dollars per million on a cache hit, and 3.00 dollars per million output tokens. That compares to GPT-5 launch pricing of around 1.25 input and 10.00 output per million, and Claude Opus 4.7 at roughly 15.00 input and 75.00 output. On those numbers K2.5 is between four and twenty-five times cheaper per token than the closed frontier models, before accounting for the fact that thinking mode causes the model to emit substantially more tokens per task than a non-reasoning baseline.
Third-party hosts list slightly different rates. OpenRouter shows K2.5 at 0.40 per million input and 1.90 per million output. Together AI and NVIDIA NIM also host the model with their own pricing. AWS Bedrock added K2.5 to its catalog within a few weeks of launch, though without video input support in the initial Bedrock integration.
The table below shows how Kimi K2.5 lines up against the closed frontier models available at the same time and against its open-source peers. Numbers are taken from each vendor's own benchmark reporting where available; cross-lab benchmark comparison is always imperfect, since labs choose different prompts and tool configurations.
| Model | Developer | Total / active params | Context | License | HLE (tools) | SWE-Bench Verified | API input price (per 1M tokens) |
|---|---|---|---|---|---|---|---|
| Kimi K2.5 | Moonshot AI | 1T / 32B (MoE) | 256K | Modified MIT | 50.2 | 76.8 | 0.60 USD |
| Kimi K2 | Moonshot AI | 1T / 32B (MoE) | 128K | Modified MIT | not reported | 71.6 | 0.60 USD |
| DeepSeek V4 | DeepSeek | 671B / 37B (MoE) | 128K | DeepSeek License | mid 40s reported | 73 reported | low |
| GPT-5 | OpenAI | undisclosed | 400K | Proprietary | 45.5 | 74 reported | around 1.25 USD |
| Claude Opus 4.7 | Anthropic | undisclosed | 200K | Proprietary | reported in 40s | 80.9 | around 15 USD |
| Gemini 3 Pro | Google DeepMind | undisclosed | 1M | Proprietary | reported in 40s | reported in 70s | around 1.25 USD |
Versus its predecessor Kimi K2, K2.5 adds native vision, doubles the context window from 128K to 256K, lifts SWE-Bench Verified by about five points, and adds the Agent Swarm capability. The two share the same base weights and tokenizer, so K2-specific fine-tunes and tooling generally transfer to K2.5 with minor changes.
Versus GPT-5, K2.5 has higher published HLE-with-tools and BrowseComp scores and is significantly cheaper, but trails on certain coding evaluations and on the calibration and hallucination measures from third-party trackers like Artificial Analysis. GPT-5 also has more mature tool-calling infrastructure inside the ChatGPT product and more sophisticated multimodal output (image generation, voice) built around it.
Versus Claude Opus 4.7, the cost gap is the most striking gap. K2.5 is roughly a tenth of the per-token price and runs longer context. Claude still leads on SWE-Bench Verified and on most third-party coding evaluations, and on subjective ratings for prose quality and reliability of long agent runs. Several reviewers describe K2.5 as around ninety percent of Opus 4.7 quality at a fraction of the cost.
Versus Gemini 3 Pro, K2.5 trades a smaller context window (256K vs 1M) for stronger math and competitive coding scores. Gemini 3 Pro keeps the lead on document and video tasks at very long context, where its million-token window can hold an entire codebase or a multi-hour video without chunking.
Versus DeepSeek V4, the two are the two leading Chinese open-weights models of early 2026. DeepSeek V4 has a smaller active parameter count (37B vs 32B is close), a tighter context window, and a stronger reputation for very large-batch inference efficiency. K2.5 leads on vision, on agentic benchmarks, and on math.
Initial reception was strongly positive among developers and frontier-model trackers. The HLE result was treated as a meaningful milestone, the first time an open-weights model had taken the top of a hard reasoning benchmark with tools. Maxime Labonne, who runs an active model evaluation series, wrote a two-week followup describing K2.5 as a real competitor to Claude Opus on most tasks at one tenth the cost, with the caveat that the model is noticeably slower at full context. Zvi Mowshowitz called it the leading open-weights model on release and an excellent value, while flagging the model's lack of any published safety strategy as concerning given the agent swarm capability.
The Agent Swarm feature drew the most attention from the agentic-AI community. Demos of K2.5 spawning fifty or more parallel sub-agents to research a topic, write a long report, or build a website circulated widely. Moonshot's own measurements claim a 4.5x speedup on parallel research tasks compared to a single-agent baseline. Independent reproductions generally confirmed the speedup but found it harder to achieve in non-research workflows where coordination overhead between sub-agents starts to dominate.
Negative reactions clustered around three problems. First, hallucination. Artificial Analysis reported a hallucination rate measured at 64 percent on their evaluation suite, well above the 20 to 35 percent range typical for current frontier models. Reviewers found that K2.5 would confidently produce fabricated citations, made-up function signatures, and incorrect factual claims with a frequency that made the model unsuitable for some research and engineering workflows without careful verification. Several writers connected this to Moonshot's RL setup, arguing that the heavy emphasis on tool-calling success may have over-rewarded confident outputs.
Second, the identity confusion. Without a system prompt, K2.5 would frequently introduce itself as Claude, occasionally as ChatGPT, and refer to Anthropic or OpenAI as its developer. This strongly suggested that a substantial portion of the post-training data came from Claude completions, likely through synthetic distillation. Moonshot did not directly address the issue, though community probing suggested that adding a system prompt that explicitly identified the model as Kimi resolved most of the behavior.
Third, a licensing dispute with Cursor. The Modified MIT License's attribution clause triggered a public disagreement between Moonshot and the IDE company Cursor in February 2026, after Cursor integrated K2.5 without prominent attribution. The dispute was resolved within a few weeks but raised concerns among other commercial integrators about the practical scope of the attribution requirement.
The Chinese AI community treated K2.5 as the strongest entry yet in the global open-weights race. Coverage in Chinese tech press emphasized the HLE result and the comparison with American frontier labs, particularly on cost. Western coverage was more cautious but generally agreed that K2.5 had narrowed the gap between open Chinese models and the closed Western frontier substantially. Two months after launch, Maxime Labonne's followup post asked whether the model was "still worth it" and answered yes for cost-sensitive deployments but no for production code agents, where Claude's reliability still won out.
In April 2026 Moonshot released Kimi K2.6, an update that extended the context window and improved tool-calling consistency. K2.5 remained in active use as the more accessible open release, with K2.6 functioning as a paid tier improvement for users who needed the additional capabilities.