Kimi K2

Kimi K2 is a large-scale Mixture of Experts language model developed by Moonshot AI, a Beijing-based AI startup. Released on July 11, 2025, as an open-weights model under a Modified MIT License, Kimi K2 features 1.04 trillion total parameters with 32.6 billion activated per token.[^1][^2] The model was designed with agentic applications as its primary focus, targeting autonomous task execution, multi-step tool use, and software engineering workflows. Its release made it the highest-ranked open-source model on the LMSYS Chatbot Arena at launch, placing fifth overall against both proprietary and open models with over 3,000 user votes.[^7][^10]

The K2 release matters for two overlapping reasons. The first is architectural: the model demonstrated that a sparse transformer at trillion-parameter scale could be trained without loss spikes using a new optimizer called MuonClip.[^1] The second is licensing: the weights were released under terms permissive enough that downstream labs, including Cursor, later trained their own commercial coding models on top of K2 derivatives.[^21][^28] By the time the K2.6 update shipped in April 2026, the broader Kimi K2 family had become the de facto reference open-weights agentic model and the only open release routinely compared against Claude Opus 4, GPT-5, and Gemini 3 in agentic coding evaluations.[^15][^18]

Background

Moonshot AI

Moonshot AI was founded in March 2023 by Yang Zhilin, Zhou Xinyu, and Wu Yuxin, three Tsinghua University alumni. Yang chose the company name as a reference to Pink Floyd's album The Dark Side of the Moon, his favorite record.[^12] The company's stated goal is to build foundation models toward artificial general intelligence, with Yang identifying three milestones: long context length, a multimodal world model, and a scalable general architecture capable of continuous self-improvement.

Moonshot raised $60 million in its initial funding round at a $300 million valuation. In February 2024, Alibaba Group led a $1 billion round that brought the valuation to $2.5 billion. Alibaba acquired approximately 36% of the company. By 2025, a further $2 billion round led by Meituan Dragon Ball pushed the post-money valuation past $20 billion, making Moonshot the most heavily funded large language model startup in China with over $3.9 billion raised.[^12]

The Kimi product line

Moonshot's consumer-facing product is Kimi, a chatbot first released in October 2023. The initial version could process up to 200,000 Chinese characters per conversation, a context window that set it apart from most contemporaries. By March 2024, Moonshot claimed Kimi could handle 2 million Chinese characters in a single prompt.[^12]

The model lineage leading to K2 proceeded as follows:

The original Kimi chatbot (October 2023): text-focused, notable for long context
Kimi K1.5 (January 20, 2025): a dense transformer model employing reinforcement learning with chain-of-thought reasoning; Moonshot claimed it matched OpenAI's o1 on mathematics, coding, and multimodal reasoning; it was proprietary, not open-weights
Kimi K2 (July 11, 2025): a Mixture of Experts model, open-weights, focused on agentic capability[^1][^2]
Kimi-K2-Instruct-0905 (September 5, 2025): incremental update with 256K context and improved tool calling[^20][^23]
Kimi Linear 48B-A3B (November 8, 2025): a separate research line with hybrid linear attention, not a K2 successor but an architecture experiment for very long contexts[^29][^30]
Kimi K2 Thinking (November 6, 2025): a reasoning variant of K2 with extended chain-of-thought and 256K context[^8][^17]
Kimi K2.5 (January 27, 2026): multimodal upgrade with native vision via the MoonViT encoder[^16]
Kimi K2.6 (April 20, 2026): further expanded agentic capabilities, 300-sub-agent swarms, and native video support[^15]

K1.5 and K2 were designed for different use cases within the Moonshot product ecosystem: K1.5 targeted reasoning-intensive and multimodal tasks, while K2 emphasized automation, software development, and scalable deployment for developers. There is no separately branded "Kimi K2 Coder" or "Kimi K2 Reasoning" variant in Moonshot's official release lineup as of May 2026; coding and reasoning specializations are folded into the dated Instruct checkpoints (0905) and into K2 Thinking respectively.[^3][^31]

Architecture

Mixture of Experts design

Kimi K2 uses a sparse Mixture of Experts (MoE) transformer architecture. Rather than activating all model parameters for every input token, the MoE design routes each token to a subset of specialized sub-networks called experts. This means the model has 1.04 trillion total parameters but only activates 32.6 billion per token during inference, making the per-token computation cost comparable to a much smaller dense model while still benefiting from the knowledge encoded in the full parameter count.[^1][^26]

The routing is controlled by a learned gating network. For each token, the network selects the top 8 experts from a pool of 384 total experts, plus one shared expert that is always active. This gives a sparsity ratio of 48 (384 divided by 8), which is higher than the sparsity used in DeepSeek-V3. Moonshot chose this higher sparsity to reduce overfitting and improve expert specialization. In their ablations, they found that increasing the expert count from 256 (the V3 figure) to 384 produced clean gains on knowledge benchmarks at a fixed activated parameter budget.[^1][^26]

The detailed architecture specifications are:

Parameter	Value
Total parameters	1.04 trillion
Activated parameters per token	32.6 billion
Transformer layers	61 (including 1 dense layer)
Total experts	384
Active experts per token	8
Shared experts	1
Attention heads	64
Hidden dimension	7,168
Expert hidden dimension	2,048
Context length	128K tokens (K2) / 256K tokens (0905 and later)
Vocabulary size	160K
Attention mechanism	Multi-head Latent Attention (MLA)
Activation function	SwiGLU
Storage format	Block-FP8

Multi-head Latent Attention

Kimi K2 uses Multi-head Latent Attention (MLA), the same attention variant introduced by DeepSeek in DeepSeek-V2. MLA compresses the key and value projections into low-dimensional latent vectors, substantially reducing the size of the KV cache during inference. This is particularly valuable for agentic use cases that involve long conversations and many sequential tool calls, where standard multi-head attention would accumulate large caches.

Kimi K2 uses 64 attention heads, compared to 128 in DeepSeek-V3. Moonshot made this choice deliberately to prioritize inference speed for agentic tasks, accepting a modest reduction in attention capacity in exchange for lower latency. The technical report describes this as an explicit tradeoff: agentic workloads tend to read long sequences once, then emit short structured outputs (tool calls, reasoning steps), so the marginal value of more heads is lower than in pure long-document understanding.[^1]

Context window extension

The pre-training context length was extended progressively. Training began at 4,096 tokens, then expanded to 32K tokens, and finally to 128K tokens during an annealing phase using the YaRN method (Yet Another RoPE extensioN). The 128K token context window is sufficient for most software development and document analysis tasks. The September 2025 0905 update doubled this to 256K tokens, which became standard for every subsequent K2 release.[^20]

Training

Pre-training data and methodology

Kimi K2 was pre-trained on 15.5 trillion tokens spanning four primary domains: web text, code, mathematics, and knowledge-intensive content. The training corpus covered both English and Chinese, with additional multilingual coverage.[^1][^26]

A notable data preparation technique involved rephrasing rather than repeating data. For the knowledge and mathematics domains, Moonshot generated stylistically diverse reformulations of high-quality source material using chunk-wise autoregressive generation with fidelity verification. An ablation comparing ten rephrased versions of a dataset against ten training epochs on the original showed the rephrased approach scoring 28.94% on SimpleQA versus 23.76% for the repeated-epoch approach, a substantial improvement in factual retention.[^1]

For mathematics data specifically, text was converted to a "learning note" style and high-quality materials were translated into English to expand the language coverage of mathematical reasoning examples.

MuonClip optimizer

One of the most technically significant contributions of Kimi K2 is the MuonClip optimizer.[^1][^27] Standard SGD-based optimizers like AdamW become increasingly difficult to tune at trillion-parameter scale. The Muon optimizer, which applies matrix orthogonalization (Newton-Schulz iterations) to gradient updates, had shown strong token efficiency on smaller models. Moonshot's earlier work on a 16B/3B MoE model called Moonlight had applied Muon successfully at that scale. However, scaling Muon to a trillion parameters introduced a new failure mode: exploding attention logits that destabilized training.

Without intervention, training with spectral-norm-constrained optimizers like Muon may see attention logits grow beyond 1,000, resulting in loss spikes and possibly catastrophic instability. MuonClip addresses this with a QK-Clip mechanism. The technique rescales query and key projection weights on a per-head basis to prevent attention logit explosion. Rather than applying a global clipping threshold, it uses per-head scaling factors. For the Multi-head Latent Attention implementation in K2, it clips head-specific components (qC, kC, qR) while preserving shared rotary components. The clipping threshold was set to 100 for K2's training run.[^1][^27]

The result was a 15.5 trillion token pre-training run with zero loss spikes. Maximum attention logits stabilized after approximately 30% of the training steps, and the run completed without any training restarts. Compared to AdamW on equivalent hardware, MuonClip is reported to roughly double the FLOPs/token efficiency, a meaningful saving when each step touches 67 million tokens.[^1]

The training configuration used:

Learning rate: 2e-4 constant for the first 10 trillion tokens, then cosine decay to 2e-5 over the remaining 5.5 trillion tokens
Batch size: 67 million tokens globally
Warm-up: 500 steps
Weight decay: 0.1 throughout
Infrastructure: NVIDIA H800 cluster with 8 x 400 Gbps RoCE interconnects[^1]

One outcome of the MuonClip release is that it has begun to influence other open-weights labs. The Muon family of optimizers, originally a niche choice from Keller Jordan's research, became significantly more credible after Moonshot's trillion-parameter result, and follow-on training runs at competing labs have been observed citing MuonClip-style stabilization in their reports.[^27]

Post-training

Post-training proceeded through two main stages: supervised fine-tuning and reinforcement learning.

The supervised fine-tuning stage used a large-scale agentic data synthesis pipeline. Moonshot generated tool-use demonstrations involving over 3,000 real MCP (Model Context Protocol) tools and more than 20,000 synthetic tools. Multi-turn trajectories were generated in both simulated and real execution environments, with models interacting with actual tool APIs to produce training examples grounded in realistic behavior.[^1][^9]

The reinforcement learning stage used a Verifiable Rewards Gym covering mathematics and STEM tasks, logic puzzles, instruction following, faithfulness, coding, and safety. For subjective tasks where there is no single correct answer, a Self-Critique Rubric Reward mechanism provided training signal by having the model evaluate its own outputs against defined rubrics. Budget control was applied to manage response length. PTX loss (preserving pre-training gradients) was included to prevent catastrophic forgetting of pre-training knowledge. The temperature schedule during RL started high to encourage exploration and decayed to promote exploitation as training converged.[^1]

The combination of agentic synthetic data and verifiable-reward RL is what distinguishes K2 from its closest open peers. DeepSeek's V3 and R1 models lean more on math and code as a primary signal source; K2's training pipeline treats tool-use trajectories as a first-class objective. This shows up in the benchmark results: K2's gap over peers is widest on Tau-bench and SWE-bench, narrower on pure math and coding benchmarks.

Open weights release

Moonshot released Kimi K2 weights on July 11, 2025, under a Modified MIT License.[^2][^5] Both the base model checkpoint (Kimi-K2-Base) and the post-trained checkpoint (Kimi-K2-Instruct) were published on Hugging Face at moonshotai/Kimi-K2-Base and moonshotai/Kimi-K2-Instruct.[^2] Weights were released in Block-FP8 format.

The Modified MIT License is a standard MIT License with one additional clause. If a commercial product or service exceeds either 100 million monthly active users or 20 million US dollars in monthly revenue, the operator must prominently display "Kimi K2" on the user interface. Below those thresholds, all standard MIT permissions apply: free use, modification, redistribution, and commercial deployment without attribution.[^5] This puts K2 in a similar licensing category to DeepSeek V3 and R1, which helped drive adoption by developers who wanted to self-host or fine-tune without restrictive commercial terms.

Kimi K2 became the top-trending model on Hugging Face within 24 hours of release. The technical report was submitted to arXiv on July 28, 2025 (arXiv:2507.20534).[^1]

For deployment, Moonshot documented compatibility with vLLM, SGLang, KTransformers, and TensorRT-LLM inference frameworks. Quantized versions for llama.cpp, Ollama, LM Studio, and Jan were made available through the community.

Variants

The Kimi K2 family grew quickly after the initial release. Each variant kept the same core MoE architecture (1T total / 32B active) but adjusted post-training, context length, multimodality, and quantization. Kimi Linear is listed separately because it is a smaller research architecture rather than a successor to the K2 series.[^3][^31]

Variant	Released	Context	Notable changes	License
Kimi-K2-Base	July 11, 2025	128K	Raw pre-trained foundation, no instruction tuning	Modified MIT
Kimi-K2-Instruct	July 11, 2025	128K	SFT + RL, reflex-grade (no chain-of-thought)	Modified MIT
Kimi-K2-Instruct-0905	September 5, 2025	256K	Improved coding and tool calling, expanded context[^20]	Modified MIT
Kimi-K2-Thinking	November 6, 2025	256K	End-to-end RL on reasoning, 200-300 tool calls per task, native INT4[^17]	Modified MIT
Kimi-Linear-48B-A3B	November 8, 2025	1M	Hybrid Kimi Delta Attention + MLA, research line[^29][^30]	Modified MIT
Kimi-K2.5	January 27, 2026	256K	Native multimodal via MoonViT, Agent Swarm (100 sub-agents)[^16][^31]	Modified MIT
Kimi-K2.6	April 20, 2026	256K	Agent Swarm 300 sub-agents, 4,000 step coordination, video input[^15][^33]	Modified MIT

Kimi-K2-Base

Kimi-K2-Base is the raw pre-trained foundation model without instruction tuning or post-training. It is intended for researchers and developers who want to apply their own fine-tuning for specialized tasks. The base model is available on Hugging Face and supports the same deployment frameworks as the instruct variant. It became one of the most-forked open base models in the second half of 2025, with notable derivatives including Cursor's Composer 2 (which used Kimi-K2.5-Base as the starting point but the same lineage).[^21][^28]

Kimi-K2-Instruct

Kimi-K2-Instruct is the post-trained model released for general use. It has undergone supervised fine-tuning on agentic data and reinforcement learning as described above. The default system prompt is "You are Kimi, an AI assistant created by Moonshot AI." This is the variant typically accessed through the Moonshot API and the Kimi.com chat interface.

The instruct model does not include extended chain-of-thought reasoning. Moonshot classified it as a "reflex-grade" model, meaning it generates responses directly without an internal scratchpad or thinking phase. This keeps latency lower but limits performance on problems that benefit from step-by-step deliberation.[^2][^3]

Kimi-K2-Instruct-0905

The September 5, 2025 update of K2 Instruct kept the underlying architecture and weights structure intact while improving coding capability, tool-calling reliability, and frontend code aesthetics. The most visible change was a doubled context window of 256K tokens (262,144 to be exact). Moonshot also tuned compatibility with downstream agentic scaffolds: the 0905 model works as a drop-in backend for Claude Code, Cline, and Roo Code through their OpenAI- or Anthropic-compatible interfaces.[^20][^23]

On SWE-Bench Verified, the 0905 release scored approximately 69.2% (vs ~65.8% for the original K2 Instruct), with similar gains on SWE-Bench Multilingual to about 55.9%. The score increase was driven primarily by additional RL on coding tasks.[^32] The model card and weights kept the same moonshotai/Kimi-K2-Instruct-0905 namespace on Hugging Face.

Kimi K2 Thinking

Kimi K2 Thinking, released on November 6, 2025, extends the K2 architecture with end-to-end RL training of chain-of-thought reasoning interleaved with function calls. The model was trained to maintain coherent behavior across 200 to 300 consecutive tool invocations, a capability that Nathan Lambert of Interconnects noted was "previously limited to closed-source models."[^11] The context window was held at 256K tokens.[^17]

A distinctive feature is native INT4 quantization via Quantization-Aware Training. Most quantized variants of large open models are produced post-hoc; K2 Thinking was trained with quantization in mind, which Moonshot reports yields roughly a 2x speedup in low-latency mode and substantial GPU memory savings, with no measurable quality loss versus the FP8 reference. All published K2 Thinking benchmark scores were measured under INT4 precision.[^17]

Benchmark scores for Kimi K2 Thinking:[^17][^22]

Benchmark	Score
AIME 2025 (with Python)	99.1%
HMMT 2025 (with Python)	95.1%
GPQA-Diamond	84.5%
MMLU-Pro	84.6%
MMLU-Redux	94.4%
SWE-bench Verified	71.3%
SWE-bench Multilingual	61.1%
LiveCodeBench v6	83.1%
BrowseComp	60.2%
BrowseComp-ZH	62.3%
HLE (text-only with tools)	44.9%
Tau2-Bench Telecom	93% (independently measured)

In December 2025 the U.S. Center for AI Standards and Innovation (CAISI) at NIST published an evaluation of Kimi K2 Thinking, calling it the most capable AI model from a PRC-based developer at the time of its release while noting that it still trailed leading U.S. models on cyber, software engineering, scientific knowledge, and mathematical reasoning. The same CAISI report observed that K2 Thinking remained "highly censored in Chinese," with refusal patterns similar to DeepSeek R1-0528, but was relatively uncensored in English, Spanish, and Arabic.[^19]

The CAISI report also flagged adoption: a month after release, K2 Thinking had been downloaded from Hugging Face roughly 10% as often as DeepSeek R1 had been a month after its release, and less than 5% as often as gpt-oss in the equivalent period.[^19] K2's mindshare was strong among power users and Chinese developers but had not yet matched DeepSeek's broader pull.

Kimi Linear

Kimi Linear is a research architecture released by Moonshot on November 8, 2025, separately from the K2 trillion-parameter lineage. The accompanying paper "Kimi Linear: An Expressive, Efficient Attention Architecture" appeared on arXiv as 2510.26692, with the weights posted under moonshotai/Kimi-Linear-48B-A3B-Base and moonshotai/Kimi-Linear-48B-A3B-Instruct.[^29][^30] The model totals 48 billion parameters with 3 billion activated per token, and is built around Kimi Delta Attention (KDA), a finer-grained gated variant of DeltaNet, interleaved with MLA in a 3:1 ratio.[^29] Moonshot reports KV cache reduction of up to 75% and roughly 6x decoding throughput at 1 million tokens of context versus full attention. Kimi Linear is positioned as the architecture line that may feed into a future Kimi K3, but it has not replaced the K2 series and the two product lines have coexisted through May 2026.

Kimi K2.5

Released January 27, 2026, Kimi K2.5 added native multimodal capabilities to the K2 architecture.[^16][^31] The key addition was MoonViT, a 400-million-parameter vision encoder developed internally at Moonshot. Unlike approaches that graft a separate vision adapter onto a text-only foundation, MoonViT was integrated natively and the model was trained on approximately 15 trillion mixed visual and text tokens. The result was vision and language capabilities that developed together rather than being combined post-hoc.[^16][^24]

K2.5 introduced Agent Swarm functionality, allowing the model to coordinate up to 100 specialized sub-agents working in parallel. This was the first K2-series model to support image input natively. The architecture otherwise maintained the same 1 trillion total / 32 billion active parameter structure as K2, with context held at 256K tokens.[^16]

K2.5 became notable in March 2026 as the foundation Cursor used for its Composer 2 coding model. Cursor publicly acknowledged that Composer 2 started from the K2.5 base, with about three quarters of the final training compute coming from Cursor's own continued pretraining and reinforcement learning, and the remaining quarter from the Moonshot base.[^21][^28] The episode was one of the more visible commercial integrations of an open Chinese model into a Western developer tool, and Moonshot publicly celebrated it as evidence of the open-weights ecosystem the company wants to support.

Kimi K2.6

Kimi K2.6 was released on April 20, 2026. It expanded Agent Swarm capacity from 100 to 300 simultaneous sub-agents, increased maximum coordinated steps from 1,500 to 4,000, and added native video input (MP4, MOV, AVI, and WebM formats, recommended up to 2K resolution). The context window remained at 256K tokens.[^15][^33]

K2.6 introduced two inference modes: Thinking (chain-of-thought) and Instant (low-latency). A "Skills" feature allowed users to convert PDFs and spreadsheets into reusable task templates for recurring workflows. Moonshot demonstrated 12 to 13 hour autonomous coding sessions with K2.6, and agent swarm runs spanning five days. A research-preview feature called Claw Groups, also debuted in K2.6, opens the agent-swarm coordination layer to external heterogeneous agents and humans through a shared operational space.[^15]

Official benchmark scores reported by Moonshot for K2.6 at release:[^33]

Benchmark	K2.6
SWE-Bench Verified	80.2%
SWE-Bench Pro	58.6%
LiveCodeBench v6	89.6%
HLE-Full with tools	54.0%
BrowseComp	83.2%

Third-party comparisons placed K2.6's SWE-Bench Pro 58.6% slightly ahead of GPT-5.4 (57.7%) and Claude Opus 4.6 (53.4%) on the same evaluation.[^15] On the Artificial Analysis Intelligence Index, K2.6 scored 54, leading all open-weights models and trailing GPT-5.5 (60), Claude Opus 4.7, and Gemini 3.1 Pro Preview.[^18] K2.6 was released under a Modified MIT License.

Speculative future releases

Moonshot publicly teased a Kimi K3 model in late March 2026, with members of the team describing K3 as built around longer context (up to 1 million tokens), larger parameter counts (potentially in the 3-4 trillion range), and incorporation of the Kimi Linear architecture line for efficient long-context attention.[^29][^31] Prediction markets in early May 2026 weighted a pre-end-of-May release at roughly 74%, but as of mid-May 2026 K3 has not been formally released and no firm release date has been announced. The K2 family continues to receive the bulk of Moonshot's development effort.

Benchmarks

The following scores are from the Kimi-K2-Instruct model under non-thinking evaluation settings (no extended chain-of-thought reasoning).

Agentic and software engineering benchmarks

Benchmark	Score	Notes
SWE-bench Verified (agentic, single attempt)	65.8%	[^1]
SWE-bench Verified (agentic, multiple attempts)	71.6%	With parallel test-time compute[^1]
SWE-bench Multilingual	47.3%	[^1]
Tau-bench Tau2-Bench	66.1 Pass@1	Tool-use[^1]
ACEBench (English)	76.5%	Tool-use[^1]
LiveCodeBench v6	53.7%	[^1]
Aider Polyglot	59.1%	Independently measured by Paul Gauthier on July 18, 2025[^34]
OJBench	27.1%	[^1]
MultiPL-E	85.7%	Open-source state of the art at release[^1]

Mathematics and STEM

Benchmark	Score	Notes
MATH-500	97.4%
AIME 2024	69.6% (Avg@64)
AIME 2025	49.5% (Avg@64)
HMMT 2025	38.8% (Avg@32)
GPQA-Diamond	75.1% (Avg@8)

General knowledge and instruction following

Benchmark	Score
MMLU	89.5%
MMLU-Redux	92.7%
MMLU-Pro	81.1%
IFEval (Prompt Strict)	89.8%
SimpleQA	31.0%

Long-context

Benchmark	Score
DROP	93.5%
MRCR	55.0%
LongBench v2	49.1%

On the LMSYS Chatbot Arena, Kimi K2 ranked first among open-source models and fifth overall (including closed proprietary models) based on over 3,000 user votes as of July 17, 2025.[^7][^10] On LiveCodeBench v6, the 53.7% score represented a global state of the art at the time of release for non-reasoning open models, surpassing GPT-4.1's 44.7% and DeepSeek V3's 46.9% on the same benchmark.[^1] The Aider Polyglot score of 59.1% placed K2 above DeepSeek R1 (56.9%) and GPT-OSS-120B (41.8%) on the same test, though behind the leading closed reasoning models.[^34]

Hosting and availability

Kimi K2 is one of the most widely available open-weights models. Within weeks of the July 2025 release, every major third-party inference provider had a hosted version, and the K2 lineage continues to be a default offering across the industry. By May 2026, Moonshot's Hugging Face page showed cumulative download counts of roughly 2.2M for K2.6, 1.8M for K2.5, 1.6M for K2-Instruct-0905, and around 800K for the original K2-Instruct since release.[^31]

Direct from Moonshot

Moonshot AI offers Kimi K2 through its commercial API at platform.moonshot.ai. The API is compatible with both OpenAI-style and Anthropic-style client libraries. For Anthropic-style clients, temperature values are remapped automatically (effective temperature equals the request temperature multiplied by 0.6). The recommended temperature for K2 is 0.6.

Direct API pricing at launch was approximately $0.55 per million input tokens and $2.20 per million output tokens, with cached input at a discount.[^13] K2 Thinking shipped at $0.60 per million input tokens and $2.50 per million output tokens.[^35] Kimi.com hosts the chat interface and Kimi App provides a mobile experience. The Kimi Code CLI, launched in January 2026, is the company's coding-focused command-line interface and crossed 6,400 GitHub stars by mid-2026 with K2.6 as its default backend.

Third-party providers

The full provider list grew quickly through late 2025 and 2026. By April 2026, Artificial Analysis tracked 11 API providers offering K2.6 access.[^18] Major providers and their typical roles:

Provider	Notes
OpenRouter	Aggregator routing requests to multiple K2 backends
Together AI	Hosts FP4 quantizations for cost efficiency
Fireworks AI	Low time-to-first-token (around 0.72s on K2.6); commercial partner of Moonshot
DeepInfra	FP4 hosting at among the lowest blended prices
Novita	Available through Hugging Face Inference Providers
Parasail	Lowest blended price at $1.15/1M tokens for K2.6 at launch
Cloudflare Workers AI	Added K2.5 in early 2026; reported 77% cost cut for some customers[^24]
Azure	Available through the Azure AI catalog
SiliconFlow	FP8 quantizations
Clarifai	Highest output throughput for K2.6 (around 141 tokens/sec)
Weights & Biases	Hosting integration alongside training observability

Moonshot published a tool called K2 Vendor Verifier (K2VV) to help users compare the quality of tool-call outputs across these providers, since lossy quantization can degrade structured outputs in ways not visible in standard chat benchmarks. K2VV measures the proportion of triggered tool calls whose JSON payload satisfies the schema over a corpus of approximately 4,000 requests, using the harmonic mean of precision and recall as its primary metric, and expanded to cover K2 Thinking by November 2025.[^25][^36]

Self-hosting

Full-precision self-hosting requires roughly 1 TB of GPU memory once you account for KV cache and activations, which means 8 H200 GPUs with 141 GB memory each, or equivalent multi-node setups. Quantized variants compress this dramatically. Community GGUF quantizations from Unsloth and others reduce K2 to a size that runs on a high-end workstation, though with quality tradeoffs that are most visible in long agentic chains.

The officially supported inference engines are vLLM, SGLang, KTransformers, and TensorRT-LLM. Self-hosting guides from the K2.5 and K2.6 releases describe vLLM as the lowest-friction option for OpenAI-compatible serving and SGLang as the fastest for high-concurrency batch workloads.[^15][^16]

Comparison with peer models

Kimi K2 competed primarily in the same tier as DeepSeek V3 and Qwen 3 among open-weights models. As the K2 family progressed and peers shipped their own updates, the comparison sharpened.

K2 (2025) vs contemporary open peers

Model	SWE-bench Verified	LiveCodeBench v6	GPQA-Diamond	MMLU
Kimi K2 Instruct	65.8%	53.7%	75.1%	89.5%
DeepSeek V3	~49%	46.9%	~65%	~88%
Qwen 3 235B (non-thinking)	~55%	~48%	~68%	~89%
GPT-4.1	--	44.7%	--	--

K2 family vs 2026 peers

By May 2026, Artificial Analysis Intelligence Index scores tell the picture more clearly than any single benchmark.[^18] Open-weights models clustered as follows:

Model	Intelligence Index	Notes
Kimi K2.6 (reasoning)	54	Top open-weights model
DeepSeek V4 Pro (max reasoning)	52	Released early 2026
Qwen 3.6 Max Preview (max reasoning)	52	Alibaba
GLM-5.1 (reasoning)	~53 (Code Arena Elo 1,534)	Zhipu AI
Claude Opus 4.7 (max reasoning)	~58	Closed
GPT-5.5 (xhigh reasoning)	60	Closed, OpenAI
Gemini 3.1 Pro Preview (reasoning)	~57	Closed, Google

The pattern across 2025 and 2026 is consistent: K2 holds a narrow lead among open-weights models, particularly on agentic and coding evaluations, while the leading closed models from OpenAI, Anthropic, and Google maintain a gap of roughly 5-10 points on the index. Nathan Lambert estimated this gap as roughly 4-6 months of model development time, while noting that the practical importance of the gap depends on what closed-model access actually buys you.[^11]

Architectural comparison with peers

Model	Total params	Active params	Experts	Active experts	Sparsity	Attention
Kimi K2	1.04T	32.6B	384	8	48	MLA, 64 heads
DeepSeek V3	671B	37B	256	8	32	MLA, 128 heads
Qwen 3 235B	235B	~22B	128	8	16	GQA
GLM-4.6	~355B	~45B	varies	varies	--	MLA
Mixtral 8x22B	141B	39B	8	2	4	GQA

Kimi K2 and DeepSeek V3 share the most architecturally: both use MoE with MLA attention and were trained with Muon-family optimizers. K2's higher sparsity (48 versus 32) and larger total parameter count distinguish the two. K2's choice of 64 attention heads versus V3's 128 reflects the inference-speed tradeoff for agentic workloads.[^1][^26]

Reception and adoption

The community response to Kimi K2's release was broadly positive, particularly among developers interested in open-weights alternatives to proprietary frontier models. The model reached number one on Hugging Face's trending list within 24 hours and accumulated strong rankings on the LMSYS Chatbot Arena.

Press coverage

VentureBeat covered the original K2 launch with a piece titled "Moonshot AI's Kimi K2 outperforms GPT-4 in key benchmarks, and it's free," emphasizing the cost asymmetry against proprietary peers.[^7] HPCwire framed it as "China's Moonshot AI Releases Trillion Parameter Model."[^6] Hugging Face's own blog ran an explainer titled "5 Things You Need to Know About Moonshot AI and Kimi K2."[^10] CNBC and VentureBeat covered the K2 Thinking release in November, focusing on its performance against GPT-5 and Claude Sonnet 4.5 on specific benchmarks.[^8] TechCrunch covered the Cursor Composer 2 controversy in March 2026, which surfaced the K2.5 lineage as the base model.[^21]

Nathan Lambert at Interconnects has written multiple posts on the K2 family, including "5 Thoughts on Kimi K2 Thinking," arguing that K2 Thinking's ability to execute 200 to 300 sequential tool calls was a meaningful capability milestone for the open-weights ecosystem.[^11]

Adoption in agentic frameworks

Kimi K2's training pipeline focused on tool-use demonstrations and real-environment interactions, which translated into wide adoption in developer-facing agentic frameworks. Documented integrations include:

Cline, the open-source coding agent for VS Code: K2 became one of the recommended backends for repository-level edits
Roo Code, a Cline fork: integrated through OpenAI-compatible APIs against Moonshot or third-party hosts
Claude Code: K2 0905 was tuned for Claude Code compatibility and runs as a drop-in alternative when configured against an OpenAI-compatible endpoint[^20]
OpenHands (formerly OpenDevin): supports K2 as one of several backend options for autonomous software engineering
Cursor: built Composer 1 and Composer 2 on top of the K2 lineage, with Composer 2 explicitly trained on K2.5 base via Fireworks AI[^21][^28]
Aider, OpenCode, Goose and similar terminal-native coding agents: supported through OpenRouter or direct provider connections

Cost asymmetry as a driver

A recurring theme across early adopters was cost. One reported case had a startup swap Claude for K2 via an OpenAI-compatible proxy and cut its monthly AI bill by over 90% while meeting requirements. Cloudflare reported a 77% cost reduction by switching to K2.5 for agent workloads on Workers AI.[^24] K2.6 inference costs land at roughly 12% of Claude Opus 4.7 per token through the MoE architecture and aggressive provider competition.

For approximately 80% of standard developer tasks (code generation, unit tests, refactors, UI prototyping), K2.6 delivers 80 to 90% of leading closed-model quality at about 12% of the cost. The remaining 20% (long-horizon planning, novel research, tasks where the closed models' superior calibration and instruction-following matter) still favor closed models.

Notable application examples

Cursor Composer 2: a $0.50/M input and $2.50/M output token coding model, derived from K2.5 base via continued pretraining and high-compute RL[^21][^28]
Cloudflare Workers AI agent platform: uses K2.5 as the default large-model backend[^24]
Enterprise agent automation: K2 is reportedly the most-used Chinese model inside Western enterprises according to a 2026 Harmonic Security report on shadow AI usage
Research and fine-tuning: K2-Base is a popular starting point for domain-specific fine-tuning thanks to permissive licensing and trillion-parameter capacity

Critique and concerns

The NIST CAISI evaluation in December 2025 noted that K2 Thinking still trailed leading U.S. models on the four domains tested (cyber, software engineering, scientific knowledge, mathematical reasoning).[^19] The report also highlighted high refusal rates in Chinese language usage, comparable to DeepSeek R1-0528, while finding the model relatively uncensored in English, Spanish, and Arabic. For Western enterprise deployment this is largely a non-issue; for Chinese consumer applications it shapes which questions the model will engage.

Some practitioners noted that K2's inference speed of approximately 37 tokens per second was on the lower end for an open-weights non-reasoning model, with the median for comparable models around 53 tokens per second at launch. The gap narrowed in later K2 family releases as providers tuned their inference stacks and Moonshot shipped INT4 weights for K2 Thinking.

Use cases

Moonshot positioned Kimi K2 primarily for agentic and developer-facing applications. The model's training pipeline focused on tool-use demonstrations and real-environment interactions, which translated into strong performance on tasks requiring sequential tool calls, code generation, and autonomous problem-solving.

Documented use cases include:

Software development assistance: code generation, debugging, refactoring, and repository-level code changes via the SWE-bench style evaluation setup
Autonomous agents: multi-step workflows requiring the model to call external APIs, process results, and continue reasoning without human intervention
Mathematical reasoning: problem-solving at competition level (AIME, HMMT benchmarks)
Long-document analysis: reading and summarizing documents up to 128K tokens (K2), extended to 256K in later variants
API integration: building applications that use the Moonshot API, which supports both OpenAI-compatible and Anthropic-compatible client libraries
Computer-using agents: K2 Thinking and K2.5 onward support Agent Swarm coordination for tasks involving 100 to 300 specialized sub-agents working in parallel on parts of a larger problem
Frontend code generation: K2 0905 specifically improved aesthetic and functional output for web and 3D interface code

Developers can deploy K2 locally using vLLM or SGLang on systems with sufficient GPU memory for the quantized model, or access it through the Moonshot API, OpenRouter, Fireworks, DeepInfra, Together AI, and other providers.

Pricing

Moonshot AI offers Kimi K2 through its commercial API at platform.moonshot.ai. The API is compatible with both OpenAI-style and Anthropic-style client libraries. The recommended temperature setting is 0.6 for optimal performance.

At launch, pricing was approximately $0.57 per million input tokens and $2.30 per million output tokens on third-party routing platforms such as OpenRouter. Direct API pricing from Moonshot's platform was listed at $0.55 per million input tokens and $2.20 per million output tokens, with cached input at a discount.[^13] Kimi K2 Thinking shipped at $0.60 per million input tokens and $2.50 per million output tokens on OpenRouter and most major providers.[^35]

By April 2026 (K2.6 era), provider pricing had compressed further. The most affordable providers for K2.6 by blended price were Parasail at $1.15 per million tokens, DeepInfra at $1.44 per million tokens (FP4), and Fireworks at $1.71 per million tokens. Time-to-first-token favored Fireworks (0.72s), DeepInfra FP4 (0.76s), and Together.ai FP4 (0.80s). Output throughput leaders were Clarifai (around 141 t/s), Azure (around 98 t/s), and Fireworks (around 81 t/s).[^18]

Limitations

At launch, Kimi K2 Instruct had several documented limitations, some of which were addressed in later variants:

Text only: The original K2 and K2 Instruct models process only text, with no image or video input. Multimodal capability was added in K2.5 (January 2026) and video input in K2.6 (April 2026).[^15][^16]
No extended reasoning in base instruct: Kimi K2 Instruct is a reflex-grade model without chain-of-thought thinking. Tasks that benefit from step-by-step deliberation perform better with K2 Thinking, K2.5, or K2.6 in Thinking mode.
Lower inference speed: At approximately 37 tokens per second at launch, K2 generated output more slowly than some comparable open-weights models at similar parameter counts, partly due to architecture choices prioritizing agentic long-context use over raw throughput. Later releases and FP4/INT4 quantizations narrowed this gap.
SimpleQA factual accuracy: The 31.0% SimpleQA score was notably lower than the model's strong performance on structured benchmarks, suggesting gaps in short-answer factual recall.[^1]
SFT stage is text-only (original K2): Despite the model having been pre-trained on multilingual content, the supervised fine-tuning stage used text-only data, which may limit some multilingual instruction-following capabilities. K2.5 changed this with its native multimodal training.
Resource requirements: Running the full 1T parameter model at inference requires substantial GPU resources, roughly 1 TB of GPU memory at full precision. Community-maintained quantizations for llama.cpp and Ollama reduced requirements but involved quality tradeoffs, most visible in long agentic tool-use chains.
Chinese-language censorship: NIST CAISI's December 2025 evaluation found K2 Thinking heavily censored in Chinese, with refusal patterns similar to DeepSeek R1-0528. English, Spanish, and Arabic usage was relatively uncensored.[^19]
Niche-topic hallucinations: Like other large language models, K2 can fabricate details on niche topics where its pre-training corpus is sparse. This is consistent with the model's relatively low SimpleQA score and is most visible in retrieval-style fact-checking tasks.
Adoption gap behind DeepSeek: As of December 2025, K2 Thinking had been downloaded from Hugging Face roughly 10% as much as DeepSeek R1 had been a month after its release, indicating that mindshare had not yet caught up to capability.[^19]

References

Kimi K2

Background

Moonshot AI

The Kimi product line

Architecture

Mixture of Experts design

Multi-head Latent Attention

Context window extension

Training

Pre-training data and methodology

MuonClip optimizer

Post-training

Open weights release

Variants

Kimi-K2-Base

Kimi-K2-Instruct

Kimi-K2-Instruct-0905

Kimi K2 Thinking

Kimi Linear

Kimi K2.5

Kimi K2.6

Speculative future releases

Benchmarks

Agentic and software engineering benchmarks

Mathematics and STEM

General knowledge and instruction following

Long-context

Hosting and availability

Direct from Moonshot

Third-party providers

Self-hosting

Comparison with peer models

K2 (2025) vs contemporary open peers

K2 family vs 2026 peers

Architectural comparison with peers

Reception and adoption

Press coverage

Adoption in agentic frameworks

Cost asymmetry as a driver

Notable application examples

Critique and concerns

Use cases

Pricing

Limitations

See also

References

Improve this article

Related Articles

Kimi (chatbot)

DeepSeek V3

Hunyuan

GLM-4.5

Qwen3

DeepSeek V3.1

Kimi K2

Background

Moonshot AI

The Kimi product line

Architecture

Mixture of Experts design

Multi-head Latent Attention

Context window extension

Training

Pre-training data and methodology

MuonClip optimizer

Post-training

Open weights release

Variants

Kimi-K2-Base

Kimi-K2-Instruct

Kimi-K2-Instruct-0905

Kimi K2 Thinking

Kimi Linear

Kimi K2.5

Kimi K2.6

Speculative future releases

Benchmarks

Agentic and software engineering benchmarks

Mathematics and STEM

General knowledge and instruction following