Kimi K2 is a large-scale Mixture of Experts language model developed by Moonshot AI, a Beijing-based AI startup. Released on July 11, 2025, as an open-weights model under a Modified MIT License, Kimi K2 features 1.04 trillion total parameters with 32.6 billion activated per token. The model was designed with agentic applications as its primary focus, targeting autonomous task execution, multi-step tool use, and software engineering workflows. Its release made it the highest-ranked open-source model on the LMSYS Chatbot Arena at launch, placing fifth overall against both proprietary and open models with over 3,000 user votes.
The K2 release matters for two overlapping reasons. The first is architectural: the model demonstrated that a sparse transformer at trillion-parameter scale could be trained without loss spikes using a new optimizer called MuonClip. The second is licensing: the weights were released under terms permissive enough that downstream labs, including Cursor, later trained their own commercial coding models on top of K2 derivatives. By the time the K2.6 update shipped in April 2026, the broader Kimi K2 family had become the de facto reference open-weights agentic model and the only open release routinely compared against Claude Opus 4, GPT-5, and Gemini 3 in agentic coding evaluations.
Moonshot AI was founded in March 2023 by Yang Zhilin, Zhou Xinyu, and Wu Yuxin, three Tsinghua University alumni. Yang chose the company name as a reference to Pink Floyd's album The Dark Side of the Moon, his favorite record. The company's stated goal is to build foundation models toward artificial general intelligence, with Yang identifying three milestones: long context length, a multimodal world model, and a scalable general architecture capable of continuous self-improvement.
Moonshot raised $60 million in its initial funding round at a $300 million valuation. In February 2024, Alibaba Group led a $1 billion round that brought the valuation to $2.5 billion. Alibaba acquired approximately 36% of the company. By 2025, a further $2 billion round led by Meituan Dragon Ball pushed the post-money valuation past $20 billion, making Moonshot the most heavily funded large language model startup in China with over $3.9 billion raised.
Moonshot's consumer-facing product is Kimi, a chatbot first released in October 2023. The initial version could process up to 200,000 Chinese characters per conversation, a context window that set it apart from most contemporaries. By March 2024, Moonshot claimed Kimi could handle 2 million Chinese characters in a single prompt.
The model lineage leading to K2 proceeded as follows:
K1.5 and K2 were designed for different use cases within the Moonshot product ecosystem: K1.5 targeted reasoning-intensive and multimodal tasks, while K2 emphasized automation, software development, and scalable deployment for developers.
Kimi K2 uses a sparse Mixture of Experts (MoE) transformer architecture. Rather than activating all model parameters for every input token, the MoE design routes each token to a subset of specialized sub-networks called experts. This means the model has 1.04 trillion total parameters but only activates 32.6 billion per token during inference, making the per-token computation cost comparable to a much smaller dense model while still benefiting from the knowledge encoded in the full parameter count.
The routing is controlled by a learned gating network. For each token, the network selects the top 8 experts from a pool of 384 total experts, plus one shared expert that is always active. This gives a sparsity ratio of 48 (384 divided by 8), which is higher than the sparsity used in DeepSeek-V3. Moonshot chose this higher sparsity to reduce overfitting and improve expert specialization. In their ablations, they found that increasing the expert count from 256 (the V3 figure) to 384 produced clean gains on knowledge benchmarks at a fixed activated parameter budget.
The detailed architecture specifications are:
| Parameter | Value |
|---|---|
| Total parameters | 1.04 trillion |
| Activated parameters per token | 32.6 billion |
| Transformer layers | 61 (including 1 dense layer) |
| Total experts | 384 |
| Active experts per token | 8 |
| Shared experts | 1 |
| Attention heads | 64 |
| Hidden dimension | 7,168 |
| Expert hidden dimension | 2,048 |
| Context length | 128K tokens (K2) / 256K tokens (0905 and later) |
| Vocabulary size | 160K |
| Attention mechanism | Multi-head Latent Attention (MLA) |
| Activation function | SwiGLU |
| Storage format | Block-FP8 |
Kimi K2 uses Multi-head Latent Attention (MLA), the same attention variant introduced by DeepSeek in DeepSeek-V2. MLA compresses the key and value projections into low-dimensional latent vectors, substantially reducing the size of the KV cache during inference. This is particularly valuable for agentic use cases that involve long conversations and many sequential tool calls, where standard multi-head attention would accumulate large caches.
Kimi K2 uses 64 attention heads, compared to 128 in DeepSeek-V3. Moonshot made this choice deliberately to prioritize inference speed for agentic tasks, accepting a modest reduction in attention capacity in exchange for lower latency. The technical report describes this as an explicit tradeoff: agentic workloads tend to read long sequences once, then emit short structured outputs (tool calls, reasoning steps), so the marginal value of more heads is lower than in pure long-document understanding.
The pre-training context length was extended progressively. Training began at 4,096 tokens, then expanded to 32K tokens, and finally to 128K tokens during an annealing phase using the YaRN method (Yet Another RoPE extensioN). The 128K token context window is sufficient for most software development and document analysis tasks. The September 2025 0905 update doubled this to 256K tokens, which became standard for every subsequent K2 release.
Kimi K2 was pre-trained on 15.5 trillion tokens spanning four primary domains: web text, code, mathematics, and knowledge-intensive content. The training corpus covered both English and Chinese, with additional multilingual coverage.
A notable data preparation technique involved rephrasing rather than repeating data. For the knowledge and mathematics domains, Moonshot generated stylistically diverse reformulations of high-quality source material using chunk-wise autoregressive generation with fidelity verification. An ablation comparing ten rephrased versions of a dataset against ten training epochs on the original showed the rephrased approach scoring 28.94% on SimpleQA versus 23.76% for the repeated-epoch approach, a substantial improvement in factual retention.
For mathematics data specifically, text was converted to a "learning note" style and high-quality materials were translated into English to expand the language coverage of mathematical reasoning examples.
One of the most technically significant contributions of Kimi K2 is the MuonClip optimizer. Standard SGD-based optimizers like AdamW become increasingly difficult to tune at trillion-parameter scale. The Muon optimizer, which applies matrix orthogonalization (Newton-Schulz iterations) to gradient updates, had shown strong token efficiency on smaller models. Moonshot's earlier work on a 16B/3B MoE model called Moonlight had applied Muon successfully at that scale. However, scaling Muon to a trillion parameters introduced a new failure mode: exploding attention logits that destabilized training.
Without intervention, training with spectral-norm-constrained optimizers like Muon may see attention logits grow beyond 1,000, resulting in loss spikes and possibly catastrophic instability. MuonClip addresses this with a QK-Clip mechanism. The technique rescales query and key projection weights on a per-head basis to prevent attention logit explosion. Rather than applying a global clipping threshold, it uses per-head scaling factors. For the Multi-head Latent Attention implementation in K2, it clips head-specific components (qC, kC, qR) while preserving shared rotary components. The clipping threshold was set to 100 for K2's training run.
The result was a 15.5 trillion token pre-training run with zero loss spikes. Maximum attention logits stabilized after approximately 30% of the training steps, and the run completed without any training restarts. Compared to AdamW on equivalent hardware, MuonClip is reported to roughly double the FLOPs/token efficiency, a meaningful saving when each step touches 67 million tokens.
The training configuration used:
One outcome of the MuonClip release is that it has begun to influence other open-weights labs. The Muon family of optimizers, originally a niche choice from Keller Jordan's research, became significantly more credible after Moonshot's trillion-parameter result, and follow-on training runs at competing labs have been observed citing MuonClip-style stabilization in their reports.
Post-training proceeded through two main stages: supervised fine-tuning and reinforcement learning.
The supervised fine-tuning stage used a large-scale agentic data synthesis pipeline. Moonshot generated tool-use demonstrations involving over 3,000 real MCP (Model Context Protocol) tools and more than 20,000 synthetic tools. Multi-turn trajectories were generated in both simulated and real execution environments, with models interacting with actual tool APIs to produce training examples grounded in realistic behavior.
The reinforcement learning stage used a Verifiable Rewards Gym covering mathematics and STEM tasks, logic puzzles, instruction following, faithfulness, coding, and safety. For subjective tasks where there is no single correct answer, a Self-Critique Rubric Reward mechanism provided training signal by having the model evaluate its own outputs against defined rubrics. Budget control was applied to manage response length. PTX loss (preserving pre-training gradients) was included to prevent catastrophic forgetting of pre-training knowledge. The temperature schedule during RL started high to encourage exploration and decayed to promote exploitation as training converged.
The combination of agentic synthetic data and verifiable-reward RL is what distinguishes K2 from its closest open peers. DeepSeek's V3 and R1 models lean more on math and code as a primary signal source; K2's training pipeline treats tool-use trajectories as a first-class objective. This shows up in the benchmark results: K2's gap over peers is widest on Tau-bench and SWE-bench, narrower on pure math and coding benchmarks.
Moonshot released Kimi K2 weights on July 11, 2025, under a Modified MIT License. Both the base model checkpoint (Kimi-K2-Base) and the post-trained checkpoint (Kimi-K2-Instruct) were published on Hugging Face at moonshotai/Kimi-K2-Base and moonshotai/Kimi-K2-Instruct. Weights were released in Block-FP8 format.
The Modified MIT License is a standard MIT License with one additional clause. If a commercial product or service exceeds either 100 million monthly active users or 20 million US dollars in monthly revenue, the operator must prominently display "Kimi K2" on the user interface. Below those thresholds, all standard MIT permissions apply: free use, modification, redistribution, and commercial deployment without attribution. This puts K2 in a similar licensing category to DeepSeek V3 and R1, which helped drive adoption by developers who wanted to self-host or fine-tune without restrictive commercial terms.
Kimi K2 became the top-trending model on Hugging Face within 24 hours of release. The technical report was submitted to arXiv on July 28, 2025 (arXiv:2507.20534).
For deployment, Moonshot documented compatibility with vLLM, SGLang, KTransformers, and TensorRT-LLM inference frameworks. Quantized versions for llama.cpp, Ollama, LM Studio, and Jan were made available through the community.
The Kimi K2 family grew quickly after the initial release. Each variant kept the same core MoE architecture (1T total / 32B active) but adjusted post-training, context length, multimodality, and quantization.
| Variant | Released | Context | Notable changes | License |
|---|---|---|---|---|
| Kimi-K2-Base | July 11, 2025 | 128K | Raw pre-trained foundation, no instruction tuning | Modified MIT |
| Kimi-K2-Instruct | July 11, 2025 | 128K | SFT + RL, reflex-grade (no chain-of-thought) | Modified MIT |
| Kimi-K2-Instruct-0905 | September 5, 2025 | 256K | Improved coding and tool calling, expanded context | Modified MIT |
| Kimi-K2-Thinking | November 6, 2025 | 256K | End-to-end RL on reasoning, 200-300 tool calls per task, native INT4 | Modified MIT |
| Kimi-K2.5 | January 27, 2026 | 256K | Native multimodal via MoonViT, Agent Swarm (100 sub-agents) | Modified MIT |
| Kimi-K2.6 | April 20, 2026 | 256K | Agent Swarm 300 sub-agents, 4,000 step coordination, video input | Modified MIT |
Kimi-K2-Base is the raw pre-trained foundation model without instruction tuning or post-training. It is intended for researchers and developers who want to apply their own fine-tuning for specialized tasks. The base model is available on Hugging Face and supports the same deployment frameworks as the instruct variant. It became the most-forked open base model in the second half of 2025, with notable derivatives including Cursor's Composer 2 (which used Kimi-K2.5-Base as the starting point but the same lineage).
Kimi-K2-Instruct is the post-trained model released for general use. It has undergone supervised fine-tuning on agentic data and reinforcement learning as described above. The default system prompt is "You are Kimi, an AI assistant created by Moonshot AI." This is the variant typically accessed through the Moonshot API and the Kimi.com chat interface.
The instruct model does not include extended chain-of-thought reasoning. Moonshot classified it as a "reflex-grade" model, meaning it generates responses directly without an internal scratchpad or thinking phase. This keeps latency lower but limits performance on problems that benefit from step-by-step deliberation.
The September 5, 2025 update of K2 Instruct kept the underlying architecture and weights structure intact while improving coding capability, tool-calling reliability, and frontend code aesthetics. The most visible change was a doubled context window of 256K tokens (262,144 to be exact). Moonshot also tuned compatibility with downstream agentic scaffolds: the 0905 model works as a drop-in backend for Claude Code, Cline, and Roo Code through their OpenAI- or Anthropic-compatible interfaces.
On LiveCodeBench v6, the 0905 release scored higher than the original K2 Instruct, primarily because of additional RL on coding tasks. The model card and weights kept the same moonshotai/Kimi-K2-Instruct-0905 namespace on Hugging Face.
Kimi K2 Thinking, released on November 6, 2025, extends the K2 architecture with end-to-end RL training of chain-of-thought reasoning interleaved with function calls. The model was trained to maintain coherent behavior across 200 to 300 consecutive tool invocations, a capability that Nathan Lambert of Interconnects noted was "previously limited to closed-source models." The context window was held at 256K tokens.
A distinctive feature is native INT4 quantization via Quantization-Aware Training. Most quantized variants of large open models are produced post-hoc; K2 Thinking was trained with quantization in mind, which Moonshot reports yields roughly a 2x speedup in low-latency mode and substantial GPU memory savings, with no measurable quality loss versus the FP8 reference.
Benchmark scores for Kimi K2 Thinking:
| Benchmark | Score |
|---|---|
| AIME 2025 (with Python) | 99.1% |
| HMMT 2025 (with Python) | 95.1% |
| GPQA-Diamond | 84.5% |
| MMLU-Pro | 84.6% |
| MMLU-Redux | 94.4% |
| SWE-bench Verified | 71.3% |
| SWE-bench Multilingual | 61.1% |
| LiveCodeBench v6 | 83.1% |
| BrowseComp | 60.2% |
| BrowseComp-ZH | 62.3% |
| HLE (text-only with tools) | 44.9% |
| Tau2-Bench Telecom | 93% (independently measured) |
In December 2025 the U.S. Center for AI Standards and Innovation (CAISI) at NIST published an evaluation of Kimi K2 Thinking, calling it the most capable AI model from a PRC-based developer at the time of its release while noting that it still trailed leading U.S. models on cyber, software engineering, scientific knowledge, and mathematical reasoning. The same CAISI report observed that K2 Thinking remained "highly censored in Chinese," with refusal patterns similar to DeepSeek R1-0528, but was relatively uncensored in English, Spanish, and Arabic.
The CAISI report also flagged adoption: a month after release, K2 Thinking had been downloaded from Hugging Face roughly 10% as often as DeepSeek R1 had been a month after its release, and less than 5% as often as gpt-oss in the equivalent period. K2's mindshare was strong among power users and Chinese developers but had not yet matched DeepSeek's broader pull.
Released January 27, 2026, Kimi K2.5 added native multimodal capabilities to the K2 architecture. The key addition was MoonViT, a 400-million-parameter vision encoder developed internally at Moonshot. Unlike approaches that graft a separate vision adapter onto a text-only foundation, MoonViT was integrated natively and the model was trained on approximately 15 trillion mixed visual and text tokens. The result was vision and language capabilities that developed together rather than being combined post-hoc.
K2.5 introduced Agent Swarm functionality, allowing the model to coordinate up to 100 specialized sub-agents working in parallel. This was the first K2-series model to support image input natively. The architecture otherwise maintained the same 1 trillion total / 32 billion active parameter structure as K2, with context held at 256K tokens.
K2.5 became notable in March 2026 as the foundation Cursor used for its Composer 2 coding model. Cursor publicly acknowledged that Composer 2 started from the K2.5 base, with about three quarters of the final training compute coming from Cursor's own continued pretraining and reinforcement learning, and the remaining quarter from the Moonshot base. The episode was one of the more visible commercial integrations of an open Chinese model into a Western developer tool, and Moonshot publicly celebrated it as evidence of the open-weights ecosystem the company wants to support.
Kimi K2.6 was released on April 20, 2026. It expanded Agent Swarm capacity from 100 to 300 simultaneous sub-agents, increased maximum coordinated steps from 1,500 to 4,000, and added native video input (MP4, MOV, AVI, and WebM formats, recommended up to 2K resolution). The context window remained at 256K tokens.
K2.6 introduced two inference modes: Thinking (chain-of-thought) and Instant (low-latency). A "Skills" feature allowed users to convert PDFs and spreadsheets into reusable task templates for recurring workflows. Moonshot demonstrated 12 to 13 hour autonomous coding sessions with K2.6, and agent swarm runs spanning five days.
Benchmark scores for K2.6 at release:
| Benchmark | K2.6 | GPT-5.4 | Claude Opus 4.6 |
|---|---|---|---|
| SWE-Bench Pro | 58.6% | 57.7% | 53.4% |
| HLE with tools | 54.0% | 52.1% | 53.0% |
| BrowseComp | 86.3% | -- | -- |
On the Artificial Analysis Intelligence Index, K2.6 scored 54, leading all open-weights models and trailing GPT-5.5 (60), Claude Opus 4.7, and Gemini 3.1 Pro Preview. K2.6 was released under a Modified MIT License.
Moonshot publicly teased a Kimi K3 model in late March 2026, with members of the team describing K3 as built around longer context (up to 1 million tokens), larger parameter counts (potentially in the 3-4 trillion range), and incorporation of an architecture line called Kimi Linear for efficient long-context attention. As of May 2026, K3 has not been released, no firm release date has been announced, and most concrete details remain speculative. The K2 family continues to receive the bulk of Moonshot's development effort.
The following scores are from the Kimi-K2-Instruct model under non-thinking evaluation settings (no extended chain-of-thought reasoning).
| Benchmark | Score | Notes |
|---|---|---|
| SWE-bench Verified (agentic, single attempt) | 65.8% | |
| SWE-bench Verified (agentic, multiple attempts) | 71.6% | With parallel test-time compute |
| SWE-bench Multilingual | 47.3% | |
| Tau-bench Tau2-Bench | 66.1 Pass@1 | Tool-use |
| ACEBench (English) | 76.5% | Tool-use |
| LiveCodeBench v6 | 53.7% | |
| OJBench | 27.1% | |
| MultiPL-E | 85.7% | Open-source state of the art at release |
| Benchmark | Score | Notes |
|---|---|---|
| MATH-500 | 97.4% | |
| AIME 2024 | 69.6% (Avg@64) | |
| AIME 2025 | 49.5% (Avg@64) | |
| HMMT 2025 | 38.8% (Avg@32) | |
| GPQA-Diamond | 75.1% (Avg@8) |
| Benchmark | Score |
|---|---|
| MMLU | 89.5% |
| MMLU-Redux | 92.7% |
| MMLU-Pro | 81.1% |
| IFEval (Prompt Strict) | 89.8% |
| SimpleQA | 31.0% |
| Benchmark | Score |
|---|---|
| DROP | 93.5% |
| MRCR | 55.0% |
| LongBench v2 | 49.1% |
On the LMSYS Chatbot Arena, Kimi K2 ranked first among open-source models and fifth overall (including closed proprietary models) based on over 3,000 user votes as of July 17, 2025. On LiveCodeBench v6, the 53.7% score represented a global state of the art at the time of release for non-reasoning open models, surpassing GPT-4.1's 44.7% and DeepSeek V3's 46.9% on the same benchmark.
Kimi K2 is one of the most widely available open-weights models. Within weeks of the July 2025 release, every major third-party inference provider had a hosted version, and the K2 lineage continues to be a default offering across the industry.
Moonshot AI offers Kimi K2 through its commercial API at platform.moonshot.ai. The API is compatible with both OpenAI-style and Anthropic-style client libraries. For Anthropic-style clients, temperature values are remapped automatically (effective temperature equals the request temperature multiplied by 0.6). The recommended temperature for K2 is 0.6.
Direct API pricing at launch was approximately $0.55 per million input tokens and $2.20 per million output tokens, with cached input at a discount. Kimi.com hosts the chat interface and Kimi App provides a mobile experience. The Kimi Code CLI, launched in January 2026, is the company's coding-focused command-line interface and crossed 6,400 GitHub stars by mid-2026 with K2.6 as its default backend.
The full provider list grew quickly through late 2025 and 2026. By April 2026, Artificial Analysis tracked 11 API providers offering K2.6 access. Major providers and their typical roles:
| Provider | Notes |
|---|---|
| OpenRouter | Aggregator routing requests to multiple K2 backends |
| Together AI | Hosts FP4 quantizations for cost efficiency |
| Fireworks AI | Low time-to-first-token (around 0.72s on K2.6); commercial partner of Moonshot |
| DeepInfra | FP4 hosting at among the lowest blended prices |
| Novita | Available through Hugging Face Inference Providers |
| Parasail | Lowest blended price at $1.15/1M tokens for K2.6 at launch |
| Cloudflare Workers AI | Added K2.5 in early 2026; reported 77% cost cut for some customers |
| Azure | Available through the Azure AI catalog |
| SiliconFlow | FP8 quantizations |
| Clarifai | Highest output throughput for K2.6 (around 141 tokens/sec) |
| Weights & Biases | Hosting integration alongside training observability |
Moonshot published a tool called K2 Vendor Verifier (K2VV) to help users compare the quality of tool-call outputs across these providers, since lossy quantization can degrade structured outputs in ways not visible in standard chat benchmarks.
Full-precision self-hosting requires roughly 1 TB of GPU memory once you account for KV cache and activations, which means 8 H200 GPUs with 141 GB memory each, or equivalent multi-node setups. Quantized variants compress this dramatically. Community GGUF quantizations from Unsloth and others reduce K2 to a size that runs on a high-end workstation, though with quality tradeoffs that are most visible in long agentic chains.
The officially supported inference engines are vLLM, SGLang, KTransformers, and TensorRT-LLM. Self-hosting guides from the K2.5 and K2.6 releases describe vLLM as the lowest-friction option for OpenAI-compatible serving and SGLang as the fastest for high-concurrency batch workloads.
Kimi K2 competed primarily in the same tier as DeepSeek V3 and Qwen 3 among open-weights models. As the K2 family progressed and peers shipped their own updates, the comparison sharpened.
| Model | SWE-bench Verified | LiveCodeBench v6 | GPQA-Diamond | MMLU |
|---|---|---|---|---|
| Kimi K2 Instruct | 65.8% | 53.7% | 75.1% | 89.5% |
| DeepSeek V3 | ~49% | 46.9% | ~65% | ~88% |
| Qwen 3 235B (non-thinking) | ~55% | ~48% | ~68% | ~89% |
| GPT-4.1 | -- | 44.7% | -- | -- |
By May 2026, Artificial Analysis Intelligence Index scores tell the picture more clearly than any single benchmark. Open-weights models clustered as follows:
| Model | Intelligence Index | Notes |
|---|---|---|
| Kimi K2.6 (reasoning) | 54 | Top open-weights model |
| DeepSeek V4 Pro (max reasoning) | 52 | Released early 2026 |
| Qwen 3.6 Max Preview (max reasoning) | 52 | Alibaba |
| GLM-5.1 (reasoning) | ~53 (Code Arena Elo 1,534) | Zhipu AI |
| Claude Opus 4.7 (max reasoning) | ~58 | Closed |
| GPT-5.5 (xhigh reasoning) | 60 | Closed, OpenAI |
| Gemini 3.1 Pro Preview (reasoning) | ~57 | Closed, Google |
The pattern across 2025 and 2026 is consistent: K2 holds a narrow lead among open-weights models, particularly on agentic and coding evaluations, while the leading closed models from OpenAI, Anthropic, and Google maintain a gap of roughly 5-10 points on the index. Nathan Lambert estimated this gap as roughly 4-6 months of model development time, while noting that the practical importance of the gap depends on what closed-model access actually buys you.
| Model | Total params | Active params | Experts | Active experts | Sparsity | Attention |
|---|---|---|---|---|---|---|
| Kimi K2 | 1.04T | 32.6B | 384 | 8 | 48 | MLA, 64 heads |
| DeepSeek V3 | 671B | 37B | 256 | 8 | 32 | MLA, 128 heads |
| Qwen 3 235B | 235B | ~22B | 128 | 8 | 16 | GQA |
| GLM-4.6 | ~355B | ~45B | varies | varies | -- | MLA |
| Mixtral 8x22B | 141B | 39B | 8 | 2 | 4 | GQA |
Kimi K2 and DeepSeek V3 share the most architecturally: both use MoE with MLA attention and were trained with Muon-family optimizers. K2's higher sparsity (48 versus 32) and larger total parameter count distinguish the two. K2's choice of 64 attention heads versus V3's 128 reflects the inference-speed tradeoff for agentic workloads.
The community response to Kimi K2's release was broadly positive, particularly among developers interested in open-weights alternatives to proprietary frontier models. The model reached number one on Hugging Face's trending list within 24 hours and accumulated strong rankings on the LMSYS Chatbot Arena.
VentureBeat covered the original K2 launch with a piece titled "Moonshot AI's Kimi K2 outperforms GPT-4 in key benchmarks, and it's free," emphasizing the cost asymmetry against proprietary peers. HPCwire framed it as "China's Moonshot AI Releases Trillion Parameter Model." Hugging Face's own blog ran an explainer titled "5 Things You Need to Know About Moonshot AI and Kimi K2." CNBC covered the K2 Thinking release in November, focusing on its performance against GPT-5 and Claude Sonnet 4.5 on specific benchmarks. TechCrunch covered the Cursor Composer 2 controversy in March 2026, which surfaced the K2.5 lineage as the base model.
Nathan Lambert at Interconnects has written multiple posts on the K2 family, including "5 Thoughts on Kimi K2 Thinking," arguing that K2 Thinking's ability to execute 200 to 300 sequential tool calls was a meaningful capability milestone for the open-weights ecosystem.
Kimi K2's training pipeline focused on tool-use demonstrations and real-environment interactions, which translated into wide adoption in developer-facing agentic frameworks. Documented integrations include:
A recurring theme across early adopters was cost. One reported case had a startup swap Claude for K2 via an OpenAI-compatible proxy and cut its monthly AI bill by over 90% while meeting requirements. Cloudflare reported a 77% cost reduction by switching to K2.5 for agent workloads on Workers AI. K2.6 inference costs land at roughly 12% of Claude Opus 4.7 per token through the MoE architecture and aggressive provider competition.
For approximately 80% of standard developer tasks (code generation, unit tests, refactors, UI prototyping), K2.6 delivers 80 to 90% of leading closed-model quality at about 12% of the cost. The remaining 20% (long-horizon planning, novel research, tasks where the closed models' superior calibration and instruction-following matter) still favor closed models.
The NIST CAISI evaluation in December 2025 noted that K2 Thinking still trailed leading U.S. models on the four domains tested (cyber, software engineering, scientific knowledge, mathematical reasoning). The report also highlighted high refusal rates in Chinese language usage, comparable to DeepSeek R1-0528, while finding the model relatively uncensored in English, Spanish, and Arabic. For Western enterprise deployment this is largely a non-issue; for Chinese consumer applications it shapes which questions the model will engage.
Some practitioners noted that K2's inference speed of approximately 37 tokens per second was on the lower end for an open-weights non-reasoning model, with the median for comparable models around 53 tokens per second at launch. The gap narrowed in later K2 family releases as providers tuned their inference stacks and Moonshot shipped INT4 weights for K2 Thinking.
Moonshot positioned Kimi K2 primarily for agentic and developer-facing applications. The model's training pipeline focused on tool-use demonstrations and real-environment interactions, which translated into strong performance on tasks requiring sequential tool calls, code generation, and autonomous problem-solving.
Documented use cases include:
Developers can deploy K2 locally using vLLM or SGLang on systems with sufficient GPU memory for the quantized model, or access it through the Moonshot API, OpenRouter, Fireworks, DeepInfra, Together AI, and other providers.
Moonshot AI offers Kimi K2 through its commercial API at platform.moonshot.ai. The API is compatible with both OpenAI-style and Anthropic-style client libraries. The recommended temperature setting is 0.6 for optimal performance.
At launch, pricing was approximately $0.57 per million input tokens and $2.30 per million output tokens on third-party routing platforms such as OpenRouter. Direct API pricing from Moonshot's platform was listed at $0.55 per million input tokens and $2.20 per million output tokens, with cached input at a discount.
By April 2026 (K2.6 era), provider pricing had compressed further. The most affordable providers for K2.6 by blended price were Parasail at $1.15 per million tokens, DeepInfra at $1.44 per million tokens (FP4), and Fireworks at $1.71 per million tokens. Time-to-first-token favored Fireworks (0.72s), DeepInfra FP4 (0.76s), and Together.ai FP4 (0.80s). Output throughput leaders were Clarifai (around 141 t/s), Azure (around 98 t/s), and Fireworks (around 81 t/s).
At launch, Kimi K2 Instruct had several documented limitations, some of which were addressed in later variants: