Kimi K2 Thinking
Last reviewed
May 31, 2026
Sources
14 citations
Review status
Source-backed
Revision
v1 ยท 2,439 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 31, 2026
Sources
14 citations
Review status
Source-backed
Revision
v1 ยท 2,439 words
Add missing citations, update stale details, or suggest a clearer explanation.
Kimi K2 Thinking is a reasoning and agentic large language model released by the Chinese startup Moonshot AI on November 6, 2025. It is a thinking variant built on top of the company's Kimi K2 foundation model, adding long chains of internal reasoning and the ability to plan and run long sequences of tool calls on its own. Moonshot published the weights openly on Hugging Face under the moonshotai account, and the model arrived with a detailed technical writeup describing how it interleaves step by step thought with function calls across hundreds of actions. [1][2][3]
The release drew attention well beyond the usual model launch coverage. On several agentic and reasoning benchmarks reported by Moonshot, Kimi K2 Thinking matched or edged ahead of frontier proprietary systems such as GPT-5 and Claude Sonnet 4.5, and it did so as a freely downloadable model. [4][5] A reported training cost figure of roughly 4.6 million US dollars circulated widely in the press, although Moonshot's leadership later said that number was not official. [6][7] The United States National Institute of Standards and Technology, through its Center for AI Standards and Innovation, published its own evaluation of the model in December 2025, which gave a more measured picture of where it stands against American systems. [8]
Moonshot AI is a Beijing based company founded in 2023 and backed by investors including Alibaba. Its consumer assistant is called Kimi, and the K2 line is the family of large models that powers it. [6][9] Kimi K2 itself shipped earlier in 2025 as a mixture of experts model aimed at general instruction following and agentic coding. Kimi K2 Thinking is the reasoning oriented sibling of that base model, post trained to produce explicit thought before it answers and to keep reasoning between each tool call rather than only at the start of a task. [1][3]
The relationship mirrors the split that DeepSeek drew between DeepSeek V3 and DeepSeek-R1, where a shared foundation model gained a separate reasoning focused release. K2 Thinking sits in the broader category of reasoning models, systems that spend extra compute at inference time on a visible or hidden scratchpad before committing to a final response. [3][5]
Kimi K2 Thinking keeps the architecture of the base K2 model. It is a sparse mixture of experts transformer with about one trillion total parameters, of which roughly 32 billion are active for any given token. The published configuration lists 61 layers, 384 routed experts with 8 selected per token plus 1 shared expert, a hidden size of 7168, and a vocabulary of about 160 thousand tokens. Attention uses a multi head latent attention scheme, and the feed forward blocks use a SwiGLU style activation. [2][10]
The model supports a context window of 256 thousand tokens, which gives it room to hold long tool transcripts, large documents, and extended reasoning traces in a single session. [1][4]
| Specification | Value |
|---|---|
| Developer | Moonshot AI |
| Release date | November 6, 2025 |
| Model type | Mixture of experts, reasoning and agentic |
| Total parameters | About 1 trillion |
| Active parameters per token | About 32 billion |
| Layers | 61 |
| Routed experts | 384 (8 active per token) |
| Shared experts | 1 |
| Hidden size | 7168 |
| Vocabulary | About 160 thousand tokens |
| Attention | Multi head latent attention |
| Context length | 256 thousand tokens |
| Native precision | INT4 (quantization aware training) |
| Weights | Open, on Hugging Face |
| License | Modified MIT |
A distinctive engineering choice is that K2 Thinking ships natively in INT4 rather than as a higher precision model that users quantize afterward. Moonshot applied quantization aware training to the mixture of experts components during the post training stage, so the four bit weights are part of the trained model rather than a lossy afterthought. The published configuration uses 4 bit integer weights in groups of 32, stored in the compressed tensors format. [2][3]
The payoff is efficiency. INT4 roughly halves the memory footprint compared with the FP8 checkpoints of earlier K2 releases, bringing the on disk size to around 594 gigabytes, and Moonshot reports close to a 2 times speedup in low latency generation. Because reasoning models emit many tokens of internal thought, faster and cheaper generation matters more here than for a plain chat model. Moonshot states that the published benchmark results were all measured under INT4 precision, so the reported quality reflects the quantized model rather than a separate full precision version. [2][3]
The headline capability of Kimi K2 Thinking is stable long horizon agency. The model was trained end to end to weave chain-of-thought reasoning together with function calling, so it can act as an autonomous agent that reasons, calls a tool, reflects on the result, and decides what to do next. Moonshot describes this pattern as interleaved thinking, where the model produces fresh reasoning between every tool use step instead of planning once and then executing blindly. [3][5]
Moonshot reports that the model can sustain coherent, goal directed behavior across roughly 200 to 300 consecutive tool calls without human intervention, where earlier systems tended to drift or lose the thread after about 30 to 50 steps. [1][3] That stamina is what makes it suited to work as one of the more capable open AI agents for tasks such as multi step web research, software repair, and long document synthesis. In the benchmark harness the agentic search tasks were allowed up to 300 steps with a reasoning budget of about 24 thousand tokens per step, while the harder exam style tasks with tools were capped near 120 steps with a larger per step budget. [4]
The model exposes its reasoning through a separate reasoning content field in the API response, and Moonshot offers an interface that is compatible with both the OpenAI and Anthropic style chat formats, which lowers the switching cost for developers who already build against those clients. [2][3]
Moonshot published a set of benchmark numbers comparing Kimi K2 Thinking with GPT-5 in its high reasoning setting and with Claude Sonnet 4.5 in its thinking setting. The table below reproduces figures from the official model card. Entries marked with an asterisk were reported by Moonshot using its own runs or reproductions rather than vendor published scores, and blank cells indicate a configuration the source did not report. All K2 Thinking numbers were measured at INT4 precision. [2][4]
| Benchmark | Setting | Kimi K2 Thinking | GPT-5 (high) | Claude Sonnet 4.5 (thinking) |
|---|---|---|---|---|
| Humanity's Last Exam (text) | with tools | 44.9 | 41.7* | 32.0* |
| Humanity's Last Exam (text) | heavy | 51.0 | 42.0 | |
| Humanity's Last Exam (text) | no tools | 23.9 | 26.3 | 19.8* |
| BrowseComp | with tools | 60.2 | 54.9 | 24.1 |
| BrowseComp-ZH | with tools | 62.3 | 63.0* | 42.4* |
| SWE-bench Verified | with tools | 71.3 | 74.9 | 77.2 |
| SWE-bench Multilingual | with tools | 61.1 | 55.3* | 68.0 |
| LiveCodeBench v6 | no tools | 83.1 | 87.0* | 64.0* |
| AIME 2025 | with python | 99.1 | 99.6 | 100.0 |
| HMMT 2025 | with python | 95.1 | 96.7 | 88.8* |
| GPQA | no tools | 84.5 | 85.7 | 83.4 |
| MMLU-Pro | 84.6 | 87.1 | 87.5 |
The pattern in these results is uneven by design. On agentic web search, K2 Thinking posts the strongest numbers of the three on BrowseComp, and on Humanity's Last Exam with tools it leads the reported field, with a further lift in a heavy mode that aggregates several reasoning trajectories. [4][5] On pure coding measured by SWE-bench Verified and on knowledge tests such as MMLU-Pro and GPQA it trails the two proprietary models, though by modest margins. Math contests such as AIME 2025 are close to saturated for all three when a Python tool is allowed. [2][4]
The strongest claim made around the launch was that K2 Thinking was the first open weight model to lead frontier proprietary systems on several of these agentic and reasoning benchmarks at the same time, rather than on any single isolated test. [5][11] Independent coverage noted one important caveat: the model is unusually verbose, generating more tokens than rivals on the same problems, which raises real inference cost and latency even though the per token price is low. [12]
In December 2025 the United States National Institute of Standards and Technology, working through its Center for AI Standards and Innovation, released an evaluation of Kimi K2 Thinking. The assessment covered cybersecurity, software engineering, scientific knowledge, and mathematical reasoning, using benchmark suites that included SWE-bench Verified, MMLU-Pro, GPQA, and several math and security tests. [8]
The evaluation reached a more cautious conclusion than the launch coverage. It found that across the tested domains, and in cyber and mathematical reasoning in particular, K2 Thinking was only a modest improvement over DeepSeek V3.1, and that its agentic performance in cybersecurity and software engineering remained below that of leading American models, including GPT-5 and Claude Opus 4. It also reported that the model is heavily censored when prompted in Chinese, with refusal patterns close to those of DeepSeek's R1 release, while staying relatively open in English, Spanish, and Arabic. The same report noted that, one month after release, K2 Thinking had seen far lower download volume than DeepSeek-R1 did at a comparable point. [8]
Moonshot released both the model weights and the supporting code under a modified MIT license. The terms are permissive in the usual MIT spirit, allowing commercial use, modification, and redistribution, with an added attribution clause that applies to very large deployments. Products that pass thresholds such as 100 million monthly active users or a high level of monthly revenue are asked to display Kimi attribution. For most developers and smaller companies the license behaves like a standard open license. [2][13]
The weights are distributed on Hugging Face in the compressed tensors INT4 format, and Moonshot documents support for serving stacks such as vLLM, SGLang, and KTransformers. The same checkpoints can be converted to FP8 or BF16 if a deployment prefers higher precision. The model is also reachable through Moonshot's hosted API and through third party providers. [2][3] Because it is released as open weights, K2 Thinking can be run on private infrastructure, fine tuned, and audited, which is part of why it became a reference point in discussions about open-source AI.
Kimi K2 Thinking landed in a year when Chinese labs were increasingly competitive at the open frontier, following DeepSeek's earlier momentum and parallel releases from groups behind models such as Qwen and GLM. Commentators framed it as evidence that near frontier agentic and reasoning performance was no longer limited to a small set of heavily funded Western labs. [5][11][12] The independent benchmarking group Artificial Analysis scored the model at 67 on its Intelligence Index, the highest for any open weight model at the time and second overall only to GPT-5, and measured a 93 percent result on the agentic tool use benchmark Tau2-bench Telecom, the strongest score it had recorded there. [14] The reported 4.6 million dollar training figure fed that narrative, echoing the cost story that surrounded DeepSeek, but it should be read with care. Moonshot's chief executive said the number was not an official figure, and company researchers explained that a large share of real cost goes into research and failed experiments that are hard to attribute to a single training run. [6][7]
The launch also fits a wider shift toward agentic systems, models judged less on single turn answers and more on whether they can carry out long, tool driven tasks. By open sourcing a model tuned for exactly that, and by shipping it in an efficient INT4 form, Moonshot gave the open community a strong base for building autonomous agents. [3][5] The base Kimi K2 remains a separate model aimed at general use and coding, and K2 Thinking is the reasoning specialized release rather than a replacement for it.
Several limitations temper the strongest claims. The model is verbose, which inflates inference cost and latency even when the headline price per token is low. Artificial Analysis found it generated about 140 million tokens to complete its Intelligence Index suite, roughly 2.5 times the count used by DeepSeek V3.2, which makes it the most token hungry model in that test. [12][14] The most impressive scores often rely on tool access or on a heavy aggregation mode that runs multiple reasoning passes, so single pass results without tools are more ordinary and in some cases trail GPT-5. [2][4] The NIST CAISI evaluation found agentic coding and cybersecurity performance below leading American models and only incremental gains over DeepSeek V3.1, alongside strong Chinese language censorship. [8] As a roughly one trillion parameter model, it also demands substantial hardware to self host despite the INT4 packaging, which keeps full local deployment out of reach for most individual users. [2][3]