Kimi K2 Thinking

Chinese AI Large Language Models Open Source AI Reasoning Models

12 min read

Updated May 31, 2026

Suggest edit History Talk

RawGraph

Last edited

May 31, 2026

Fact-checked

In review queue

Sources

14 citations

Revision

v1 · 2,439 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Kimi K2 Thinking is a reasoning and agentic large language model released by the Chinese startup Moonshot AI on November 6, 2025. It is a thinking variant built on top of the company's Kimi K2 foundation model, adding long chains of internal reasoning and the ability to plan and run long sequences of tool calls on its own. Moonshot published the weights openly on Hugging Face under the moonshotai account, and the model arrived with a detailed technical writeup describing how it interleaves step by step thought with function calls across hundreds of actions. ^[1]^[2]^[3]

The release drew attention well beyond the usual model launch coverage. On several agentic and reasoning benchmarks reported by Moonshot, Kimi K2 Thinking matched or edged ahead of frontier proprietary systems such as GPT-5 and Claude Sonnet 4.5, and it did so as a freely downloadable model. ^[4]^[5] A reported training cost figure of roughly 4.6 million US dollars circulated widely in the press, although Moonshot's leadership later said that number was not official. ^[6]^[7] The United States National Institute of Standards and Technology, through its Center for AI Standards and Innovation, published its own evaluation of the model in December 2025, which gave a more measured picture of where it stands against American systems. ^[8]

Relation to Kimi K2 and Moonshot AI

Moonshot AI is a Beijing based company founded in 2023 and backed by investors including Alibaba. Its consumer assistant is called Kimi, and the K2 line is the family of large models that powers it. ^[6]^[9] Kimi K2 itself shipped earlier in 2025 as a mixture of experts model aimed at general instruction following and agentic coding. Kimi K2 Thinking is the reasoning oriented sibling of that base model, post trained to produce explicit thought before it answers and to keep reasoning between each tool call rather than only at the start of a task. ^[1]^[3]

The relationship mirrors the split that DeepSeek drew between DeepSeek V3 and DeepSeek-R1, where a shared foundation model gained a separate reasoning focused release. K2 Thinking sits in the broader category of reasoning models, systems that spend extra compute at inference time on a visible or hidden scratchpad before committing to a final response. ^[3]^[5]

Architecture

Kimi K2 Thinking keeps the architecture of the base K2 model. It is a sparse mixture of experts transformer with about one trillion total parameters, of which roughly 32 billion are active for any given token. The published configuration lists 61 layers, 384 routed experts with 8 selected per token plus 1 shared expert, a hidden size of 7168, and a vocabulary of about 160 thousand tokens. Attention uses a multi head latent attention scheme, and the feed forward blocks use a SwiGLU style activation. ^[2]^[10]

The model supports a context window of 256 thousand tokens, which gives it room to hold long tool transcripts, large documents, and extended reasoning traces in a single session. ^[1]^[4]

Specification	Value
Developer	Moonshot AI
Release date	November 6, 2025
Model type	Mixture of experts, reasoning and agentic
Total parameters	About 1 trillion
Active parameters per token	About 32 billion
Layers	61
Routed experts	384 (8 active per token)
Shared experts	1
Hidden size	7168
Vocabulary	About 160 thousand tokens
Attention	Multi head latent attention
Context length	256 thousand tokens
Native precision	INT4 (quantization aware training)
Weights	Open, on Hugging Face
License	Modified MIT

Native INT4 quantization

A distinctive engineering choice is that K2 Thinking ships natively in INT4 rather than as a higher precision model that users quantize afterward. Moonshot applied quantization aware training to the mixture of experts components during the post training stage, so the four bit weights are part of the trained model rather than a lossy afterthought. The published configuration uses 4 bit integer weights in groups of 32, stored in the compressed tensors format. ^[2]^[3]

The payoff is efficiency. INT4 roughly halves the memory footprint compared with the FP8 checkpoints of earlier K2 releases, bringing the on disk size to around 594 gigabytes, and Moonshot reports close to a 2 times speedup in low latency generation. Because reasoning models emit many tokens of internal thought, faster and cheaper generation matters more here than for a plain chat model. Moonshot states that the published benchmark results were all measured under INT4 precision, so the reported quality reflects the quantized model rather than a separate full precision version. ^[2]^[3]

Long horizon agentic design

The headline capability of Kimi K2 Thinking is stable long horizon agency. The model was trained end to end to weave chain-of-thought reasoning together with function calling, so it can act as an autonomous agent that reasons, calls a tool, reflects on the result, and decides what to do next. Moonshot describes this pattern as interleaved thinking, where the model produces fresh reasoning between every tool use step instead of planning once and then executing blindly. ^[3]^[5]

Moonshot reports that the model can sustain coherent, goal directed behavior across roughly 200 to 300 consecutive tool calls without human intervention, where earlier systems tended to drift or lose the thread after about 30 to 50 steps. ^[1]^[3] That stamina is what makes it suited to work as one of the more capable open AI agents for tasks such as multi step web research, software repair, and long document synthesis. In the benchmark harness the agentic search tasks were allowed up to 300 steps with a reasoning budget of about 24 thousand tokens per step, while the harder exam style tasks with tools were capped near 120 steps with a larger per step budget. ^[4]

The model exposes its reasoning through a separate reasoning content field in the API response, and Moonshot offers an interface that is compatible with both the OpenAI and Anthropic style chat formats, which lowers the switching cost for developers who already build against those clients. ^[2]^[3]

Benchmark results

Moonshot published a set of benchmark numbers comparing Kimi K2 Thinking with GPT-5 in its high reasoning setting and with Claude Sonnet 4.5 in its thinking setting. The table below reproduces figures from the official model card. Entries marked with an asterisk were reported by Moonshot using its own runs or reproductions rather than vendor published scores, and blank cells indicate a configuration the source did not report. All K2 Thinking numbers were measured at INT4 precision. ^[2]^[4]

Benchmark	Setting	Kimi K2 Thinking	GPT-5 (high)	Claude Sonnet 4.5 (thinking)
Humanity's Last Exam (text)	with tools	44.9	41.7*	32.0*
Humanity's Last Exam (text)	heavy	51.0	42.0
Humanity's Last Exam (text)	no tools	23.9	26.3	19.8*
BrowseComp	with tools	60.2	54.9	24.1
BrowseComp-ZH	with tools	62.3	63.0*	42.4*
SWE-bench Verified	with tools	71.3	74.9	77.2
SWE-bench Multilingual	with tools	61.1	55.3*	68.0
LiveCodeBench v6	no tools	83.1	87.0*	64.0*
AIME 2025	with python	99.1	99.6	100.0
HMMT 2025	with python	95.1	96.7	88.8*
GPQA	no tools	84.5	85.7	83.4
MMLU-Pro		84.6	87.1	87.5

The pattern in these results is uneven by design. On agentic web search, K2 Thinking posts the strongest numbers of the three on BrowseComp, and on Humanity's Last Exam with tools it leads the reported field, with a further lift in a heavy mode that aggregates several reasoning trajectories. ^[4]^[5] On pure coding measured by SWE-bench Verified and on knowledge tests such as MMLU-Pro and GPQA it trails the two proprietary models, though by modest margins. Math contests such as AIME 2025 are close to saturated for all three when a Python tool is allowed. ^[2]^[4]

The strongest claim made around the launch was that K2 Thinking was the first open weight model to lead frontier proprietary systems on several of these agentic and reasoning benchmarks at the same time, rather than on any single isolated test. ^[5]^[11] Independent coverage noted one important caveat: the model is unusually verbose, generating more tokens than rivals on the same problems, which raises real inference cost and latency even though the per token price is low. ^[12]

Independent evaluation

In December 2025 the United States National Institute of Standards and Technology, working through its Center for AI Standards and Innovation, released an evaluation of Kimi K2 Thinking. The assessment covered cybersecurity, software engineering, scientific knowledge, and mathematical reasoning, using benchmark suites that included SWE-bench Verified, MMLU-Pro, GPQA, and several math and security tests. ^[8]

The evaluation reached a more cautious conclusion than the launch coverage. It found that across the tested domains, and in cyber and mathematical reasoning in particular, K2 Thinking was only a modest improvement over DeepSeek V3.1, and that its agentic performance in cybersecurity and software engineering remained below that of leading American models, including GPT-5 and Claude Opus 4. It also reported that the model is heavily censored when prompted in Chinese, with refusal patterns close to those of DeepSeek's R1 release, while staying relatively open in English, Spanish, and Arabic. The same report noted that, one month after release, K2 Thinking had seen far lower download volume than DeepSeek-R1 did at a comparable point. ^[8]

Open weight release and licensing

Moonshot released both the model weights and the supporting code under a modified MIT license. The terms are permissive in the usual MIT spirit, allowing commercial use, modification, and redistribution, with an added attribution clause that applies to very large deployments. Products that pass thresholds such as 100 million monthly active users or a high level of monthly revenue are asked to display Kimi attribution. For most developers and smaller companies the license behaves like a standard open license. ^[2]^[13]

The weights are distributed on Hugging Face in the compressed tensors INT4 format, and Moonshot documents support for serving stacks such as vLLM, SGLang, and KTransformers. The same checkpoints can be converted to FP8 or BF16 if a deployment prefers higher precision. The model is also reachable through Moonshot's hosted API and through third party providers. ^[2]^[3] Because it is released as open weights, K2 Thinking can be run on private infrastructure, fine tuned, and audited, which is part of why it became a reference point in discussions about open-source AI.

Reception and significance

Kimi K2 Thinking landed in a year when Chinese labs were increasingly competitive at the open frontier, following DeepSeek's earlier momentum and parallel releases from groups behind models such as Qwen and GLM. Commentators framed it as evidence that near frontier agentic and reasoning performance was no longer limited to a small set of heavily funded Western labs. ^[5]^[11]^[12] The independent benchmarking group Artificial Analysis scored the model at 67 on its Intelligence Index, the highest for any open weight model at the time and second overall only to GPT-5, and measured a 93 percent result on the agentic tool use benchmark Tau2-bench Telecom, the strongest score it had recorded there. ^[14] The reported 4.6 million dollar training figure fed that narrative, echoing the cost story that surrounded DeepSeek, but it should be read with care. Moonshot's chief executive said the number was not an official figure, and company researchers explained that a large share of real cost goes into research and failed experiments that are hard to attribute to a single training run. ^[6]^[7]

The launch also fits a wider shift toward agentic systems, models judged less on single turn answers and more on whether they can carry out long, tool driven tasks. By open sourcing a model tuned for exactly that, and by shipping it in an efficient INT4 form, Moonshot gave the open community a strong base for building autonomous agents. ^[3]^[5] The base Kimi K2 remains a separate model aimed at general use and coding, and K2 Thinking is the reasoning specialized release rather than a replacement for it.

Limitations

Several limitations temper the strongest claims. The model is verbose, which inflates inference cost and latency even when the headline price per token is low. Artificial Analysis found it generated about 140 million tokens to complete its Intelligence Index suite, roughly 2.5 times the count used by DeepSeek V3.2, which makes it the most token hungry model in that test. ^[12]^[14] The most impressive scores often rely on tool access or on a heavy aggregation mode that runs multiple reasoning passes, so single pass results without tools are more ordinary and in some cases trail GPT-5. ^[2]^[4] The NIST CAISI evaluation found agentic coding and cybersecurity performance below leading American models and only incremental gains over DeepSeek V3.1, alongside strong Chinese language censorship. ^[8] As a roughly one trillion parameter model, it also demands substantial hardware to self host despite the INT4 packaging, which keeps full local deployment out of reach for most individual users. ^[2]^[3]

References

Moonshot AI. "Introducing Kimi K2 Thinking." moonshotai.github.io/Kimi-K2/thinking.html, November 6, 2025. ↩
Moonshot AI. "moonshotai/Kimi-K2-Thinking." Hugging Face model card, 2025. https://huggingface.co/moonshotai/Kimi-K2-Thinking ↩
Kimi (Moonshot AI). "Introducing Kimi K2 Thinking." Medium, November 2025. https://medium.com/@kimi_moonshot/introducing-kimi-k2-thinking-a61c95f6e59a ↩
Moonshot AI. "Kimi-K2-Thinking config.json." Hugging Face repository, 2025. https://huggingface.co/moonshotai/Kimi-K2-Thinking/raw/main/config.json ↩
Turing Post. "AI 101: Kimi K2 Thinking: Inside Moonshot AI's Agentic Reasoning Model." 2025. https://www.turingpost.com/p/kimik2thinking ↩
CNBC. "Alibaba-backed Moonshot releases new AI model Kimi K2 Thinking." November 6, 2025. https://www.cnbc.com/2025/11/06/alibaba-backed-moonshot-releases-new-ai-model-kimi-k2-thinking.html ↩
Yicai Global. "Kimi K2 Thinking's Reported USD4.6 Million Training Cost Isn't Official, Moonshot CEO Says." 2025. https://www.yicaiglobal.com/news/kimi-k2-thinkings-reported-usd46-million-training-cost-isnt-official-moonshot-ceo-says ↩
National Institute of Standards and Technology. "CAISI Evaluation of Kimi K2 Thinking." December 12, 2025. https://www.nist.gov/news-events/news/2025/12/caisi-evaluation-kimi-k2-thinking ↩
Wikipedia. "Moonshot AI." https://en.wikipedia.org/wiki/Moonshot_AI ↩
MoonshotAI. "Kimi-K2 GitHub repository." https://github.com/moonshotai/Kimi-K2 ↩
SiliconAngle. "Moonshot launches open-source Kimi K2 Thinking AI with trillion parameters and reasoning capabilities." November 7, 2025. https://siliconangle.com/2025/11/07/moonshot-launches-open-source-kimi-k2-thinking-ai-trillion-parameters-reasoning-capabilities/ ↩
RecodeChina AI. "Kimi K2 Thinking: The 4.6M Model Shifting AI Narratives." 2025. https://www.recodechinaai.com/p/kimi-k2-thinking-the-46m-model-shifting ↩
bdtechtalks. "Kimi K2 thinking: The open-source model giving closed AI labs a run for their money." November 8, 2025. https://bdtechtalks.com/2025/11/08/kimi-k2-thinking/ ↩
Artificial Analysis. "Kimi K2 Thinking: Everything you need to know." 2025. https://artificialanalysis.ai/articles/kimi-k2-thinking-everything-you-need-to-know ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

Suggest edit

What links here

Kimi K2.5 Kimi K2.6

Relation to Kimi K2 and Moonshot AI

Architecture

Native INT4 quantization

Long horizon agentic design

Benchmark results

Independent evaluation

Open weight release and licensing

Reception and significance

Limitations

See also

References

Improve this article

Related Articles

DeepSeek-R1-Distill

DeepSeek V3.1

QwQ

Marco-o1

DeepSeek-R1

MiniMax M1

What links here

Related Articles

DeepSeek-R1-Distill

DeepSeek V3.1

QwQ

Marco-o1

DeepSeek-R1

MiniMax M1

What links here