Kimi K1.5

AI Models Chinese AI Large Language Models Reinforcement Learning

15 min read

Updated Jul 16, 2026

Suggest edit History Talk

RawGraph

Last edited

Jul 16, 2026

Fact-checked

In review queue

Sources

7 citations

Revision

v4 · 3,037 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Kimi K1.5 is a multimodal reasoning large language model developed by Moonshot AI, a Beijing-based artificial intelligence company. The model was announced on January 20, 2025,^[2] the same day DeepSeek published DeepSeek R1,^[3] and it appeared alongside the accompanying technical report posted to arXiv as paper 2501.12599.^[1] K1.5 was Moonshot's first publicly described reasoning-oriented model and one of the earliest non-OpenAI systems to demonstrate that long chain-of-thought reasoning could be trained at scale through reinforcement learning rather than through purely supervised methods. The technical report reported scores that matched or beat the September 2024 release of OpenAI o1 on several mathematics and code benchmarks,^[1] a result that surprised much of the Western research community and helped trigger the wave of reasoning-model releases that defined the first half of 2025.

The model occupies a specific niche in the Moonshot lineup. It sits between the long-context Kimi chatbot, which the company had been shipping since late 2023, and Kimi K2, the trillion-parameter Mixture of Experts model that arrived in July 2025. Unlike K2, which is an open-weights model released under a Modified MIT license, K1.5 was offered primarily through Moonshot's own Kimi application and through a partial open release of smaller derivatives on Hugging Face.^[4] The K1.5 paper is often cited together with the DeepSeek R1 paper as one of the two contemporaneous documents that established a public template for training reasoning models with simple, scalable RL on verifiable rewards.

Background

Moonshot AI and the Kimi product line

Moonshot AI was founded in March 2023 by Yang Zhilin and two co-founders from Tsinghua University. The company's name is a reference to Pink Floyd's album The Dark Side of the Moon, which inspired the Chinese legal name (Beijing Moonshot AI Technology Co., Ltd.). Moonshot was one of the so-called Chinese AI Tigers, a loose grouping of six well-funded large language model startups that included Zhipu AI, MiniMax, 01.AI, Baichuan, and StepFun.

The company's consumer product is Kimi, a chatbot first released in October 2023 with a 200,000 character context window. By March 2024 the company announced support for two million Chinese characters in a single prompt, and the long-context positioning became the early Kimi identity. The model family powering Kimi went through several internal generations before K1.5 was introduced. A preview called Kimi k1 was used internally during late 2024 and was referenced briefly in the K1.5 paper as an earlier short-CoT baseline.^[1] K1.5 was the first generation to be described publicly in detail with a technical report.^[1]

The 2025 reasoning model wave

Reasoning models were the dominant story in early 2025. OpenAI had released the original o1 model in September 2024,^[5] then a fully featured o1 in December 2024,^[6] and the company had been deliberately vague about how the system was trained. The accompanying OpenAI material described long chains of internal thought generated before the final answer, very high inference cost, and strong gains on mathematics and competitive programming. The training method was not described in any detail.

The vacuum that this opacity created was filled by two Chinese papers on the same day. DeepSeek published the R1 paper, describing a recipe that started from a base model and used RL on verifiable rewards to elicit long chain-of-thought reasoning.^[3] Moonshot published the K1.5 paper, describing a closely related recipe with a different multimodal training corpus and a deliberate focus on both long-CoT and short-CoT variants.^[1] The simultaneous release was largely coincidental. Both groups had been working on reasoning training for months, and the publication windows aligned at the start of the Chinese New Year season. The pairing was significant because it cut the cost-of-replication story that had surrounded o1 and shifted the conversation toward a question of training discipline rather than secret architecture.

Architecture

The Kimi K1.5 technical report does not publish a full parameter count for the deployed model.^[1] The authors describe it as a dense transformer language model that was fine-tuned for long chain-of-thought reasoning and extended to handle vision-language inputs through a vision encoder. The model supports a 128,000 token context window during reinforcement learning, which the paper highlights as a key design choice because it allows the policy to revise long reasoning traces without truncating them midway through.^[1]

K1.5 is presented as a single model with two operating modes. The long-CoT mode produces extended internal reasoning before answering, similar to OpenAI o1 and DeepSeek R1. The short-CoT mode produces concise answers with shorter reasoning traces, closer to a conventional instruction-tuned model. Both modes share weights and were trained together rather than as separate checkpoints.^[1]

The multimodal component is built around a vision tower that produces token embeddings the language model consumes alongside text tokens. The paper describes a joint text and image training mixture, and reports vision-and-text scores on benchmarks like MathVista and MMMU.^[1] The K1.5 paper is one of the first published reasoning model reports to include vision benchmarks alongside math and code, and the authors argue that joint multimodal training improved transfer on text-only reasoning tasks rather than competing with it.^[1]

Long-CoT and short-CoT variants

Variant	Behavior	Typical use
K1.5 long-CoT	Produces extended reasoning before answering, often thousands of tokens	Olympiad math, competitive programming, multi-step proofs
K1.5 short-CoT	Produces concise answers with shorter reasoning traces	General chat, summarization, routine question answering

The paper reports separate benchmark numbers for the two variants. The long-CoT variant is the one most often compared with o1 and R1. The short-CoT variant is closer to a stronger version of the existing Kimi chat model.^[1]

Training

The K1.5 technical report focuses heavily on the training pipeline. The authors describe four main stages: pretraining of the base model, supervised fine-tuning on a curated reasoning corpus, a long-CoT supervised warm-up, and a reinforcement learning phase on prompts with verifiable answers.^[1]

Pretraining and supervised fine-tuning

The paper does not publish full pretraining details. It refers to a pretrained base language model and a vision encoder trained on internal data. Supervised fine-tuning uses a mixture of high-quality reasoning examples covering mathematics, code, and general knowledge questions. The long-CoT warm-up stage is positioned as a separate phase that exposes the model to extended reasoning traces sourced from a combination of curated data and earlier model rollouts. This stage is described as a small but important conditioning step that shapes how the policy uses its context window during the subsequent RL stage.^[1]

Reinforcement learning on verifiable rewards

The core training contribution of K1.5 is the RL stage. The authors argue that simple, scalable algorithms outperform more complex methods when the reward signal is verifiable.^[1] The policy receives a binary or near-binary reward based on whether the final answer matches a ground truth, with no learned reward model in the main loop. Prompts are filtered to ensure that the answers can be checked automatically, which restricts the training mix to mathematics with numerical answers, code problems with test cases, and structured logic puzzles.

The paper presents two algorithmic ideas as central to its results.

The first is what the authors call long-context RL. Standard RL fine-tuning of language models typically uses short contexts because long rollouts are expensive and unstable. K1.5 was trained with a 128,000 token RL context, which the authors say is essential for letting the policy learn behaviors like backtracking, self-verification, and exploration of multiple solution paths within a single rollout.^[1]

The second is a partial rollout system. Long rollouts are wasteful when they share long common prefixes. The authors describe a system that caches partial rollouts and reuses them across training steps, reducing the wall-clock cost of long-context RL. Engineering details are partially redacted in the public report, but the conceptual claim is that long-CoT training is bottlenecked by infrastructure rather than by algorithmic novelty.^[1]

The authors deliberately avoid more complex RL recipes that were popular at the time. They report that they did not use Monte Carlo tree search at inference, did not train a value function, and did not use a process reward model that scores intermediate steps.^[1] The argument is that these methods are sometimes useful but are not necessary to reach o1-level performance on the benchmarks they tested, provided the verifiable-reward RL loop is scaled properly.

Short-CoT distillation

The paper describes a short-to-long distillation step in which the long-CoT model is used to generate compact reasoning traces that are then used to fine-tune the short-CoT variant. The authors report that this transfer improves the short-CoT model on the same benchmarks where the long-CoT model is strong, including AIME and MATH-500, without producing the same inference cost.^[1] The short-CoT variant is the one used by default in the consumer Kimi application for routine queries.

Benchmark performance

The K1.5 paper published benchmark numbers for both variants alongside contemporaneous scores for OpenAI o1 and other competitors. The numbers below are taken directly from the technical report posted to arXiv in January 2025.^[1] Scores from external evaluators that ran the model after release have varied slightly because of decoding settings and prompt formats, but the headline ranking has held up in independent reproductions.

Benchmark	K1.5 long-CoT	K1.5 short-CoT	OpenAI o1 (reported)
AIME 2024 (pass@1)	77.5	60.8	74.4
MATH-500 (pass@1)	96.2	94.6	94.8
Codeforces percentile	94	47	89
LiveCodeBench (pass@1)	47.3	40.3	not reported in K1.5 paper
MathVista (pass@1)	74.9	not reported	71.0
MMMU (pass@1)	70.0	not reported	78.2

The AIME and MATH-500 numbers are the most cited results. AIME is the American Invitational Mathematics Examination, a high school competition used widely as a reasoning benchmark, and a score of 77.5 on AIME 2024 placed K1.5 at the top of the public leaderboard at the time of publication, slightly above the September 2024 reported number for o1.^[1] MATH-500 is a 500-problem subset of the MATH benchmark, also used heavily for reasoning evaluation. Codeforces is the Russian competitive programming platform, and the 94th percentile rating put K1.5 long-CoT in roughly the same band as o1 on this metric.

MathVista and MMMU are multimodal benchmarks. K1.5 was the first reasoning model to publish results in this area at this scale, and the gap on MMMU against o1 reflects the much larger general-knowledge training mix that OpenAI used for its evaluation runs.

Comparison to o1 and DeepSeek R1

The three contemporaneous reasoning models share more than they differ. All three rely on long chain-of-thought generation, all three use reinforcement learning on tasks with verifiable rewards as a central training step, and all three were positioned at launch as targeting mathematical reasoning and competitive coding. The differences are about openness, multimodality, and deployment.

Model	Developer	Release	Open weights	Multimodal	Context length
OpenAI o1	OpenAI	September and December 2024	No	No (text only at launch)	128K
DeepSeek R1	DeepSeek	January 20, 2025	Yes (MIT)	No	128K (inherited from V3)
Kimi K1.5	Moonshot AI	January 20, 2025	Partial (smaller derivatives only at launch)	Yes (vision and text)	128K

The most prominent contrast is openness. DeepSeek R1 shipped with full weights for the main 671-billion parameter model under the MIT license, which made it the easiest of the three to study externally.^[3] K1.5 was deployed primarily through Moonshot's Kimi product, with smaller distilled variants released later on Hugging Face under the Moonshot AI organization rather than the full base model.^[4]

The second contrast is multimodality. K1.5 was the only one of the three that supported image inputs at launch, which mattered for the vision-and-math benchmarks but had limited day-to-day visibility because most reasoning workloads were text-only at the time.

The third contrast is reproducibility of the training recipe. The R1 paper described its full training pipeline including the rule-based reward model and the cold-start data generation.^[3] The K1.5 paper was more guarded about specific infrastructure details and dataset composition.^[1] External groups that tried to reproduce both papers in early 2025 reported that R1 was easier to replicate because the open weights served as a check on intermediate results, while K1.5 reproductions had to rely entirely on the paper plus public Moonshot blog posts.

API and access

At launch in January 2025, Kimi K1.5 was available through Moonshot's own platform. The Kimi consumer application running at kimi.ai exposed the long-CoT and short-CoT modes through a model selector.^[2] Users could submit text or image inputs and could optionally enable the long-thinking mode for harder questions. The Moonshot Platform API at platform.moonshot.cn provided developer access on a pay-per-token basis, with the Kimi K1.5 model identifier exposed alongside earlier Kimi chat endpoints.^[7]

The API rolled out in stages. Text-only K1.5 was available first, and multimodal inputs were enabled for selected developers in the following months. Long-CoT requests were billed differently from short-CoT requests because of the much larger output token volume, and the Moonshot documentation warned that long-CoT responses could exceed twenty thousand output tokens on hard problems.^[7]

A partial open release accompanied the API launch. Moonshot uploaded several smaller K1.5 derivatives to Hugging Face under the moonshotai organization.^[4] The full base model was not released publicly at launch. The available checkpoints were intended for research use and were licensed under the Moonshot Model License, which allowed non-commercial use and required attribution.^[4] The full open-weights story for the Moonshot lineup did not arrive until Kimi K2 in July 2025, when the company shifted to a Modified MIT license for the trillion-parameter base model.

Reception

The response to K1.5 was shaped by the simultaneous DeepSeek R1 release. Most coverage in the week of January 20, 2025 grouped the two papers together and treated them as a single event: the first credible non-OpenAI reasoning models with public training recipes. The benchmark numbers were widely reported in technical media, and the AIME 2024 result in particular was repeated as evidence that o1-level performance was reproducible outside OpenAI.

Researchers focused on the methodological contributions. The long-context RL claim from the K1.5 paper was discussed alongside the simpler GRPO-based recipe in the R1 paper, and several groups noted that the two recipes were more similar than different. The K1.5 emphasis on multimodal training was also discussed, though the lack of full open weights limited how much external work could be built directly on top of the model.

In the consumer market, K1.5 was a strong refresh of the Kimi product. The Kimi chatbot's user base grew during early 2025, and the long-thinking mode became a marketing feature for harder math and coding questions inside the app. The Chinese press coverage emphasized the parity with o1 as a national prestige story, and Moonshot's valuation continued to rise through subsequent funding rounds.

Criticism of K1.5 generally targeted three areas. The first was the partial open release, which contrasted unfavorably with the fully open R1 weights. The second was the absence of full pretraining details in the paper, which made independent replication harder. The third was concern about benchmark saturation. AIME 2024 and MATH-500 were already widely used for reasoning evaluation, and several commentators noted that competitive math benchmarks were close to ceiling for the top models. Later reasoning benchmarks, including AIME 2025 and harder math sets, replaced these as the discriminating tests in subsequent quarters.

Legacy and succession to K2

Kimi K1.5 was the bridge between Moonshot's long-context chat lineage and its later open-weights agentic models. The lessons from K1.5, including the long-context RL infrastructure, the verifiable-reward training loop, and the joint multimodal mixture, fed directly into K2 and its successors.

Kimi K2 was released on July 11, 2025, six months after K1.5. K2 was a 1.04 trillion parameter Mixture of Experts model, a different architectural direction from the dense K1.5. The most visible inherited piece was the training discipline. K2 used a similar verifiable-reward RL approach for its post-training, scaled to the larger base model and reoriented around agentic tasks like multi-step tool use and software engineering. The K2 launch was the moment when Moonshot fully embraced the open-weights model under a Modified MIT license, a step that the partial K1.5 release had stopped short of.

Later additions to the family extended this trajectory. Kimi K2.5 followed in late 2025 with refinements to the agentic post-training, and a subsequent K2.6 update arrived in April 2026 with further coding improvements. The Moonshot lineup as of 2026 still cites K1.5 in the company materials as the first model that established reasoning as a distinct capability inside the Kimi ecosystem.

Outside Moonshot, K1.5 had a broader influence on the public conversation around reasoning models. The same week that K1.5 and R1 appeared, several research groups outside of China started replicating the long-CoT RL recipe on smaller open models, and a steady stream of small reasoning models appeared during the first half of 2025. The combination of the two January papers established a template that later releases including Qwen QwQ, the early Mistral reasoning variants, and several academic projects all referenced. The reasoning model wave that followed reshaped the open-source landscape and helped close the perceived gap between Chinese frontier labs and US ones.

References

Moonshot AI team. "Kimi k1.5: Scaling Reinforcement Learning with LLMs." arXiv preprint arXiv:2501.12599, January 22, 2025. https://arxiv.org/abs/2501.12599 ↩
Moonshot AI. "Kimi k1.5 release announcement." Moonshot AI official site, January 20, 2025. https://moonshot.ai ↩
DeepSeek-AI. "DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning." arXiv preprint arXiv:2501.12948, January 22, 2025. https://arxiv.org/abs/2501.12948 ↩
Moonshot AI organization on Hugging Face. https://huggingface.co/moonshotai ↩
OpenAI. "Learning to reason with LLMs." OpenAI blog, September 12, 2024. https://openai.com/index/learning-to-reason-with-llms ↩
OpenAI. "OpenAI o1 System Card." December 2024. https://openai.com/index/openai-o1-system-card ↩
Moonshot AI Platform documentation. https://platform.moonshot.cn ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

3 revisions by 1 contributors · full history

Suggest edit

What links here

Kimi K2.5 Skywork-R1V

Background

Moonshot AI and the Kimi product line

The 2025 reasoning model wave

Architecture

Long-CoT and short-CoT variants

Training

Pretraining and supervised fine-tuning

Reinforcement learning on verifiable rewards

Short-CoT distillation

Benchmark performance

Comparison to o1 and DeepSeek R1

API and access

Reception

Legacy and succession to K2

See also

References

Improve this article

Related Articles

DeepSeek V4

Kimi K2

DeepSeek V3

Hunyuan

GLM-4.5

Qwen3

What links here

Related Articles

DeepSeek V4

Kimi K2

DeepSeek V3

Hunyuan

GLM-4.5

Qwen3

What links here