LongCat-Flash
Last reviewed
Jun 8, 2026
Sources
11 citations
Review status
Source-backed
Revision
v1 · 1,635 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 8, 2026
Sources
11 citations
Review status
Source-backed
Revision
v1 · 1,635 words
Add missing citations, update stale details, or suggest a clearer explanation.
LongCat-Flash is an open-weight large language model developed by the LongCat team at Meituan, the Chinese on-demand local-services and food-delivery company. Released in late August 2025, it is a mixture-of-experts (MoE) model with roughly 560 billion total parameters that activates only about 18.6 billion to 31.3 billion parameters per token, averaging around 27 billion. [1][2] Its signature innovation is a "zero-computation experts" mechanism that lets the MoE router send unimportant tokens to no-op experts, so the amount of compute spent varies dynamically per token. [1] LongCat-Flash marked Meituan's entry into the frontier of open large language models, and it became the foundation for a fast-moving lineage that includes the reasoning-focused LongCat-Flash-Thinking, the multimodal LongCat-Flash-Omni, and the later trillion-parameter LongCat-2.0-Preview. [2][3][4]
LongCat-Flash was conceived as an efficient, high-throughput foundation model with a particular emphasis on agentic and tool-use tasks. [1] Rather than competing purely on raw scale, its design tries to maximize useful capability per unit of activated compute. The model's MoE architecture holds about 560 billion total parameters but only activates a small, variable fraction of them for each token, which keeps inference cheap relative to the model's total size. [1][2] Meituan reported inference throughput of more than 100 tokens per second and an output cost on the order of 0.70 US dollars per million tokens. [1][2]
The first public release, LongCat-Flash-Chat, is a non-reasoning ("non-thinking") instruction-tuned model. [5] It was followed within weeks by LongCat-Flash-Thinking, an explicit reasoning variant, and later by omni-modal and trillion-parameter siblings. [3][4] The release placed Meituan, a company best known for food delivery and local commerce rather than artificial intelligence research, alongside DeepSeek, Alibaba's Qwen family, and Moonshot AI's Kimi as a contributor to China's open-weight model ecosystem. [6][7]
LongCat-Flash was built by Meituan's LongCat team. [1] Meituan is one of China's largest internet platforms, centered on food delivery, restaurant reviews, travel booking, and other local services; the move into frontier-scale language models represented a significant diversification for the company. [6][7] Caixin and other outlets framed the launch as Meituan formally entering the open-source AI race, joining incumbents such as DeepSeek and Alibaba. [6]
The LongCat-Flash Technical Report, posted to arXiv on September 1, 2025, is credited to the "Meituan LongCat Team" along with a large list of contributing authors. [1] Coverage of the release clustered around the same period, with the Chinese-state outlet China Youth International and the model's Hugging Face card both dating the open-source release to early September 2025, and some English-language summaries citing late August 2025 for the initial announcement. [5][8] Meituan released the model openly with weights, code, and a technical report, positioning LongCat as a continuing line rather than a one-off project. [1][5]
One widely noted detail is training efficiency: Meituan reported that LongCat-Flash was trained on more than 20 trillion tokens and that this pre-training run completed within roughly 30 days, supported by a scaling framework built for stability at large cluster sizes. [1]
LongCat-Flash uses a mixture-of-experts transformer with approximately 560 billion total parameters. [1] In a standard MoE, a router selects a fixed number of expert sub-networks to process each token, so the activated parameter count is constant. LongCat-Flash departs from this in two main ways. [1]
The first is zero-computation experts. The model adds expert "slots" that perform no computation (an identity or no-op transformation). Because not all tokens require the same amount of processing, the router can assign these no-op experts to less important tokens, effectively spending little or no extra compute on them, while routing significant tokens to real experts. [1] The result is that the number of activated parameters varies per token, ranging from about 18.6 billion to 31.3 billion and averaging roughly 27 billion, far below the 560 billion total. [1][2] An auxiliary control keeps the average activation budget stable during training so the dynamic routing does not destabilize. [1]
The second is Shortcut-connected MoE (ScMoE). This design rewires the network so that the heavy all-to-all communication required by MoE expert routing can be overlapped with computation, widening the "computation-communication overlap window." [1] This overlap is a major reason the model can sustain high inference throughput (over 100 tokens per second) despite its size. [1][2]
To train a model of this scale stably, Meituan reported a comprehensive framework combining hyperparameter transfer, model-growth initialization (growing a larger model from a smaller trained one), a "multi-pronged stability suite," and deterministic computation for reproducibility. [1] The released LongCat-Flash-Chat supports a 128,000-token context window. [5]
| Attribute | Value |
|---|---|
| Developer | Meituan (LongCat team) [1] |
| Initial release | Late August to September 1, 2025 (LongCat-Flash-Chat) [1][5] |
| Model type | Mixture-of-experts large language model [1] |
| Total parameters | ~560 billion [1] |
| Activated parameters per token | ~18.6B to 31.3B (avg ~27B) [1] |
| Key architecture features | Zero-computation experts; Shortcut-connected MoE (ScMoE) [1] |
| Training data | >20 trillion tokens (pre-training ~30 days) [1] |
| Context length | 128,000 tokens (Chat) [5] |
| Reported inference speed | >100 tokens per second [1][2] |
| Reported output cost | ~$0.70 per million tokens [1] |
| License | MIT [5] |
| Availability | Hugging Face, GitHub [5] |
Meituan positioned LongCat-Flash as competitive with leading open and closed models while standing out on agentic and tool-use tasks. [1][2] Reported benchmark figures should be attributed to Meituan's own technical report and model card unless independently confirmed.
On the LongCat-Flash-Chat model card, Meituan reported scores including 89.71 on MMLU, 86.50 on ArenaHard-V2, 89.65 on IFEval (instruction following), and 48.02 on LiveCodeBench (coding). [5] On agentic and tool-use evaluations it reported 73.68 on the telecom split of the tau-squared benchmark, 39.51 on TerminalBench, and 24.30 on VitaBench. [5] Independent commentary noted that LongCat-Flash-Chat tended to outperform several mainstream models on agentic tasks while lagging somewhat on coding benchmarks. [2][7]
Multiple summaries described LongCat-Flash as performing on roughly the same tier as models from DeepSeek, Alibaba's Qwen3 family, and Moonshot AI, as well as some prominent US models, with its efficiency (low activated-parameter count and high throughput) cited as the main differentiator rather than top-of-the-leaderboard accuracy. [2][6][7] As with all vendor-reported benchmarks, these numbers reflect Meituan's evaluation conditions and should be read with that caveat.
LongCat-Flash-Chat was released as an open-weight model under the permissive MIT license, with weights distributed on Hugging Face and supporting code on GitHub. [5] The permissive license allows commercial use, redistribution, and modification, consistent with the broader trend of Chinese labs releasing capable models under permissive terms. Inference support was added by community runtimes; for example, the LMSYS team published guidance on serving LongCat-Flash with SGLang shortly after release. [9]
Not every model in the lineage is open. While LongCat-Flash-Chat, LongCat-Flash-Thinking, and LongCat-Flash-Omni were released openly, Meituan opened the later LongCat-2.0-Preview only for free testing rather than as downloadable weights. [4]
LongCat-Flash is significant on two fronts. First, it represented a major new and somewhat unexpected entrant to the frontier open-weight landscape: a food-delivery and local-services company building and openly releasing a 560-billion-parameter model placed Meituan in the same conversation as dedicated AI labs. [6][7] Second, its zero-computation experts and ScMoE designs offered concrete architectural ideas for making very large MoE models cheaper to run, contributing to the broader open-MoE trend pioneered by models such as DeepSeek-V3 and Qwen's MoE releases. [1][2]
The model anchored a rapidly expanding family released over late 2025 and into 2026:
Taken together, the LongCat releases established Meituan as a persistent open-model contributor and a notable example of a Chinese consumer-internet company scaling frontier AI, including on domestic hardware, alongside the better-known efforts of DeepSeek, Qwen, and Kimi. [4][6][7]