Ling-1T
Last reviewed
May 31, 2026
Sources
9 citations
Review status
Source-backed
Revision
v1 ยท 2,779 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 31, 2026
Sources
9 citations
Review status
Source-backed
Revision
v1 ยท 2,779 words
Add missing citations, update stale details, or suggest a clearer explanation.
Ling-1T is a trillion-parameter open-weight language model released by Ant Group through its Inclusion AI research group on October 9, 2025. It is a mixture of experts model with about one trillion total parameters, of which only around 50 billion are active for any given token. Ant Group put it forward as one of the largest open-source language models available at the time, and as the flagship non-thinking model of a family branded Ling 2.0. [1][2][3]
The release matters mostly for what it signals. A trillion-parameter model is the kind of thing that until recently lived behind closed APIs at a handful of well-funded labs. Ant Group, the fintech company best known for the Alipay payments platform, shipped one with open weights under a permissive license, joining a wave of Chinese groups pushing large open source AI systems into public hands. Ling-1T sits alongside models like DeepSeek V3 and Kimi K2 as part of that wave, and it leans hard on sparsity to keep the compute cost of a trillion-scale model within reach. [2][4]
Ling-1T is a dense-looking model that is actually very sparse under the hood. The headline figure is the roughly one trillion total parameters, which is what makes it eye-catching. The number that matters for cost is the active parameter count, which the technical report puts at 51 billion per token, because a mixture-of-experts model only runs a fraction of its weights on each forward pass. The overall activation ratio works out to about 3.5 percent, so the network holds a very large pool of expert sub-networks and routes each token to a small subset of them. [1][3][5]
The model is a large language model in the now-standard transformer mold, trained for text generation, coding, math, and general instruction following. Inclusion AI describes it as a non-thinking model, meaning it answers directly rather than emitting a long visible chain of reasoning before the final response. That label is deliberate, because it distinguishes Ling-1T from a sibling model, Ring-1T, that is tuned to reason step by step. More on that distinction below. [1][2]
Ling-1T supports a context window of up to 128,000 tokens, extended from a 32,000-token base using YaRN, which is in line with other recent frontier-class models and enough to take in long documents or large codebases in a single prompt. [1][3]
Inclusion AI is the open-source research group that publishes Ant Group's foundation models on platforms like Hugging Face and ModelScope. The Chinese branding for the family is Bailing, which translates roughly to lark, the songbird, and the Ling in Ling-1T comes from that name. [2][6]
The group has shipped more than one model line. The Ling series covers non-thinking general-purpose models at several sizes, with Ling-mini-2.0 at 16 billion total parameters, Ling-flash-2.0 at 103 billion, and Ling-1T at the top. A parallel Ring series covers reasoning models at matching sizes, and a separate Ming line handles multimodal work. Ling-1T is the largest member of the lineup and the one that drew the most attention, because a trillion-parameter open release is rare. [3][5][6][9]
This fits a broader pattern in China AI, where several companies have decided that releasing strong open weights is a good way to build developer mindshare and to compete with closed models from OpenAI and others. Ant Group framed the Ling-1T release in those terms. Its chief technology officer He Zhengyu said the company believes artificial general intelligence should be a public good, and the open release of Ling-1T together with a preview of the reasoning model Ring-1T was put forward as a step toward that. [2][4]
Ling-1T is built on what Inclusion AI calls the Ling 2.0 architecture, a mixture-of-experts design shared across the whole family. The same recipe scales from the 16 billion parameter mini model up to the trillion-parameter flagship, which is part of the point. The team says it used a set of internally derived scaling rules, which it calls Ling Scaling Laws, to pick the architecture and the expert configuration so that the design choices would hold up as the model grew. It reports roughly sevenfold active-compute efficiency against a dense model of comparable quality. The full description is laid out in a technical report titled Every Activation Boosted, posted to arXiv on October 24, 2025. [1][3][5]
The expert layout is uniform across the family and specific in its numbers. Every mixture-of-experts layer holds 256 routed experts plus 1 shared expert, and 8 of the routed experts fire for a given token, which is what yields the 3.5 percent activation ratio. The shared expert is always on, while the router picks which of the 256 specialists to wake up. Ling-1T stacks 80 layers and uses 64 attention heads, and the first four layers are kept dense rather than routed, a choice the team says improves routing balance while trimming parameters. The vocabulary is large, about 156,000 tokens with byte-level byte-pair encoding, which helps multilingual coverage. [3][5]
The attention design uses standard grouped-query attention, the memory-saving scheme that most recent large models settled on, paired with SwiGLU activations and RMSNorm pre-normalization. The router uses a sigmoid scoring function together with an aux-loss-free balancing strategy, meant to spread tokens evenly across experts without the auxiliary load-balancing loss that older MoE models relied on. The model also applies QK normalization, a stabilization trick on the query and key projections that the team says matters a lot for low-precision training, plus a partial rotary embedding applied to only the first 64 dimensions of each head to help with length extrapolation. And it includes multi-token prediction layers, often shortened to MTP, where the model learns to predict more than one future token at a time. Multi-token prediction is the same family of technique that DeepSeek V3 used, and it can sharpen both the training signal and inference speed. [3][5]
The high-sparsity ratio is the heart of the efficiency story. Because only a small slice of the trillion parameters fires per token, the model behaves at inference time more like a 50 billion parameter dense model in terms of compute, while keeping the representational capacity of a much larger network. That is the standard mixture-of-experts bargain, and Ling-1T pushes it to a trillion-parameter total. [3][5]
Ling-1T was pre-trained on a large token budget, reported at more than 20 trillion tokens. Inclusion AI says it leaned the mix toward reasoning over the course of the run, raising the share of reasoning-dense data from about 32 percent early on to about 46 percent later, with dedicated math and code corpora doing a lot of the work. A budget in that range is comparable to other recent frontier-class pretraining runs and is one reason the model performs as well as it does on knowledge and reasoning tasks. [1][3][5]
The pretraining used FP8 mixed-precision, an eight-bit floating-point format that cuts memory and bandwidth costs relative to the older 16-bit formats. Inclusion AI calls Ling-1T part of the largest open-source effort trained entirely in FP8, and reports that the fine-grained quantization stayed within about a quarter of a percent of full BF16 accuracy after 900 billion tokens while cutting memory use by more than 15 percent. FP8 training at trillion-parameter scale is demanding to get right, and its use here is part of how the team kept the run tractable while holding training stable. The combination of a sparse MoE design, FP8 precision, and the scaling-law-driven architecture is what Inclusion AI points to when it describes Ling-1T as an efficient way to train and serve a trillion-parameter model. [1][3][5]
On the post-training side, the team describes a sequence of named methods. It starts from a decoupled fine-tuning step to set up a reasoning-focused initialization, then applies Evo-CoT, short for evolutionary chain-of-thought, to progressively strengthen the model's reasoning. After that comes LPO, or linguistic-unit policy optimization, a reinforcement learning method that treats whole sentences as the unit of optimization rather than individual tokens, which the team says is more stable than token-level or sequence-level alternatives. A group-based preference step handles the human-alignment part. These are the stages that turn the pretrained base into the instruction-following model that ships. [1][5]
The split between Ling and Ring is the easiest way to understand the family, and it is worth getting right because the names are close. Ling models are non-thinking. They take a prompt and produce an answer directly, which makes them fast and well suited to general chat, coding, and everyday tasks. Ling-1T is the trillion-parameter member of that group. [1][2]
Ring models are the reasoning side of the family. Ring-1T is built on the same trillion-parameter Ling 2.0 base, with the same roughly 50 billion active parameters, but it is trained further with reinforcement learning so that it works through problems step by step before answering. It first appeared as Ring-1T-preview alongside Ling-1T in October 2025, with a full release later that month. Inclusion AI describes Ring-1T's training in terms of reinforcement learning with verifiable rewards on top of its own RL stack, which includes a stabilization method it calls Icepop and a training framework it calls ASystem. The result is a model aimed at hard math, competition coding, and multi-step logic, the kinds of tasks where explicit reasoning helps. Inclusion AI reported that Ring-1T reached silver-medal-level performance on the 2025 International Mathematical Olympiad problems through natural-language reasoning. [7]
So the relationship is that Ling is the base and the fast responder, while Ring is the reasoning model grown from the same trunk. A user who wants quick answers reaches for Ling-1T, and a user who wants the model to deliberate reaches for Ring-1T. Both share the architecture, the parameter counts, and the open license. [2][7]
Inclusion AI evaluated Ling-1T against a set of strong recent models, both open and closed. The comparison set on the official model card includes DeepSeek-V3.1-Terminus, Kimi-K2-Instruct, Qwen3-235B-A22B-Instruct-2507, GPT-5-main, and Gemini-2.5-Pro, spanning code generation, mathematics, knowledge, and instruction following. The company positions Ling-1T as a leading open-source non-thinking model, with particular strength in code generation and competition mathematics, and reports that it generally came out ahead of the open and closed models in that comparison set with thinking modes disabled. [1][4][8]
The most-cited single number is the AIME 2025 competition math result, where Ling-1T scored 70.42 percent while using a relatively low token budget per problem. The team frames this as extending the Pareto frontier of accuracy versus token cost rather than just topping a leaderboard. The full model card presents the rest of the scores in a results figure that did not transcribe cleanly into the secondary write-ups available, so the table below lists the headline number and the evaluation areas and comparison models that Inclusion AI reported. Readers who need cell-by-cell figures should consult the model card and the technical report directly. [1][3][8]
| Item | Detail |
|---|---|
| AIME 2025 (competition math) | 70.42 percent for Ling-1T [4][8] |
| Reported strengths | Code generation, competition math, logical reasoning [1][8] |
| Code benchmarks cited | LiveCodeBench, CodeForces, MultiPL-E [1][8] |
| Math benchmarks cited | AIME 2025 and related competition sets [1][8] |
| Comparison models | DeepSeek-V3.1-Terminus, Kimi-K2-Instruct, Qwen3-235B-A22B-Instruct-2507, GPT-5-main, Gemini-2.5-Pro [1] |
The headline claim from Ant Group is that Ling-1T reaches top-tier results among open models and trades blows with leading closed systems on coding and math. As always with vendor-reported numbers, the right read is that the model is competitive in the upper tier of open releases rather than a settled winner, and independent evaluation across more benchmarks is the way to confirm where it actually lands. [2][4]
| Specification | Value |
|---|---|
| Developer | Ant Group, via Inclusion AI [1][2] |
| Family | Ling 2.0 (Bailing) [2][6] |
| Release date | October 9, 2025 [2] |
| Model type | Non-thinking mixture-of-experts language model [1] |
| Total parameters | About 1 trillion [1][3] |
| Active parameters | 51 billion per token [5] |
| Layers | 80, first 4 dense [5] |
| Expert layout | 256 routed experts plus 1 shared, 8 routed experts active per token [3][5] |
| Activation ratio | About 3.5 percent [5] |
| Attention | Grouped-query attention, 64 heads [5] |
| Vocabulary | About 156,000 tokens (byte-level BPE) [5] |
| Context length | Up to 128,000 tokens, extended from 32,000 via YaRN [1][3] |
| Training tokens | More than 20 trillion [1][3] |
| Training precision | FP8 mixed-precision [1][3][5] |
| Notable techniques | Aux-loss-free sigmoid routing, QK normalization, partial RoPE, multi-token prediction, Ling Scaling Laws [3][5] |
| Post-training | Decoupled fine-tuning, Evo-CoT, LPO reinforcement learning [1][5] |
| License | MIT [1][2] |
| Availability | Hugging Face, ModelScope, GitHub [1][2] |
Ling-1T ships under the MIT license, one of the most permissive licenses around, which the model card and the launch coverage both confirm. That is a meaningful choice. MIT puts very few restrictions on commercial use, modification, or redistribution, so the weights are genuinely usable by companies and researchers, not just available to look at. For a model at this scale, that level of openness is unusual and is part of why the release drew attention. [1][2]
The significance is less about any single benchmark and more about the precedent. A trillion-parameter open-weight model under a permissive license lowers the bar for who can study and build on frontier-scale systems. It strengthens the case that the open ecosystem can keep pace with closed labs on raw scale, and it adds Ant Group to the short list of organizations that have actually released models in this size class. Alongside DeepSeek and others, it is part of the reason the gap between open and closed models narrowed sharply through 2025. [2][4]
Ling-1T belongs to the same lineage of large sparse models as DeepSeek V3, which popularized the approach of pairing a very large expert pool with a small active footprint and training it efficiently at scale. The two share several specific ideas, including a shared-plus-routed expert layout, aux-loss-free load balancing, multi-token prediction, and an emphasis on FP8 training, and they target the same goal of getting frontier-level quality without frontier-level serving cost. The technical report explicitly positions Ling-1T against DeepSeek V3, which at 671 billion total parameters is smaller, with Ling-1T pushing the total up to a trillion while keeping the active count in a similar range. [3][5]
The broader point is that mixture-of-experts has become the default way to build very large open models. Sparsity decouples capacity from per-token compute, and the recent crop of Chinese open releases, Ling-1T and Kimi K2 included, leans on that decoupling to hit large headline parameter counts at manageable inference cost. [3][5]
A few cautions are worth keeping in mind. First, the most detailed benchmark numbers come from the developer, and at the time of release independent third-party evaluation was still thin, so the precise standing of the model against its rivals should be treated as provisional. Second, serving a trillion-parameter model, even a sparse one, still needs substantial hardware. The roughly 50 billion active parameters set the per-token compute, but the full weight set has to be held in memory across an accelerator cluster, which puts self-hosting out of reach for most individuals. Third, Ling-1T is a non-thinking model by design, so for the hardest multi-step reasoning the reasoning-tuned Ring-1T is the intended tool, and Ling-1T should not be expected to match a dedicated reasoning model on those tasks. Finally, as a recently released model the surrounding tooling, fine-tunes, and long-term support are less mature than for older and more widely deployed systems. [1][2][7]