Hunyuan-A13B
Last reviewed
May 31, 2026
Sources
10 citations
Review status
Source-backed
Revision
v1 · 2,063 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 31, 2026
Sources
10 citations
Review status
Source-backed
Revision
v1 · 2,063 words
Add missing citations, update stale details, or suggest a clearer explanation.
Hunyuan-A13B is an open-weight mixture-of-experts large language model released by Tencent in late June 2025. It has about 80 billion total parameters but activates only around 13 billion of them for any single token, which is where the "A13B" name comes from. The model ships with open weights, a native 256K-token context window, and a switchable "fast versus slow" reasoning mode that lets a caller decide whether the model answers directly or works through an explicit chain of thought first. Tencent positions it as a model that punches above its active-parameter weight, especially on reasoning and agentic tasks, while staying cheap enough to run on a small number of GPUs.[1][2][3]
The release is part of Tencent's broader Hunyuan program, the company's family of foundation models that spans text, image, video, and 3D generation. Hunyuan-A13B sits in that family as the flagship open text model of mid-2025, and it follows the much larger Hunyuan-Large MoE that Tencent opened in late 2024.[3][4]
Hunyuan-A13B is a decoder-only transformer that uses a fine-grained mixture-of-experts design. Instead of one big feed-forward block per layer, each layer holds a pool of smaller expert networks, and a learned router sends each token to a subset of them. The result is a model whose stored capacity is large while the compute spent per token stays small. Tencent reports a total parameter count near 80 billion with roughly 13 billion active per token, a ratio that keeps inference closer in cost to a 13B dense model than to an 80B dense one.[1][2]
The team released the model in several forms. There is a base pretrained checkpoint, an instruction-tuned chat checkpoint called Hunyuan-A13B-Instruct, and quantized builds for cheaper serving. The weights and the inference code live on GitHub and Hugging Face under Tencent's own community license, and the design choices are written up in a technical report posted to arXiv in June 2025.[1][2][5]
Hunyuan-A13B is built around 32 transformer layers. Each MoE layer contains 64 specialized (non-shared) experts plus 1 shared expert that every token always passes through. For a given token the router activates 8 of the 64 specialized experts on top of the shared one, so most of the parameter pool stays idle on any single forward pass. The shared expert captures general patterns that every token needs, while the routed experts specialize. The feed-forward blocks use the SwiGLU activation, and the vocabulary holds about 128,000 tokens.[1][6]
For attention the model uses grouped-query attention, which lets multiple query heads share a smaller set of key and value heads. That choice shrinks the key-value cache, which matters a lot at long context lengths because the cache is what dominates memory once a prompt grows into the hundreds of thousands of tokens. Tencent pretrained the model on a corpus of roughly 20 trillion tokens, then ran post-training stages for reasoning and general chat behavior.[1][6]
The efficiency story is the point of the whole design. Because only about 13 billion parameters fire per token, Hunyuan-A13B can serve interactively on a modest GPU setup rather than the multi-node clusters that dense models of similar quality tend to need. Coverage at launch framed it as a model that brings frontier-style reasoning into reach for teams without large GPU fleets, which is the gap Tencent says it built the model to fill. Tencent and downstream packagers also published FP8 and INT4 (GPTQ) quantized versions, produced with the AngelSlim compression toolkit, which push the memory footprint down further. The model runs on the common open serving stacks including vLLM, SGLang, and TensorRT-LLM.[1][2][8]
Hunyuan-A13B natively supports a 256K-token context window, which is enough to hold a small book, a large codebase slice, or a long multi-turn agent trace in a single prompt. Tencent reports that the model keeps stable behavior on long-text tasks rather than degrading sharply past some shorter limit, and the company evaluated it on long-context suites alongside the shorter benchmarks.[1][2]
The long window pairs naturally with the agentic use cases Tencent emphasizes. Tool-calling agents tend to accumulate long histories of observations, tool outputs, and intermediate reasoning, so a model that can read 256K tokens without falling apart is more useful as the planning core of an agent loop.[1][3]
The most distinctive feature of Hunyuan-A13B is its dual-mode chain-of-thought. The model can answer in a "fast thinking" mode that returns a direct response with little or no visible deliberation, or in a "slow thinking" mode where it writes out an explicit chain-of-thought before committing to a final answer. The caller picks the mode at request time. Slow thinking is the default, and a user turns on fast thinking by prefixing the query with the token /no_think. The /think prefix forces slow thinking back on.[1][2]
The trade-off is the usual one for reasoning models. Slow thinking spends more tokens and more latency to get better answers on hard math, science, and multi-step problems, while fast thinking is cheaper and snappier for routine queries where long deliberation would just waste compute. Folding both behaviors into one set of weights means a deployment does not have to host a separate "thinking" model and a separate "chat" model, and an application can route easy and hard traffic to the same endpoint with a one-token switch.[1][2]
This approach echoes the broader 2025 trend of reasoning-first models, where the chain-of-thought is trained in through reinforcement learning rather than only prompted at inference. Tencent describes a post-training pipeline that mixes supervised fine-tuning with reinforcement learning aimed at both the reasoning traces and general helpfulness, so the slow mode is a learned behavior of the weights rather than a wrapper bolted on afterward.[3][9]
Tencent reports that Hunyuan-A13B is competitive with much larger reasoning systems despite its small active footprint, and that it leads on several agentic tool-use benchmarks. The instruction-tuned model's published scores put it near OpenAI's o1 and DeepSeek-R1 on math and science, and ahead of them on some agent tasks. The figures below come from Tencent's model card and technical report. As with any vendor-reported numbers, they reflect the publisher's own evaluation setup and should be read with that caveat.[1][2]
| Benchmark | Category | OpenAI o1 | DeepSeek-R1 | Qwen3-A22B | Hunyuan-A13B-Instruct | |---|---|---|---|---| | AIME 2024 | Math | 74.3 | 79.8 | 85.7 | 87.3 | | AIME 2025 | Math | 79.2 | 70.0 | 81.5 | 76.8 | | MATH | Math | 96.4 | 97.3 | 94.0 | 94.3 | | GPQA-Diamond | Science | 78.0 | 71.5 | 71.1 | 71.2 | | LiveCodeBench | Coding | 63.9 | 65.9 | 70.7 | 63.9 | | BBH | Reasoning | n/a | n/a | n/a | 89.1 | | ZebraLogic | Reasoning | n/a | n/a | n/a | 84.7 | | BFCL v3 | Agent | n/a | n/a | n/a | 78.3 | | tau-Bench | Agent | n/a | n/a | n/a | 54.7 | | C3-Bench | Agent | n/a | n/a | n/a | 63.5 |
The agentic results are the ones Tencent calls out most. The abstract of the technical report names BFCL-v3, tau-Bench, C3-Bench, and ComplexFuncBench as benchmarks where the model leads on challenging tasks, which fits the company's framing of Hunyuan-A13B as a model meant to drive tool-using agents rather than just chat.[2][3]
The base pretrained checkpoint also posts strong general-knowledge and reasoning numbers before any instruction tuning, which gives a sense of the raw model quality.[1]
| Benchmark | Hunyuan-A13B-Pretrain |
|---|---|
| MMLU | 88.17 |
| MMLU-Pro | 67.23 |
| BBH | 87.56 |
| GSM8K | 91.83 |
| MATH | 72.35 |
| C-Eval | 84.00 |
| CMMLU | 88.20 |
| Property | Value |
|---|---|
| Developer | Tencent (Hunyuan team) |
| Release | Late June 2025 |
| Architecture | Fine-grained mixture-of-experts, decoder-only transformer |
| Total parameters | ~80 billion |
| Active parameters per token | ~13 billion |
| Layers | 32 |
| Experts | 64 specialized + 1 shared, 8 specialized activated per token |
| Activation | SwiGLU |
| Attention | Grouped-query attention |
| Vocabulary | ~128,000 tokens |
| Pretraining data | ~20 trillion tokens |
| Context window | 256K tokens |
| Reasoning modes | Dual mode: fast (/no_think) and slow (/think) |
| Quantization | FP8, INT4 (GPTQ) via AngelSlim |
| Serving | vLLM, SGLang, TensorRT-LLM |
| License | Tencent Hunyuan Community License |
Hunyuan-A13B is open in the practical sense that the weights and inference code are downloadable and runnable by anyone, which makes it part of the wider open-source AI movement out of China. It is not, however, under a standard permissive license. Tencent ships it under the Tencent Hunyuan Community License Agreement, the same custom license the company applies across its open Hunyuan releases.[1][7]
That license allows commercial use but carries two notable limits. First, products or services that exceed 100 million monthly active users in a given month must request a separate license from Tencent, which the company may grant at its discretion. Second, the granted territory is worldwide but excludes the European Union, the United Kingdom, and South Korea. These terms put Hunyuan-A13B in the same "open weights with strings attached" category as several other large-vendor releases rather than in the fully permissive Apache or MIT camp.[7]
Hunyuan is Tencent's umbrella brand for its in-house foundation models, and the lineup reaches well past text. The company has released Hunyuan models for image generation, the HunyuanVideo system for video, and Hunyuan3D for 3D asset generation, alongside the language models. On the text side, Hunyuan-A13B follows Hunyuan-Large, a much bigger mixture-of-experts model with about 389 billion total parameters and around 52 billion active that Tencent opened in November 2024 with a 256K context window of its own.[3][4]
After Hunyuan-A13B, Tencent continued to fill out the open lineup with a set of smaller dense models in sizes such as 0.5B, 1.8B, 4B, and 7B aimed at edge and on-device use, released later in 2025. Those smaller checkpoints carry over the same long context and dual-mode reasoning ideas at sizes that fit on a single consumer card.[10] Together with the DeepSeek and Qwen3 families, these releases are part of a wave of Chinese labs publishing capable open-weight models through 2025, and Hunyuan-A13B is often grouped with them as an example of the mixture-of-experts approach that Mixtral helped popularize for open models.[3]
The headline efficiency claim comes with the standard mixture-of-experts caveat. Active parameters per token are small, but the full 80 billion parameters still have to be loaded into memory to serve the model, so the GPU memory bill scales with total size even though the compute bill scales with active size. The single-GPU-class framing applies most cleanly to the quantized FP8 and INT4 builds rather than to the full-precision weights.[1][2]
The reported benchmark numbers are Tencent's own. Independent evaluations and leaderboard placements can differ from a vendor's published table because of prompt formatting, sampling settings, and the exact benchmark versions used, so the comparisons against o1, DeepSeek-R1, and Qwen3-A22B are best treated as the publisher's view rather than a settled ranking. The custom license is also a real constraint for some users, since the monthly-active-user cap and the excluded territories rule out certain deployments that a permissive license would allow.[1][7]
Finally, the dual-mode reasoning helps with cost control but does not remove the failure modes common to reasoning models. Slow thinking can still produce confident wrong chains, and long contexts do not guarantee perfect recall across the whole 256K window, so applications that depend on retrieval accuracy at extreme lengths should test against their own data.[1][2]