GLM-4.6
Last reviewed
May 16, 2026
Sources
15 citations
Review status
Source-backed
Revision
v1 ยท 2,651 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 16, 2026
Sources
15 citations
Review status
Source-backed
Revision
v1 ยท 2,651 words
Add missing citations, update stale details, or suggest a clearer explanation.
GLM-4.6 is a flagship open-weight large language model released by Zhipu AI under its international brand Z.ai on September 30, 2025. The model is the successor to GLM-4.5 and is built on a sparse Mixture of Experts (MoE) architecture with roughly 357 billion total parameters and about 32 billion active parameters per token. It expands the context window from 128,000 tokens in GLM-4.5 to 200,000 tokens, with a maximum output of 128,000 tokens, and is published under the MIT License on Hugging Face and ModelScope [1][2][3].
GLM-4.6 is positioned by Zhipu as a frontier-class general purpose model with particular strength in real-world coding, long-context processing, reasoning, search, writing, and agentic tool use. On the company's CC-Bench evaluation, an extended human-graded coding harness run in isolated Docker containers, GLM-4.6 reaches a 48.6 percent win rate against Claude Sonnet 4 and uses approximately 15 percent fewer tokens than GLM-4.5 to complete the same tasks. It still trails Claude Sonnet 4.5 on the hardest coding evaluations, which Zhipu's own release post acknowledges [1][2][4]. The model is available through Z.ai's API, OpenRouter, Together AI, and a growing list of inference partners, and it ships with integrations for Claude Code, Cline, Roo Code, Kilo Code, and OpenCode out of the box [3][5].
The GLM series was started by Zhipu AI, a Beijing-based research lab that spun out of the Knowledge Engineering Group at Tsinghua University in 2019. Early GLM models used a General Language Model objective that combined autoregressive and span-corruption pretraining, and the team published the open-source ChatGLM-6B chatbot in March 2023, which became one of the most widely downloaded Chinese language models on Hugging Face during the year. GLM-4 followed in January 2024 as a closed proprietary model, and the GLM-4.5 family in July 2025 returned the line to open weights with a 355B-A32B MoE flagship and a lighter 106B-A12B "Air" variant [3][6].
GLM-4.6 was announced under the Z.ai brand, which Zhipu adopted in mid-2025 for its international product line. The release was timed against a busy week in late September that also saw the launch of DeepSeek V3.2 and the steady ramp of Qwen3-Max, and Zhipu framed GLM-4.6 as the strongest open-weight Chinese coding model on the market at the time. The team explicitly compared itself to Anthropic and to the rest of the domestic Chinese pack, calling the model the top performing Chinese coding model in head to head evaluations [1][7]. The release also coincided with broader investor attention on Zhipu, which had raised substantial funding from Saudi Arabian and other strategic backers earlier in 2025 and was positioning itself for a possible Hong Kong listing.
GLM-4.6 keeps the broad architectural template of GLM-4.5 but tunes several details and pushes context length to a new ceiling. The model is a decoder-only transformer with a sparse MoE feed-forward layer at every block. Of the 357 billion total parameters, only about 32 billion are activated per token, giving it roughly an 11 to 1 sparsity ratio. That choice mirrors the design space staked out by DeepSeek V3 and Qwen3-235B-A22B and reflects the field's growing consensus that very large MoE models with modest active parameter counts are the most cost-efficient way to push toward frontier quality on commodity inference hardware [1][8][9].
The attention stack uses Grouped-Query Attention with 96 query heads, which keeps the inference memory footprint manageable for long contexts. Position information is encoded with a partial Rotary Position Embedding scheme, and attention logits are stabilized with QK-Norm. Routing inside the MoE layers uses loss-free balance routing with sigmoid gates, an approach designed to encourage roughly uniform expert utilization without introducing the explicit auxiliary balancing loss that earlier MoE designs depended on. Zhipu reports BF16 and F32 tensor types as native, and the published weights on Hugging Face are distributed in BF16 [1][8].
The headline architectural change relative to GLM-4.5 is the expansion of the context window from 128K tokens to 200K tokens. The model can still emit up to 128,000 output tokens in a single response, which makes it usable for long-form code generation, end-to-end document drafting, and multi-turn agentic workflows that need to keep large amounts of state in context. Zhipu reports that the longer window was achieved primarily through changes to position embeddings and continued pre-training on long sequence data, rather than through purely inference-time tricks like YaRN extrapolation [1][2][5].
A notable behavioral change is that GLM-4.6 is trained to invoke tools during its internal reasoning trace rather than only after producing a final chain of thought. In practice that means an agent built on GLM-4.6 can interleave search queries, code execution, and file operations with reasoning steps without breaking out of the model loop. Zhipu credits this design with much of the model's improvement on agentic and search-based benchmarks, and the same pattern shows up in third-party reviews of how the model behaves inside coding harnesses like Claude Code and Kilo Code [1][4][10].
Zhipu has not published a full standalone technical report for GLM-4.6. The model card on Hugging Face references the GLM-4.5 technical report on arXiv (2508.06471) as the primary architectural and training source, and notes that GLM-4.6 inherits most of the GLM-4.5 recipe with continued pre-training on additional code, agent, and long-context data [1][8]. Public material describes three high level stages: pre-training on a large multilingual corpus with heavy emphasis on English and Chinese, a long-context continued pre-training phase that extended the effective window to 200,000 tokens, and a post-training phase combining supervised fine-tuning with reinforcement learning aimed at coding, tool use, and reasoning quality [1][8].
Beyond those high-level claims, Zhipu has not disclosed the size of the GLM-4.6 training corpus, the exact mix of data sources, or the reinforcement learning algorithm used in post training. The release blog focuses instead on outcome measures, in particular the CC-Bench win rate against Claude Sonnet 4 and the 15 percent token efficiency improvement over GLM-4.5 on multi-turn coding tasks. The model is open weight under the MIT license, but it is not open data, and the training corpus is not published [1][2][4].
GLM-4.6 is a refresh rather than a new architecture. The headline improvements are concentrated in long context, coding, agentic tool use, and writing quality. The table below summarizes the most-cited differences between the two generations as reported in Zhipu's own materials and in independent reviews.
| Area | GLM-4.5 | GLM-4.6 | Notes |
|---|---|---|---|
| Total parameters | 355B | 357B | Slight increase, same broad architecture [1][8] |
| Active parameters | ~32B | ~32B | Sparsity ratio unchanged [8][9] |
| Context window | 128K tokens | 200K tokens | About 56 percent longer input window [1][2] |
| Maximum output | 96K tokens | 128K tokens | Allows longer single-turn generations [2][5] |
| CC-Bench vs Claude Sonnet 4 | Lower win rate | 48.6 percent win rate | Near parity in extended human-graded coding tests [1][4] |
| Token efficiency on CC-Bench | Baseline | About 15 percent fewer tokens | Same tasks completed with fewer tokens [1][4] |
| LiveCodeBench v6 | 63.3 percent | 82.8 percent | Large jump on competitive coding benchmark [10] |
| Tool use inside reasoning | Limited | Native | Tools can be called mid-reasoning [1][4] |
| Writing alignment | GLM-4.5 baseline | Better human preference scores | Reported by Zhipu and reviewers [1][7] |
The model is otherwise drop-in compatible with GLM-4.5 deployments. Inference servers like vLLM and SGLang added day-one support for GLM-4.6, and the chat template, tokenizer, and tool-calling schema remain compatible with the GLM-4.5 specification [1][3].
Zhipu reported GLM-4.6 results across eight public benchmarks at launch, covering math, science, coding, agentic reasoning, and general knowledge. Independent measurement by Artificial Analysis and other third party trackers followed within days. The table below collects the most widely cited numbers and identifies which are vendor reported and which are independent. Benchmarks where Zhipu has not published a number are omitted rather than estimated.
| Benchmark | Score | Source |
|---|---|---|
| AIME 2025 (math, standard) | 93.9 percent | Vendor reported [1][10] |
| AIME 2025 (with tools enabled) | 98.6 percent | Vendor reported [10] |
| GPQA Diamond | 81.0 percent | Vendor reported [3][10] |
| GPQA (with tools) | 82.9 percent | Vendor reported [10] |
| LiveCodeBench v6 | 82.8 percent | Vendor reported [3][10] |
| SWE-bench Verified | 68.0 percent | Vendor reported [10] |
| Humanity's Last Exam (with tools) | 30.4 percent | Vendor reported [10] |
| CC-Bench win rate vs Claude Sonnet 4 | 48.6 percent | Vendor reported [1][4] |
| CC-Bench win rate vs DeepSeek V3.1-Terminus | 64.9 percent | Vendor reported [10] |
| Artificial Analysis Intelligence Index | 33 | Independent [11] |
| Output speed (Artificial Analysis) | 42.2 tokens per second | Independent [11] |
| Time to first token (Artificial Analysis) | 1.16 seconds | Independent [11] |
The LiveCodeBench v6 result is a roughly 19 point jump over GLM-4.5, which scored 63.3 on the same benchmark. The SWE-bench Verified score of 68.0 is competitive with frontier open-weight models from the same generation but trails Claude Sonnet 4.5, which Anthropic reported in the high 70s on the same benchmark. AIME 2025 at 93.9 percent without tools, and 98.6 percent with a code interpreter, is comparable to top reasoning-focused models in the period. Humanity's Last Exam at 30.4 percent with tools is well below frontier reasoning models, which is consistent with GLM-4.6 being a non reasoning flagship rather than a dedicated chain of thought specialist [3][10].
Artificial Analysis placed GLM-4.6 at 33 on its Intelligence Index in October 2025, slightly above the median of 30 for comparable open weight models. The same harness measured output speed at 42.2 tokens per second, which the analysts described as below average for its class, and time to first token at 1.16 seconds, which they called very competitive. Total output volume across the full Intelligence Index run was about 57 million tokens, on the verbose end of the range [11].
The weights are released under the MIT license, which permits commercial use, fine-tuning, redistribution, and derivative works without royalties. Zhipu publishes the safetensors files on Hugging Face under the zai-org/GLM-4.6 repository and mirrors them on ModelScope. Local deployment is supported through vLLM, SGLang, and the standard Hugging Face Transformers library, and the recommended inference settings are a temperature of 1.0 for general evaluation and top_p 0.95 with top_k 40 for code-focused workloads [1][3].
Commercial access goes through the Z.ai API at https://api.z.ai/api/paas/v4/chat/completions, which uses an OpenAI-compatible request schema and accepts an explicit thinking parameter that can be set to enabled or disabled per request. The same endpoint is mirrored by external gateways. OpenRouter lists GLM-4.6 at $0.43 per million input tokens and $1.74 per million output tokens, while Together AI lists it at $0.60 input and $2.20 output. Artificial Analysis reports a blended price of about $0.96 per million tokens at a 3 to 1 input output ratio, which they describe as slightly expensive relative to comparable open weight peers but cheaper than most closed Western alternatives [5][11][12].
| Provider | Input ($/M tokens) | Output ($/M tokens) | Context |
|---|---|---|---|
| Z.ai direct API | Tiered (varies by plan) | Tiered (varies by plan) | 200K |
| OpenRouter | 0.43 | 1.74 | 203K |
| Together AI | 0.60 | 2.20 | 200K |
| Artificial Analysis blended | 0.96 (3:1 mix) | n/a | 200K |
Zhipu also offers a coding-focused subscription product called GLM Coding Plan that bundles GLM-4.6 access with coding tool integrations at a fixed monthly rate, aimed at developers who use the model inside Claude Code, Cline, or similar harnesses for everyday work [1][5].
GLM-4.6 was received as a strong incremental release rather than a step change. The CC-Bench result against Claude Sonnet 4 attracted the most attention, since reaching a 48.6 percent win rate on a Zhipu-defined harness is close to parity with one of the leading Western coding models of the same period. Several reviewers noted the same harness still puts the model behind Claude Sonnet 4.5, which Zhipu itself acknowledged in the release post [1][4][7]. Inside the open weight community, the MIT license drew positive attention because it imposes fewer restrictions than the licenses used by some other large Chinese releases.
The table below compares GLM-4.6 with three of its closest 2025 peers based on the most widely cited publicly reported numbers. Where a benchmark was not reported by a given vendor, the cell is left blank rather than filled with an estimate.
| Feature | GLM-4.6 | DeepSeek V3.1 | Qwen3-Max-Instruct | Kimi K2 |
|---|---|---|---|---|
| Total parameters | 357B | 671B | 1T+ | 1T |
| Active parameters | ~32B | ~37B | Not disclosed | 32B |
| Context window | 200K | 128K | 262K | 128K |
| Open weights | Yes (MIT) | Yes | No (closed) | Yes |
| LiveCodeBench v6 | 82.8 | Comparable | 57-75 reported range | Lower reported |
| SWE-bench Verified | 68.0 | Comparable | 69.6 | Lower reported |
| Released | September 30, 2025 | August 2025 | September 24, 2025 | July 2025 |
| Primary distribution | Z.ai API, Hugging Face | DeepSeek API, Hugging Face | Alibaba Model Studio API | Moonshot API, Hugging Face |
The most consistent praise for GLM-4.6 in third party reviews concerns its front-end coding output, which several reviewers said produces more visually polished web pages than its open weight peers, and its agentic behavior inside coding harnesses, where the mid-reasoning tool use design is described as producing fewer broken tool calls and tighter end to end traces. The most consistent criticism is that pure reasoning benchmarks like Humanity's Last Exam still leave a clear gap between GLM-4.6 and the strongest closed reasoning models, and that output speed is below average for the open weight tier [4][7][10][11].
Zhipu followed up GLM-4.6 with continued iteration on the line over the following months, including a smaller GLM-4.6-Air variant and an updated GLM-4.7 release in early 2026. GLM-4.6 nonetheless remains an important reference point as the model that first established 200K context on the open-weight Chinese frontier and that made coding parity with Claude Sonnet 4 a credible claim for an open weight model.