GLM-5.1
Last reviewed
Jun 2, 2026
Sources
18 citations
Review status
Source-backed
Revision
v1 · 2,149 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 2, 2026
Sources
18 citations
Review status
Source-backed
Revision
v1 · 2,149 words
Add missing citations, update stale details, or suggest a clearer explanation.
GLM-5.1 is an open-weight large language model developed by the Chinese AI company Zhipu AI, which markets its products internationally under the brand Z.ai. Released on April 7, 2026, with weights published the following day, it is the company's flagship model for agentic engineering and the direct successor to GLM-5.[1][2][3] GLM-5.1 is a mixture-of-experts (MoE) model with roughly 754 billion total parameters and about 40 billion active per token, distributed on Hugging Face under the permissive MIT License.[2][4] Its headline feature is long-horizon autonomy: Z.ai says the model can work continuously on a single coding task for up to eight hours, refining its own strategy across hundreds of iterations rather than answering in one shot.[1][5]
GLM-5.1 is a text-in, text-out model positioned squarely at software engineering and tool-driven agent work.[1][6] Z.ai describes it as a refinement of the GLM-5 series tuned for "long-horizon" execution, the ability to keep making forward progress on a hard problem long after a typical model has, in the company's words, run out of ideas.[5] Where earlier releases were built for minute-scale interactions, GLM-5.1 is designed to plan, run experiments, read the results, find the blockers, and try again over many rounds without a human stepping in.[1][5]
The model reaches state-of-the-art results among open models on several agentic and coding benchmarks, and Z.ai claims it edges past closed frontier systems including GPT-5.1-Codex-Max's lineage, Claude Opus 4.6, and Gemini 3 Pro on the SWE-Bench Pro software-engineering benchmark.[1][3][7] It ships with a 200,000-token context window and supports up to 128,000 output tokens in a single response.[6][8]
Zhipu AI is a Beijing-based AI company that spun out of Tsinghua University and develops the GLM (General Language Model) family.[3][9] The company uses the consumer and developer brand Z.ai for its international platform and API, while operating the BigModel.cn platform in China.[3][7] Zhipu has been on the United States Entity List since January 2025, which restricts its access to American-made AI accelerators, a constraint that shaped how the GLM-5 series was trained.[10][9]
The GLM line had already moved through several generations before GLM-5.1. The wiki documents GLM-4.5 and GLM-4.6 from the 4.x series, the multimodal GLM-4-Voice, and the GLM-5 base model. GLM-5.1 is not a separate architecture from GLM-5 so much as the next iteration of it: a model in the same MoE family, retrained and post-trained to push much harder on coding, tool use, and sustained autonomous execution.[1][5][2]
Z.ai announced GLM-5.1 on April 7, 2026, and open-sourced the weights on Hugging Face on April 8, 2026, alongside an API rollout.[1][2][11] Regional and English-language press carried the launch over April 7 through April 9, 2026, with most outlets dating the announcement to April 7 or 8.[3][7][12] The release was framed by the company as its "most intelligent flagship model to date" and, in Z.ai's own marketing, the strongest open-source model then available.[12][7]
The launch coincided with a price increase for Zhipu's cloud API. The company raised pricing on model calls by roughly 8 to 17 percent relative to its earlier GLM-5 Turbo offering, a move several outlets noted ran against the broader industry trend of falling token prices.[11][12] Coverage tied the pricing and release to Zhipu's business position: the company had reported a widening loss for 2025 and had raised prices repeatedly through early 2026, and its shares moved sharply on launch-week disclosures.[11]
At launch GLM-5.1 was distributed as a single flagship model rather than a family of differently sized checkpoints. The base weights are published in BF16, and a large ecosystem of quantized builds (for runtimes such as llama.cpp, Ollama, and LM Studio) appeared on Hugging Face shortly after release.[2][4]
| Variant | Description | Source |
|---|---|---|
| GLM-5.1 (base weights) | Full BF16 weights, ~754B total parameters, MIT-licensed | [2][4] |
| Quantized community builds | Numerous GGUF and other quantizations for local runtimes (llama.cpp, Ollama, LM Studio, Jan) | [2][4] |
API model glm-5.1 | Hosted model served via the Z.ai / BigModel API and third-party clouds | [6][7] |
Z.ai also exposed the model through its GLM Coding Plan tiers (Lite, Pro, and Max) and made it compatible with agentic coding clients including Claude Code, Cline, and other OpenAI-compatible tools.[1][13]
GLM-5.1 uses the architecture Z.ai labels glm_moe_dsa, a mixture-of-experts design combined with a sparse-attention mechanism the company derived from research on efficient long-context attention.[2][8] Independent technical summaries and the published model configuration describe a network of roughly 754 billion total parameters with about 40 billion active per token, built from 256 routed experts plus one shared expert (with on the order of eight routed experts plus the shared expert active per token) spread across roughly 78 layers.[8][14] The attention stack mixes a linear-attention component with conventional attention, which is what keeps the 200,000-token context window tractable.[8][14] These configuration-level details come from the model card and third-party analyses rather than a Z.ai datasheet, so the exact expert-routing numbers should be read as the disclosed configuration rather than a fully audited specification.[8][14]
The GLM-5 series is notable for the hardware it ran on. Z.ai disclosed that the GLM-5 line, including GLM-5.1, was trained without Nvidia accelerators, using a large cluster of Huawei Ascend 910-series chips together with Huawei's MindSpore framework.[10][15] Reporting on the cluster put it among the largest non-Nvidia training deployments disclosed to date, which several outlets framed as evidence that competitive frontier models can be built outside the Nvidia ecosystem under export-control pressure.[10][15] For context, the GLM-5 base model scaled the family from 355 billion parameters (32 billion active) up to 744 billion (40 billion active) and was trained on about 28.5 trillion tokens of pre-training data, integrating a sparse-attention mechanism for the first time.[16] GLM-5.1 carries that lineage forward with a post-training and reinforcement learning recipe aimed at long-horizon agency, using what Z.ai describes as asynchronous RL infrastructure.[1][5]
The defining capability of GLM-5.1 is sustained autonomous execution. Rather than producing a one-shot answer, the model runs an experiment, analyzes, and optimizes loop, repeatedly reviewing its own strategy, recognizing dead ends, and trying new approaches.[1][5] Z.ai reports demonstrations in which the model worked for more than eight hours on a single task, including building a complete Linux desktop environment from scratch through hundreds of autonomous plan-execute-analyze-optimize iterations in one uninterrupted session.[1][5] In other demonstrations it iterated on performance tuning over hundreds of rounds and thousands of tool calls, for example pushing a CUDA kernel from a roughly 2.6x to a far larger speedup and improving a vector-database workload through hundreds of iterations.[5][6]
On coding and agentic AI workflows, GLM-5.1 is optimized for clients such as Claude Code and supports function calling, structured tool use, and OpenAI-SDK-compatible access.[6][7] Beyond software engineering, the model posts strong scores on general reasoning, mathematics, and web-browsing agent benchmarks, indicating it is a broadly capable system rather than a coding-only specialist.[2][4]
The figures below are drawn from Z.ai's published model card and release materials together with independent coverage. Scores are self-reported by Z.ai unless otherwise noted, and benchmark results for any model should be read with that caveat.[2][4][1]
| Benchmark | GLM-5.1 | Notes | Source |
|---|---|---|---|
| SWE-Bench Pro | 58.4 | Reported as #1 globally at release, ahead of GPT-5.4 (57.7) and Claude Opus 4.6 (57.3) | [1][3][7] |
| SWE-bench Verified | 77.8 | Competitive with leading closed models, a few points behind Claude Opus 4.6 | [3][16] |
| Terminal-Bench 2.0 | 63.5 (69.0 with Claude Code scaffolding) | Real-world terminal tasks | [2][4] |
| NL2Repo | 42.7 | Repository generation | [2][4] |
| CyberGym | 68.7 | Up sharply from GLM-5's 48.3 | [2][4] |
| BrowseComp | 68.0 | Web-browsing agent benchmark | [2][4] |
| GPQA Diamond | 86.2 | Graduate-level science questions | [2][4][5] |
| AIME 2026 | 95.3 | Competition mathematics | [2][4] |
| τ³-Bench | 70.6 | Tool-use agent benchmark | [4] |
Z.ai summarized the coding results by saying that across the combined average of SWE-Bench Pro, Terminal-Bench, and NL2Repo, GLM-5.1 ranked third globally, first among Chinese models, and first among open-source models.[12][2] Reviewers measuring it against Claude Opus 4.6 reported coding performance in the low-to-mid-90s percent range relative to that model, with Z.ai's CEO stating that coding capability "approaches" Claude Opus 4.6.[11][17]
| Aspect | Detail | Source |
|---|---|---|
| License | MIT License (commercial use allowed, no royalty) | [2][4] |
| Weights | Published on Hugging Face at zai-org/GLM-5.1 (BF16) | [2] |
| Context window | 200,000 tokens | [6][8] |
| Max output | 128,000 tokens | [6][8] |
| API access | Z.ai (api.z.ai) and BigModel.cn; OpenAI-SDK compatible | [6][7] |
| API model ID | glm-5.1 | [6] |
| Coding plans | GLM Coding Plan Lite / Pro / Max tiers | [1][13] |
| Client compatibility | Claude Code, Cline, OpenClaw, and OpenAI-compatible coding tools | [1][7] |
| Third-party hosting | Offered through clouds such as Together AI and others | [18] |
The permissive MIT license, rather than a custom community license, was repeatedly highlighted as a draw: it allows download, modification, fine-tuning, and commercial deployment without usage restrictions or fees, subject to the practical reality that self-hosting a 754B model requires substantial hardware.[2][17]
Coverage of GLM-5.1 was broadly positive about its open-weight, low-cost challenge to closed frontier labs. Outlets framed it as the first open-source model to lead the SWE-Bench Pro leaderboard, and several described it as the strongest open-weight model available at the time of release.[1][3][7] Technical writers emphasized two themes in particular: the long-horizon autonomy, which they treated as a meaningful step beyond minute-scale chat interactions, and the training story, since the model reached frontier-adjacent results on Huawei Ascend hardware with no Nvidia chips, an outcome read as significant given U.S. export controls.[5][10][15]
Constellation Research and others noted that GLM-5.1 topped public leaderboards on Hugging Face and chat arenas around launch, while some reviewers tempered the enthusiasm by stressing that the most striking claims came from Z.ai's own benchmarks and demonstrations.[3][1] The simultaneous price increase drew comment as an unusual move in a market trending toward cheaper tokens, and was generally read against Zhipu's financial position rather than as a sign of weakness in the model.[11][12]
Several limitations and caveats apply. Many of the most dramatic capability claims, the eight-hour autonomous runs and the leaderboard-topping benchmark scores, are self-reported by Z.ai or shown in company demonstrations, and independent reproduction of the long-horizon results at the scale Z.ai describes was limited at launch.[1][5] Detailed architecture figures such as the exact expert-routing configuration come from the model card and third-party analyses rather than a comprehensive Z.ai technical datasheet, so specific internal numbers should be treated as the disclosed configuration rather than independently audited facts.[8][14]
Practical access is constrained by scale: at roughly 754 billion parameters the full model is out of reach for typical local hardware, so most users rely on the hosted API or quantized community builds, and the MIT license does not change those hardware requirements.[2][17] The model is text-only, without the image or audio input found in some rival systems, and its knowledge and benchmark coverage reflect an early-2026 snapshot.[6][2] Finally, as with any frontier coding agent, sustained autonomous operation raises the usual concerns about unsupervised tool use, which the long-horizon framing makes more salient rather than less.[1][5]