GLM-4.5 and GLM-4.6 are open-weights large language models developed by Zhipu AI (operating internationally as Z.ai), a Chinese AI company founded as a spinoff of Tsinghua University. Released in July and September 2025 respectively, the two models represent the most capable entries in the General Language Model (GLM) family and are designed specifically for agentic tasks, coding, and complex reasoning. Both use a Mixture of Experts architecture and are available under the MIT license, making them freely usable for commercial applications.
The models are positioned as open alternatives to proprietary systems from Anthropic, Google, and OpenAI, and at launch GLM-4.5 ranked third globally across twelve representative benchmarks, claiming first place among all open-source models at the time. GLM-4.6, released roughly two months later, extended the context window to 200,000 tokens and made substantial gains in real-world coding and agentic evaluation tasks.
Zhipu AI was founded in 2019 by Tang Jie and Li Juanzi, both professors at Tsinghua University's Knowledge Engineering Group (KEG). The company grew out of academic research into knowledge graphs and language models conducted at Tsinghua, and was registered as an independent commercial entity shortly after. Zhang Peng serves as CEO. Internationally, the company rebranded itself as Z.ai in mid-2025 around the time of GLM-4.5's release.
Tang Jie is also credited as the creator of AMiner, an academic research database platform that Z.ai acquired and now operates. The Tsinghua KEG lab had published early work on pre-trained language models that would eventually feed into the GLM architecture, including contributions to what became the autoregressive blank infilling training objective that distinguishes GLM models from the causal language modeling approach used in GPT-style models.
The company has raised approximately $1.5 billion in total funding since founding, with major investors including Alibaba, Ant Group, Tencent, Meituan, Xiaomi, and HongShan. In June 2024, Prosperity7 Ventures (a Saudi Aramco venture capital arm) led a $400 million funding round that valued the company at roughly $3 billion. A subsequent round in December 2024 raised another $411 million. As of mid-2025, the company's valuation had reached approximately 40 billion yuan ($5.6 billion). In April 2025, Zhipu pre-filed for an IPO, with plans to list in Hong Kong in 2026.
As of mid-2025, Z.ai reported over 40 million cumulative downloads of its open-source models globally.
The GLM family traces back to foundational research completed at Tsinghua before Zhipu AI's commercial launch. In late 2020, the Tsinghua KEG group developed the GLM pre-training architecture, and by 2021 they had trained a GLM-10B model. In 2022, the lab open-sourced GLM-130B, a bilingual (Chinese-English) model with 130 billion parameters trained on 400 billion tokens, which achieved performance comparable to GPT-3.
The commercial trajectory of the GLM family then proceeded as follows:
ChatGLM-6B (March 2023): The first generation of aligned dialogue models, released the same day that the GLM-130B service went live at chatglm.cn. ChatGLM-6B attracted substantially more attention than anticipated, partly because it ran on consumer hardware.
ChatGLM2-6B (June 2023): The second generation, pre-trained on more and better data. Achieved a 23% improvement on MMLU, a 571% improvement on GSM8K, and a 60% improvement on BBH compared to its predecessor. Introduced FlashAttention to extend the context window to 32,000 tokens and incorporated Multi-Query Attention for a 42% inference speed increase.
ChatGLM3-6B (October 2023): Topped 42 benchmarks across semantics, mathematics, reasoning, code, and knowledge at the time of release. Introduced function calling support, a code interpreter, and the ability to handle complex agentic tasks.
GLM-4 series (2024): The fourth generation, including GLM-4, GLM-4-Air, and GLM-4-9B. Pre-trained on ten trillion tokens, primarily in Chinese and English, with a small corpus from 24 other languages. GLM-4 and GLM-4 All Tools support a 128,000-token context, and a separate 1M-token context variant was also released. The GLM-4 series introduced several architectural improvements: removal of bias terms (except in QKV layers), RMSNorm replacing LayerNorm, Group Query Attention to reduce KV cache requirements, and extended Rotary Position Embeddings in 2D format.
GLM-4.5 / GLM-4.6 (2025): See sections below.
GLM-4.5, released on July 28, 2025, uses a Mixture of Experts transformer architecture with 355 billion total parameters and 32 billion active parameters per forward pass. The companion model GLM-4.5-Air uses 106 billion total parameters with 12 billion active.
The technical report (arXiv:2508.06471) describes the architecture under the heading "Agentic, Reasoning, and Coding (ARC) Foundation Models." Several design choices distinguish GLM-4.5 from its predecessors and from contemporaries:
MoE routing: The model uses loss-free balance routing with sigmoid gating. The design philosophy favors depth over width, stacking more layers with smaller hidden dimensions to improve reasoning capacity under an equivalent compute budget.
Attention configuration: Grouped-Query Attention (GQA) with partial Rotary Position Embeddings. GLM-4.5 has roughly 2.5 times more attention heads than standard configurations (96 heads at 5,120 hidden dimension). QK-Norm stabilizes attention logits.
Multi-Token Prediction (MTP) layers: The MoE layers also function as MTP draft layers, enabling speculative decoding and substantially faster inference. Z.ai reports generation speeds exceeding 100 tokens per second on the API.
Muon optimizer: The Muon optimizer (rather than AdamW) was used during training to enable faster convergence at scale.
The model supports a 128,000-token context window. Maximum output length is 96,000 tokens.
GLM-4.5 was trained in three distinct stages:
Pre-training (23 trillion tokens): The training corpus comprised roughly 15 trillion tokens of general text (web documents, books, multilingual text) plus approximately 7 trillion tokens of code and reasoning-focused material (source code repositories, scientific papers, and mathematics). Fill-in-the-middle objectives were used for bidirectional code understanding.
Mid-training specialization: Repository-level code training extended context to 32,000 tokens with cross-file dependencies. Synthetic reasoning data generated challenging math and science problems. Agent trajectories were extended to 128,000 tokens for tool-use alignment.
Post-training via expert distillation and RL: Three specialized expert models were trained separately: a reasoning expert (single-stage reinforcement learning with a difficulty curriculum), an agentic expert (information-seeking QA and software engineering tasks), and a chat expert (safety and helpfulness alignment). These three experts were then distilled into a single unified generalist model through supervised fine-tuning followed by reinforcement learning.
The RL infrastructure is called Slime, an open-source SGLang-native post-training framework that Z.ai released alongside the model weights. Slime supports two modes: a colocated synchronous mode more effective for reasoning tasks, and a disaggregated asynchronous mode that decouples generation from training for agentic tasks, which tend to produce longer rollouts.
GLM-4.5-Air is the smaller variant of the GLM-4.5 family, with 106 billion total parameters and 12 billion active parameters. It targets deployments where hardware or cost is a constraint.
Despite its smaller size, GLM-4.5-Air performs strongly on reasoning benchmarks. On AIME 2024, it scored 89.4%, which Z.ai notes exceeds Claude 4 Opus (75.7%) on that benchmark. Z.ai also claims GLM-4.5-Air surpasses Gemini 2.5 Flash, Qwen 3-235B, and Claude 4 Opus on reasoning benchmarks in its comparison tests.
In terms of hardware requirements, GLM-4.5-Air in BF16 precision requires four H100 or two H200 GPUs for typical inference (batch size up to 8). The FP8-quantized version runs on two H100 or a single H200. With quantization support, the model can run on consumer hardware with 32 to 64 GB VRAM, making it accessible without data center infrastructure.
API pricing for GLM-4.5-Air on the Z.ai platform is $0.20 per million input tokens and $1.10 per million output tokens. Cached input pricing is $0.03 per million tokens.
Both GLM-4.5 and GLM-4.6 implement a hybrid reasoning design that lets the model switch between two operating modes within a single model, without requiring separate model weights:
Thinking mode: The model generates an explicit chain-of-thought reasoning trace before producing its final answer. This is suited for complex mathematics, multi-step logical problems, and multi-turn planning in agentic workflows.
Non-thinking mode: The model responds directly without a reasoning preamble. This is faster and appropriate for conversational exchanges, well-scoped coding tasks, and structured output generation.
The Z.ai API exposes this through a thinking.type parameter set to either enabled or disabled. By default, dynamic thinking is enabled, and the model decides which mode to use based on the apparent complexity of the request.
The paper describes the motivation: "recognizing that a prolonged thinking process is unnecessary for certain domains that demand quick responses (such as chit chat), they meticulously balanced training data containing full reasoning with data lacking explicit thought processes." The hybrid approach is presented as a middle path between fully deliberative reasoning models and pure instruction-following models.
Both GLM-4.5 and GLM-4.6 are released under the MIT license, one of the most permissive open-source licenses available. This allows unrestricted commercial use, modification, redistribution, and sublicensing without any usage threshold or attribution requirements beyond preserving the license text.
Model weights are hosted on Hugging Face under the zai-org organization and on ModelScope. Available variants include:
| Model | Parameters | Precision | Notes |
|---|---|---|---|
| GLM-4.5 | 355B-A32B | BF16 | Flagship |
| GLM-4.5-FP8 | 355B-A32B | FP8 | Reduced memory |
| GLM-4.5-Base | 355B-A32B | BF16 | No alignment/chat |
| GLM-4.5-Air | 106B-A12B | BF16 | Compact variant |
| GLM-4.5-Air-FP8 | 106B-A12B | FP8 | Reduced memory |
| GLM-4.5-Air-Base | 106B-A12B | BF16 | No alignment/chat |
| GLM-4.6 | 357B | BF16 | September 2025 update |
The models are compatible with multiple inference frameworks: Transformers (HuggingFace), vLLM, SGLang, and for fine-tuning, LlamaFactory and Swift. There are also 41 quantized versions available in GGUF format for llama.cpp, Ollama, and LM Studio.
Minimum GPU requirements for GLM-4.5 (BF16, batch size up to 8) are sixteen H100s or eight H200s. The FP8 version reduces this to eight H100s or four H200s. Full 128,000-token context support requires thirty-two H100s or sixteen H200s in BF16, or sixteen H100s or eight H200s in FP8.
The following table summarizes GLM-4.5 and GLM-4.6 scores across key benchmarks, alongside selected comparison models. Scores are from the GLM-4.5 technical report (arXiv:2508.06471) and GLM-4.6 model documentation.
| Benchmark | GLM-4.5 | GLM-4.6 | Claude Sonnet 4 | DeepSeek V3 | Kimi K2 |
|---|---|---|---|---|---|
| AIME 2024 | 91.0% | N/A | N/A | N/A | N/A |
| AIME 2025 | 85.4% | 93.9% | 87.0% | N/A | N/A |
| AIME 2025 (with tools) | N/A | 98.6% | N/A | N/A | N/A |
| GPQA Diamond | 79.1% | 82.9% (w/ tools) | 83.4% | N/A | N/A |
| SWE-bench Verified | 64.2% | 68.0% | 67.8% | N/A | ~65% |
| LiveCodeBench (v6) | 63.3% | 82.8% | mid-80s | N/A | 83% |
| TAU-Bench (avg) | 70.1% | N/A | N/A | N/A | N/A |
| BFCL v3 / Tool-calling avg | 90.6% | N/A | 89.5% | N/A | N/A |
| BrowseComp | 26.4% | N/A | 18.8% (Opus 4) | N/A | N/A |
| HLE | 14.4% | 30.4% | 17.3% | N/A | N/A |
Across the 12 benchmark composite used at launch, GLM-4.5 scored 63.2 and GLM-4.5-Air scored 59.8.
GLM-4.5 is particularly strong in tool-calling, where its 90.6% success rate across TAU-Bench, BFCL v3, and BrowseComp surpassed Claude 3.5 Sonnet, Kimi K2, and most contemporaries at the time of release. The model also achieves 37.5% on Terminal-Bench, outperforming GPT-4.1 and Gemini 2.5 Pro.
GLM-4.6 shows the most dramatic improvements in LiveCodeBench (+19.5 percentage points over GLM-4.5) and AIME 2025 (+8.5 points). On HLE, it more than doubles GLM-4.5's score. The SWE-bench improvement is more modest (+3.8 points), and GLM-4.6 still trails Claude 4.5 Sonnet (77.2%) on that benchmark.
GLM-4.6 was released on September 30, 2025. It shares the same basic MoE architecture as GLM-4.5 but is listed with 357 billion total parameters (a small increase) and significantly expanded context handling.
Key changes from GLM-4.5 to GLM-4.6:
Context window expansion: Increased from 128,000 to 200,000 input tokens. Maximum output length is now 128,000 tokens (up from 96,000).
Real-world coding: Z.ai reports approximately 27% improvement in code generation quality versus GLM-4.5, primarily in real-world multi-turn tasks rather than isolated benchmark items. On CC-Bench, a human-evaluated multi-turn coding benchmark, GLM-4.6 reached a 48.6% win rate against Claude Sonnet 4 (near parity). Performance on Claude Code, Cline, Roo Code, and Kilo Code was specifically mentioned as improved.
Token efficiency: GLM-4.6 uses roughly 15% fewer tokens than GLM-4.5 for equivalent task completion, and about 26% fewer than Kimi K2 on comparable tasks.
Reasoning: AIME 2025 climbs from 85.4% to 93.9% (98.6% with tools). GPQA Diamond climbs from 79.1% to 82.9% with tools. HLE more than doubles from 14.4% to 30.4%.
Alignment and writing: The update notes improvements in natural language quality, human preference alignment, and role-playing scenarios.
Maintained open weights: Released under MIT license, same as GLM-4.5, with weights on Hugging Face under zai-org/GLM-4.6.
Pricing is the same as GLM-4.5: $0.60 per million input tokens, $0.11 per million cached input tokens, and $2.20 per million output tokens.
Alongside the model releases, Z.ai launched the GLM Coding Plan, a subscription service designed to give developers access to GLM models for use with third-party AI coding agents and IDEs.
The Coding Plan is not an IDE or standalone coding assistant. It provides an API endpoint, including an Anthropic-compatible endpoint, that lets users substitute GLM models into tools like Claude Code, Cline, Roo Code, Kilo Code, and OpenCode. This means a developer can point their Cline installation at Z.ai's endpoint and use GLM-4.5 or GLM-4.6 instead of paying Anthropic's API rates.
Pricing tiers (as of late 2025 / early 2026):
| Plan | Monthly cost | Usage quota | Models included |
|---|---|---|---|
| Lite | ~$10/month | ~120 prompts per 5-hour window | GLM-4.5-Air, GLM-4.6, GLM-4.7 |
| Pro | ~$50/month | ~600 prompts per 5-hour window | All models including GLM-5 |
| Max | Higher | ~4x Pro quota | All models |
Note: Z.ai bills these plans quarterly (every three months), not monthly. Prices drifted upward through late 2025 and early 2026; the Lite plan launched at $6/month before increasing.
Free models (GLM-4.7-Flash and GLM-4.5-Flash) are available to all registered users with no subscription required.
The Coding Plan attracted developer interest primarily because GLM-4.6 and later GLM-4.7 offered competitive performance at a fraction of the cost of comparable Anthropic or OpenAI API access. GLM-4.5-Air in particular can run at 100+ tokens per second on Z.ai's infrastructure, which matters for agentic workflows that make many sequential API calls.
Full API pricing for the GLM-4.5 family on the Z.ai platform:
| Model | Input (per 1M tokens) | Cached input | Output (per 1M tokens) |
|---|---|---|---|
| GLM-4.5 | $0.60 | $0.11 | $2.20 |
| GLM-4.5-Air | $0.20 | $0.03 | $1.10 |
| GLM-4.5-X (extended thinking) | $2.20 | $0.45 | $8.90 |
| GLM-4.5-AirX (extended thinking) | $1.10 | $0.22 | $4.50 |
| GLM-4.6 | $0.60 | $0.11 | $2.20 |
| GLM-4.5-Flash (free tier) | $0.00 | $0.00 | $0.00 |
Cached input storage carries no charge as of early 2026 (listed as a limited-time promotion).
At $0.11 per million input tokens for GLM-4.5-Air and $0.60 for the full GLM-4.5, these prices were substantially below competing models at launch. In comparison, Kimi K2 was priced at $0.15 per million input tokens, Qwen 3 at $0.35 to $0.60, and Claude Sonnet 4 at significantly higher rates.
GLM-4.5 and GLM-4.6 sit within a group of Chinese open-weights models released in 2025 that collectively challenged the assumption that competitive frontier performance required proprietary infrastructure. The closest comparisons are DeepSeek V3, Qwen 3, and Kimi K2.
| Feature | GLM-4.5 | DeepSeek V3 | Qwen 3 (235B) | Kimi K2 |
|---|---|---|---|---|
| Total parameters | 355B | 671B | 235B | 1T |
| Active parameters | 32B | ~37B | 22B | 32B |
| Context window | 128K | 128K | 256K (1M ext.) | 130K (256K ext.) |
| SWE-bench Verified | 64.2% | ~59% | 67% | ~65% |
| Tool-calling accuracy | 90.6% | N/A | N/A | N/A |
| License | MIT | MIT | Apache 2.0 | Modified MIT |
| Input price (per 1M) | $0.60 | ~$0.27 | $0.35–0.60 | $0.15 |
| Inference speed | 100+ tok/s | N/A | variable | ~47 tok/s |
Z.ai's primary positioning claims are efficiency and speed. GLM-4.5 achieves competitive benchmark performance with 32 billion active parameters, while DeepSeek R1 uses 37 billion active parameters and Kimi K2 also uses 32 billion but from a 1 trillion parameter pool. The company frames this as a "Pareto frontier" efficiency advantage.
On coding tasks, the models compete closely. Kimi K2 leads on greenfield feature implementation (93% task completion in one independent evaluation), Qwen 3 excels at large-scale refactoring across very long contexts, and GLM-4.5 tends to perform best in debugging workflows and tool-dependent pipelines. For inference speed, GLM-4.5 runs at 100+ tokens per second (reportedly peaking at 200 tok/s) versus Kimi K2's approximately 47 tok/s.
The 2025 crop of Chinese open-weights models demonstrated that strong agentic and reasoning performance was no longer confined to frontier proprietary labs, though the exact practical gap between these models and Claude 4.5 Sonnet or GPT-4.5 for complex long-horizon tasks remained a point of debate.
In January 2025, the U.S. Department of Commerce added Zhipu AI and ten of its subsidiaries to the Entity List, citing the company's role in advancing the People's Republic of China's military modernization through AI development. The Federal Register notice stated that the entities were added because their activities are "contrary to the national security and foreign policy interests of the United States."
The specific subsidiaries listed include Beijing Zhipu Huazhang Technology Co., Ltd.; Beijing Lingxin Intelligent Technology Co., Ltd.; Beijing Yuanyin Intelligent Technology Co., Ltd.; Beijing Zhipu Future Technology Co., Ltd.; Beijing Zhipu Linghang Technology Co., Ltd.; Beijing Zhipu Qingyan Technology Co., Ltd.; Hangzhou Zhipu Huazhang Technology Co., Ltd.; Nanjing Zhihu Information Technology Co., Ltd.; Shanghai Zhipu Huanyu Technology Co., Ltd.; and Shenzhen Zhipu Future Technology Co., Ltd.
Zhipu AI called the move lacking a "factual basis" and said it "will not have a substantial impact" on operations. The designation restricts U.S. companies from exporting certain technology to Z.ai without a license, but because the GLM models use Zhipu's own hardware infrastructure (not U.S. exports), ongoing operations were largely unaffected.
Z.ai's valuation roughly doubled between the entity list designation in January 2025 and the GLM-4.5 launch in July 2025, and the company continued to attract major investment from Chinese tech companies and government-linked funds despite the sanctions.
The company responded by publicly stating it is "not relying on U.S. large-model technology" and that it would continue participating in global AI competition. Several Western developers noted that using Z.ai's API or open weights raises data governance questions given the company's China-based operations, though the open-weight models themselves can be run locally without any data passing through Z.ai's servers.
The GLM-4.5 and GLM-4.6 model family targets several use cases:
Agentic software development: The high tool-calling accuracy (90.6% on the combined TAU-Bench / BFCL v3 / BrowseComp average) makes GLM-4.5 particularly suited for multi-step coding agents. The models integrate with Cline, Roo Code, Claude Code (via OpenAI-compatible endpoint), Kilo Code, and OpenCode.
Code debugging and production tooling: Independent testing found GLM-4.5 fastest at diagnosing memory leaks and production issues in tool-integrated testing scenarios.
High-volume API pipelines: The combination of low token pricing ($0.11 per million input tokens for Air) and high generation speed (100+ tok/s) suits applications that make frequent API calls, such as code review pipelines, test generation, or document processing.
On-premise deployment: Because weights are freely downloadable under MIT, organizations can run GLM-4.5-Air internally without ongoing API costs. This is attractive for enterprises with data residency requirements or those concerned about sending code to external APIs.
Chinese-language tasks: The GLM family has historically been strong on Chinese-language benchmarks. On AlignBench, a Chinese language evaluation, GLM-4 scored 8.0 overall, outperforming GPT-4 Turbo at the time. GLM-4.5 continues this trend, though the models are fully bilingual.
Multimodal applications: A separate GLM-4.5V model (released August 2025) extends GLM-4.5's capabilities to image understanding, making it applicable to visual question answering, document analysis, and screenshot-based workflows.
The developer community response to GLM-4.5 and GLM-4.6 was broadly positive, though with notable qualifications.
GLM-4.6 reached the top ranking among open-weight models on LMArena (a community human preference leaderboard) shortly after its release in September 2025. The model received praise for "polished, human-like" code outputs that required minimal manual corrections, and for its performance on creative and hard-prompt evaluations.
The Deep Learning AI newsletter described Z.ai as "one of the few global companies capable of building rival models at competitive prices," a framing that OpenAI had applied to Zhipu in internal assessments that became public.
Not all experiences were uniformly positive. One developer published a detailed account of switching fully to GLM-4.6 for two weeks before returning to Claude, citing GLM-4.6's weakness in 30+ hour continuous agentic sessions and limited integration with browser automation and GUI control. Occasional reports of API latency issues and reasoning overhead (particularly when using extended thinking mode for simple queries that did not require it) also appeared in developer forums.
The open-source release of Slime, the asynchronous RL training framework used to train GLM-4.5 and subsequent models, attracted favorable attention from the research community as a practical contribution beyond the model weights themselves.
Z.ai has raised approximately $1.5 billion in total funding across multiple rounds:
| Round | Date | Amount | Notable investors |
|---|---|---|---|
| Seed to Series B | 2019-2023 | Alibaba, Tencent, Meituan, Ant Group, Xiaomi, HongShan | |
| Strategic | June 2024 | $400M | Prosperity7 Ventures (Saudi Aramco) |
| Series D | December 2024 | $411M | Multiple Chinese institutional investors |
| Shanghai government round | July 2025 | $140M | Shanghai state-linked funds |
The company filed pre-IPO documents in April 2025 with plans for a Hong Kong listing in 2026. Tang Jie (co-founder) and Liu Debing (chairman) are the key controlling shareholders.
Z.ai describes its mission as "Inspiring AGI to Benefit Humanity." The company operates zhipuai.cn (Chinese-language site) and z.ai (English-language site with developer documentation and API access).
Several limitations were documented or acknowledged around GLM-4.5 and GLM-4.6:
Long-horizon agentic performance: Claude 4.5 Sonnet maintains a clear advantage in tasks requiring 30+ hours of continuous agent execution. GLM-4.6 acknowledged it "still lags Claude Sonnet 4.5 on coding tasks" in complex, long-horizon contexts.
GUI and computer control: Both models have limited native support for browser automation and desktop GUI control, which constrains their use in web scraping and OS-level automation agents.
SWE-bench ceiling: While GLM-4.6's SWE-bench Verified score (68.0%) is competitive, it remains below Claude 4.5 Sonnet's 77.2% on that benchmark.
Reasoning overhead: The thinking mode can produce 50 seconds of reasoning trace for queries that do not require deliberate reasoning, increasing latency and token cost.
Infrastructure requirements: Running the full GLM-4.5 locally requires a minimum of eight H200 GPUs or sixteen H100 GPUs, placing self-hosted deployment out of reach for most organizations. GLM-4.5-Air is more accessible but still requires enterprise-grade hardware in its full-precision form.
Geopolitical considerations: Z.ai's Entity List designation and China-based operations raise data governance questions for some enterprise users. The open-weight models mitigate this concern for on-premise deployments, but the hosted API routes all traffic through Z.ai's infrastructure.
Ecosystem maturity: Third-party integrations, documentation quality, and community tooling lag behind OpenAI and Anthropic ecosystems, though the gap narrowed significantly through 2025.