GLM-130B
Last reviewed
Jun 3, 2026
Sources
9 citations
Review status
Source-backed
Revision
v1 · 1,875 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 3, 2026
Sources
9 citations
Review status
Source-backed
Revision
v1 · 1,875 words
Add missing citations, update stale details, or suggest a clearer explanation.
GLM-130B is a 130-billion-parameter bilingual (English and Chinese) large language model released in August 2022 by the Knowledge Engineering Group (KEG) and the Data Mining research group at Tsinghua University, together with the startup Zhipu AI.[1][2] Its stated goal was to open-source a model at the 100-billion-parameter scale that was at least as capable as GPT-3 (the 175B davinci model) and to document, in detail, how a dense model of that size can actually be trained to convergence.[1] The accompanying paper, "GLM-130B: An Open Bilingual Pre-trained Model" (arXiv:2210.02414), was published as a conference paper at ICLR 2023.[1][3] GLM-130B became the technical foundation for Zhipu's later ChatGLM and GLM-4 model lines.[4]
The project was notable on three counts: it released weights and training code for a 100B-scale model at a time when most models of that size were closed; it described concrete techniques for keeping such a large model from diverging during training; and it shipped INT4 quantization that let the full model run on comparatively modest consumer-grade GPUs.[1][2]
GLM-130B is a dense (not mixture-of-experts) transformer with 130 billion parameters. It was trained on more than 400 billion text tokens, split roughly evenly between English and Chinese (about 200 billion tokens each).[2][5] The English corpus drew on the Pile and the Chinese corpus on Wikipedia and large web crawls (WudaoCorpora and similar collections).[1] The model uses a bilingual tokenizer with a vocabulary of 150,000 tokens and a maximum sequence length of 2,048 tokens.[5]
The lead authors of the paper are Aohan Zeng, Xiao Liu, Zhengxiao Du, and Zihan Wang, with a senior author list that includes Yuxiao Dong and Jie Tang, who leads the KEG group at Tsinghua and co-founded Zhipu AI.[1] The same group had earlier built the GLM-10B and CogView models, and GLM-130B was the largest model they had attempted by an order of magnitude.
Rather than the left-to-right next-token prediction used by GPT-style decoders or the masked-token prediction used by BERT-style encoders, GLM-130B is pretrained with autoregressive blank infilling, the objective introduced in the earlier GLM paper, "GLM: General Language Model Pretraining with Autoregressive Blank Infilling" (arXiv:2103.10360), presented at ACL 2022.[6] In this scheme, spans of text are masked out and the model learns to regenerate them autoregressively, so a single model learns both a bidirectional understanding of the surrounding context (Part A) and a unidirectional generation of the masked spans (Part B). The original GLM paper reported that this unified objective outperformed BERT, T5, and GPT at comparable model sizes and data budgets on a range of natural language understanding and generation tasks.[6]
GLM-130B uses two distinct mask tokens to control behaviour. A short [MASK] is used for blanks within a sentence, while [gMASK] marks a long blank at the end of a sequence, which is what drives open-ended generation.[5] Pretraining was divided into two parts: roughly 95% self-supervised blank infilling on the bilingual corpora, plus about 5% multi-task instruction pretraining on a mix of supervised datasets to improve downstream and zero-shot behaviour.[5]
GLM-130B is a 70-layer transformer with a hidden dimension of 12,288.[5][7] It departs from a vanilla transformer in several ways. Positional information is supplied by rotary positional embeddings (RoPE) rather than learned absolute positions.[5] The feed-forward blocks use a gated GeLU activation (GeGLU) in place of a standard MLP.[5] Layer normalization follows the DeepNorm formulation, applied in a Post-LayerNorm arrangement, which the authors found essential for stability (see below).[7]
| Property | Value |
|---|---|
| Parameters | 130 billion |
| Layers | 70 |
| Hidden dimension | 12,288 |
| Maximum sequence length | 2,048 tokens |
| Vocabulary | 150,000 (bilingual) |
| Position encoding | Rotary (RoPE) |
| Activation | GeGLU |
| Normalization | DeepNorm (Post-LN) |
| Training tokens | ~400 billion (~200B EN, ~200B ZH) |
GLM-130B was trained between May 6 and July 3, 2022, on a cluster of 96 DGX-A100 nodes, each with eight 40 GB A100 GPUs, for a total of 768 GPUs.[7] The team used a 3D parallelism strategy that combined data parallelism, 4-way tensor (model) parallelism, and 8-way pipeline parallelism, implemented on top of Megatron-LM and DeepSpeed.[7] Because activation re-materialization was needed to fit the model in memory, reported efficiency was around 43.3% hardware FLOPs utilization and 32.5% model FLOPs utilization.[7]
The optimizer was AdamW with beta values of 0.9 and 0.95 and weight decay of 0.1. The learning rate warmed up to a peak of 8 x 10^-5 and then decayed on a cosine schedule, while the batch size was warmed from 192 up to 4,224 over the early part of training.[7] Gradient clipping was set to 1.0 and dropout to 0.1.[7]
A large part of the GLM-130B paper is given over to a candid account of how unstable a 100B-scale model can be. The authors report that training "faces frequent loss spikes" that tended to become more common as training progressed, and that several training runs diverged outright before they found a stable recipe.[7] Two specific interventions were central.
The first is embedding gradient shrink (EGS). The team observed that gradients on the word-embedding layer were an early warning sign of divergence, spiking before the loss did. They damped those gradients with the operation word_embedding = word_embedding * alpha + word_embedding.detach() * (1 - alpha), using a shrink factor of alpha = 0.1. With that change, "setting alpha = 0.1 wipes out most spikes we would have met, with negligible latency."[7]
The second is DeepNorm, a Post-LayerNorm variant that scales the residual branch by a factor of (2N)^(1/2) for an N-layer network. The authors tested Pre-LN, Post-LN, and Sandwich-LN and reported that all three were "incapable of stabilizing" their test runs, whereas DeepNorm-based Post-LN trained stably.[7] For mixed precision they followed an Apex O2 style setup, keeping forward and backward passes in FP16 while holding optimizer states and master weights in FP32.[7] These engineering lessons, rather than any single benchmark win, were the part of the paper most often cited by other groups training large models.
GLM-130B is frequently cited as the first 100B-scale model to reach INT4 quantization without post-training (without a separate calibration or fine-tuning step) and with almost no loss in quality.[1][2] The authors attribute this to a scaling property they observed in GLM-130B's weight distributions that made the model unusually tolerant of aggressive quantization.[1] In practice the model weights are quantized to 4-bit integers while activations and matrix multiplications are computed at higher precision.
The practical payoff is hardware accessibility. At full FP16 precision the model needs roughly 8 A100 (40 GB) or 8 V100 (32 GB) GPUs to run inference.[2] Quantization brings that requirement down substantially:
| Configuration | GPUs required |
|---|---|
| Full precision (FP16) | 8 x A100 40 GB, or 8 x V100 32 GB |
| INT8 | 8 x RTX 3090 (24 GB), or 8 x RTX 2080 Ti (11 GB) |
| INT4 | 4 x RTX 3090 (24 GB), or 8 x RTX 2080 Ti (11 GB) |
The INT4 path, running on four RTX 3090 cards or eight RTX 2080 Ti cards, put a 100B-scale model within reach of a single workstation or a small server, which was unusual for a model of this size in 2022.[1][2]
On English benchmarks, GLM-130B reported results competitive with or ahead of GPT-3 175B. On the LAMBADA last-word prediction task it scored 80.2% zero-shot accuracy, against 76.2% for GPT-3 175B (davinci).[8] On the MMLU multitask benchmark it reached 44.8% in the 5-shot setting after seeing 400 billion tokens.[8] The paper frames its English results relative to three contemporaneous 175B-class open or semi-open models, noting that the advantage it held over GPT-3 did not extend to OPT-175B or BLOOM-176B, which it described as roughly on par.[1]
| Benchmark (setting) | GLM-130B | Reported comparison |
|---|---|---|
| LAMBADA (zero-shot) | 80.2% | GPT-3 175B 76.2%; better than GPT-3 (+5.0%), OPT-175B (+6.5%), BLOOM-176B (+13.0%) |
| MMLU (5-shot) | 44.8% | Better than GPT-3 175B (+0.9%); ahead of BLOOM-176B by a wide margin |
| BIG-bench (zero-shot) | N/A | Reported to outperform GPT-3 175B on BIG-bench-lite |
The comparison percentages above are the relative deltas stated by the authors on the project's pages and in the paper.[2][8] Because OPT-175B and BLOOM-176B were the only other openly released models at this scale, GLM-130B's published Pile and LAMBADA numbers became a common reference point for the open community.
On Chinese benchmarks, where the most relevant baseline was ERNIE TITAN 3.0 260B (the largest Chinese language model at the time), GLM-130B reported large gains: about +24.26% over ERNIE TITAN 3.0 across seven zero-shot CLUE datasets and about +12.75% across five zero-shot FewCLUE datasets.[2] As with all self-reported benchmark figures, these numbers reflect the authors' own evaluation setup and should be read with that caveat.
The GLM-130B source code is released under the Apache 2.0 license.[2] The model weights are governed by a separate model license (the "GLM-130B License") and were distributed on application: users requested access through a form, after which they could download the checkpoints.[2] This split, a permissive open-source license on the code plus a separate gated license on the weights, was common for large models released in 2022. The repository, originally under the THUDM organization on GitHub and now hosted under zai-org, includes the inference code, the quantization toolkit, and training logs.[2]
GLM-130B's most lasting impact came through the models that descended from it. In March 2023 the group released ChatGLM-6B, a 6.2-billion-parameter conversational model that could run locally on a consumer GPU under INT4 quantization, alongside an aligned 130B-scale chat model on the chatglm.cn service.[4][9] ChatGLM-6B was downloaded heavily and helped seed a wider ecosystem of Chinese open models. The lineage continued through the ChatGLM2 and ChatGLM3 generations and into the GLM-4 family, which Zhipu AI documents in its 2024 paper "ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools" (arXiv:2406.12793).[4]
Beyond the specific models, GLM-130B is remembered for the transparency of its training write-up. By publishing the loss curves, the failed configurations, and the fixes (embedding gradient shrink and DeepNorm in particular), the authors gave other teams a rare, detailed look at what training a 100B-scale dense model actually involves, at a moment when very few such accounts existed in public.[1][7]