GLM-130B

Chinese AI Large Language Models Open Source AI

10 min read

Updated Jun 28, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 28, 2026

Fact-checked

In review queue

Sources

9 citations

Revision

v2 · 1,986 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

GLM-130B is a 130-billion-parameter bilingual (English and Chinese) large language model released in August 2022 by the Knowledge Engineering Group (KEG) and the Data Mining research group at Tsinghua University, together with the startup Zhipu AI.^[1]^[2] It was, in the authors' own words, "an attempt to open-source a 100B-scale model at least as good as GPT-3 (davinci) and unveil how models of such a scale can be successfully pre-trained."^[1] Trained on more than 400 billion bilingual tokens and reaching 80.2% zero-shot accuracy on LAMBADA against 76.2% for GPT-3 175B, GLM-130B used the General Language Model (GLM) autoregressive blank-infilling architecture and became the first 100B-scale model to support INT4 quantization with almost no loss in quality, letting the full model run on as few as four consumer RTX 3090 GPUs.^[1]^[2]^[8]

The accompanying paper, "GLM-130B: An Open Bilingual Pre-trained Model" (arXiv:2210.02414), was published as a conference paper at ICLR 2023.^[1]^[3] GLM-130B became the technical foundation for Zhipu's later ChatGLM and GLM-4 model lines, a lineage the authors trace explicitly in their 2024 follow-up paper.^[4]

What is GLM-130B?

GLM-130B is a dense (not mixture-of-experts) transformer with 130 billion parameters. It was trained on more than 400 billion text tokens, split roughly evenly between English and Chinese (about 200 billion tokens each).^[2]^[5] The English corpus drew on the Pile and the Chinese corpus on Wikipedia and large web crawls (WudaoCorpora and similar collections).^[1] The model uses a bilingual tokenizer with a vocabulary of 150,000 tokens and a maximum sequence length of 2,048 tokens.^[5]

The project was notable on three counts: it released weights and training code for a 100B-scale model at a time when most models of that size were closed; it described concrete techniques for keeping such a large model from diverging during training; and it shipped INT4 quantization that let the full model run on comparatively modest consumer-grade GPUs.^[1]^[2]

The lead authors of the paper are Aohan Zeng, Xiao Liu, Zhengxiao Du, and Zihan Wang, with a senior author list that includes Yuxiao Dong and Jie Tang, who leads the KEG group at Tsinghua and co-founded Zhipu AI.^[1] The same group had earlier built the GLM-10B and CogView models, and GLM-130B was the largest model they had attempted by an order of magnitude.

What is the GLM architecture?

Rather than the left-to-right next-token prediction used by GPT-style decoders or the masked-token prediction used by BERT-style encoders, GLM-130B is pretrained with autoregressive blank infilling, the objective introduced in the earlier GLM paper, "GLM: General Language Model Pretraining with Autoregressive Blank Infilling" (arXiv:2103.10360), presented at ACL 2022.^[6] In this scheme, spans of text are masked out and the model learns to regenerate them autoregressively, so a single model learns both a bidirectional understanding of the surrounding context (Part A) and a unidirectional generation of the masked spans (Part B). The original GLM paper reported that this unified objective outperformed BERT, T5, and GPT at comparable model sizes and data budgets on a range of natural language understanding and generation tasks.^[6]

GLM-130B uses two distinct mask tokens to control behaviour. A short [MASK] is used for blanks within a sentence, while [gMASK] marks a long blank at the end of a sequence, which is what drives open-ended generation.^[5] Pretraining was divided into two parts: roughly 95% self-supervised blank infilling on the bilingual corpora, plus about 5% multi-task instruction pretraining on a mix of supervised datasets to improve downstream and zero-shot behaviour.^[5]

How is GLM-130B built?

GLM-130B is a 70-layer transformer with a hidden dimension of 12,288.^[5]^[7] It departs from a vanilla transformer in several ways. Positional information is supplied by rotary positional embeddings (RoPE) rather than learned absolute positions.^[5] The feed-forward blocks use a gated GeLU activation (GeGLU) in place of a standard MLP.^[5] Layer normalization follows the DeepNorm formulation, applied in a Post-LayerNorm arrangement, which the authors found essential for stability (see below).^[7]

Property	Value
Parameters	130 billion
Layers	70
Hidden dimension	12,288
Maximum sequence length	2,048 tokens
Vocabulary	150,000 (bilingual)
Position encoding	Rotary (RoPE)
Activation	GeGLU
Normalization	DeepNorm (Post-LN)
Training tokens	~400 billion (~200B EN, ~200B ZH)

How was GLM-130B trained?

GLM-130B was trained between May 6 and July 3, 2022, on a cluster of 96 DGX-A100 nodes, each with eight 40 GB A100 GPUs, for a total of 768 GPUs.^[7] The team used a 3D parallelism strategy that combined data parallelism, 4-way tensor (model) parallelism, and 8-way pipeline parallelism, implemented on top of Megatron-LM and DeepSpeed.^[7] Because activation re-materialization was needed to fit the model in memory, reported efficiency was around 43.3% hardware FLOPs utilization and 32.5% model FLOPs utilization.^[7]

The optimizer was AdamW with beta values of 0.9 and 0.95 and weight decay of 0.1. The learning rate warmed up to a peak of 8 x 10^-5 and then decayed on a cosine schedule, while the batch size was warmed from 192 up to 4,224 over the early part of training.^[7] Gradient clipping was set to 1.0 and dropout to 0.1.^[7]

How did the team stabilize training at 100B scale?

A large part of the GLM-130B paper is given over to a candid account of how unstable a 100B-scale model can be. The authors write that in pretraining the model "we face numerous unexpected technical and engineering challenges, particularly on loss spikes and divergence," and they report that several training runs diverged outright before they found a stable recipe.^[1]^[7] Two specific interventions were central.

The first is embedding gradient shrink (EGS). The team observed that gradients on the word-embedding layer were an early warning sign of divergence, spiking before the loss did. They damped those gradients with the operation word_embedding = word_embedding * alpha + word_embedding.detach() * (1 - alpha), using a shrink factor of alpha = 0.1. With that change, "setting alpha = 0.1 wipes out most spikes we would have met, with negligible latency."^[7]

The second is DeepNorm, a Post-LayerNorm variant that scales the residual branch by a factor of (2N)^(1/2) for an N-layer network. The authors tested Pre-LN, Post-LN, and Sandwich-LN and reported that all three were "incapable of stabilizing" their test runs, whereas DeepNorm-based Post-LN trained stably.^[7] For mixed precision they followed an Apex O2 style setup, keeping forward and backward passes in FP16 while holding optimizer states and master weights in FP32.^[7] These engineering lessons, rather than any single benchmark win, were the part of the paper most often cited by other groups training large models.

What is INT4 quantization in GLM-130B?

GLM-130B is frequently cited as the first 100B-scale model to reach INT4 quantization without post-training (without a separate calibration or fine-tuning step) and with almost no loss in quality.^[1]^[2] The authors attribute this to a scaling property they observed in GLM-130B's weight distributions that made the model unusually tolerant of aggressive quantization.^[1] In practice the model weights are quantized to 4-bit integers while activations and matrix multiplications are computed at higher precision.

The practical payoff is hardware accessibility. At full FP16 precision the model needs roughly 8 A100 (40 GB) or 8 V100 (32 GB) GPUs to run inference.^[2] Quantization brings that requirement down substantially:

Configuration	GPUs required
Full precision (FP16)	8 x A100 40 GB, or 8 x V100 32 GB
INT8	8 x RTX 3090 (24 GB), or 8 x RTX 2080 Ti (11 GB)
INT4	4 x RTX 3090 (24 GB), or 8 x RTX 2080 Ti (11 GB)

The INT4 path, running on four RTX 3090 cards or eight RTX 2080 Ti cards, put a 100B-scale model within reach of a single workstation or a small server, which was unusual for a model of this size in 2022.^[1]^[2]

How does GLM-130B perform on benchmarks?

On English benchmarks, GLM-130B reported results competitive with or ahead of GPT-3 175B. On the LAMBADA last-word prediction task it scored 80.2% zero-shot accuracy, against 76.2% for GPT-3 175B (davinci) and 77.9% for PaLM 540B.^[8] On the MMLU multitask benchmark it reached 44.8% in the 5-shot setting after seeing 400 billion tokens, slightly ahead of GPT-3 175B (+0.9%).^[8] The paper frames its English results relative to three contemporaneous 175B-class open or semi-open models, noting that the advantage it held over GPT-3 did not extend to OPT-175B or BLOOM-176B, which it described as roughly on par.^[1]

Benchmark (setting)	GLM-130B	Reported comparison
LAMBADA (zero-shot)	80.2%	GPT-3 175B 76.2%; PaLM 540B 77.9%; better than GPT-3 (+5.0%), OPT-175B (+6.5%), BLOOM-176B (+13.0%)
MMLU (5-shot)	44.8%	Better than GPT-3 175B (+0.9%); ahead of BLOOM-176B by a wide margin
BIG-bench (zero-shot)	N/A	Reported ~3x better than GPT-3 175B on BIG-bench-lite

The comparison percentages above are the relative deltas stated by the authors on the project's pages and in the paper.^[2]^[8] Because OPT-175B and BLOOM-176B were the only other openly released models at this scale, GLM-130B's published Pile and LAMBADA numbers became a common reference point for the open community.

On Chinese benchmarks, where the most relevant baseline was ERNIE TITAN 3.0 260B (the largest Chinese language model at the time), GLM-130B reported large gains: about +24.26% over ERNIE TITAN 3.0 across seven zero-shot CLUE datasets and about +12.75% across five zero-shot FewCLUE datasets.^[2] As with all self-reported benchmark figures, these numbers reflect the authors' own evaluation setup and should be read with that caveat.

Is GLM-130B open source?

The GLM-130B source code is released under the Apache 2.0 license.^[2] The model weights are governed by a separate model license (the "GLM-130B License") and were distributed on application: users requested access through a form, after which they could download the checkpoints.^[2] This split, a permissive open-source license on the code plus a separate gated license on the weights, was common for large models released in 2022. The repository, originally under the THUDM organization on GitHub and now hosted under zai-org, includes the inference code, the quantization toolkit, and training logs.^[2]

Why was GLM-130B significant?

GLM-130B's most lasting impact came through the models that descended from it. In March 2023 the group released ChatGLM-6B, a 6.2-billion-parameter conversational model that could run locally on a consumer GPU under INT4 quantization, alongside an aligned 130B-scale chat model on the chatglm.cn service.^[4]^[9] ChatGLM-6B was downloaded heavily and helped seed a wider ecosystem of Chinese open models. The lineage continued through the ChatGLM2 and ChatGLM3 generations and into the GLM-4 family, which Zhipu AI documents in its 2024 paper "ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools" (arXiv:2406.12793).^[4]

Beyond the specific models, GLM-130B is remembered for the transparency of its training write-up. By publishing the loss curves, the failed configurations, and the fixes (embedding gradient shrink and DeepNorm in particular), the authors gave other teams a rare, detailed look at what training a 100B-scale dense model actually involves, at a moment when very few such accounts existed in public.^[1]^[7]

References

Aohan Zeng et al., "GLM-130B: An Open Bilingual Pre-trained Model," arXiv:2210.02414, https://arxiv.org/abs/2210.02414 ↩
zai-org (formerly THUDM), "GLM-130B: An Open Bilingual Pre-Trained Model (ICLR 2023)," GitHub repository, https://github.com/zai-org/GLM-130B ↩
OpenReview, "GLM-130B: An Open Bilingual Pre-trained Model," ICLR 2023 conference paper, https://openreview.net/forum?id=-Aw0rrrPUF ↩
Team GLM (Zhipu AI / Tsinghua KEG), "ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools," arXiv:2406.12793, https://arxiv.org/abs/2406.12793 ↩
KEG, Tsinghua University, "GLM-130B: An Open Bilingual Pre-Trained Model," project blog post, https://keg.cs.tsinghua.edu.cn/glm-130b/posts/glm-130b/ ↩
Zhengxiao Du, Yujie Qian, Xiao Liu, Ming Ding, Jiezhong Qiu, Zhilin Yang, Jie Tang, "GLM: General Language Model Pretraining with Autoregressive Blank Infilling," ACL 2022, https://aclanthology.org/2022.acl-long.26/ ↩
Aohan Zeng et al., "GLM-130B: An Open Bilingual Pre-trained Model" (full paper, training and stability sections), arXiv:2210.02414, https://ar5iv.labs.arxiv.org/html/2210.02414 ↩
KEG, Tsinghua University, "GLM-130B" results and benchmarks, https://keg.cs.tsinghua.edu.cn/glm-130b/posts/glm-130b/ ↩
zai-org, "ChatGLM-6B," GitHub repository, https://github.com/zai-org/ChatGLM-6B ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

1 revision by 1 contributors · full history

Suggest edit

What links here

ChatGLM GLM-4 Zhang Peng (Zhipu AI)

What is GLM-130B?

What is the GLM architecture?

How is GLM-130B built?

How was GLM-130B trained?

How did the team stabilize training at 100B scale?

What is INT4 quantization in GLM-130B?

How does GLM-130B perform on benchmarks?

Is GLM-130B open source?

Why was GLM-130B significant?

References

Improve this article

Related Articles

Qwen

DeepSeek V4

Kimi K2

DeepSeek V3

Hunyuan

GLM-4.5

What links here

Related Articles

Qwen

DeepSeek V4

Kimi K2

DeepSeek V3

Hunyuan

GLM-4.5

What links here