CodeGeeX

AI Code Generation Chinese AI Open Source AI

7 min read

Updated Jul 16, 2026

Suggest edit History Talk

RawGraph

Last edited

Jul 16, 2026

Fact-checked

In review queue

Sources

5 citations

Revision

v3 · 1,375 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

CodeGeeX is an open series of multilingual code generation models developed by the Knowledge Engineering Group (KEG) and Data Mining lab at Tsinghua University together with Zhipu AI. The first model, a 13-billion-parameter large language model trained on a corpus of 23 programming languages, was described in a paper submitted to arXiv in March 2023 and presented at the KDD 2023 conference ^[1]^[2]. Alongside the model the team released a free code-assistant extension for Visual Studio Code and JetBrains IDEs, and introduced HumanEval-X, a benchmark for evaluating code generation and translation across several languages ^[1]^[3]. Later versions reused the lab's general-purpose chat models as their base: CodeGeeX2 is built on ChatGLM2-6B, and CodeGeeX4 is built on GLM-4-9B ^[4]^[5].

The original 13B model

The first CodeGeeX is a left-to-right autoregressive transformer with 13 billion parameters. The architecture is a 39-layer transformer decoder with a hidden size of 5,120 and 40 attention heads, followed by an extra "top query layer" that conditions the final prediction on the embedding of the target position rather than reusing the last token's input. The vocabulary contains 52,224 tokens ^[1].

The pretraining corpus held about 158 billion tokens covering 23 programming languages, including Python, C++, Java, JavaScript, Go, Rust, and others. It was assembled from two parts: open-source code datasets, namely the Pile and CodeParrot, and supplementary Python, Java, and C++ code scraped directly from public GitHub repositories that did not already appear in the first part. The scraped data was filtered to drop automatically generated files, files with very long lines, and files outside a 1 KB to 100 KB size range. Over the full run the model saw roughly 850 billion tokens, traversing the corpus for multiple epochs, with a maximum sequence length of 2,048 tokens. Training ran on a cluster of 1,536 Ascend 910 AI processors from Huawei, between April and June 2022, for about 213,000 steps using the MindSpore framework (version 1.7.0) ^[1]^[3]. This reliance on Ascend hardware and MindSpore, rather than NVIDIA GPUs and PyTorch, distinguishes CodeGeeX from most contemporary code models and reflected the compute available to the Tsinghua and Zhipu team at the time.

The paper, "CodeGeeX: A Pre-Trained Model for Code Generation with Multilingual Benchmarking on HumanEval-X," is credited to Qinkai Zheng, Xiao Xia, Xu Zou, Yuxiao Dong, Jie Tang, and collaborators. It was submitted to arXiv on 30 March 2023 (arXiv:2303.17568) and accepted to the ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD) in 2023 ^[1]^[2].

HumanEval-X benchmark

To measure code generation beyond Python, the authors built HumanEval-X. The widely used HumanEval benchmark from OpenAI contains 164 hand-written Python problems, each with a function signature, a docstring, and unit tests. The CodeGeeX team manually translated all 164 problems into four additional languages, C++, Java, JavaScript, and Go, producing 820 problem-and-solution pairs (164 problems with reference solutions in 5 languages) ^[1]^[3].

HumanEval-X supports two tasks. The first is code generation, where a model completes a function from its signature and docstring, scored with the pass@k metric on hidden tests. The second is code translation, where a model rewrites a reference solution from one language into another. Because each problem exists in five languages with shared tests, the benchmark allows direct comparison of a model's competence across languages and of its ability to move code between them.

The table below shows the original CodeGeeX-13B pass@1 scores on the code generation task ^[1].

Language	CodeGeeX-13B pass@1
Python	22.89%
C++	17.06%
Java	20.04%
JavaScript	17.59%
Go	14.43%

On HumanEval-X the original model outperformed other multilingual code models of comparable scale at the time of publication, for both generation and translation ^[1].

IDE extension

The project shipped a free coding assistant as an editor extension rather than only as research weights. The extension provides inline code completion, function generation from comments, and translation between languages, calling the CodeGeeX model as a backend.

The Visual Studio Code extension was published to the VS Code Marketplace, and JetBrains support was added in December 2022, covering IDEs such as IntelliJ IDEA, PyCharm, GoLand, CLion, WebStorm, and others (requiring a 2021.1 or newer IDE). The model was also integrated into Tencent's Cloud Studio environment in February 2023 ^[3]. According to the paper, the deployed extensions generated billions of tokens per week for tens of thousands of weekly active users ^[1].

CodeGeeX2 and CodeGeeX4

After the first model, the team rebuilt CodeGeeX on top of its general-purpose chat models, which let it inherit a stronger and more efficient base while shrinking the parameter count.

CodeGeeX2, released on 24 July 2023, is a 6-billion-parameter model implemented on the ChatGLM2 architecture (see ChatGLM). It was further pretrained on 600 billion code tokens on top of the ChatGLM2-6B base. Despite being less than half the size of the original, CodeGeeX2-6B improved sharply on HumanEval-X in every language: Python +57%, C++ +71%, Java +54%, JavaScript +83%, Go +56%, and Rust +321% relative to the first CodeGeeX, reaching 35.9% pass@1 in Python. At that size it surpassed the larger StarCoder-15B on the same benchmark. The model supports a maximum sequence length of 8,192 tokens, handles more than 100 programming languages in practice, and can run in an INT4-quantized form using roughly 6 GB of GPU memory ^[4].

CodeGeeX4-ALL-9B, released in July 2024, is continually trained on the GLM-4-9B base, giving a 9-billion-parameter model with a 128K-token context window. Beyond plain completion and generation, it adds agent-style abilities: a code interpreter, web search, function calling, and repository-level code question answering, which the "ALL" in its name refers to. The long context lets it ingest an entire project so that completions and answers can take account of code spread across many files, and the model supports fill-in-the-middle completion, where it generates code between a given prefix and suffix rather than only continuing from the end. The team reported that, among models under 10 billion parameters, it was the strongest code generation model at release, and that on BigCodeBench it posted the highest scores of any model under 20 billion parameters ^[5]. As with earlier versions, CodeGeeX4 is meant to be used both as raw weights and through the CodeGeeX editor extensions, which were updated to route requests to the newer model.

Version	Release	Base model	Parameters	Context length
CodeGeeX	2022 (paper Mar 2023)	Trained from scratch	13B	2,048
CodeGeeX2-6B	Jul 2023	ChatGLM2-6B	6B	8,192
CodeGeeX4-ALL-9B	Jul 2024	GLM-4-9B	9B	128K

Benchmarks

CodeGeeX4-ALL-9B was evaluated on a broad set of code benchmarks. The figures the maintainers reported are listed below ^[5].

Benchmark	CodeGeeX4-ALL-9B
HumanEval	82.3
MBPP	75.7
NaturalCodeBench (NCB)	40.4
BigCodeBench (complete)	48.9
BigCodeBench (instruct)	40.4
HumanEval-FIM	85.0
CRUXEval-O	47.1

On the Code Needle In A Haystack (NIAH) evaluation, the model reportedly achieved 100% retrieval accuracy for Python code within its 128K-token context ^[5]. The fill-in-the-middle and repository-level results reflect the shift in later versions from single-function completion toward longer-context, whole-project assistance.

Licensing

Across versions the project splits its license between code and weights. The source code of the CodeGeeX repositories is released under the Apache License 2.0, while the model weights are distributed under a separate model license. Use of the weights for academic research is permitted, and commercial use requires registering through a form provided by the maintainers ^[3]^[4]^[5]. The repositories and weights are hosted on GitHub and Hugging Face under the project's organization (formerly THUDM, later zai-org).

References

Zheng, Qinkai et al. "CodeGeeX: A Pre-Trained Model for Code Generation with Multilingual Benchmarking on HumanEval-X." arXiv:2303.17568. https://arxiv.org/abs/2303.17568 ↩
"CodeGeeX: A Pre-Trained Model for Code Generation with Multilingual Evaluations on HumanEval-X." arXiv HTML (v2), KDD 2023. https://arxiv.org/html/2303.17568v2 ↩
"CodeGeeX: An Open Multilingual Code Generation Model (KDD 2023)." GitHub repository. https://github.com/zai-org/CodeGeeX ↩
"CodeGeeX2: A More Powerful Multilingual Code Generation Model." GitHub repository (README, EN). https://github.com/zai-org/CodeGeeX2/blob/main/README_EN.md ↩
"CodeGeeX4-ALL-9B: A versatile model for AI software development." GitHub repository. https://github.com/zai-org/CodeGeeX4 ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

2 revisions by 1 contributors · full history

Suggest edit

What links here

Qwen3-Coder-Next

The original 13B model

HumanEval-X benchmark

IDE extension

CodeGeeX2 and CodeGeeX4

Benchmarks

Licensing

See also

References

Improve this article

Related Articles

Qwen3-Coder-Next

Trae

DeepSeek-Coder

Qwen2.5-Coder

Code Llama

StarCoder

What links here

Related Articles

Qwen3-Coder-Next

Trae

DeepSeek-Coder

Qwen2.5-Coder

Code Llama

StarCoder