CodeGeeX
Last reviewed
Jun 3, 2026
Sources
5 citations
Review status
Source-backed
Revision
v1 · 1,372 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 3, 2026
Sources
5 citations
Review status
Source-backed
Revision
v1 · 1,372 words
Add missing citations, update stale details, or suggest a clearer explanation.
CodeGeeX is an open series of multilingual code generation models developed by the Knowledge Engineering Group (KEG) and Data Mining lab at Tsinghua University together with Zhipu AI. The first model, a 13-billion-parameter large language model trained on a corpus of 23 programming languages, was described in a paper submitted to arXiv in March 2023 and presented at the KDD 2023 conference [1][2]. Alongside the model the team released a free code-assistant extension for Visual Studio Code and JetBrains IDEs, and introduced HumanEval-X, a benchmark for evaluating code generation and translation across several languages [1][3]. Later versions reused the lab's general-purpose chat models as their base: CodeGeeX2 is built on ChatGLM2-6B, and CodeGeeX4 is built on GLM-4-9B [4][5].
The first CodeGeeX is a left-to-right autoregressive transformer with 13 billion parameters. The architecture is a 39-layer transformer decoder with a hidden size of 5,120 and 40 attention heads, followed by an extra "top query layer" that conditions the final prediction on the embedding of the target position rather than reusing the last token's input. The vocabulary contains 52,224 tokens [1].
The pretraining corpus held about 158 billion tokens covering 23 programming languages, including Python, C++, Java, JavaScript, Go, Rust, and others. It was assembled from two parts: open-source code datasets, namely the Pile and CodeParrot, and supplementary Python, Java, and C++ code scraped directly from public GitHub repositories that did not already appear in the first part. The scraped data was filtered to drop automatically generated files, files with very long lines, and files outside a 1 KB to 100 KB size range. Over the full run the model saw roughly 850 billion tokens, traversing the corpus for multiple epochs, with a maximum sequence length of 2,048 tokens. Training ran on a cluster of 1,536 Ascend 910 AI processors from Huawei, between April and June 2022, for about 213,000 steps using the MindSpore framework (version 1.7.0) [1][3]. This reliance on Ascend hardware and MindSpore, rather than NVIDIA GPUs and PyTorch, distinguishes CodeGeeX from most contemporary code models and reflected the compute available to the Tsinghua and Zhipu team at the time.
The paper, "CodeGeeX: A Pre-Trained Model for Code Generation with Multilingual Benchmarking on HumanEval-X," is credited to Qinkai Zheng, Xiao Xia, Xu Zou, Yuxiao Dong, Jie Tang, and collaborators. It was submitted to arXiv on 30 March 2023 (arXiv:2303.17568) and accepted to the ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD) in 2023 [1][2].
To measure code generation beyond Python, the authors built HumanEval-X. The widely used HumanEval benchmark from OpenAI contains 164 hand-written Python problems, each with a function signature, a docstring, and unit tests. The CodeGeeX team manually translated all 164 problems into four additional languages, C++, Java, JavaScript, and Go, producing 820 problem-and-solution pairs (164 problems with reference solutions in 5 languages) [1][3].
HumanEval-X supports two tasks. The first is code generation, where a model completes a function from its signature and docstring, scored with the pass@k metric on hidden tests. The second is code translation, where a model rewrites a reference solution from one language into another. Because each problem exists in five languages with shared tests, the benchmark allows direct comparison of a model's competence across languages and of its ability to move code between them.
The table below shows the original CodeGeeX-13B pass@1 scores on the code generation task [1].
| Language | CodeGeeX-13B pass@1 |
|---|---|
| Python | 22.89% |
| C++ | 17.06% |
| Java | 20.04% |
| JavaScript | 17.59% |
| Go | 14.43% |
On HumanEval-X the original model outperformed other multilingual code models of comparable scale at the time of publication, for both generation and translation [1].
The project shipped a free coding assistant as an editor extension rather than only as research weights. The extension provides inline code completion, function generation from comments, and translation between languages, calling the CodeGeeX model as a backend.
The Visual Studio Code extension was published to the VS Code Marketplace, and JetBrains support was added in December 2022, covering IDEs such as IntelliJ IDEA, PyCharm, GoLand, CLion, WebStorm, and others (requiring a 2021.1 or newer IDE). The model was also integrated into Tencent's Cloud Studio environment in February 2023 [3]. According to the paper, the deployed extensions generated billions of tokens per week for tens of thousands of weekly active users [1].
After the first model, the team rebuilt CodeGeeX on top of its general-purpose chat models, which let it inherit a stronger and more efficient base while shrinking the parameter count.
CodeGeeX2, released on 24 July 2023, is a 6-billion-parameter model implemented on the ChatGLM2 architecture (see ChatGLM). It was further pretrained on 600 billion code tokens on top of the ChatGLM2-6B base. Despite being less than half the size of the original, CodeGeeX2-6B improved sharply on HumanEval-X in every language: Python +57%, C++ +71%, Java +54%, JavaScript +83%, Go +56%, and Rust +321% relative to the first CodeGeeX, reaching 35.9% pass@1 in Python. At that size it surpassed the larger StarCoder-15B on the same benchmark. The model supports a maximum sequence length of 8,192 tokens, handles more than 100 programming languages in practice, and can run in an INT4-quantized form using roughly 6 GB of GPU memory [4].
CodeGeeX4-ALL-9B, released in July 2024, is continually trained on the GLM-4-9B base, giving a 9-billion-parameter model with a 128K-token context window. Beyond plain completion and generation, it adds agent-style abilities: a code interpreter, web search, function calling, and repository-level code question answering, which the "ALL" in its name refers to. The long context lets it ingest an entire project so that completions and answers can take account of code spread across many files, and the model supports fill-in-the-middle completion, where it generates code between a given prefix and suffix rather than only continuing from the end. The team reported that, among models under 10 billion parameters, it was the strongest code generation model at release, and that on BigCodeBench it posted the highest scores of any model under 20 billion parameters [5]. As with earlier versions, CodeGeeX4 is meant to be used both as raw weights and through the CodeGeeX editor extensions, which were updated to route requests to the newer model.
| Version | Release | Base model | Parameters | Context length |
|---|---|---|---|---|
| CodeGeeX | 2022 (paper Mar 2023) | Trained from scratch | 13B | 2,048 |
| CodeGeeX2-6B | Jul 2023 | ChatGLM2-6B | 6B | 8,192 |
| CodeGeeX4-ALL-9B | Jul 2024 | GLM-4-9B | 9B | 128K |
CodeGeeX4-ALL-9B was evaluated on a broad set of code benchmarks. The figures the maintainers reported are listed below [5].
| Benchmark | CodeGeeX4-ALL-9B |
|---|---|
| HumanEval | 82.3 |
| MBPP | 75.7 |
| NaturalCodeBench (NCB) | 40.4 |
| BigCodeBench (complete) | 48.9 |
| BigCodeBench (instruct) | 40.4 |
| HumanEval-FIM | 85.0 |
| CRUXEval-O | 47.1 |
On the Code Needle In A Haystack (NIAH) evaluation, the model reportedly achieved 100% retrieval accuracy for Python code within its 128K-token context [5]. The fill-in-the-middle and repository-level results reflect the shift in later versions from single-function completion toward longer-context, whole-project assistance.
Across versions the project splits its license between code and weights. The source code of the CodeGeeX repositories is released under the Apache License 2.0, while the model weights are distributed under a separate model license. Use of the weights for academic research is permitted, and commercial use requires registering through a form provided by the maintainers [3][4][5]. The repositories and weights are hosted on GitHub and Hugging Face under the project's organization (formerly THUDM, later zai-org).