GLM-4-Voice
Last reviewed
May 20, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 4,297 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 20, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 4,297 words
Add missing citations, update stale details, or suggest a clearer explanation.
GLM-4-Voice is an open-weights end-to-end speech-to-speech large language model released in October 2024 by Zhipu AI together with the Knowledge Engineering Group (KEG) at Tsinghua University.[^1][^2] The system accepts raw speech as input and generates speech as output without going through a separate automatic speech recognition (ASR) and text-to-speech (TTS) pipeline, instead converting audio to discrete tokens, processing them with a large language model derived from GLM-4-9B, and resynthesizing speech with a flow-matching decoder.[^1][^3] It is bilingual in Chinese and English, supports voice attribute control (emotion, intonation, speech rate, dialect), and was published under Apache 2.0 (code) and a separate model license for the weights.[^2][^3] A companion paper, "GLM-4-Voice: Towards Intelligent and Human-Like End-to-End Spoken Chatbot" (arXiv:2412.02612), describes the design and training approach.[^1]
| Field | Value |
|---|---|
| Developer | Zhipu AI, Tsinghua University KEG |
| Initial release | October 2024[^4] |
| Paper | arXiv:2412.02612, 3 December 2024[^1] |
| Components | Tokenizer, GLM-4-Voice-9B, Decoder |
| Speech tokenizer | Whisper-large-v3 encoder + vector quantization, 12.5 Hz, 175 bps[^1][^5] |
| LLM backbone | GLM-4-9B (9B parameters)[^3][^6] |
| Decoder | Flow Matching, based on CosyVoice[^3][^7] |
| Languages | Chinese, English[^2][^3] |
| Pre-training scale | ~1 trillion tokens combining speech and text[^1] |
| License | Apache 2.0 (code); separate Model License Agreement (weights)[^2] |
| Repository | github.com/THUDM/GLM-4-Voice (mirrored at zai-org/GLM-4-Voice)[^2][^3] |
Zhipu AI was founded in 2019 in Beijing by Tsinghua University professors Tang Jie and Li Juanzi, both members of the university's Knowledge Engineering Group (KEG).[^8] The company spun out of academic research on knowledge graphs and large language models that had been underway at KEG for more than a decade, and it began developing the General Language Model (GLM) pre-training architecture from 2020 onwards.[^8][^9] In 2022, Zhipu and KEG jointly released GLM-130B, a 130-billion-parameter bilingual model that was one of the earliest large open foundation models released from China.[^9] The GLM-4 family followed in 2024, with the open-weights GLM-4-9B variant trained on roughly ten trillion tokens across 26 languages.[^6] Subsequent generations include GLM-4.5, GLM-4.6, and GLM-5, the latter announced in 2026 alongside Zhipu AI's IPO as Z.ai.[^10]
GLM-4-Voice was developed by the same team that produced the GLM-4 text models, with first authors Aohan Zeng and Zhengxiao Du (both of Tsinghua's Department of Computer Science) and senior authors Yuxiao Dong and Jie Tang.[^1][^11] The work was positioned as the speech modality extension of the GLM-4 family, analogous to how GLM-4V extended GLM-4 into vision.[^4]
Until 2024, voice assistants typically combined three separately trained models: an ASR system to transcribe input speech (often based on OpenAI Whisper or commercial ASR), a text LLM to produce a reply, and a TTS system to synthesize speech. This cascaded design introduces latency, loses paralinguistic information (tone, emotion, prosody), and prevents the LLM from reasoning about how something was said.[^12]
The release of GPT-4o in May 2024 demonstrated a proprietary alternative in which a single neural network handled speech-to-speech generation with sub-second latency, opening a wave of open research into similar architectures.[^13] Among the open competitors that appeared between mid-2024 and the end of 2024 were Mini-Omni from Tsinghua's other lab (August 2024), Moshi from Kyutai (September 2024), and LLaMA-Omni from ICT-CAS (September 2024). GLM-4-Voice, announced on 25 October 2024, was Zhipu AI's entry into this category.[^4][^14]
The first public materials for GLM-4-Voice appeared at the China National Computer Conference (CNCC) in October 2024, after which the team published code, model weights, and a web demo on GitHub and Hugging Face on 25 October 2024.[^4][^2] The accompanying preprint, "GLM-4-Voice: Towards Intelligent and Human-Like End-to-End Spoken Chatbot," was submitted to arXiv on 3 December 2024.[^1]
The repository was originally hosted under the THUDM (Tsinghua University Data Mining group) organization on GitHub. Following Zhipu AI's rebranding around its international product Z.ai, the same repository is mirrored at the zai-org organization, and the Hugging Face model cards have been republished under both THUDM and zai-org namespaces.[^2][^3]
GLM-4-Voice decomposes the speech-to-speech task into three components: a speech tokenizer that turns audio into discrete tokens, an autoregressive language model that consumes and emits these tokens interleaved with text, and a flow-matching decoder that converts model outputs back into audio.[^1][^3] Unlike Moshi, which uses parallel streams of fine and coarse acoustic codes, GLM-4-Voice operates on a single sequence of tokens.[^14][^15]
The tokenizer is built by adding a vector-quantized bottleneck to the encoder of Whisper-large-v3, the largest publicly released variant of OpenAI's automatic speech recognition model.[^1][^5] During training, the Whisper encoder is fine-tuned for two epochs at batch size 4096 and learning rate 1e-5, with codebook vectors updated by an exponential moving average (decay 0.99) and a commitment loss coefficient of 10.0.[^5] The supervised-to-pseudo-labeled sample ratio is 1:3, drawing on a combination of paired ASR data and self-labeled audio.[^5]
To enable streaming, the original convolutional and bidirectional attention layers of the Whisper encoder are replaced with causal convolution and block-causal attention, so that tokens can be produced as audio arrives rather than only after a full utterance.[^5] Although the team evaluated multiple frame rates and pooling configurations, GLM-4-Voice ultimately uses the 12.5 Hz variant: each second of audio is compressed to 12.5 discrete tokens drawn from a single codebook.[^1][^3] Combined with the codebook size used in the final model, this corresponds to an ultra-low bitrate of approximately 175 bps, an order of magnitude lower than typical neural audio codecs at comparable quality, while still retaining enough semantic and paralinguistic detail for downstream generation.[^1] The tokenizer itself contains around 0.4 billion parameters and is distributed in float32 format on Hugging Face.[^16]
The central language model, GLM-4-Voice-9B, is initialized from GLM-4-9B, an open-weights bilingual text LLM with roughly 9 billion parameters and a transformer decoder architecture in the ChatGLM lineage.[^3][^6] Speech tokens emitted by the tokenizer are added to the model's vocabulary as additional discrete tokens, so that the autoregressive model treats text and speech identically and can predict either modality conditional on either modality.[^1][^3]
Continued pre-training scales to approximately 1 trillion tokens combining four data sources: large unsupervised speech corpora, synthetic speech-text interleaved data generated from existing text pre-training data using a text-to-token model, supervised paired ASR and TTS data, and raw text data retained to prevent linguistic regression.[^1] The interleaved data trick is central to the design: the team observed that simply training a text LLM on raw audio tokens degrades reasoning quality, while mixing text and speech tokens at sentence granularity preserves the underlying language model's intelligence while teaching it to handle speech tokens fluently.[^1]
After pre-training, GLM-4-Voice-9B is fine-tuned on high-quality multi-turn spoken dialogues authored to elicit attribute-controlled speech (e.g., dialogues that request a sad tone, a Sichuan dialect, or a faster speech rate). The supervised fine-tuning data is tagged so that the model learns to respect style instructions provided as text prompts.[^1]
The decoder is responsible for converting discrete speech tokens back into a continuous audio waveform. GLM-4-Voice-Decoder is retrained from the open CosyVoice architecture from Alibaba's FunAudioLLM team, which uses a conditional flow matching model to map semantic tokens to mel-spectrograms, followed by a HiFi-GAN-style vocoder to produce a waveform.[^3][^7][^17] Flow matching trains a continuous-time generative model by regressing an ordinary-differential-equation velocity field, and the variant used by CosyVoice supports streaming inference: outputs can begin once a small chunk of tokens is available.[^7][^17]
Because GLM-4-Voice operates at 12.5 Hz, the decoder can start producing audio after as few as ten speech tokens have been generated, which the authors report as reducing first-audio latency to a fraction of a second.[^3] Acoustic details such as speaker timbre are supplied to the decoder through reference embeddings rather than baked into the discrete tokens, which is what enables zero-shot voice cloning from a short reference clip.[^7][^17]
A distinguishing feature of GLM-4-Voice over the simpler Mini-Omni approach is its "streaming thoughts" inference: the model first generates a short text response in its context, then begins emitting speech tokens that match the text. Text and speech are interleaved at a fine grain so that text generation does not need to finish before speech begins.[^14][^18] In practice this means the model thinks in text, which keeps reasoning quality close to the underlying GLM-4-9B base, but speaks in audio without an explicit external TTS step.[^18]
The paper reports pre-training on roughly 1 trillion tokens, dominated by unsupervised speech. Independent secondary coverage cites approximately 700,000 hours of speech audio as the unsupervised component, alongside supervised ASR and TTS pairs and synthetic interleaved data.[^1][^14][^18] Although the paper does not list specific dataset names, the unsupervised speech is drawn primarily from publicly available Chinese and English audio corpora consistent with the model's bilingual focus.[^1] The fine-tuning corpus consists of high-quality multi-turn spoken dialogues that are tailored to specific speech style requirements, allowing the model to learn style-controllable spoken dialogue.[^1][^18]
Because paired speech-text data on the scale required for LLM pre-training is impractical to collect, the authors describe a key trick for transferring text-domain knowledge into the speech tokenizer's vocabulary: synthesizing speech-text interleaved data from the existing text pre-training corpus of GLM-4-9B. A text-to-token model (a small TTS-style system trained to output GLM-4-Voice speech tokens rather than audio) is run over chunks of text, producing artificial sequences in which spoken renditions of certain phrases are inserted into otherwise textual context.[^1] This is conceptually similar to the use of synthetic data in vision LLMs to teach a model what image patches correspond to which captions, and it preserves the reasoning capacity of the base text LLM while exposing it to large quantities of speech tokens. The technique was credited in later work as one of the key contributors to GLM-4-Voice's strong spoken-QA results relative to Mini-Omni, which omits this step.[^14]
After the synthetic interleaved phase, the model is trained on a mixture of raw text, unsupervised speech, supervised ASR pairs, and supervised TTS pairs, with the sampling ratios chosen so that no single modality dominates. The total budget is approximately 1 trillion tokens, of which the dominant components are unsupervised speech (counted at 700,000 hours of audio, equivalent to roughly 31 billion speech tokens at 12.5 Hz) and synthetic interleaved data.[^1][^14] Supervised fine-tuning then uses curated, multi-turn spoken dialogues with explicit style tags, taught alongside text instructions that describe target attributes like dialect, emotion, and rate.[^1]
The headline capability is direct speech-in, speech-out conversation. A user speaks into a microphone, the audio is tokenized in real time at 12.5 Hz, the GLM-4-Voice-9B model consumes the tokens and generates a textual reply followed by speech tokens, and the decoder streams the spoken response back. Because no intermediate transcription is exposed, the model can in principle respond to non-lexical cues such as sighs, laughter, or interruptions, although the authors note that fully duplex behavior (talking and listening simultaneously) is not as developed as in Moshi.[^14][^18]
By passing a short reference clip to the flow-matching decoder, GLM-4-Voice can resynthesize the model's output in a similar voice without retraining.[^3][^7] This inherits the zero-shot voice-cloning behavior of CosyVoice. Voice control is limited by the quality and length of the reference: the system targets a reasonable match to speaker timbre rather than strict perceptual identity, and it does not include any explicit watermarking or anti-cloning safeguard.[^7]
Because supervised fine-tuning includes dialogues annotated for emotion, intonation, speech rate, and dialect, GLM-4-Voice can be steered with natural-language instructions. Prompts such as "please answer slowly in a sad tone" or "respond in Sichuan dialect" alter the speech tokens before they reach the decoder, producing audibly different output.[^1][^3] The set of supported dialects is centered on regional varieties of Mandarin, including Sichuan and Cantonese-influenced styles, although the underlying model also covers standard English.[^18]
The model handles Chinese and English natively in both input and output, including code-switching within a single utterance. Performance is strongest on Mandarin, reflecting the composition of the training corpus, but the system answers conversational English queries fluently.[^1][^3]
Because the tokenizer is causal and the decoder is streaming, the system can in principle start emitting audio while the model is still computing the rest of its reply. The repository documents a default configuration where the model emits a short text segment, then produces speech tokens for that segment while continuing to plan the next segment, with the decoder beginning playback after 10 speech tokens are available (roughly 0.8 seconds of audio buffered, but with sub-second time to first audible byte).[^3][^14] This approach contrasts with cascaded systems, which must complete the full LLM reply before the TTS engine can begin.
Because the LLM backbone is a chat-style GLM-4-9B variant, GLM-4-Voice respects custom system prompts in the same way text chatbots do. Operators can constrain the assistant's persona, voice, or content policy by editing the system prompt; the model preserves these instructions across long conversations to the extent the underlying GLM-4-9B context window allows.[^3] The default web demo bundles example system prompts that demonstrate how to elicit different speaking styles.
The arXiv paper reports state-of-the-art results among open speech LLMs on two main settings.[^1] In a speech-language modeling benchmark covering speech-to-speech and speech-to-text generation, GLM-4-Voice outperforms previous open systems. In spoken question answering, an aggregate evaluation that measures answer correctness when both the question and answer are spoken, GLM-4-Voice scores 5.40 against 2.44 for Mini-Omni and 2.42 for Moshi on the General QA setting reported by later third-party studies, a wide gap attributed by the authors to the text-grounded interleaved generation that preserves the reasoning of the GLM-4-9B base model.[^15][^18] At the same time, third-party comparisons consistently show that cascaded systems combining a strong ASR model with a frontier text LLM still outperform any open end-to-end model, including GLM-4-Voice, on knowledge-heavy spoken QA tasks.[^15] On standard ASR and TTS metrics the model is competitive but not state of the art, and later models from the same team, such as GLM-TTS released in 2026, materially improve Chinese ASR character-error-rate on dialect-heavy benchmarks.[^19]
The paper evaluates on several axes. For ASR, it reports word-error-rate on English (LibriSpeech-style benchmarks) and character-error-rate on Chinese (AISHELL-style benchmarks), with GLM-4-Voice competitive with dedicated ASR systems though not state of the art on either.[^1] For TTS, it reports character-error-rate of synthesized output (a measure of intelligibility) and UTMOS, an automated mean-opinion-score predictor for perceived speech quality.[^1] For end-to-end performance, it relies on ChatGPT scoring of spoken answers, treating an LLM judge as the proxy for human evaluation. Across these axes the model is presented as the first open release that simultaneously matches dedicated cascades on text quality and dedicated TTS systems on naturalness.[^1][^18]
Later papers that benchmark open speech systems against each other generally use GLM-4-Voice as a baseline. The SOVA-Bench evaluation, MinMo, DeepTalk, and others all cite the model and reproduce its public checkpoints, finding that the released weights perform broadly in line with what the paper reports while also confirming the gap relative to cascaded GPT-4o-class systems on knowledge tasks.[^15][^18] By 2026, both the GLM-4-Voice tokenizer and the underlying GLM-4-9B base have been superseded inside Zhipu AI by their successors, although the released checkpoints continue to be widely used as a reference baseline.[^19]
| System | Released | Backbone | Speech rep. | Open weights | Languages |
|---|---|---|---|---|---|
| GPT-4o / Realtime API | May 2024[^13] | Proprietary | Proprietary tokens | No | Multilingual |
| Mini-Omni | Aug 2024[^14] | Qwen2-0.5B / 7B | SNAC tokens | Yes | English |
| Moshi | Sept 2024[^14][^20] | Helium 7B | Mimi codec, dual stream | Yes | English, French |
| LLaMA-Omni | Sept 2024[^14] | Llama-3.1-8B | HuBERT units | Yes | English |
| GLM-4-Voice | Oct 2024[^4] | GLM-4-9B | Whisper-VQ 12.5 Hz | Yes | Chinese, English |
GLM-4-Voice is the only model in this cohort built on a bilingual Chinese/English LLM backbone and uses a notably lower token rate (12.5 Hz) than Moshi (the Mimi codec runs at 12.5 Hz semantic but with parallel acoustic streams), which simplifies modeling at the cost of acoustic detail.[^1][^20] Mini-Omni and LLaMA-Omni use ASR-derived tokens but rely on smaller LLM backbones, while Moshi pursues full-duplex conversation but at the cost of weaker multilingual coverage.[^14][^15]
GLM-4-Voice helped establish that an open lab could match the architectural approach taken by GPT-4o, at least at the level of system design, using fully open building blocks: an OpenAI Whisper-derived tokenizer, an open Chinese/English LLM backbone, and a CosyVoice-derived decoder.[^1][^3] It also demonstrated that interleaved speech-text training on a modest amount of additional compute (relative to text pre-training) is sufficient to give a strong text LLM a usable voice modality, which influenced subsequent work such as LLaMA-Omni 2 and MinMo.[^14] In the Chinese AI ecosystem, where commercial APIs from OpenAI are unavailable, the model also serves as a reference implementation for voice assistants that can run domestically on commodity GPUs.[^4]
For Zhipu AI specifically, GLM-4-Voice fit into the company's broader strategy of releasing open-weights models in every major modality, a strategy that culminated in the IPO of the Z.ai entity in early 2026 and the release of larger general-purpose models GLM-4.5, GLM-4.6, and ultimately GLM-5.[^10]
A number of subsequent open systems explicitly cite GLM-4-Voice and reproduce or modify its design.[^14][^15][^18] LLaMA-Omni 2 adopts a similar interleaved text-speech generation strategy but pairs it with a Llama backbone and an autoregressive TTS head; MinMo adopts the same overall three-component pipeline but expands the supervised fine-tuning corpus and adds a more sophisticated voice activity detection module to enable closer-to-duplex operation.[^14] DeepTalk and DeepOmni explore mixture-of-experts adaptations to the same paradigm.[^15] In each case, the GLM-4-Voice ablation that "synthetic interleaved data preserves text reasoning while still teaching speech" has become a default assumption rather than a contested design choice.[^14][^18]
Because GLM-4-Voice fits on a single 24 GB GPU at INT4 precision and runs comfortably on a 40 GB GPU at bfloat16, it has been adopted by a number of Chinese voice assistant projects and educational platforms as a self-hostable alternative to commercial APIs. The Apache 2.0 license on the code and the absence of usage tracking in the weights distribution makes it suitable for offline or air-gapped deployments, although the proprietary Model License Agreement on the weights imposes some restrictions on certain commercial uses.[^2] Hugging Face download statistics consistently rank the GLM-4-Voice tokenizer and GLM-4-Voice-9B among the most-downloaded Chinese speech models, with the tokenizer alone seeing tens of thousands of downloads per month.[^16][^21]
The authors and third-party reviewers note several limitations of GLM-4-Voice as released in late 2024.[^1][^15][^18]
First, end-to-end speech systems still underperform cascaded pipelines that combine a strong ASR model and a frontier text LLM on knowledge-heavy spoken QA. While GLM-4-Voice is much better than Moshi or Mini-Omni at general QA, it can lose factual accuracy compared with the cascade Whisper plus a frontier text model.[^15]
Second, the model is not fully duplex. The generation loop produces speech in response to a completed user utterance rather than continuously processing incoming audio alongside its own outgoing speech, which is in contrast to Moshi's full-duplex design.[^14] This caps how naturally the system can handle real-time interruptions, although the streaming decoder helps reduce first-audio latency.
Third, the use of a single-codebook 12.5 Hz tokenizer makes the model efficient but limits acoustic fidelity. The 175 bps representation discards much of the fine-grained acoustic information that a higher-rate codec would preserve, which is acceptable for conversational quality but inferior for music or expressive performance.[^1] Subsequent work by the same team (GLM-TTS) reports that replacing the tokenizer with a higher-capacity variant materially improves ASR character-error-rate on Sichuan dialect (from 54.11% to 24.40%) and TTS metrics, indicating that the GLM-4-Voice tokenizer is a bottleneck.[^19]
Fourth, the model's understanding and generation in English is good but lags Chinese, reflecting the bilingual-but-Chinese-leaning training mix.[^1]
Fifth, the weights are released under a proprietary Model License Agreement rather than a fully permissive license such as the MIT License used by later models in the family (GLM-4.5 onward), which constrains certain commercial uses.[^2][^10]
GLM-4-Voice is distributed as three independent checkpoints on Hugging Face and ModelScope, with reference code on GitHub.[^2][^3]
| Artifact | Repository | Role |
|---|---|---|
| GLM-4-Voice-Tokenizer | huggingface.co/THUDM/glm-4-voice-tokenizer (mirror at zai-org)[^16] | Whisper-VQ speech encoder |
| GLM-4-Voice-9B | huggingface.co/THUDM/glm-4-voice-9b[^21] | Speech-aware GLM-4-9B language model |
| GLM-4-Voice-Decoder | huggingface.co/THUDM/glm-4-voice-decoder[^22] | CosyVoice-style flow-matching decoder |
| Code | github.com/THUDM/GLM-4-Voice (mirror at zai-org/GLM-4-Voice)[^2] | Reference implementation, web demo, Docker image |
Inference is supported in bfloat16 and INT4 precision on a single CUDA GPU, with a Docker image (zhipuai/glm-4-voice:0.1) provided for reproducibility.[^2][^3] A Gradio-based web demo is included in the repository to exercise the full speech-to-speech loop.[^2]
The release received attention in the Chinese and English AI press as one of the most credible open answers to GPT-4o's voice modality, with MarkTechPost calling it "a new open-source end-to-end speech large language model" that "integrates speech recognition, language understanding, and speech generation" with "lower latency compared to predecessor models."[^4] Subsequent papers on end-to-end speech LLMs use GLM-4-Voice as a standard baseline alongside Moshi, Mini-Omni, and LLaMA-Omni, treating it as the representative open-weights system for Chinese-language voice conversation.[^15][^18]
The GLM-4 model family from Zhipu AI / Tsinghua KEG spans several modalities and several generations:[^6][^10]
| Generation | Year | Notes |
|---|---|---|
| GLM-130B | 2022 | 130B bilingual base model[^9] |
| ChatGLM | 2023 | Dialogue-tuned series leading to GLM-4 |
| GLM-4 / GLM-4-9B | 2024 | Open 9B text model trained on ~10T tokens[^6] |
| GLM-4-Voice | Oct 2024 | Speech-to-speech extension of GLM-4-9B[^4] |
| GLM-4.5 | 2025 | Larger reasoning-focused successor[^10] |
| GLM-4.6 | 2025-2026 | Iterative improvement before GLM-5 |
| GLM-5 | 2026 | Announced at the time of Zhipu's IPO[^10] |
GLM-4-Voice is the only speech-modality member of the lineage as of 2026, although the GLM-TTS technical report (December 2026) describes a successor TTS-focused system built by the same group.[^19]
GLM-4-Voice sits in the family of end-to-end neural multimodal AI systems that operate jointly on text and audio. Direct comparison points include Moshi from Kyutai (full-duplex, dual-stream acoustic codes), the proprietary voice mode of GPT-4o and the OpenAI Realtime API from OpenAI, and the CosyVoice family from Alibaba whose decoder GLM-4-Voice reuses.[^7][^13][^14][^20] The tokenizer is a derivative of Whisper / OpenAI Whisper, and the decoder uses flow matching for continuous-time speech generation.[^1][^7][^17] The base language model belongs to the same large language model family as GLM-4.5, GLM-4.6, and GLM-5, and the wider Chinese open-weights ecosystem includes models from Qwen and DeepSeek that pursue similar bilingual goals in text and reasoning rather than speech.