Wu Dao
Last reviewed
Jun 3, 2026
Sources
10 citations
Review status
Source-backed
Revision
v1 · 1,757 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 3, 2026
Sources
10 citations
Review status
Source-backed
Revision
v1 · 1,757 words
Add missing citations, update stale details, or suggest a clearer explanation.
Wu Dao (Chinese: 悟道, roughly "enlightenment" or "understanding the way") is a series of large pretrained models built by the Beijing Academy of Artificial Intelligence (BAAI), a non-profit research institute founded in 2018 with backing from the Beijing municipal government and the Chinese Ministry of Science and Technology. The project was led by Tang Jie (Jie Tang), a Tsinghua University professor and a vice president at BAAI, and drew on more than 100 researchers from institutions including Tsinghua, Peking University, and the Chinese Academy of Sciences.[1][2] Wu Dao 1.0 was released in March 2021, and Wu Dao 2.0, announced at the BAAI Conference at the start of June 2021, was a multimodal model widely reported at 1.75 trillion parameters. That figure, a sparse mixture-of-experts count rather than a dense one, made it the largest reported neural network in the world at the time, ahead of Google's Switch Transformer and OpenAI's GPT-3.[3][4]
Wu Dao was a direct response to GPT-3, which OpenAI had unveiled in 2020. BAAI began the effort in late 2020 with the goal of building Chinese-language foundation models at a scale comparable to, or larger than, the leading American systems, and of doing so on infrastructure that did not depend on Google's proprietary stack.[2][5] The institute framed the work as a step toward general artificial intelligence and toward an "AI application ecosystem" that Chinese developers could build on, language that recurred in BAAI's announcements and in state media coverage.[3]
The series takes its name from a Daoist and Buddhist term for awakening or grasping the underlying principle of things. Rather than a single model, Wu Dao was an umbrella for several research lines pursued in parallel, spanning text, images, and even protein sequences.
BAAI introduced Wu Dao 1.0 in March 2021, presenting it not as one network but as a collection of four sub-projects, each named with the character "Wen" (文, "writing" or "culture") and each led by a different team:[1][2]
| Sub-project | Focus | Reported scale |
|---|---|---|
| Wen Yuan (文渊) | Chinese language model (CPM) | 2.6 billion parameters |
| Wen Lan (文澜) | Image-text multimodal model (BriVL) | trained on tens of millions of image-text pairs |
| Wen Hui (文汇) | Cognitive / bilingual model (GLM) | about 11.3 billion parameters |
| Wen Su (文溯) | Biomolecular model for protein and gene data | trained on the UniParc protein database |
The Wen Yuan model, CPM (Chinese Pretrained Model), was described by BAAI as China's largest pretrained Chinese language model at the time and was reported to match GPT-3-style performance on a set of Chinese natural-language tasks.[1] The Wen Hui line introduced the General Language Model (GLM) architecture, which aimed to unify different pretraining objectives in a single framework and which later became the basis for the open GLM-130B model and the ChatGLM chatbots from Tang Jie's startup Zhipu AI.[5] Wu Dao 1.0 established the components, the GLM framework, and the multimodal pieces (BriVL and the CogView text-to-image model) that the 2.0 release would scale up.
Wu Dao 2.0 was unveiled at the 2021 BAAI Conference in Beijing at the end of May and the start of June 2021. BAAI reported that it contained 1.75 trillion parameters, roughly ten times the 175 billion of GPT-3, and presented it as a single multimodal model able to handle both language and vision.[3][4][6] Unlike GPT-3, which is text-only, Wu Dao 2.0 was trained jointly on text and images and was demonstrated on tasks that crossed the two modalities.
The training corpus was reported at 4.9 terabytes of high-quality data, including roughly 1.2 terabytes of Chinese text, 1.2 terabytes of English text, and about 2.5 terabytes of Chinese image data.[4][6] BAAI contrasted this with GPT-3's curated text set, although the two figures are not measured the same way. The model incorporated the GLM architecture, the CogView text-to-image system, and an algorithm BAAI called P-Tuning for adapting the model to downstream tasks with few examples.[6]
BAAI said Wu Dao 2.0 reached or beat the state of the art on nine benchmarks spanning language and vision, with comparisons drawn against systems such as OpenAI's CLIP and DALL-E, Google's ALIGN, and Microsoft's Turing-NLG on tasks including SuperGLUE, LAMBADA, zero-shot ImageNet classification, and image-text retrieval on MS COCO.[4][6] These results were announced rather than published in a peer-reviewed paper, and contemporaneous coverage noted that the absence of a full technical report made the numbers hard to verify independently.[4]
The headline 1.75 trillion figure is a sparse mixture-of-experts count, and understanding that is essential to reading it correctly. In a mixture-of-experts model, the network contains many parallel "expert" sub-networks, but only a small subset is activated for any given input, so the total parameter count can be very large while the parameters actually used per token, and the compute per token, remain far smaller. This is the same design Google used to cross the trillion-parameter line with its Switch Transformer (reported at 1.6 trillion parameters) earlier in 2021. A dense model such as GPT-3 activates all of its parameters for every token, so a sparse trillion-parameter count is not directly comparable to a dense one of similar size.[3][4]
What was genuinely novel was the training infrastructure. Wu Dao 2.0 was trained using FastMoE, an open-source mixture-of-experts training system described in a March 2021 paper (arXiv:2103.13262) by Jiaao He, Jiezhong Qiu, Aohan Zeng, Zhilin Yang, Jidong Zhai, and Tang Jie, affiliated with Tsinghua and BAAI.[7] The paper's motivation is explicit: the only existing platform capable of training trillion-scale mixture-of-experts models depended on Google's TPU hardware and its Mesh-TensorFlow software, and was not available to the wider GPU and PyTorch community. FastMoE provided a PyTorch-based system that ran on commodity GPUs and let the number of experts scale with the number of GPUs, making large MoE training accessible without Google's stack.[7] Reporting at the time noted that Wu Dao 2.0 was trained partly on China's Sunway TaihuLight supercomputer.[5]
The comparison below summarizes how the reported scale of Wu Dao 2.0 related to its contemporaries.
| Model | Developer | Reported parameters | Type | Year |
|---|---|---|---|---|
| GPT-3 | OpenAI | 175 billion | Dense | 2020 |
| Switch Transformer | 1.6 trillion | Sparse (MoE) | 2021 | |
| Wu Dao 2.0 | BAAI | 1.75 trillion | Sparse (MoE) | 2021 |
At launch BAAI showed Wu Dao 2.0 writing essays, poems, and couplets in classical Chinese, generating descriptive captions from images, and producing images from text prompts via CogView.[4][6] The institute also said the underlying technology could be applied to areas such as protein structure prediction, echoing the Wen Su line from Wu Dao 1.0.
The most publicized demonstration was Hua Zhibing (华智冰), described as China's first AI-powered virtual student. On June 3, 2021, she was presented as enrolling in the Department of Computer Science and Technology at Tsinghua University. Her appearance, voice, and the paintings and other content attributed to her were generated using Wu Dao 2.0, with the virtual human built jointly by BAAI, Zhipu AI, and Xiaoice (Microsoft's former chatbot unit, by then a separate company). BAAI said Hua Zhibing would continue to learn and improve over time.[8][9] Coverage of her reasoning abilities was promotional, and outside observers urged caution about how much real capability the demonstration reflected.[4]
Wu Dao 2.0 drew wide attention as a marker of how quickly Chinese institutions were matching and, on paper, exceeding the scale of leading Western models. Press accounts often led with the contrast that it was ten times the size of GPT-3 and larger than the Switch Transformer.[3][4] OpenAI later used the episode as an example of "model diffusion," the spread of frontier capabilities beyond the labs that pioneered them.[10]
Skeptics raised several caveats that have aged well. The 1.75 trillion number is a sparse count and not comparable to GPT-3's dense parameters, a distinction that much of the popular coverage blurred.[3][4] No peer-reviewed paper or full evaluation accompanied the launch, so the benchmark claims could not be checked, and the model itself was never released for general use.[4] More fundamentally, the sheer size made the model impractical: a later retrospective reported that Wu Dao 2.0 occupied roughly 20 terabytes, required hundreds of A100 GPUs just for inference, suffered from catastrophic forgetting, and in practice performed worse than the much smaller GLM-10B from the 1.0 generation.[5] In other words, the record-setting parameter count delivered limited usable capability.
The Wu Dao name continued, but BAAI's strategy shifted decisively from headline scale toward practical, open models. The GLM architecture introduced in Wu Dao 1.0 fed directly into GLM-130B, a 130-billion-parameter bilingual model that Tang Jie's startup Zhipu AI trained over several months in 2022 and that, with the ChatGLM chatbots, became one of China's more widely used open model families.[5]
Under the Wu Dao 3.0 banner announced in June 2023, BAAI released the Aquila series of open bilingual large language models along with vision and multimodal models and the FlagEval evaluation suite.[5] BAAI also became known for the BGE (BAAI General Embedding) text-embedding models, which were widely adopted in retrieval and retrieval-augmented generation pipelines. Seen in that light, Wu Dao 2.0 reads less as a lasting product than as a moment: a demonstration that Chinese labs could build infrastructure such as FastMoE and reach frontier scale, after which the field, BAAI included, moved on to models that people could actually run.