ERNIE 4.5
Last reviewed
May 31, 2026
Sources
14 citations
Review status
Source-backed
Revision
v1 ยท 2,239 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 31, 2026
Sources
14 citations
Review status
Source-backed
Revision
v1 ยท 2,239 words
Add missing citations, update stale details, or suggest a clearer explanation.
ERNIE 4.5 is a family of large language models released by the Chinese technology company Baidu, open-sourced on June 30, 2025 under the Apache 2.0 license [1][2]. The family spans ten distinct variants, from a compact dense model with a few hundred million parameters up to a Mixture-of-Experts model with 424 billion total parameters [1][3]. Several of the models are multimodal, meaning they process images and video alongside text. ERNIE 4.5 is the latest entry in Baidu's long-running ERNIE line and, more notably, the first time the company put its flagship foundation models out as open weights. For broader background on the family's predecessors, see Baidu ERNIE.
The release mattered well beyond Baidu's own product roadmap. For years the company's founder and chief executive Robin Li had argued in public that closed, proprietary development was the only sensible path for frontier models [9]. Opening the ERNIE 4.5 weights reversed that position, and it placed one of China's largest AI labs directly inside the open-weight ecosystem that DeepSeek, Alibaba's Qwen, and others had been building through 2024 and 2025.
ERNIE 4.5 did not start out as an open model. Baidu first launched it on March 16, 2025 as a proprietary, natively multimodal foundation model, announced alongside a deep-thinking reasoning model called ERNIE X1 [10][11]. At that same launch the company said it would make its ERNIE Bot assistant free to individual users ahead of schedule, another break from its earlier paid, closed approach [10].
The open-sourcing had been signposted in advance. In February 2025 Baidu said it planned to roll the ERNIE 4.5 series out gradually and to release the code openly on June 30 [9]. That announcement came only weeks after DeepSeek-R1 drew global attention to how competitive open Chinese models had become, and the timing was hard to miss. Baidu had been one of the more vocal defenders of the proprietary model, so the reversal read as a direct response to a shifting market rather than a long-held plan.
When the weights did land at the end of June, they went up on Hugging Face, on GitHub, and on Baidu's own AI Studio and PaddlePaddle ecosystem [1][2][4]. Western outlets sometimes dated the drop to July 1 because of time-zone differences, but the official ERNIE blog gives June 30, 2025 [1][5].
The ERNIE 4.5 family is built as five model designs, each shipped in two forms: a Base version that has only been pre-trained, and a post-trained version tuned for instruction following and chat. That pairing is how Baidu reaches the figure of ten variants [1][3]. The designs cover three categories: a small dense model, two text-only Mixture-of-Experts models, and two multimodal vision-language Mixture-of-Experts models.
The naming follows a pattern. A label like 300B-A47B means roughly 300 billion total parameters with about 47 billion activated for any given token, which is the hallmark of a sparse MoE design where only a slice of the network runs on each forward pass. The VL prefix marks the vision-language models that handle images and video.
| Model | Type | Total params | Active params | Layers | Modality |
|---|---|---|---|---|---|
| ERNIE-4.5-0.3B | Dense | ~0.36B | ~0.36B (dense) | 18 | Text |
| ERNIE-4.5-21B-A3B | MoE | 21B | 3B | 28 | Text |
| ERNIE-4.5-300B-A47B | MoE | 300B | 47B | 54 | Text |
| ERNIE-4.5-VL-28B-A3B | MoE | 28B | 3B | n/a | Text + vision |
| ERNIE-4.5-VL-424B-A47B | MoE | 424B | 47B | 54 | Text + vision |
Each row above exists in both a Base and a post-trained release, which is what brings the count to ten [3][6][7][8]. The smallest model, ERNIE-4.5-0.3B, is a conventional dense transformer of about 360 million parameters with 18 layers, light enough to run on modest hardware [8]. At the other end, ERNIE-4.5-VL-424B-A47B is a 424 billion parameter multimodal MoE with 54 layers that activates roughly 47 billion parameters per token [6]. The two A3B models, with only about 3 billion active parameters each, target setups where memory and serving cost matter more than peak capability.
The most distinctive idea in ERNIE 4.5 is what Baidu calls a heterogeneous Mixture-of-Experts structure for the multimodal models [1][2]. A standard MoE routes every token through a shared pool of expert sub-networks. The problem when you mix text and images in one model is cross-modal interference: the two kinds of input pull the shared parameters in different directions, and training one modality can degrade the other.
ERNIE 4.5 handles this by splitting the experts. The architecture keeps some parameters shared across modalities while giving text and vision their own dedicated experts, and it routes each token to the set built for its modality [1][2]. The Hugging Face model cards for the large variants describe an expert layout with 64 text experts and 64 vision experts, of which a small number are activated per token, plus a couple of shared experts in the smaller 21B design [6][7]. Baidu reports that the visual experts hold about a third of the parameters of the text experts, reflecting that the vision side needs less capacity [2].
Keeping the two modalities from interfering took more than separate experts. The training used modality-isolated routing so tokens never leak into the wrong expert set, together with a router orthogonal loss and a multimodal token-balanced loss that keep the routing decisions distinct and the token load even across modalities [1][2]. The vision side is fed through a vision transformer and an adapter that connect image and video features into the language backbone [6].
Training ran in stages. The text parameters were trained first, and the visual experts, vision transformer, and adapter were brought in during a later multimodal stage, so the model learned language before it learned to ground that language in images [6]. Baidu trained with FP8 mixed-precision and reports reaching 47 percent Model FLOPs Utilization, a measure of how efficiently the hardware was kept busy during training [1][2]. The whole stack was built on Baidu's own PaddlePaddle deep learning framework, with a training toolkit called ERNIEKit and an inference toolkit called FastDeploy [2][4].
For deployment, the team developed a quantization approach that pushes the weights down to 4-bit and even 2-bit precision while keeping the loss in quality small, using what they describe as a convolutional code quantization algorithm together with a multi-expert parallel collaboration scheme for inference [1][2]. The language models support a context window of 128K tokens [4]. The post-trained vision-language models also offer a choice between a thinking mode, which spends extra computation on step-by-step reasoning, and a faster non-thinking mode [1].
Weights ship in two flavors. The -Paddle releases carry PaddlePaddle tensors, while the -PT releases use PyTorch-style weights compatible with the Hugging Face transformers library, version 4.54.0 or newer [8]. That dual packaging lets the models run inside common open-source serving stacks such as vLLM rather than only inside Baidu's own ecosystem.
Baidu reports a broad benchmark suite in the ERNIE 4.5 technical report and on the model cards. The headline result for the text models is that ERNIE-4.5-300B-A47B-Base beats DeepSeek-V3-671B-A37B-Base on 22 of 28 benchmarks, which is striking given that DeepSeek's base model carries more than twice the total parameters [2][3]. The post-trained ERNIE-4.5-300B-A47B is reported to reach state-of-the-art scores among the compared models on instruction-following and knowledge tasks including IFEval, Multi-IF, SimpleQA, and ChineseSimpleQA [2][6].
The smaller text MoE punches above its size. Baidu reports that ERNIE-4.5-21B-A3B performs competitively with Alibaba's Qwen3-30B-A3B despite having roughly 30 percent fewer total parameters, and that it comes out ahead on reasoning and math benchmarks such as BBH and CMATH [1][2].
On the multimodal side, Baidu reports that ERNIE-4.5-VL is competitive on vision-language benchmarks like MathVista, MMMU, and VisualPuzzle, and that switching on the thinking mode narrows or in some cases closes the gap to OpenAI's o1 reasoning model on those harder, reasoning-heavy tasks [1][2]. The table below summarizes the comparative claims that Baidu states in primary materials. Exact per-benchmark figures are published as charts in the technical report rather than as a single transcribed table, so the entries here record the verified comparative claim rather than invented decimals.
| Model | Benchmark area | Reported result | Compared against |
|---|---|---|---|
| ERNIE-4.5-300B-A47B-Base | 28-benchmark suite | Wins on 22 of 28 | DeepSeek-V3-671B-A37B-Base [2][3] |
| ERNIE-4.5-300B-A47B (post-trained) | Instruction following, knowledge | State of the art on IFEval, Multi-IF, SimpleQA, ChineseSimpleQA | DeepSeek-V3, Qwen3, GPT-4.1, OpenAI-o1 [2][6] |
| ERNIE-4.5-21B-A3B | Reasoning, math (BBH, CMATH) | Competitive overall, ahead on several tasks at ~30% fewer params | Qwen3-30B-A3B [1][2] |
| ERNIE-4.5-VL-424B-A47B (thinking) | MathVista, MMMU, VisualPuzzle | Narrows or surpasses the gap on reasoning-heavy visual tasks | OpenAI-o1 [1][2] |
As with any vendor-published numbers, these figures come from Baidu's own evaluation runs and reflect the specific prompts, settings, and comparison checkpoints the team chose. They are useful as a guide to where the models are strong, not as a neutral ranking.
ERNIE 4.5 sits at the top of a model line that goes back several years. The ERNIE name first appeared on Baidu's pretraining research in 2019, the ERNIE Bot consumer assistant launched in March 2023, and proprietary releases such as ERNIE 3.0 and ERNIE 4.0 preceded the 4.5 generation [9][12]. What changed with 4.5 was less the lineage and more the openness: earlier ERNIE foundation models were served through Baidu's API and apps, while 4.5 put the weights themselves into the open. The general history and architecture of the line is covered in the Baidu ERNIE article, which is a separate, broader topic from this 4.5-specific family.
In the wider field, ERNIE 4.5 arrived into a crowded year for Chinese open releases. DeepSeek had shipped its V3 and R1 models, Alibaba had pushed out the Qwen3 series, and the open-weight conversation had clearly moved beyond Meta's Llama 4 and into a competitive Chinese cohort. Baidu's entry added a large lab that had previously stayed closed, and it leaned on a multimodal-first design and an Apache 2.0 license to stand apart from rivals whose flagship open models were often text-first or carried more restrictive terms.
Every model in the ERNIE 4.5 family is released under the Apache 2.0 license, which permits commercial use, modification, and redistribution subject to the usual attribution terms [3][6][8]. That is a permissive choice. It means companies can build products on ERNIE 4.5 without negotiating a separate commercial agreement with Baidu, the same footing that made models like DeepSeek and Qwen attractive to downstream builders.
The significance is partly technical and partly strategic. Technically, the family gives the open community a multimodal MoE at genuinely large scale, plus small dense and small-active variants that are cheap to run, all under one consistent design and license. Strategically, it marks a major Chinese lab abandoning a closed stance it had defended for years, and it strengthens the pattern in which the most widely available open-weight frontier models increasingly come from Chinese labs rather than from the US firms that pioneered the modern open-source AI wave.
There are real caveats. The benchmark numbers are Baidu's own and have not, as a body, been independently reproduced at the time of release, so the comparative claims against DeepSeek-V3, Qwen3, GPT-4.1, and OpenAI-o1 should be read as vendor results [2]. Running the largest variants is demanding: a 424 billion parameter model, even one that activates only about 47 billion parameters per token, needs substantial GPU memory to serve, which is why the 4-bit and 2-bit quantization work and the smaller A3B and 0.3B models matter for practical use [1][6].
The tooling story also has rough edges. The training and inference toolkits are tied to Baidu's PaddlePaddle framework, and while PyTorch-format weights exist for the transformers library, teams standardized on other stacks may hit friction that they would not with models built PyTorch-first [4][8]. And like the rest of the ERNIE line, the models were trained with heavy Chinese-language and Chinese-context data, which helps on Chinese benchmarks but means behavior, knowledge coverage, and content handling can differ from Western-developed models in ways that are worth testing before deployment [2].