Vidu (video generation)
Last reviewed
Jun 3, 2026
Sources
18 citations
Review status
Source-backed
Revision
v1 · 1,716 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 3, 2026
Sources
18 citations
Review status
Source-backed
Revision
v1 · 1,716 words
Add missing citations, update stale details, or suggest a clearer explanation.
Vidu is an AI video generation model and platform developed by Shengshu Technology (Chinese: 生数科技), a Beijing startup that grew out of research at Tsinghua University. It was first unveiled in April 2024 as one of China's earliest answers to OpenAI's Sora, and it generates short clips from text, images, or reference photographs. Vidu is built on a diffusion transformer backbone that Shengshu traces to its own U-ViT research, which the team published in September 2022. The platform has gone through several model generations, from the original Vidu through Vidu 1.5, 2.0, and the "Q" series (Q1, Q2, Q3). In April 2026 Shengshu raised roughly 290 million US dollars in a round led by Alibaba to fund a broader "world model" effort. [1][2][3]
Shengshu Technology was founded in March 2023 by a team connected to Tsinghua University's Institute for AI Industry Research. The chief scientist is Zhu Jun, a Tsinghua professor and deputy dean of the university's institute for artificial intelligence, who is widely cited as the key academic figure behind the company. Tang Jiayu, a Tsinghua computer science graduate, was the co-founder and original chief executive; the chief technology officer is Bao Fan (also rendered Fan Bao), a doctoral researcher from Zhu Jun's group who specialises in diffusion models. [1][4][5]
In early 2025 Shengshu brought in Luo Yihang, a former AI executive at ByteDance who had led the AI unit at Volcano Engine, as chief executive. Tang Jiayu moved to the role of president with responsibility for strategy, branding, finance, and administration, while Luo took over research and development, product, and commercialisation. The change was reported in March 2025. [5]
Vidu's underlying design descends from U-ViT, short for "U-shaped Vision Transformer," introduced in the paper "All are Worth Words: A ViT Backbone for Diffusion Models." The paper was authored by Fan Bao, Shen Nie, Kaiwen Xue, Yue Cao, Chongxuan Li, Hang Su, and Jun Zhu, first posted to arXiv on 25 September 2022 and later accepted to CVPR 2023. [6]
U-ViT replaces the convolutional U-Net that earlier diffusion models used with a pure Vision Transformer. It treats the noisy image patches, the timestep, and any conditioning signal all as tokens, and it adds long skip connections between shallow and deep transformer layers (the feature that gives the architecture its "U" shape). In the original experiments, a latent diffusion model with a U-ViT backbone reached an FID score of 2.29 for class-conditional image generation on ImageNet at 256 by 256 resolution and 5.48 for text-to-image generation on MS-COCO. [6]
Shengshu emphasises that U-ViT predates the DiT (Diffusion Transformer) architecture that underlies several Western video and image models. The DiT paper, "Scalable Diffusion Models with Transformers" by William Peebles and Saining Xie, first appeared on arXiv on 19 December 2022, roughly three months after U-ViT. Both papers converge on the same broad idea, replacing the U-Net with a transformer, and Shengshu uses the timing to argue that its transformer-plus-diffusion roadmap for video was set independently and early. Zhu Jun has said that when Sora appeared the team found it "closely aligned" with their existing technical direction. The company describes Vidu as combining diffusion and transformer methods into a single architecture trained end to end on video. [3][6][7]
Vidu was first shown publicly at the Zhongguancun Forum in Beijing in late April 2024, where Shengshu and Tsinghua presented it as China's first text-to-video model on the level of Sora, capable of 16-second clips at 1080p. A global public launch followed on 30 July 2024, with text-to-video and image-to-video in both Chinese and English and clip lengths of 4 or 8 seconds depending on the plan. Large-scale training for the model was run with support from Baidu's AI cloud. In September 2024 Vidu added a subject-consistency feature, letting users lock a character or object across a clip from a reference image. [2][3][8]
Subsequent releases moved quickly. Vidu 1.5 (November 2024) introduced what the company called Multiple-Entity Consistency, the ability to combine several unrelated reference images (people, objects, and backgrounds) into one coherent video. Vidu 2.0 (January 2025) focused on speed and price, generating a clip in under ten seconds. The "Q" generation began with Vidu Q1 in spring 2025, which added natively generated, synchronised audio, and continued through Q2 and Q3, the last of which generates video and sound together in a single pass.
| Version | Date | Notable additions |
|---|---|---|
| Vidu (initial) | Unveiled April 2024; global launch 30 July 2024 | Text-to-video and image-to-video, up to 1080p; 4 to 8 second clips; subject consistency added September 2024 [2][3] |
| Vidu 1.5 | 13 November 2024 | Multiple-Entity Consistency from multiple reference images; longer-context understanding [9] |
| Vidu 2.0 | 15 January 2025 | Generation in under 10 seconds per clip; lower price; one-click templates [10] |
| Vidu Q1 | Announced 29 March 2025; global launch 21 April 2025 | Synchronised 48 kHz audio; cinematic effects; later multi-reference up to seven images [11][12] |
| Vidu Q2 | 21 October 2025 | Reference-to-Video with up to seven images; "cinematic" and "lightning" presets; 2 to 8 second clips [13] |
| Vidu Q3 | 13 April 2026 | Simultaneous audio-visual generation in a single pass; expanded cinematic effects [14] |
A higher-tier Vidu Q3 Pro model was also released and, according to coverage of the 2026 funding round, ranked among the top models for generating videos from text and images. [3]
Across its generations Vidu supports several input modes. Text-to-video turns a written prompt into a clip; image-to-video animates a still image; and reference-to-video uses one or more supplied photographs so that a specific face, character, prop, or setting stays consistent through the generated footage. Shengshu describes the reference-to-video capability, first offered in 2024, as an industry first, and later versions raised the number of reference images to as many as seven and allowed several distinct subjects to appear together in one scene. [2][13]
From the Q1 generation onward, Vidu generates audio as well as video. Q1 produced synchronised sound effects from a text prompt at a 48 kHz sampling rate, and Q3 generates dialogue, sound effects, and background music together with the picture in a single pass so that they line up with the on-screen action. The models also support start-and-end-frame control, where the user supplies a first and last frame and Vidu fills in a smooth transition between them, along with camera moves, depth of field, and a set of cinematic visual effects. [11][12][14]
Shengshu has reported strong early adoption, including roughly 10 million users within the first 100 days after the global launch. It also offers Vidu through an API and a model-as-a-service platform aimed at businesses and developers, so studios can build the generation features into their own pipelines. The company says its tools have reached users across more than 200 countries and regions, in areas such as advertising, film, animation, and short-form entertainment. [10][15]
In April 2026 Shengshu raised about 290 million US dollars, reported as roughly 2 billion yuan, in a Series B round led by Alibaba Cloud. The announcement was made on 10 April 2026. Reported participants and existing backers included TAL Education and Baidu Ventures, among others. Coverage put the figure variously at around 290 million to 293 million US dollars, and Bloomberg framed it as a roughly 300 million dollar bet. [1][3][16]
The company said the money would go toward a "general-purpose world model," a system meant to connect digital worlds (such as gaming and AI video) with the physical world (such as autonomous driving and robotics), trained on multimodal data spanning vision, audio, and other signals. The round followed a Series A+ of more than 600 million yuan (about 86 million US dollars) reported in early 2026; together with earlier financing, Shengshu's total funding across 2024 to 2026 was reported at close to 380 million US dollars. Investors in the earlier round included Zhongguancun Science City, LINK-X Capital, Qiming Venture Partners, and the Beijing Artificial Intelligence Industry Investment Fund. [1][3][17]
Vidu competes in a crowded field of text-to-video and image-to-video systems. Its most direct rivals among Chinese products are Kuaishou's Kling, MiniMax's Hailuo, ByteDance's Jimeng and Seedance, and Alibaba's own Wan models; internationally it is positioned against OpenAI's Sora and Google's Veo. Coverage of Vidu has repeatedly framed it as a Sora challenger, and Shengshu has pointed out that it launched Vidu globally before OpenAI made Sora broadly available. The company tends to compete on inference speed, price, and its reference-based consistency features rather than on raw clip length alone. [3][13][18]