Vidu (video generation)

AI Models Chinese AI Video Generation

9 min read

Updated Jul 16, 2026

Suggest edit History Talk

RawGraph

Last edited

Jul 16, 2026

Fact-checked

In review queue

Sources

18 citations

Revision

v2 · 1,712 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Vidu is an AI video generation model and platform developed by Shengshu Technology (Chinese: 生数科技), a Beijing startup that grew out of research at Tsinghua University. It was first unveiled in April 2024 as one of China's earliest answers to OpenAI's Sora, and it generates short clips from text, images, or reference photographs. Vidu is built on a diffusion transformer backbone that Shengshu traces to its own U-ViT research, which the team published in September 2022. The platform has gone through several model generations, from the original Vidu through Vidu 1.5, 2.0, and the "Q" series (Q1, Q2, Q3). In April 2026 Shengshu raised roughly 290 million US dollars in a round led by Alibaba to fund a broader "world model" effort. ^[1]^[2]^[3]

Shengshu Technology

Shengshu Technology was founded in March 2023 by a team connected to Tsinghua University's Institute for AI Industry Research. The chief scientist is Zhu Jun, a Tsinghua professor and deputy dean of the university's institute for artificial intelligence, who is widely cited as the key academic figure behind the company. Tang Jiayu, a Tsinghua computer science graduate, was the co-founder and original chief executive; the chief technology officer is Bao Fan (also rendered Fan Bao), a doctoral researcher from Zhu Jun's group who specialises in diffusion models. ^[1]^[4]^[5]

In early 2025 Shengshu brought in Luo Yihang, a former AI executive at ByteDance who had led the AI unit at Volcano Engine, as chief executive. Tang Jiayu moved to the role of president with responsibility for strategy, branding, finance, and administration, while Luo took over research and development, product, and commercialisation. The change was reported in March 2025. ^[5]

The U-ViT architecture

Vidu's underlying design descends from U-ViT, short for "U-shaped Vision Transformer," introduced in the paper "All are Worth Words: A ViT Backbone for Diffusion Models." The paper was authored by Fan Bao, Shen Nie, Kaiwen Xue, Yue Cao, Chongxuan Li, Hang Su, and Jun Zhu, first posted to arXiv on 25 September 2022 and later accepted to CVPR 2023. ^[6]

U-ViT replaces the convolutional U-Net that earlier diffusion models used with a pure Vision Transformer. It treats the noisy image patches, the timestep, and any conditioning signal all as tokens, and it adds long skip connections between shallow and deep transformer layers (the feature that gives the architecture its "U" shape). In the original experiments, a latent diffusion model with a U-ViT backbone reached an FID score of 2.29 for class-conditional image generation on ImageNet at 256 by 256 resolution and 5.48 for text-to-image generation on MS-COCO. ^[6]

Shengshu emphasises that U-ViT predates the DiT (Diffusion Transformer) architecture that underlies several Western video and image models. The DiT paper, "Scalable Diffusion Models with Transformers" by William Peebles and Saining Xie, first appeared on arXiv on 19 December 2022, roughly three months after U-ViT. Both papers converge on the same broad idea, replacing the U-Net with a transformer, and Shengshu uses the timing to argue that its transformer-plus-diffusion roadmap for video was set independently and early. Zhu Jun has said that when Sora appeared the team found it "closely aligned" with their existing technical direction. The company describes Vidu as combining diffusion and transformer methods into a single architecture trained end to end on video. ^[3]^[6]^[7]

Vidu versions

Vidu was first shown publicly at the Zhongguancun Forum in Beijing in late April 2024, where Shengshu and Tsinghua presented it as China's first text-to-video model on the level of Sora, capable of 16-second clips at 1080p. A global public launch followed on 30 July 2024, with text-to-video and image-to-video in both Chinese and English and clip lengths of 4 or 8 seconds depending on the plan. Large-scale training for the model was run with support from Baidu's AI cloud. In September 2024 Vidu added a subject-consistency feature, letting users lock a character or object across a clip from a reference image. ^[2]^[3]^[8]

Subsequent releases moved quickly. Vidu 1.5 (November 2024) introduced what the company called Multiple-Entity Consistency, the ability to combine several unrelated reference images (people, objects, and backgrounds) into one coherent video. Vidu 2.0 (January 2025) focused on speed and price, generating a clip in under ten seconds. The "Q" generation began with Vidu Q1 in spring 2025, which added natively generated, synchronised audio, and continued through Q2 and Q3, the last of which generates video and sound together in a single pass.

Version	Date	Notable additions
Vidu (initial)	Unveiled April 2024; global launch 30 July 2024	Text-to-video and image-to-video, up to 1080p; 4 to 8 second clips; subject consistency added September 2024 ^[2]^[3]
Vidu 1.5	13 November 2024	Multiple-Entity Consistency from multiple reference images; longer-context understanding ^[9]
Vidu 2.0	15 January 2025	Generation in under 10 seconds per clip; lower price; one-click templates ^[10]
Vidu Q1	Announced 29 March 2025; global launch 21 April 2025	Synchronised 48 kHz audio; cinematic effects; later multi-reference up to seven images ^[11]^[12]
Vidu Q2	21 October 2025	Reference-to-Video with up to seven images; "cinematic" and "lightning" presets; 2 to 8 second clips ^[13]
Vidu Q3	13 April 2026	Simultaneous audio-visual generation in a single pass; expanded cinematic effects ^[14]

A higher-tier Vidu Q3 Pro model was also released and, according to coverage of the 2026 funding round, ranked among the top models for generating videos from text and images. ^[3]

Features

Across its generations Vidu supports several input modes. Text-to-video turns a written prompt into a clip; image-to-video animates a still image; and reference-to-video uses one or more supplied photographs so that a specific face, character, prop, or setting stays consistent through the generated footage. Shengshu describes the reference-to-video capability, first offered in 2024, as an industry first, and later versions raised the number of reference images to as many as seven and allowed several distinct subjects to appear together in one scene. ^[2]^[13]

From the Q1 generation onward, Vidu generates audio as well as video. Q1 produced synchronised sound effects from a text prompt at a 48 kHz sampling rate, and Q3 generates dialogue, sound effects, and background music together with the picture in a single pass so that they line up with the on-screen action. The models also support start-and-end-frame control, where the user supplies a first and last frame and Vidu fills in a smooth transition between them, along with camera moves, depth of field, and a set of cinematic visual effects. ^[11]^[12]^[14]

Shengshu has reported strong early adoption, including roughly 10 million users within the first 100 days after the global launch. It also offers Vidu through an API and a model-as-a-service platform aimed at businesses and developers, so studios can build the generation features into their own pipelines. The company says its tools have reached users across more than 200 countries and regions, in areas such as advertising, film, animation, and short-form entertainment. ^[10]^[15]

Funding

In April 2026 Shengshu raised about 290 million US dollars, reported as roughly 2 billion yuan, in a Series B round led by Alibaba Cloud. The announcement was made on 10 April 2026. Reported participants and existing backers included TAL Education and Baidu Ventures, among others. Coverage put the figure variously at around 290 million to 293 million US dollars, and Bloomberg framed it as a roughly 300 million dollar bet. ^[1]^[3]^[16]

The company said the money would go toward a "general-purpose world model," a system meant to connect digital worlds (such as gaming and AI video) with the physical world (such as autonomous driving and robotics), trained on multimodal data spanning vision, audio, and other signals. The round followed a Series A+ of more than 600 million yuan (about 86 million US dollars) reported in early 2026; together with earlier financing, Shengshu's total funding across 2024 to 2026 was reported at close to 380 million US dollars. Investors in the earlier round included Zhongguancun Science City, LINK-X Capital, Qiming Venture Partners, and the Beijing Artificial Intelligence Industry Investment Fund. ^[1]^[3]^[17]

Competition

Vidu competes in a crowded field of text-to-video and image-to-video systems. Its most direct rivals among Chinese products are Kuaishou's Kling, MiniMax's Hailuo, ByteDance's Jimeng and Seedance, and Alibaba's own Wan models; internationally it is positioned against OpenAI's Sora and Google's Veo. Coverage of Vidu has repeatedly framed it as a Sora challenger, and Shengshu has pointed out that it launched Vidu globally before OpenAI made Sora broadly available. The company tends to compete on inference speed, price, and its reference-based consistency features rather than on raw clip length alone. ^[3]^[13]^[18]

References

Alibaba leads $290m investment for Shengshu Vidu AI world model - CNBC ↩
China-developed Text-to-video Large Model Launched for Global Users - TMTPost ↩
China's ShengShu raises $290 million led by Alibaba to speed world model development - Digital Today ↩
Vidu | ShengShu Technology - Shengshu Technology ↩
Shengshu Technology Appoints Former ByteDance AI Executive as CEO - TMTPost ↩
All are Worth Words: A ViT Backbone for Diffusion Models (arXiv:2209.12152) - Fan Bao et al., arXiv ↩
Scalable Diffusion Models with Transformers (arXiv:2212.09748) - William Peebles and Saining Xie, arXiv ↩
China's Vidu Challenges Sora with High-Definition 16-Second AI Video Clips in 1080p - MarkTechPost ↩
Vidu 1.5 Launch Marks New Emergence in Multimodal AI - PR Newswire ↩
ShengShu Technology Announces Vidu 2.0, Offering the Industry's Fastest Generative Video - PR Newswire ↩
Vidu Q1 Model Launches Globally Offering Unmatched Realistic VFX Capabilities - PR Newswire ↩
Vidu Q1 Model Update Unveils Multi-Reference Feature, Supporting Up to Seven Image Inputs - PR Newswire ↩
Vidu Launches Q2 "Reference-to-Video", Pioneering a New Era of High Consistency and Creative Control - PR Newswire ↩
ShengShu Launches Vidu Q3 Reference-to-Video with Expanded Visual and Audio Capabilities - PR Newswire ↩
Vidu Launches Globally, with Baidu's AIHC Support for Large-Scale Video Model Training - Pandaily ↩
Alibaba Leads $300 Million Bet on AI Video Platform ShengShu - Bloomberg ↩
ShengShu Technology Completes Series A+ Funding of Over RMB 600 Million - PR Newswire via Yahoo Finance ↩
Chinese AI start-up Shengshu unveils Vidu Q2 in challenge to OpenAI's Sora - South China Morning Post ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

1 revision by 1 contributors · full history

Suggest edit

What links here

PixVerse Zhu Jun

Shengshu Technology

The U-ViT architecture

Vidu versions

Features

Funding

Competition

References

Improve this article

Related Articles

Wan 2.1

Seedance

Wan 2.1-VACE

Wan 2.5

Doubao Seedance

Seedance 2.0

What links here

Related Articles

Wan 2.1

Seedance

Wan 2.1-VACE

Wan 2.5

Doubao Seedance

Seedance 2.0

What links here