HappyHorse-1.0

AI Models Chinese AI Video Generation

7 min read

Updated Jun 3, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 3, 2026

Fact-checked

In review queue

Sources

10 citations

Revision

v1 · 1,352 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

HappyHorse-1.0 is an AI video generation model developed by Alibaba that briefly became the top-ranked system on the Artificial Analysis Video Arena in April 2026 ^[1]^[2]. The model is a unified roughly 15-billion-parameter Transformer that generates video and its accompanying soundtrack jointly in a single forward pass, producing 1080p clips with dialogue, ambient sound, and lip-sync across seven languages ^[2]^[3]. It first appeared on the arena anonymously, climbing to first place in both text-to-video and image-to-video blind tests before Alibaba confirmed it was the developer ^[1]^[4].

What it is

HappyHorse-1.0 is a closed, API-only video generator that takes text prompts, still images, or reference clips as input and returns short videos with synchronized audio ^[2]^[3]. Its defining feature is that audio and video are not produced in separate stages. Most text-to-video systems generate silent footage and then bolt on speech or sound effects afterward, often with a second model. HappyHorse instead treats sound as another modality inside the same network, so a generated character's lip movements, the dialogue audio, and background Foley all come out of one inference pass ^[2]^[5].

Alibaba positioned the model as a research preview rather than a finished consumer product when it was revealed, and access has been gated through the inference platform fal.ai rather than a public Alibaba app ^[3]^[6]. Independent press coverage treated the launch as significant partly because it was the first arena-topping video model attributed to Alibaba's commerce-focused Taotian unit rather than to its Qwen or Tongyi research groups ^[1]^[4].

Joint audio and video in a single pass

The model is built around a single self-attention Transformer rather than the more common two-tower design that pairs a separate audio model with a video diffusion backbone. According to the developer's technical description relayed by fal, the network has 40 layers: the first four and last four handle modality-specific encoding and decoding, while the middle 32 layers share parameters across text, image, video, and audio tokens with no cross-attention branches ^[2]. Every token type flows through the same stack, which the team argues is what keeps speech, motion, and lighting consistent across a clip ^[2]^[5].

Because all four modalities live in one token sequence, the model produces a phoneme-level match between generated speech and mouth movement instead of approximating it after the fact ^[3]^[5]. The lip-sync covers seven languages: English, Mandarin, Cantonese, Japanese, Korean, German, and French ^[2]^[3]. The audio track can include spoken dialogue, environmental ambience, and Foley-style sound effects tied to on-screen action ^[2]^[5]. The seven-language list, weighted toward Chinese, Japanese, and Korean, was one of the early clues that prompted speculation about an Asia-based origin before Alibaba came forward ^[7].

For speed, HappyHorse uses a DMD-2 distillation step that cuts the diffusion process down to roughly eight denoising steps and drops classifier-free guidance ^[2]. The team reported generating a 1080p clip in about 38 seconds on a single NVIDIA H100 GPU, a figure that had not been independently reproduced at the time of the arena debut ^[2].

Specifications

The table below summarizes the model's reported characteristics. Several figures come from the developer or from fal and should be read as vendor claims unless noted as independently measured.

Attribute	Detail
Developer	Alibaba (Taotian Group, Future Life Lab / ATH AI Innovation Unit) ^[1]^[3]
Architecture	Unified 40-layer self-attention Transformer, no cross-attention ^[2]
Parameters	~15 billion ^[2]^[3]
Modalities	Joint audio plus video in a single forward pass ^[2]^[5]
Resolution	Up to 1080p ^[2]^[3]
Clip length	About 5 to 8 seconds ^[2]
Lip-sync	Phoneme-level across 7 languages (EN, Mandarin, Cantonese, JA, KO, DE, FR) ^[2]^[3]
Sampling	DMD-2 distillation, ~8 steps, no classifier-free guidance ^[2]
Reported speed	~38 s for a 1080p clip on one H100 (vendor claim) ^[2]
Availability	API-only via fal, from April 2026 ^[3]^[6]
License	Closed source ^[2]^[3]

Arena ranking

HappyHorse-1.0 made its name on the Artificial Analysis Video Arena, a leaderboard that ranks generative video models using an Elo system derived from blind human preference votes. Users are shown two clips generated from the same prompt or image and pick the one they prefer, and the model identities are hidden ^[1]^[8].

At its debut the model reached #1 in both the text-to-video and image-to-video categories on the no-audio boards, the first system to lead both at once ^[1]^[2]. Reported Elo figures from the debut window varied with the snapshot: fal cited about 1333 for text-to-video and 1392 for image-to-video on the no-audio boards, and contemporary write-ups quoted scores in the 1350 to 1400 range with a lead of roughly 40 to 110 points over the next model, then ByteDance's Dreamina Seedance 2.0 ^[2]^[7]. On the tighter boards that include generated audio, HappyHorse placed second behind Seedance 2.0 rather than first ^[2]^[9]. As more votes accumulated and rival models were added, its margin narrowed: the live text-to-video leaderboard later showed the two models nearly tied near the top ^[9].

The arena result drew outsized attention because the entry was, in Artificial Analysis's wording, pseudonymous, meaning no verifiable organization was attached when it was submitted ^[7]. A model with no public team leading a benchmark over established systems from Kling, Veo, and Seedance was unusual enough to become a story in its own right before anyone knew who built it ^[4]^[7].

Team and origin

HappyHorse-1.0 appeared on the arena around April 7, 2026 without any stated affiliation ^[1]^[7]. Within days the developers opened an account on X and acknowledged the project, and on April 10 CNBC reported that Alibaba had confirmed it was behind the model, describing it as part of the company's ATH AI Innovation Unit and still under development ^[1]. Alibaba's Hong Kong-listed shares closed about 2 percent higher the day the involvement was reported ^[1].

Subsequent coverage attributed the work specifically to a team inside Alibaba's Taotian Group, sometimes called the Future Life Lab, led by Zhang Di ^[3]^[4]. Zhang is described as a former vice president of Kuaishou and a technical lead on Kling AI, Kuaishou's video model, who rejoined Alibaba in late 2025 to work on multimodal generation ^[3]^[4]. That lineage, an arena-topping video model shipped within months by a leader who had previously built a leading rival, was the basis for the widely repeated framing that HappyHorse came from an ex-Kling team ^[3]^[4]. The precise composition of the group beyond Zhang has not been published, and the "Future Life Lab" and "ATH AI Innovation Unit" labels appear to describe the same Taotian organization under slightly different names across sources ^[1]^[3].

Availability

After the anonymous arena run, the model became available to developers through fal as an official API partner. fal announced HappyHorse-1.0 went live on its platform in late April 2026, with the launch timed for April 26 to 27 ^[3]^[6]. Access is offered through several endpoints covering text-to-video, image-to-video, reference-based generation, and video editing ^[3]. Listed pricing was about $0.14 per second of video at 720p and $0.28 per second at 1080p, with no subscription or minimum spend ^[3]^[10]. Alibaba did not, at launch, release model weights or a public consumer app, so the API remained the primary route to use the system ^[2]^[3].

References

CNBC, "Alibaba revealed as creator of AI video generation model 'HappyHorse-1.0'," April 10, 2026. https://www.cnbc.com/2026/04/10/alibaba-happyhorse-ai-video-model-benchmark-reveal.html ↩
fal, "HappyHorse-1.0 AI Goes Live on fal: April 26, 9 PM PST." https://fal.ai/learn/devs/happyhorse-1-0-what-do-we-know-so-far ↩
fal, "HappyHorse-1.0 | AI Video Generator | Official API Partner." https://fal.ai/happyhorse-1.0 ↩
WaveSpeed, "What Is HappyHorse-1.0? The Mystery #1 AI Video Model." https://wavespeed.ai/blog/posts/what-is-happyhorse-1-0-ai-video-model/ ↩
Morphic, "Happy Horse 1.0 from Alibaba: Joint Video and Audio with 7-Language Lip-Sync." https://morphic.com/resources/models/happy-horse ↩
Morningstar / PR Newswire, "fal Launches HappyHorse-1.0, the #1-Ranked AI Video Model, as Official API Partner," April 27, 2026. https://www.morningstar.com/news/pr-newswire/20260427sf45051/fal-launches-happyhorse-10-the-1-ranked-ai-video-model-as-official-api-partner ↩
WaveSpeed, "Why Is HappyHorse-1.0 Suddenly #1 on Video Leaderboard?" https://wavespeed.ai/blog/posts/why-happyhorse-top-ai-video-leaderboard-2026/ ↩
Artificial Analysis, "Image to Video Leaderboard." https://artificialanalysis.ai/video/leaderboard/image-to-video ↩
Artificial Analysis, "Text to Video Leaderboard." https://artificialanalysis.ai/video/leaderboard/text-to-video ↩
fal, "How to use HappyHorse-1.0 in 2026?" https://fal.ai/learn/tools/how-to-use-happyhorse-1-0 ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

Suggest edit

What links here

Seedance

What it is

Joint audio and video in a single pass

Specifications

Arena ranking

Team and origin

Availability

References

Improve this article

Related Articles

Wan 2.1

Seedance

Wan 2.1-VACE

Wan 2.5

Doubao Seedance

Seedance 2.0

What links here

Related Articles

Wan 2.1

Seedance

Wan 2.1-VACE

Wan 2.5

Doubao Seedance

Seedance 2.0