In a move that has caught the global AI community off guard, Alibaba has officially unmasked itself as the developer behind HappyHorse-1.0, a mysterious video-generation model that quietly topped global leaderboards earlier this week.
The model, which debuted anonymously on the Artificial Analysis Video Arena, surged to the #1 spot in both text-to-video and image-to-video categories, outperforming established heavyweights like ByteDance’s Seedance 2.0 and Google’s Veo 3.1.
1. The “Mystery Model” Reveal
For the first 48 hours of its existence, “HappyHorse” was a total enigma. The name—a nod to the Year of the Horse (2026) and a play on the surname of Alibaba founder Jack Ma (Ma means horse)—led to intense speculation before Alibaba’s Token Hub (its new AI-focused unit) claimed credit today.
- The Pedigree: The model was developed by an independent team within Alibaba’s Taotian Group Future Life Lab, led by Zhang Di, the former VP of Kuaishou and the technical architect behind the famous Kling AI.
- The Strategy: By releasing the model anonymously, Alibaba allowed “blind” user preference tests to prove its quality without the influence of brand bias.
2. Performance: A “Clean Sweep”
On the Artificial Analysis leaderboard—the “gold standard” for AI evaluation—HappyHorse-1.0 achieved a record-breaking Elo score, gapping the competition by a significant margin.
| Category | HappyHorse-1.0 Rank | Elo Score | Key Competitor Defeated |
| Text-to-Video | #1 | 1379 | ByteDance Seedance 2.0 (1273) |
| Image-to-Video | #1 | 1392 | Kling 3.0 (1242) |
| Audio-Sync Video | #2 | 1205 | Google Veo 3.1 (1219) |
- Cinematic Quality: Users in blind tests favored HappyHorse nearly 90% of the time over competitors, citing its “film-like” lighting, nuanced textures, and superior physical consistency (e.g., realistic liquid splashes and hair movement).

3. Native Audio-Video Synchronization
Unlike older models that “add” sound after the video is generated, HappyHorse uses a unified single-stream Transformer architecture.
- One-Pass Generation: It generates video and audio tokens simultaneously in a single pass, leading to perfect synchronization (e.g., the sound of a foot hitting ice occurs exactly as the ice cracks).
- Multilingual Lip-Sync: It natively supports 7 languages (Mandarin, Cantonese, English, Japanese, Korean, French, and German). It doesn’t just dub; it adapts the mouth’s physical movement to the phonetic nuances of each language.
4. Open-Source vs. API Access
In a major disruption of the “closed-source” trend led by OpenAI (Sora) and Google (Veo), Alibaba is taking a hybrid approach.
- Open Weights: Alibaba has pledged to make the 15-billion parameter model weights publicly available on GitHub (scheduled for full release today, April 10).
- Speed: The model is highly optimized, requiring only 8 inference steps to generate high-quality clips. A single NVIDIA H100 can generate a 1080p cinematic clip in roughly 38 seconds.
- Commercial API: For enterprises without massive GPU clusters, Alibaba’s Token Hub is launching a paid API in the “near future.”
5. Why the Name “HappyHorse”?
The name is part of a 2026 trend of Chinese tech firms using “Zodiac-themed” stealth launches.
- The Joke: With 2026 being the Year of the Horse, “HappyHorse” (Kuàilè Mǎ) is a lighthearted reference to the success of the project.
- The Trend: Earlier this year, Tencent-backed developers used a similar “Pony Alpha” moniker for a coding model, signaling a new era of playful, stealthy competition between China’s tech giants.
“HappyHorse-1.0 proves that true innovation doesn’t need a closed-source wall,” the development team stated. “By focusing on real user preference rather than benchmark hype, we’ve set a new standard for accessible, high-performance video.”


