
The AI video sector has been a bit cold recently, with Seedance 2.0 embroiled in copyright disputes and OpenAI shutting down Sora, casting a shadow over this track.
Just at this moment, Alibaba brought out a dark horse.
In April 2026, HappyHorse-1.0 topped the Artificial Analysis rankings, surpassing competitors like ByteDance and Kuaishou in both text-to-video and image-to-video (without audio) categories.
Zhang Di returned to Alibaba in November 2025, taking on the role of head of Taotian Group's Future Life Lab, and directly reporting to Zheng Bo, CTO of Alimama.
In other words, it only took Zhang Di about five months from his return to making a name for himself.
The key point is that, like Qwen, HappyHorse also released an open-source version that can be used commercially.
What is Qwen’s position within Alibaba now? It serves as the core general large model foundation for the entire Alibaba Group and is the absolute central carrier of its AI strategy. Everything Alibaba does now revolves around Qwen.
Therefore, the significance of HappyHorse to Alibaba might go far beyond being just a model showcasing technical prowess.
However, before understanding Alibaba's thinking, we should first talk about who Zhang Di is.
01
From Alibaba to Kuaishou and back to Alibaba
Zhang Di graduated from Shanghai Jiao Tong University with a degree in computer science, completing both his bachelor's and master's consecutively. After graduating in 2010, he joined Alibaba and was responsible for the big data and machine learning engineering architecture of Alimama for a long time.
Alimama focuses on advertising, recommendations, search, and conversion, all powered by massive-scale data, large-scale distribution, and complex engineering systems. These might not sound as exciting as large models, but they are precisely where Chinese internet companies train AI talent.
Many people who can turn models into products don't come purely from labs. They have earlier experience with systems like search, recommendation, advertising, and content distribution.
Let me give you a few examples to make it clear. Google CEO Sundar Pichai comes from working on the search bar and YouTube content recommendations, while Microsoft CEO Satya Nadella initially developed the Bing search engine and Microsoft’s ad system at Microsoft.
These systems handle massive amounts of user behavior every day and require models to operate stably in real-world business environments. Engineers can't just create a visually appealing demo; they are forced to build something genuinely useful, repeatedly making trade-offs between latency, cost, effectiveness, and feedback.
During Zhang Di’s ten years at Alibaba, he largely worked in such an environment. Back then, the outside world didn’t call everything a 'large model,' but internally, Alibaba had already established a training ground centered around data, algorithms, and engineering practices.
In 2020, Zhang Di left Alibaba and joined Kuaishou.
By that time, short-video platforms had moved from traffic competition into technological competition. At Kuaishou, Zhang Di held roles including Vice President of Technology and Head of Large Model and Multimedia Technology Team, eventually leading the foundational architecture R&D and application deployment of the Koling large model.

Koling is of great significance to Kuaishou.
Kuaishou has evolved from being a 'content distribution platform' in the past to becoming a 'content production infrastructure provider,' establishing a complete closed loop of 'creative generation - video production - one-click distribution - traffic monetization - data iteration'.
In April 2025, Kuaishou established the Keling AI Division, which was upgraded to a top-level company department reporting directly to CEO Cheng Yixiao, placing it on par with the main short-video business.
Therefore, when he briefly joined Bilibili in September 2025 and returned to Alibaba two months later, this move is difficult to view as just an ordinary talent migration.
Bilibili needs video technology, and so does Alibaba, but Alibaba's requirements are more complex.
For Kuaishou, video generation mainly involves distribution. However, if Alibaba engages in video generation, the associated links become much more extensive, including e-commerce, advertising, live streaming, cloud services, and overseas merchants.
As previously mentioned, after Zhang Di returned to Alibaba in November 2025, he assumed the role of head of Taobao's 'Future Life Lab' at level P11.
With such arrangements, the influence of Alibaba remains strong.It did not simply place the video model within a pure research department; instead, its position is closer to Taobao, which operates as a transactional hub.
In other words, HappyHorse was conceived as a product emphasizing practical implementation and closely tied to Alibaba's existing ecosystem.
Five months later, HappyHorse emerged.
This pace is indeed fast. Alibaba has given Zhang Di a new business scenario and team, and he has once again打通 the video model route.
He neither entered the AI video field from scratch nor simply空降 into Alibaba from the outside.
His career path resembles a line that loops out and then loops back. First, he learned how large-scale commercial systems operate at Alibaba, then moved to Kuaishou to turn video generation into a product, and subsequently returned to Alibaba to integrate this capability into a larger commercial machine.
Many companies are scrambling for large model talent, but what is truly scarce are individuals who can simultaneously understand models, business, and organizations.
There are plenty of people who know how to train models, and plenty who can talk strategy, but what's difficult is finding someone who knows where each step—from technical roadmap to architecture design, training inference, product output, and finally adoption by merchants and users—might get stuck.
HappyHorse has brought Zhang Di back into the spotlight and provided Alibaba’s relatively fragmented AI narrative over the past few years with a more concrete character entry point.
02
How open-source models can defeat closed-source giants
The aspect of HappyHorse that truly drew attention was how suddenly it achieved victory.
In the video generation space, overseas competitors include Runway, Pika, Luma, and Google’s Veo, while domestically there’s ByteDance’s Seedance and Kuaishou’s Kelin. Alibaba doesn’t even make the list.
So when HappyHorse first topped the charts, many were more inclined to believe it was a model developed by some venture rather than one from Alibaba.
HappyHorse ranks in the top tier in both text-to-video and image-to-video categories, with an Elo score of 1333 for text-to-video and 1392 for image-to-video.
The Artificial Analysis leaderboard changes dynamically based on user blind tests, and subsequent page scores are updated accordingly. However, it has indeed outperformed a number of earlier well-known closed-source models in user preference testing.
This is quite unusual. Typically, video generation is one of the most resource-intensive areas in terms of money, data, and computational power.
Large closed-source companies can conceal their data, model details, inference systems, and product experience within their own platforms, enabling continuous internal iteration.
Open-source models face more practical constraints, as their parameters must be publicly available, inference must be executable, the community must be able to reproduce results, and performance must withstand cross-comparison.
Therefore, prior to HappyHorse, most open-source video models were rudimentary toys that generated unstable videos with frequent character drifts.
HappyHorse features a 15-billion-parameter, 40-layer unified self-attention Transformer architecture that integrates tokens from three modalities—text, video, and audio—into a single sequence for joint modeling.
This approach is very similar to Qwen, which explains why Zhang Di was able to develop HappyHorse in just five months, likely leveraging high-quality native multimodal training methods inherited from Qwen.
Non-native multimodal video generation models like Sora often exhibit issues such as lips moving while the sound lags behind. Sometimes characters display exaggerated expressions but with mismatched tones, or they may act before the sound is produced.
The reason HappyHorse scores highly lies in its use of native multimodality to address these problems.
HappyHorse natively supports lip-sync in multiple languages including English, Mandarin, Cantonese, Japanese, Korean, German, and French, with its word error rate being compared to similar open-source models.
Why is Zhang Di doing this? My understanding is that if Alibaba wants this video generation technology to enter advertising, e-commerce, short dramas, education, and live streaming, it can't rely solely on good visuals.
It needs to be able to talk, add voiceovers, and make the sound and visuals work together seamlessly.
Another key point is cost and speed.
HappyHorse takes about 38 seconds to generate a 5-second 1080p video on a single H100 GPU, utilizing DMD-2 distillation technology to reduce noise removal steps to 8.
This is an unavoidable hurdle for the commercialization of video generation. No matter how good the model's performance is, if generating a short video is too costly or requires too long a wait, it will be difficult to integrate into merchants' daily workflows.

Merchants won't wait half a day for each product, nor will they pay excessive costs for dozens of test materials.
Therefore, the significance of HappyHorse lies not only in 'being able to generate' but also in its attempt to push generation speed and inference costs into a usable range.
For developers, open-sourcing means they can self-host, fine-tune, and integrate the model into their own products. For platforms, open-sourcing also brings more community feedback.
The progress of a closed-source model mainly depends on the internal team of a company, while an open-source model will be used by developers for various unusual tests, exposing issues quickly and creating more directions for improvement.
The video arena of Artificial Analysis uses user preference voting. Often, it's not just about looking at a single technical indicator but rather which of the two videos users prefer.
Of course, Zhang Di should not be too proud. Topping the list once does not mean permanent leadership.
Competitors won't stay in place. HappyHorse has only won an open test, not the entire war.
If HappyHorse is merely a model that can top charts, its significance is limited. However, if it can become the foundational video generation model used by both Alibaba Cloud and Taotian businesses, it will become an entry point.
Therefore, the most interesting aspect of HappyHorse defeating closed-source giants is not just about leading in scores. What truly deserves attention is that it allows Alibaba to find a way to re-enter the video generation arena.
It didn't first create an app for end consumers, nor did it only conduct internal demonstrations; instead, it directly put the open-source model under full industry scrutiny.
This victory may not last long, but Zhang Di has changed outsiders' perception of Alibaba's video generation models.
The new question becomes: where does Alibaba plan to apply this capability?
03
The significance of HappyHorse to Alibaba
The most direct application of HappyHorse is e-commerce.
In the past, when people talked about AI-generated videos, they most easily thought of films, short dramas, big-budget advertisements, and creator tools. Indeed, these are all substantial markets, but they are still somewhat distant from Alibaba's core business.
Alibaba's strength does not lie in building its own video community or in having ordinary users open an AI video app every day to pass time. Where Alibaba truly excels is in its possession of China’s densest network of products, merchants, transactions, and advertising systems.
This is also why many people are paying attention to the fact that HappyHorse was born in Taotian Group's 'Future Life Lab'.
Taotian faces daily questions like how merchants can sell goods, how products can be seen, why users click in, and why they place orders. With HappyHorse placed here, it naturally raises the question of whether it can improve the efficiency of product content creation, boost conversion rates, and help the platform generate more business.
For an ordinary merchant, video content has always been a troublesome issue.
To shoot a 30-second product video, you need to find a location, hire models, set up lighting, edit, and add voiceovers. Big brands can afford to hire teams, but small and medium-sized merchants often have to manage with limited resources.
Many product selling points are not complicated; the problem lies in the fact that no one has visually captured those selling points. On plain white backgrounds, they may look ordinary, but once placed in a specific scenario, users realize what they can be used for.
Recently overseas, solar-powered fountain pumps became a massive hit. Originally just small garden items with modest effects, when packaged by AI videos as birdbaths, fish ponds, and children’s bathtubs with cool water-spraying features, everyone rushed to buy them.

AI hasn’t changed the product itself, but it has changed the way users understand it. It turns 'functional descriptions' into 'usage scenarios.'
This directly addresses the pain point of e-commerce content.
Product pages filled with parameters might not hold users' patience to read; even if a livestream host talks for a long time, users may still not trust it. However, a short video of about ten seconds can make the scenario clear, and the conversion efficiency could be much higher.
More importantly, AI videos can be generated in batches. Merchants can create children’s versions, family versions, holiday versions, outdoor versions for the same product, or generate different languages, characters, and scenarios for different countries.
The significance of this to Alibaba is greater than simply creating a video generation tool. Whether it’s Taobao or Tmall, both have a large number of merchants as well as substantial product data and transaction feedback.
If an AI video tool only knows how to generate beautiful visuals, it will quickly become mere material software; if it knows under which scenarios the product is more likely to be clicked, what copywriting is more likely to drive cart additions, and what video opening seconds are more likely to retain users, it will approach becoming part of an e-commerce operating system.
What sets Alibaba apart from other video generation model companies is precisely this feedback loop.
Product images, detail pages, reviews, Q&A, search terms, click-through rates, cart addition rates, refund reasons, live streaming dwell times—these items may seem fragmented, but they all serve as fuel for training e-commerce content capabilities.
If HappyHorse taps into this feedback, it can evolve from 'helping merchants generate a video' to 'helping merchants generate videos that are more likely to sell products'.
For Taobao and Tmall, it can produce main image videos, product scenario shorts, live stream clips, virtual hosts, and marketing materials.
In the past, when merchants launched new products, they might have only uploaded a few pictures or, at most, shot a rough short video. In the future, they can hand over product images, selling points, reviews, and audience tags to the system, allowing it to generate multiple versions of videos, then use real-world placement and transaction data to screen out the most effective one.
If this process runs smoothly, the platform's content supply will significantly increase, and the content threshold for small and medium-sized merchants will also decrease.
However, AI video marketing comes with risks. It can amplify selling points, but it may also magnify illusions. A fountain pump might appear to spray very high in an AI-generated video, but in reality, it cannot achieve such an effect.
Alibaba's opportunity should not lie in allowing merchants to use AI to create fantasies. The focus should be on product specifications, real-shot materials, buyer reviews, and platform verification to ensure that the generated content has boundaries.
In late March, OpenAI announced the shutdown of the standalone Sora application and related APIs. The reason was pragmatic: video generation is too costly, and user retention could not justify the expenses. OpenAI decided to redirect its computing power towards coding, enterprise services, and robotics.
Sora stumbled over commercial calculations.
ByteDance encountered trouble on another front. Although Seedance 2.0 also delivered impressive results, ByteDance suspended its global launch due to copyright issues.
The stronger a model becomes, the more likely it is to run into quagmires involving copyright, portrait rights, and training data.
Looking now at HappyHorse, developed under Zhang Di’s leadership, it features a clear commercial application. Moreover, Alibaba’s collection of product images, merchant materials, real-shot videos, and transaction feedback naturally makes it more suitable for controlled generation than film IPs.
Therefore, the value of HappyHorse extends beyond just rankings. It provides AI video with a more stable foothold. $Alibaba (BABA.US)$$Alibaba (BABA.FTOLD.US)$$BABA-WR (89988.HK)$$ByteDance (FT0001)$$ByteDance Ecosystem (LIST1266.HK)$$KUAISHOU-W (01024.HK)$$KUAISHOU-WR (81024.HK)$$Short Videos (LIST1306.HK)$$Quick Hand Concept (LIST0723.SH)$
Risk Disclaimer: The above content only represents the author's view. It does not represent any position or investment advice of Futu. Futu makes no representation or warranty.Read more
Comments
to post a comment
