Alibaba unveils Wan2.2-S2V, open-source speech-to-video tool
Alibaba has released Wan2.2-S2V, an open-source model designed for digital human video generation using speech-to-video technology.
The Wan2.2-S2V model enables the creation of high-quality animated videos from a single portrait photo and an audio clip. This model can produce film-quality avatars that perform actions such as speaking, singing, and other performances, catering to a range of professional and creative requirements.
Belonging to Alibaba's Wan2.2 video generation series, Wan2.2-S2V supports a variety of framing options including portrait, bust, and full-body perspectives. The model can dynamically generate character actions and respond to environmental factors based on user-provided prompts. This feature allows content creators to guide visual representations and adapt character performances for specific storytelling or design needs.
Using advanced audio-driven animation techniques, Wan2.2-S2V delivers lifelike motion for characters, supporting everything from conversational dialogue to musical performances. The model is capable of animating multiple characters within a scene and interprets vocal input to inform movement and expressions. The model offers flexibility in avatar styles, accommodating cartoon characters, animals, and various stylised figures.
To support the varied output needs of professional creators, Wan2.2-S2V provides output resolutions of 480P and 720P. This flexibility aims to ensure compatibility with both social media requirements and professional presentation standards.
Technical approach
Wan2.2-S2V employs a combination of text-guided global motion control and audio-driven fine-grained local movement to provide expressive performances, particularly in complex or challenging animation scenarios. This approach is designed to move beyond traditional talking-head animation, giving creators more diverse and realistic results.
The model introduces a new frame processing method in which historical animation frames of arbitrary length are compressed into a single, compact latent representation. This technique reduces computational demands and, in turn, supports more stable long-form video generation, which has been a persistent challenge within extended animated content production.
The training methodology for Wan2.2-S2V incorporates a large-scale audio-visual dataset targeted at film and television production environments. Alibaba reports that a multi-resolution training approach was used, enabling flexible output formats that include both vertical short form and more traditional horizontal film productions.
Open source rollout
Wan2.2-S2V is now available for download through Hugging Face, GitHub, and Alibaba Cloud's open-source community, ModelScope. This continues Alibaba's series of open source releases; Wan2.1 was shared in February 2025 and Wan2.2 in July. To date, the Wan generative video model series has generated over 6.9 million downloads on Hugging Face and ModelScope.
"Wan2.2-S2V offers versatile character animation capabilities, enabling the creation of videos across multiple framing options including portrait, bust, and full-body perspectives. It can generate character actions and environmental factors dynamically based on prompt instructions, allowing professional content creators to capture precise visual representations tailored to specific storytelling and design requirements."
Another statement from the release explained the technical foundation:
"Wan2.2-S2V transcends traditional talking-head animations by combining text-guided global motion control with audio-driven fine-grained local movements. This enables natural and expressive character performances across complex and challenging scenarios."
The release further noted the benefits of the frame processing approach:
"By compressing historical frames of arbitrary length into a single, compact latent representation, the technology significantly reduces computational overhead. This approach allows for remarkably stable long-video generation, addressing a critical challenge in extended animated content production."
Alibaba's research team highlighted the unique training approach:
"Alibaba's research team constructed a large-scale audio-visual dataset specifically tailored to film and television production scenarios. Using a multi-resolution training approach, Wan2.2-S2V supports flexible video generation across diverse formats – from vertical short-form content to traditional horizontal film and television productions."
As with its previous models, Alibaba has positioned Wan2.2-S2V for open-source community access, providing tools to both professional and independent creators in the global marketplace.