What Is Wan 2.5? Unlock Audio-Visual Synced Video Magic
At the 2025 Hangzhou APSARA Conference, Alibaba introduced the Wan 2.5 Preview model series, marking the first realization of audio-visual synced video generation and further empowering cinematic-level video creation.
Accurate Audio-Visual Sync
Wan 2.5 delivers precise audio-visual synchronization, generating sound effects, background music, ambient audio, and even ASMR based on prompts.
Wan 2.5 seamlessly aligns voice with on-screen visuals, perfectly matching lip movements and facial expressions while blending into the scene’s atmosphere. It also supports using audio as a reference input, enabling more accurate and context-aware video sound generation.
Native Multimodal Architecture
Unlike traditional models that handle only text or images, Wan 2.5 natively supports text, images, video, and audio as both inputs and outputs. This means creators are no longer limited to a single content format.
Through this multimodal approach, Wan 2.5 is able to understand and merge multiple content types seamlessly. Wan 2.5 processes all of these together, generating coherent audio-visual synced videos that align with the creative intent.
Longer Duration, Higher Quality
Wan 2.5 takes AI video generation to the next level with up to 10 seconds long at 1080p resolution and 24 frames per second, providing a cinematic feel in short-form content.
With this upgrade, the generated videos feature finer details and more complete content, allowing creators to produce scenes with richer storytelling, smoother motion, and more immersive visual experiences.
Enhanced Image Editing
Wan 2.5 offers powerful and versatile image generation and editing capabilities,including:
Bilingual charts and tables: create clear, professional charts in both Chinese and English.
Complex layouts: design multi-element graphics with precise arrangement.
Artistic text effects: generate stylized typography for posters or banners.
Flowcharts and architecture diagrams: visualize processes, structures, and systems seamlessly.
Wan 2.5 Prompt Guide: Audio Prompts Formula

To generate audio that perfectly matches your video, you only need to enhance your video prompts with detailed sound descriptions.
Prompt Formula:
Subject + Scene + Motion + Sound Description
(Sound description can include voice, sound effects, and background audio)
Voice Prompt Structure:
Content + Emotion + Tone + Speed + Timbre + Accent
Example:
"An alarm clock confidently says, 'Alibaba has launched a new model, go try it now!' with excitement, moderate speed, clear voice."
Sound Effect Prompt Structure:
Material + Action + Ambient Sound
Example:
"Eggshell cracking, egg falling into a hot pan, producing a 'sizzle' sound, with faint kitchen hood noise in the background."
Background Music Prompt Structure:
Visual Context + Style + Background Music or Sound
Example:
"At dusk, the sun is about to set below the horizon, accompanied by mysterious background music."
Wan 2.5 is ideal for anyone who want to produce cinematic-level audio-visual content with ease. Whether you’re looking to enhance storytelling, streamline content creation, or explore new creative formats, Wan 2.5 unlock new possibilities in short films, advertising, e-learning, gaming, and creative projects. Discover the potential of Wan 2.5 and start creating immersive, cinematic-quality videos NOW!