April 1, 2026

Alibaba's Qwen3.5-Omni Handles Text, Audio, Video in Real-Time

Published: April 1, 2026 at 12:35 AM

Updated: April 1, 2026 at 12:35 AM

100-word summary

Alibaba launched Qwen3.5-Omni, a multimodal model that processes text, images, audio, and video simultaneously. The flagship feature: sketch a user interface while talking, and get working front-end code in real-time. The model handles up to 256,000 tokens natively (1 million via API) and decodes long documents 19 times faster than its predecessor. It supports web search, tool use, and live audio-visual translation. One catch: sources dispute whether it's truly open-source or proprietary. Either way, Alibaba is making a serious play for the multimodal crown, putting pressure on OpenAI and Google to match real-time voice-vision capabilities.

What happened

Alibaba launched Qwen3.5-Omni, a multimodal model that processes text, images, audio, and video simultaneously. The flagship feature: sketch a user interface while talking, and get working front-end code in real-time. The model handles up to 256,000 tokens natively (1 million via API) and decodes long documents 19 times faster than its predecessor. It supports web search, tool use, and live audio-visual translation. One catch: sources dispute whether it's truly open-source or proprietary.

Why it matters

Either way, Alibaba is making a serious play for the multimodal crown, putting pressure on OpenAI and Google to match real-time voice-vision capabilities.

Sources