Krux
April 1, 2026
Alibaba's Qwen3.5-Omni Handles Text, Audio, Video in Real-Time
Published: April 1, 2026 at 12:35 AM
Updated: April 1, 2026 at 12:35 AM
100-word summary
Alibaba launched Qwen3.5-Omni, a multimodal model that processes text, images, audio, and video simultaneously. The flagship feature: sketch a user interface while talking, and get working front-end code in real-time. The model handles up to 256,000 tokens natively (1 million via API) and decodes long documents 19 times faster than its predecessor. It supports web search, tool use, and live audio-visual translation. One catch: sources dispute whether it's truly open-source or proprietary. Either way, Alibaba is making a serious play for the multimodal crown, putting pressure on OpenAI and Google to match real-time voice-vision capabilities.
What happened
Alibaba launched Qwen3.5-Omni, a multimodal model that processes text, images, audio, and video simultaneously. The flagship feature: sketch a user interface while talking, and get working front-end code in real-time. The model handles up to 256,000 tokens natively (1 million via API) and decodes long documents 19 times faster than its predecessor. It supports web search, tool use, and live audio-visual translation. One catch: sources dispute whether it's truly open-source or proprietary.
Why it matters
Either way, Alibaba is making a serious play for the multimodal crown, putting pressure on OpenAI and Google to match real-time voice-vision capabilities.