Mistral's Voxtral Understands Speech Without Transcribing First

Published: March 28, 2026 at 12:38 AM

Updated: March 28, 2026 at 12:38 AM

100-word summary

Mistral released Voxtral, its first open-source speech model, with a twist: it skips transcription and understands audio directly. The model can summarize a 30-minute meeting or trigger actions from voice commands without converting speech to text first. It handles 13 languages and costs $0.001 per minute via API, undercutting closed alternatives. Both versions (24B for servers, 3B for phones) are available under Apache 2.0 on Hugging Face. Companies worried about sending call recordings to third parties can run it on their own hardware. The catch: you'll need someone who knows how to wrangle a 24-billion-parameter model on-premises.

What happened

Why it matters

The catch: you'll need someone who knows how to wrangle a 24-billion-parameter model on-premises.

Sources

Mistral AI TechCrunch