Krux

March 28, 2026
Mistral's Voxtral Understands Speech Without Transcribing First
Published: March 28, 2026 at 12:38 AM
Updated: March 28, 2026 at 12:38 AM
100-word summary
Mistral released Voxtral, its first open-source speech model, with a twist: it skips transcription and understands audio directly. The model can summarize a 30-minute meeting or trigger actions from voice commands without converting speech to text first. It handles 13 languages and costs $0.001 per minute via API, undercutting closed alternatives. Both versions (24B for servers, 3B for phones) are available under Apache 2.0 on Hugging Face. Companies worried about sending call recordings to third parties can run it on their own hardware. The catch: you'll need someone who knows how to wrangle a 24-billion-parameter model on-premises.
What happened
Mistral released Voxtral, its first open-source speech model, with a twist: it skips transcription and understands audio directly. The model can summarize a 30-minute meeting or trigger actions from voice commands without converting speech to text first. It handles 13 languages and costs $0.001 per minute via API, undercutting closed alternatives. Both versions (24B for servers, 3B for phones) are available under Apache 2.0 on Hugging Face. Companies worried about sending call recordings to third parties can run it on their own hardware.
Why it matters
The catch: you'll need someone who knows how to wrangle a 24-billion-parameter model on-premises.