Mistral release new open source speech model ‘Voxtral TTS’

Completing its full-stack voice AI ecosystem, Mistral AI has officially released Voxtral 4B TTS, a high-performance, open-weights text-to-speech model. The launch marks Mistral’s first major move into speech generation, following its successful Voxtral Transcribe series for speech-to-text.

Designed for real-time applications and edge devices, the model is being positioned as a direct, open-source competitor to proprietary leaders like ElevenLabs and OpenAI.

1. Key Features of Voxtral TTS

Mistral has optimized the model for speed and naturalness, prioritizing “time-to-first-audio” (latency) to make voice agents feel more human.

Zero-Shot Voice Cloning: The model can clone any voice from a reference sample as short as 2–5 seconds, capturing the speaker’s unique tone, accent, and emotional cadence.
Ultra-Low Latency: Achieves a 90ms processing time, allowing for near-instantaneous speech generation in conversational AI pipelines.
Multilingual Support: Natively supports nine languages: English, French, Spanish, German, Italian, Portuguese, Dutch, Hindi, and Arabic.
“Voice-as-an-Instruction”: Instead of using complex prosody tags (like SSML), the model automatically follows the rhythm and emotional “vibe” of the provided audio prompt.
Compact Footprint: With only 4 billion parameters, the model runs comfortably on consumer-grade hardware and modern mobile chips, requiring approximately 3GB of RAM.

2. Licensing and Availability

In a shift from its earlier “Apache 2.0” tradition, Mistral has released the weights under a more restrictive license to protect its commercial interests.

Aspect	Detail
Model Name	`Voxtral-4B-TTS-2603`
License	CC BY-NC 4.0 (Non-Commercial)
Platform	Available now on Hugging Face and Mistral AI Studio.
Deployment	Optimized for vLLM-Omni for production-grade throughput.

Commercial Note: While the weights are free for research and personal use, enterprise and commercial applications require a Mistral Commercial License.

3. Benchmarks: Mistral vs. ElevenLabs

In its launch announcement, Mistral claimed that Voxtral TTS has already begun outperforming current industry benchmarks in human preference tests.

vs. ElevenLabs Flash v2.5: Mistral reports higher “naturalness” scores in blind testing, specifically in non-English languages like French and Hindi.
Hardware Efficiency: Unlike many high-fidelity TTS models that require 12GB+ of VRAM, Voxtral’s 3GB requirement makes it one of the most accessible “frontier-grade” speech models available for local hosting.

4. The “Voxtral” Ecosystem

The addition of TTS allows developers to build end-to-end voice agents entirely within the Mistral framework:

Voxtral Realtime (STT): Transcribes the user’s voice (Sub-200ms).
Mistral Small/Large (LLM): Processes the text and generates a response.
Voxtral TTS: Converts the response back into speech (90ms).

“Voice is the original UI,” noted Mistral CEO Arthur Mensch. “With Voxtral TTS, we are removing the final barrier to building truly responsive, local, and private voice interfaces that don’t rely on expensive cloud APIs.”

Lapaas Voice

Subscribe to newsletter

Startup

Artificial Intelligence

Funding

Case Studies

Lapaas Voice

Startup

Artificial Intelligence

Funding

Case Studies

Lapaas Voice

Trending

Related Posts

Mistral release new open source speech model ‘Voxtral TTS’

1. Key Features of Voxtral TTS

2. Licensing and Availability

3. Benchmarks: Mistral vs. ElevenLabs

4. The “Voxtral” Ecosystem

LEAVE A REPLY Cancel reply

Popular Articles

Lapaas Voice

About us

Latest Articles

Most Popular

Subscribe