ElevenLabs launch Scribe v2 Realtime

November 12, 2025

ElevenLabs has officially launched Scribe v2 Realtime, a next-generation speech-to-text model poised to redefine real-time voice applications. With sub-150-millisecond latency and support for over 90 languages — including many Indian regional languages — this model aims to power live voice agents, meeting transcriptions, and other interactive experiences.

What Is Scribe v2 Realtime?

Scribe v2 Realtime is a streaming, real-time speech recognition model built for live use cases. Key details:

Latency: approximately 150 ms from speech to text.
Multilingual: Support for 90+ languages, including major Indian languages like Hindi, Tamil, Malayalam, Kannada, Telugu, Gujarati, Bengali, Marathi, Punjabi and Sindhi.
Features:
- Negative latency (predicting next word/punctuation) for smoother streaming.
- Automatic language detection and text conditioning (handles connection resets).
- Voice Activity Detection (VAD) to detect speech segments and silence.
- Streaming support via API for live audio chunks. ElevenLabs
- Enterprise-grade compliance: SOC 2, HIPAA, GDPR, data-residency options (e.g., India).

Why This Launch Matters

Real-Time Voice Applications Get a Big Boost

Until now, many speech-to-text models were optimized for post-processing (e.g., uploading a recorded file). Scribe v2 Realtime makes live speech accessible in real time — critical for voice agents, live captioning, meeting assistants, and interactive voice applications.

Extensive Language & Regional Coverage

By supporting many Indian regional languages and offering India-data-residency options, ElevenLabs is positioning strongly for global and regional markets, not just English-dominant ones.

Competitive Edge in Latency & Accuracy

ElevenLabs claims that Scribe v2 Realtime achieves superior performance compared to major competitors (bar charts on their site show it outperforming models like Gemini Flash 2.5 and GPT‑4o Mini in accuracy). For applications where micro-seconds of delay matter (for instance, voice agents where the system has to respond almost immediately), this improvement is significant.

Enterprise-Friendly Features

The availability of zero-retention modes, data-residency, support for domain-specific vocabularies and high-quality streaming support means Scribe v2 Realtime is not just a toy — it’s build for enterprise use across regulated industries (healthcare, finance, education).

Background: ElevenLabs & Speech-to-Text Journey

ElevenLabs started primarily with high-quality text-to-speech (TTS) and voice-synthesis models. Over time, they built out transcription (speech-to-text) capabilities, with their earlier “Scribe v1” model. The release of Scribe v2 Realtime signifies their push deeper into live voice AI and interactive audio experiences.

Their broader strategy: voice as the next major interface, enabling developers and enterprises to build immersive voice-enabled applications. The addition of this model aligns well with that goal.

What Developers & Businesses Should Know

If you’re a developer, startup or enterprise considering using Scribe v2 Realtime, here are some key points:

Integration: It’s available via API and supports streaming (WebSocket or REST) for live audio chunks.
Language/support: If you work in Indian languages (Hindi, Tamil, etc.), good to check their benchmark accuracy in your language (they report strong performance).
Use Cases:
- Voice agents (support bots, sales assistants) — use the model to transcribe user speech in real time and respond.
- Live captioning for meetings, webinars, streaming video.
- Accessibility tools: real-time subtitles for hearing-impaired users.
- Compliance & transcription for calls in regulated industries.
Pricing/Plans: Their docs mention pricing based on usage (for example, “$0.28 per hour & lower on annual Business plans”) for this real-time STT model.
Data Security: For sensitive audio (medical, legal), leverage zero-retention and data-residency options.
Performance Considerations: While <150 ms latency is excellent, actual performance will depend on network conditions, audio quality, accents, background noise, etc. Test in production environment.
Streaming Setup: Use supported audio formats (PCM 8 kHz–48 kHz, μ-law) for best compatibility.

Challenges & Considerations

While latency is impressive, real-world deployments may face constraints like network jitter, packet loss, audio quality degradation, accent/dialect variation especially in noisy environments.
Even though 90+ languages are supported, depth of support per language may vary. For Indian regional languages, check if they support speaker diarization, domain vocab, and how the accuracy holds.
Cost and scaling: Real-time models tend to consume resources (streaming, compute). Ensure you budget and test for concurrency & throughput. The docs mention concurrency limits tied to subscription plan.
Privacy/regulation: For jurisdictions like India, data-residency and compliance still need operational checks (e.g., does the audio stay in India? How is it processed?).
Integration complexity: Streaming systems, voice activity detection, low latency feedback loops require building/deploying carefully.

Looking Ahead: What To Watch

Adoption: How quickly voice-AI products start using Scribe v2 Realtime for real-time captioning, voice agents and meetings will tell if the latency/accuracy advantages convert into real world value.
Competitive responses: Other STT providers (Google Cloud Speech-to-Text, Microsoft Azure Speech, Deepgram, etc.) may accelerate their own real-time models; so watch how the market shifts.
Language coverage improvements: Whether more Indian regional languages and dialects get full parity (with speaker diarization, custom vocab support) will be key for Indian/Asia markets.
Pricing and tiers: How sustainable the pricing is for startups vs enterprise — if cost becomes a barrier, adoption may slow.
End-to-end voice-agent ecosystems: With Scribe v2 Realtime plus ElevenLabs’ voice-synthesis tools and Agents platform, the company is building an integrated voice stack. How developers adopt it end-to-end (v2 transcription + voice generation + agent logic) will be interesting.
Performance benchmarks and transparency: Independent users will test word error rates (WER) in various conditions; being able to trust the numbers will be important.

Conclusion

The launch of ElevenLabs’ Scribe v2 Realtime marks a significant milestone in real-time speech-to-text technology. With sub-150 ms latency, support for 90+ languages (including multiple Indian languages), and enterprise features such as zero-retention and data-residency, it offers developers and companies a powerful tool for voice-enabled applications. The focus keyword ElevenLabs Scribe v2 Realtime underlines this shift toward faster, more inclusive, and live voice interactions.

For anyone building voice assistants, live captioning systems, meeting tools or interactive voice apps, this model should be on the radar. The key will be how it performs in real-world settings, how cost scales, and how language/regional support evolves.

{{post_title}}

ElevenLabs launch Scribe v2 Realtime

What Is Scribe v2 Realtime?

Why This Launch Matters