OpenAI has officially launched GPT-Realtime, its most advanced speech-to-speech AI model, now available through the production-ready Realtime API. The updates include dramatic cost reductions, new voice options, and powerful integrations aimed at enabling seamless voice-based AI agents.
What’s New in GPT-Realtime and Realtime API
Advanced Speech-To-Speech Model
- GPT-Realtime is a single-model pipeline that directly processes and generates audio—no more chaining between speech-to-text and text-to-speech—ensuring lower latency and better preservation of vocal nuances like laughs and pauses.
- The model demonstrates improved ability to follow complex instructions, perform accurate function calls, and handle multi-step reasoning.
Natural, Expressive Voices
- OpenAI introduced two new voice options: Cedar and Marin, enhancing expressiveness and realism in voice output. Existing voices also receive quality upgrades.
- GPT-Realtime can interpret non-verbal cues like laughter, switch languages mid-sentence, and adapt tone based on instructions (e.g., “speak empathetically in a French accent”).
Enhanced Developer Tools & API Features
- The updated Realtime API includes:
- MCP (Model Context Protocol) support for seamless tool integration with external data hubs.
- Image input support—developers can feed images alongside audio or text for richer interactions.
- SIP (Session Initiation Protocol) support for voice agents capable of making direct phone calls.
- Reusable prompts to streamline voice agent deployment across sessions.
Lower Latency and Pricing
- OpenAI has reduced cost by ~20%: $32 per million input audio tokens (down from $40) and $64 per million output tokens (down from $80).
- The diverse integrations and real-time performance make it ideal for production environments.
What This Means for Developers & Businesses
Benefit | Details |
---|---|
Speed & Fidelity | Real-time voice response with preserved human-like nuances |
Expressiveness | Customizable tone, accent, and language-switching capabilities |
Versatility | Supports images, tool calls, SIP phone integration, and reusable contexts |
Cost-Effectiveness | ~20% lower pricing makes voice agents more affordable |
This brings voice agents closer to mainstream adoption, enabling more natural and engaging applications in customer support, virtual assistants, education, and content creation.