Tuesday, October 7, 2025

Trending

Related Posts

OpenAI Launches gpt-realtime: Advanced Speech-to-Speech Model for Low-Latency Voice Agents

OpenAI has introduced gpt-realtime, its most advanced speech-to-speech model to date, designed for low-latency, natural voice interactions in production applications. Announced at DevDay on October 6, 2025, gpt-realtime powers the Realtime API’s general availability, enabling developers to build voice agents with features like SIP phone calling, image inputs, and remote MCP server support. For AI developers, voice tech innovators, and enterprise builders searching OpenAI gpt-realtime launch, Realtime API GA 2025, or speech-to-speech AI model, gpt-realtime scores 30.5% on instruction following—up from 20.6% in the December 2024 preview—and 82.8% on Big Bench Audio reasoning, with natural intonation, emotion, and multilingual alphanumeric detection. Priced at $0.06 per minute for audio input and $0.24 for output—20% cheaper than the preview—it’s optimized for real-time applications like customer service and voice assistants.

The Realtime API now supports gpt-4o-mini-realtime-preview for cost-effective streaming, making voice AI more accessible.

gpt-realtime: Key Capabilities and Benchmarks

gpt-realtime excels in natural speech synthesis and comprehension, supporting multilingual interactions with improved accuracy for phone numbers, VINs, and sequences in languages like Spanish, Chinese, Japanese, and French.

  • Instruction Following: 30.5% on MultiChallenge benchmark, up from 20.6%.
  • Reasoning: 82.8% on Big Bench Audio, vs. 65.6% previous.
  • Voice Quality: New Cedar and Marin voices for empathetic, professional tones.
  • Latency: Low-delay speech-to-speech without STT-LLM-TTS chains.

Developers can input text or audio and receive text, audio, or both, with tool calling for enhanced functionality.

Benchmarkgpt-realtime ScorePrevious Model Score
MultiChallenge (Instructions)30.5%20.6%
Big Bench Audio (Reasoning)82.8%65.6%

Realtime API Updates: SIP, Images, and MCP for Production

The Realtime API’s general availability includes:

  • SIP Phone Calling: Integrate with telephony systems for voice agents.
  • Image Inputs: Multimodal support for vision-language tasks.
  • Remote MCP Servers: External tool access for complex workflows.
  • gpt-4o-mini-realtime-preview: Cost-effective streaming option.

Pricing: $5/1M text input tokens, $20/1M output; $100/1M audio input, $200/1M output (~$0.06/min input, $0.24/min output). Upcoming: GPT-5 Pro and Sora 2 integrations.

Conclusion: gpt-realtime’s Voice Revolution

OpenAI’s gpt-realtime launch on October 6, 2025, advances speech-to-speech AI with 30.5% instruction accuracy and production features like SIP. At $0.06/min, it’s developer-ready. For voice agents, the future speaks—will it redefine conversations? The voices evolve. Tech crunch

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Popular Articles