← All software
Low-latency speech-to-speech for voice agents
Pros
- Industry-leading conversational latency
- Tight tool-use integration
- Caching dramatically cuts cost
Cons
- Token-based pricing is hard to forecast
- Uncached cost can exceed $0.30/min
- Requires engineering to deploy
✓ Where it shines / best for
- Developers building real-time voice agents and assistants
- Voice-enabled customer support and phone automation
- Low-latency conversational apps requiring tool use
✕ Not the best fit for
- No-code users wanting a ready-made app
- Cost-sensitive batch TTS where flat per-character pricing is cheaper
- Offline/on-device speech needs
Features
- ✓ Speech-to-speech over a single WebSocket/WebRTC connection (low latency)
- ✓ Native audio in/out without separate STT and TTS pipelines
- ✓ Function/tool calling during live voice conversations
- ✓ Built-in voices with expressive, natural intonation
- ✓ Interruption handling and turn detection (VAD)
- ✓ Supports SIP for phone-call integration
- ✓ Image input alongside audio in realtime sessions
- ✓ Multilingual speech understanding and generation
Pricing
| Plan | Price | Billing | Notes |
|---|---|---|---|
| Text input | $4.00 | per 1M tokens | gpt-realtime text input tokens |
| Text output | $16.00 | per 1M tokens | gpt-realtime text output tokens |
| Cached text input | $0.40 | per 1M tokens | Cached input discount |
| Audio input | $32.00 | per 1M tokens | gpt-realtime audio input tokens |
| Cached audio input | $0.40 | per 1M tokens | Cached audio input |
| Audio output | $64.00 | per 1M tokens | gpt-realtime audio output tokens |
Pricing verified from the official source. Prices change often — confirm on the vendor's site before buying.
Specifications
| model | gpt-realtime |
| modality | speech-to-speech |
Sponsored
A full review is being generated for this product and will appear here shortly.