Real production data from testing 12 providers with Gulf Arabic callers. Not synthetic benchmarks — actual calls from a live real estate voice agent.
All benchmarks come from a production real estate voice agent handling real inbound calls from Gulf Arabic speakers in the UAE. This is not a lab test with clean audio and MSA — these are real callers with background noise, dialect variations, and natural conversation patterns.
End-of-utterance delay — time from when the caller stops speaking to when the STT emits a final transcript. Lower is better. Under 500ms feels real-time.
Total time from end-of-speech to agent response audio starting. Includes STT + LLM + TTS pipeline latency.
1-5 score based on transcription accuracy, dialect handling, and whether callers needed to repeat themselves.
| Provider | Category | Avg EOU Delay | Best Case | Quality | Streaming | LiveKit | Verdict |
|---|---|---|---|---|---|---|---|
| Deepgram Nova-3 | Speech-to-Text | 424ms | 0ms | Recommended | |||
| ElevenLabs TTS | Text-to-Speech | N/A | N/A | Recommended | |||
| Groq — Llama 4 Maverick | Voice LLMs | N/A | N/A | Recommended | |||
| LiveKit BVC (Background Voice Cancellation) | Noise Cancellation | N/A | N/A | Recommended | |||
| Silero VAD | Voice Activity Detection | N/A | N/A | Recommended | |||
| Soniox STT RT v3 | Speech-to-Text | 1678ms | 773ms | Good | |||
| Google Cloud STT — Chirp 3 | Speech-to-Text | 2376ms | 2000ms | Acceptable | |||
| ElevenLabs Scribe v2 | Speech-to-Text | 2000ms–2500ms | 2000ms | Not Recommended | |||
| Groq Whisper Large v3 Turbo | Speech-to-Text | 284ms–3388ms | 284ms | Not Recommended | |||
| Groq Whisper Large v3 | Speech-to-Text | 32ms–3494ms | 32ms | Not Recommended | |||
| Speechmatics | Speech-to-Text | 460ms | 0ms | Not Recommended | |||
| Mistral Voxtral Mini | Speech-to-Text | N/A | N/A | Non-functional |
Average end-of-utterance delay in milliseconds. Lower is better. Under 500ms recommended for real-time agents.
Best combination of latency and quality. 424ms average EOU delay with excellent transcription accuracy — no caller repetitions needed. The only STT where speed and quality both deliver.
Both Groq Whisper variants produced poor transcription quality for Arabic. The Turbo variant added wildly inconsistent latency (284ms to 3.4s). Whisper architecture is fundamentally weak for Arabic dialects.
Speechmatics delivers the fastest endpointing (~460ms) but Arabic transcription quality is unacceptable — callers had to repeat themselves. Raw speed is meaningless if the transcript is wrong.
Arabic support ranges from excellent (Deepgram, Soniox) to completely non-functional (Voxtral Mini — zero output). Marketing claims about "multilingual support" are unreliable. Always test with real Arabic audio before committing.