Arabic AI Voice Benchmarks

Real production data from testing 12 providers with Gulf Arabic callers. Not synthetic benchmarks — actual calls from a live real estate voice agent.

How We Benchmark

Test Environment

All benchmarks come from a production real estate voice agent handling real inbound calls from Gulf Arabic speakers in the UAE. This is not a lab test with clean audio and MSA — these are real callers with background noise, dialect variations, and natural conversation patterns.

Key Metrics
EOU Delay

End-of-utterance delay — time from when the caller stops speaking to when the STT emits a final transcript. Lower is better. Under 500ms feels real-time.

Full Turn Time

Total time from end-of-speech to agent response audio starting. Includes STT + LLM + TTS pipeline latency.

Quality Rating

1-5 score based on transcription accuracy, dialect handling, and whether callers needed to repeat themselves.

Full Results

ProviderCategoryAvg EOU DelayBest CaseQualityStreamingLiveKitVerdict
Deepgram Nova-3Speech-to-Text424ms0ms
Recommended
ElevenLabs TTSText-to-SpeechN/AN/A
Recommended
Groq — Llama 4 MaverickVoice LLMsN/AN/A
Recommended
LiveKit BVC (Background Voice Cancellation)Noise CancellationN/AN/A
Recommended
Silero VADVoice Activity DetectionN/AN/A
Recommended
Soniox STT RT v3Speech-to-Text1678ms773ms
Good
Google Cloud STT — Chirp 3Speech-to-Text2376ms2000ms
Acceptable
ElevenLabs Scribe v2Speech-to-Text2000ms–2500ms2000ms
Not Recommended
Groq Whisper Large v3 TurboSpeech-to-Text284ms–3388ms284ms
Not Recommended
Groq Whisper Large v3Speech-to-Text32ms–3494ms32ms
Not Recommended
SpeechmaticsSpeech-to-Text460ms0ms
Not Recommended
Mistral Voxtral MiniSpeech-to-TextN/AN/A
Non-functional

STT Latency Comparison

Groq Whisper Large v3
32ms–3494ms
Groq Whisper Large v3 Turbo
284ms–3388ms
Deepgram Nova-3
424ms
Speechmatics
460ms
Soniox STT RT v3
1678ms
ElevenLabs Scribe v2
2000ms–2500ms
Google Cloud STT — Chirp 3
2376ms

Average end-of-utterance delay in milliseconds. Lower is better. Under 500ms recommended for real-time agents.

Key Findings

Deepgram Nova-3 Wins

Best combination of latency and quality. 424ms average EOU delay with excellent transcription accuracy — no caller repetitions needed. The only STT where speed and quality both deliver.

Whisper Models Fail on Arabic

Both Groq Whisper variants produced poor transcription quality for Arabic. The Turbo variant added wildly inconsistent latency (284ms to 3.4s). Whisper architecture is fundamentally weak for Arabic dialects.

Speed vs Quality Tradeoff

Speechmatics delivers the fastest endpointing (~460ms) but Arabic transcription quality is unacceptable — callers had to repeat themselves. Raw speed is meaningless if the transcript is wrong.

Quality Varies Wildly

Arabic support ranges from excellent (Deepgram, Soniox) to completely non-functional (Voxtral Mini — zero output). Marketing claims about "multilingual support" are unreliable. Always test with real Arabic audio before committing.