Arabic AI Voice Benchmarks

Real production data from testing 12 providers with Gulf Arabic callers. Not synthetic benchmarks — actual calls from a live real estate voice agent.

How We Benchmark

Test Environment

All benchmarks come from a production real estate voice agent handling real inbound calls from Gulf Arabic speakers in the UAE. This is not a lab test with clean audio and MSA — these are real callers with background noise, dialect variations, and natural conversation patterns.

Key Metrics

EOU Delay

End-of-utterance delay — time from when the caller stops speaking to when the STT emits a final transcript. Lower is better. Under 500ms feels real-time.

Full Turn Time

Total time from end-of-speech to agent response audio starting. Includes STT + LLM + TTS pipeline latency.

Quality Rating

1-5 score based on transcription accuracy, dialect handling, and whether callers needed to repeat themselves.

Full Results

Provider	Category	Avg EOU Delay	Best Case	Verdict
Deepgram Nova-3	Speech-to-Text	424ms	0ms	Recommended
ElevenLabs TTS	Text-to-Speech	N/A	N/A	Recommended
Groq — Llama 4 Maverick	Voice LLMs	N/A	N/A	Recommended
LiveKit BVC (Background Voice Cancellation)	Noise Cancellation	N/A	N/A	Recommended
Silero VAD	Voice Activity Detection	N/A	N/A	Recommended
Soniox STT RT v3	Speech-to-Text	1678ms	773ms	Good
Google Cloud STT — Chirp 3	Speech-to-Text	2376ms	2000ms	Acceptable
ElevenLabs Scribe v2	Speech-to-Text	2000ms–2500ms	2000ms	Not Recommended
Groq Whisper Large v3 Turbo	Speech-to-Text	284ms–3388ms	284ms	Not Recommended
Groq Whisper Large v3	Speech-to-Text	32ms–3494ms	32ms	Not Recommended
Speechmatics	Speech-to-Text	460ms	0ms	Not Recommended
Mistral Voxtral Mini	Speech-to-Text	N/A	N/A	Non-functional

STT Latency Comparison

Groq Whisper Large v3

32ms–3494ms

Groq Whisper Large v3 Turbo

284ms–3388ms

Deepgram Nova-3

424ms

Speechmatics

460ms

Soniox STT RT v3

1678ms

ElevenLabs Scribe v2

2000ms–2500ms

Google Cloud STT — Chirp 3

2376ms

Average end-of-utterance delay in milliseconds. Lower is better. Under 500ms recommended for real-time agents.

Key Findings

Deepgram Nova-3 Wins

Best combination of latency and quality. 424ms average EOU delay with excellent transcription accuracy — no caller repetitions needed. The only STT where speed and quality both deliver.

Whisper Models Fail on Arabic

Both Groq Whisper variants produced poor transcription quality for Arabic. The Turbo variant added wildly inconsistent latency (284ms to 3.4s). Whisper architecture is fundamentally weak for Arabic dialects.

Speed vs Quality Tradeoff

Speechmatics delivers the fastest endpointing (~460ms) but Arabic transcription quality is unacceptable — callers had to repeat themselves. Raw speed is meaningless if the transcript is wrong.

Quality Varies Wildly

Arabic support ranges from excellent (Deepgram, Soniox) to completely non-functional (Voxtral Mini — zero output). Marketing claims about "multilingual support" are unreliable. Always test with real Arabic audio before committing.