Arabic Voice Agent Architecture Guide
How to build a production Arabic voice agent with LiveKit, Deepgram, Groq, and ElevenLabs.
Arabic Voice Agent Architecture Guide
This guide covers the architecture we used to build a production Arabic voice agent for real estate. The stack: LiveKit for real-time communication, Deepgram Nova-3 for STT, Groq Llama 4 Maverick for the LLM, and ElevenLabs for TTS.
The Pipeline
Caller → LiveKit Room → Silero VAD → Deepgram Nova-3 (STT) → Groq Llama 4 (LLM) → ElevenLabs (TTS) → Caller
Each component adds latency. The goal is minimizing total turn time — the time from when the user stops speaking to when they hear the agent's response.
Component Selection
Voice Activity Detection: Silero VAD
VAD detects when the user starts and stops speaking. We use Silero VAD because:
- Free and open-source
- Lightweight (under 2MB model)
- Language-agnostic (works for Arabic)
- Integrated with LiveKit
Key learning: VAD tuning has diminishing returns. After aggressive tuning (50ms silence duration, 0.30 activation threshold), the bottleneck shifted to STT transcription time.
Speech-to-Text: Deepgram Nova-3
After testing 8 providers, Deepgram Nova-3 won decisively:
- 424ms average EOU delay (75% faster than alternatives)
- Excellent Gulf Arabic quality
- LiveKit plugin for zero-config integration
- Streaming for real-time partial results
LLM: Groq Llama 4 Maverick
Groq's hardware-accelerated inference gives us the lowest time-to-first-token for the conversational LLM. Combined with streaming, responses begin playing while the model is still generating.
Text-to-Speech: ElevenLabs
ElevenLabs eleven_multilingual_v2 with the Sultan voice provides the most natural Arabic TTS. The streaming API means we can start playback as soon as the first audio chunk is ready.
Noise Cancellation: LiveKit BVC
Background Voice Cancellation cleans up the audio before it hits STT, improving transcription accuracy in noisy environments.
Latency Optimization
Total turn time = VAD detection + STT transcription + LLM inference + TTS generation
Our optimizations:
- Streaming everywhere: STT, LLM, and TTS all use streaming APIs
- Preemptive generation: Start LLM inference on partial STT results
- Aggressive VAD: Minimize silence detection delay
- Provider selection: Choose the fastest provider at each stage
Result: 787ms best-case full turn time with Deepgram Nova-3.
What's Next: End-to-End Voice Models
The cascade architecture (STT → LLM → TTS) has inherent latency from inter-stage hops. End-to-end models like GPT-4o Realtime and Ultravox could eliminate these hops entirely. We're evaluating:
- Ultravox: Audio-in, text-out (skips STT)
- GPT-4o Realtime: Audio-in, audio-out (no pipeline at all)
- Gemini 2.0 Flash: Native audio understanding
The key question: do these models handle Gulf Arabic as well as our current pipeline?