Arabic Voice Agent Architecture Guide

This guide covers the architecture we used to build a production Arabic voice agent for real estate. The stack: LiveKit for real-time communication, Deepgram Nova-3 for STT, Groq Llama 4 Maverick for the LLM, and ElevenLabs for TTS.

The Pipeline

Caller → LiveKit Room → Silero VAD → Deepgram Nova-3 (STT) → Groq Llama 4 (LLM) → ElevenLabs (TTS) → Caller

Each component adds latency. The goal is minimizing total turn time — the time from when the user stops speaking to when they hear the agent's response.

Component Selection

Voice Activity Detection: Silero VAD

VAD detects when the user starts and stops speaking. We use Silero VAD because:

Free and open-source
Lightweight (under 2MB model)
Language-agnostic (works for Arabic)
Integrated with LiveKit

Key learning: VAD tuning has diminishing returns. After aggressive tuning (50ms silence duration, 0.30 activation threshold), the bottleneck shifted to STT transcription time.

Speech-to-Text: Deepgram Nova-3

After testing 8 providers, Deepgram Nova-3 won decisively:

424ms average EOU delay (75% faster than alternatives)
Excellent Gulf Arabic quality
LiveKit plugin for zero-config integration
Streaming for real-time partial results

LLM: Groq Llama 4 Maverick

Groq's hardware-accelerated inference gives us the lowest time-to-first-token for the conversational LLM. Combined with streaming, responses begin playing while the model is still generating.

Text-to-Speech: ElevenLabs

ElevenLabs eleven_multilingual_v2 with the Sultan voice provides the most natural Arabic TTS. The streaming API means we can start playback as soon as the first audio chunk is ready.

Noise Cancellation: LiveKit BVC

Background Voice Cancellation cleans up the audio before it hits STT, improving transcription accuracy in noisy environments.

Latency Optimization

Total turn time = VAD detection + STT transcription + LLM inference + TTS generation

Our optimizations:

Streaming everywhere: STT, LLM, and TTS all use streaming APIs
Preemptive generation: Start LLM inference on partial STT results
Aggressive VAD: Minimize silence detection delay
Provider selection: Choose the fastest provider at each stage

Result: 787ms best-case full turn time with Deepgram Nova-3.

What's Next: End-to-End Voice Models

The cascade architecture (STT → LLM → TTS) has inherent latency from inter-stage hops. End-to-end models like GPT-4o Realtime and Ultravox could eliminate these hops entirely. We're evaluating:

Ultravox: Audio-in, text-out (skips STT)
GPT-4o Realtime: Audio-in, audio-out (no pipeline at all)
Gemini 2.0 Flash: Native audio understanding

The key question: do these models handle Gulf Arabic as well as our current pipeline?