Oftentimes, when companies approach Hamming for voice agent testing, they're aware of issues affecting their voice agents such as latency spikes or agent hallucinations, but not the root cause.
Sometimes companies assume the model needs fine-tuning or they write the issues off as edge cases. But sometimes, the real issue is architectural.
The initial stack that works for a demo or proof-of-concept often falls short when it comes to reliability, scale, and real-world complexity.
The issue usually isn’t a lack of powerful components, it’s the absence of a suitable framework for selecting and integrating those components into a system that can hold up in production.
This guide provides that framework, breaking down the core architectural patterns, key component trade-offs, and testing strategies that turn a working demo into a production-grade system built for scale.
Foundational Architectures: Cascading vs. Speech-to-Speech
Your first architectural decision isn't about vendors, it comes down to choosing between a cascading or a speech-to-speech (S2S) model. This choice affects:
- Control: Cascading gives more granular control over each step; S2S trades control for simplicity and flexibility.
- Latency: Cascading can be optimized per component, but may introduce delays between steps. S2S is often faster end-to-end but harder to tune.
- Observability: Cascading offers clearer logs and metrics at each stage. S2S is more opaque, it's harder to know where a failure occurred.

Cascading Architecture: The Production Standard
Most production voice agents use this architecture's three-step flow. It offers maximum control and provides clear points for debugging and logging.
- Speech-to-Text (100-300ms): A provider like Deepgram processes the user's speech into a text transcript.
- LLM Processing (300-900ms): An LLM like GPT-4o interprets the text, determines intent, and generates a text response.
- Text-to-Speech (300-1200ms): A service like ElevenLabs converts the LLM's text response into audio.
Total Latency: 2000-4000ms (2-4 seconds) under optimal conditions, but network overhead and processing variability can push this past 4000ms.
Speech-to-Speech (S2S) Models: The Emerging Future
Newer models, such as those powering OpenAI's Realtime API, bypass the text layer for an audio-in, audio-out flow. This drops average latency to 500ms. However, this speed comes at the cost of control. Some companies have reverted from S2S to cascading architectures because they lose the ability to inject compliance logic and audit trails at the intermediate text layer.
The Pragmatic Choice: Hybrid Architectures
A hybrid approach balances speed and control. Use a low-latency S2S model for greetings and conversational turn-taking, then switch to a cascading architecture for complex transactions that require strict logic, tool use, and auditability. Platforms like LiveKit enable this mid-conversation switching.
Breaking Down the Essential Components
Here are the essential components:
1. Speech-to-Text (STT): The First Critical Decision
The STT layer is a primary source of agent failures. Focus evaluation on accuracy for specific user demographics and latency under real-world conditions.
Provider | Avg. Latency | Key Differentiator | Best For |
---|---|---|---|
Deepgram Nova-3 | Under 300ms | 94.74% accuracy for pre-recorded, 93.16% for real-time (6.84% WER). Industry-leading performance in noisy environments. | Multilingual real-time transcription, noisy environments, self-serve customization. |
AssemblyAI Universal-1 | ~300ms (streaming) | 93.4% accuracy (6.6% WER) for English. Immutable streaming transcripts with reliable confidence scores. | Live captioning, real-time applications needing stable transcripts. |
Whisper Large-v3 | ~300ms (streaming API) | 92% accuracy (8% WER) for English. Supports languages with <50% WER, strongest in ~10 major languages. | Multilingual applications, zero-shot transcription for diverse languages. |
Engineering Insight: Fine-tuning STT for domain-specific vocabulary (medical terms, product names, internal jargon) is a common and necessary step for production-grade accuracy.
2. Large Language Models (LLMs)
In production voice contexts, tool-calling reliability, instruction-following precision, and time-to-first-token matter more than benchmarks.
Model | Time-to-First-Token | Key Differentiator | Best For |
---|---|---|---|
Gemini 2.5 Flash | 300ms TTFT, 252.9 tokens/sec | Strict instruction adherence and reliable tool use. | Workflows demanding predictable, structured outputs. |
GPT-4.1 | 400-600ms TTFT | Strong tool calling and reasoning capabilities. | The default choice for complex, multi-turn conversations. |
Key Consideration: Most voice agent conversations use less than 4K tokens. Don't pay for large context windows you won't use.
3. Text-to-Speech (TTS): Where Milliseconds and Brand Matter
The TTS voice represents your brand. Balance vocal naturalness, latency, and cost.
Provider | Latency | Key Differentiator |
---|---|---|
ElevenLabs Flash v2.5 | ~75ms | Ultra-low latency with 32 language support. |
ElevenLabs Turbo v2.5 | 250-300ms | Natural voice quality with balanced speed. |
Cartesia Sonic | 90ms streaming (40ms model) | Very realistic voices at competitive pricing. |
GPT-4o TTS | 220-320ms | Promptable emotional control and dynamic voice modulation. |
Rime AI | Sub-200ms | Best price-to-performance ratio at scale ($40/million chars). |
Key Consideration: Test TTS options with your target user demographic. What sounds natural to one group may feel robotic to another. Budget $5-10K for a custom voice if brand consistency is a top priority.
4. Conversation Orchestration
Managing conversational flow to feel natural goes beyond simple silence detection. This is a common point of failure for internal builds.
- LiveKit Agents: Provides hardware-accelerated Voice Activity Detection (VAD) that handles interruptions gracefully. Best for teams needing a production-ready orchestration layer quickly.
- Pipecat: An open-source framework with full customization but requires significant DevOps and engineering expertise. Ideal for teams with unique requirements that off-the-shelf platforms can't meet.
Engineering Insight: Measure false-positive interruptions closely—being cut off drives user frustration. Implement dynamic silence thresholds (300ms for quick exchanges, 800ms for users who speak more slowly).
Platform Comparison: Real Numbers, Real Trade-offs
Several platforms help teams deploy faster by abstracting away the complexity of integrating these components.
Platform | Time to First Call | Monthly Cost (10K mins) | Key Strength | Key Limitation |
---|---|---|---|---|
Retell | 3 hours | $500 - $3,100 | Visual workflow builder for non-technical teams. | Least flexible; adds 50-100ms of latency overhead. |
Vapi | 2 hours | $500 - $1,300 | Strong developer experience and API design. | Can become costly at high scale ($0.05-0.13/min total). |
Custom (LiveKit/Pipecat) | 2+ weeks | ~$800 + eng. time | Complete control, lowest latency, up to 80% cost savings at scale. | Large up-front investment (2+ engineers for 3+ months). |
- Choose a Platform (Retell/Vapi) if: You're a SaaS company adding voice features, validating a voice UX, or need to build and iterate without deep engineering dependency.
- Choose a Custom Build (LiveKit) if: You have unique architectural requirements, expect high volume, and have a strong engineering team ready for a significant, long-term commitment.
The 4-Layer Quality Framework for Production Systems
A successful voice agent isn't just functional—it's observable, reliable, and testable. A production-ready stack must support a quality assurance process. This 4-layer framework outlines the critical areas to monitor.
Layer 1: Infrastructure Health
- What to Test: Time-to-first-word (
<500ms
), audio quality scores (no artifacts in 99.9% of calls), and SIP trunk reliability (connection stability, packet loss).
Layer 2: Agent Execution Accuracy
- What to Test: Prompt compliance, entity extraction accuracy (names, dates, addresses), and context retention across multiple turns. Most agents score 95% on happy paths but drop to 60% with background noise or accents, which is why teams need our solution. As one customer described it, a "powerhouse regression" system.
Layer 3: User Satisfaction Signals
- What to Test: Frustration markers (tracking words like "What?" or "I already said..."), conversation efficiency (turns vs. optimal path), and emotional trajectory. Many technically "successful" calls contain user frustration.
Layer 4: Business Outcome Achievement
- What to Test: End-to-end validation (did the order get created, appointment scheduled, or patient record updated correctly?), business logic verification (was the right product offered?), and revenue impact.
Your 30-Day Implementation Plan
A four-week sprint from architectural decisions to validated stack.

Week 1: Define Your Non-Negotiables
- Document Constraints: Budget (per-minute cost at scale), latency targets, and deal-breakers like HIPAA compliance.
- Map Critical User Journeys: List the top 5 conversation flows and define success/failure for each.
Week 2: Hands-On Platform Testing
- Build Your Core Flow: Implement the same primary user journey on Retell and Vapi to compare developer experience and flexibility. If considering a custom build, scope a LiveKit demo.
- Evaluate Platform Maturity (Red Flags):
- Lack of a Self-Serve Trial: Platforms should let their product speak for itself. Gated demos can hide limitations.
- Poor or Outdated Documentation: This indicates the quality of support and developer experience to expect.
- Inactive Developer Community: A vibrant community is a resource for troubleshooting and identifying common issues.
Week 3: Component Deep Dive
- STT Shootout: Use 50+ recorded utterances from real users (with accents, background noise, and domain terms) to test each provider's accuracy and latency under your specific conditions.
- LLM & TTS Gauntlet: Create 20 complex, multi-turn conversation scripts to test instruction following. Blind test TTS options with target users and ask which sounds more "trustworthy."
Week 4: Make Your Decision
- Architecture Choice:
- Need production in < 1 month? → Platform (Retell/Vapi).
- Processing < 10K minutes/month? → Platform.
- Unique requirements + engineering resources? → Custom.
- Compliance is critical? → A provider like Bland AI or a custom build with a clear audit trail.
- Final Validation: Build a proof-of-concept of your most complex use case with the chosen stack.
Conclusion
Most teams discover their architecture isn't scaling within 90 days of launch. The most successful teams aren't those with the most hyped components, but those who treat their voice stack as an engineering system—one that's architected for observability and continuous improvement.
Your users don't care about your stack. They care about conversations that feel natural, responses that arrive instantly, and agents that solve their problems. Making the right architectural choice is the first step. The next is implementing the monitoring and testing required to maintain that quality at scale. Choose accordingly.