Guide to Choosing the Right Voice Agent Stack

Oftentimes, when companies approach Hamming for voice agent testing, they're aware of issues affecting their voice agents such as latency spikes or agent hallucinations, but not the root cause.

Sometimes companies assume the model needs fine-tuning or they write the issues off as edge cases. But sometimes, the real issue is architectural.

The initial stack that works for a demo or proof-of-concept often falls short when it comes to reliability, scale, and real-world complexity.

The issue usually isn’t a lack of powerful components, it’s the absence of a suitable framework for selecting and integrating those components into a system that can hold up in production.

This guide provides that framework, breaking down the core architectural patterns, key component trade-offs, and testing strategies that turn a working demo into a production-grade system built for scale.

Foundational Architectures: Cascading vs. Speech-to-Speech

Your first architectural decision isn't about vendors, it comes down to choosing between a cascading or a speech-to-speech (S2S) model. This choice affects:

Control: Cascading gives more granular control over each step; S2S trades control for simplicity and flexibility.
Latency: Cascading can be optimized per component, but may introduce delays between steps. S2S is often faster end-to-end but harder to tune.
Observability: Cascading offers clearer logs and metrics at each stage. S2S is more opaque, it's harder to know where a failure occurred.

Diagram comparing cascading architecture with Speech-to-Speech models, showing data flow paths and latency trade-offs for voice agent systems

Cascading Architecture: The Production Standard

Most production voice agents use this architecture's three-step flow. It offers maximum control and provides clear points for debugging and logging.

Speech-to-Text (100-300ms): A provider like Deepgram processes the user's speech into a text transcript.
LLM Processing (300-900ms): An LLM like GPT-4o interprets the text, determines intent, and generates a text response.
Text-to-Speech (300-1200ms): A service like ElevenLabs converts the LLM's text response into audio.

Total Latency: 2000-4000ms (2-4 seconds) under optimal conditions, but network overhead and processing variability can push this past 4000ms.

Speech-to-Speech (S2S) Models: The Emerging Future

Newer models, such as those powering OpenAI's Realtime API, bypass the text layer for an audio-in, audio-out flow. This drops average latency to 500ms. However, this speed comes at the cost of control. Some companies have reverted from S2S to cascading architectures because they lose the ability to inject compliance logic and audit trails at the intermediate text layer.

The Pragmatic Choice: Hybrid Architectures

A hybrid approach balances speed and control. Use a low-latency S2S model for greetings and conversational turn-taking, then switch to a cascading architecture for complex transactions that require strict logic, tool use, and auditability. Platforms like LiveKit enable this mid-conversation switching.

Breaking Down the Essential Components

Here are the essential components:

1. Speech-to-Text (STT): The First Critical Decision

The STT layer is a primary source of agent failures. Focus evaluation on accuracy for specific user demographics and latency under real-world conditions.

STT Provider Comparison
Provider	Avg. Latency	Key Differentiator	Best For
Deepgram Nova-3	Under 300ms	94.74% accuracy for pre-recorded, 93.16% for real-time (6.84% WER). Industry-leading performance in noisy environments.	Multilingual real-time transcription, noisy environments, self-serve customization.
AssemblyAI Universal-1	~300ms (streaming)	93.4% accuracy (6.6% WER) for English. Immutable streaming transcripts with reliable confidence scores.	Live captioning, real-time applications needing stable transcripts.
Whisper Large-v3	~300ms (streaming API)	92% accuracy (8% WER) for English. Supports languages with <50% WER, strongest in ~10 major languages.	Multilingual applications, zero-shot transcription for diverse languages.

Engineering Insight: Fine-tuning STT for domain-specific vocabulary (medical terms, product names, internal jargon) is a common and necessary step for production-grade accuracy.

2. Large Language Models (LLMs)

In production voice contexts, tool-calling reliability, instruction-following precision, and time-to-first-token matter more than benchmarks.

LLM Comparison
Model	Time-to-First-Token	Key Differentiator	Best For
Gemini 2.5 Flash	300ms TTFT, 252.9 tokens/sec	Strict instruction adherence and reliable tool use.	Workflows demanding predictable, structured outputs.
GPT-4.1	400-600ms TTFT	Strong tool calling and reasoning capabilities.	The default choice for complex, multi-turn conversations.

Key Consideration: Most voice agent conversations use less than 4K tokens. Don't pay for large context windows you won't use.

3. Text-to-Speech (TTS): Where Milliseconds and Brand Matter

The TTS voice represents your brand. Balance vocal naturalness, latency, and cost.

TTS Provider Comparison
Provider	Latency	Key Differentiator
ElevenLabs Flash v2.5	~75ms	Ultra-low latency with 32 language support.
ElevenLabs Turbo v2.5	250-300ms	Natural voice quality with balanced speed.
Cartesia Sonic	90ms streaming (40ms model)	Very realistic voices at competitive pricing.
GPT-4o TTS	220-320ms	Promptable emotional control and dynamic voice modulation.
Rime AI	Sub-200ms	Best price-to-performance ratio at scale ($40/million chars).

Key Consideration: Test TTS options with your target user demographic. What sounds natural to one group may feel robotic to another. Budget $5-10K for a custom voice if brand consistency is a top priority.

4. Conversation Orchestration

Managing conversational flow to feel natural goes beyond simple silence detection. This is a common point of failure for internal builds.

LiveKit Agents: Provides hardware-accelerated Voice Activity Detection (VAD) that handles interruptions gracefully. Best for teams needing a production-ready orchestration layer quickly.
Pipecat: An open-source framework with full customization but requires significant DevOps and engineering expertise. Ideal for teams with unique requirements that off-the-shelf platforms can't meet.

Engineering Insight: Measure false-positive interruptions closely—being cut off drives user frustration. Implement dynamic silence thresholds (300ms for quick exchanges, 800ms for users who speak more slowly).

Platform Comparison: Real Numbers, Real Trade-offs

Several platforms help teams deploy faster by abstracting away the complexity of integrating these components.

Platform Comparison
Platform	Time to First Call	Monthly Cost (10K mins)	Key Strength	Key Limitation
Retell	3 hours	$500 - $3,100	Visual workflow builder for non-technical teams.	Least flexible; adds 50-100ms of latency overhead.
Vapi	2 hours	$500 - $1,300	Strong developer experience and API design.	Can become costly at high scale ($0.05-0.13/min total).
Custom (LiveKit/Pipecat)	2+ weeks	~$800 + eng. time	Complete control, lowest latency, up to 80% cost savings at scale.	Large up-front investment (2+ engineers for 3+ months).

Choose a Platform (Retell/Vapi) if: You're a SaaS company adding voice features, validating a voice UX, or need to build and iterate without deep engineering dependency.
Choose a Custom Build (LiveKit) if: You have unique architectural requirements, expect high volume, and have a strong engineering team ready for a significant, long-term commitment.

The 4-Layer Quality Framework for Production Systems

A successful voice agent isn't just functional—it's observable, reliable, and testable. A production-ready stack must support a quality assurance process. This 4-layer framework outlines the critical areas to monitor.

Layer 1: Infrastructure Health

What to Test: Time-to-first-word (<500ms), audio quality scores (no artifacts in 99.9% of calls), and SIP trunk reliability (connection stability, packet loss).

Layer 2: Agent Execution Accuracy

What to Test: Prompt compliance, entity extraction accuracy (names, dates, addresses), and context retention across multiple turns. Most agents score 95% on happy paths but drop to 60% with background noise or accents, which is why teams need our solution. As one customer described it, a "powerhouse regression" system.

Layer 3: User Satisfaction Signals

What to Test: Frustration markers (tracking words like "What?" or "I already said..."), conversation efficiency (turns vs. optimal path), and emotional trajectory. Many technically "successful" calls contain user frustration.

Layer 4: Business Outcome Achievement

What to Test: End-to-end validation (did the order get created, appointment scheduled, or patient record updated correctly?), business logic verification (was the right product offered?), and revenue impact.

Your 30-Day Implementation Plan

A four-week sprint from architectural decisions to validated stack.

Visual timeline showing the 4-week implementation plan for voice agent stack selection, from defining constraints to final validation

Week 1: Define Your Non-Negotiables

Document Constraints: Budget (per-minute cost at scale), latency targets, and deal-breakers like HIPAA compliance.
Map Critical User Journeys: List the top 5 conversation flows and define success/failure for each.

Week 2: Hands-On Platform Testing

Build Your Core Flow: Implement the same primary user journey on Retell and Vapi to compare developer experience and flexibility. If considering a custom build, scope a LiveKit demo.
Evaluate Platform Maturity (Red Flags):
- Lack of a Self-Serve Trial: Platforms should let their product speak for itself. Gated demos can hide limitations.
- Poor or Outdated Documentation: This indicates the quality of support and developer experience to expect.
- Inactive Developer Community: A vibrant community is a resource for troubleshooting and identifying common issues.

Week 3: Component Deep Dive

STT Shootout: Use 50+ recorded utterances from real users (with accents, background noise, and domain terms) to test each provider's accuracy and latency under your specific conditions.
LLM & TTS Gauntlet: Create 20 complex, multi-turn conversation scripts to test instruction following. Blind test TTS options with target users and ask which sounds more "trustworthy."

Week 4: Make Your Decision

Architecture Choice:
- Need production in < 1 month? → Platform (Retell/Vapi).
- Processing < 10K minutes/month? → Platform.
- Unique requirements + engineering resources? → Custom.
- Compliance is critical? → A provider like Bland AI or a custom build with a clear audit trail.
Final Validation: Build a proof-of-concept of your most complex use case with the chosen stack.

Conclusion

Most teams discover their architecture isn't scaling within 90 days of launch. The most successful teams aren't those with the most hyped components, but those who treat their voice stack as an engineering system—one that's architected for observability and continuous improvement.

Your users don't care about your stack. They care about conversations that feel natural, responses that arrive instantly, and agents that solve their problems. Making the right architectural choice is the first step. The next is implementing the monitoring and testing required to maintain that quality at scale. Choose accordingly.