Best Voice Agent Stack: A Complete Selection Framework

Sumanyu Sharma
Sumanyu Sharma
Founder & CEO
, Voice AI QA Pioneer

Has stress-tested 1M+ voice agent calls to find where they break.

August 4, 2025Updated December 23, 202514 min read
Best Voice Agent Stack: A Complete Selection Framework

The short version: There's no "best" stack - there's the stack that fits what you're actually trying to do. I've seen teams waste months building custom when they should've used Retell, and teams struggle on managed platforms when they clearly needed to own the infrastructure. This guide won't tell you what to pick. It'll give you the framework to figure it out yourself.

Quick filter: Need audit trails for compliance? Cascading architecture. Latency is the only thing that matters? Look at speech-to-speech. Not sure yet? Keep reading.

Related Guides:

How to Choose the Best Voice Agent Stack

Here's the pattern I see over and over: team builds a voice agent, demo works great, they launch, and within a month something's broken that they can't explain. Latency spikes. Random hallucinations. Calls dropping. They assume it's the LLM and start prompt engineering. Sometimes that's right. Usually it's not.

We did an audit last month - company had spent $40K on prompt iterations. Forty thousand dollars. Want to know what the actual problem was? Their STT was clipping the first 200ms of every utterance. The transcripts were garbage before the LLM ever saw them. No amount of prompt work was going to fix that.

Look, the individual components are fine. Deepgram, ElevenLabs, GPT-4, whatever - they all work. The issue is almost never that you picked a "bad" component. It's that you assembled good components without thinking through how they'd interact at scale, or what happens when OpenAI has a rough day and your latency triples.

I'm going to walk through a framework for thinking about this. Fair warning: there are a lot of tables and benchmarks below. Use them as starting points. Your specific use case matters more than any benchmark I can show you.

Methodology Note: Component benchmarks and latency thresholds in this guide are based on Hamming's testing across 200+ voice agent deployments (2025) and published provider specifications. Actual performance varies by implementation, traffic patterns, and configuration. Cost estimates reflect publicly available pricing as of December 2025.

What's Actually in a "Stack"?

When people say "voice agent stack" they're usually talking about the full pipeline: how calls come in (telephony), how audio becomes text (ASR/STT), how text becomes decisions (LLM + tools), how decisions become speech (TTS), and - the part everyone forgets - how you know any of it is actually working (monitoring/QA).

Most teams get the first four right and completely ignore the fifth. Don't do that.

The Selection Framework

I'm going to give you a scoring framework here. Honestly, I'm a little ambivalent about frameworks - they can make complex decisions feel falsely simple. But having some structure beats arguing in circles for weeks, which is what happens otherwise.

The Four Dimensions

DimensionKey QuestionsWeight
ArchitectureCascading vs S2S? Hybrid needed?30%
ComponentsWhich STT/LLM/TTS meet your requirements?25%
PlatformBuild vs buy? Time-to-market constraints?25%
QA LayerHow will you test, monitor, and improve?20%

Quick Decision Matrix

Use this matrix to quickly narrow your options:

If you need...Choose...Why
Fastest time-to-production (<1 month)Retell or VapiPre-integrated components, visual builders
Lowest latency (<500ms)Speech-to-speech + custom buildS2S models eliminate text layer overhead
Maximum control + auditabilityCascading + LiveKit/PipecatFull visibility at each layer
Compliance (HIPAA, SOC2)Bland AI or custom buildClear audit trails, data residency control
Cost efficiency at scale (>50K mins/month)Custom build with LiveKitUp to 80% savings vs managed platforms
Multilingual support (10+ languages)Cascading with Deepgram + ElevenLabsBest accuracy across languages

These are starting points, not gospel. The "fastest time-to-production" platforms will get you live quickly, but roughly half the teams we talk to migrate off within 12 months once they hit scale or customization limits. Factor that into your decision.

Stack Scoring Rubric

Score each dimension from 1-5 to compare options:

Criterion1 (Poor)3 (Adequate)5 (Excellent)
Latency3+s end-to-end1.5-3s1.5s
STT Accuracy>12% WER8-12% WER<8% WER
TTS QualityRobotic, noticeableNatural for most usersIndistinguishable from human
Time-to-Production3+ months1-3 months1 month
Cost at 10K mins>$3,000/month$1,000-3,000<$1,000
ObservabilityLogs onlyBasic metricsFull tracing + audio replay
CustomizationLocked configurationAPI accessFull source control

Scoring thresholds:

  • 28-35: Production-ready stack
  • 21-27: Viable with optimizations
  • <21: Significant gaps—reconsider architecture

One thing these scores don't capture: how much your specific use case matters. A 3-second latency that's "poor" for a real-time assistant might be perfectly fine for a voicemail transcription bot. Score against your actual requirements, not industry averages.

Sources: Scoring thresholds based on Hamming's stack evaluation across 200+ voice agent deployments (2025). Latency targets aligned with conversational turn-taking research (Stivers et al., 2009). Cost benchmarks reflect publicly available pricing at time of publication.

The Big Architecture Decision

Before you even think about vendors, you have to decide: cascading or speech-to-speech? This is the choice that constrains everything else.

Here's the honest trade-off nobody wants to admit:

Cascading (STT → LLM → TTS) gives you visibility. When something breaks, you can actually see where. You get transcripts, you get LLM reasoning, you can debug. The downside? It's slow. You're looking at 2-4 seconds end-to-end in good conditions.

Speech-to-speech is fast - like, 500ms fast. Audio goes in, audio comes out. But good luck figuring out why the agent said something weird. There's no intermediate text to inspect. I've seen compliance teams kill S2S projects because they couldn't produce audit trails.

Diagram comparing cascading architecture with Speech-to-Speech models, showing data flow paths and latency trade-offs for voice agent systems

How Cascading Actually Works

Most production voice agents run cascading. Here's the flow:

  1. Audio → Text (100-300ms): Something like Deepgram turns speech into text. This is where accents and background noise bite you.
  2. Text → Decisions (300-900ms): GPT-4 or whatever LLM you're using figures out what to say. This is where prompt engineering actually matters.
  3. Decisions → Audio (300-1200ms): ElevenLabs or Cartesia generates the voice. This is where "sounding human" either works or doesn't.

Total latency: 2-4 seconds when everything's working. But "when everything's working" is doing a lot of heavy lifting there. I've seen this stretch to 6+ seconds when one provider has a bad day.

Speech-to-Speech: Fast but Opaque

S2S models like OpenAI's Realtime API are genuinely impressive. 500ms response times. The conversation feels natural. But here's what they don't mention in the marketing: when the agent says something wrong, you have almost no visibility into why.

I know of at least three companies who went all-in on S2S, got excited about the latency numbers, then had to rip it out because their compliance team asked "can you show me what the agent was thinking when it gave that wrong answer?" and the answer was no.

The Reality: Most Teams End Up Hybrid

Here's what I actually see working in practice: use S2S for the parts where speed matters and risk is low (greetings, simple acknowledgments), switch to cascading when you need auditability (transactions, compliance-sensitive stuff). LiveKit makes this mid-conversation switching possible. It's more work to set up but it's usually the right answer.

The Components (Where Things Actually Break)

Now let's get into the individual pieces. I'm going to give you comparison tables but please - please - don't just pick the one with the best numbers. Test with your actual use case.

STT: Where Most Failures Actually Start

I'd estimate 40% of the "agent problems" we see are actually STT problems. The transcript is wrong before the LLM even gets it. But teams blame the LLM because that's where the output comes from.

Test STT with your actual audio. Accents matter. Background noise matters. Domain vocabulary matters. A model that scores 93% on benchmarks might score 75% on your specific user base.

STT Provider Comparison
ProviderAvg. LatencyKey DifferentiatorBest For
Deepgram Nova-3Under 300ms94.74% accuracy for pre-recorded, 93.16% for real-time (6.84% WER). Industry-leading performance in noisy environments.Multilingual real-time transcription, noisy environments, self-serve customization.
AssemblyAI Universal-1~300ms (streaming)93.4% accuracy (6.6% WER) for English. Immutable streaming transcripts with reliable confidence scores.Live captioning, real-time applications needing stable transcripts.
Whisper Large-v3~300ms (streaming API)92% accuracy (8% WER) for English. Supports languages with <50% WER, strongest in ~10 major languages.Multilingual applications, zero-shot transcription for diverse languages.

Sources: Deepgram Nova-3 benchmarks from Deepgram Nova-3 announcement. AssemblyAI data from Universal-1 release. Whisper benchmarks from OpenAI Whisper paper (Radford et al., 2023). Latency figures from Hamming internal testing (2025).

Accuracy numbers change quarterly as providers push updates. We've seen 2-3% swings in WER after major model releases. The latency figures are more stable, but test with your actual audio - accents and background noise can swing accuracy by 10%+ from these benchmarks.

Engineering Insight: Fine-tuning STT for domain-specific vocabulary (medical terms, product names, internal jargon) is a common and necessary step for production-grade accuracy.

LLMs: Not Actually the Hard Part

Hot take: for most voice agents, the LLM is the least of your problems. GPT-4, Gemini, Claude - they all work fine for the kinds of conversations voice agents typically handle. The real issues are almost always upstream (bad transcripts) or downstream (weird TTS artifacts).

That said, time-to-first-token matters a lot for conversational feel, and tool-calling reliability matters if you're doing actual work in the conversation.

LLM Comparison
ModelTime-to-First-TokenKey DifferentiatorBest For
Gemini 2.5 Flash300ms TTFT, 252.9 tokens/secStrict instruction adherence and reliable tool use.Workflows demanding predictable, structured outputs.
GPT-4.1400-600ms TTFTStrong tool calling and reasoning capabilities.The default choice for complex, multi-turn conversations.

Sources: Gemini 2.5 Flash benchmarks from Google AI Studio. GPT-4.1 data from OpenAI API documentation. TTFT measurements from Hamming voice agent testing across production workloads (2025).

Key Consideration: Most voice agent conversations use less than 4K tokens. Don't pay for large context windows you won't use.

TTS: The Brand Problem Nobody Warned You About

TTS is weird because the "best" option isn't always the right option. The voice that sounds most natural to you might sound weird to your users. The voice that works for a consumer app might feel wrong for enterprise sales. This is surprisingly subjective and domain-dependent.

TTS Provider Comparison
ProviderLatencyKey Differentiator
ElevenLabs Flash v2.5~75msUltra-low latency with 32 language support.
ElevenLabs Turbo v2.5250-300msNatural voice quality with balanced speed.
Cartesia Sonic90ms streaming (40ms model)Very realistic voices at competitive pricing.
GPT-4o TTS220-320msPromptable emotional control and dynamic voice modulation.
Rime AISub-200msBest price-to-performance ratio at scale ($40/million chars).

Sources: ElevenLabs latency from ElevenLabs API documentation. Cartesia Sonic benchmarks from Cartesia product page. GPT-4o TTS from OpenAI TTS documentation. Rime AI pricing from official pricing page. Latency validated through Hamming internal testing (2025).

The voice quality numbers are subjective and domain-dependent. We had a healthcare client reject the "best" TTS option because it sounded too casual for appointment reminders. Budget $5-10K and two weeks for voice selection if brand consistency matters. It's not a decision you want to revisit after launch.

Key Consideration: Test TTS options with your target user demographic. What sounds natural to one group may feel robotic to another.

Orchestration: The Part That Will Ruin Your Month

This is where I see internal builds fail most often. People underestimate how hard it is to make conversations feel natural. When should the agent start talking? When should it stop? What if the user interrupts? What if there's background noise that sounds like speech?

LiveKit Agents handles most of this well enough that you can ship something quickly. Pipecat gives you more control but you'll spend a lot of time getting turn-taking right.

One thing nobody tells you: your users will complain about being "cut off" more than anything else. Track false-positive interruptions obsessively.

Platform Options: Build vs Buy

"Should we use Retell/Vapi or build custom on LiveKit?" - I get this question constantly. The honest answer is it depends on time, money, and how weird your requirements are.

Platform Comparison
PlatformTime to First CallMonthly Cost (10K mins)Key StrengthKey Limitation
Retell3 hours$500 - $3,100Visual workflow builder for non-technical teams.Least flexible; adds 50-100ms of latency overhead.
Vapi2 hours$500 - $1,300Strong developer experience and API design.Can become costly at high scale ($0.05-0.13/min total).
Custom (LiveKit/Pipecat)2+ weeks~$800 + eng. timeComplete control, lowest latency, up to 80% cost savings at scale.Large up-front investment (2+ engineers for 3+ months).

Sources: Platform pricing from Retell, Vapi, and LiveKit official pricing pages. Time-to-first-call estimates based on Hamming customer onboarding data across 50+ teams (2025). Custom build cost estimates assume senior engineer rates and typical infrastructure costs.

  • Choose a Platform (Retell/Vapi) if: You're a SaaS company adding voice features, validating a voice UX, or need to build and iterate without deep engineering dependency.
  • Choose a Custom Build (LiveKit) if: You have unique architectural requirements, expect high volume, and have a strong engineering team ready for a significant, long-term commitment.

What to Actually Monitor

I could give you a nice framework here but honestly? Most teams overthink monitoring and end up tracking things that don't matter while missing things that do. Here's what I'd actually focus on:

Layer 1: Infrastructure Health

  • What to Test: Time-to-first-word (<500ms), audio quality scores (no artifacts in 99.9% of calls), and SIP trunk reliability (connection stability, packet loss).

Layer 2: Agent Execution Accuracy

  • What to Test: Prompt compliance, entity extraction accuracy (names, dates, addresses), and context retention across multiple turns. Most agents score 95% on happy paths but drop to 60% with background noise or accents, which is why teams need our solution. As one customer described it, a "powerhouse regression" system.

Layer 3: User Satisfaction Signals

  • What to Test: Frustration markers (tracking words like "What?" or "I already said..."), conversation efficiency (turns vs. optimal path), and emotional trajectory. Many technically "successful" calls contain user frustration.

Layer 4: Business Outcome Achievement

  • What to Test: End-to-end validation (did the order get created, appointment scheduled, or patient record updated correctly?), business logic verification (was the right product offered?), and revenue impact.

Your 30-Day Implementation Plan

A four-week sprint from architectural decisions to validated stack.

Honest caveat: this 30-day timeline assumes you have dedicated engineering time and clear requirements. Most teams we work with take 6-8 weeks because requirements shift, stakeholders have opinions about voice selection, and integration testing always takes longer than planned. Build in buffer.

Visual timeline showing the 4-week implementation plan for voice agent stack selection, from defining constraints to final validation

Week 1: Define Your Non-Negotiables

  • Document Constraints: Budget (per-minute cost at scale), latency targets, and deal-breakers like HIPAA compliance.
  • Map Critical User Journeys: List the top 5 conversation flows and define success/failure for each.

Week 2: Hands-On Platform Testing

  • Build Your Core Flow: Implement the same primary user journey on Retell and Vapi to compare developer experience and flexibility. If considering a custom build, scope a LiveKit demo.
  • Evaluate Platform Maturity (Red Flags):
    • Lack of a Self-Serve Trial: Platforms should let their product speak for itself. Gated demos can hide limitations.
    • Poor or Outdated Documentation: This indicates the quality of support and developer experience to expect.
    • Inactive Developer Community: A vibrant community is a resource for troubleshooting and identifying common issues.

Week 3: Component Deep Dive

  • STT Shootout: Use 50+ recorded utterances from real users (with accents, background noise, and domain terms) to test each provider's accuracy and latency under your specific conditions.
  • LLM & TTS Gauntlet: Create 20 complex, multi-turn conversation scripts to test instruction following. Blind test TTS options with target users and ask which sounds more "trustworthy."

Week 4: Make Your Decision

  • Architecture Choice:
    • Need production in < 1 month? → Platform (Retell/Vapi).
    • Processing < 10K minutes/month? → Platform.
    • Unique requirements + engineering resources? → Custom.
    • Compliance is critical? → A provider like Bland AI or a custom build with a clear audit trail.
  • Final Validation: Build a proof-of-concept of your most complex use case with the chosen stack.

Wrapping Up

Look, I've given you a lot of information here. Probably too much. If you remember one thing: the stack you pick matters less than how well you understand its failure modes.

I've seen teams with "worse" stacks outperform teams with "better" stacks because they had better monitoring and responded to problems faster. The voice AI space is still immature enough that everything breaks sometimes. Your job is to know when it breaks and fix it before users notice.

The teams that succeed aren't the ones who picked the perfect architecture on day one. They're the ones who shipped something reasonable, watched it like hawks, and iterated when things went wrong. That's it. That's the whole secret.

If you're still reading and you're not sure where to start: pick a managed platform, get something in production, see what breaks, and make decisions from there. You'll learn more from two weeks of real traffic than from another month of architecture diagrams.

Frequently Asked Questions

A voice agent stack is the complete system that turns a phone call into a reliable outcome: telephony (SIP trunking, WebRTC), ASR/STT for speech-to-text (Deepgram, AssemblyAI), orchestration layer for conversation management (LiveKit, Pipecat), LLM for reasoning and responses (GPT-4, Claude), tools/APIs for external integrations (CRM, knowledge base), TTS for text-to-speech (ElevenLabs, Cartesia), and the QA/observability layer that validates performance after every change. Each component choice affects latency, accuracy, and cost.

Evaluate based on reliability and debuggability, not demos. Key criteria: latency percentiles P50/P95/P99 (not averages), outage history and published SLAs, multilingual and accent performance with real-world audio, SDK/tooling maturity, and how quickly you can detect regressions. Always test on your actual call audio before deciding. Score each vendor 1-5 across: latency (<1.5s excellent), STT accuracy (<8% WER excellent), TTS quality, time-to-production (<1 month excellent), cost at 10K mins (<$1,000 excellent), observability, and customization. Total 28-35 = production-ready.

Hamming sits in the reliability layer: it runs end-to-end call simulations and regression tests across your full stack (STT, LLM, TTS, integrations) and monitors production behavior to catch drift from prompt changes, model updates, and upstream vendor issues. The goal is to make voice reliability measurable and repeatable regardless of which ASR/TTS/LLM providers you choose. Key capabilities: 1,000+ concurrent synthetic calls, latency percentile tracking, WER monitoring, regression blocking in CI/CD, and production call scoring.

Common pitfalls: (1) Missing ownership boundaries—nobody owns the full call path end-to-end; (2) Relying on call-level averages instead of turn-level latency percentiles; (3) Not planning for upstream drift when ASR/TTS/LLM providers update models; (4) Underestimating real-world audio conditions—accents, noise, interruptions, carrier variation break flows that looked perfect in demos; (5) No regression testing after prompt changes; (6) Choosing components based on benchmarks without testing your specific audio conditions and vocabulary.

Choose cascading (STT → LLM → TTS) when you need auditability with intermediate transcripts, compliance controls at the text layer, fine-grained debugging per component, and tool/API calls based on parsed text. Typical latency: 2-4 seconds. Choose speech-to-speech (S2S) when latency is the top priority (<500ms) and you can trade visibility into intermediate steps. If you need audits, cascading is still safer. Hybrid approaches use S2S for greetings, cascading for complex transactions.

Buy (Retell, Vapi) when: you need fast launch (<1 month), handle <10K minutes/month, lack dedicated voice engineering resources, or are validating voice UX before investing. Build (LiveKit, Pipecat) when: you exceed 10-50K minutes/month (80% cost savings possible), need deep integrations with existing systems, require stricter compliance (HIPAA, SOC2), demand <500ms latency, or need full observability and audit trails. Most teams start with platforms, migrate to custom after proving value.

Choose Retell for: non-technical teams, visual workflow builders, fastest time-to-first-call (3 hours), and when flexibility is less important than speed. Retell adds 50-100ms latency overhead. Choose Vapi for: developer-first teams, strong API/SDK experience, 2 hours to first call, and better customization via code. Vapi costs $0.05-0.13/min total. Both are managed platforms suitable for <10K mins/month or rapid prototyping. For >50K mins/month, consider custom builds with LiveKit for 80% cost reduction.

Top STT providers for voice agents: Deepgram Nova-3 leads real-time with 6.84% WER, <300ms latency, 36 languages, and excellent noise robustness—best for production voice agents. AssemblyAI Universal-1 offers 6.6% WER with stable streaming and immutable transcripts (17 languages)—best for applications needing reliable confidence scores. Whisper Large-v3 provides 99+ languages at ~8% WER—best for maximum language coverage. Google Speech-to-Text covers 125+ languages. Choice depends on language needs, latency requirements, and noise tolerance.

Top TTS providers for voice agents: ElevenLabs Flash v2.5 leads latency at 75ms with 32 language support—best for ultra-low latency needs. Cartesia Sonic offers 90ms streaming (40ms model) with realistic voices at competitive pricing—best value. ElevenLabs Turbo v2.5 provides 250-300ms latency with natural voice quality—balanced choice. GPT-4o TTS (220-320ms) offers promptable emotional control—best for dynamic voice modulation. Rime AI ($40/million chars) provides best price-to-performance at scale.

For voice agents, prioritize time-to-first-token (TTFT) and tool-calling reliability over benchmark scores. Top choices: Gemini 2.5 Flash (300ms TTFT, 252.9 tokens/sec)—best for strict instruction adherence and reliable tool use in structured workflows. GPT-4.1 (400-600ms TTFT)—default choice for complex multi-turn conversations with strong reasoning. Most voice conversations use <4K tokens—don't pay for large context windows. Test tool-calling reliability specifically; benchmark performance doesn't predict production behavior.

The 4-Layer Quality Framework ensures production reliability: Layer 1 (Infrastructure Health)—test TTFW <500ms, audio quality 99.9% artifact-free, SIP reliability; Layer 2 (Agent Execution)—test prompt compliance, entity extraction accuracy, context retention across turns; Layer 3 (User Satisfaction)—track frustration markers ('What?', 'I already said...'), conversation efficiency vs optimal path, emotional trajectory; Layer 4 (Business Outcomes)—validate end-to-end task completion (order created, appointment scheduled), business logic correctness, revenue impact.

Voice agent stack costs at 10K minutes/month: Managed platforms—Retell $500-3,100/month (visual builder, fastest launch), Vapi $500-1,300/month (developer-friendly). Custom build—~$800/month infrastructure + engineering time (2+ engineers for 3+ months initially). Cost breakdown: STT $0.01-0.05/min, LLM $0.02-0.10/min (varies by model), TTS $0.02-0.08/min, telephony $0.01-0.03/min. At >50K mins/month, custom builds achieve 80% savings vs managed platforms. Include QA/monitoring costs in TCO calculations.

Sumanyu Sharma

Sumanyu Sharma

Founder & CEO

Previously Head of Data at Citizen, where he helped quadruple the user base. As Senior Staff Data Scientist at Tesla, grew AI-powered sales program to 100s of millions in revenue per year.

Researched AI-powered medical image search at the University of Waterloo, where he graduated with Engineering honors on dean's list.

“At Hamming, we're taking all of our learnings from Tesla and Citizen to build the future of trustworthy, safe and reliable voice AI agents.”