How to Optimize Latency in Voice Agents
Skip this if you're running a demo agent with a handful of test calls. Average latency and basic logging work fine at that scale. But once you're past 100 calls per day, or operating in situations where users hang up after a few seconds of silence, you'll need what's in here.
The telltale sign: Go listen to some call recordings. If you hear "Hello? Are you there?" more than once, you have a latency problem. We've seen agents with "great" 400ms averages get this constantly because of tail latency spikes.
Latency is the silent killer in voice AI. Users don't wait, they just hang up. And unlike text chat, there's no visual "typing..." indicator to buy you time.
The research is clear on this: anything over 800ms feels sluggish, and beyond 1.5 seconds, users start mentally checking out of the conversation. Yet despite latency's importance, it remains widely misunderstood by teams building voice applications.
Here's what we found after analyzing 1M+ production calls: many teams track latency averages and think they're covered. They're not. A 300ms average can mask the fact that 10% of users experience 1500ms delays. We call this the "average latency trap," and it's one of the most common failure modes we see.
To optimize latency for voice agents, you need to measure and track the right latency metrics using a voice observability tool that shows latency at each pipeline stage.
Why Average Latency Metrics Fail
I used to think averages were good enough. We had a customer showing 400ms average latency, dashboard looked great, everyone felt good about it. Then we dug into why 15% of their calls were ending with users hanging up. Turns out the tail latency was brutal - some calls were hitting 2+ seconds.
The average hid everything. A 300ms average can mean "everyone gets 300ms" or it can mean "90% get 150ms and 10% wait 1.5 seconds." These are completely different situations that look identical on a dashboard.
The other thing that trips teams up: latency compounds through the pipeline. STT takes a bit longer. LLM queues for a moment. TTS adds a few hundred milliseconds. Each step looks fine in isolation. Then you realize they're all adding together and your users are sitting in silence for 2 seconds wondering if the line went dead. Average latency per component doesn't tell you this - you need end-to-end percentiles.
Understanding P45, P90, and P99
Think of percentiles as a way to understand what different users actually experience, not what the "average user" experiences (who doesn't exist).
P45 represents your best-case performance. When p45 exceeds ~200ms, it often points to deeper architectural inefficiencies. If things are slow even under optimal conditions, something fundamental needs addressing.
P90 shows what 1 in 10 users experience. Here's a way to make this concrete: in a 10-minute conversation with 20 turns, users encounter p90 latency twice. Those delays are noticeable. They're the moments where users start wondering if something is wrong.
P99 exposes the extreme tail. This is where complaints come from. Spikes here often reveal infrastructure limits: cold starts, network congestion, model queue depths, memory pressure. We've seen teams with "great" p50 metrics discover that their p99 was causing 5% of users to abandon calls entirely.
| Percentile | What it represents | How to use it |
|---|---|---|
| P45 | Best-case performance | Reveals baseline architecture limits |
| P90 | 1 in 10 users | Detects noticeable UX degradation |
| P99 | Worst-case tail | Finds outage-level slowdowns |
Building Production Monitoring Systems
Effective latency management requires continuous observability and measurement infrastructure.
Essential metrics by call and component:
- Time to first word: call connection to agent speech
- Turn-level latency: each exchange, not call averages
- Component timing: STT duration, LLM processing, TTS generation
- Queue depths: predictive indicators of future delays
Percentile tracking configuration:
- Bucketed histograms replace simple averages
- Separate metrics by component and operation type
- Tag with context: device type, audio quality, model version, geographic region
Histogram buckets:
- 50ms buckets: 0-500ms (conversation threshold)
- 100ms buckets: 500-1000ms (degradation zone)
- 250ms buckets: 1000-3000ms (abandonment risk)
- Single bucket: >3000ms (complete failure)
Alert thresholds based on user impact:
- p90 > 500ms: noticeable degradation
- p99 > 1000ms: user frustration
- Any percentile > 2000ms: abandonment risk
Hamming's Latency Optimization Cycle
Monitoring identifies the issues. But monitoring alone doesn't fix anything.
Based on our analysis of 1M+ production voice agent calls across 50+ deployments, we've developed this cycle:
- Monitor percentiles across all stages
- Identify the component causing spikes
- Apply targeted fixes to that component
- Verify improvement in production
- Watch for regression or new bottlenecks
- Return to monitoring
The order matters. Prioritize by user impact, not technical severity. Remember the p99 issue we mentioned earlier? P99 spikes affecting greetings matter more than p50 delays in rarely-used functions. The greeting is when users form their first impression.
A/B testing validates provider decisions:
- Route 10% of traffic through new STT providers
- Compare p90 and p99 latencies, not averages
- Calculate cost per millisecond saved
- Make data-driven switches
Regression detection prevents backsliding. Every model update, deployment, or configuration change triggers automatic latency comparison. A 50ms p99 increase compounds through pipelines.
Hamming's approach monitors three dimensions: what is said, how it's said, and what's done. Comprehensive monitoring ensures latency improvements maintain accuracy and user experience.
Engineering Sub-Second Response Times
Consistent sub-second response requires architectural optimization beyond basic improvements.
Regional deployment eliminates foundational latency. US-East hosting for US-West users adds 60-80ms round trip, multiplied by every service hop. Geographic distribution is mandatory for production voice agents.
Prewarmed instances prevent cold starts. First calls shouldn't wait 2-3 seconds for model loading. Maintain warm instances with synthetic health checks.
Connection pooling saves 20-50ms per request. Persistent pools eliminate HTTPS connection overhead. Connection delays compound faster than other latency sources.
Protocol selection impacts resilience. WebRTC beats WebSockets by 50% in lossy networks. The protocol handles packet loss gracefully, maintaining quality during congestion.
Pipeline parallelization breaks sequential bottlenecks:
- Start TTS rendering before LLM completion for predictable endings
- Prepare common responses while processing unique requests
- Launch speculative execution for frequent paths
Intelligent caching extends beyond static responses:
- Cache partial LLM outputs for similar queries
- Store preprocessed audio for common phrases
- Remember user preferences to skip discovery delays
Provider redundancy ensures consistency:
- Route to backup providers when primary latency spikes
- Maintain hot standbys for critical components
- Switch automatically based on real-time p99 monitoring
Managing Edge Cases
Edge cases are where p99 goes to die. You can optimize your main path beautifully, then hit an edge case and watch latency spike to 3 seconds.
The edge cases that bite us most often:
- Background noise so bad that the STT provider switches to expensive "enhanced" processing
- Accents the primary model struggles with, triggering slower fallback transcription
- Cross-talk where two people are speaking and the system tries to separate them
- Elderly users with loud TV audio bleeding into the microphone
Here's something counterintuitive we learned: fast-but-imperfect often beats slow-but-accurate. A quick "I didn't catch that, could you say that again?" at 200ms is better than a perfect transcription at 2 seconds. Users forgive "could you repeat that?" They don't forgive dead air.
Test with weird audio. Accents your main demo voice doesn't have. Background noise you wouldn't normally encounter. We started monitoring p99 separately for these edge cases and found issues that would have taken months to surface through aggregate metrics.
WebRTC achieves sub-100ms latency in perfect conditions. Production users don't provide perfect conditions.
Hamming's Continuous Latency Optimization Framework
Latency optimization requires ongoing monitoring. Every change affects percentiles. We've formalized this into Hamming's Continuous Latency Optimization Framework:
- Baseline: Establish current p45/p90/p99 across all stages
- Identify: Find components causing p99 spikes
- Optimize: Apply targeted fixes
- Verify: Confirm percentile improvements
- Monitor: Watch for regression
- Iterate: Return to identification
Warning signs requiring immediate action:
- User complaints despite "good" averages—check p99
- Performance gaps between testing and production—monitor real calls
- Model updates causing slowdowns—compare percentile trends
- Intermittent slowness—analyze percentile distributions
Flaws but Not Dealbreakers
Before you set up comprehensive latency monitoring, know what you're getting into:
Histograms cost more than averages. Full percentile tracking means more storage and compute. If you're under 1,000 calls/month, it might not be worth it. But at minimum, track P95 - one tail metric is cheap and catches what averages miss.
Setting up baselines takes time. Plan for a couple hours to get your first monitoring setup working properly. It's worth it, but it's not "flip a switch and you're done."
Too many charts is a real problem. We've watched teams set up latency dashboards with 47 panels and then never look at them because it's overwhelming. Start with three metrics: TTFW, end-to-end P95, and whatever stage is slowest. Add more when you need them, not before.
What's Next
In follow-up posts, we'll walk through how to run a latency audit and cover advanced optimization strategies for managing regressions and edge cases in production.
Hamming provides stage-by-stage latency breakdown purpose-built for voice agents. Track percentiles at every pipeline stage. Compare performance across provider changes. Catch regressions before users notice. Transform production issues into test cases.
Voice agents have milliseconds to maintain natural conversation. Average metrics fail when p99 breaks that expectation. Start with a latency audit. Implement percentile monitoring today.
Ready to see where latency lives in your voice agent pipeline? Explore how Hamming measures and monitors performance bottlenecks.

