TL;DR
The 300ms Rule: Research shows human conversation operates on a 200-300ms response window—hardwired across all cultures. Exceeding this threshold triggers neurological stress responses that break conversational flow.
Real-World Voice AI Latency:
Based on analysis of 2M+ voice agent calls in production:
| Percentile | Response Time | User Experience |
|---|---|---|
| P50 (median) | 1.4-1.7s | Noticeable delay, but functional |
| P90 | 3.3-3.8s | Significant delay, user frustration |
| P95 | 4.3-5.4s | Severe delay, many interruptions |
| P99 | 8.4-15.3s | Complete breakdown |
Key Reality Check:
- Industry median: 1.4-1.7 seconds - 5x slower than the 300ms human expectation
- 10% of calls exceed 3-5 seconds - causing severe user frustration
- 1% of calls exceed 8-15 seconds - complete conversation breakdown
Key Insight: Users never complain about "latency"—they report agents that "feel slow," "keep getting interrupted," or "don't understand when I'm done talking." This disconnect makes latency issues hard to diagnose without proper testing. Research shows 68% of customers abandon calls when systems feel sluggish.
Introduction
"What latency should we be targeting?" is one of the first questions engineering teams ask when building voice AI agents. The answer isn't just a number—it's the difference between a natural conversation and a frustrating experience that users abandon.
Latency is often the hidden culprit behind "bad conversations." Users might not articulate it as a latency problem, but when your agent feels unresponsive, gets interrupted constantly, or creates awkward pauses, you're dealing with a latency issue.
This guide provides concrete benchmarks, measurement techniques, and optimization strategies based on real-world voice AI deployments. We'll break down exactly what causes latency, how to measure it accurately, and most importantly, how to fix it.
What is good latency for voice AI?
The Science Behind the 300ms Rule
Neurological Foundation: Research in conversational psychology reveals that the average gap between speakers in natural dialogue is approximately 200 milliseconds—about the time it takes to blink. This timing is hardwired into human communication across all languages and cultures, refined over evolutionary timescales.
Psychological Impact:
- Less than 300ms: Perceived as instantaneous, maintains natural conversation flow
- 300-400ms: Beginning of awkwardness detection
- Over 500ms: Users wonder if they were heard
- Over 1000ms: Assumption of connection failure or system breakdown
- Over 1500ms: Neurological stress response triggered (amygdala activation)
Understanding Production Latencies
What these real-world latencies mean for users:
| Latency Range | What Actually Happens | User Impact | Business Reality |
|---|---|---|---|
| Under 1s | Theoretical ideal | Natural conversation | Rarely achieved in production |
| 1.4-1.7s | Industry standard (median) | Users notice slowness, some interruptions | Where 50% of voice AI operates today |
| 3-5s | Common experience (P90-P95) | Frequent talk-overs, user frustration | 10-20% of all interactions |
| 8-15s | Worst-case (P99) | Complete breakdown, immediate hangup | 1% failure rate = thousands of bad experiences daily |
The harsh truth: While humans expect 300ms responses, production voice AI delivers 1,400-1,700ms at median—explaining why users consistently report agents that "feel slow" or "don't understand when I'm done talking."
Real-World Implications
Under 300ms: At this speed, your agent feels magical. Users can't distinguish it from talking to a highly responsive human. This requires significant infrastructure investment but delivers exceptional user satisfaction.
300-800ms: This is the sweet spot for most production deployments. Users maintain natural conversation flow without adjusting their speaking patterns. Interruptions are rare, and the experience feels smooth.
800-1200ms: Users start to notice the delay but adapt unconsciously. They might pause slightly longer between utterances or speak more deliberately. Still functional for many use cases but requires careful turn detection tuning.
Above 1500ms: The conversation breaks down. Users consistently talk over the agent, repeat themselves, or abandon the call. Even with perfect accuracy, the experience feels broken.
Source: Analysis based on extensive real-world voice agent data
What does latency actually mean in voice AI?
End-to-End vs Component Latency
Voice AI latency isn't a single metric—it's a chain of sequential operations:
User stops speaking → STT processes → LLM generates → TTS synthesizes → Audio plays
End-to-End Latency: The total time from when a user finishes speaking until they hear the agent's response. This is what users actually experience.
Component Breakdown:
- Speech-to-Text (STT): Time to transcribe audio to text
- Language Model (LLM): Time to generate response
- Text-to-Speech (TTS): Time to synthesize audio
- Network/Transport: Cumulative network round trips
- Processing Overhead: Serialization, queuing, context switching
Time-to-First-Byte vs Full Response
Time-to-First-Byte (TTFB): When the first audio sample reaches the user. This is what creates the perception of responsiveness.
Full Response Time: When the complete response finishes playing. Less critical for perceived latency but affects conversation pacing.
User Perception Factors
Users don't experience latency uniformly. Perception varies based on:
- Context: A 500ms delay feels fast for complex questions but slow for simple acknowledgments
- Expectation: Users expect instant responses to "yes/no" but tolerate delays for calculations
- Audio Cues: Filler sounds ("um," "let me check") can make 1000ms feel like 500ms
- Turn Signals: Clear end-of-turn detection prevents interruptions even with higher latency
Why is my voice agent slow?
The Latency Stack Breakdown
Understanding where time is spent is crucial for optimization. Here's a typical latency budget:
| Component | Typical Range | Optimized Range | Notes |
|---|---|---|---|
| STT | 200-400ms | 100-200ms | Streaming STT can reduce this |
| LLM Inference | 300-1000ms | 200-400ms | Highly model-dependent |
| TTS | 150-500ms | 100-250ms | TTFB, not full synthesis |
| Network (Total) | 100-300ms | 50-150ms | Multiple round trips |
| Processing | 50-200ms | 20-50ms | Queuing, serialization |
| Turn Detection | 200-800ms | 200-400ms | Configurable silence threshold |
| Total | 1000-3200ms | 670-1450ms | End-to-end latency |
Detailed Component Analysis
Speech-to-Text (STT) Latency:
- Standard models: 200-400ms for final transcript
- Streaming models: 100-200ms with partial results
- Factors: Audio quality, accent, background noise
- Optimization: Use streaming APIs, optimize audio encoding
LLM Inference Breakdown:
In 2025, the most popular models for voice agents prioritize the balance between speed and cost:
Fast Tier (200-500ms TTFT):
- GPT-4o-mini: The go-to for high-volume applications, ~400ms latency
- Gemini 2.5 Flash: 10x cheaper for audio processing than GPT-4o, similar speed
- Claude 3.5 Haiku: ~360ms, optimized specifically for conversational AI
Balanced Tier (500-800ms TTFT):
- GPT-4o: Industry standard with native audio I/O and WebRTC support
- Qwen 2.5: Popular in e-commerce and Asian markets
- Llama 3.3 70B: Self-hosted option for privacy-sensitive deployments
Premium Tier (800ms+ TTFT):
- Claude 3.5 Sonnet: Higher accuracy but ~2x slower than GPT-4o
- Gemini 2.5 Pro: Best for complex reasoning tasks
- Large open models: 100B+ parameters for specialized use cases
The industry consensus: 500ms TTFT or less is sufficient for most voice AI applications. The LLM typically accounts for 70% of total latency, making model selection critical.
Text-to-Speech (TTS) Latency:
Modern TTS systems have made remarkable progress, with time-to-first-byte (TTFB) now approaching human reaction speeds:
Performance Tiers:
- Ultra-fast (40-100ms): Achieved by specialized providers using streaming architectures
- Standard (100-250ms): Most production TTS systems fall in this range
- Neural/Premium (250-500ms): Higher quality voices with more natural prosody
Key Factors:
- Voice quality vs speed tradeoff: Neural voices sound better but add 100-200ms
- Streaming vs batch: Streaming can cut TTFB by 50-70%
- Geographic proximity: Add 20-50ms per thousand miles from TTS servers
- Caching: Pre-synthesized common phrases deliver instant audio
The sweet spot for production: 100-200ms TTFB with streaming enabled. This keeps the TTS component from becoming a bottleneck while maintaining good voice quality.
Network Round Trips:
- WebRTC connection: Typically adds 100-250ms total in production
- Geographic impact: US East-West +60-80ms, US-Europe +80-150ms, US-Asia +150-250ms
- Multiple API calls: Each hop adds 20-100ms depending on provider and region
- Target for voice: Keep total network overhead under 200ms
- Reality check: Most production deployments see 100-300ms of network latency total
Common Bottlenecks
- Sequential Processing: Not starting TTS until LLM completes
- Poor Region Selection: Users connecting to distant servers
- Cold Starts: Serverless functions adding 500-2000ms
- Unoptimized Models: Using GPT-4 when GPT-3.5 would suffice
- Excessive Context: Large conversation history slowing inference
How to measure voice AI latency correctly
Measurement Methodologies
Critical Timestamps to Capture:
| Measurement Point | Event Description | Why It Matters |
|---|---|---|
| userSpeechEnd | When user stops speaking | Start of end-to-end latency |
| sttStarted | STT processing begins | Start of transcription latency |
| sttCompleted | Transcript ready | End of STT, start of business logic |
| llmRequestSent | LLM API call initiated | Start of inference latency |
| llmResponseReceived | LLM response complete | End of LLM processing |
| ttsRequestSent | TTS synthesis started | Start of speech synthesis |
| firstAudioByte | First audio sent to user | User-perceived response time |
| responseComplete | Full response delivered | Total interaction time |
Key Latency Calculations:
| Metric | Calculation | Target | What It Measures |
|---|---|---|---|
| End-to-End | firstAudioByte - userSpeechEnd | Under 800ms | Total user-perceived latency |
| STT Latency | sttCompleted - sttStarted | 100-200ms | Speech recognition speed |
| LLM Latency | llmResponseReceived - llmRequestSent | 200-500ms | Model inference time |
| TTS Latency | firstAudioByte - ttsRequestSent | 40-200ms | Speech synthesis TTFB |
| Turn Detection | sttStarted - userSpeechEnd | 200-400ms | Silence detection delay |
Common Measurement Mistakes
- Measuring from wrong start point: Starting from when audio arrives vs when user stops speaking
- Ignoring turn detection: Not accounting for silence detection delay
- Testing with ideal conditions: Perfect network, no background noise, simple queries
- Averaging without percentiles: Missing tail latencies that ruin user experience
- Not measuring in production: Lab results don't reflect real-world conditions
Measurement Best Practices
Use Percentiles, Not Averages:
- P50 (median): Your typical experience
- P90: What 10% of users experience
- P95: Critical for user satisfaction
- P99: Identifies systemic issues
Track Component Waterfalls:
| Component | Start Time | End Time | Duration | Cumulative | Status |
|---|---|---|---|---|---|
| STT | 0ms | 300ms | 300ms | 300ms | ✓ On target |
| LLM | 300ms | 700ms | 400ms | 700ms | ✓ On target |
| TTS | 700ms | 900ms | 200ms | 900ms | ⚠️ Slightly high |
| Network | 900ms | 1100ms | 200ms | 1100ms | ❌ Over budget |
| Total | 0ms | 1100ms | 1100ms | - | ⚠️ Above 800ms target |
Visual Timeline:
0ms 300ms 700ms 900ms 1100ms
|------------|------------|-----------|------------|
STT LLM TTS Network
Production Monitoring Checklist:
- Instrument every component with timestamps
- Track percentiles, not just averages
- Measure by geography/region
- Monitor during peak load
- Alert on P95 degradation
- Correlate with user feedback
Quick wins for reducing voice AI latency
1. Implement Streaming Where Possible
Streaming STT: Start processing before user finishes speaking
- Benefit: Save 100-200ms
- Implementation: Use streaming WebSocket APIs
- Tradeoff: Slightly lower accuracy on partial results
Streaming TTS: Start audio playback before full synthesis
- Benefit: Save 200-400ms on TTFB
- Implementation: Use chunked audio streaming
- Tradeoff: Can't know total duration upfront
2. Optimize Turn Detection
VAD Configuration Tuning:
| Parameter | Default | Optimized | Impact | Risk |
|---|---|---|---|---|
| Silence Threshold | 800ms | 500ms | -300ms latency | May cut off pauses |
| Speech Threshold | 0.3 | 0.5 | Faster detection | May miss soft speech |
| Min Speech Duration | 200ms | 100ms | Quicker response | False positives on noise |
| End-of-Turn Delay | 1000ms | 400-600ms | -400ms perceived | Interruption risk |
Configuration by Use Case:
| Use Case | Silence (ms) | Threshold | Min Duration | Best For |
|---|---|---|---|---|
| Fast Q&A | 400 | 0.6 | 50ms | Quick exchanges |
| Conversation | 500-600 | 0.5 | 100ms | Natural dialogue |
| Thoughtful | 800 | 0.4 | 150ms | Complex queries |
| Noisy Environment | 600 | 0.7 | 200ms | Background noise |
- Total Benefit: Save 200-400ms on turn detection
- Implementation: Adjust based on user feedback and interruption rates
3. Choose the Right Model
Model Selection Matrix (2025):
| Use Case | Recommended Model | Latency (TTFT) | Cost |
|---|---|---|---|
| High volume, budget | Gemini 2.5 Flash | ~400ms | $ (10x cheaper for audio) |
| Simple Q&A | GPT-4o-mini | 200-400ms | $ |
| Conversational AI | Claude 3.5 Haiku | 360ms | $$ |
| Industry standard | GPT-4o | 400-600ms | $$$ |
| E-commerce/Asia | Qwen 2.5 | 400-500ms | $$ |
| Self-hosted | Llama 3.3 70B | Variable | Infrastructure |
| Complex reasoning | Claude 3.5 Sonnet | 800-1200ms | $$$$ |
4. Geographic Distribution
Deploy Closer to Users:
- US East to West Coast: +60-80ms
- US to Europe: +80-150ms
- US to Asia: +150-250ms
Multi-Region Setup:
regions:
us-east: Primary for East Coast users
us-west: Primary for West Coast users
eu-west: Primary for European users
5. Connection Pooling and Keep-Alive
Maintain Persistent Connections:
// Reuse connections for API calls
const httpsAgent = new https.Agent({
keepAlive: true,
keepAliveMsecs: 1000,
maxSockets: 50
});
- Benefit: Save 20-100ms per request
- Implementation: Critical for sequential API calls
6. Implement Response Caching
Cache Common Responses:
response_cache = {
"greeting": pre_synthesized_audio["Hello, how can I help you?"],
"confirmation": pre_synthesized_audio["Got it, let me help with that."],
"thinking": pre_synthesized_audio["Hmm, let me check..."]
}
- Benefit: Instant response for cached phrases
- Storage: ~1MB per minute of cached audio
7. Parallel Processing Pipeline
Process Components in Parallel When Possible:
async def process_response(transcript):
# Start both simultaneously when independent
sentiment_task = asyncio.create_task(analyze_sentiment(transcript))
context_task = asyncio.create_task(fetch_context(user_id))
sentiment = await sentiment_task
context = await context_task
# Now proceed with LLM call
response = await generate_response(transcript, sentiment, context)
Latency tradeoffs: speed vs accuracy vs cost
Decision Matrix for Different Use Cases
| Use Case | Latency Target | Accuracy Priority | Cost Sensitivity | Recommended Setup |
|---|---|---|---|---|
| Customer Support | Under 800ms | High | Medium | Fast model + Streaming + Caching |
| Sales Calls | Under 500ms | Medium | Low | Premium model + Edge deployment |
| Voice IVR | Under 1200ms | Medium | High | Self-hosted model + Basic TTS |
| Medical Consultation | Under 1000ms | Very High | Low | Premium model + Verification layer |
| Food Ordering | Under 600ms | Medium | Medium | Fast model + Response cache |
| Virtual Receptionist | Under 700ms | Medium | High | Conversational model + Standard TTS |
When to Prioritize Speed
Speed-First Scenarios:
- High-volume, short interactions
- Simple decision trees
- Confirmation/acknowledgment heavy flows
- Users expecting instant responses
Speed Optimization Stack:
- Streaming STT with partial results (target: 100-200ms)
- Fast LLM tier (200-400ms TTFT)
- Ultra-fast TTS with streaming (target: 40-100ms TTFB)
- Edge deployment to minimize network hops
- Pre-synthesized common responses
When to Accept Higher Latency
Accuracy-First Scenarios:
- Complex reasoning required
- High-stakes conversations (medical, financial)
- Multi-turn context critical
- Need for verification/safety checks
Accuracy Optimization Stack:
- High-accuracy STT with post-processing
- Premium LLM tier (800ms+ but higher accuracy)
- Neural TTS for natural speech
- Additional safety/verification layers
- Rich context retrieval
Cost Optimization Strategies
Balancing Cost and Performance:
- Tiered Model Selection:
def select_model(query_complexity, latency_requirement):
if latency_requirement < 400:
return "fast_tier" # 200-400ms TTFT
elif query_complexity == "simple":
return "fast_tier" # Optimize for speed
elif query_complexity == "medium":
return "balanced_tier" # 500-800ms TTFT
else:
return "premium_tier" # Accuracy over speed
- Hybrid Approach:
- Use fast model for initial response
- Upgrade to accurate model for complex queries
- Cache frequently used responses
- Batch non-urgent processing
The future: speech-to-speech models
How Speech-to-Speech Changes Everything
Traditional pipeline: Audio → Text → LLM → Text → Audio (1000-2000ms) Speech-to-speech: Audio → Model → Audio (200-500ms)
Current State of Speech-to-Speech:
Speech-to-speech models are achieving 160-400ms end-to-end latency, compared to 1000-2000ms for traditional pipelines. These models process audio directly without intermediate text conversion.
Key Characteristics:
- Latency: 160-400ms typical (vs 1000-2000ms traditional)
- Quality: Preserves emotion, tone, and prosody
- Availability: Limited but growing rapidly
- Requirements: Significant computational resources
Benefits of Speech-to-Speech
- Dramatic Latency Reduction: 70-80% faster than traditional pipeline
- Preserves Prosody: Maintains emotion, tone, emphasis
- Natural Turn-Taking: Better interruption handling
- No Transcription Errors: Bypasses STT failures
- Native Multimodal: Can process voice characteristics directly
Current Limitations
Technical Challenges:
- Limited control over response content
- Difficult to integrate business logic
- No intermediate text for logging/analysis
- Challenging to implement guardrails
- Higher computational requirements
Practical Considerations:
- Most models still in research/beta
- Limited language and accent support
- Unclear pricing models
- Requires new evaluation frameworks
- Integration complexity with existing systems
Implementation Readiness Timeline
Note: Timeline projections based on current adoption rates and technology maturity as of January 2026
2024 (Past): Early adopters experimenting, limited production use 2025 (Recent): Broader API availability, hybrid approaches emerged 2026 (Current): Mainstream adoption for specific use cases 2027+ (Projected): Default approach for most voice AI applications
Conclusion
Key Takeaways
- Target under 800ms end-to-end latency for production voice agents
- Measure from user speech end to first audio byte for accurate metrics
- Optimize the slowest component first - usually LLM inference
- Use streaming APIs wherever possible for 20-40% improvement
- Choose models based on use case, not default to most powerful
- Monitor P95 latency, not averages, for user satisfaction
- Consider speech-to-speech for next-generation experiences
Action Items for Engineering Teams
Immediate Actions (Week 1):
- Implement comprehensive latency monitoring
- Measure current P50, P90, P95 latencies
- Identify biggest bottleneck component
- Test streaming STT/TTS if not already using
Short-term Improvements (Month 1):
- Optimize turn detection settings
- Implement response caching for common phrases
- Evaluate faster model alternatives
- Set up multi-region deployment if needed
Long-term Strategy (Quarter):
- Design for under 500ms latency target
- Evaluate speech-to-speech models
- Build latency testing into CI/CD
- Establish latency SLAs with alerts
Further Resources
Tools and Services:
- Deepgram Streaming STT - Low-latency speech recognition
- ElevenLabs Streaming TTS - High-quality, low-latency synthesis
- Hamming - Voice Agent Testing Platform - Automated voice agent testing with built-in latency profiling
- Component-level latency breakdown (STT, LLM, TTS)
- P50/P90/P99 percentile tracking
- Geographic latency testing across regions
- Automated regression detection for latency spikes
- WebRTC Stats API - Browser-based latency measurement
Open Source Projects:
- Pipecat - Framework for building real-time voice agents
- LiveKit - WebRTC infrastructure for voice/video
Remember: Users don't complain about milliseconds—they complain about conversations that feel broken. Focus on the experience, measure religiously, and optimize systematically. The difference between good and great voice AI is often just a few hundred milliseconds.
References and Citations
Academic Papers & Research:
- ArXiv: Moshi - Speech-Text Foundation Model - Real-time dialogue with 160-200ms latency
- ArXiv: Human Latency Conversational Turns - Psychological basis for 200-300ms response window
- ArXiv: Low-Latency Voice Agents for Telecommunications - Streaming ASR, quantized LLMs, real-time TTS
- ArXiv: X-Talk Modular Speech-to-Speech - Event-driven architecture for real-time dialogue
- ArXiv: Sub-millisecond Speech Enhancement - 0.32-1.25ms algorithmic latency achievements
- ArXiv: SpeakStream - 30ms TTS + 15ms vocoder latency breakthrough
- ArXiv: ASR Latency Assessment - Methodological framework for real-time measurement
Industry Benchmarks & Standards:
- MLPerf Inference v5.1 - Industry standard LLM performance benchmarks
- MLPerf Interactive Benchmark - 450ms TTFT, 40ms TPOT requirements
- Hugging Face LLM Performance Leaderboard - Real-time performance tracking
- Hugging Face LLM-Perf Leaderboard - Hardware-specific optimizations
- Artificial Analysis - Continuous API performance monitoring
TTS Latency Research:
- Jambonz TTS Latency Leaderboard - Comprehensive TTS vendor comparison
- Podcastle TTS Benchmark - Quality vs latency tradeoff analysis
- ElevenLabs Latency Optimization - Official optimization guide
- Deepgram Aura-2 Launch - Enterprise TTS benchmarks
- GitHub - Picovoice TTS Benchmark - Open source benchmarking tool
Network and WebRTC Analysis:
- VideoSDK WebRTC Low Latency Guide - 2025 WebRTC optimization techniques
- Cyara - RTT and Network Latency - RTT impact on voice quality
- 100ms - WebRTC Call Quality - Quality measurement best practices
Voice AI System Design:
- Dev.to - Sub-1-Second Voice Loop - 30+ stack benchmarks and findings
- Softcery Lab - STT/TTS Comparison 2025 - Comprehensive vendor guide
- Databricks - LLM Inference Best Practices - Enterprise optimization strategies

