TL;DR: Troubleshoot voice agent failures using this symptom-to-diagnosis approach:
| Symptom | Likely Layer | First Check | Production Threshold |
|---|---|---|---|
| Calls not connecting | Telephony | SIP registration, network | ICE state: "connected" |
| No sound or garbled audio | Audio/ASR | Codec, WebRTC, VAD | Packet loss <1%, jitter <20ms |
| Wrong responses or timeouts | Intelligence/LLM | LLM endpoint, prompts | Response <1s, no 429 errors |
| No agent speech | Output/TTS | TTS service, audio encoding | TTFB <200ms |
| Agent cuts off users | Turn Detection | VAD threshold, endpointing | Silence threshold 400-600ms |
| High latency (>2s) | Multiple layers | Component-level traces | P95 end-to-end <5s |
Start at the infrastructure layer. Move up only when that layer is verified working. Most issues (50%+) are in telephony or audio—don't jump to LLM debugging first.
Methodology Note: Diagnostic frameworks and thresholds in this guide are derived from Hamming's analysis of 4M+ production voice agent calls and incident response patterns across 10K+ voice agents (2025-2026).
Related Guides:
- Voice Agent Incident Response Runbook — 4-Stack framework for production outages
- Debug WebRTC Voice Agents — ICE, RTP, and pipeline debugging
- How to Evaluate and Test Voice Agents — 4-Layer QA Framework
- Voice Agent Observability — 4-Layer Observability Framework
- Voice AI Latency Guide — Latency benchmarks and optimization
- Debugging Voice Agents: Logs, Missed Intents & Error Dashboards — Real-time log analysis, error dashboards, and continuous improvement loops
If You're Debugging an AI Voice Agent...
AI voice agents add complexity layers on top of traditional VoIP infrastructure. Before debugging ASR, LLM, or TTS issues, verify your network and telephony stack is healthy. Most AI voice agent problems trace back to underlying VoIP issues.
| VoIP Symptom | AI Agent Impact | What Breaks |
|---|---|---|
| High jitter (>30ms) | ASR receives corrupted audio frames | Transcription errors, wrong words, gibberish |
| Packet loss (>1%) | Audio gaps confuse speech recognition | Missed utterances, incomplete sentences |
| Poor MOS (<3.5) | Degraded audio quality throughout pipeline | ASR confidence drops, user frustration |
| NAT/firewall issues | WebRTC ICE failures, one-way audio | Agent can't hear user or vice versa |
| SIP registration failures | Calls don't connect | Complete call failure before agent loads |
| Codec mismatch | Audio format incompatibility | Garbled audio, no audio, echo |
Bottom line: If your VoIP layer has problems, your AI pipeline will magnify them. A 2% packet loss that's "acceptable" for human calls causes 10-15% ASR word error rate increases for voice agents.
→ Skip to: VoIP Call Quality Checklist if you suspect network issues
Why Do Voice Agents Fail?
Voice agents fail across multiple interdependent layers: telephony, ASR, LLM orchestration, tool execution, and TTS. Single component failures cascade through subsequent decisions, making root cause diagnosis difficult. Systematic troubleshooting requires isolating whether issues stem from audio quality, semantic understanding, model failures, API integrations, or synthesis latency.
What you'll learn:
- How to identify which component (ASR, LLM, tool execution, TTS) causes specific failure patterns
- Diagnostic techniques using logs, traces, and component-level testing to isolate root causes
- Production monitoring strategies to catch issues before they impact users
Quick filter: If you're restarting services before understanding which layer failed, you're wasting time.
What Are the Common Voice Agent Failure Categories?
Voice agents combine STT (speech-to-text), NLU (natural language understanding), decision logic, response generation, and TTS. Each layer depends on previous outputs: ASR errors corrupt LLM inputs, causing downstream tool execution failures.
Failure Category Reference Table
| Category | Layer | Symptoms | Root Causes | Diagnostic Priority |
|---|---|---|---|---|
| Retrieval failures | Intelligence | Irrelevant responses, wrong facts | RAG returning wrong context | Medium |
| Instruction adherence | Intelligence | Ignoring guidelines, scope creep | Prompt drift, temperature too high | High |
| Reasoning failures | Intelligence | Logical errors, contradictions | Context overflow, model limitations | Medium |
| Tool integration | Intelligence | API errors, timeouts, wrong calls | Auth failures, parameter issues | High |
| ASR failures | Audio | Empty transcripts, wrong words | Accents, noise, phonetic ambiguity | High |
| Latency bottlenecks | Multiple | Awkward pauses, interruptions | Slow APIs, model inference, synthesis | High |
| Context loss | Intelligence | Forgetting earlier details | Token limits, state management | Medium |
| Turn-taking errors | Audio | Cutting off users, not responding | VAD misconfiguration, endpointing | High |
How Do Failures Cascade Across Layers?
Single root-cause ASR errors propagate: incorrect transcription leads to misclassified intent, which triggers wrong tool selection. External service failures cascade when slow CRM responses delay agent replies beyond user tolerance (1-2 seconds).
| Initial Failure | Cascade Effect | User Experience |
|---|---|---|
| Network latency (Telephony) | ASR timeouts → LLM timeouts | Call drops, no response |
| ASR returning garbage (Audio) | LLM hallucinating (Intelligence) | Wrong actions, frustration |
| LLM slow (Intelligence) | Turn-taking broken | Users talk over agent |
| TTS slow (Output) | User thinks agent died | Premature hangup |
How Do You Troubleshoot ASR (Speech Recognition) Failures?
ASR Error Types and Patterns
| Error Type | Example | Root Cause | Diagnostic Check |
|---|---|---|---|
| Accent variation | "async" → "ask key" | Regional pronunciation | Test with accent datasets |
| Background noise | Random word insertions | Poor microphone, artifacts | Check audio quality scores |
| Code-mixed speech | Mixed language confusion | Multiple languages | Enable multilingual ASR |
| Low confidence | Names, numbers wrong | Critical utterance issues | Log confidence scores |
| Truncation | Sentences cut off | Aggressive endpointing | Check silence threshold |
ASR Diagnostic Checklist
- Audio reaching server? Check for audio frames in logs, verify WebRTC connection
- Codec negotiated correctly? Expected: Opus (WebRTC) or PCMU/PCMA (SIP)
- ASR returning transcripts? Empty transcripts = no audio or VAD issue
- Confidence scores acceptable? Target >0.85, investigate <0.7
- WER within threshold? Target <5% clean audio, <10% with noise
- Provider status? Check Deepgram, AssemblyAI, Google STT status pages
ASR-Specific Fixes
- Incorporate diverse training data: accented audio, noisy environments, varied speech patterns from real production calls
- Implement noise-canceling technologies: beamforming microphones, suppression algorithms, acoustic models trained on real-world audio
- Apply LLM-guided refinement to ASR output: use language models to correct transcription errors using conversational context
- Deploy hardware-accelerated VAD (voice activity detection) to filter background noise before ASR processing
For detailed ASR failure patterns, see Seven Voice Agent ASR Failure Modes in Production.
How Do You Debug LLM and Intent Recognition Failures?
LLM Failure Mode Reference
| Failure Mode | Symptoms | Root Cause | Fix |
|---|---|---|---|
| Hallucinations | Made-up facts, wrong policies | No grounding in verified data | Add RAG validation, lower temperature |
| Misclassified intent | Wrong action triggered | Ambiguous user input, poor NLU | Improve prompt, add disambiguation |
| Context overflow | Forgets earlier details | Token limit exceeded | Implement summarization, truncation |
| Cascading errors | Multiple wrong decisions | Single root mistake propagates | Add validation checkpoints |
| Rate limiting | Slow/no responses | 429 errors from provider | Implement backoff, upgrade tier |
| Prompt drift | Inconsistent behavior | Recent prompt changes | Version control prompts, A/B test |
LLM Diagnostic Checklist
- LLM endpoint responding? Direct API test, check provider status
- Rate limiting? Look for 429 errors, check tokens per minute
- Prompt changes? Review recent deployments, check for injection
- Context window? Calculate tokens per conversation, approaching limit?
- Tool calls working? Check function call logs, tool timeout errors
- Response quality? Compare to baseline, check for hallucinations
Mitigation Strategies
- Ground with verified data: integrate agents with reliable, up-to-date databases (CRM, knowledge bases, APIs)
- Implement prompt engineering: design prompts that constrain model outputs to factual, verified responses
- Set appropriate model configurations: lower temperature (0.3-0.5) for factual tasks, restrict token generation length
- Add validation checkpoints: verify critical information before executing irreversible actions
How Do You Fix Tool Execution and API Integration Failures?
Tool Call Failure Patterns
| Failure Type | Symptom | Investigation Steps | Fix |
|---|---|---|---|
| Tool not recognized | Agent continues instead of action | Check intent classification, tool definitions | Improve tool descriptions |
| Wrong tool selection | Email API called instead of SMS | Review tool descriptions, disambiguation | Add explicit tool routing |
| Parameter formatting | Tool rejects request | Validate data types, ranges, fields | Add parameter validation |
| Response misinterpretation | Incorrect follow-up actions | Check response parsing, schema validation | Fix response handling |
| Timeout | No response from tool | Check API latency, timeout settings | Increase timeout, add caching |
Tool Integration Diagnostic Steps
- Navigate to API Logs to monitor all requests/responses, check authentication errors, verify request payload structure
- Check webhook logs to verify deliveries, server response codes, timing, monitor event delivery failures
- Track tool execution results and errors through trace views showing input parameters and returned data
- Test tool integrations independently before end-to-end testing: verify API calls work outside agent context
- Measure API response latency to identify slow external services creating conversation pauses
Tool Execution Fixes
- Implement fallback logic for when external services fail or respond slowly: retry with exponential backoff
- Cache frequently used data to avoid unnecessary database lookups mid-conversation
- Set timeout thresholds for external API calls (500-1000ms) to prevent indefinite waiting
- Build circuit breakers to prevent small failures from cascading into system-wide problems
How Do You Optimize TTS Latency and Quality?
TTS Performance Benchmarks
| Metric | Excellent | Good | Acceptable | Poor |
|---|---|---|---|---|
| TTS TTFB | <100ms | <200ms | <400ms | >400ms |
| Full synthesis | <150ms | <300ms | <500ms | >500ms |
| Audio quality (MOS) | >4.3 | >4.0 | >3.5 | <3.5 |
TTS Diagnostic Methods
- Measure total latency including time-to-first-byte (TTFB) and complete audio synthesis duration
- Track component-level breakdowns to isolate delays between STT, LLM inference, and TTS generation
- Monitor tail latencies (p99) as users remember worst experiences, not average performance
- Log synthesis quality metrics: audio artifacts, volume consistency, unnatural pauses in generated speech
Optimizing TTS Performance
- Use dual streaming TTS: accepts text incrementally (token by token), begins speaking while LLM generates remaining response
- Pre-connect and reuse SpeechSynthesizer to avoid connection latency on each request
- Implement text streaming via websocket v2 endpoints for real-time synthesis as text arrives
- Chunk long outputs at punctuation marks, stream incrementally to accelerate multi-sentence replies
For detailed latency optimization, see Voice AI Latency: What's Fast, What's Slow, and How to Fix It.
Why Are Customers Hanging Up on My Voice Bot?
Audio Quality Degradation Patterns
| Symptom | Likely Cause | Diagnostic | Fix |
|---|---|---|---|
| Choppy audio | Packet loss >5% | Check webrtc-internals stats | Improve network, enable FEC |
| Echo/feedback | AEC failure | Test different device/browser | Enable echo cancellation |
| One-way audio | Asymmetric NAT/firewall | Check inbound/outbound packets | Open UDP ports, use TURN |
| Robotic voice | High jitter | Check jitter buffer stats | Increase buffer, improve network |
| Audio cuts out | Network instability | Monitor packet loss patterns | Use wired connection |
Call Drop Root Causes
- Insufficient internet speed for VoIP bandwidth requirements (minimum 100 kbps per call)
- Network overload when multiple applications compete for bandwidth during voice calls
- Weak or disrupted Wi-Fi signals cause packet loss, forcing call termination
- Application conflicts when other apps request microphone access, breaking audio connection
Audio Quality Fixes
- Upgrade internet connection to meet VoIP requirements: minimum 100 kbps upload/download per concurrent call
- Use wired Ethernet connections for critical calls instead of Wi-Fi to reduce packet loss
- Optimize Quality of Service (QoS) settings to prioritize voice traffic over other network activity
- Implement jitter buffers to smooth packet arrival timing and reduce audio stuttering
VoIP Call Quality (Jitter/Packet Loss/MOS) Checklist
This section covers traditional VoIP diagnostics that directly impact AI voice agent performance. Fix these first before debugging ASR/LLM/TTS.
Network Quality Metrics Reference
| Metric | Measurement | Good | Acceptable | Poor | AI Agent Impact |
|---|---|---|---|---|---|
| Packet Loss | % of lost RTP packets | <0.5% | <1% | >2% | ASR misses words, sentences cut off |
| Jitter | Variance in packet arrival (ms) | <15ms | <30ms | >50ms | Audio distortion, robotic voice |
| Latency (RTT) | Round-trip time (ms) | <100ms | <150ms | >200ms | Conversation delays, overlapping speech |
| MOS Score | Mean Opinion Score (1-5) | >4.0 | >3.5 | <3.0 | User/agent audio quality degrades |
VoIP Diagnostic Checklist
Network & Bandwidth:
- Sufficient bandwidth? Minimum 100 kbps per concurrent call (G.711), 30 kbps (Opus)
- QoS configured? Voice traffic prioritized (DSCP 46/EF marking)
- Packet loss under threshold? Use
ping -c 100or VoIP quality tools - Jitter acceptable? Check with
iperf3or RTP stream analysis - No bandwidth contention? Other applications competing during calls
NAT & Firewall:
- SIP ALG disabled? Router SIP ALG causes registration failures, one-way audio
- UDP ports open? SIP: 5060/5061, RTP: 10000-20000 (varies by provider)
- STUN/TURN configured? Required for WebRTC NAT traversal
- Symmetric NAT handled? May require TURN relay server
- Firewall allowing RTP? Stateful inspection may block return packets
SIP & Signaling:
- SIP registration successful? Check for 401/403/408 errors
- Correct SIP trunk credentials? Authentication failures = no calls
- DNS SRV records resolving? SIP often uses SRV lookups
- TLS/SRTP configured? Encryption may be required by provider
- SIP timers appropriate? Session timers, registration refresh
Codec & Audio:
- Codec negotiated correctly? Check SDP in SIP INVITE/200 OK
- Codec priority set? Opus > G.722 > G.711 (for quality)
- Sample rate matched? Mismatch causes audio distortion
- Echo cancellation enabled? AEC required for full-duplex
- Comfort noise configured? Prevents "dead air" during silence
Common VoIP Issues and Fixes
| Issue | Symptoms | Diagnostic Command | Fix |
|---|---|---|---|
| SIP ALG interference | One-way audio, registration drops | Disable in router settings | Turn off SIP ALG on all routers/firewalls |
| NAT traversal failure | ICE connection timeout, no audio | Check webrtc-internals ICE candidates | Configure STUN/TURN, open UDP ports |
| Codec mismatch | Garbled audio, no audio | Inspect SDP in SIP traces | Force compatible codec on both ends |
| RTP packet loss | Choppy audio, words missing | tcpdump -i eth0 udp port 10000-20000 | Enable FEC, increase jitter buffer |
| DNS resolution | Intermittent call failures | dig SRV _sip._udp.provider.com | Use IP directly or fix DNS |
| TLS handshake failure | Secure calls not connecting | openssl s_client -connect sip.provider.com:5061 | Update certificates, check TLS version |
WebRTC-Specific Diagnostics
For browser-based voice agents using WebRTC:
chrome://webrtc-internals (Chrome)
about:webrtc (Firefox)
Key metrics to check:
- ICE connection state: Should be "connected" or "completed"
- DTLS state: Should be "connected"
- Packets lost: Incoming/outgoing RTP packet loss
- Jitter buffer: Current delay and target delay
- Audio level: Verify audio is flowing (not 0)
RTP Stream Analysis
For deep packet inspection when standard tools don't reveal issues:
Capture RTP traffic:
tcpdump -i any -w voip_capture.pcap udp portrange 10000-20000
Analyze in Wireshark:
- Navigate to: Telephony → RTP → RTP Streams
- Check for packet loss percentage, jitter, delta (inter-packet timing)
- Look for sequence number gaps indicating lost packets
Key RTP metrics:
| Metric | Where to Find | Healthy Value |
|---|---|---|
| Lost packets | RTP stream analysis | <0.5% |
| Max jitter | RTP stream analysis | <30ms |
| Mean jitter | RTP stream analysis | <15ms |
| Sequence errors | RTP stream analysis | 0 |
MOS Score Interpretation
Mean Opinion Score (MOS) predicts perceived call quality:
| MOS Score | Quality | User Experience | Typical Cause |
|---|---|---|---|
| 4.3-5.0 | Excellent | Toll quality, no perceptible issues | Good network, proper codec |
| 4.0-4.3 | Good | Minor impairments, still clear | Slight jitter, minimal loss |
| 3.5-4.0 | Fair | Noticeable issues, still usable | Moderate packet loss |
| 3.0-3.5 | Poor | Annoying, hard to understand | High jitter, significant loss |
| <3.0 | Bad | Unusable, call should be terminated | Severe network issues |
For AI voice agents: Target MOS >4.0. Below 3.5, ASR accuracy drops significantly.
What Logging and Tracing Do You Need for Voice Agent Debugging?
Essential Logging Schema
Turn-level data (per exchange):
{
"call_id": "call_abc123",
"turn_index": 3,
"timestamp": "2026-01-26",
"user_transcript": "I need to reschedule my appointment",
"asr_confidence": 0.94,
"intent": {"name": "reschedule_appointment", "confidence": 0.91},
"latency_ms": {"stt": 180, "llm": 420, "tts": 150, "total": 750},
"tool_calls": [{"name": "get_appointments", "success": true, "latency_ms": 85}],
"agent_response": "I can help you reschedule..."
}
Production Monitoring Essentials
| Metric | What It Measures | Alert Threshold |
|---|---|---|
| Call success rate | Calls completing without errors | Alert if <95% |
| P95 end-to-end latency | Worst-case response time | Alert if >5s |
| ASR confidence | Transcription quality | Alert if avg <0.8 |
| Task completion | Goal achievement rate | Alert if <85% |
| Error rate | Failed calls/total calls | Alert if >0.2% |
Tracing Voice Agent Workflows
Tracing captures every call step: audio input, ASR output, semantic interpretation, internal prompts, model generations, tool calls, TTS output. Use OpenTelemetry for metrics, logs, traces to keep data portable across observability tools.
For detailed observability implementation, see Voice Agent Observability: The Missing Discipline.
How Do You Fix Conversation Flow and Turn-Taking Issues?
Context Loss and Memory Issues
Agents hit context window limits (4k-32k tokens), causing "forgetting" of important earlier conversation details. As conversations grow, critical information gets pushed out, leading to contradictions or lost problem tracking.
| Issue | Symptom | Fix |
|---|---|---|
| Token overflow | Forgets early details | Implement conversation summarization |
| State loss | Asks same question twice | Persist state externally |
| Context drift | Contradicts earlier statements | Add context anchoring prompts |
How Do You Prevent Agents from Interrupting Users?
- Implement dynamic silence thresholds: 300ms for quick exchanges, 800ms for slower speakers
- Use hardware-accelerated Voice Activity Detection (VAD) that handles interruptions gracefully
- Move beyond Voice Activity Detection to consider semantics, context, tone, conversational cues
- Tune VAD sensitivity based on use case: customer service needs longer thresholds than quick commands
Fixing Conversation Flow Issues
- Use hybrid context management: full server-side history for high-stakes sessions, lightweight vector summaries for general chat
- Implement explicit context anchoring: have users restate critical constraints every 3-4 turns
- Test conversation state management: verify handling of interruptions, corrections, topic changes
- Implement conversation summarization at regular intervals to maintain context within token limits
How Do You Build Error Handling and Recovery Patterns?
Resilience Design Patterns
| Pattern | Implementation | When to Use |
|---|---|---|
| Circuit breaker | Stop calling failed service | External API failures |
| Exponential backoff | Retry with increasing delays | Transient network issues |
| Graceful degradation | Fall back to simpler responses | Knowledge retrieval failures |
| Timeout limits | Max 500-1000ms for tool calls | Slow external services |
| Retry limits | Max 3-5 attempts | Before escalating to human |
User-Facing Error Recovery
- Provide clear, actionable feedback: "I'm having trouble accessing our product database. Let me try a different approach" instead of "Error 500"
- Build fallback logic into customer journeys: "Press 0 to speak to a live representative" when agent reaches capability limits
- Acknowledge errors transparently: "I missed that, could you repeat?" rather than guessing at misheard inputs
Continuous Improvement from Failures
- Feed production failures back into offline evaluation datasets to create continuous improvement loops
- Convert any live conversation into replayable test case with caller audio, ASR text, expected intent
- When production call fails, convert to regression test with one click, preserving original audio and timing
- Track failure resolution rates: measure time from issue identification to deployed fix
What Testing and Evaluation Strategies Work for Voice Agents?
Automated Testing Approaches
- Auto-generate test cases from agent prompts and documentation to ensure coverage
- Run 1000+ concurrent calls with real-world conditions: accents, background noise, interruptions, edge cases
- Test agents in multiple languages, simulate global accents and real-world noise environments
- Implement synthetic user simulation: generate varied conversation paths to stress-test agent logic
Evaluation Metrics That Matter
| Metric Category | Key Metrics | Target Threshold |
|---|---|---|
| Conversational | Latency, interruptions, turn-taking | P95 <800ms response time |
| Outcomes | Task completion, escalation rate | >85% completion |
| Quality | WER, intent accuracy, entity extraction | <5% error rate |
| Compliance | PII handling, script adherence | 100% compliance |
CI/CD Integration for Voice Agents
- Integrate testing into GitHub Actions, Jenkins, or CI/CD pipeline to trigger tests and block bad prompts automatically
- After each build, send predefined prompts to agent; if more than 5% responses differ from baseline, deployment halts
- Version control agent configurations (prompts, tools, models) alongside code for reproducible deployments
For comprehensive testing methodology, see How to Evaluate and Test Voice Agents.
Summary and Next Steps
Systematic troubleshooting requires component-level isolation: test ASR, LLM, tool execution, TTS independently before end-to-end diagnosis. Production monitoring with comprehensive logging, tracing, and observability catches issues before they impact users.
Next steps:
- Implement structured logging capturing every component's inputs, outputs, latency, confidence scores
- Set up production monitoring with alerts for latency spikes, error rate increases, quality degradation
- Build automated testing pipelines that run diverse scenarios before deployment to catch failures early
How Hamming Helps with Voice Agent Troubleshooting
Hamming provides the observability and testing layer that makes troubleshooting faster:
- 4-Layer Visibility: Unified dashboards showing health across Telephony, Audio, Intelligence, and Output
- Instant Root Cause: One-click from alert to transcript, audio, and model logs
- Session Replay: Full audio playback with transcripts and component traces
- Regression Detection: Automated alerts when metrics deviate from baseline
- Scenario Generation: Auto-generate test cases from prompts, execute in <10 minutes
Instead of manually debugging across multiple dashboards, get automated visibility into every layer of your voice agent stack.
Debug your voice agents with Hamming →
Related Guides:
- Voice Agent Drop-Off Analysis — Funnel analysis and abandonment metrics with remediation playbook
- Slack Alerts for Voice Agents — Alert templates for latency, ASR drift, jitter, and prompt regressions
- Voice Agent Monitoring KPIs — 10 production metrics with formulas and benchmarks
- Voice Agent Incident Response Runbook — Systematic debugging framework for production outages
- Voice Agent Monitoring Platform Guide — 4-Layer monitoring architecture
- Debugging Voice Agents: Logs, Missed Intents & Error Dashboards — Real-time log analysis, error dashboards, and continuous improvement loops
- Voice Agent SEV Playbook & Postmortem Template — Severity classification, response checklists, and postmortem framework for voice AI incidents

