How do I diagnose voice agent failures quickly?

Use Hamming's 4-Stack Incident Response Framework: check Telephony (calls connecting?), Audio (sound both ways?), Intelligence (correct responses?), Output (agent speaking?). Work stack-by-stack from bottom up. According to Hamming's incident data, 60% of failures are in Stacks 1-2 (Telephony/Audio), so don't jump to LLM debugging first. Target resolution: Stack 1 <5 min, Stack 2 <10 min, Stack 3 <15 min, Stack 4 <10 min.

What causes ASR failures in voice agents?

ASR failures have four primary causes: (1) codec mismatch between telephony and ASR service—check negotiated codec (PCMU, PCMA, Opus), (2) WebRTC ICE negotiation failure—verify STUN/TURN servers reachable, (3) VAD threshold too aggressive cutting off speech—increase detection threshold, (4) ASR provider degradation—check provider status page. Key metrics: ASR latency 0.85, audio packet loss <1%.

Why is my voice agent not responding to users?

Agent not responding typically indicates Stack 2 (Audio) or Stack 3 (Intelligence) failure. First check if audio is reaching the agent by looking for transcripts in logs. If transcripts exist but no response, check LLM logs for timeouts, 429 rate limit errors, or tool call failures. If no transcripts exist, focus on ASR/VAD configuration—VAD may be cutting off speech, or codec mismatch preventing audio processing. According to Hamming data, 50% of 'not responding' issues are audio-layer problems.

What causes dead air in voice agent calls?

Dead air (silence >2 seconds) has four causes: (1) ASR processing delay—check STT latency, target <300ms, (2) LLM response time—check time-to-first-token, target <500ms, (3) TTS synthesis delay—check TTS latency, target <200ms, (4) turn detection issue with endpointing too late. Check each component's latency independently. Total end-to-end latency should be <800ms P95. If one component is slow, it cascades through the entire pipeline.

How do I test ASR in a voice agent?

Test ASR by: (1) checking transcription output in logs for recent calls, (2) verifying audio is reaching ASR service by looking for audio events/frames, (3) testing ASR endpoint directly with a curl command using known audio samples, (4) checking Word Error Rate (WER) if available. Formula: WER = (Substitutions + Deletions + Insertions) / Total Words × 100. Target WER: 0.85.

Why are voice agent calls dropping mid-conversation?

Mid-call drops indicate: (1) WebRTC ICE failure—check TURN server availability and NAT traversal, (2) SIP session timeout—verify keepalive settings are configured, (3) resource exhaustion—check memory, CPU, connection pool limits, (4) rate limiting—check for 429 errors in ASR, LLM, or TTS services. Log connection state changes to identify the pattern. Track the exact timestamp of drops and correlate with component logs to find the failing stack.

How do I reduce voice agent incident response time?

Reduce MTTR by: (1) using a systematic framework like Hamming's 4-Stack approach instead of random debugging, (2) setting up real-time alerting on key metrics (call success rate, P95 latency, ASR error rate), (3) creating stack-specific runbooks for common failure modes, (4) automating diagnostic checks with synthetic calls. Teams using structured incident response resolve issues 4-6x faster according to Hamming's data. Pre-populate your incident channel with quick diagnostic commands.

What's the difference between voice agent troubleshooting and incident response?

Troubleshooting is diagnostic (understanding why something failed in depth), while incident response is operational (restoring service as quickly as possible). During an active incident, prioritize mitigation over root cause analysis—restart services, failover to backup providers, scale resources. Do thorough root cause analysis after service is restored. Incident response targets: SEV-1 <15 min to mitigate, SEV-2 <30 min. Troubleshooting has no time pressure.

How do I know if my LLM is causing voice agent latency?

Measure LLM latency separately from other components. Check: (1) time from request to first token (TTFT)—target <500ms, (2) total response time—target <1000ms for short responses, (3) 429 rate limit errors in logs. If LLM latency is high but endpoint is healthy, check prompt length (context window may be filling up), or consider caching frequent responses. Test LLM directly with curl to isolate the issue from other pipeline components.

What metrics should I monitor to prevent voice agent incidents?

Monitor these key metrics with alerting thresholds: (1) Call success rate—warning at 1000ms, critical at >1500ms, (3) ASR error rate/WER—warning at >5%, critical at >10%, (4) TTS timeout rate—warning at >2%, critical at >5%, (5) LLM error rate—warning at >1%, critical at >5%, (6) Task completion rate—warning at <85%, critical at <70%. Hamming provides real-time dashboards with automatic anomaly detection across all four stacks.

Voice Agent Incident Response Runbook: Debug and Fix Failures in Production

Q: What causes ASR failures in voice agents?

ASR failures have four primary causes: (1) codec mismatch between telephony and ASR service—check negotiated codec (PCMU, PCMA, Opus), (2) WebRTC ICE negotiation failure—verify STUN/TURN servers reachable, (3) VAD threshold too aggressive cutting off speech—increase detection threshold, (4) ASR provider degradation—check provider status page. Key metrics: ASR latency 0.85, audio packet loss <1%.

Q: Why is my voice agent not responding to users?

Agent not responding typically indicates Stack 2 (Audio) or Stack 3 (Intelligence) failure. First check if audio is reaching the agent by looking for transcripts in logs. If transcripts exist but no response, check LLM logs for timeouts, 429 rate limit errors, or tool call failures. If no transcripts exist, focus on ASR/VAD configuration—VAD may be cutting off speech, or codec mismatch preventing audio processing. According to Hamming data, 50% of 'not responding' issues are audio-layer problems.

Q: What causes dead air in voice agent calls?

Dead air (silence >2 seconds) has four causes: (1) ASR processing delay—check STT latency, target <300ms, (2) LLM response time—check time-to-first-token, target <500ms, (3) TTS synthesis delay—check TTS latency, target <200ms, (4) turn detection issue with endpointing too late. Check each component's latency independently. Total end-to-end latency should be <800ms P95. If one component is slow, it cascades through the entire pipeline.

Q: How do I test ASR in a voice agent?

Test ASR by: (1) checking transcription output in logs for recent calls, (2) verifying audio is reaching ASR service by looking for audio events/frames, (3) testing ASR endpoint directly with a curl command using known audio samples, (4) checking Word Error Rate (WER) if available. Formula: WER = (Substitutions + Deletions + Insertions) / Total Words × 100. Target WER: 0.85.

Q: Why are voice agent calls dropping mid-conversation?

Mid-call drops indicate: (1) WebRTC ICE failure—check TURN server availability and NAT traversal, (2) SIP session timeout—verify keepalive settings are configured, (3) resource exhaustion—check memory, CPU, connection pool limits, (4) rate limiting—check for 429 errors in ASR, LLM, or TTS services. Log connection state changes to identify the pattern. Track the exact timestamp of drops and correlate with component logs to find the failing stack.

Q: How do I reduce voice agent incident response time?

Reduce MTTR by: (1) using a systematic framework like Hamming's 4-Stack approach instead of random debugging, (2) setting up real-time alerting on key metrics (call success rate, P95 latency, ASR error rate), (3) creating stack-specific runbooks for common failure modes, (4) automating diagnostic checks with synthetic calls. Teams using structured incident response resolve issues 4-6x faster according to Hamming's data. Pre-populate your incident channel with quick diagnostic commands.

Q: What's the difference between voice agent troubleshooting and incident response?

Troubleshooting is diagnostic (understanding why something failed in depth), while incident response is operational (restoring service as quickly as possible). During an active incident, prioritize mitigation over root cause analysis—restart services, failover to backup providers, scale resources. Do thorough root cause analysis after service is restored. Incident response targets: SEV-1 <15 min to mitigate, SEV-2 <30 min. Troubleshooting has no time pressure.

Q: How do I know if my LLM is causing voice agent latency?

Measure LLM latency separately from other components. Check: (1) time from request to first token (TTFT)—target <500ms, (2) total response time—target <1000ms for short responses, (3) 429 rate limit errors in logs. If LLM latency is high but endpoint is healthy, check prompt length (context window may be filling up), or consider caching frequent responses. Test LLM directly with curl to isolate the issue from other pipeline components.

Q: What metrics should I monitor to prevent voice agent incidents?

Monitor these key metrics with alerting thresholds: (1) Call success rate—warning at 1000ms, critical at >1500ms, (3) ASR error rate/WER—warning at >5%, critical at >10%, (4) TTS timeout rate—warning at >2%, critical at >5%, (5) LLM error rate—warning at >1%, critical at >5%, (6) Task completion rate—warning at <85%, critical at <70%. Hamming provides real-time dashboards with automatic anomaly detection across all four stacks.

TL;DR: Respond to voice agent incidents using Hamming's 4-Stack Incident Response Framework:

Stack	What Failed	First Check	Target Resolution
1. Telephony	Calls not connecting	SIP registration, network	<5 min
2. Audio	No sound, garbled, ASR failing	Codec, WebRTC, VAD	<10 min
3. Intelligence	Wrong responses, timeouts	LLM endpoint, prompts	<15 min
4. Output	No agent speech, TTS errors	TTS service, audio encoding	<10 min

Start at Stack 1. Move up only when that stack is verified working. Most incidents (50%) are Stack 1 or 2—don't jump to LLM debugging first.

Related Guides:

Voice Agent Troubleshooting Guide — Complete diagnostic checklist for ASR, LLM, TTS, and tool failures
How to Evaluate and Test Voice Agents — 4-Layer QA Framework with checklists and metrics
Voice Agent Evaluation Metrics Guide — Complete metrics library with formulas and benchmarks
How to Monitor Voice Agent Outages in Real Time — Hamming's 4-Layer Monitoring Framework
Voice Agent Drift Detection Guide — Performance degradation detection and diagnosis
Voice Agent Observability & Tracing — Distributed tracing for debugging
OpenTelemetry for Voice Agents — OTel span hierarchies and W3C traceparent propagation for incident triage
Testing Voice Agents for Production Reliability — Prevent incidents with proactive testing
Debugging Voice Agents: Logs, Missed Intents & Error Dashboards — Error dashboards, alerting thresholds, and golden call monitoring

How to Debug Voice Agents in Production (Step-by-Step)

When a voice agent fails, use this symptom-based approach to quickly identify the root cause:

Symptom → Diagnosis Table

Symptom	Likely Stack	Where to Look	Common Causes	Quick Fix
Agent is slow (>2s response)	Stack 3 (LLM) or Stack 2 (ASR)	LLM latency traces, ASR processing time	LLM rate limiting, cold starts, complex prompts	Implement streaming, reduce prompt length, check provider status
Users talk over agent	Stack 4 (TTS) or Stack 2 (VAD)	TTS latency, VAD threshold, turn detection	TTS too slow, endpointing threshold too high, barge-in not working	Reduce TTS latency, lower VAD silence threshold, verify interruption handling
Agent misunderstands intent	Stack 2 (ASR) or Stack 3 (NLU)	WER metrics, intent accuracy, transcript quality	High WER from noise/accents, NLU drift, prompt changes	Check ASR provider, review prompt changes, test with different accents
Agent repeats itself	Stack 3 (LLM)	Conversation history, context window	Context not being passed, infinite loop in dialog	Verify conversation history injection, check for circular prompt logic
No audio from agent	Stack 4 (TTS) or Stack 1 (Network)	TTS logs, audio encoding, network traces	TTS service down, codec mismatch, network blocking	Check TTS provider status, verify audio encoding matches telephony
Calls drop immediately	Stack 1 (Telephony)	SIP registration, call setup logs	SIP trunk down, credential expired, firewall blocking	Re-register SIP, rotate credentials, check firewall rules
Agent gives wrong information	Stack 3 (LLM)	LLM responses, knowledge base, tool calls	Hallucination, stale knowledge base, failed tool calls	Add validation against sources, update knowledge base, verify tool success
Call connects but no interaction	Stack 2 (Audio)	Audio frames, codec negotiation	One-way audio, codec mismatch, VAD not detecting speech	Check codec compatibility, verify bidirectional audio flow

Minimum Logging Checklist for Voice Agents

To debug voice agents effectively, ensure you're capturing these data points for every call:

Turn-Level Data (per exchange):

Timestamps: User speech start/end, ASR complete, LLM start/end, TTS start/end
Transcripts: Raw ASR output with confidence scores
Intent: Classified intent with confidence and alternatives
Latency breakdown: STT ms, LLM ms, TTS ms, total ms

Call-Level Data:

Call metadata: Call ID, correlation ID, caller info, agent version
Session context: Conversation history passed to LLM
Tool calls: Function name, parameters, result, success/failure, latency
Outcomes: Task completion status, escalation reason, call duration

Audio Data:

Audio quality: MOS score or quality indicators
VAD events: Speech detection timestamps, silence durations
Barge-in events: Interruption timestamps, recovery success

Example Log Entry (JSON):

{
  "call_id": "call_abc123",
  "turn_index": 3,
  "timestamp": "2026-01-25T10:30:00Z",
  "user_transcript": "I need to reschedule my appointment",
  "asr_confidence": 0.94,
  "intent": {"name": "reschedule_appointment", "confidence": 0.91},
  "latency_ms": {"stt": 180, "llm": 420, "tts": 150, "total": 750},
  "tool_calls": [{"name": "get_appointments", "success": true, "latency_ms": 85}],
  "agent_response": "I can help you reschedule. I see you have an appointment on Tuesday. What date works better?"
}

Pro tip: Use distributed tracing (OpenTelemetry) to correlate logs across services. See Voice Agent Observability & Tracing for implementation guidance.

Your voice agent is down. Calls are failing. Your on-call engineer just got paged at 2 AM.

What do they do first?

Most teams scramble—restarting services, checking logs randomly, hoping something works. Meanwhile, customers get dead air or disconnected calls. Every minute of downtime costs revenue and trust.

At Hamming, we've analyzed 4M+ voice agent calls and helped teams respond to hundreds of production incidents. Here's what separates a 15-minute resolution from a 3-hour firefight: a systematic incident response framework that diagnoses the right stack first.

This runbook gives your on-call team exactly that.

Quick filter: If you're restarting services before understanding which stack failed, you're wasting time.

Methodology Note: The frameworks, thresholds, and resolution times in this runbook are derived from Hamming's analysis of 4M+ voice agent interactions and incident response patterns across 10K+ production voice agents (2025-2026).
Your specific thresholds should be calibrated to your baseline performance.

What Does This Runbook Cover?

This runbook applies to production voice agents using:

SIP or WebRTC telephony (Twilio, Vonage, Telnyx, custom)
Streaming ASR (Deepgram, AssemblyAI, Whisper, Google STT)
LLM orchestration (OpenAI, Anthropic, custom models)
TTS services (ElevenLabs, PlayHT, Cartesia, Azure)

Assumes:

Agent is deployed and was working previously
You have access to logs, metrics, or a monitoring dashboard
Basic familiarity with voice agent architecture

Definitions used:

Incident: Unplanned degradation affecting call quality or success rate
Latency: End-to-end turn latency (user silence → agent audio playback)
Failure: Call termination before task completion

Not an active incident? For general troubleshooting and diagnosis, see our Voice Agent Drift Detection Guide. For setting up proactive monitoring to prevent incidents, see Voice Agent Monitoring Platform Guide.

What Is the 4-Stack Voice Agent Architecture?

Voice agents consist of four interdependent stacks. Each stack has distinct failure modes and requires specific diagnostic approaches:

Stack	Function	Components	Failure Mode
1. Telephony	Call connectivity	SIP, WebRTC, network	Calls don't connect or drop immediately
2. Audio	Sound capture & processing	Codec, VAD, ASR	No sound, garbled audio, empty transcripts
3. Intelligence	Understanding & response	LLM, prompts, tools	Wrong responses, timeouts, hallucinations
4. Output	Speech synthesis	TTS, audio encoding	No agent speech, robotic/garbled output

Source: Stack architecture based on Hamming's analysis of 10K+ production voice agents (2025-2026). Categorization aligned with standard voice agent architecture patterns.

Key insight: Failures cascade upward. A Stack 1 (Telephony) issue makes everything else irrelevant. A Stack 2 (Audio) issue means the LLM never gets good input. Always start at Stack 1.

How Do You Classify Incident Severity?

Before diving into diagnosis, classify the incident severity to determine response urgency:

Severity	Definition	User Impact	Response Time Target
SEV-1 (Critical)	Complete outage	No calls connecting, 100% failure	<15 min to mitigate
SEV-2 (Major)	Significant degradation	>25% calls affected, major feature broken	<30 min to mitigate
SEV-3 (Minor)	Partial degradation	<25% calls affected, edge cases broken	<2 hours to mitigate
SEV-4 (Low)	Cosmetic or rare issues	Minimal user impact	Next business day

Source: Severity classification aligned with Google SRE practices and adapted for voice agent-specific signals.

SEV-1 and SEV-2 require immediate action. SEV-3 and SEV-4 can be scheduled for normal working hours.

What Is the Incident Response Decision Tree?

Use this decision tree to identify which stack to investigate first:

INCIDENT DETECTED
       │
       ▼
┌──────────────────────────────────────────────┐
│ Can calls connect at all?                    │
│ (Check: SIP registration, call logs)         │
└──────────────────────────────────────────────┘
       │
   NO  │  YES
       ▼    │
  STACK 1   │
  Telephony │
       │    ▼
       │  ┌──────────────────────────────────────────────┐
       │  │ Is audio flowing both directions?            │
       │  │ (Check: transcripts, audio recordings)       │
       │  └──────────────────────────────────────────────┘
       │       │
       │   NO  │  YES
       │       ▼    │
       │    STACK 2 │
       │    Audio   │
       │       │    ▼
       │       │  ┌──────────────────────────────────────────────┐
       │       │  │ Is agent responding correctly to input?      │
       │       │  │ (Check: LLM logs, response quality)          │
       │       │  └──────────────────────────────────────────────┘
       │       │       │
       │       │   NO  │  YES
       │       │       ▼    │
       │       │    STACK 3 │
       │       │    LLM     │
       │       │       │    ▼
       │       │       │  ┌──────────────────────────────────────────────┐
       │       │       │  │ Is agent voice output working?               │
       │       │       │  │ (Check: TTS logs, audio playback)            │
       │       │       │  └──────────────────────────────────────────────┘
       │       │       │       │
       │       │       │   NO  │  YES
       │       │       │       ▼    │
       │       │       │    STACK 4 │
       │       │       │    TTS     │
       │       │       │       │    ▼
       │       │       │       │  CROSS-STACK or
       │       │       │       │  INTERMITTENT ISSUE

Pro tip: Run through this decision tree in order. Don't skip to Stack 3 (LLM) because it seems more likely—verify each stack first.

How Do You Diagnose Stack 1: Telephony Failures?

Symptoms:

Calls don't connect at all
Immediate disconnect after dial
"Number not reachable" errors
SIP 4xx/5xx errors in logs
WebRTC ICE connection failures

What Causes Telephony Failures?

Cause	Likelihood	How to Diagnose
SIP trunk down	High	Check SIP registration status in provider dashboard
Network/firewall blocking	High	Verify UDP/TCP ports (5060, 5061, 10000-20000) are open
Credential expiration	Medium	Check SIP auth errors in logs ("401 Unauthorized")
Provider outage	Medium	Check provider status page (Twilio, Telnyx, Vonage)
DNS resolution failure	Low	Test DNS resolution for SIP domain
Certificate expiration	Low	Check TLS cert validity for secure SIP

Stack 1 Diagnostic Checklist

Run through these checks in order:

SIP Registration Active?
- Check provider dashboard for registration status
- Look for "REGISTER" success in SIP logs
Network Connectivity?
- Ping SIP server: ping sip.provider.com
- Check firewall rules for SIP ports (5060/5061 TCP/UDP)
- Verify STUN/TURN servers reachable
Provider Status?
- Check: Twilio Status, Telnyx Status, Vonage Status
Recent Changes?
- Credential rotation?
- Firewall rule changes?
- DNS updates?

Stack 1 Resolution Steps

If SIP registration failed: Re-register with provider, verify credentials
If network blocked: Open required ports, check NAT traversal
If provider outage: Failover to backup SIP trunk if available
If credentials expired: Rotate credentials and update configuration

When to Escalate: If SIP registration is active but calls still fail, escalate to Stack 2 (Audio).

How Do You Diagnose Stack 2: Audio Failures?

Symptoms:

One-way audio (user hears agent, agent doesn't hear user)
"Empty transcript" errors
Agent not responding to user speech
Garbled or choppy audio
VAD not detecting speech

What Causes Audio Failures?

Cause	Likelihood	How to Diagnose
Codec mismatch	High	Check negotiated codec vs. expected (PCMU, PCMA, Opus)
WebRTC ICE failure	High	Check ICE connection state in browser/client logs
VAD threshold too aggressive	Medium	Check silence detection cutting off speech
ASR service degraded	Medium	Check ASR provider status, test direct API
Sample rate mismatch	Low	Verify 16kHz throughout pipeline
Audio buffer overflow	Low	Check for dropped frames in audio processing

How Do You Test ASR in a Voice Agent?

Test ASR independently to isolate audio issues:

Check transcription output: Look for recent call transcripts in logs
Verify audio reaching ASR: Look for audio events/frames being sent

Test ASR endpoint directly:

# Test Deepgram directly (example)
curl -X POST "https://api.deepgram.com/v1/listen" \
  -H "Authorization: Token YOUR_API_KEY" \
  -H "Content-Type: audio/wav" \
  --data-binary @test-audio.wav

Check Word Error Rate (WER): Target <5% for clean audio, <10% for noisy

Stack 2 Key Metrics

Metric	Normal	Warning	Critical
ASR Latency	<300ms	300-500ms	>500ms
Transcription Confidence	>0.85	0.7-0.85	<0.7
Audio Packet Loss	<1%	1-3%	>3%
VAD False Negatives	<2%	2-5%	>5%

Stack 2 Diagnostic Checklist

Audio Reaching Server?
- Check for audio frames in logs
- Verify WebRTC connection established
Codec Negotiated Correctly?
- Expected: Opus (WebRTC) or PCMU/PCMA (SIP)
- Mismatch causes garbled audio
ASR Returning Transcripts?
- Check ASR logs for transcription responses
- Empty transcripts = no audio or VAD issue
VAD Configuration?
- Is silence threshold too aggressive?
- Is speech being cut off prematurely?
ASR Provider Status?
- Check: Deepgram Status, AssemblyAI Status

Stack 2 Resolution Steps

If codec mismatch: Reconfigure to match expected codec
If ICE failure: Check STUN/TURN servers, NAT traversal
If VAD too aggressive: Increase speech detection threshold
If ASR degraded: Failover to backup ASR provider if available

When to Escalate: If transcripts look correct but agent responses are wrong, escalate to Stack 3 (Intelligence).

How Do You Diagnose Stack 3: Intelligence Failures?

Symptoms:

Agent gives wrong or nonsensical responses
Long pauses before agent speaks (>2 seconds)
Timeout errors in logs
Hallucinated information
Tool calls failing silently

What Causes LLM Failures?

Cause	Likelihood	How to Diagnose
LLM rate limiting	High	Check for 429 errors in logs
Prompt corruption	High	Review recent prompt changes, check for injection
Context window overflow	Medium	Check token count per turn (approaching limit?)
Model endpoint down	Medium	Direct API health check to LLM provider
Tool calling failure	Medium	Check function call logs, tool timeout errors
Model regression	Low	Compare response quality to baseline

Stack 3 Key Metrics

Metric	Normal	Warning	Critical
LLM Response Time	<500ms	500-1000ms	>1000ms
Time to First Token (TTFT)	<300ms	300-500ms	>500ms
Tool Call Success Rate	>99%	95-99%	<95%
Hallucination Rate	<5%	5-10%	>10%

Stack 3 Diagnostic Commands

Test LLM endpoint directly:

# Test OpenAI endpoint
curl -X POST https://api.openai.com/v1/chat/completions \
  -H "Authorization: Bearer $OPENAI_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-4",
    "messages": [{"role": "user", "content": "Say hello"}],
    "max_tokens": 50
  }'

# Expected: Response in <1s, no errors
# Red flags: 429 (rate limited), 500 (server error), timeout

# Test Anthropic endpoint
curl -X POST https://api.anthropic.com/v1/messages \
  -H "x-api-key: $ANTHROPIC_API_KEY" \
  -H "content-type: application/json" \
  -d '{
    "model": "claude-sonnet-4-20250514",
    "max_tokens": 50,
    "messages": [{"role": "user", "content": "Say hello"}]
  }'

Stack 3 Diagnostic Checklist

LLM Endpoint Responding?
- Direct API test (see commands above)
- Check provider status page
Rate Limiting?
- Look for 429 errors
- Check tokens per minute usage
Prompt Changes?
- Review recent prompt deployments
- Check for prompt injection in user input
Context Window?
- Calculate tokens per conversation
- Approaching 128K/200K limit?
Tool Calls Working?
- Check function call logs
- Are tools timing out?

Stack 3 Resolution Steps

If rate limited: Reduce request rate, implement backoff, upgrade tier
If prompt corrupted: Revert to last known good prompt
If context overflow: Implement conversation summarization or truncation
If endpoint down: Failover to backup LLM provider
If tools failing: Check external API dependencies, increase timeouts

When to Escalate: If LLM responses look correct but users don't hear them, escalate to Stack 4 (Output).

How Do You Diagnose Stack 4: Output Failures?

Symptoms:

Agent silent (no audio output)
Garbled or robotic speech
Audio cuts off mid-sentence
TTS timeout errors
Unnatural prosody or pacing

What Causes TTS Failures?

Cause	Likelihood	How to Diagnose
TTS service rate limited	High	Check for 429 errors
Audio encoding mismatch	Medium	Verify output format matches telephony (PCMU/Opus)
Voice ID invalid	Medium	Confirm voice ID exists and is accessible
TTS queue backed up	Low	Check queue depth in TTS service
Text too long	Low	Check character limits for TTS API

Stack 4 Key Metrics

Metric	Normal	Warning	Critical
TTS Latency (TTFB)	<200ms	200-400ms	>400ms
TTS Error Rate	<0.1%	0.1-1%	>1%
Audio Generation Success	>99.9%	99-99.9%	<99%

Stack 4 Diagnostic Checklist

TTS Service Responding?
- Direct API test to TTS provider
- Check provider status page
Voice ID Valid?
- Confirm voice exists in provider dashboard
- Check voice wasn't deleted or renamed
Audio Format Correct?
- Output should match telephony expectations
- Common formats: PCM 16-bit 16kHz, Opus
Rate Limiting?
- Check for 429 errors in TTS logs
- Review characters per minute usage

Stack 4 Resolution Steps

If rate limited: Reduce request rate, implement caching for common phrases
If encoding mismatch: Reconfigure output format to match telephony
If voice ID invalid: Switch to backup voice ID
If TTS down: Failover to backup TTS provider

How Do You Handle Cross-Stack Failures?

Sometimes failures span multiple stacks or cascade across them. Signs of cross-stack issues:

Symptoms change during the call: Started with audio issues, now LLM is slow
Intermittent failures: Works sometimes, fails other times
Multiple error types in logs: SIP errors + ASR errors + LLM timeouts

Cross-Stack Diagnostic Approach

Identify the timeline: When did each symptom start?
Find the root cause: Which stack failed first?
Trace the cascade: How did failure in Stack N affect Stack N+1?

Common cascade patterns:

Initial Failure	Cascade Effect
Network latency (Stack 1)	ASR timeouts (Stack 2) → LLM timeouts (Stack 3)
ASR returning garbage (Stack 2)	LLM hallucinating (Stack 3)
LLM slow (Stack 3)	Turn-taking feels broken, user frustration
TTS slow (Stack 4)	User thinks agent died, hangs up

What Are the Key Thresholds for Incident Detection?

Metric	Normal	Warning	Critical
Call Success Rate	>95%	85-95%	<85%
P95 End-to-End Latency	<800ms	800-1500ms	>1500ms
ASR Word Error Rate	<5%	5-10%	>10%
Task Completion Rate	>85%	70-85%	<70%
TTS Timeout Rate	<1%	1-5%	>5%

What Is the Post-Incident Analysis Template?

After resolving an incident, document what happened to prevent recurrence:

Incident Summary Template

## Incident Summary: [TITLE]

**Date/Time:** YYYY-MM-DD HH:MM - HH:MM (duration)
**Severity:** SEV-1/2/3/4
**Impact:** X calls affected, Y% degradation

### Timeline
- HH:MM - Incident detected (how?)
- HH:MM - On-call paged
- HH:MM - Root cause identified (which stack?)
- HH:MM - Mitigation applied
- HH:MM - Full resolution confirmed

### Root Cause
[Which stack failed? What specifically broke?]

### Resolution
[What fixed it?]

### Action Items
- [ ] Preventive measure 1
- [ ] Monitoring improvement
- [ ] Documentation update

### Lessons Learned
[What would have caught this faster?]

Mean Time to Resolution (MTTR) Benchmarks

Stack	Without Framework	With Framework	Improvement
Telephony	45 min	8 min	5.6x faster
Audio (ASR/VAD)	60 min	12 min	5x faster
Intelligence (LLM)	90 min	15 min	6x faster
Output (TTS)	30 min	10 min	3x faster

Source: MTTR data from Hamming's incident response analysis across 10K+ production voice agents (2025-2026).

Voice Agent Incident Response Checklist

Use this checklist during any incident:

Immediate (First 5 Minutes):

Classify severity (SEV-1/2/3/4)
Page appropriate team if SEV-1/2
Open incident channel/war room
Start decision tree: Can calls connect?

Diagnosis (Next 10-15 Minutes):

Work through 4-Stack decision tree
Identify which stack is failing
Run stack-specific diagnostic checklist
Check provider status pages

Mitigation (Next 5-10 Minutes):

Apply stack-specific resolution steps
Verify fix with test calls
Confirm metrics returning to normal
Update incident channel

Post-Incident (Within 24 Hours):

Complete post-incident analysis template
Create action items for prevention
Update runbook if new failure mode discovered
Share learnings with team

Limitations of This Runbook

Not all incidents fit cleanly into one stack. Some failures cascade across multiple components. The framework helps narrow the search, but complex incidents may require investigating multiple stacks simultaneously.

Assumes basic observability is in place. If you don't have logging or metrics, your first step is instrumenting the system—not this runbook.

Generic by necessity. Your specific voice agent stack (LiveKit, Pipecat, ElevenLabs, Retell, Vapi, custom stacks, etc.) will have platform-specific failure modes. This framework provides the mental model; adapt the diagnostics to your stack.

How Does Hamming Help With Incident Response?

Hamming provides the observability layer that makes incident response faster:

4-Stack Visibility: Unified dashboards showing health across Telephony, Audio, Intelligence, and Output
Instant Root Cause: One-click from alert to transcript, audio, and model logs
24/7 Synthetic Calls: Catch outages before customers with continuous testing
Automated Alerting: Configurable thresholds with Slack, PagerDuty, and webhook integrations
Post-Incident Tracing: Full call traces for post-mortem analysis

Instead of scrambling through multiple dashboards during an incident, your team gets a single source of truth with the context needed to resolve issues fast.

Start monitoring your voice agents →

Related Guides:

Voice Agent SEV Playbook & Postmortem Template — Severity classification, response checklists, communication templates, and postmortem framework
Voice Agent Troubleshooting Guide — Complete diagnostic checklist for ASR, LLM, TTS, and tool failures
Voice Agent Observability & Tracing — End-to-end distributed tracing for debugging
How to Monitor Voice Agent Outages in Real Time — Real-time monitoring framework
Voice Agent Drift Detection Guide — Performance degradation detection and diagnosis

Frequently Asked Questions

How do I diagnose voice agent failures quickly?

What causes ASR failures in voice agents?

Why is my voice agent not responding to users?

What causes dead air in voice agent calls?

How do I test ASR in a voice agent?

Why are voice agent calls dropping mid-conversation?

How do I reduce voice agent incident response time?

What's the difference between voice agent troubleshooting and incident response?

How do I know if my LLM is causing voice agent latency?

What metrics should I monitor to prevent voice agent incidents?

Sumanyu Sharma

Related Resources

Voice Agent Incident Response Runbook: SEV Playbook & Postmortem Template

Debugging Voice Agents: Real-Time Logs, Missed Intents & Error Dashboards (2026)

Voice Agent Troubleshooting: Complete Diagnostic Checklist