What is the difference between an incident runbook and a playbook?

Runbooks provide tactical step-by-step checklists for specific failure scenarios—they tell responders exactly what to check and in what order during an active incident. Playbooks offer strategic incident response frameworks covering severity classification, escalation paths, communication cadence, and organizational coordination. For voice agents, both are essential: runbooks handle the technical diagnosis (check ASR, then LLM, then TTS), while playbooks manage the operational response (who to notify, when to escalate, how to communicate with stakeholders).

How should severity levels be adapted for voice agent incidents?

Voice agent severity classification must factor in acoustic quality, turn-level latency, and task completion rates alongside traditional availability metrics. A voice agent with 99.9% uptime but P90 latency of 7 seconds is effectively a SEV-2 incident because users cannot maintain natural conversation. SEV-1 triggers include complete service unavailability, PII exposure, or P90 latency exceeding 15 seconds. SEV-2 includes sustained P90 above 7 seconds, WER above 25%, or task completion below 50%. SEV-3 covers isolated degradation affecting less than 5% of calls without SLA violations.

Why is MTTR particularly important for voice agents?

Voice agents operate under real-time conversation constraints where every delayed or incorrect response degrades user experience immediately. Unlike web applications where users can wait, refresh, or retry, voice conversations have no visual feedback mechanisms—silence is interpreted as failure. Users abandon voice interactions 3-5x faster than text-based interactions when quality degrades. This means MTTR directly correlates with call abandonment rates and customer satisfaction. Teams using structured incident response frameworks with pre-built checklists and severity playbooks reduce MTTR by 4-6x compared to ad-hoc debugging.

What makes voice agent postmortems different from traditional software postmortems?

Voice agent postmortems must analyze a four-layer infrastructure stack (audio pipeline, STT, LLM, TTS) where failures cascade silently across layers. Traditional postmortems examine error logs and request metrics. Voice postmortems require acoustic data analysis alongside transcripts, turn-level latency breakdowns per component, and conversation replay to understand the user experience. The Five Whys technique is particularly effective because cascading failures obscure true root causes—an apparent LLM failure often traces back to ASR degradation producing corrupted input.

How often should incident communication templates be updated?

Review and refine communication templates after every SEV-1 and SEV-2 incident as part of the postmortem process. Evaluate whether the templates provided the right level of detail, whether stakeholders received timely information, and whether the language accurately described the user impact. Common improvements include adding voice-specific impact descriptions (latency degradation, audio quality), adjusting update frequency expectations, and refining the resolution template to include prevention measures. Teams that iterate on templates after each major incident report 30% faster stakeholder alignment during subsequent incidents.

What are the most common root causes revealed by Five Whys in voice agent incidents?

Based on analysis of production incidents across 4M+ calls, the most common systemic root causes are: prompt changes deployed without regression testing (35% of incidents), ASR provider degradation from upstream model updates (20%), missing component-level monitoring that allowed latency spikes to go undetected (15%), infrastructure scaling failures during traffic peaks (12%), configuration drift between staging and production environments (10%), and third-party API breaking changes (8%). The Five Whys consistently reveals that the proximate cause (e.g., high latency) traces back to a process gap (e.g., no automated regression test for context window size).

How should teams balance speed vs. thoroughness in incident response?

Use pre-built templates and checklists for fast initial acknowledgment and structured diagnosis during the active incident—speed is critical when users are experiencing degraded service. Reserve thorough analysis for the scheduled postmortem conducted within 48-72 hours. During the incident, prioritize mitigation over root cause analysis: rollback deployments, failover to backup providers, or scale resources first. Document observations in real-time for postmortem input but do not pause mitigation efforts for investigation. SEV-1 targets 15-minute mitigation, SEV-2 targets 60 minutes.

What voice-specific metrics should be included in postmortem analysis?

Voice agent postmortems should include component-level latency breakdown (STT processing time, LLM TTFT, TTS TTFB), end-to-end latency percentiles (P50, P90 target 3.5s, P95, P99), turn-level timing analysis showing latency distribution across conversation turns, interruption counts and false positive rates, intent classification accuracy during the incident window, entity extraction error rates, Word Error Rate compared to baseline, task completion rates by workflow type, and user abandonment rates compared to pre-incident baseline. Compare all metrics against the 7-day baseline at the same time of day to account for traffic patterns.

What tools are best for voice agent incident response and postmortem analysis?

Effective voice agent incident response requires tools that provide component-level latency breakdown across STT, LLM, and TTS layers, conversation replay with full audio and transcript data, distributed tracing with per-conversation trace IDs, and automated regression test creation from production failures. Hamming provides these capabilities natively for teams running voice agents on LiveKit, Pipecat, Vapi, or Retell, enabling incident responders to drill into per-turn latency data during triage and convert postmortem findings into automated prevention through one-click regression test creation.

Voice Agent Incident Response Runbook: SEV Playbook & Postmortem Template

TL;DR: Voice Agent Incident Response in 5 Minutes

Why voice agents need specialized incident response: A single conversation touches ASR, LLM reasoning, TTS, and dialog management simultaneously. Failure at any layer cascades through the pipeline, creating failure modes absent from traditional software systems.

SEV classification for voice agents:

SEV Level	Scope	Response Time	Example
SEV-1	All users affected	Immediate (15 min)	Complete service down, data loss, security breach
SEV-2	Significant subset	30-60 min	High ASR error rates, sustained latency spikes
SEV-3	Isolated degradation	Business hours	Slight quality drop, no SLA violation

Latency alert thresholds (from 4M+ production calls):

Percentile	Target	Warning	Critical
P50	<1.5s	1.5-2.0s	>2.0s
P90	<3.5s	3.5-5.0s	>5.0s
P95	<5.0s	5.0-7.0s	>7.0s
P99	<10s	10-15s	>15s

The incident lifecycle: Detect → Classify severity → Execute checklist → Identify failure layer → Communicate → Mitigate → Postmortem → Prevent recurrence

Why Voice Agents Need Specialized Incident Response

Last Updated: February 2026

Voice agents fail differently than traditional software. A web application returns an error page or times out. A voice agent creates silence, garbled speech, wrong answers, or confused loops—all while a real person is waiting on the other end of a phone line.

Standard incident response runbooks assume request-response architectures where failures are binary: the service works or it does not. Voice agents operate across four interdependent layers—ASR, LLM, TTS, and dialog management—where partial failures at one layer cascade unpredictably through the others. A degraded ASR model does not throw an exception. It returns a confident-sounding but incorrect transcript that the LLM interprets literally, generating a plausible but wrong response that TTS renders perfectly. From your monitoring dashboard, everything looks healthy. From the caller's perspective, the agent is incompetent.

This operational framework provides production teams with severity classification, response checklists, failure mode libraries, communication templates, and a postmortem structure purpose-built for voice agent incidents.

Related Guides:

Voice Agent Incident Response Runbook — 4-Stack diagnostic framework for debugging production failures
Voice Agent Observability & Tracing — End-to-end distributed tracing for voice systems
OpenTelemetry for Voice Agents — OTel instrumentation for trace_id propagation across STT, LLM, and TTS
How to Monitor Voice Agent Outages in Real Time — Real-time monitoring framework
Voice Agent Troubleshooting Guide — Complete diagnostic checklist for ASR, LLM, TTS, and tool failures
Testing Voice Agents for Production Reliability — Proactive testing to prevent incidents
Debugging Voice Agents: Logs, Missed Intents & Error Dashboards — Error dashboards and alerting thresholds

Methodology Note: The incident patterns and thresholds in this guide are derived from Hamming's analysis of 4M+ production voice agent incidents across 10K+ voice agents (2025-2026).
Incident patterns validated across LiveKit, Pipecat, Vapi, and Retell deployments.

Understanding Voice Agent Incident Response

Why Voice Agents Need Specialized Runbooks

Traditional software incident response assumes failures are observable and isolated. A database query fails and returns an error. A microservice times out and triggers a circuit breaker. Voice agents break this model in three fundamental ways.

Multi-layer cascading failures. A single voice agent turn flows through ASR, LLM reasoning, tool execution, and TTS synthesis. Each layer introduces unique failure modes that compound. An ASR transcription error does not raise an alert—it silently corrupts the LLM's input, producing a confidently wrong response. The failure is invisible until a human listens to the call.

Real-time interaction constraints. Web applications tolerate latency spikes with loading spinners. Voice conversations do not. Any response delay beyond 800ms breaks conversational flow. Users interpret silence as system failure, not processing time. A latency spike that would be a minor annoyance in a web application becomes a call-ending event in voice.

Probabilistic failure modes. ASR accuracy varies by accent, background noise, and vocabulary. LLM responses are non-deterministic. The same input can succeed on one call and fail on the next. This makes reproduction difficult and threshold-based alerting insufficient without percentile distributions.

Key Differences from Traditional Software Incidents

Dimension	Traditional Software	Voice Agents
Failure visibility	Error codes, exceptions, HTTP 5xx	Silent degradation, confident wrong answers
Latency tolerance	Seconds acceptable with UI feedback	>800ms breaks user experience immediately
Reproduction	Deterministic with same input	Probabilistic—same input produces different results
Impact scope	Per-request or per-user	Per-conversation turn, compounding across turns
Monitoring	Request/response metrics	Multi-layer pipeline metrics per turn
Root cause	Usually single component	Often multi-layer cascade

Incident Severity Taxonomy for Voice Agents

Defining SEV Levels

Severity classification determines response urgency, communication cadence, and escalation paths. Voice agent SEV levels must account for factors absent from traditional software: acoustic quality degradation, turn-level latency distribution, and conversational task completion rates.

SEV-1: Critical Voice Agent Incidents

Definition: Complete service unavailability, data loss, active security breach, or conditions affecting all users requiring immediate response.

Voice-specific SEV-1 triggers:

All inbound/outbound calls failing to connect
Zero ASR transcription output (complete STT failure)
PII/PHI exposure in call recordings or transcripts
P90 latency exceeding 15 seconds across all calls
Task completion rate dropping below 10%

Response requirements:

Acknowledge within 5 minutes
Incident commander assigned within 10 minutes
War room established within 15 minutes
Stakeholder notification within 30 minutes
Status updates every 30-60 minutes

SEV-2: Major Voice Agent Incidents

Definition: Significant subset of users impacted by sustained quality degradation, high error rates, or latency spikes that violate SLA commitments.

Voice-specific SEV-2 triggers:

ASR Word Error Rate exceeding 25% for a user segment (accent, language, noise profile)
P90 end-to-end latency sustained above 7 seconds
Task completion rate below 50% for specific workflows
TTS output producing unintelligible audio for a provider region
Intent misclassification rate exceeding 20%

Response requirements:

Acknowledge within 15 minutes
Gather component-level metrics within 30 minutes
Check recent deployments and configuration changes
Engage relevant on-call engineers
Status updates every 2-4 hours

SEV-3: Minor Voice Agent Incidents

Definition: Isolated issues causing slight degradation without SLA violations, addressable during standard business hours.

Voice-specific SEV-3 triggers:

Intermittent latency spikes affecting less than 5% of calls
Minor ASR accuracy degradation in edge conditions (heavy background noise)
Cosmetic TTS issues (slight pronunciation errors for uncommon terms)
Non-critical tool call failures with graceful fallback working

Response requirements:

Document issue in tracking system within 24 hours
Schedule fix for next deployment window
Daily status update until resolved

Voice-Specific Severity Considerations

Standard severity frameworks rely on availability percentages and error rates. Voice agents require additional classification dimensions:

Factor	SEV-3	SEV-2	SEV-1
Acoustic quality	Minor artifacts	Intelligibility degraded	Unintelligible output
Turn-level latency (P90)	<3.5s (target)	3.5-7.0s (degraded)	>7.0s (broken)
Task completion	>80%	50-80%	<50%
User abandonment	<5% increase	5-20% increase	>20% increase
Scope	Single workflow	User segment or region	All users

Initial Response Checklists

SEV-1 Response Checklist

Execute these steps in order. Do not skip to diagnosis before completing initial response.

Minutes 0-5: Acknowledge and mobilize

Acknowledge incident in on-call channel
Page incident commander (IC) and backup
Create incident ticket with timestamp (UTC)
Open war room (video call link in channel)

Minutes 5-15: Assess and stabilize

IC confirms severity classification
Check: Did a deployment happen in the last 2 hours?
Check: Are upstream providers (ASR, LLM, TTS) reporting issues?
Check: Infrastructure metrics (CPU, memory, connection pools)
Decision: Can we rollback the last deployment? If yes, initiate rollback
Decision: Can we failover to a backup provider? If yes, initiate failover

Minutes 15-30: Diagnose by layer

Layer 1 (Telephony): Are calls connecting? Check SIP registration, trunk status
Layer 2 (ASR): Are transcripts being generated? Check STT latency and confidence
Layer 3 (LLM): Are responses being generated? Check TTFT, error rates, rate limits
Layer 4 (TTS): Is audio being synthesized? Check TTS latency and output quality
Identify the affected layer and focus investigation there

Minutes 30+: Communicate and mitigate

Send first stakeholder notification (use template below)
Implement mitigation (even if temporary)
Verify mitigation is working with synthetic test calls
Schedule 30-minute update cadence

SEV-2 Response Checklist

Within 15 minutes: Initial assessment

Acknowledge in on-call channel with severity classification
Identify affected user segment (geography, language, workflow, time range)
Pull component-level metrics for the affected segment
Check recent deployments (prompt changes, model updates, configuration changes)

Within 30 minutes: Focused diagnosis

Run synthetic test calls reproducing the reported conditions
Compare current metrics against baseline (previous 24 hours, previous week)
Check upstream provider status pages (Deepgram, OpenAI, ElevenLabs, etc.)
Review recent prompt or configuration changes in version control

Within 60 minutes: Mitigation

Implement fix or workaround
Validate with test calls and metric verification
Send stakeholder update with root cause hypothesis and ETA

SEV-3 Response Checklist

Document issue in tracking system with reproduction steps
Capture relevant logs, metrics, and sample call recordings
Classify affected component (ASR, LLM, TTS, dialog, telephony)
Assess whether issue is trending (getting worse) or stable
Schedule fix for next planned deployment window
Add to next team standup agenda

Voice Agent Failure Mode Library

Understanding common failure modes accelerates diagnosis. This library maps symptoms to root causes across the voice agent pipeline.

ASR/STT Layer Failures

ASR failures are insidious because they rarely produce errors. Instead, they produce incorrect transcripts that flow downstream as if they were correct.

Failure Mode	Symptoms	Common Causes	Diagnostic Steps
Transcription errors	Wrong words in transcript, intent misclassification	Background noise, unsupported accent, low audio bitrate	Check WER against baseline, compare audio quality metrics
Complete STT silence	No transcript generated, agent never responds	API key rotation, provider outage, codec mismatch	Verify API connectivity, check audio format negotiation
High latency	Long pauses before agent responds	Provider rate limiting, large audio chunks, cold starts	Measure STT processing time per request, check provider metrics
Truncated transcripts	Partial recognition, cut-off sentences	VAD endpointing too aggressive, audio stream interruption	Review VAD settings, check audio buffer continuity

Key insight: 60% of "the agent doesn't understand me" reports trace back to ASR-layer issues, not LLM problems. Always check transcription quality before investigating LLM behavior.

LLM/NLU Layer Failures

LLM failures in voice agents occur at 3-10x higher rates than in text systems due to noisy ASR input, real-time latency pressure, and the absence of user correction signals.

Failure Mode	Symptoms	Common Causes	Diagnostic Steps
Prompt non-compliance	Agent ignores instructions, wrong persona	Prompt too long for context window, conflicting instructions	Review prompt token count, test prompt in isolation
Hallucinations	Fabricated information, non-existent options	Missing context grounding, overly creative temperature	Check grounding sources, review temperature settings
Intent misclassification	Correct transcript, wrong action	Ambiguous utterance, insufficient training examples	Compare transcript against intent labels, review edge cases
Tool call failures	Agent describes action but does not execute	Function schema mismatch, parameter validation failure	Check tool call logs, validate function schemas
Rate limiting	Intermittent slow responses, 429 errors	Traffic spike, insufficient API quota	Check provider rate limit headers, review usage patterns

Latency and Timing Failures

Latency is the most common SEV-2 trigger for voice agents. Component-level tracking is essential for actionable diagnosis.

Production latency benchmarks (from 4M+ calls):

Component	Target	Warning	Critical
STT processing	<200ms	200-400ms	>400ms
LLM TTFT	<400ms	400-800ms	>800ms
LLM full response	<1000ms	1000-2000ms	>2000ms
TTS TTFB	<150ms	150-300ms	>300ms
Turn detection	<400ms	400-600ms	>600ms
Network overhead	<100ms	100-200ms	>200ms

End-to-end latency targets:

Percentile	Target	Warning	Critical
P50	<1.5s	1.5-2.0s	>2.0s
P90	<3.5s	3.5-5.0s	>5.0s
P95	<5.0s	5.0-7.0s	>7.0s
P99	<10s	10-15s	>15s

Diagnosis approach: Measure each component independently. If end-to-end P90 exceeds 3.5 seconds, identify which component contributes the majority of the delay. A single bottleneck (typically LLM TTFT or STT processing) usually accounts for 60-70% of total latency.

TTS Output Failures

Failure Mode	Symptoms	Common Causes	Diagnostic Steps
Unnatural prosody	Robotic or stilted speech patterns	Wrong voice model, SSML rendering issues	Compare against reference audio, check voice model configuration
Mispronunciations	Names, acronyms, or domain terms mangled	Missing pronunciation dictionary entries	Review custom lexicon coverage, test specific terms
Audio artifacts	Clicks, pops, distortion in agent speech	Encoding mismatch, buffer underrun, sample rate conversion	Check audio encoding chain, verify sample rate consistency
Complete TTS failure	Agent generates response but no audio	API key rotation, provider outage, audio routing failure	Verify TTS API connectivity, check audio output pipeline

Dialog Flow and Conversation Logic Failures

Failure Mode	Symptoms	Common Causes	Diagnostic Steps
Conversational loops	Agent repeats same question or response	Missing state tracking, prompt logic error	Review conversation history injection, check for circular flows
Excessive interruptions	Agent talks over user, fails to yield	VAD thresholds too low, barge-in handling misconfigured	Review turn detection settings, check interruption handling
Poor turn-taking	Awkward pauses, overlapping speech	Endpointing timeout too high or too low	Tune VAD silence threshold, review turn detection latency
Context loss	Agent forgets earlier conversation context	Context window overflow, history truncation	Check token counts, verify conversation history management

Monitoring and Alerting Reference

Critical Voice Agent Metrics

Monitor these metrics continuously with automated alerting:

Metric	What It Measures	Alert Threshold (Warning)	Alert Threshold (Critical)
Call success rate	Calls completing without errors	<95%	<85%
ASR accuracy (WER)	Transcription quality	>10%	>20%
P90 end-to-end latency	Response speed for 90th percentile	>3.5s	>5.0s
Intent accuracy	Correct intent classification	<90%	<80%
Task completion rate	Successfully completed user tasks	<85%	<70%
Interruption false positive rate	Background noise triggering barge-in	>5%	>10%
TTS error rate	Failed audio synthesis	>2%	>5%
Escalation rate	Calls transferred to humans	>25%	>40%

Alert Configuration Best Practices

Use percentile-based alerts, not averages. A P90 latency of 3.5 seconds is actionable. A mean latency of 1.2 seconds hides the fact that 10% of users are waiting 5+ seconds.

Configure component-level breakdowns. When P90 end-to-end latency alerts, you need to know which component caused it. Instrument separate metrics for STT, LLM TTFT, TTS TTFB, and network overhead.

Set baseline-relative thresholds. Static thresholds miss gradual drift. Alert when a metric deviates more than 50% from its 7-day baseline at the same time of day.

Avoid alert fatigue. Group related alerts (e.g., high latency + low task completion) into a single incident rather than separate notifications. Use severity-based routing: SEV-1 pages on-call immediately, SEV-3 creates a ticket.

Using OpenTelemetry for Voice Agent Tracing

Instrument distributed tracing with a conversation-scoped trace ID that propagates across all components:

trace_id: conv-{uuid}
├── span: stt_processing (duration, confidence, transcript)
├── span: intent_classification (intent, confidence, entities)
├── span: llm_reasoning (model, tokens_in, tokens_out, ttft)
│   ├── span: tool_call (function, args, result, duration)
│   └── span: tool_call (function, args, result, duration)
├── span: response_generation (template, variables)
└── span: tts_synthesis (voice_id, duration, audio_length)

Key attributes to capture per span:

conversation.id — Unique conversation identifier
turn.index — Turn number within the conversation
component.provider — ASR/LLM/TTS provider name and version
component.latency_ms — Processing time for this component
component.error — Error type and message if failed

This trace structure enables drill-down from high-level latency alerts to the specific component and provider call causing the issue.

Incident Communication Templates

Internal Communication Template

Use this structure for war room updates and internal Slack messages:

[SEV-{level}] {Brief description} — Update #{number}

Status: INVESTIGATING | IDENTIFIED | MITIGATING | RESOLVED
Affected: {Component} → {User impact description}
Duration: {Time since detection}

Current findings:
- {What we know}

Next steps:
- {What we are doing next}
- {Who is doing it}

ETA to mitigation: {Estimate or "Investigating"}
Next update: {Time of next update}

External Stakeholder Template

Subject: [Voice Agent Service] {Status} — {Brief description}

We are aware of an issue affecting {description of impacted functionality}.

Impact: {What users are experiencing, in non-technical terms}
Status: Our engineering team is actively investigating.

We will provide an update by {time}.

We apologize for any inconvenience.

Do not speculate on root cause in external communications. State what is happening, not why.

Update Frequency by Severity

Severity	Update Frequency	Channel
SEV-1	Every 30-60 minutes	War room + status page + stakeholder email
SEV-2	Every 2-4 hours	Incident channel + stakeholder email
SEV-3	Daily summary	Team channel + tracking ticket

Resolution Communication Template

Subject: [RESOLVED] {Brief description}

The issue affecting {functionality} has been resolved as of {timestamp UTC}.

Root cause: {One-sentence technical explanation}
Fix applied: {What was changed}
Duration: {Total incident duration}
Prevention: {What we are doing to prevent recurrence}

A full postmortem will be published within {48-72 hours}.

Incident Postmortem Template

Postmortem Overview Section

# Voice Agent Incident Postmortem

**Title:** {Descriptive title}
**Date:** {YYYY-MM-DD}
**Duration:** {Start time} - {End time} UTC ({total duration})
**Severity:** SEV-{level}
**Incident Commander:** {Name}
**Participants:** {List of involved team members}

## Summary
{One paragraph: what happened, what was affected, how it was resolved}

Impact Assessment

Quantify the impact across multiple dimensions:

| Dimension              | Measurement                                        |
|------------------------|----------------------------------------------------|
| Users affected         | {Number or percentage of total users}              |
| Calls impacted         | {Total calls during incident window}               |
| Failed calls           | {Calls that failed or were abandoned}              |
| Revenue impact         | {Estimated revenue loss if applicable}             |
| Latency degradation    | {P90 during incident vs. baseline P90 of 3.5s}    |
| SLA consumption        | {Error budget consumed by this incident}           |
| Task completion impact | {Task completion rate during incident vs. baseline}|

Timeline of Events

Use UTC timestamps. Include detection, escalation, diagnosis, mitigation, and full resolution milestones.

{HH:MM UTC} — First anomalous metric detected by {monitoring system}
{HH:MM UTC} — Alert triggered: {alert name and threshold}
{HH:MM UTC} — On-call engineer acknowledged
{HH:MM UTC} — Severity classified as SEV-{level}
{HH:MM UTC} — Root cause identified: {brief description}
{HH:MM UTC} — Mitigation applied: {action taken}
{HH:MM UTC} — Metrics returning to baseline
{HH:MM UTC} — Incident resolved, monitoring confirmed recovery

Root Cause Analysis

Applying Five Whys to Voice Agent Incidents

The Five Whys technique works particularly well for voice agent incidents because cascading failures obscure the true root cause. Ask "why" iteratively until reaching a systemic process failure—not individual human error.

Example:

Why did users experience 8-second response times? → LLM TTFT spiked to 4 seconds
Why did LLM TTFT spike? → Context window was hitting the 128K token limit
Why was context window full? → Conversation history was not being truncated
Why was truncation not working? → Recent prompt refactor removed the truncation logic
Why was the removal not caught? → No regression test for context window size

Root cause: Missing regression test coverage for conversation history management. Action: Add automated test that validates context window stays within limits after prompt changes.

The goal is always to identify a systemic improvement—a test, a check, a monitoring alert—not to assign blame.

Common Root Causes in Voice Systems

Based on analysis of production incidents across 4M+ calls:

Root Cause Category	Frequency	Example
Prompt changes without regression testing	~35%	New prompt breaks existing intent handling
ASR provider degradation	~20%	Provider model update reduces accuracy for specific accents
Missing component-level monitoring	~15%	Latency spike undetected because only e2e was monitored
Infrastructure scaling	~12%	Connection pool exhaustion during traffic peak
Configuration drift	~10%	Environment variable mismatch between staging and production
Third-party API changes	~8%	Breaking change in LLM or TTS provider API

Action Items and Owners

Structure action items with clear ownership and deadlines:

## Action Items

### Immediate (This Sprint)
- [ ] {Action} — Owner: {Name} — Due: {Date}
- [ ] {Action} — Owner: {Name} — Due: {Date}

### Short-term (Next 30 Days)
- [ ] {Action} — Owner: {Name} — Due: {Date}

### Long-term (Next Quarter)
- [ ] {Action} — Owner: {Name} — Due: {Date}

Distinguish between:

Mitigation actions — What was done to restore service (already completed)
Prevention actions — What will prevent recurrence (needs scheduling)
Detection actions — What will catch this faster next time (monitoring improvements)

Lessons Learned

Document 3-5 key takeaways. Focus on system improvements and process gaps, not individual performance.

## Lessons Learned

1. {Takeaway about detection}: How could we have detected this sooner?
2. {Takeaway about response}: What would have made the response faster?
3. {Takeaway about prevention}: What systemic change prevents recurrence?
4. {Takeaway about communication}: Did stakeholders get the right info at the right time?

Post-Incident Process

Postmortem Meeting Guidelines

Timing: Conduct within 48-72 hours while details remain fresh. Waiting longer leads to rationalized narratives rather than accurate reconstructions.

Structure:

Timeline review (15 min) — Walk through events chronologically
Root cause analysis (20 min) — Apply Five Whys methodology
What went well (10 min) — Identify effective response patterns to reinforce
What could improve (10 min) — Identify gaps without assigning blame
Action items (15 min) — Assign owners and deadlines

Ground rules:

Blameless. Focus on systems and processes, not individuals
Evidence-based. Reference metrics, logs, and traces rather than memory
Forward-looking. Every discussion point should produce an action item or a confirmed non-action

Converting Production Failures to Test Cases

Every production incident should generate at least one regression test. This converts operational pain into permanent prevention.

Process:

Extract the failing conversation from production logs (audio + transcript + metadata)
Identify the specific failure point (which turn, which component, what input triggered the failure)
Create a test case that replays the failing conditions
Verify the test case fails before the fix and passes after
Add to the automated regression suite that runs on every deployment

Test case categories from incidents:

Incident Type	Test Case Approach
ASR failure for specific accent	Add accent-specific audio samples to test set
LLM misclassification	Add the misclassified utterance to intent test suite
Latency spike from long context	Add context window size validation test
Tool call failure	Add integration test for the specific tool with edge case inputs
Dialog loop	Add multi-turn test that detects repetitive agent responses

Teams that systematically convert incidents to tests see a 40-60% reduction in repeat incidents within 6 months.

Updating Runbooks and Monitoring

After every SEV-1 or SEV-2 postmortem, update the following:

Severity criteria — Did this incident reveal a gap in SEV classification?
Failure mode library — Add new failure modes discovered during diagnosis
Alert thresholds — Adjust based on actual impact data from the incident
Response checklists — Add steps that would have accelerated diagnosis
Communication templates — Refine based on what worked during stakeholder communication

Tools and Resources

How Teams Implement Incident Response with Hamming

Hamming provides the observability layer that makes structured incident response possible. Teams running voice agents on LiveKit, Pipecat, Vapi, or Retell use Hamming for component-level visibility during incidents:

Component-level latency breakdown — See STT, LLM, and TTS latency per conversation turn, enabling immediate identification of which layer is causing latency alerts
Production call replay — Replay any production call with full audio, transcript, and timing data for incident diagnosis and regression test creation
OpenTelemetry trace ingestion — Ingest distributed traces from your voice agent pipeline for end-to-end correlation during incidents
Automated regression testing — Run test suites against production recordings to validate fixes before deployment
Threshold-based alerting — Configure P90 latency, WER, and task completion alerts with Slack, PagerDuty, and webhook integrations to trigger SEV classification automatically
Synthetic call monitoring — Run automated test calls every 5-15 minutes to detect degradation before real users are affected
Postmortem data extraction — Export full conversation traces, latency breakdowns, and audio recordings for postmortem analysis and root cause identification
Regression test creation from incidents — Convert any production failure into a permanent regression test case with one click, preventing recurrence after every postmortem

Essential Monitoring Stack Components

A complete voice agent monitoring stack requires capabilities beyond standard application monitoring:

Component	Purpose	Voice-Specific Requirement
Distributed tracing	Correlate events across pipeline	Trace ID per conversation, span per component
Percentile-based alerting	Detect latency distribution shifts	P90 at 3.5s, not just average
Acoustic quality analysis	Evaluate audio beyond transcripts	MOS scoring, SNR monitoring, codec health
Turn-level metrics	Per-turn performance tracking	Latency, confidence, and accuracy per turn
Conversation replay	Debug specific incidents	Full audio + transcript + timing reconstruction

Voice Agent Incident Response Runbook — 4-Stack diagnostic framework for debugging production failures
How to Evaluate Voice Agents: Complete Framework — 4-Layer quality framework with metrics and benchmarks
Voice Agent Observability & Tracing — End-to-end distributed tracing for voice systems
Voice Agent Monitoring KPIs — Production dashboard metrics and alerting
Voice Agent Drift Detection Guide — Performance degradation detection
Voice AI Latency: What's Fast, What's Slow, and How to Fix It — Deep dive on latency optimization

Frequently Asked Questions

What is the difference between an incident runbook and a playbook?

How should severity levels be adapted for voice agent incidents?

Why is MTTR particularly important for voice agents?

What makes voice agent postmortems different from traditional software postmortems?

How often should incident communication templates be updated?

What are the most common root causes revealed by Five Whys in voice agent incidents?

How should teams balance speed vs. thoroughness in incident response?

What voice-specific metrics should be included in postmortem analysis?

What tools are best for voice agent incident response and postmortem analysis?

Sumanyu Sharma

Related Resources

Voice Agent Incident Response Runbook: Debug and Fix Failures in Production

Voice Agent Observability: End-to-End Tracing for AI Voice Systems

How to Monitor Voice Agent Outages in Real Time