Voice Agent Incident Response Runbook: SEV Playbook & Postmortem Template

Sumanyu Sharma
Sumanyu Sharma
Founder & CEO
, Voice AI QA Pioneer

Has stress-tested 4M+ voice agent calls to find where they break.

February 11, 2026Updated February 11, 202621 min read
Voice Agent Incident Response Runbook: SEV Playbook & Postmortem Template

TL;DR: Voice Agent Incident Response in 5 Minutes

Why voice agents need specialized incident response: A single conversation touches ASR, LLM reasoning, TTS, and dialog management simultaneously. Failure at any layer cascades through the pipeline, creating failure modes absent from traditional software systems.

SEV classification for voice agents:

SEV LevelScopeResponse TimeExample
SEV-1All users affectedImmediate (15 min)Complete service down, data loss, security breach
SEV-2Significant subset30-60 minHigh ASR error rates, sustained latency spikes
SEV-3Isolated degradationBusiness hoursSlight quality drop, no SLA violation

Latency alert thresholds (from 4M+ production calls):

PercentileTargetWarningCritical
P50<1.5s1.5-2.0s>2.0s
P90<3.5s3.5-5.0s>5.0s
P95<5.0s5.0-7.0s>7.0s
P99<10s10-15s>15s

The incident lifecycle: Detect → Classify severity → Execute checklist → Identify failure layer → Communicate → Mitigate → Postmortem → Prevent recurrence

Why Voice Agents Need Specialized Incident Response

Last Updated: February 2026

Voice agents fail differently than traditional software. A web application returns an error page or times out. A voice agent creates silence, garbled speech, wrong answers, or confused loops—all while a real person is waiting on the other end of a phone line.

Standard incident response runbooks assume request-response architectures where failures are binary: the service works or it does not. Voice agents operate across four interdependent layers—ASR, LLM, TTS, and dialog management—where partial failures at one layer cascade unpredictably through the others. A degraded ASR model does not throw an exception. It returns a confident-sounding but incorrect transcript that the LLM interprets literally, generating a plausible but wrong response that TTS renders perfectly. From your monitoring dashboard, everything looks healthy. From the caller's perspective, the agent is incompetent.

This operational framework provides production teams with severity classification, response checklists, failure mode libraries, communication templates, and a postmortem structure purpose-built for voice agent incidents.

Related Guides:

Methodology Note: The incident patterns and thresholds in this guide are derived from Hamming's analysis of 4M+ production voice agent incidents across 10K+ voice agents (2025-2026).

Incident patterns validated across LiveKit, Pipecat, Vapi, and Retell deployments.


Understanding Voice Agent Incident Response

Why Voice Agents Need Specialized Runbooks

Traditional software incident response assumes failures are observable and isolated. A database query fails and returns an error. A microservice times out and triggers a circuit breaker. Voice agents break this model in three fundamental ways.

Multi-layer cascading failures. A single voice agent turn flows through ASR, LLM reasoning, tool execution, and TTS synthesis. Each layer introduces unique failure modes that compound. An ASR transcription error does not raise an alert—it silently corrupts the LLM's input, producing a confidently wrong response. The failure is invisible until a human listens to the call.

Real-time interaction constraints. Web applications tolerate latency spikes with loading spinners. Voice conversations do not. Any response delay beyond 800ms breaks conversational flow. Users interpret silence as system failure, not processing time. A latency spike that would be a minor annoyance in a web application becomes a call-ending event in voice.

Probabilistic failure modes. ASR accuracy varies by accent, background noise, and vocabulary. LLM responses are non-deterministic. The same input can succeed on one call and fail on the next. This makes reproduction difficult and threshold-based alerting insufficient without percentile distributions.

Key Differences from Traditional Software Incidents

DimensionTraditional SoftwareVoice Agents
Failure visibilityError codes, exceptions, HTTP 5xxSilent degradation, confident wrong answers
Latency toleranceSeconds acceptable with UI feedback>800ms breaks user experience immediately
ReproductionDeterministic with same inputProbabilistic—same input produces different results
Impact scopePer-request or per-userPer-conversation turn, compounding across turns
MonitoringRequest/response metricsMulti-layer pipeline metrics per turn
Root causeUsually single componentOften multi-layer cascade

Incident Severity Taxonomy for Voice Agents

Defining SEV Levels

Severity classification determines response urgency, communication cadence, and escalation paths. Voice agent SEV levels must account for factors absent from traditional software: acoustic quality degradation, turn-level latency distribution, and conversational task completion rates.

SEV-1: Critical Voice Agent Incidents

Definition: Complete service unavailability, data loss, active security breach, or conditions affecting all users requiring immediate response.

Voice-specific SEV-1 triggers:

  • All inbound/outbound calls failing to connect
  • Zero ASR transcription output (complete STT failure)
  • PII/PHI exposure in call recordings or transcripts
  • P90 latency exceeding 15 seconds across all calls
  • Task completion rate dropping below 10%

Response requirements:

  • Acknowledge within 5 minutes
  • Incident commander assigned within 10 minutes
  • War room established within 15 minutes
  • Stakeholder notification within 30 minutes
  • Status updates every 30-60 minutes

SEV-2: Major Voice Agent Incidents

Definition: Significant subset of users impacted by sustained quality degradation, high error rates, or latency spikes that violate SLA commitments.

Voice-specific SEV-2 triggers:

  • ASR Word Error Rate exceeding 25% for a user segment (accent, language, noise profile)
  • P90 end-to-end latency sustained above 7 seconds
  • Task completion rate below 50% for specific workflows
  • TTS output producing unintelligible audio for a provider region
  • Intent misclassification rate exceeding 20%

Response requirements:

  • Acknowledge within 15 minutes
  • Gather component-level metrics within 30 minutes
  • Check recent deployments and configuration changes
  • Engage relevant on-call engineers
  • Status updates every 2-4 hours

SEV-3: Minor Voice Agent Incidents

Definition: Isolated issues causing slight degradation without SLA violations, addressable during standard business hours.

Voice-specific SEV-3 triggers:

  • Intermittent latency spikes affecting less than 5% of calls
  • Minor ASR accuracy degradation in edge conditions (heavy background noise)
  • Cosmetic TTS issues (slight pronunciation errors for uncommon terms)
  • Non-critical tool call failures with graceful fallback working

Response requirements:

  • Document issue in tracking system within 24 hours
  • Schedule fix for next deployment window
  • Daily status update until resolved

Voice-Specific Severity Considerations

Standard severity frameworks rely on availability percentages and error rates. Voice agents require additional classification dimensions:

FactorSEV-3SEV-2SEV-1
Acoustic qualityMinor artifactsIntelligibility degradedUnintelligible output
Turn-level latency (P90)<3.5s (target)3.5-7.0s (degraded)>7.0s (broken)
Task completion>80%50-80%<50%
User abandonment<5% increase5-20% increase>20% increase
ScopeSingle workflowUser segment or regionAll users

Initial Response Checklists

SEV-1 Response Checklist

Execute these steps in order. Do not skip to diagnosis before completing initial response.

Minutes 0-5: Acknowledge and mobilize

  • Acknowledge incident in on-call channel
  • Page incident commander (IC) and backup
  • Create incident ticket with timestamp (UTC)
  • Open war room (video call link in channel)

Minutes 5-15: Assess and stabilize

  • IC confirms severity classification
  • Check: Did a deployment happen in the last 2 hours?
  • Check: Are upstream providers (ASR, LLM, TTS) reporting issues?
  • Check: Infrastructure metrics (CPU, memory, connection pools)
  • Decision: Can we rollback the last deployment? If yes, initiate rollback
  • Decision: Can we failover to a backup provider? If yes, initiate failover

Minutes 15-30: Diagnose by layer

  • Layer 1 (Telephony): Are calls connecting? Check SIP registration, trunk status
  • Layer 2 (ASR): Are transcripts being generated? Check STT latency and confidence
  • Layer 3 (LLM): Are responses being generated? Check TTFT, error rates, rate limits
  • Layer 4 (TTS): Is audio being synthesized? Check TTS latency and output quality
  • Identify the affected layer and focus investigation there

Minutes 30+: Communicate and mitigate

  • Send first stakeholder notification (use template below)
  • Implement mitigation (even if temporary)
  • Verify mitigation is working with synthetic test calls
  • Schedule 30-minute update cadence

SEV-2 Response Checklist

Within 15 minutes: Initial assessment

  • Acknowledge in on-call channel with severity classification
  • Identify affected user segment (geography, language, workflow, time range)
  • Pull component-level metrics for the affected segment
  • Check recent deployments (prompt changes, model updates, configuration changes)

Within 30 minutes: Focused diagnosis

  • Run synthetic test calls reproducing the reported conditions
  • Compare current metrics against baseline (previous 24 hours, previous week)
  • Check upstream provider status pages (Deepgram, OpenAI, ElevenLabs, etc.)
  • Review recent prompt or configuration changes in version control

Within 60 minutes: Mitigation

  • Implement fix or workaround
  • Validate with test calls and metric verification
  • Send stakeholder update with root cause hypothesis and ETA

SEV-3 Response Checklist

  • Document issue in tracking system with reproduction steps
  • Capture relevant logs, metrics, and sample call recordings
  • Classify affected component (ASR, LLM, TTS, dialog, telephony)
  • Assess whether issue is trending (getting worse) or stable
  • Schedule fix for next planned deployment window
  • Add to next team standup agenda

Voice Agent Failure Mode Library

Understanding common failure modes accelerates diagnosis. This library maps symptoms to root causes across the voice agent pipeline.

ASR/STT Layer Failures

ASR failures are insidious because they rarely produce errors. Instead, they produce incorrect transcripts that flow downstream as if they were correct.

Failure ModeSymptomsCommon CausesDiagnostic Steps
Transcription errorsWrong words in transcript, intent misclassificationBackground noise, unsupported accent, low audio bitrateCheck WER against baseline, compare audio quality metrics
Complete STT silenceNo transcript generated, agent never respondsAPI key rotation, provider outage, codec mismatchVerify API connectivity, check audio format negotiation
High latencyLong pauses before agent respondsProvider rate limiting, large audio chunks, cold startsMeasure STT processing time per request, check provider metrics
Truncated transcriptsPartial recognition, cut-off sentencesVAD endpointing too aggressive, audio stream interruptionReview VAD settings, check audio buffer continuity

Key insight: 60% of "the agent doesn't understand me" reports trace back to ASR-layer issues, not LLM problems. Always check transcription quality before investigating LLM behavior.

LLM/NLU Layer Failures

LLM failures in voice agents occur at 3-10x higher rates than in text systems due to noisy ASR input, real-time latency pressure, and the absence of user correction signals.

Failure ModeSymptomsCommon CausesDiagnostic Steps
Prompt non-complianceAgent ignores instructions, wrong personaPrompt too long for context window, conflicting instructionsReview prompt token count, test prompt in isolation
HallucinationsFabricated information, non-existent optionsMissing context grounding, overly creative temperatureCheck grounding sources, review temperature settings
Intent misclassificationCorrect transcript, wrong actionAmbiguous utterance, insufficient training examplesCompare transcript against intent labels, review edge cases
Tool call failuresAgent describes action but does not executeFunction schema mismatch, parameter validation failureCheck tool call logs, validate function schemas
Rate limitingIntermittent slow responses, 429 errorsTraffic spike, insufficient API quotaCheck provider rate limit headers, review usage patterns

Latency and Timing Failures

Latency is the most common SEV-2 trigger for voice agents. Component-level tracking is essential for actionable diagnosis.

Production latency benchmarks (from 4M+ calls):

ComponentTargetWarningCritical
STT processing<200ms200-400ms>400ms
LLM TTFT<400ms400-800ms>800ms
LLM full response<1000ms1000-2000ms>2000ms
TTS TTFB<150ms150-300ms>300ms
Turn detection<400ms400-600ms>600ms
Network overhead<100ms100-200ms>200ms

End-to-end latency targets:

PercentileTargetWarningCritical
P50<1.5s1.5-2.0s>2.0s
P90<3.5s3.5-5.0s>5.0s
P95<5.0s5.0-7.0s>7.0s
P99<10s10-15s>15s

Diagnosis approach: Measure each component independently. If end-to-end P90 exceeds 3.5 seconds, identify which component contributes the majority of the delay. A single bottleneck (typically LLM TTFT or STT processing) usually accounts for 60-70% of total latency.

TTS Output Failures

Failure ModeSymptomsCommon CausesDiagnostic Steps
Unnatural prosodyRobotic or stilted speech patternsWrong voice model, SSML rendering issuesCompare against reference audio, check voice model configuration
MispronunciationsNames, acronyms, or domain terms mangledMissing pronunciation dictionary entriesReview custom lexicon coverage, test specific terms
Audio artifactsClicks, pops, distortion in agent speechEncoding mismatch, buffer underrun, sample rate conversionCheck audio encoding chain, verify sample rate consistency
Complete TTS failureAgent generates response but no audioAPI key rotation, provider outage, audio routing failureVerify TTS API connectivity, check audio output pipeline

Dialog Flow and Conversation Logic Failures

Failure ModeSymptomsCommon CausesDiagnostic Steps
Conversational loopsAgent repeats same question or responseMissing state tracking, prompt logic errorReview conversation history injection, check for circular flows
Excessive interruptionsAgent talks over user, fails to yieldVAD thresholds too low, barge-in handling misconfiguredReview turn detection settings, check interruption handling
Poor turn-takingAwkward pauses, overlapping speechEndpointing timeout too high or too lowTune VAD silence threshold, review turn detection latency
Context lossAgent forgets earlier conversation contextContext window overflow, history truncationCheck token counts, verify conversation history management

Monitoring and Alerting Reference

Critical Voice Agent Metrics

Monitor these metrics continuously with automated alerting:

MetricWhat It MeasuresAlert Threshold (Warning)Alert Threshold (Critical)
Call success rateCalls completing without errors<95%<85%
ASR accuracy (WER)Transcription quality>10%>20%
P90 end-to-end latencyResponse speed for 90th percentile>3.5s>5.0s
Intent accuracyCorrect intent classification<90%<80%
Task completion rateSuccessfully completed user tasks<85%<70%
Interruption false positive rateBackground noise triggering barge-in>5%>10%
TTS error rateFailed audio synthesis>2%>5%
Escalation rateCalls transferred to humans>25%>40%

Alert Configuration Best Practices

Use percentile-based alerts, not averages. A P90 latency of 3.5 seconds is actionable. A mean latency of 1.2 seconds hides the fact that 10% of users are waiting 5+ seconds.

Configure component-level breakdowns. When P90 end-to-end latency alerts, you need to know which component caused it. Instrument separate metrics for STT, LLM TTFT, TTS TTFB, and network overhead.

Set baseline-relative thresholds. Static thresholds miss gradual drift. Alert when a metric deviates more than 50% from its 7-day baseline at the same time of day.

Avoid alert fatigue. Group related alerts (e.g., high latency + low task completion) into a single incident rather than separate notifications. Use severity-based routing: SEV-1 pages on-call immediately, SEV-3 creates a ticket.

Using OpenTelemetry for Voice Agent Tracing

Instrument distributed tracing with a conversation-scoped trace ID that propagates across all components:

trace_id: conv-{uuid}
├── span: stt_processing (duration, confidence, transcript)
├── span: intent_classification (intent, confidence, entities)
├── span: llm_reasoning (model, tokens_in, tokens_out, ttft)
   ├── span: tool_call (function, args, result, duration)
   └── span: tool_call (function, args, result, duration)
├── span: response_generation (template, variables)
└── span: tts_synthesis (voice_id, duration, audio_length)

Key attributes to capture per span:

  • conversation.id — Unique conversation identifier
  • turn.index — Turn number within the conversation
  • component.provider — ASR/LLM/TTS provider name and version
  • component.latency_ms — Processing time for this component
  • component.error — Error type and message if failed

This trace structure enables drill-down from high-level latency alerts to the specific component and provider call causing the issue.


Incident Communication Templates

Internal Communication Template

Use this structure for war room updates and internal Slack messages:

[SEV-{level}] {Brief description}  Update #{number}

Status: INVESTIGATING | IDENTIFIED | MITIGATING | RESOLVED
Affected: {Component}  {User impact description}
Duration: {Time since detection}

Current findings:
- {What we know}

Next steps:
- {What we are doing next}
- {Who is doing it}

ETA to mitigation: {Estimate or "Investigating"}
Next update: {Time of next update}

External Stakeholder Template

Subject: [Voice Agent Service] {Status}  {Brief description}

We are aware of an issue affecting {description of impacted functionality}.

Impact: {What users are experiencing, in non-technical terms}
Status: Our engineering team is actively investigating.

We will provide an update by {time}.

We apologize for any inconvenience.

Do not speculate on root cause in external communications. State what is happening, not why.

Update Frequency by Severity

SeverityUpdate FrequencyChannel
SEV-1Every 30-60 minutesWar room + status page + stakeholder email
SEV-2Every 2-4 hoursIncident channel + stakeholder email
SEV-3Daily summaryTeam channel + tracking ticket

Resolution Communication Template

Subject: [RESOLVED] {Brief description}

The issue affecting {functionality} has been resolved as of {timestamp UTC}.

Root cause: {One-sentence technical explanation}
Fix applied: {What was changed}
Duration: {Total incident duration}
Prevention: {What we are doing to prevent recurrence}

A full postmortem will be published within {48-72 hours}.

Incident Postmortem Template

Postmortem Overview Section

# Voice Agent Incident Postmortem

**Title:** {Descriptive title}
**Date:** {YYYY-MM-DD}
**Duration:** {Start time} - {End time} UTC ({total duration})
**Severity:** SEV-{level}
**Incident Commander:** {Name}
**Participants:** {List of involved team members}

## Summary
{One paragraph: what happened, what was affected, how it was resolved}

Impact Assessment

Quantify the impact across multiple dimensions:

| Dimension              | Measurement                                        |
|------------------------|----------------------------------------------------|
| Users affected         | {Number or percentage of total users}              |
| Calls impacted         | {Total calls during incident window}               |
| Failed calls           | {Calls that failed or were abandoned}              |
| Revenue impact         | {Estimated revenue loss if applicable}             |
| Latency degradation    | {P90 during incident vs. baseline P90 of 3.5s}    |
| SLA consumption        | {Error budget consumed by this incident}           |
| Task completion impact | {Task completion rate during incident vs. baseline}|

Timeline of Events

Use UTC timestamps. Include detection, escalation, diagnosis, mitigation, and full resolution milestones.

{HH:MM UTC}  First anomalous metric detected by {monitoring system}
{HH:MM UTC}  Alert triggered: {alert name and threshold}
{HH:MM UTC}  On-call engineer acknowledged
{HH:MM UTC}  Severity classified as SEV-{level}
{HH:MM UTC}  Root cause identified: {brief description}
{HH:MM UTC}  Mitigation applied: {action taken}
{HH:MM UTC}  Metrics returning to baseline
{HH:MM UTC}  Incident resolved, monitoring confirmed recovery

Root Cause Analysis

Applying Five Whys to Voice Agent Incidents

The Five Whys technique works particularly well for voice agent incidents because cascading failures obscure the true root cause. Ask "why" iteratively until reaching a systemic process failure—not individual human error.

Example:

  1. Why did users experience 8-second response times? → LLM TTFT spiked to 4 seconds
  2. Why did LLM TTFT spike? → Context window was hitting the 128K token limit
  3. Why was context window full? → Conversation history was not being truncated
  4. Why was truncation not working? → Recent prompt refactor removed the truncation logic
  5. Why was the removal not caught? → No regression test for context window size

Root cause: Missing regression test coverage for conversation history management. Action: Add automated test that validates context window stays within limits after prompt changes.

The goal is always to identify a systemic improvement—a test, a check, a monitoring alert—not to assign blame.

Common Root Causes in Voice Systems

Based on analysis of production incidents across 4M+ calls:

Root Cause CategoryFrequencyExample
Prompt changes without regression testing~35%New prompt breaks existing intent handling
ASR provider degradation~20%Provider model update reduces accuracy for specific accents
Missing component-level monitoring~15%Latency spike undetected because only e2e was monitored
Infrastructure scaling~12%Connection pool exhaustion during traffic peak
Configuration drift~10%Environment variable mismatch between staging and production
Third-party API changes~8%Breaking change in LLM or TTS provider API

Action Items and Owners

Structure action items with clear ownership and deadlines:

## Action Items

### Immediate (This Sprint)
- [ ] {Action}  Owner: {Name}  Due: {Date}
- [ ] {Action}  Owner: {Name}  Due: {Date}

### Short-term (Next 30 Days)
- [ ] {Action}  Owner: {Name}  Due: {Date}

### Long-term (Next Quarter)
- [ ] {Action}  Owner: {Name}  Due: {Date}

Distinguish between:

  • Mitigation actions — What was done to restore service (already completed)
  • Prevention actions — What will prevent recurrence (needs scheduling)
  • Detection actions — What will catch this faster next time (monitoring improvements)

Lessons Learned

Document 3-5 key takeaways. Focus on system improvements and process gaps, not individual performance.

## Lessons Learned

1. {Takeaway about detection}: How could we have detected this sooner?
2. {Takeaway about response}: What would have made the response faster?
3. {Takeaway about prevention}: What systemic change prevents recurrence?
4. {Takeaway about communication}: Did stakeholders get the right info at the right time?

Post-Incident Process

Postmortem Meeting Guidelines

Timing: Conduct within 48-72 hours while details remain fresh. Waiting longer leads to rationalized narratives rather than accurate reconstructions.

Structure:

  1. Timeline review (15 min) — Walk through events chronologically
  2. Root cause analysis (20 min) — Apply Five Whys methodology
  3. What went well (10 min) — Identify effective response patterns to reinforce
  4. What could improve (10 min) — Identify gaps without assigning blame
  5. Action items (15 min) — Assign owners and deadlines

Ground rules:

  • Blameless. Focus on systems and processes, not individuals
  • Evidence-based. Reference metrics, logs, and traces rather than memory
  • Forward-looking. Every discussion point should produce an action item or a confirmed non-action

Converting Production Failures to Test Cases

Every production incident should generate at least one regression test. This converts operational pain into permanent prevention.

Process:

  1. Extract the failing conversation from production logs (audio + transcript + metadata)
  2. Identify the specific failure point (which turn, which component, what input triggered the failure)
  3. Create a test case that replays the failing conditions
  4. Verify the test case fails before the fix and passes after
  5. Add to the automated regression suite that runs on every deployment

Test case categories from incidents:

Incident TypeTest Case Approach
ASR failure for specific accentAdd accent-specific audio samples to test set
LLM misclassificationAdd the misclassified utterance to intent test suite
Latency spike from long contextAdd context window size validation test
Tool call failureAdd integration test for the specific tool with edge case inputs
Dialog loopAdd multi-turn test that detects repetitive agent responses

Teams that systematically convert incidents to tests see a 40-60% reduction in repeat incidents within 6 months.

Updating Runbooks and Monitoring

After every SEV-1 or SEV-2 postmortem, update the following:

  • Severity criteria — Did this incident reveal a gap in SEV classification?
  • Failure mode library — Add new failure modes discovered during diagnosis
  • Alert thresholds — Adjust based on actual impact data from the incident
  • Response checklists — Add steps that would have accelerated diagnosis
  • Communication templates — Refine based on what worked during stakeholder communication

Tools and Resources

How Teams Implement Incident Response with Hamming

Hamming provides the observability layer that makes structured incident response possible. Teams running voice agents on LiveKit, Pipecat, Vapi, or Retell use Hamming for component-level visibility during incidents:

  • Component-level latency breakdown — See STT, LLM, and TTS latency per conversation turn, enabling immediate identification of which layer is causing latency alerts
  • Production call replay — Replay any production call with full audio, transcript, and timing data for incident diagnosis and regression test creation
  • OpenTelemetry trace ingestion — Ingest distributed traces from your voice agent pipeline for end-to-end correlation during incidents
  • Automated regression testing — Run test suites against production recordings to validate fixes before deployment
  • Threshold-based alerting — Configure P90 latency, WER, and task completion alerts with Slack, PagerDuty, and webhook integrations to trigger SEV classification automatically
  • Synthetic call monitoring — Run automated test calls every 5-15 minutes to detect degradation before real users are affected
  • Postmortem data extraction — Export full conversation traces, latency breakdowns, and audio recordings for postmortem analysis and root cause identification
  • Regression test creation from incidents — Convert any production failure into a permanent regression test case with one click, preventing recurrence after every postmortem

Essential Monitoring Stack Components

A complete voice agent monitoring stack requires capabilities beyond standard application monitoring:

ComponentPurposeVoice-Specific Requirement
Distributed tracingCorrelate events across pipelineTrace ID per conversation, span per component
Percentile-based alertingDetect latency distribution shiftsP90 at 3.5s, not just average
Acoustic quality analysisEvaluate audio beyond transcriptsMOS scoring, SNR monitoring, codec health
Turn-level metricsPer-turn performance trackingLatency, confidence, and accuracy per turn
Conversation replayDebug specific incidentsFull audio + transcript + timing reconstruction


Frequently Asked Questions

Runbooks provide tactical step-by-step checklists for specific failure scenarios—they tell responders exactly what to check and in what order during an active incident. Playbooks offer strategic incident response frameworks covering severity classification, escalation paths, communication cadence, and organizational coordination. For voice agents, both are essential: runbooks handle the technical diagnosis (check ASR, then LLM, then TTS), while playbooks manage the operational response (who to notify, when to escalate, how to communicate with stakeholders).

Voice agent severity classification must factor in acoustic quality, turn-level latency, and task completion rates alongside traditional availability metrics. A voice agent with 99.9% uptime but P90 latency of 7 seconds is effectively a SEV-2 incident because users cannot maintain natural conversation. SEV-1 triggers include complete service unavailability, PII exposure, or P90 latency exceeding 15 seconds. SEV-2 includes sustained P90 above 7 seconds, WER above 25%, or task completion below 50%. SEV-3 covers isolated degradation affecting less than 5% of calls without SLA violations.

Voice agents operate under real-time conversation constraints where every delayed or incorrect response degrades user experience immediately. Unlike web applications where users can wait, refresh, or retry, voice conversations have no visual feedback mechanisms—silence is interpreted as failure. Users abandon voice interactions 3-5x faster than text-based interactions when quality degrades. This means MTTR directly correlates with call abandonment rates and customer satisfaction. Teams using structured incident response frameworks with pre-built checklists and severity playbooks reduce MTTR by 4-6x compared to ad-hoc debugging.

Voice agent postmortems must analyze a four-layer infrastructure stack (audio pipeline, STT, LLM, TTS) where failures cascade silently across layers. Traditional postmortems examine error logs and request metrics. Voice postmortems require acoustic data analysis alongside transcripts, turn-level latency breakdowns per component, and conversation replay to understand the user experience. The Five Whys technique is particularly effective because cascading failures obscure true root causes—an apparent LLM failure often traces back to ASR degradation producing corrupted input.

Review and refine communication templates after every SEV-1 and SEV-2 incident as part of the postmortem process. Evaluate whether the templates provided the right level of detail, whether stakeholders received timely information, and whether the language accurately described the user impact. Common improvements include adding voice-specific impact descriptions (latency degradation, audio quality), adjusting update frequency expectations, and refining the resolution template to include prevention measures. Teams that iterate on templates after each major incident report 30% faster stakeholder alignment during subsequent incidents.

Based on analysis of production incidents across 4M+ calls, the most common systemic root causes are: prompt changes deployed without regression testing (35% of incidents), ASR provider degradation from upstream model updates (20%), missing component-level monitoring that allowed latency spikes to go undetected (15%), infrastructure scaling failures during traffic peaks (12%), configuration drift between staging and production environments (10%), and third-party API breaking changes (8%). The Five Whys consistently reveals that the proximate cause (e.g., high latency) traces back to a process gap (e.g., no automated regression test for context window size).

Use pre-built templates and checklists for fast initial acknowledgment and structured diagnosis during the active incident—speed is critical when users are experiencing degraded service. Reserve thorough analysis for the scheduled postmortem conducted within 48-72 hours. During the incident, prioritize mitigation over root cause analysis: rollback deployments, failover to backup providers, or scale resources first. Document observations in real-time for postmortem input but do not pause mitigation efforts for investigation. SEV-1 targets 15-minute mitigation, SEV-2 targets 60 minutes.

Voice agent postmortems should include component-level latency breakdown (STT processing time, LLM TTFT, TTS TTFB), end-to-end latency percentiles (P50, P90 target 3.5s, P95, P99), turn-level timing analysis showing latency distribution across conversation turns, interruption counts and false positive rates, intent classification accuracy during the incident window, entity extraction error rates, Word Error Rate compared to baseline, task completion rates by workflow type, and user abandonment rates compared to pre-incident baseline. Compare all metrics against the 7-day baseline at the same time of day to account for traffic patterns.

Effective voice agent incident response requires tools that provide component-level latency breakdown across STT, LLM, and TTS layers, conversation replay with full audio and transcript data, distributed tracing with per-conversation trace IDs, and automated regression test creation from production failures. Hamming provides these capabilities natively for teams running voice agents on LiveKit, Pipecat, Vapi, or Retell, enabling incident responders to drill into per-turn latency data during triage and convert postmortem findings into automated prevention through one-click regression test creation.

Sumanyu Sharma

Sumanyu Sharma

Founder & CEO

Previously Head of Data at Citizen, where he helped quadruple the user base. As Senior Staff Data Scientist at Tesla, grew AI-powered sales program to 100s of millions in revenue per year.

Researched AI-powered medical image search at the University of Waterloo, where he graduated with Engineering honors on dean's list.

“At Hamming, we're taking all of our learnings from Tesla and Citizen to build the future of trustworthy, safe and reliable voice AI agents.”