What is a voice agent outage?

A voice agent outage is a measurable degradation in conversation quality or task completion, even if infrastructure is technically 'up'. Common indicators: WER spike >5% from baseline, intent accuracy drop below 85%, fallback rate exceeding 15%, P90 latency >2 seconds, or task completion falling below 80%. Outages are often silent—the agent doesn't crash but behaves differently with slight delays, missed intents, or increased fallbacks. Infrastructure can show green while users experience broken conversations.

How do you detect voice agent outages before customers notice?

Detect outages using the 4-Layer Monitoring Framework: (1) ASR layer—track WER, transcription latency, confidence scores; (2) NLU layer—monitor intent accuracy, fallback rate, slot fill success; (3) TTS layer—track P50/P90/P99 latency, audio generation errors; (4) API layer—monitor dependency latency and error rates. Run synthetic 'heartbeat' calls every 5-15 minutes through critical flows. Alert when any layer exceeds thresholds before users notice. If you’re hearing about it from customers, you’re too late.

Why do voice agent outages often go unnoticed by standard monitoring?

Standard infrastructure monitoring tracks CPU, HTTP 200s, and call volume—but voice agent quality degrades silently. ASR can return empty or garbled transcripts while still returning 200 OK. Tool calls can timeout without triggering infrastructure alerts. Provider updates can change behavior and spike fallbacks without any error codes. Voice monitoring needs conversational outcome signals: task completion rates, latency percentiles, WER trends, and frustration markers—not just uptime. Per-flow monitoring catches 'billing flow is broken' before overall metrics shift.

How does Hamming help monitor voice agent outages?

Hamming continuously runs synthetic 'heartbeat' calls through critical flows every 5-15 minutes and alerts when outcomes or voice quality signals drift: task completion drops, transfer/fallback rates spike, latency percentiles increase, or WER degrades. When something breaks, correlated call traces make it clear whether the outage came from agent logic, an integration dependency, or an upstream vendor. One-click traceability from alert → transcript → audio → model logs accelerates root cause identification.

What signals detect voice agent outages early?

Early warning signals by layer: ASR—WER spike >5% from baseline, confidence drop 500ms; NLU—fallback rate doubling in 15 minutes, intent accuracy 20%; TTS—P90 latency doubling, audio generation errors >0.5%; API—dependency timeout rate >1%, consecutive failures triggering circuit breakers. Flow-specific alerts are most useful: 'billing flow broken' surfaces faster than aggregate metrics. Track sharp changes over rolling 15-minute windows.

How often should you run synthetic calls for outage monitoring?

Synthetic call frequency by period: Business hours (8am-8pm)—every 5 minutes with full scenario rotation; Off-hours (8pm-8am)—every 15 minutes covering critical paths only; Weekends—every 15 minutes on critical paths; After deployments—every 2 minutes for 30 minutes to catch immediate regressions. Rotate through scenario variations (happy path, edge cases, language coverage, API-dependent flows) to ensure broad coverage without excessive cost. Alert if 3 consecutive synthetic calls fail any quality threshold.

What alert thresholds should trigger incident response for voice agents?

Alert thresholds by severity: CRITICAL (page immediately, 15%, intent accuracy 2s, call failure rate >5%, API error rate >5%; WARNING (Slack, 10%, intent accuracy 1s, call failure rate >2%; INFO (dashboard, <1 hour)—metrics outside normal range, unusual patterns. Configure auto-resolve for warnings that recover within 10 minutes. Deduplicate related alerts within 5-minute windows to prevent fatigue.

What is the 4-Layer Monitoring Framework for voice agents?

The 4-Layer Monitoring Framework covers all voice agent components: Layer 1 (ASR)—WER 0.85; Layer 2 (NLU)—intent accuracy >92%, fallback rate <5%, low-confidence classifications <10%; Layer 3 (TTS)—P50 latency <200ms, P90 <500ms, P99 <1000ms, audio errors <0.1%; Layer 4 (API)—LLM latency P90 <500ms, CRM <200ms, function calls <400ms, all error rates <0.5%. Each layer has distinct failure modes requiring specific monitoring signals.

How do you classify voice agent outage severity?

Outage classification matrix: SEV-1 Critical (complete inability to complete tasks)—>50% call failures, API down, respond in 2s, WER >15%, respond in 1s, WER >10%, respond in 5% from baseline, respond in 60 seconds.

How do you prevent alert fatigue in voice agent monitoring?

Prevent alert fatigue with: (1) Deduplication—group related alerts within 5-minute windows; (2) Correlation—link ASR + NLU + TTS alerts when they co-occur (one root cause); (3) Auto-resolve—clear warnings if metrics recover within 10 minutes; (4) Scheduled quiet—suppress non-critical alerts during maintenance windows; (5) Flow-specific routing—route billing alerts to billing team, not everyone; (6) Severity discipline—only page for true critical issues, use Slack for warnings. Well-tuned alerts should average <3 actionable alerts per day per on-call engineer.

What causes voice agent latency spikes?

Common latency spike causes by layer: ASR—provider capacity limits during peak hours, audio preprocessing overhead, network latency to provider; LLM—cold starts, rate limiting, complex prompts requiring more tokens, function call chains; TTS—voice synthesis queue depth, audio encoding overhead, large response texts; API—dependency timeouts, database query slowness, third-party rate limits. Diagnose by measuring latency at each pipeline stage separately. Spikes often cascade—slow ASR delays LLM start which delays TTS. Track P95/P99, not averages.

How to Monitor Voice Agent Outages in Real Time

TL;DR: Monitor voice agent outages by tracking ASR WER, NLU intent accuracy, TTS latency percentiles, and API dependency errors in real time. Use synthetic calls and clear alert thresholds to catch silent degradations before customers notice.

Related Guides:

How to Evaluate Voice Agents: Complete Guide — Hamming's VOICE Framework with all metrics
Testing Voice Agents for Production Reliability — Hamming's 3-Pillar Testing Framework
ASR Accuracy Evaluation for Voice Agents — Hamming's 5-Factor ASR Framework
Multilingual Voice Agent Testing — Hamming's 5-Step Multilingual Testing Framework
Best Voice Agent Stack — Architecture decisions affecting reliability
OpenTelemetry for Voice Agents — Instrument per-component tracing to detect latency spikes at the span level

Last month, one of our customers had a 45-minute outage. Their voice agent didn't crash—it just started responding 200ms slower than usual. Doesn't sound like much, right? But that extra latency made turn-taking feel off. Users started interrupting more. The agent started cutting them off. Call completion rates dropped 23% before anyone noticed.

They found out because their CSAT scores tanked overnight. By the time they traced it back to an ASR provider having capacity issues, they'd already handled 800 frustrated calls.

The complexity of AI voice agents makes them prone to these silent failures that don't look like outages but absolutely feel like outages to users. Detecting them is difficult without the right observability platform; most teams still rely on manual QA and post-call reviews to discover issues long after users have felt the impact.

Maintaining the reliability of voice agents requires real-time insight into how each layer of the system performs in production. Without that visibility, voice agent outages often go undetected until users start complaining.

In this article, we'll walk you through how to monitor AI voice agent outages in real time.

Quick filter: If you learn about outages from customer complaints, your monitoring is too slow.

Methodology Note: Alert thresholds and monitoring benchmarks in this guide are based on Hamming's analysis of outage patterns across 10K+ production voice agents (2025).
Thresholds should be calibrated to your specific baseline performance and SLA requirements.

What Is Voice Agent Outage Monitoring?

Voice agent outage monitoring is the continuous measurement of conversation-level reliability signals (ASR accuracy, intent success, TTS latency, and dependency health) to detect silent failures that don’t appear as infrastructure downtime.

Why Are Voice Agent Outages Hard to Detect?

Voice agent outages are often silent. For instance, the voice agent might not crash or stop responding, but rather behave differently. This could be represented as a slight delay in transcription, a missed intent, or an LLM hallucination.

This makes real-time detection difficult: what looks like a normal interaction in logs is frustrating the user on the other end. Part of the challenge in detecting voice agent outages is the architecture of the voice stack.

Most voice agents depend on multiple probabilistic components, ASR, NLU, TTS, LLM etc. If any layer experiences latency, drift, or dependency failure, it can ripple across the entire conversation without causing a visible outage.

Another reason why voice agent outages are hard to detect is that most teams still rely on manual post-call quality assurance to discover them. By the time an analyst reviews transcripts or listens to recordings, the underlying issue, an API slowdown, a model regression, or an expired integration key has often already been resolved. What’s left behind is only a pattern of “fallback” responses, longer pauses, or increased drop-off rates that don’t depict what went wrong.

How to Monitor Voice Agent Outages in Real Time?

Effective outage monitoring requires a systematic approach. Use Hamming's 4-Layer Monitoring Framework to ensure comprehensive coverage across your entire voice stack.

Hamming's 4-Layer Monitoring Framework

Voice agents consist of four interdependent layers. Each layer has distinct failure modes and requires specific monitoring signals:

Layer	Function	Key Metrics	Failure Mode
ASR	Speech-to-text	Word Error Rate, transcription latency	Silent degradation (mishears but doesn't fail)
NLU	Intent classification	Intent accuracy, fallback rate	Wrong routing, default responses
TTS	Text-to-speech	P90 latency, audio quality	Delayed or garbled responses
API	External dependencies	Error rate, latency percentiles	Timeouts, cascading failures

Sources: 4-Layer Framework based on Hamming's incident analysis across 10K+ production voice agents (2025). Layer categorization aligned with standard voice agent architecture patterns.

How Do You Monitor ASR (Speech Recognition)?

ASR failures are the hardest to detect because the system continues operating—it just mishears users.

Key Metrics:

Metric	How to Calculate	Healthy	Warning	Critical
Word Error Rate (WER)	(S + D + I) / N × 100	<8%	8-12%	>12%
WER Delta	Current WER - Baseline WER	<2%	2-5%	>5%
Transcription Latency	Time from audio end to transcript	<300ms	300-500ms	>500ms
Confidence Score	ASR model confidence	>0.85	0.7-0.85	<0.7

Sources: WER thresholds based on LibriSpeech evaluation standards and Hamming production monitoring data (2025). Latency targets aligned with conversational turn-taking research (Stivers et al., 2009). Confidence thresholds from ASR provider documentation (Deepgram, AssemblyAI, Google STT).

WER Formula:

WER = (Substitutions + Deletions + Insertions) / Total Reference Words × 100

Example: If your baseline WER is 6% and you detect 11%, that's an 83% degradation—trigger a critical alert.

How Do You Monitor NLU (Intent Classification)?

NLU failures manifest as increased fallback responses and misrouted conversations.

Key Metrics:

Metric	Definition	Healthy	Warning	Critical
Intent Accuracy	Correct classifications / Total	>92%	85-92%	<85%
Fallback Rate	"Unknown" intents / Total	<5%	5-10%	>10%
Confidence Distribution	% of low-confidence (<0.7) classifications	<10%	10-20%	>20%
Top Intent Drift	Change in top-5 intent distribution	<5% shift	5-15% shift	>15% shift

Sources: Intent accuracy benchmarks based on dialogue systems research (Budzianowski et al., 2019) and Hamming customer NLU monitoring data (2025). Fallback rate thresholds derived from contact center industry standards.

Alert Logic:

If fallback rate doubles within 15 minutes → Warning
If intent accuracy drops >5% from 24-hour baseline → Critical

How Do You Monitor TTS (Speech Synthesis)?

TTS failures create awkward pauses and delayed responses that frustrate users.

Key Metrics:

Metric	Definition	Healthy	Warning	Critical
P50 Latency	Median time to first audio byte	<200ms	200-400ms	>400ms
P90 Latency	90th percentile latency	<500ms	500-1000ms	>1000ms
P99 Latency	99th percentile latency	<1000ms	1-2s	>2s
Audio Generation Errors	Failed TTS requests / Total	<0.1%	0.1-1%	>1%

Sources: TTS latency benchmarks from ElevenLabs, Cartesia, and OpenAI TTS documentation. Error rate thresholds based on Hamming production monitoring across 10K+ voice agents (2025).

Why P90/P99 Matter:

P50 shows typical experience
P90 shows what 10% of users experience
P99 catches the worst cases that drive complaints

A system with 200ms P50 but 3s P99 has a hidden outage affecting 1 in 100 calls.

How Do You Monitor API Dependencies?

External API failures (LLM, CRM, knowledge base) cascade through the entire system.

Key Metrics:

Dependency	Healthy Latency	Healthy Error Rate	Timeout Threshold
LLM Inference	<500ms P90	<0.5%	2s
CRM Lookup	<200ms P90	<0.1%	1s
Knowledge Base	<300ms P90	<0.1%	1.5s
Function Calls	<400ms P90	<1%	2s

Sources: API latency targets based on OpenAI API SLAs, Anthropic documentation, and typical CRM/enterprise integration benchmarks. Circuit breaker patterns from Release It! (Nygard, 2018) and Hamming infrastructure monitoring.

Circuit Breaker Thresholds:

5 consecutive failures → Open circuit (stop calling)
30 seconds → Half-open (test with single request)
Success → Close circuit (resume normal operation)

How Do You Define What Counts as an Outage?

Before you can monitor for outages, you have to define what an outage looks like for your agent. Think in functional thresholds, not binary thresholds.

What Is the Outage Classification Matrix?

Severity	User Impact	Technical Signals	Response Time
SEV-1 Critical	Complete inability to complete tasks	>50% call failures, API down	<5 minutes
SEV-2 Major	Significant degradation	P90 latency >2s, WER >15%	<15 minutes
SEV-3 Minor	Noticeable but functional	P90 latency >1s, WER >10%	<1 hour
SEV-4 Warning	No user impact yet	Metric drift >5% from baseline	<4 hours

Sources: Severity classification aligned with Google SRE practices and adapted for voice agent-specific signals. Response time SLAs based on Hamming customer incident response data (2025).

What Is the Conversational Downtime Definition?

An outage occurs when any of these conditions persist for >60 seconds:

Response Latency: Agent fails to respond within 1.2 seconds
Call Drops: Unexpected disconnections exceed 2% of calls
Intent Failures: Fallback responses exceed 15% of turns
Task Completion: Success rate drops below 80% baseline

How Do You Determine Monitoring Signals?

For each layer, define the specific patterns that indicate degradation:

What Are ASR Warning Signals?

ASR issues typically surface as transcription quality problems before they cause complete failures:

Signal	Pattern	Action
WER Spike	>5% increase in 15 min	Investigate ASR provider status
Confidence Drop	Average confidence <0.75	Check audio quality, background noise
Transcription Delays	P90 >500ms	Review ASR capacity, network latency
Language Detection Errors	Wrong language >5%	Validate language routing logic

What Are NLU Warning Signals?

NLU degradation shows up in intent classification patterns—watch for sudden shifts in fallback rates:

Signal	Pattern	Action
Fallback Surge	2x increase in "unknown"	Review recent intent changes
Confidence Clustering	Bimodal distribution appears	Retrain or adjust thresholds
Intent Drift	Top intents shift >10%	Check for upstream data changes
Slot Fill Failures	>20% incomplete slots	Review entity extraction

What Are TTS Warning Signals?

TTS problems directly impact user experience—delayed or garbled audio is immediately noticeable:

Signal	Pattern	Action
Latency Spike	P90 doubles	Check TTS provider capacity
Audio Errors	Garbled/missing audio >0.5%	Validate audio encoding pipeline
Voice Consistency	Pitch/speed variance	Review TTS configuration

How Do You Run Synthetic Calls 24/7?

The most reliable way to detect outages before your customers do is to simulate customer calls continuously.

What Is the Synthetic Call Strategy?

Time Period	Call Frequency	Scenario Coverage
Business Hours (8am-8pm)	Every 5 minutes	Full scenario rotation
Off-Hours (8pm-8am)	Every 15 minutes	Critical paths only
Weekends	Every 15 minutes	Critical paths only
After Deployments	Every 2 minutes for 30 min	Regression scenarios

What to Test in Synthetic Calls?

Happy Path Scenarios: Standard booking, inquiry, support flows
Edge Cases: Interruptions, corrections, multi-turn context
Load Conditions: Concurrent call handling
Language Coverage: All supported languages/dialects
API Dependencies: Calls that exercise external integrations

What Are Synthetic Call Success Criteria?

Metric	Threshold	Alert If
Task Completion	>95%	<90% for 3 consecutive calls
Latency (end-to-end)	<3s average	>4s for any call
ASR Accuracy	>90%	<85% for any call
Intent Accuracy	>95%	<90% for any call

How Do You Automate Alerting?

Configure alerts that route to the right team with appropriate urgency.

What Is the Alerting Threshold Configuration?

Metric	Warning	Critical	Escalation
WER	>10%	>15%	Slack → PagerDuty
Intent Accuracy	<90%	<85%	Slack → PagerDuty
P90 Latency	>1s	>2s	Slack → PagerDuty
Call Failure Rate	>2%	>5%	Direct PagerDuty
API Error Rate	>1%	>5%	Slack → PagerDuty

What Is the Alert Routing Matrix?

Severity	Primary Channel	Escalation	Response SLA
Critical	PagerDuty	VP Engineering @ 15min	5 minutes
Major	PagerDuty	Team Lead @ 30min	15 minutes
Minor	Slack #alerts	None	1 hour
Warning	Slack #monitoring	None	4 hours

How Do You Prevent Alert Fatigue?

Deduplication: Group related alerts within 5-minute windows
Correlation: Link ASR + NLU + TTS alerts when they co-occur
Auto-resolve: Clear warnings if metrics recover within 10 minutes
Scheduled Quiet: Suppress non-critical during maintenance windows

What Is the Voice Agent Outage Monitoring Checklist?

Use this checklist to validate your monitoring coverage:

ASR Monitoring:

WER tracked with baseline comparison
Transcription latency percentiles (P50/P90/P99)
Confidence score distribution
Per-language WER breakdown

NLU Monitoring:

Intent accuracy with rolling baseline
Fallback rate alerts configured
Top intent distribution tracking
Slot fill success rate

TTS Monitoring:

Latency percentiles tracked
Audio generation error rate
Time-to-first-byte monitoring

API Monitoring:

All external dependencies tracked
Circuit breakers configured
Timeout thresholds defined
Error rate alerting active

Synthetic Testing:

24/7 synthetic calls running
Critical paths covered
Post-deployment regression tests
Multi-language scenarios included

Alerting:

Severity levels defined
Routing rules configured
Escalation paths documented
Alert fatigue prevention active

How Does Hamming Monitor Voice Agent Outages in Real Time?

With Hamming, teams gain continuous visibility into every layer of the voice stack, from ASR drift to API slowdowns. Real-time alerts, synthetic call testing, and detailed reliability dashboards give you the earliest possible signal when your voice agent starts to degrade.

Hamming's monitoring capabilities include:

24/7 Synthetic Calls: Automated testing every 5-15 minutes across all scenarios
4-Layer Observability: Unified dashboards for ASR, NLU, TTS, and API metrics
Intelligent Alerting: Configurable thresholds with Slack, PagerDuty, and webhook integrations
Root-Cause Tracing: One-click from alert to transcript, audio, and model logs
Baseline Comparison: Automatic drift detection against rolling performance baselines

Instead of discovering failures hours later in transcripts, you'll know the moment they happen and resolve them before users ever notice.

Start monitoring your voice agents →

Related Guides:

Post-Call Analytics for Voice Agents - 4-Layer Analytics Framework with automated scoring and regression detection
Voice Agent Incident Response Runbook - 4-Stack framework for diagnosing and resolving incidents fast
Voice Agent Monitoring Platform Guide - Hamming's complete 4-Layer Monitoring Stack
Voice Agent Drift Detection Guide - Catch gradual degradation before it becomes an outage
Monitor Voice Agents in Production - Comprehensive production monitoring strategies
AI Voice Agent Regression Testing - Prevent outages with proactive testing
Voice Agent SEV Playbook & Postmortem Template - Severity classification, communication templates, and postmortem framework

How to Monitor Voice Agent Outages in Real Time

What Is Voice Agent Outage Monitoring?

Why Are Voice Agent Outages Hard to Detect?

How to Monitor Voice Agent Outages in Real Time?

Hamming's 4-Layer Monitoring Framework

How Do You Monitor ASR (Speech Recognition)?

How Do You Monitor NLU (Intent Classification)?

How Do You Monitor TTS (Speech Synthesis)?

How Do You Monitor API Dependencies?

How Do You Define What Counts as an Outage?

What Is the Outage Classification Matrix?

What Is the Conversational Downtime Definition?

How Do You Determine Monitoring Signals?

What Are ASR Warning Signals?

What Are NLU Warning Signals?

What Are TTS Warning Signals?

How Do You Run Synthetic Calls 24/7?

What Is the Synthetic Call Strategy?

What to Test in Synthetic Calls?

What Are Synthetic Call Success Criteria?

How Do You Automate Alerting?

What Is the Alerting Threshold Configuration?

What Is the Alert Routing Matrix?

How Do You Prevent Alert Fatigue?

What Is the Voice Agent Outage Monitoring Checklist?

How Does Hamming Monitor Voice Agent Outages in Real Time?

Frequently Asked Questions

Sumanyu Sharma

Related Resources

Testing and Monitoring LiveKit Voice Agents in Production

Voice Agent Incident Response Runbook: SEV Playbook & Postmortem Template

Monitor Pipecat Agents in Production: Logs, Traces, Metrics & Alerts