How to Monitor Voice Agent Outages in Real Time

Sumanyu Sharma
Sumanyu Sharma
Founder & CEO
, Voice AI QA Pioneer

Has stress-tested 1M+ voice agent calls to find where they break.

October 7, 2025Updated December 23, 202513 min read
How to Monitor Voice Agent Outages in Real Time

TL;DR: Monitor voice agent outages by tracking ASR WER, NLU intent accuracy, TTS latency percentiles, and API dependency errors in real time. Use synthetic calls and clear alert thresholds to catch silent degradations before customers notice.

Related Guides:

Last month, one of our customers had a 45-minute outage. Their voice agent didn't crash—it just started responding 200ms slower than usual. Doesn't sound like much, right? But that extra latency made turn-taking feel off. Users started interrupting more. The agent started cutting them off. Call completion rates dropped 23% before anyone noticed.

They found out because their CSAT scores tanked overnight. By the time they traced it back to an ASR provider having capacity issues, they'd already handled 800 frustrated calls.

The complexity of AI voice agents makes them prone to these silent failures that don't look like outages but absolutely feel like outages to users. Detecting them is difficult without the right observability platform; most teams still rely on manual QA and post-call reviews to discover issues long after users have felt the impact.

Maintaining the reliability of voice agents requires real-time insight into how each layer of the system performs in production. Without that visibility, voice agent outages often go undetected until users start complaining.

In this article, we'll walk you through how to monitor AI voice agent outages in real time.

Quick filter: If you learn about outages from customer complaints, your monitoring is too slow.

Methodology Note: Alert thresholds and monitoring benchmarks in this guide are based on Hamming's analysis of outage patterns across 100+ production voice agents (2025). Thresholds should be calibrated to your specific baseline performance and SLA requirements.

What Is Voice Agent Outage Monitoring?

Voice agent outage monitoring is the continuous measurement of conversation-level reliability signals (ASR accuracy, intent success, TTS latency, and dependency health) to detect silent failures that don’t appear as infrastructure downtime.

Why Are Voice Agent Outages Hard to Detect?

Voice agent outages are often silent. For instance, the voice agent might not crash or stop responding, but rather behave differently. This could be represented as a slight delay in transcription, a missed intent, or an LLM hallucination.

This makes real-time detection difficult: what looks like a normal interaction in logs is frustrating the user on the other end. Part of the challenge in detecting voice agent outages is the architecture of the voice stack.

Most voice agents depend on multiple probabilistic components, ASR, NLU, TTS, LLM etc. If any layer experiences latency, drift, or dependency failure, it can ripple across the entire conversation without causing a visible outage.

Another reason why voice agent outages are hard to detect is that most teams still rely on manual post-call quality assurance to discover them. By the time an analyst reviews transcripts or listens to recordings, the underlying issue, an API slowdown, a model regression, or an expired integration key has often already been resolved. What’s left behind is only a pattern of “fallback” responses, longer pauses, or increased drop-off rates that don’t depict what went wrong.

How to Monitor Voice Agent Outages in Real Time?

Effective outage monitoring requires a systematic approach. Use Hamming's 4-Layer Monitoring Framework to ensure comprehensive coverage across your entire voice stack.

Hamming's 4-Layer Monitoring Framework

Voice agents consist of four interdependent layers. Each layer has distinct failure modes and requires specific monitoring signals:

LayerFunctionKey MetricsFailure Mode
ASRSpeech-to-textWord Error Rate, transcription latencySilent degradation (mishears but doesn't fail)
NLUIntent classificationIntent accuracy, fallback rateWrong routing, default responses
TTSText-to-speechP90 latency, audio qualityDelayed or garbled responses
APIExternal dependenciesError rate, latency percentilesTimeouts, cascading failures

Sources: 4-Layer Framework based on Hamming's incident analysis across 100+ production voice agents (2025). Layer categorization aligned with standard voice agent architecture patterns.

How Do You Monitor ASR (Speech Recognition)?

ASR failures are the hardest to detect because the system continues operating—it just mishears users.

Key Metrics:

MetricHow to CalculateHealthyWarningCritical
Word Error Rate (WER)(S + D + I) / N × 100<8%8-12%>12%
WER DeltaCurrent WER - Baseline WER<2%2-5%>5%
Transcription LatencyTime from audio end to transcript<300ms300-500ms>500ms
Confidence ScoreASR model confidence>0.850.7-0.85<0.7

Sources: WER thresholds based on LibriSpeech evaluation standards and Hamming production monitoring data (2025). Latency targets aligned with conversational turn-taking research (Stivers et al., 2009). Confidence thresholds from ASR provider documentation (Deepgram, AssemblyAI, Google STT).

WER Formula:

WER = (Substitutions + Deletions + Insertions) / Total Reference Words × 100

Example: If your baseline WER is 6% and you detect 11%, that's an 83% degradation—trigger a critical alert.

How Do You Monitor NLU (Intent Classification)?

NLU failures manifest as increased fallback responses and misrouted conversations.

Key Metrics:

MetricDefinitionHealthyWarningCritical
Intent AccuracyCorrect classifications / Total>92%85-92%<85%
Fallback Rate"Unknown" intents / Total<5%5-10%>10%
Confidence Distribution% of low-confidence (<0.7) classifications<10%10-20%>20%
Top Intent DriftChange in top-5 intent distribution<5% shift5-15% shift>15% shift

Sources: Intent accuracy benchmarks based on dialogue systems research (Budzianowski et al., 2019) and Hamming customer NLU monitoring data (2025). Fallback rate thresholds derived from contact center industry standards.

Alert Logic:

  • If fallback rate doubles within 15 minutes → Warning
  • If intent accuracy drops >5% from 24-hour baseline → Critical

How Do You Monitor TTS (Speech Synthesis)?

TTS failures create awkward pauses and delayed responses that frustrate users.

Key Metrics:

MetricDefinitionHealthyWarningCritical
P50 LatencyMedian time to first audio byte<200ms200-400ms>400ms
P90 Latency90th percentile latency<500ms500-1000ms>1000ms
P99 Latency99th percentile latency<1000ms1-2s>2s
Audio Generation ErrorsFailed TTS requests / Total<0.1%0.1-1%>1%

Sources: TTS latency benchmarks from ElevenLabs, Cartesia, and OpenAI TTS documentation. Error rate thresholds based on Hamming production monitoring across 50+ voice agent deployments (2025).

Why P90/P99 Matter:

  • P50 shows typical experience
  • P90 shows what 10% of users experience
  • P99 catches the worst cases that drive complaints

A system with 200ms P50 but 3s P99 has a hidden outage affecting 1 in 100 calls.

How Do You Monitor API Dependencies?

External API failures (LLM, CRM, knowledge base) cascade through the entire system.

Key Metrics:

DependencyHealthy LatencyHealthy Error RateTimeout Threshold
LLM Inference<500ms P90<0.5%2s
CRM Lookup<200ms P90<0.1%1s
Knowledge Base<300ms P90<0.1%1.5s
Function Calls<400ms P90<1%2s

Sources: API latency targets based on OpenAI API SLAs, Anthropic documentation, and typical CRM/enterprise integration benchmarks. Circuit breaker patterns from Release It! (Nygard, 2018) and Hamming infrastructure monitoring.

Circuit Breaker Thresholds:

  • 5 consecutive failures → Open circuit (stop calling)
  • 30 seconds → Half-open (test with single request)
  • Success → Close circuit (resume normal operation)

How Do You Define What Counts as an Outage?

Before you can monitor for outages, you have to define what an outage looks like for your agent. Think in functional thresholds, not binary thresholds.

What Is the Outage Classification Matrix?

SeverityUser ImpactTechnical SignalsResponse Time
SEV-1 CriticalComplete inability to complete tasks>50% call failures, API down<5 minutes
SEV-2 MajorSignificant degradationP90 latency >2s, WER >15%<15 minutes
SEV-3 MinorNoticeable but functionalP90 latency >1s, WER >10%<1 hour
SEV-4 WarningNo user impact yetMetric drift >5% from baseline<4 hours

Sources: Severity classification aligned with Google SRE practices and adapted for voice agent-specific signals. Response time SLAs based on Hamming customer incident response data (2025).

What Is the Conversational Downtime Definition?

An outage occurs when any of these conditions persist for >60 seconds:

  • Response Latency: Agent fails to respond within 1.2 seconds
  • Call Drops: Unexpected disconnections exceed 2% of calls
  • Intent Failures: Fallback responses exceed 15% of turns
  • Task Completion: Success rate drops below 80% baseline

How Do You Determine Monitoring Signals?

For each layer, define the specific patterns that indicate degradation:

What Are ASR Warning Signals?

ASR issues typically surface as transcription quality problems before they cause complete failures:

SignalPatternAction
WER Spike>5% increase in 15 minInvestigate ASR provider status
Confidence DropAverage confidence <0.75Check audio quality, background noise
Transcription DelaysP90 >500msReview ASR capacity, network latency
Language Detection ErrorsWrong language >5%Validate language routing logic

What Are NLU Warning Signals?

NLU degradation shows up in intent classification patterns—watch for sudden shifts in fallback rates:

SignalPatternAction
Fallback Surge2x increase in "unknown"Review recent intent changes
Confidence ClusteringBimodal distribution appearsRetrain or adjust thresholds
Intent DriftTop intents shift >10%Check for upstream data changes
Slot Fill Failures>20% incomplete slotsReview entity extraction

What Are TTS Warning Signals?

TTS problems directly impact user experience—delayed or garbled audio is immediately noticeable:

SignalPatternAction
Latency SpikeP90 doublesCheck TTS provider capacity
Audio ErrorsGarbled/missing audio >0.5%Validate audio encoding pipeline
Voice ConsistencyPitch/speed varianceReview TTS configuration

How Do You Run Synthetic Calls 24/7?

The most reliable way to detect outages before your customers do is to simulate customer calls continuously.

What Is the Synthetic Call Strategy?

Time PeriodCall FrequencyScenario Coverage
Business Hours (8am-8pm)Every 5 minutesFull scenario rotation
Off-Hours (8pm-8am)Every 15 minutesCritical paths only
WeekendsEvery 15 minutesCritical paths only
After DeploymentsEvery 2 minutes for 30 minRegression scenarios

What to Test in Synthetic Calls?

  1. Happy Path Scenarios: Standard booking, inquiry, support flows
  2. Edge Cases: Interruptions, corrections, multi-turn context
  3. Load Conditions: Concurrent call handling
  4. Language Coverage: All supported languages/dialects
  5. API Dependencies: Calls that exercise external integrations

What Are Synthetic Call Success Criteria?

MetricThresholdAlert If
Task Completion>95%<90% for 3 consecutive calls
Latency (end-to-end)<3s average>4s for any call
ASR Accuracy>90%<85% for any call
Intent Accuracy>95%<90% for any call

How Do You Automate Alerting?

Configure alerts that route to the right team with appropriate urgency.

What Is the Alerting Threshold Configuration?

MetricWarningCriticalEscalation
WER>10%>15%Slack → PagerDuty
Intent Accuracy<90%<85%Slack → PagerDuty
P90 Latency>1s>2sSlack → PagerDuty
Call Failure Rate>2%>5%Direct PagerDuty
API Error Rate>1%>5%Slack → PagerDuty

What Is the Alert Routing Matrix?

SeverityPrimary ChannelEscalationResponse SLA
CriticalPagerDutyVP Engineering @ 15min5 minutes
MajorPagerDutyTeam Lead @ 30min15 minutes
MinorSlack #alertsNone1 hour
WarningSlack #monitoringNone4 hours

How Do You Prevent Alert Fatigue?

  • Deduplication: Group related alerts within 5-minute windows
  • Correlation: Link ASR + NLU + TTS alerts when they co-occur
  • Auto-resolve: Clear warnings if metrics recover within 10 minutes
  • Scheduled Quiet: Suppress non-critical during maintenance windows

What Is the Voice Agent Outage Monitoring Checklist?

Use this checklist to validate your monitoring coverage:

ASR Monitoring:

  • WER tracked with baseline comparison
  • Transcription latency percentiles (P50/P90/P99)
  • Confidence score distribution
  • Per-language WER breakdown

NLU Monitoring:

  • Intent accuracy with rolling baseline
  • Fallback rate alerts configured
  • Top intent distribution tracking
  • Slot fill success rate

TTS Monitoring:

  • Latency percentiles tracked
  • Audio generation error rate
  • Time-to-first-byte monitoring

API Monitoring:

  • All external dependencies tracked
  • Circuit breakers configured
  • Timeout thresholds defined
  • Error rate alerting active

Synthetic Testing:

  • 24/7 synthetic calls running
  • Critical paths covered
  • Post-deployment regression tests
  • Multi-language scenarios included

Alerting:

  • Severity levels defined
  • Routing rules configured
  • Escalation paths documented
  • Alert fatigue prevention active

How Does Hamming Monitor Voice Agent Outages in Real Time?

With Hamming, teams gain continuous visibility into every layer of the voice stack, from ASR drift to API slowdowns. Real-time alerts, synthetic call testing, and detailed reliability dashboards give you the earliest possible signal when your voice agent starts to degrade.

Hamming's monitoring capabilities include:

  • 24/7 Synthetic Calls: Automated testing every 5-15 minutes across all scenarios
  • 4-Layer Observability: Unified dashboards for ASR, NLU, TTS, and API metrics
  • Intelligent Alerting: Configurable thresholds with Slack, PagerDuty, and webhook integrations
  • Root-Cause Tracing: One-click from alert to transcript, audio, and model logs
  • Baseline Comparison: Automatic drift detection against rolling performance baselines

Instead of discovering failures hours later in transcripts, you'll know the moment they happen and resolve them before users ever notice.

Start monitoring your voice agents →

Frequently Asked Questions

A voice agent outage is a measurable degradation in conversation quality or task completion, even if infrastructure is technically 'up'. Common indicators: WER spike >5% from baseline, intent accuracy drop below 85%, fallback rate exceeding 15%, P90 latency >2 seconds, or task completion falling below 80%. Outages are often silent—the agent doesn't crash but behaves differently with slight delays, missed intents, or increased fallbacks. Infrastructure can show green while users experience broken conversations.

Detect outages using the 4-Layer Monitoring Framework: (1) ASR layer—track WER, transcription latency, confidence scores; (2) NLU layer—monitor intent accuracy, fallback rate, slot fill success; (3) TTS layer—track P50/P90/P99 latency, audio generation errors; (4) API layer—monitor dependency latency and error rates. Run synthetic 'heartbeat' calls every 5-15 minutes through critical flows. Alert when any layer exceeds thresholds before users notice. If you’re hearing about it from customers, you’re too late.

Standard infrastructure monitoring tracks CPU, HTTP 200s, and call volume—but voice agent quality degrades silently. ASR can return empty or garbled transcripts while still returning 200 OK. Tool calls can timeout without triggering infrastructure alerts. Provider updates can change behavior and spike fallbacks without any error codes. Voice monitoring needs conversational outcome signals: task completion rates, latency percentiles, WER trends, and frustration markers—not just uptime. Per-flow monitoring catches 'billing flow is broken' before overall metrics shift.

Hamming continuously runs synthetic 'heartbeat' calls through critical flows every 5-15 minutes and alerts when outcomes or voice quality signals drift: task completion drops, transfer/fallback rates spike, latency percentiles increase, or WER degrades. When something breaks, correlated call traces make it clear whether the outage came from agent logic, an integration dependency, or an upstream vendor. One-click traceability from alert → transcript → audio → model logs accelerates root cause identification.

Early warning signals by layer: ASR—WER spike >5% from baseline, confidence drop <0.75, empty transcripts, transcription latency >500ms; NLU—fallback rate doubling in 15 minutes, intent accuracy <90%, slot fill failures >20%; TTS—P90 latency doubling, audio generation errors >0.5%; API—dependency timeout rate >1%, consecutive failures triggering circuit breakers. Flow-specific alerts are most useful: 'billing flow broken' surfaces faster than aggregate metrics. Track sharp changes over rolling 15-minute windows.

Synthetic call frequency by period: Business hours (8am-8pm)—every 5 minutes with full scenario rotation; Off-hours (8pm-8am)—every 15 minutes covering critical paths only; Weekends—every 15 minutes on critical paths; After deployments—every 2 minutes for 30 minutes to catch immediate regressions. Rotate through scenario variations (happy path, edge cases, language coverage, API-dependent flows) to ensure broad coverage without excessive cost. Alert if 3 consecutive synthetic calls fail any quality threshold.

Alert thresholds by severity: CRITICAL (page immediately, <5 min response)—WER >15%, intent accuracy <85%, P90 latency >2s, call failure rate >5%, API error rate >5%; WARNING (Slack, <15 min response)—WER >10%, intent accuracy <90%, P90 latency >1s, call failure rate >2%; INFO (dashboard, <1 hour)—metrics outside normal range, unusual patterns. Configure auto-resolve for warnings that recover within 10 minutes. Deduplicate related alerts within 5-minute windows to prevent fatigue.

The 4-Layer Monitoring Framework covers all voice agent components: Layer 1 (ASR)—WER <10%, transcription latency P90 <300ms, confidence >0.85; Layer 2 (NLU)—intent accuracy >92%, fallback rate <5%, low-confidence classifications <10%; Layer 3 (TTS)—P50 latency <200ms, P90 <500ms, P99 <1000ms, audio errors <0.1%; Layer 4 (API)—LLM latency P90 <500ms, CRM <200ms, function calls <400ms, all error rates <0.5%. Each layer has distinct failure modes requiring specific monitoring signals.

Outage classification matrix: SEV-1 Critical (complete inability to complete tasks)—>50% call failures, API down, respond in <5 minutes; SEV-2 Major (significant degradation)—P90 latency >2s, WER >15%, respond in <15 minutes; SEV-3 Minor (noticeable but functional)—P90 latency >1s, WER >10%, respond in <1 hour; SEV-4 Warning (no user impact yet)—metric drift >5% from baseline, respond in <4 hours. An outage is confirmed when conditions persist for >60 seconds.

Prevent alert fatigue with: (1) Deduplication—group related alerts within 5-minute windows; (2) Correlation—link ASR + NLU + TTS alerts when they co-occur (one root cause); (3) Auto-resolve—clear warnings if metrics recover within 10 minutes; (4) Scheduled quiet—suppress non-critical alerts during maintenance windows; (5) Flow-specific routing—route billing alerts to billing team, not everyone; (6) Severity discipline—only page for true critical issues, use Slack for warnings. Well-tuned alerts should average <3 actionable alerts per day per on-call engineer.

Common latency spike causes by layer: ASR—provider capacity limits during peak hours, audio preprocessing overhead, network latency to provider; LLM—cold starts, rate limiting, complex prompts requiring more tokens, function call chains; TTS—voice synthesis queue depth, audio encoding overhead, large response texts; API—dependency timeouts, database query slowness, third-party rate limits. Diagnose by measuring latency at each pipeline stage separately. Spikes often cascade—slow ASR delays LLM start which delays TTS. Track P95/P99, not averages.

Sumanyu Sharma

Sumanyu Sharma

Founder & CEO

Previously Head of Data at Citizen, where he helped quadruple the user base. As Senior Staff Data Scientist at Tesla, grew AI-powered sales program to 100s of millions in revenue per year.

Researched AI-powered medical image search at the University of Waterloo, where he graduated with Engineering honors on dean's list.

“At Hamming, we're taking all of our learnings from Tesla and Citizen to build the future of trustworthy, safe and reliable voice AI agents.”