TL;DR: Monitor voice agent outages by tracking ASR WER, NLU intent accuracy, TTS latency percentiles, and API dependency errors in real time. Use synthetic calls and clear alert thresholds to catch silent degradations before customers notice.
Related Guides:
- How to Evaluate Voice Agents: Complete Guide — Hamming's VOICE Framework with all metrics
- Testing Voice Agents for Production Reliability — Hamming's 3-Pillar Testing Framework
- ASR Accuracy Evaluation for Voice Agents — Hamming's 5-Factor ASR Framework
- Multilingual Voice Agent Testing — Hamming's 5-Step Multilingual Testing Framework
- Best Voice Agent Stack — Architecture decisions affecting reliability
Last month, one of our customers had a 45-minute outage. Their voice agent didn't crash—it just started responding 200ms slower than usual. Doesn't sound like much, right? But that extra latency made turn-taking feel off. Users started interrupting more. The agent started cutting them off. Call completion rates dropped 23% before anyone noticed.
They found out because their CSAT scores tanked overnight. By the time they traced it back to an ASR provider having capacity issues, they'd already handled 800 frustrated calls.
The complexity of AI voice agents makes them prone to these silent failures that don't look like outages but absolutely feel like outages to users. Detecting them is difficult without the right observability platform; most teams still rely on manual QA and post-call reviews to discover issues long after users have felt the impact.
Maintaining the reliability of voice agents requires real-time insight into how each layer of the system performs in production. Without that visibility, voice agent outages often go undetected until users start complaining.
In this article, we'll walk you through how to monitor AI voice agent outages in real time.
Quick filter: If you learn about outages from customer complaints, your monitoring is too slow.
Methodology Note: Alert thresholds and monitoring benchmarks in this guide are based on Hamming's analysis of outage patterns across 100+ production voice agents (2025). Thresholds should be calibrated to your specific baseline performance and SLA requirements.
What Is Voice Agent Outage Monitoring?
Voice agent outage monitoring is the continuous measurement of conversation-level reliability signals (ASR accuracy, intent success, TTS latency, and dependency health) to detect silent failures that don’t appear as infrastructure downtime.
Why Are Voice Agent Outages Hard to Detect?
Voice agent outages are often silent. For instance, the voice agent might not crash or stop responding, but rather behave differently. This could be represented as a slight delay in transcription, a missed intent, or an LLM hallucination.
This makes real-time detection difficult: what looks like a normal interaction in logs is frustrating the user on the other end. Part of the challenge in detecting voice agent outages is the architecture of the voice stack.
Most voice agents depend on multiple probabilistic components, ASR, NLU, TTS, LLM etc. If any layer experiences latency, drift, or dependency failure, it can ripple across the entire conversation without causing a visible outage.
Another reason why voice agent outages are hard to detect is that most teams still rely on manual post-call quality assurance to discover them. By the time an analyst reviews transcripts or listens to recordings, the underlying issue, an API slowdown, a model regression, or an expired integration key has often already been resolved. What’s left behind is only a pattern of “fallback” responses, longer pauses, or increased drop-off rates that don’t depict what went wrong.
How to Monitor Voice Agent Outages in Real Time?
Effective outage monitoring requires a systematic approach. Use Hamming's 4-Layer Monitoring Framework to ensure comprehensive coverage across your entire voice stack.
Hamming's 4-Layer Monitoring Framework
Voice agents consist of four interdependent layers. Each layer has distinct failure modes and requires specific monitoring signals:
| Layer | Function | Key Metrics | Failure Mode |
|---|---|---|---|
| ASR | Speech-to-text | Word Error Rate, transcription latency | Silent degradation (mishears but doesn't fail) |
| NLU | Intent classification | Intent accuracy, fallback rate | Wrong routing, default responses |
| TTS | Text-to-speech | P90 latency, audio quality | Delayed or garbled responses |
| API | External dependencies | Error rate, latency percentiles | Timeouts, cascading failures |
Sources: 4-Layer Framework based on Hamming's incident analysis across 100+ production voice agents (2025). Layer categorization aligned with standard voice agent architecture patterns.
How Do You Monitor ASR (Speech Recognition)?
ASR failures are the hardest to detect because the system continues operating—it just mishears users.
Key Metrics:
| Metric | How to Calculate | Healthy | Warning | Critical |
|---|---|---|---|---|
| Word Error Rate (WER) | (S + D + I) / N × 100 | <8% | 8-12% | >12% |
| WER Delta | Current WER - Baseline WER | <2% | 2-5% | >5% |
| Transcription Latency | Time from audio end to transcript | <300ms | 300-500ms | >500ms |
| Confidence Score | ASR model confidence | >0.85 | 0.7-0.85 | <0.7 |
Sources: WER thresholds based on LibriSpeech evaluation standards and Hamming production monitoring data (2025). Latency targets aligned with conversational turn-taking research (Stivers et al., 2009). Confidence thresholds from ASR provider documentation (Deepgram, AssemblyAI, Google STT).
WER Formula:
WER = (Substitutions + Deletions + Insertions) / Total Reference Words × 100
Example: If your baseline WER is 6% and you detect 11%, that's an 83% degradation—trigger a critical alert.
How Do You Monitor NLU (Intent Classification)?
NLU failures manifest as increased fallback responses and misrouted conversations.
Key Metrics:
| Metric | Definition | Healthy | Warning | Critical |
|---|---|---|---|---|
| Intent Accuracy | Correct classifications / Total | >92% | 85-92% | <85% |
| Fallback Rate | "Unknown" intents / Total | <5% | 5-10% | >10% |
| Confidence Distribution | % of low-confidence (<0.7) classifications | <10% | 10-20% | >20% |
| Top Intent Drift | Change in top-5 intent distribution | <5% shift | 5-15% shift | >15% shift |
Sources: Intent accuracy benchmarks based on dialogue systems research (Budzianowski et al., 2019) and Hamming customer NLU monitoring data (2025). Fallback rate thresholds derived from contact center industry standards.
Alert Logic:
- If fallback rate doubles within 15 minutes → Warning
- If intent accuracy drops >5% from 24-hour baseline → Critical
How Do You Monitor TTS (Speech Synthesis)?
TTS failures create awkward pauses and delayed responses that frustrate users.
Key Metrics:
| Metric | Definition | Healthy | Warning | Critical |
|---|---|---|---|---|
| P50 Latency | Median time to first audio byte | <200ms | 200-400ms | >400ms |
| P90 Latency | 90th percentile latency | <500ms | 500-1000ms | >1000ms |
| P99 Latency | 99th percentile latency | <1000ms | 1-2s | >2s |
| Audio Generation Errors | Failed TTS requests / Total | <0.1% | 0.1-1% | >1% |
Sources: TTS latency benchmarks from ElevenLabs, Cartesia, and OpenAI TTS documentation. Error rate thresholds based on Hamming production monitoring across 50+ voice agent deployments (2025).
Why P90/P99 Matter:
- P50 shows typical experience
- P90 shows what 10% of users experience
- P99 catches the worst cases that drive complaints
A system with 200ms P50 but 3s P99 has a hidden outage affecting 1 in 100 calls.
How Do You Monitor API Dependencies?
External API failures (LLM, CRM, knowledge base) cascade through the entire system.
Key Metrics:
| Dependency | Healthy Latency | Healthy Error Rate | Timeout Threshold |
|---|---|---|---|
| LLM Inference | <500ms P90 | <0.5% | 2s |
| CRM Lookup | <200ms P90 | <0.1% | 1s |
| Knowledge Base | <300ms P90 | <0.1% | 1.5s |
| Function Calls | <400ms P90 | <1% | 2s |
Sources: API latency targets based on OpenAI API SLAs, Anthropic documentation, and typical CRM/enterprise integration benchmarks. Circuit breaker patterns from Release It! (Nygard, 2018) and Hamming infrastructure monitoring.
Circuit Breaker Thresholds:
- 5 consecutive failures → Open circuit (stop calling)
- 30 seconds → Half-open (test with single request)
- Success → Close circuit (resume normal operation)
How Do You Define What Counts as an Outage?
Before you can monitor for outages, you have to define what an outage looks like for your agent. Think in functional thresholds, not binary thresholds.
What Is the Outage Classification Matrix?
| Severity | User Impact | Technical Signals | Response Time |
|---|---|---|---|
| SEV-1 Critical | Complete inability to complete tasks | >50% call failures, API down | <5 minutes |
| SEV-2 Major | Significant degradation | P90 latency >2s, WER >15% | <15 minutes |
| SEV-3 Minor | Noticeable but functional | P90 latency >1s, WER >10% | <1 hour |
| SEV-4 Warning | No user impact yet | Metric drift >5% from baseline | <4 hours |
Sources: Severity classification aligned with Google SRE practices and adapted for voice agent-specific signals. Response time SLAs based on Hamming customer incident response data (2025).
What Is the Conversational Downtime Definition?
An outage occurs when any of these conditions persist for >60 seconds:
- Response Latency: Agent fails to respond within 1.2 seconds
- Call Drops: Unexpected disconnections exceed 2% of calls
- Intent Failures: Fallback responses exceed 15% of turns
- Task Completion: Success rate drops below 80% baseline
How Do You Determine Monitoring Signals?
For each layer, define the specific patterns that indicate degradation:
What Are ASR Warning Signals?
ASR issues typically surface as transcription quality problems before they cause complete failures:
| Signal | Pattern | Action |
|---|---|---|
| WER Spike | >5% increase in 15 min | Investigate ASR provider status |
| Confidence Drop | Average confidence <0.75 | Check audio quality, background noise |
| Transcription Delays | P90 >500ms | Review ASR capacity, network latency |
| Language Detection Errors | Wrong language >5% | Validate language routing logic |
What Are NLU Warning Signals?
NLU degradation shows up in intent classification patterns—watch for sudden shifts in fallback rates:
| Signal | Pattern | Action |
|---|---|---|
| Fallback Surge | 2x increase in "unknown" | Review recent intent changes |
| Confidence Clustering | Bimodal distribution appears | Retrain or adjust thresholds |
| Intent Drift | Top intents shift >10% | Check for upstream data changes |
| Slot Fill Failures | >20% incomplete slots | Review entity extraction |
What Are TTS Warning Signals?
TTS problems directly impact user experience—delayed or garbled audio is immediately noticeable:
| Signal | Pattern | Action |
|---|---|---|
| Latency Spike | P90 doubles | Check TTS provider capacity |
| Audio Errors | Garbled/missing audio >0.5% | Validate audio encoding pipeline |
| Voice Consistency | Pitch/speed variance | Review TTS configuration |
How Do You Run Synthetic Calls 24/7?
The most reliable way to detect outages before your customers do is to simulate customer calls continuously.
What Is the Synthetic Call Strategy?
| Time Period | Call Frequency | Scenario Coverage |
|---|---|---|
| Business Hours (8am-8pm) | Every 5 minutes | Full scenario rotation |
| Off-Hours (8pm-8am) | Every 15 minutes | Critical paths only |
| Weekends | Every 15 minutes | Critical paths only |
| After Deployments | Every 2 minutes for 30 min | Regression scenarios |
What to Test in Synthetic Calls?
- Happy Path Scenarios: Standard booking, inquiry, support flows
- Edge Cases: Interruptions, corrections, multi-turn context
- Load Conditions: Concurrent call handling
- Language Coverage: All supported languages/dialects
- API Dependencies: Calls that exercise external integrations
What Are Synthetic Call Success Criteria?
| Metric | Threshold | Alert If |
|---|---|---|
| Task Completion | >95% | <90% for 3 consecutive calls |
| Latency (end-to-end) | <3s average | >4s for any call |
| ASR Accuracy | >90% | <85% for any call |
| Intent Accuracy | >95% | <90% for any call |
How Do You Automate Alerting?
Configure alerts that route to the right team with appropriate urgency.
What Is the Alerting Threshold Configuration?
| Metric | Warning | Critical | Escalation |
|---|---|---|---|
| WER | >10% | >15% | Slack → PagerDuty |
| Intent Accuracy | <90% | <85% | Slack → PagerDuty |
| P90 Latency | >1s | >2s | Slack → PagerDuty |
| Call Failure Rate | >2% | >5% | Direct PagerDuty |
| API Error Rate | >1% | >5% | Slack → PagerDuty |
What Is the Alert Routing Matrix?
| Severity | Primary Channel | Escalation | Response SLA |
|---|---|---|---|
| Critical | PagerDuty | VP Engineering @ 15min | 5 minutes |
| Major | PagerDuty | Team Lead @ 30min | 15 minutes |
| Minor | Slack #alerts | None | 1 hour |
| Warning | Slack #monitoring | None | 4 hours |
How Do You Prevent Alert Fatigue?
- Deduplication: Group related alerts within 5-minute windows
- Correlation: Link ASR + NLU + TTS alerts when they co-occur
- Auto-resolve: Clear warnings if metrics recover within 10 minutes
- Scheduled Quiet: Suppress non-critical during maintenance windows
What Is the Voice Agent Outage Monitoring Checklist?
Use this checklist to validate your monitoring coverage:
ASR Monitoring:
- WER tracked with baseline comparison
- Transcription latency percentiles (P50/P90/P99)
- Confidence score distribution
- Per-language WER breakdown
NLU Monitoring:
- Intent accuracy with rolling baseline
- Fallback rate alerts configured
- Top intent distribution tracking
- Slot fill success rate
TTS Monitoring:
- Latency percentiles tracked
- Audio generation error rate
- Time-to-first-byte monitoring
API Monitoring:
- All external dependencies tracked
- Circuit breakers configured
- Timeout thresholds defined
- Error rate alerting active
Synthetic Testing:
- 24/7 synthetic calls running
- Critical paths covered
- Post-deployment regression tests
- Multi-language scenarios included
Alerting:
- Severity levels defined
- Routing rules configured
- Escalation paths documented
- Alert fatigue prevention active
How Does Hamming Monitor Voice Agent Outages in Real Time?
With Hamming, teams gain continuous visibility into every layer of the voice stack, from ASR drift to API slowdowns. Real-time alerts, synthetic call testing, and detailed reliability dashboards give you the earliest possible signal when your voice agent starts to degrade.
Hamming's monitoring capabilities include:
- 24/7 Synthetic Calls: Automated testing every 5-15 minutes across all scenarios
- 4-Layer Observability: Unified dashboards for ASR, NLU, TTS, and API metrics
- Intelligent Alerting: Configurable thresholds with Slack, PagerDuty, and webhook integrations
- Root-Cause Tracing: One-click from alert to transcript, audio, and model logs
- Baseline Comparison: Automatic drift detection against rolling performance baselines
Instead of discovering failures hours later in transcripts, you'll know the moment they happen and resolve them before users ever notice.

