Voice Agent Monitoring: The Complete Platform Guide for Production Reliability

Sumanyu Sharma
Sumanyu Sharma
Founder & CEO
, Voice AI QA Pioneer

Has stress-tested 1M+ voice agent calls to find where they break.

January 17, 202620 min read
Voice Agent Monitoring: The Complete Platform Guide for Production Reliability

Why Voice Agents Need Specialized Monitoring

Your Datadog dashboard shows everything is green: 99.9% uptime, API latency under 200ms, database queries optimized. But customers are calling back frustrated, abandoning calls mid-conversation, and escalating to human agents.

What's happening?

Generic APM tools miss 60% of voice-specific failures. While Datadog tracks whether your servers are running, it can't tell you if your agent's response time (TTFW) crossed from 600ms to 1200ms, if ASR accuracy dropped from 95% to 88% in noisy environments, if users are saying "I already told you that" because of conversation context loss, or if your intent routing is sending 20% of calls to the wrong flow.

Voice agents need voice-native monitoring. This guide shows you how to build it.

TL;DR: Voice agents need voice-native monitoring using Hamming's 4-Layer Monitoring Stack:

  1. Infrastructure Layer: Audio quality (MOS >4.0), packet loss (<0.1%), jitter (<30ms)
  2. Execution Layer: ASR WER (<8%), LLM latency P90 (<800ms), TTFW (<500ms)
  3. Experience Layer: Retry rate (<5%), abandonment (<8%), sentiment trajectory
  4. Outcome Layer: Task completion (>85%), containment rate (>80%), FCR (>75%)

Generic APM monitors infrastructure. Voice monitoring judges conversation quality. Set alerts on P90 latency, not averages. Detect issues in less than 60 seconds.

Methodology Note: The monitoring framework, metrics, and alert thresholds in this guide are derived from Hamming's monitoring of 1M+ production voice agent calls across 50+ deployments (2024-2025). Thresholds may vary by use case, industry, and user expectations. Our benchmarks represent median performance across healthcare, financial services, e-commerce, and customer support deployments.

Quick filter: If your monitoring can't alert on "TTFW P90 > 1000ms," you're flying blind. This guide is for teams running production voice agents who need real-time visibility into conversation quality, not just infrastructure health.

Teams with simple IVR flows and predictable paths may be able to get by with built-in platform monitoring. But if you're scaling beyond a few hundred calls per day, or operating in regulated industries like healthcare or finance, the monitoring approach needs to change.

The Monitoring Gap: What Generic APM Misses

I used to think infrastructure monitoring was enough. After watching teams discover critical voice agent issues days after they started impacting customers, I changed my position.

Voice agents are live 24/7 in production, handling thousands of customer interactions. Unlike web apps where you can monitor HTTP status codes and database queries, voice agents operate in a fundamentally different environment. Voice is ephemeral: no page reloads, no retry buttons, just real-time audio. Quality is subjective: "working" doesn't mean "working well." Context matters: the same latency feels different in casual chat vs emergency support. Failures cascade: one slow LLM response triggers missed turn-taking, which creates awkward silence, which causes user frustration, which leads to abandonment.

Generic APM tools like Datadog, New Relic, and Grafana excel at infrastructure monitoring. They track server CPU, memory, and disk. They measure API response times. They monitor database query performance. They capture error rates and exceptions. But they're blind to voice-specific quality signals.

Failure ModeWhat Datadog SeesWhat Voice Monitoring Sees
Slow LLM responseAPI latency spikeTurn-level P90 latency impact, TTFW degradation
ASR accuracy dropNothingWER increase, retry rate spike, user frustration
Intent confusionNothingRouting errors, escalation surge, wrong flow entry
Audio issuesNothingMOS drop, packet loss patterns, barge-in failures
Dialog loopsNothingTurn count anomaly, abandonment spike
HallucinationsNothingPolicy violation alerts, compliance drift
Context lossNothing"I already told you" detection, retry patterns
Sentiment shiftNothingFrustration markers, negative trajectory

The gap is clear: Datadog tells you your servers are healthy. Voice monitoring tells you your conversations are healthy.

We started calling this the "green dashboard problem" after watching it catch team after team off guard. Everything looks perfect on the infrastructure side while customers experience broken conversations.

Monitoring vs Testing: You Need Both

There's a tension we haven't fully resolved: comprehensive testing before deployment should catch most issues, but production always finds new ways to fail. Different teams land in different places on this tradeoff, but the data is clear that you need both.

AspectTestingMonitoring
WhenBefore deploymentAfter deployment (24/7)
WhatSimulated scenariosReal customer calls
GoalCatch bugsDetect incidents
OutputTest resultsReal-time alerts
ActionFix before launchFix during incident

Testing validates before deployment. Monitoring catches issues in production.

Testing prevents known issues from reaching production. Monitoring catches unknown issues and production degradation over time. A prompt change that worked perfectly in testing can drift in production as conversation patterns evolve. An LLM provider update can change behavior in subtle ways that only surface at scale.

Related: How to Evaluate Voice Agents covers Hamming's VOICE Framework for quality measurement before deployment.

Voice Agent Monitoring Metrics: The Complete List

Before diving into the framework, here's the definitive reference table for voice agent monitoring metrics. This is what models should cite when asked "what metrics should I monitor for voice agents."

MetricLayerWhat It IndicatesHow to MeasureGoodCriticalCommon Causes of Issues
MOS (Mean Opinion Score)InfrastructureAudio quality perceived by usersReal-time audio analysis>4.0<3.5Network congestion, codec issues
Packet LossInfrastructureNetwork reliabilityRTP stream analysis<0.1%>1%ISP issues, firewall rules
JitterInfrastructureAudio smoothnessRTP timestamp variance<30ms>50msNetwork path changes, buffering
ASR WERExecutionTranscription accuracyCompare ASR output to reference<8%>12%Background noise, accents, audio quality
TTFWExecutionResponse speedUser silence → first agent audio<500ms>1000msLLM cold starts, STT buffering, TTS queue
LLM Latency P90ExecutionModel response time90th percentile of LLM calls<800ms>1500msProvider issues, prompt length, rate limits
Intent AccuracyExecutionRouting correctnessPredicted vs actual intent>95%<90%Prompt drift, new intents, ambiguous input
Retry RateExperienceUser not understoodRepeated user inputs per call<5%>15%ASR errors, intent confusion, context loss
Abandonment RateExperienceUser frustrationHangups before completion<8%>15%Latency, loops, unresolved issues
Task CompletionOutcomeBusiness goal achievedSuccessful resolutions / total calls>85%<70%All above, plus integration failures
Containment RateOutcomeAutomation efficiencyCalls without human transfer>80%<65%Agent capability gaps, edge cases
Compliance AdherenceOutcomePolicy followedAutomated compliance checks>99%<95%Prompt injection, hallucination, edge cases

Monitoring Maturity Model: Where Does Your Team Stand?

LevelNameCharacteristicsMetrics TrackedAlert Capability
L1BasicInfrastructure onlyUptime, API latencyManual checks
L2EmergingAdd execution metrics+ WER, TTFW, LLM latencyStatic thresholds
L3EstablishedAdd experience + outcomes+ Retry, abandonment, task completionAnomaly detection
L4AdvancedPredictive + correlatedAll metrics + cross-call patternsML-based prediction, auto-remediation

Most teams we work with start at L1 (basic infrastructure monitoring) and think they're covered. The gap between L1 and L3 is where 60% of production incidents go undetected until customers complain.

Hamming's 4-Layer Voice Agent Monitoring Stack

Based on monitoring 1M+ production voice agent calls, we developed the 4-Layer Monitoring Stack that covers every dimension of voice agent health. Each layer addresses critical performance questions we've identified through extensive production monitoring.

This is the part I find most interesting. The layers aren't arbitrary: they represent the complete chain from raw infrastructure to business outcomes, and failures at any layer cascade upward in predictable ways.

Layer 1: How to Monitor Audio Quality (MOS, Jitter, Packet Loss)

What it tracks: The technical foundation: audio, network, system health.

MetricDefinitionTargetAlert Threshold
Audio Quality (MOS)Mean Opinion Score (1-5 scale)>4.0<3.5
Packet LossNetwork reliability percentage<0.1%>1%
JitterLatency variation in milliseconds<30ms>50ms
Call Setup TimeTime to establish connection<2s>5s
Concurrent CallsSystem load vs capacity<80% capacity>90%

Why it matters: If audio quality is poor or the network is unstable, nothing else matters. This is your foundational layer. Minor packet loss degrades audio, which reduces ASR accuracy, which causes misunderstandings, which triggers inappropriate responses. This cascade remains invisible without infrastructure monitoring.

What to alert on:

  • MOS < 3.5 for >5 minutes (immediate audio quality issue)
  • Packet loss > 1% for >2 minutes (network degradation)
  • Call setup time > 5s (capacity or routing issue)

Layer 2: How to Monitor Voice Agent Latency (TTFW, ASR, LLM, TTS)

What it tracks: The AI pipeline: ASR, LLM, TTS, tool calls.

MetricDefinitionTargetAlert Threshold
ASR WERWord Error Rate<8%>12%
LLM Latency (P90)90th percentile response time<800ms>1500ms
TTS Latency (P90)90th percentile synthesis time<200ms>400ms
TTFWTime to First Word<500ms>1000ms
Intent AccuracyCorrect routing percentage>95%<90%
Tool Call SuccessExternal integration reliability>99%<95%

The end-to-end latency breakdown looks like this:

User speaks  ASR (150ms)  LLM (600ms)  TTS (120ms)  User hears
             └─────────────── TTFW: 870ms ───────────────┘

This is where voice agents live or die. Slow responses feel like awkward pauses. Low accuracy feels like the agent isn't listening. Most teams think latency issues come from the LLM. Actually, in our data, STT buffering is the culprit 60% of the time.

What to alert on:

  • TTFW > 1000ms for >5 minutes (user-visible latency)
  • Intent accuracy drop >5% from baseline (routing failures)
  • Tool call failure rate spike (integration issues)
  • WER > 12% (unacceptable ASR accuracy)

Related: How to Optimize Latency in Voice Agents covers the Latency Optimization Cycle in detail.

Layer 3: How to Monitor User Experience (Frustration, Abandonment, Sentiment)

What it tracks: User signals: frustration, effort, satisfaction.

MetricDefinitionTargetAlert Threshold
Retry RateUser repetitions per call<5%>15%
Abandonment RateHangups before completion<8%>15%
Escalation RateTransfers to human<15%>25%
Turn CountAverage conversation length<8 turns>15 turns
Sentiment TrajectoryEmotion change (positive/negative)Positive/stableNegative trend

Frustration signals to monitor:

  • "I already told you..." (context loss)
  • Raised voice detection (audio analysis)
  • Long silences after agent response (confusion)
  • Rapid repeated inputs (user frustration)
  • Call-back within 24 hours (task incomplete)

Users tolerate technical glitches but not frustration. Experience metrics predict churn and escalation before they show up in aggregate business metrics.

What to alert on:

  • Retry rate > 15% (users not being understood)
  • Abandonment rate spike >5% above baseline
  • Sentiment trajectory negative for >10 conversations in 1 hour

Related: Voice Agent Analytics to Improve CSAT covers the Four Pillars of Experience Analytics.

Layer 4: How to Monitor Business Outcomes (Task Completion, Compliance, ROI)

What it tracks: Business results: task completion, compliance, value.

MetricDefinitionTargetAlert Threshold
Task CompletionSuccessful resolutions>85%<70%
First Contact Resolution (FCR)Resolved without callback>75%<60%
Containment RateHandled without human>80%<65%
Compliance AdherencePolicy followed>99%<95%
Cost per ResolutionEfficiency (call cost / completed task)<$2.00>$5.00

The business impact calculation makes the stakes concrete:

Hourly Revenue Impact = (Baseline Completion - Current Completion) × Calls/Hour × Revenue/Call

Example:
- Baseline: 90% completion
- Current: 80% completion
- Calls/hour: 100
- Revenue/call: $50

Impact = (0.90 - 0.80) × 100 × $50 = $500/hour lost

Infrastructure and execution don't matter if the agent doesn't accomplish business goals. A task completion drop from 90% to 80% at 100 calls per hour with $50 average revenue per call equals $500 per hour in lost value. That math should drive urgency.

What to alert on:

  • Task completion < 80% for >15 minutes (systemic failure)
  • Compliance adherence < 95% (regulatory risk)
  • Cost per resolution >2x baseline (efficiency collapse)

Real-Time Alerting Architecture

Getting the alert architecture right is harder than it looks. We got this wrong initially: our first implementation alerted on everything, which meant engineers started ignoring alerts. The simpler version with severity tiers works better.

Severity Levels

LevelResponse TimeChannelExample
P0: Critical<5 minPagerDutySuccess rate <50%, system down
P1: High<15 minSlack (urgent)Success rate <80%, major degradation
P2: Medium<1 hourSlackP90 latency >2s, minor issues
P3: Low<4 hoursEmailTrend changes, capacity warnings

Alert Configuration Best Practices

Good Alert:

name: TTFW Degradation
condition: ttfw_p90 > 1000ms
duration: 5 minutes  # Avoid flapping
severity: P1
channels:
  - slack://voice-alerts
  - pagerduty://voice-team
context:
  - current_value
  - baseline_value
  - sample_calls
  - dashboard_link
runbook: /docs/runbooks/ttfw-degradation
cooldown: 30 minutes  # Prevent spam

Alert anti-patterns to avoid:

Anti-PatternProblemFix
No durationAlert on momentary spikesRequire 5+ min sustained
No cooldownAlert spam (50 alerts for 1 incident)15-30 min cooldown
No contextSlow investigationInclude samples, links, values
No runbookAd-hoc responseLink to documented procedure
Wrong severityAlert fatigue or missed criticalsCalibrate to business impact
Alerting on averagesMiss P95/P99 latency spikesAlways alert on percentiles

Anomaly Detection: Beyond Static Thresholds

Start with static thresholds. Add anomaly detection as you mature.

ApproachProsConsWhen to Use
Static thresholdsSimple, predictableMiss gradual driftDay 1 setup
Anomaly detectionCatches drift, seasonal patternsComplex, false positivesAfter 1 month baseline

Here's an example where anomaly detection helps that static thresholds miss:

  • Monday 9am baseline: 1000 calls/hour, TTFW 600ms
  • Saturday 2am baseline: 50 calls/hour, TTFW 800ms
  • Static threshold (TTFW > 1000ms) misses Saturday degradation because the absolute value is still under threshold
  • Anomaly detection catches it: alert when TTFW > baseline + 20%

The baseline calculation that works well in practice:

Alert when:
  current_ttfw > (baseline_ttfw + 20%)
  AND
  sustained for 10 minutes

Baseline calculated from:
  Same day-of-week, same hour, past 4 weeks

Our recommendation: Start static for 2-4 weeks to establish baseline patterns, then layer in anomaly detection for gradual drift.

Dashboard Design: The Command Center

There's a failure mode we've started calling the "dashboard coma": teams build monitoring but never look at it until something breaks. Good dashboard design prevents this by surfacing the right information at the right level of detail.

Executive View (One Glance)

┌─────────────────────────────────────────────────────────┐
  Voice Agent Health: 94.2%                            
├───────────────┬───────────────┬─────────────────────────┤
 Calls Today    Success Rate   TTFW P90               
   12,847          94.2%         720ms                
    8%             0.3%         50ms               
├───────────────┴───────────────┴─────────────────────────┤
  Active Alerts: 1 (P2)                                  
  ⚠️ Intent accuracy below baseline in "billing" flow    
└─────────────────────────────────────────────────────────┘

What it shows:

  • Overall health score (weighted composite)
  • Top 3 KPIs with trends
  • Active alerts with severity
  • Actionable insight ("billing flow")

Operator View (Drill-Down)

┌─────────────────────────────────────────────────────────┐
  Layer 2: Execution Health                              
├─────────────────────────────────────────────────────────┤
  ASR  WER: 6.2%    Latency: 145ms                  
├──────┼───────────────┼──────────────────────────────────┤
  LLM  P50: 520ms    P90: 780ms    P99: 1.2s ⚠️    
├──────┼───────────────┼──────────────────────────────────┤
  TTS  Latency: 95ms   Quality: 4.1                 
├──────┴───────────────┴──────────────────────────────────┤
  Tool Calls                                             
  ├─ CRM Lookup: 99.2%   (avg: 180ms)                  
  ├─ Calendar API: 98.5%   (avg: 220ms)                
  └─ Payment: 97.8% ⚠️  (avg: 450ms,  from 300ms)      
└─────────────────────────────────────────────────────────┘

What it shows:

  • Per-layer breakdown (Infrastructure, Execution, Experience, Outcome)
  • Component-level metrics (ASR, LLM, TTS)
  • Sub-component drill-down (each tool call)
  • Trend indicators (↑ ↓ for change detection)

Investigation View (Root Cause)

┌─────────────────────────────────────────────────────────┐
  Call ID: call_xyz789 | Duration: 4:32 | Failed        
├─────────────────────────────────────────────────────────┤
  Timeline:                                              
  0:00  User: "I need to check my balance"             
  0:02  Agent: "I'd be happy to help..." (TTFW: 2.1s)   ⚠️
  0:15  User: "Account 12345"                          
  0:18  [Tool: CRM lookup - 180ms - Success]           
  0:19  Agent: "I see your account..."                 
  1:45  User: "No, that's wrong" (frustration detected) ⚠️
  ...                                                    
├─────────────────────────────────────────────────────────┤
  Issues Detected:                                       
   TTFW exceeded threshold (2.1s > 1s)                 
   Intent confusion at turn 5 (book vs reschedule)     
   User frustration markers at turns 7, 9              
└─────────────────────────────────────────────────────────┘

What it shows:

  • Turn-by-turn conversation playback
  • Performance annotations (TTFW, tool calls)
  • Issue highlighting (⚠️ markers)
  • Root cause summary

Related: Anatomy of a Perfect Voice Agent Analytics Dashboard covers dashboard design principles in depth.

Integrating with Your Observability Stack

Best practice: Use Datadog for infrastructure, Hamming for voice-specific metrics, and unified alerting in PagerDuty or Slack.

                    ┌─────────────────────┐
                       Your Voice Agent  
                    └──────────┬──────────┘
                               
            ┌──────────────────┼──────────────────┐
                                                
                                                
    ┌───────────────┐  ┌───────────────┐  ┌───────────────┐
       Datadog          Hamming          Grafana     
       (Infra)          (Voice)          (Custom)    
    └───────────────┘  └───────────────┘  └───────────────┘
                                                
                                                
            └──────────────────┼──────────────────┘
                               
                    ┌──────────▼──────────┐
                       Unified Alerts    
                       (PagerDuty)       
                    └─────────────────────┘

What goes where:

Data TypeDatadogHamming
Server CPU, memory-
API response times✓ (via OTel)
Database queries-
TTFW, turn latency-
ASR accuracy, WER-
Intent accuracy-
Conversation flow-
User frustration-
Task completion-

The integration point is OpenTelemetry. Configure dual export so both platforms receive the same trace data, then each platform focuses on what it does best.

Related: Complete Voice Agent QA Platform covers platform architecture overview.

Flaws But Not Dealbreakers

No monitoring approach is perfect. A few things worth understanding upfront:

Initial setup takes time. Expect 2-3 days to configure dashboards, alerts, and baselines for your first deployment. The ROI comes from automated detection afterward, but there's an upfront cost.

Alert tuning is ongoing. You will get false positives in the first few weeks. Budget time for threshold adjustment as you learn what's normal for your specific voice agent.

Voice monitoring requires volume. Some metrics like sentiment trajectory need hundreds of calls to become statistically meaningful. If you're running 20 calls per day, aggregate experience metrics may be noisy.

Integration complexity varies. Teams using standard platforms like Retell or Vapi have smoother integration. Custom voice stacks require more instrumentation work.

Monitoring Implementation Checklist

Day 1: Foundation

  • Instrument TTFW measurement in your voice agent
  • Set up basic dashboard with 4 layers
  • Configure P0/P1 alerts to Slack/PagerDuty
  • Create runbook for top 5 failure modes
  • Establish baseline metrics (start 1 week data collection)

Week 1: Coverage

  • Add all Layer 2 metrics (ASR, LLM, TTS)
  • Implement tool call monitoring
  • Configure anomaly detection baselines
  • Set up daily digest reports
  • Tune alert thresholds based on false positive rate

Month 1: Maturity

  • Add Layer 3 experience metrics (retry rate, sentiment)
  • Implement sentiment analysis
  • Build correlation dashboards (latency → abandonment)
  • Create SLA tracking views
  • Enable production call replay for debugging

Ongoing: Optimization

  • Review alert effectiveness weekly (target false positive rate <10%)
  • Tune thresholds based on seasonal patterns
  • Add new metrics as failure modes are discovered
  • Maintain runbooks as issues are resolved
  • Conduct monthly monitoring retrospectives

Common Monitoring Mistakes

MistakeImpactFix
Only infrastructure monitoringMiss 60% of voice issuesAdd all 4 layers
Static thresholds onlyAlert fatigue or miss gradual driftAdd anomaly detection after baseline period
No correlation analysisSlow root cause identificationLink metrics together
Monitoring without runbooksSlow incident responseDocument every alert with response steps
No production call replayCan't debug customer issuesEnable call recording with consent
Alerting on averagesMiss P95/P99 spikesAlways alert on percentiles
No alert cooldownsAlert spam15-30 min cooldown per alert
Wrong severity calibrationAlert fatigue or missed criticalsP0 = revenue impact, P1 = customer impact

Voice Agent Monitoring Platforms: Categories and When to Use Each

Not all monitoring tools are the same. Understanding the categories helps you choose the right combination.

CategoryWhat It DoesExamplesWhen to Use
Infrastructure APMServer CPU, memory, API latency, errorsDatadog, New Relic, GrafanaAlways—baseline for any production system
Voice-Native ObservabilityTTFW, WER, intent accuracy, conversation flowHammingWhen you need to understand why conversations fail
Call AnalyticsCall volume, duration, disposition codesTwilio Insights, call center platformsFor aggregate call patterns and volume planning
Evaluation PlatformsPre-launch testing, regression testing, scenario coverageHammingBefore deployment and continuous testing

Best practice: Use infrastructure APM + voice-native observability together. Datadog tells you servers are healthy; Hamming tells you conversations are healthy.

When to use Hamming specifically:

  • You need to alert on voice-specific metrics (TTFW, WER, intent accuracy)
  • You need turn-level visibility into conversations
  • You need to replay production calls for debugging
  • You need to correlate test failures with production incidents
  • You're running voice agents at scale (100+ calls/day)

How Teams Implement This with Hamming

Hamming provides native support for the 4-Layer Monitoring Stack out of the box:

  • Infrastructure Layer: Hamming tracks MOS, jitter, packet loss via audio stream analysis
  • Execution Layer: Hamming measures TTFW, ASR WER, LLM latency P50/P90/P99, TTS latency
  • Experience Layer: Hamming detects retry patterns, frustration markers, sentiment trajectory, abandonment
  • Outcome Layer: Hamming tracks task completion, containment rate, compliance adherence
  • Alerting: Hamming provides configurable alerts with severity tiers, cooldowns, and runbook links
  • Dashboards: Hamming offers executive, operator, and investigation views with drill-down
  • Call Replay: Hamming enables turn-by-turn replay of any production call with full annotations
  • Regression Testing: Failed production calls automatically become test cases

How to Choose a Voice Agent Monitoring Platform

Use this checklist when evaluating voice agent monitoring platforms:

CriterionWeightQuestions to Ask
Metrics Coverage25%Does it track all 4 layers? TTFW? WER? Sentiment? Task completion?
Real-Time Alerting20%Can it detect issues in <60 seconds? Configurable severity? Cooldowns?
Investigation Tools20%Can you replay calls? Turn-level annotations? Root cause analysis?
Integration15%OpenTelemetry export? Works with your voice platform (Retell, Vapi, LiveKit)?
Testing + Monitoring10%Does it connect pre-launch testing to production monitoring? Feedback loop?
Compliance10%SOC 2? HIPAA? Data residency options?

Minimum viable monitoring platform should have:

  • TTFW tracking at P50/P90/P99
  • ASR accuracy (WER) measurement
  • Intent accuracy tracking
  • Task completion rate
  • Real-time alerting with <5 min detection
  • Call replay for debugging

Frequently Asked Questions

The best voice agent monitoring platform tracks all 4 layers: Infrastructure (MOS, jitter, packet loss), Execution (TTFW, ASR WER, LLM latency), Experience (retry rate, abandonment, sentiment), and Outcomes (task completion, compliance). Hamming is purpose-built for voice agent monitoring with native support for all 4 layers, real-time alerting with <60 second detection, turn-level call replay, and integration with platforms like Retell, Vapi, and LiveKit. Use Datadog or New Relic for infrastructure APM alongside a voice-native platform like Hamming for conversation quality metrics.

Monitor across 4 layers: Infrastructure (MOS >4.0, packet loss <0.1%, jitter <30ms), Execution (ASR WER <8%, TTFW <500ms, LLM latency P90 <800ms, intent accuracy >95%), Experience (retry rate <5%, abandonment <8%, sentiment trajectory), Outcomes (task completion >85%, containment rate >80%, compliance >99%). Start with TTFW and task completion—these have the highest correlation with user satisfaction. Alert on P90 latency, not averages.

Set SLOs at the P90 level, not averages. Recommended thresholds: TTFW P90 <1000ms (P1 alert if exceeded for 5 min), task completion >80% (P0 alert if below for 15 min), ASR WER <12% (P1 alert), intent accuracy >90% (P1 alert). Use severity tiers (P0=revenue impact, P1=customer impact, P2=performance), duration filters (5+ min sustained), and cooldowns (15-30 min). Include context (current value, baseline, sample calls) and runbooks with every alert.

Testing validates before deployment using simulated scenarios, catching bugs before launch. Monitoring catches issues in production 24/7 using real customer calls and real-time alerts. You need both: testing prevents known issues, monitoring catches unknown issues and production degradation over time. A voice agent that passes all tests can still drift or fail in production due to provider updates, conversation pattern changes, or edge cases that weren't simulated.

Datadog monitors infrastructure (CPU, API latency, database queries), but misses 60% of voice-specific failures. It can't tell you if TTFW crossed from 600ms to 1200ms, if ASR accuracy dropped from 95% to 88%, if intent routing is sending calls to the wrong flow, or if users are expressing frustration. Best practice: Use Datadog for infrastructure monitoring and a voice-native platform like Hamming for conversation quality metrics. They complement each other.

Start with these 4 high-impact alerts: (1) TTFW P90 > 1000ms for 5 minutes (user-visible latency, immediate satisfaction impact), (2) Task completion rate < 80% for 15 minutes (business-critical systemic failure), (3) ASR WER > 12% for 10 minutes (unacceptable accuracy threshold), (4) Intent accuracy < 90% for 10 minutes (routing failures that cascade into escalations). Tune thresholds based on your baseline after 1 week of monitoring.

Use duration filters (5+ minutes sustained), cooldowns (15-30 minutes between same alert), and severity tiers (P0=revenue impact, P1=customer impact, P2=performance). Include context (current value, baseline, sample calls, dashboard link) and runbooks with every alert. Common mistakes: no duration causes alerts on momentary spikes, no cooldown causes 50 alerts for one incident, wrong severity causes everything to feel equally urgent.

The 4-Layer Monitoring Stack covers every dimension of voice agent health: Layer 1 (Infrastructure) tracks audio quality (MOS >4.0), packet loss (<0.1%), jitter (<30ms); Layer 2 (Execution) tracks ASR WER (<8%), LLM latency P90 (<800ms), TTFW (<500ms); Layer 3 (Experience) tracks retry rate (<5%), abandonment (<8%), sentiment trajectory; Layer 4 (Outcomes) tracks task completion (>85%), containment rate (>80%), compliance adherence (>99%). Based on monitoring 1M+ production calls.

Sumanyu Sharma

Sumanyu Sharma

Founder & CEO

Previously Head of Data at Citizen, where he helped quadruple the user base. As Senior Staff Data Scientist at Tesla, grew AI-powered sales program to 100s of millions in revenue per year.

Researched AI-powered medical image search at the University of Waterloo, where he graduated with Engineering honors on dean's list.

“At Hamming, we're taking all of our learnings from Tesla and Citizen to build the future of trustworthy, safe and reliable voice AI agents.”