What are the most important KPIs for measuring voice agent performance in 2026?

Prioritize containment rate, first call resolution (FCR), intent recognition accuracy, latency percentiles (p50/p90/p95), and CSAT as foundational metrics. Containment rate targets 80%+ after optimization, FCR targets 70-85%, intent accuracy requires 95%+ for production, and p90 latency should stay under 3.5 seconds. Expand dashboard scope to include WER, MOS, and hallucination rate once foundational metrics are instrumented.

How do you calculate First Call Resolution (FCR) rate for voice agents?

Divide verified successful resolutions on first contact by total calls, then multiply by 100. Use a 48-72 hour verification window—if a customer contacts again within that period about the same issue, the original call did not achieve resolution. Segment FCR by intent category and channel for actionable insights, as blended averages mask significant performance variation across use cases.

What is an acceptable Word Error Rate (WER) for production voice agents?

Target under 5% WER for enterprise production deployments. The formula is (Substitutions + Deletions + Insertions) / Total Reference Words × 100. Regulated industries and safety-critical applications require even lower thresholds, often under 3%. Track WER segmented by noise condition, accent, and domain vocabulary to identify specific improvement areas.

What latency should voice agents target for natural conversations?

Aim for sub-500ms Time to First Word (TTFW) as ideal, with 800ms as the acceptable production threshold. For total turn latency, production benchmarks show p50 at 1.5 seconds, p90 at 3.5 seconds, and p95 at 5 seconds. Track percentile distributions rather than averages—a system with 500ms average latency may have 10% of calls experiencing 3.5+ second delays.

How is customer satisfaction (CSAT) scored automatically for AI voice agents?

AI-based automated scoring analyzes tone, sentiment trajectory, resolution speed, interruption patterns, and conversation ending patterns to predict CSAT without requiring explicit surveys. Voice agents can also embed satisfaction prompts directly in conversations, achieving 30%+ higher completion rates than post-call email surveys. Target 75-85% CSAT for production voice agents.

What containment rate should voice agents achieve?

Target 80%+ containment after optimization for well-defined transactional flows. Rates vary by complexity: FAQ bots achieve 40-60%, appointment scheduling targets 75-85%, and complex customer service targets 65-80%. Always pair containment tracking with CSAT and task completion metrics to avoid false containment where users abandon rather than get helped.

How should voice agent dashboards be designed to detect issues quickly?

Implement real-time views with percentile tracking (p50/p90/p95), sentiment analysis, and one-click drill-downs from anomaly to individual conversation transcript and audio. Set alert thresholds on percentile metrics rather than averages—alert when p90 latency exceeds 3.5 seconds for 5 minutes, not when the mean crosses a threshold. Build metric correlation views linking upstream failures like high WER to downstream impacts like low task success rate.

What is the difference between Mean Opinion Score (MOS) and Word Error Rate (WER)?

MOS measures subjective TTS (text-to-speech) naturalness on a 1-5 scale, targeting 4.3-4.5 for production quality that rivals human speech. WER quantifies objective ASR (speech-to-text) transcription accuracy as an error percentage, targeting under 5% for production. MOS evaluates output speech quality while WER evaluates input speech recognition—they measure opposite ends of the voice pipeline.

How do you measure and prevent AI hallucinations in voice agents?

Track hallucination rate as hallucinated responses divided by total responses, targeting under 3% for general deployments and under 1% for regulated industries. Implement real-time validation against verified knowledge base sources, use LLM-as-judge evaluation on sampled conversations, monitor responses with high confidence but no matching source documents, and maintain knowledge base currency to prevent outdated information from driving confabulation.

Voice Agent Analytics & Post-Call Metrics: Definitions, Formulas & Dashboards

Voice agent analytics is the continuous measurement of performance across telephony, ASR, LLM, and TTS layers to ensure production quality. Unlike traditional call center metrics that track averages and aggregate outcomes, voice agent analytics requires layer-by-layer observability—tracing every interaction from audio ingestion through speech recognition, language model inference, and speech synthesis to pinpoint where and why conversations succeed or fail.

Metric Category	Key Metrics	Production Target
Task Success	FCR, containment rate, TSR	FCR 70-85%, containment 80%+
Latency	TTFW, turn latency, p90/p95	P90 <3.5s, TTFW <500ms
ASR Quality	WER, confidence scores	WER <5%
NLU Accuracy	Intent recognition, slot filling	Intent accuracy 95%+
TTS Quality	MOS, synthesis latency	MOS 4.3-4.5
Safety	Hallucination rate, refusal rate	Hallucination <3%

At Hamming, we've analyzed 4M+ voice agent calls across 10K+ production voice agents. This guide provides the standardized definitions, mathematical formulas, instrumentation approaches, and benchmark ranges for every critical voice agent metric.

Methodology Note: Metrics, benchmarks, and formula definitions in this guide are derived from Hamming's analysis of 4M+ voice agent interactions across 10K+ production voice agents (2025-2026).
Thresholds validated across healthcare, financial services, e-commerce, and customer support verticals.

Related Guides:

Voice Agent Evaluation Metrics: Definitions, Formulas & Benchmarks — Complete technical reference for individual evaluation metrics
Post-Call Analytics for Voice Agents: Metrics and Monitoring — Real-time data pipelines and 4-layer observability
Voice Agent Dashboard Template — 6-Metric Framework with executive reports
Voice Agent Monitoring KPIs: Production Guide — 10 critical KPIs with alert thresholds
Voice AI Latency: What's Fast, What's Slow, How to Fix It — Engineering guide to latency optimization
The Anatomy of a Perfect Voice Agent Analytics Dashboard — Dashboard design patterns and KPI categories

Core Voice Agent Metrics and KPIs

First Call Resolution (FCR)

First Call Resolution (FCR) measures the percentage of customer issues fully resolved during the initial interaction without requiring callbacks, transfers, or follow-up contacts.

Formula:

FCR = (Issues resolved on first contact / Total interactions) × 100

Level	FCR Target	Context
Baseline	70-75%	Standard voice agent deployment
Good	75-80%	Optimized flows with knowledge base coverage
Top Performer	85%+	Specialized, well-defined use cases

Measurement approach: Use 48-72 hour verification windows. If a customer contacts again within that window about the same issue, the original interaction did not achieve resolution—even if it was marked complete.

Segmentation matters: FCR varies significantly by intent category. Appointment scheduling may achieve 90%+ FCR while complex troubleshooting sits at 60-65%. Report FCR by intent to identify specific improvement opportunities rather than optimizing a blended average.

Containment Rate

Containment rate measures the percentage of calls handled entirely by the AI agent without escalation to a human operator.

Formula:

Containment Rate = (AI-resolved calls / Total calls) × 100

Level	Target	Use Case Context
Excellent	>80%	Well-defined transactional flows
Good	70-80%	Standard customer service
Acceptable	60-70%	Complex queries, new deployments
FAQ Bots	40-60%	Simple information retrieval

Target 80%+ after optimization, though rates vary substantially by use case complexity. Healthcare triage may appropriately target 60-70% while appointment scheduling targets 85%+.

Critical caveat: High containment with low CSAT indicates false containment—users abandoning rather than getting helped. Always pair containment tracking with task completion and satisfaction metrics.

Track escalation patterns by reason to identify actionable improvements:

Knowledge gap (agent lacks required information)
Authentication failure
User preference (explicitly requested human)
Conversation breakdown (intent confusion, loops)
Policy requirement (regulatory escalation triggers)

Intent Recognition Accuracy

Intent recognition accuracy measures how correctly the voice agent classifies customer requests into predefined intent categories.

Formula:

Intent Recognition Accuracy = (Correct intent matches / Total utterances) × 100

Level	Accuracy	Assessment
Production-ready	>95%	Required for customer-facing deployment
Acceptable	90-95%	Requires monitoring and prompt tuning
Below threshold	<90%	Not production-ready without improvement

Production requires 95%+ accuracy. Intent misclassification cascades through the entire conversation—a misrouted caller enters the wrong flow, receives irrelevant responses, and either escalates or abandons.

Track intent coverage rate alongside accuracy: the percentage of incoming calls that match a fully supported intent category. Low coverage (many "fallback" or "unknown" intents) indicates gaps in your intent taxonomy rather than classification quality.

Task Success Rate (TSR)

Task Success Rate (TSR) tracks completed objectives relative to total interaction attempts.

Formula:

TSR = (Successful task completions / Total interactions) × 100

Use Case	TSR Benchmark
Appointment scheduling	90-95%
Order status inquiry	88-93%
Payment processing	85-90%
Technical troubleshooting	75-85%
Complex multi-step flows	70-80%

Benchmark 85-95% for specialized implementations with well-defined success criteria. TSR differs from FCR in that it measures whether the agent completed its designated task, regardless of whether the customer needed to call back for a different issue.

Define explicit completion criteria for each task type. "Appointment scheduled" is unambiguous. "Customer helped" is not.

Customer Satisfaction (CSAT) Score

CSAT measures satisfaction with individual interactions, typically on a 1-5 scale.

Formula:

CSAT = (Satisfied responses [4-5 rating] / Total survey responses) × 100

Level	CSAT Target
Excellent	>85%
Good	75-85%
Acceptable	65-75%
Poor	<65%

Target 75-85% with AI-based automated scoring supplementing explicit surveys. Voice agents can embed satisfaction measurement directly in conversations, achieving 30%+ higher completion rates than post-call email surveys.

Beyond explicit ratings: Track caller frustration through sentiment trajectory, interruption patterns, and repetition frequency. AI-based scoring analyzes tone, sentiment, resolution speed, and conversation ending patterns to predict CSAT without requiring explicit surveys—useful when survey response rates are low.

Latency and Response Time Metrics

Time to First Word (TTFW)

Time to First Word (TTFW) measures initial response delay from user speech completion (VAD silence detection) to the first agent audio reaching the caller.

Formula:

TTFW = VAD silence detection → Agent first audio byte

Threshold	User Experience
<300ms	Natural, indistinguishable from human conversation
300-500ms	Acceptable for most users
500-800ms	Noticeable delay, users begin adapting speech patterns
>800ms	Conversation breakdown begins

Target sub-500ms as ideal, with 800ms as the acceptable production threshold. Note that TTFW measures only the initial response delay—the time until the first audio byte reaches the caller. This differs from total turn latency (covered below), which measures complete end-to-end response time. Based on Hamming's analysis of production voice agents, industry median total turn latency is 1.4-1.7 seconds—significantly slower than the 300ms human conversational expectation. This gap explains why users report agents that "feel slow" or "keep getting interrupted."

Total Conversation Latency

Total conversation latency measures end-to-end response time including ASR processing, LLM inference, and TTS synthesis for each conversational turn.

Component breakdown (typical production):

Audio transmission: ~40ms
STT processing: 150-350ms
LLM inference (TTFT): 200-800ms (typically 70% of total)
TTS synthesis: 100-200ms
Audio playback: ~30ms

LLM inference dominates total latency, making model selection and prompt optimization the highest-leverage improvement targets.

Latency Percentiles (P50, P90, P95, P99)

Track percentile distributions rather than averages to expose performance outliers that degrade user experience for significant portions of your traffic.

Production latency benchmarks:

Percentile	Response Time	User Experience	Action Required
P50 (median)	1.5s	Noticeable delay, functional	Optimize LLM inference
P90	3.5s	Significant frustration, talk-overs	Investigate infrastructure and model
P95	5.0s	Severe delay, frequent abandonment	Immediate attention required
P99	10s+	Complete breakdown	Critical incident

Why percentiles matter: A system reporting 500ms average latency may have 10% of calls experiencing 3.5+ second delays. Those callers don't care about the average—they experience a broken product. Alert on p90 threshold breaches (3.5s), not mean degradation.

Alert configuration: Set notifications when p90 latency exceeds 3.5s for a sustained 5-minute window. This catches degradation before it becomes widespread while filtering transient spikes.

Speech Recognition Quality Metrics

Word Error Rate (WER)

Word Error Rate (WER) is the industry standard metric for ASR accuracy, measuring the percentage of words incorrectly transcribed.

Formula:

WER = (S + D + I) / N × 100

Where:
S = Substitutions (wrong words)
D = Deletions (missing words)
I = Insertions (extra words)
N = Total words in reference transcript

Level	WER	Assessment
Enterprise	<5%	Required for production deployment
Acceptable	5-8%	Standard deployment, optimization needed
Below threshold	8-12%	Not production-ready for high-stakes use
Poor	>12%	Requires fundamental ASR changes

Require under 5% WER for production. ASR errors cascade through the entire pipeline—a misrecognized word becomes a wrong intent becomes a failed task. Track WER segmented by noise condition, accent, and domain vocabulary to identify specific improvement areas.

Transcription Confidence Scores

ASR systems provide probability scores indicating transcription certainty for each word or utterance segment. These scores enable real-time quality monitoring without requiring reference transcripts.

Confidence Level	Score Range	Action
High	>0.85	Process normally
Medium	0.6-0.85	Flag for monitoring
Low	<0.6	Trigger re-prompting or human review

Production use: Flag low-confidence segments (below 0.6) for re-prompting strategies—ask the caller to repeat or rephrase rather than proceeding with uncertain transcription. Monitor confidence score distributions over time to detect ASR drift.

Speaker Diarization Accuracy

Speaker diarization identifies and separates multiple speakers in a conversation, attributing each utterance to the correct participant.

Critical for:

Multi-party calls (caller + agent + transferred party)
Accurate context attribution in analytics
Compliance monitoring requiring speaker-specific tracking
Training data quality for model improvement

Track diarization error rate as the percentage of speech segments attributed to the wrong speaker. Production systems should achieve under 5% diarization error for two-speaker conversations.

Natural Language Understanding Metrics

Intent Coverage Rate

Intent coverage rate measures the percentage of incoming calls that match a fully supported intent category in your voice agent's taxonomy.

Formula:

Intent Coverage = (Calls matching supported intents / Total calls) × 100

Track coverage gaps—calls routed to "fallback" or "unknown" intent categories—to identify where your agent lacks capability. High fallback rates (above 15%) indicate taxonomy gaps rather than classification errors.

Action pattern: Review fallback utterances weekly. Cluster similar requests and evaluate whether they warrant new intent categories or expanded training data for existing intents.

Semantic Accuracy Rate

Semantic accuracy measures whether agent responses align with the user's actual meaning, going beyond keyword matching to evaluate contextual understanding.

Unlike intent accuracy (which measures classification), semantic accuracy evaluates whether the agent's response appropriately addresses what the user meant—even when the intent was correctly classified.

Validation approach: Conduct periodic manual audits or use LLM-as-judge evaluation pipelines. Sample 100-200 conversations per week, scoring whether responses were semantically appropriate given the full conversation context. LLM-as-judge approaches achieve 95%+ agreement with human evaluators when using two-step evaluation pipelines.

Slot Filling Accuracy

Slot filling accuracy tracks successful extraction of required parameters (names, dates, account numbers, addresses) from user utterances before task execution.

Formula:

Slot Filling Accuracy = (Correctly extracted slots / Total required slots) × 100

Production target: 90%+ slot filling accuracy. Failed slot extraction forces repetitive re-prompting that degrades user experience. Track accuracy by slot type—dates and numbers typically achieve higher accuracy than proper nouns and addresses.

Text-to-Speech Quality Metrics

Mean Opinion Score (MOS)

Mean Opinion Score (MOS) is a subjective 1-5 scale rating of TTS naturalness, clarity, and overall quality, following the ITU-T P.800 standard.

MOS Score	Quality Level
4.5-5.0	Excellent, indistinguishable from human
4.3-4.5	Very good, rivals human speech quality
3.8-4.3	Good, clearly synthetic but natural
3.0-3.8	Fair, robotic qualities noticeable
<3.0	Poor, unacceptable for production

Target 4.3-4.5 to rival human speech quality benchmarks. MOS remains the gold standard for TTS evaluation despite being resource-intensive, requiring crowdsourced evaluation or automated MOSNet scoring.

Voice Consistency Rate

Voice consistency measures stable prosody, tone, and pacing throughout an entire conversation. Inconsistent voice characteristics—sudden pitch shifts, pacing changes, or tonal breaks—break user immersion and erode trust.

Monitor for:

Pitch stability across conversation turns
Pacing consistency (words per minute variance)
Tonal alignment with conversation context (empathetic when appropriate)
Cross-session consistency for returning callers

Audio Synthesis Latency

TTS synthesis latency measures the time required to generate audio output from text input.

Percentile	Target	Impact
P50	<150ms	Contributes to natural conversational flow
P90	<300ms	Acceptable production threshold
P95	<500ms	Investigate TTS provider performance

Track TTS p90 latency under 300ms to maintain conversational rhythm. TTS latency combines with STT and LLM latency to determine total turn latency—optimizing any single component improves end-to-end experience.

Hallucination Detection and Safety Metrics

Hallucination Rate

Hallucination rate tracks instances where the voice agent generates fabricated information, invented facts, or confident responses not grounded in its knowledge base.

Formula:

Hallucination Rate = (Hallucinated responses / Total responses) × 100

Target under 3% occurrence for general deployments. Regulated industries (healthcare, financial services) should target under 1%.

Detection approaches:

Real-time validation against knowledge base sources
LLM-as-judge evaluation on sampled conversations
Tracking five or more consecutive transcription errors as potential hallucination signals
Monitoring responses with high confidence but no matching source documents

Safety Refusal Rate

Safety refusal rate measures the percentage of adversarial, inappropriate, or out-of-scope prompts correctly rejected by the voice agent.

Track both:

True positive refusals: Correctly blocked adversarial or policy-violating requests
False positive refusals: Legitimate requests incorrectly blocked (over-aggressive guardrails)

Balance is critical. Under-refusing exposes your system to misuse. Over-refusing creates frustrated users who can't complete legitimate tasks.

Source Grounding Score

Source grounding validates that agent responses are traceable to verified knowledge base content, flagging "confident answers with no matching source" as potential hallucinations.

Implementation: For each response, check whether the key claims map to retrieved knowledge base passages. Responses with high confidence but low source overlap should trigger review and potential re-prompting.

Conversation Quality Scoring

Call Deflection Success

Call deflection success measures prevention of unnecessary human transfers when the AI agent could resolve the issue, calculated against baseline pre-automation escalation rates.

Formula:

Deflection Success = (Pre-automation escalations - Current escalations) / Pre-automation escalations × 100

This metric only makes sense relative to historical baselines. Compare current escalation patterns to pre-automation rates, segmented by intent category.

Interruption Frequency

Interruption frequency counts instances where the agent speaks over the user or responds before the user completes their thought. High interruption rates indicate ASR timing issues, specifically problems with Voice Activity Detection (VAD) or end-of-turn prediction.

Level	Interruption Rate	Assessment
Good	<5% of turns	Natural conversation flow
Acceptable	5-10% of turns	Monitor VAD configuration
Poor	>10% of turns	Immediate VAD tuning required

Diagnostic approach: Distinguish between agent-caused interruptions (premature response) and user-caused interruptions (barge-in). Agent-caused interruptions indicate system issues. User-caused barge-ins may indicate latency problems prompting users to repeat themselves.

Conversation Abandonment Rate

Conversation abandonment rate tracks calls ended by the user mid-conversation before reaching resolution, signaling poor experience or agent failure.

Formula:

Abandonment Rate = (Calls abandoned before resolution / Total calls) × 100

Segment abandonment by:

Time in call: Early abandonment (under 30s) suggests greeting or routing issues
Intent stage: Abandonment during slot filling suggests re-prompting fatigue
After specific turns: Identifies exact conversation points causing drop-off

Voice Agent Analytics Framework

4-Layer Monitoring Approach

Implement Hamming's 4-layer analytics framework for comprehensive voice agent observability:

Layer	Function	Key Metrics	Failure Modes
Layer 1: Telephony & Audio	Audio quality, transport health	Packet loss, jitter, SNR, codec latency	Garbled audio, dropouts, echo
Layer 2: ASR & Transcription	Speech-to-text accuracy	WER, confidence scores, transcription latency	Mishearing, silent failures, drift
Layer 3: LLM & Semantic	Intent and response generation	TTFT, intent accuracy, hallucination rate	Wrong routing, confabulation, scope creep
Layer 4: TTS & Generation	Speech synthesis quality	Synthesis latency, MOS, voice consistency	Delays, robotic speech, voice drift

Issues cascade across layers. An audio quality problem (Layer 1) causes transcription errors (Layer 2), which cause intent misclassification (Layer 3), which causes task failure. Without layer-by-layer instrumentation, you see the task failure but not the root cause.

Real-Time vs Post-Call Analytics

Balance immediate alerting with deep post-call analysis:

Approach	Purpose	Latency	Depth
Real-time	Detect degradation as it happens	Seconds	Surface-level indicators
Near-real-time	Pattern identification within sessions	Minutes	Trend analysis
Post-call	Root cause analysis, model improvement	Hours	Full conversation evaluation

Real-time monitoring catches outages and severe degradation. Post-call analysis identifies systematic patterns—specific prompts that consistently underperform, intent categories with declining accuracy, or time-of-day latency variations—that inform model improvements and prompt optimization.

Dashboard Design and Visualization

Essential Dashboard Components

A production voice agent dashboard must answer four questions within 30 seconds:

Is the system healthy? — Call volume trends, error rates, infrastructure status
Are users satisfied? — CSAT trajectory, abandonment rates, sentiment patterns
Where are the problems? — Latency percentile distributions, WER trends, containment drops
What changed? — Deployment markers, model version annotations, configuration diffs

Display call volume trends, latency percentile distributions (p50/p90/p95), containment rates, and sentiment analysis with one-click drill-downs from anomaly to individual conversation transcript and audio playback.

Alert Configuration Best Practices

Set thresholds on percentile metrics, not averages. Average-based alerts mask degradation affecting minority populations of calls.

Metric	Alert Threshold	Duration	Severity
P90 latency	>3.5s	5 minutes	Warning
P95 latency	>5.0s	5 minutes	Critical
Containment rate	<60%	1 hour	Warning
WER	>8%	15 minutes	Warning
Hallucination rate	>3%	30 minutes	Critical
TTFW p95	>800ms	5 minutes	Warning

Trigger notifications before customer-facing degradation becomes widespread. A p90 alert at 3.5s catches the problem when 10% of users are affected, not when the average crosses a threshold that requires 50%+ degradation.

Metric Correlation Views

Link upstream failures to downstream impacts to trace root causes efficiently:

High WER → Low intent accuracy → Low TSR — ASR degradation cascading to task failure
Latency spike → High interruption rate → High abandonment — Infrastructure issue causing conversation breakdown
Low confidence scores → High fallback rate → Low containment — ASR uncertainty driving escalations

Build correlation dashboards that surface these causal chains automatically, enabling operators to jump from symptom to root cause without manual investigation.

ROI and Business Impact Metrics

Cost Per Interaction

Calculate total resolution cost including infrastructure, model inference, and telephony:

Channel	Cost Range	Context
Human agent	$5-8 per call	Fully loaded: salary, benefits, training, facilities
AI voice agent	$0.01-0.25 per minute	Infrastructure, model inference, telephony

Comparison framework: For a 3-minute average call, AI costs $0.03-0.75 versus $5-8 for human handling—a 10-250x cost reduction depending on complexity and infrastructure choices.

Automation ROI Formula

Formula:

ROI = (Containment Rate × Call Volume × Per-Call Savings - Infrastructure Costs) / Infrastructure Costs × 100

Expected returns: 200-500% ROI within 3-6 months for well-implemented deployments with sub-six-month payback periods.

Worked example: 10,000 monthly calls × 75% containment × $6 per-call savings = $45,000 monthly savings. Against $8,000 monthly infrastructure costs: ($45,000 - $8,000) / $8,000 = 462% ROI.

Industry Benchmarks and Thresholds (2026)

Performance Benchmark Summary Table

Metric	Baseline	Good	Top Performer	How to Measure
FCR	70%	75-80%	85%+	48-72hr callback verification
Containment	60%	70-80%	80%+	Escalation event tracking
Latency P90	4.5s	3.5s	<2.5s	Percentile distribution monitoring
WER	8%	5%	<3%	Reference transcript comparison
CSAT	65%	75-80%	85%+	In-call + AI-predicted scoring
MOS	3.8	4.3	4.5+	Crowdsourced + MOSNet evaluation
Intent Accuracy	90%	95%	98%+	Labeled test set evaluation
Hallucination Rate	5%	3%	<1%	Source grounding validation

Benchmark Variation by Use Case

Performance expectations vary significantly by voice agent complexity:

Use Case	Typical Containment	FCR Target	Latency Tolerance
FAQ / Information	40-60%	80%+	P90 <4.0s
Appointment Scheduling	75-85%	90%+	P90 <3.0s
Order Management	70-80%	80%+	P90 <3.5s
Customer Service	65-80%	70-80%	P90 <3.5s
Technical Support	50-65%	65-75%	P90 <4.0s
Healthcare Triage	55-70%	75-85%	P90 <3.5s

Simpler FAQ bots achieve 40-60% containment while complex customer service targets 75-80%. Adjust expectations by complexity rather than applying uniform benchmarks.

Production Monitoring and Alerting

Critical Alert Definitions

Configure tiered notifications linking severity to response expectations:

Alert Type	Trigger	Severity	Response SLA
Containment drop	Below 60% for 1 hour	P2 - High	30 minutes
Latency spike	P90 >3.5s for 5 minutes	P2 - High	15 minutes
WER degradation	Above 8% for 15 minutes	P2 - High	30 minutes
Hallucination increase	Above 3% for 30 minutes	P1 - Critical	10 minutes
TTFW spike	P95 >800ms for 5 minutes	P3 - Medium	1 hour
Total failure	Error rate >5% for 5 minutes	P1 - Critical	5 minutes

Incident Response Workflows

Establish playbooks linking alert types to diagnostic steps:

Latency alerts: Check infrastructure health → Review model inference times → Verify TTS provider status → Examine traffic patterns for load spikes

Accuracy alerts: Audit recent model or prompt changes → Compare WER distributions before/after → Review confidence score trends → Check ASR provider status

Containment alerts: Analyze escalation reason distribution → Review intent coverage gaps → Check for new conversation patterns → Verify knowledge base currency

Hallucination alerts: Validate knowledge base freshness → Review recent prompt modifications → Audit source grounding scores → Check retrieval pipeline health

Each playbook should terminate in either resolution or escalation within defined SLAs, with post-incident documentation capturing root cause and prevention measures.

Building a Measurement-Driven Voice Agent Practice

Voice agent analytics is not a dashboard project—it is an operational discipline. The teams that succeed in production share three practices: they instrument all four layers (telephony, ASR, LLM, TTS) independently, they alert on percentile distributions rather than averages, and they correlate upstream failures to downstream business impact.

Start with the metrics that drive decisions: FCR (70-85%), containment rate (80%+), WER (under 5%), and p90 latency (under 3.5s). Instrument these first, set alert thresholds, and build the correlation views that connect a WER spike to a containment drop to a CSAT decline. Everything else—MOS scores, slot filling accuracy, diarization—layers on top of that foundation.

At Hamming, we help teams validate these metrics before production deployment and continuously monitor them at scale. Whether you are establishing baseline measurements for a new voice agent or debugging a latency regression in an existing deployment, the definitions, formulas, and benchmarks in this guide provide the shared vocabulary your team needs to move from transcript logging to genuine observability.

Frequently Asked Questions

What are the most important KPIs for measuring voice agent performance in 2026?

How do you calculate First Call Resolution (FCR) rate for voice agents?

What is an acceptable Word Error Rate (WER) for production voice agents?

What latency should voice agents target for natural conversations?

How is customer satisfaction (CSAT) scored automatically for AI voice agents?

What containment rate should voice agents achieve?

How should voice agent dashboards be designed to detect issues quickly?

What is the difference between Mean Opinion Score (MOS) and Word Error Rate (WER)?

How do you measure and prevent AI hallucinations in voice agents?

Sumanyu Sharma

Related Resources

Real-Time AI Voice Analytics Dashboards for Customer Service (2026)

Post-Call Analytics for Voice Agents: Metrics and Monitoring

Voice Agent Monitoring KPIs: 10 Production Metrics, Dashboards & Alerting Guide