Voice Agent Analytics & Post-Call Metrics: Definitions, Formulas & Dashboards

Sumanyu Sharma
Sumanyu Sharma
Founder & CEO
, Voice AI QA Pioneer

Has stress-tested 4M+ voice agent calls to find where they break.

February 10, 2026Updated February 10, 202620 min read
Voice Agent Analytics & Post-Call Metrics: Definitions, Formulas & Dashboards

Voice agent analytics is the continuous measurement of performance across telephony, ASR, LLM, and TTS layers to ensure production quality. Unlike traditional call center metrics that track averages and aggregate outcomes, voice agent analytics requires layer-by-layer observability—tracing every interaction from audio ingestion through speech recognition, language model inference, and speech synthesis to pinpoint where and why conversations succeed or fail.

Metric CategoryKey MetricsProduction Target
Task SuccessFCR, containment rate, TSRFCR 70-85%, containment 80%+
LatencyTTFW, turn latency, p90/p95P90 <3.5s, TTFW <500ms
ASR QualityWER, confidence scoresWER <5%
NLU AccuracyIntent recognition, slot fillingIntent accuracy 95%+
TTS QualityMOS, synthesis latencyMOS 4.3-4.5
SafetyHallucination rate, refusal rateHallucination <3%

At Hamming, we've analyzed 4M+ voice agent calls across 10K+ production voice agents. This guide provides the standardized definitions, mathematical formulas, instrumentation approaches, and benchmark ranges for every critical voice agent metric.

Methodology Note: Metrics, benchmarks, and formula definitions in this guide are derived from Hamming's analysis of 4M+ voice agent interactions across 10K+ production voice agents (2025-2026).

Thresholds validated across healthcare, financial services, e-commerce, and customer support verticals.

Related Guides:


Core Voice Agent Metrics and KPIs

First Call Resolution (FCR)

First Call Resolution (FCR) measures the percentage of customer issues fully resolved during the initial interaction without requiring callbacks, transfers, or follow-up contacts.

Formula:

FCR = (Issues resolved on first contact / Total interactions) × 100
LevelFCR TargetContext
Baseline70-75%Standard voice agent deployment
Good75-80%Optimized flows with knowledge base coverage
Top Performer85%+Specialized, well-defined use cases

Measurement approach: Use 48-72 hour verification windows. If a customer contacts again within that window about the same issue, the original interaction did not achieve resolution—even if it was marked complete.

Segmentation matters: FCR varies significantly by intent category. Appointment scheduling may achieve 90%+ FCR while complex troubleshooting sits at 60-65%. Report FCR by intent to identify specific improvement opportunities rather than optimizing a blended average.

Containment Rate

Containment rate measures the percentage of calls handled entirely by the AI agent without escalation to a human operator.

Formula:

Containment Rate = (AI-resolved calls / Total calls) × 100
LevelTargetUse Case Context
Excellent>80%Well-defined transactional flows
Good70-80%Standard customer service
Acceptable60-70%Complex queries, new deployments
FAQ Bots40-60%Simple information retrieval

Target 80%+ after optimization, though rates vary substantially by use case complexity. Healthcare triage may appropriately target 60-70% while appointment scheduling targets 85%+.

Critical caveat: High containment with low CSAT indicates false containment—users abandoning rather than getting helped. Always pair containment tracking with task completion and satisfaction metrics.

Track escalation patterns by reason to identify actionable improvements:

  • Knowledge gap (agent lacks required information)
  • Authentication failure
  • User preference (explicitly requested human)
  • Conversation breakdown (intent confusion, loops)
  • Policy requirement (regulatory escalation triggers)

Intent Recognition Accuracy

Intent recognition accuracy measures how correctly the voice agent classifies customer requests into predefined intent categories.

Formula:

Intent Recognition Accuracy = (Correct intent matches / Total utterances) × 100
LevelAccuracyAssessment
Production-ready>95%Required for customer-facing deployment
Acceptable90-95%Requires monitoring and prompt tuning
Below threshold<90%Not production-ready without improvement

Production requires 95%+ accuracy. Intent misclassification cascades through the entire conversation—a misrouted caller enters the wrong flow, receives irrelevant responses, and either escalates or abandons.

Track intent coverage rate alongside accuracy: the percentage of incoming calls that match a fully supported intent category. Low coverage (many "fallback" or "unknown" intents) indicates gaps in your intent taxonomy rather than classification quality.

Task Success Rate (TSR)

Task Success Rate (TSR) tracks completed objectives relative to total interaction attempts.

Formula:

TSR = (Successful task completions / Total interactions) × 100
Use CaseTSR Benchmark
Appointment scheduling90-95%
Order status inquiry88-93%
Payment processing85-90%
Technical troubleshooting75-85%
Complex multi-step flows70-80%

Benchmark 85-95% for specialized implementations with well-defined success criteria. TSR differs from FCR in that it measures whether the agent completed its designated task, regardless of whether the customer needed to call back for a different issue.

Define explicit completion criteria for each task type. "Appointment scheduled" is unambiguous. "Customer helped" is not.

Customer Satisfaction (CSAT) Score

CSAT measures satisfaction with individual interactions, typically on a 1-5 scale.

Formula:

CSAT = (Satisfied responses [4-5 rating] / Total survey responses) × 100
LevelCSAT Target
Excellent>85%
Good75-85%
Acceptable65-75%
Poor<65%

Target 75-85% with AI-based automated scoring supplementing explicit surveys. Voice agents can embed satisfaction measurement directly in conversations, achieving 30%+ higher completion rates than post-call email surveys.

Beyond explicit ratings: Track caller frustration through sentiment trajectory, interruption patterns, and repetition frequency. AI-based scoring analyzes tone, sentiment, resolution speed, and conversation ending patterns to predict CSAT without requiring explicit surveys—useful when survey response rates are low.


Latency and Response Time Metrics

Time to First Word (TTFW)

Time to First Word (TTFW) measures initial response delay from user speech completion (VAD silence detection) to the first agent audio reaching the caller.

Formula:

TTFW = VAD silence detection  Agent first audio byte
ThresholdUser Experience
<300msNatural, indistinguishable from human conversation
300-500msAcceptable for most users
500-800msNoticeable delay, users begin adapting speech patterns
>800msConversation breakdown begins

Target sub-500ms as ideal, with 800ms as the acceptable production threshold. Note that TTFW measures only the initial response delay—the time until the first audio byte reaches the caller. This differs from total turn latency (covered below), which measures complete end-to-end response time. Based on Hamming's analysis of production voice agents, industry median total turn latency is 1.4-1.7 seconds—significantly slower than the 300ms human conversational expectation. This gap explains why users report agents that "feel slow" or "keep getting interrupted."

Total Conversation Latency

Total conversation latency measures end-to-end response time including ASR processing, LLM inference, and TTS synthesis for each conversational turn.

Component breakdown (typical production):

  • Audio transmission: ~40ms
  • STT processing: 150-350ms
  • LLM inference (TTFT): 200-800ms (typically 70% of total)
  • TTS synthesis: 100-200ms
  • Audio playback: ~30ms

LLM inference dominates total latency, making model selection and prompt optimization the highest-leverage improvement targets.

Latency Percentiles (P50, P90, P95, P99)

Track percentile distributions rather than averages to expose performance outliers that degrade user experience for significant portions of your traffic.

Production latency benchmarks:

PercentileResponse TimeUser ExperienceAction Required
P50 (median)1.5sNoticeable delay, functionalOptimize LLM inference
P903.5sSignificant frustration, talk-oversInvestigate infrastructure and model
P955.0sSevere delay, frequent abandonmentImmediate attention required
P9910s+Complete breakdownCritical incident

Why percentiles matter: A system reporting 500ms average latency may have 10% of calls experiencing 3.5+ second delays. Those callers don't care about the average—they experience a broken product. Alert on p90 threshold breaches (3.5s), not mean degradation.

Alert configuration: Set notifications when p90 latency exceeds 3.5s for a sustained 5-minute window. This catches degradation before it becomes widespread while filtering transient spikes.


Speech Recognition Quality Metrics

Word Error Rate (WER)

Word Error Rate (WER) is the industry standard metric for ASR accuracy, measuring the percentage of words incorrectly transcribed.

Formula:

WER = (S + D + I) / N × 100

Where:
S = Substitutions (wrong words)
D = Deletions (missing words)
I = Insertions (extra words)
N = Total words in reference transcript
LevelWERAssessment
Enterprise<5%Required for production deployment
Acceptable5-8%Standard deployment, optimization needed
Below threshold8-12%Not production-ready for high-stakes use
Poor>12%Requires fundamental ASR changes

Require under 5% WER for production. ASR errors cascade through the entire pipeline—a misrecognized word becomes a wrong intent becomes a failed task. Track WER segmented by noise condition, accent, and domain vocabulary to identify specific improvement areas.

Transcription Confidence Scores

ASR systems provide probability scores indicating transcription certainty for each word or utterance segment. These scores enable real-time quality monitoring without requiring reference transcripts.

Confidence LevelScore RangeAction
High>0.85Process normally
Medium0.6-0.85Flag for monitoring
Low<0.6Trigger re-prompting or human review

Production use: Flag low-confidence segments (below 0.6) for re-prompting strategies—ask the caller to repeat or rephrase rather than proceeding with uncertain transcription. Monitor confidence score distributions over time to detect ASR drift.

Speaker Diarization Accuracy

Speaker diarization identifies and separates multiple speakers in a conversation, attributing each utterance to the correct participant.

Critical for:

  • Multi-party calls (caller + agent + transferred party)
  • Accurate context attribution in analytics
  • Compliance monitoring requiring speaker-specific tracking
  • Training data quality for model improvement

Track diarization error rate as the percentage of speech segments attributed to the wrong speaker. Production systems should achieve under 5% diarization error for two-speaker conversations.


Natural Language Understanding Metrics

Intent Coverage Rate

Intent coverage rate measures the percentage of incoming calls that match a fully supported intent category in your voice agent's taxonomy.

Formula:

Intent Coverage = (Calls matching supported intents / Total calls) × 100

Track coverage gaps—calls routed to "fallback" or "unknown" intent categories—to identify where your agent lacks capability. High fallback rates (above 15%) indicate taxonomy gaps rather than classification errors.

Action pattern: Review fallback utterances weekly. Cluster similar requests and evaluate whether they warrant new intent categories or expanded training data for existing intents.

Semantic Accuracy Rate

Semantic accuracy measures whether agent responses align with the user's actual meaning, going beyond keyword matching to evaluate contextual understanding.

Unlike intent accuracy (which measures classification), semantic accuracy evaluates whether the agent's response appropriately addresses what the user meant—even when the intent was correctly classified.

Validation approach: Conduct periodic manual audits or use LLM-as-judge evaluation pipelines. Sample 100-200 conversations per week, scoring whether responses were semantically appropriate given the full conversation context. LLM-as-judge approaches achieve 95%+ agreement with human evaluators when using two-step evaluation pipelines.

Slot Filling Accuracy

Slot filling accuracy tracks successful extraction of required parameters (names, dates, account numbers, addresses) from user utterances before task execution.

Formula:

Slot Filling Accuracy = (Correctly extracted slots / Total required slots) × 100

Production target: 90%+ slot filling accuracy. Failed slot extraction forces repetitive re-prompting that degrades user experience. Track accuracy by slot type—dates and numbers typically achieve higher accuracy than proper nouns and addresses.


Text-to-Speech Quality Metrics

Mean Opinion Score (MOS)

Mean Opinion Score (MOS) is a subjective 1-5 scale rating of TTS naturalness, clarity, and overall quality, following the ITU-T P.800 standard.

MOS ScoreQuality Level
4.5-5.0Excellent, indistinguishable from human
4.3-4.5Very good, rivals human speech quality
3.8-4.3Good, clearly synthetic but natural
3.0-3.8Fair, robotic qualities noticeable
<3.0Poor, unacceptable for production

Target 4.3-4.5 to rival human speech quality benchmarks. MOS remains the gold standard for TTS evaluation despite being resource-intensive, requiring crowdsourced evaluation or automated MOSNet scoring.

Voice Consistency Rate

Voice consistency measures stable prosody, tone, and pacing throughout an entire conversation. Inconsistent voice characteristics—sudden pitch shifts, pacing changes, or tonal breaks—break user immersion and erode trust.

Monitor for:

  • Pitch stability across conversation turns
  • Pacing consistency (words per minute variance)
  • Tonal alignment with conversation context (empathetic when appropriate)
  • Cross-session consistency for returning callers

Audio Synthesis Latency

TTS synthesis latency measures the time required to generate audio output from text input.

PercentileTargetImpact
P50<150msContributes to natural conversational flow
P90<300msAcceptable production threshold
P95<500msInvestigate TTS provider performance

Track TTS p90 latency under 300ms to maintain conversational rhythm. TTS latency combines with STT and LLM latency to determine total turn latency—optimizing any single component improves end-to-end experience.


Hallucination Detection and Safety Metrics

Hallucination Rate

Hallucination rate tracks instances where the voice agent generates fabricated information, invented facts, or confident responses not grounded in its knowledge base.

Formula:

Hallucination Rate = (Hallucinated responses / Total responses) × 100

Target under 3% occurrence for general deployments. Regulated industries (healthcare, financial services) should target under 1%.

Detection approaches:

  • Real-time validation against knowledge base sources
  • LLM-as-judge evaluation on sampled conversations
  • Tracking five or more consecutive transcription errors as potential hallucination signals
  • Monitoring responses with high confidence but no matching source documents

Safety Refusal Rate

Safety refusal rate measures the percentage of adversarial, inappropriate, or out-of-scope prompts correctly rejected by the voice agent.

Track both:

  • True positive refusals: Correctly blocked adversarial or policy-violating requests
  • False positive refusals: Legitimate requests incorrectly blocked (over-aggressive guardrails)

Balance is critical. Under-refusing exposes your system to misuse. Over-refusing creates frustrated users who can't complete legitimate tasks.

Source Grounding Score

Source grounding validates that agent responses are traceable to verified knowledge base content, flagging "confident answers with no matching source" as potential hallucinations.

Implementation: For each response, check whether the key claims map to retrieved knowledge base passages. Responses with high confidence but low source overlap should trigger review and potential re-prompting.


Conversation Quality Scoring

Call Deflection Success

Call deflection success measures prevention of unnecessary human transfers when the AI agent could resolve the issue, calculated against baseline pre-automation escalation rates.

Formula:

Deflection Success = (Pre-automation escalations - Current escalations) / Pre-automation escalations × 100

This metric only makes sense relative to historical baselines. Compare current escalation patterns to pre-automation rates, segmented by intent category.

Interruption Frequency

Interruption frequency counts instances where the agent speaks over the user or responds before the user completes their thought. High interruption rates indicate ASR timing issues, specifically problems with Voice Activity Detection (VAD) or end-of-turn prediction.

LevelInterruption RateAssessment
Good<5% of turnsNatural conversation flow
Acceptable5-10% of turnsMonitor VAD configuration
Poor>10% of turnsImmediate VAD tuning required

Diagnostic approach: Distinguish between agent-caused interruptions (premature response) and user-caused interruptions (barge-in). Agent-caused interruptions indicate system issues. User-caused barge-ins may indicate latency problems prompting users to repeat themselves.

Conversation Abandonment Rate

Conversation abandonment rate tracks calls ended by the user mid-conversation before reaching resolution, signaling poor experience or agent failure.

Formula:

Abandonment Rate = (Calls abandoned before resolution / Total calls) × 100

Segment abandonment by:

  • Time in call: Early abandonment (under 30s) suggests greeting or routing issues
  • Intent stage: Abandonment during slot filling suggests re-prompting fatigue
  • After specific turns: Identifies exact conversation points causing drop-off

Voice Agent Analytics Framework

4-Layer Monitoring Approach

Implement Hamming's 4-layer analytics framework for comprehensive voice agent observability:

LayerFunctionKey MetricsFailure Modes
Layer 1: Telephony & AudioAudio quality, transport healthPacket loss, jitter, SNR, codec latencyGarbled audio, dropouts, echo
Layer 2: ASR & TranscriptionSpeech-to-text accuracyWER, confidence scores, transcription latencyMishearing, silent failures, drift
Layer 3: LLM & SemanticIntent and response generationTTFT, intent accuracy, hallucination rateWrong routing, confabulation, scope creep
Layer 4: TTS & GenerationSpeech synthesis qualitySynthesis latency, MOS, voice consistencyDelays, robotic speech, voice drift

Issues cascade across layers. An audio quality problem (Layer 1) causes transcription errors (Layer 2), which cause intent misclassification (Layer 3), which causes task failure. Without layer-by-layer instrumentation, you see the task failure but not the root cause.

Real-Time vs Post-Call Analytics

Balance immediate alerting with deep post-call analysis:

ApproachPurposeLatencyDepth
Real-timeDetect degradation as it happensSecondsSurface-level indicators
Near-real-timePattern identification within sessionsMinutesTrend analysis
Post-callRoot cause analysis, model improvementHoursFull conversation evaluation

Real-time monitoring catches outages and severe degradation. Post-call analysis identifies systematic patterns—specific prompts that consistently underperform, intent categories with declining accuracy, or time-of-day latency variations—that inform model improvements and prompt optimization.


Dashboard Design and Visualization

Essential Dashboard Components

A production voice agent dashboard must answer four questions within 30 seconds:

  1. Is the system healthy? — Call volume trends, error rates, infrastructure status
  2. Are users satisfied? — CSAT trajectory, abandonment rates, sentiment patterns
  3. Where are the problems? — Latency percentile distributions, WER trends, containment drops
  4. What changed? — Deployment markers, model version annotations, configuration diffs

Display call volume trends, latency percentile distributions (p50/p90/p95), containment rates, and sentiment analysis with one-click drill-downs from anomaly to individual conversation transcript and audio playback.

Alert Configuration Best Practices

Set thresholds on percentile metrics, not averages. Average-based alerts mask degradation affecting minority populations of calls.

MetricAlert ThresholdDurationSeverity
P90 latency>3.5s5 minutesWarning
P95 latency>5.0s5 minutesCritical
Containment rate<60%1 hourWarning
WER>8%15 minutesWarning
Hallucination rate>3%30 minutesCritical
TTFW p95>800ms5 minutesWarning

Trigger notifications before customer-facing degradation becomes widespread. A p90 alert at 3.5s catches the problem when 10% of users are affected, not when the average crosses a threshold that requires 50%+ degradation.

Metric Correlation Views

Link upstream failures to downstream impacts to trace root causes efficiently:

  • High WER → Low intent accuracy → Low TSR — ASR degradation cascading to task failure
  • Latency spike → High interruption rate → High abandonment — Infrastructure issue causing conversation breakdown
  • Low confidence scores → High fallback rate → Low containment — ASR uncertainty driving escalations

Build correlation dashboards that surface these causal chains automatically, enabling operators to jump from symptom to root cause without manual investigation.


ROI and Business Impact Metrics

Cost Per Interaction

Calculate total resolution cost including infrastructure, model inference, and telephony:

ChannelCost RangeContext
Human agent$5-8 per callFully loaded: salary, benefits, training, facilities
AI voice agent$0.01-0.25 per minuteInfrastructure, model inference, telephony

Comparison framework: For a 3-minute average call, AI costs $0.03-0.75 versus $5-8 for human handling—a 10-250x cost reduction depending on complexity and infrastructure choices.

Automation ROI Formula

Formula:

ROI = (Containment Rate × Call Volume × Per-Call Savings - Infrastructure Costs) / Infrastructure Costs × 100

Expected returns: 200-500% ROI within 3-6 months for well-implemented deployments with sub-six-month payback periods.

Worked example: 10,000 monthly calls × 75% containment × $6 per-call savings = $45,000 monthly savings. Against $8,000 monthly infrastructure costs: ($45,000 - $8,000) / $8,000 = 462% ROI.


Industry Benchmarks and Thresholds (2026)

Performance Benchmark Summary Table

MetricBaselineGoodTop PerformerHow to Measure
FCR70%75-80%85%+48-72hr callback verification
Containment60%70-80%80%+Escalation event tracking
Latency P904.5s3.5s<2.5sPercentile distribution monitoring
WER8%5%<3%Reference transcript comparison
CSAT65%75-80%85%+In-call + AI-predicted scoring
MOS3.84.34.5+Crowdsourced + MOSNet evaluation
Intent Accuracy90%95%98%+Labeled test set evaluation
Hallucination Rate5%3%<1%Source grounding validation

Benchmark Variation by Use Case

Performance expectations vary significantly by voice agent complexity:

Use CaseTypical ContainmentFCR TargetLatency Tolerance
FAQ / Information40-60%80%+P90 <4.0s
Appointment Scheduling75-85%90%+P90 <3.0s
Order Management70-80%80%+P90 <3.5s
Customer Service65-80%70-80%P90 <3.5s
Technical Support50-65%65-75%P90 <4.0s
Healthcare Triage55-70%75-85%P90 <3.5s

Simpler FAQ bots achieve 40-60% containment while complex customer service targets 75-80%. Adjust expectations by complexity rather than applying uniform benchmarks.


Production Monitoring and Alerting

Critical Alert Definitions

Configure tiered notifications linking severity to response expectations:

Alert TypeTriggerSeverityResponse SLA
Containment dropBelow 60% for 1 hourP2 - High30 minutes
Latency spikeP90 >3.5s for 5 minutesP2 - High15 minutes
WER degradationAbove 8% for 15 minutesP2 - High30 minutes
Hallucination increaseAbove 3% for 30 minutesP1 - Critical10 minutes
TTFW spikeP95 >800ms for 5 minutesP3 - Medium1 hour
Total failureError rate >5% for 5 minutesP1 - Critical5 minutes

Incident Response Workflows

Establish playbooks linking alert types to diagnostic steps:

Latency alerts: Check infrastructure health → Review model inference times → Verify TTS provider status → Examine traffic patterns for load spikes

Accuracy alerts: Audit recent model or prompt changes → Compare WER distributions before/after → Review confidence score trends → Check ASR provider status

Containment alerts: Analyze escalation reason distribution → Review intent coverage gaps → Check for new conversation patterns → Verify knowledge base currency

Hallucination alerts: Validate knowledge base freshness → Review recent prompt modifications → Audit source grounding scores → Check retrieval pipeline health

Each playbook should terminate in either resolution or escalation within defined SLAs, with post-incident documentation capturing root cause and prevention measures.


Building a Measurement-Driven Voice Agent Practice

Voice agent analytics is not a dashboard project—it is an operational discipline. The teams that succeed in production share three practices: they instrument all four layers (telephony, ASR, LLM, TTS) independently, they alert on percentile distributions rather than averages, and they correlate upstream failures to downstream business impact.

Start with the metrics that drive decisions: FCR (70-85%), containment rate (80%+), WER (under 5%), and p90 latency (under 3.5s). Instrument these first, set alert thresholds, and build the correlation views that connect a WER spike to a containment drop to a CSAT decline. Everything else—MOS scores, slot filling accuracy, diarization—layers on top of that foundation.

At Hamming, we help teams validate these metrics before production deployment and continuously monitor them at scale. Whether you are establishing baseline measurements for a new voice agent or debugging a latency regression in an existing deployment, the definitions, formulas, and benchmarks in this guide provide the shared vocabulary your team needs to move from transcript logging to genuine observability.

Frequently Asked Questions

Prioritize containment rate, first call resolution (FCR), intent recognition accuracy, latency percentiles (p50/p90/p95), and CSAT as foundational metrics. Containment rate targets 80%+ after optimization, FCR targets 70-85%, intent accuracy requires 95%+ for production, and p90 latency should stay under 3.5 seconds. Expand dashboard scope to include WER, MOS, and hallucination rate once foundational metrics are instrumented.

Divide verified successful resolutions on first contact by total calls, then multiply by 100. Use a 48-72 hour verification window—if a customer contacts again within that period about the same issue, the original call did not achieve resolution. Segment FCR by intent category and channel for actionable insights, as blended averages mask significant performance variation across use cases.

Target under 5% WER for enterprise production deployments. The formula is (Substitutions + Deletions + Insertions) / Total Reference Words × 100. Regulated industries and safety-critical applications require even lower thresholds, often under 3%. Track WER segmented by noise condition, accent, and domain vocabulary to identify specific improvement areas.

Aim for sub-500ms Time to First Word (TTFW) as ideal, with 800ms as the acceptable production threshold. For total turn latency, production benchmarks show p50 at 1.5 seconds, p90 at 3.5 seconds, and p95 at 5 seconds. Track percentile distributions rather than averages—a system with 500ms average latency may have 10% of calls experiencing 3.5+ second delays.

AI-based automated scoring analyzes tone, sentiment trajectory, resolution speed, interruption patterns, and conversation ending patterns to predict CSAT without requiring explicit surveys. Voice agents can also embed satisfaction prompts directly in conversations, achieving 30%+ higher completion rates than post-call email surveys. Target 75-85% CSAT for production voice agents.

Target 80%+ containment after optimization for well-defined transactional flows. Rates vary by complexity: FAQ bots achieve 40-60%, appointment scheduling targets 75-85%, and complex customer service targets 65-80%. Always pair containment tracking with CSAT and task completion metrics to avoid false containment where users abandon rather than get helped.

Implement real-time views with percentile tracking (p50/p90/p95), sentiment analysis, and one-click drill-downs from anomaly to individual conversation transcript and audio. Set alert thresholds on percentile metrics rather than averages—alert when p90 latency exceeds 3.5 seconds for 5 minutes, not when the mean crosses a threshold. Build metric correlation views linking upstream failures like high WER to downstream impacts like low task success rate.

MOS measures subjective TTS (text-to-speech) naturalness on a 1-5 scale, targeting 4.3-4.5 for production quality that rivals human speech. WER quantifies objective ASR (speech-to-text) transcription accuracy as an error percentage, targeting under 5% for production. MOS evaluates output speech quality while WER evaluates input speech recognition—they measure opposite ends of the voice pipeline.

Track hallucination rate as hallucinated responses divided by total responses, targeting under 3% for general deployments and under 1% for regulated industries. Implement real-time validation against verified knowledge base sources, use LLM-as-judge evaluation on sampled conversations, monitor responses with high confidence but no matching source documents, and maintain knowledge base currency to prevent outdated information from driving confabulation.

Sumanyu Sharma

Sumanyu Sharma

Founder & CEO

Previously Head of Data at Citizen, where he helped quadruple the user base. As Senior Staff Data Scientist at Tesla, grew AI-powered sales program to 100s of millions in revenue per year.

Researched AI-powered medical image search at the University of Waterloo, where he graduated with Engineering honors on dean's list.

“At Hamming, we're taking all of our learnings from Tesla and Citizen to build the future of trustworthy, safe and reliable voice AI agents.”