Voice agent evaluation metrics are standardized measurements for assessing voice AI performance across accuracy, latency, task completion, quality, and safety dimensions. Unlike text-based LLM evaluation, voice agents require end-to-end tracing across ASR, NLU, LLM, and TTS components—each introducing unique failure modes.
| Metric Category | Key Metrics | Why It Matters |
|---|---|---|
| ASR Accuracy | WER, CER, entity accuracy | Transcription errors cascade downstream |
| Latency | TTFB, p95/p99, end-to-end | Delays break conversational flow |
| Task Success | TSR, FCR, containment rate | Measures actual business outcomes |
| TTS Quality | MOS, MCD, naturalness | Affects user trust and experience |
| Safety | Hallucination rate, compliance score | Prevents harmful or incorrect outputs |
TL;DR: Use Hamming's Voice Agent Metrics Reference to systematically measure production voice AI across all five dimensions. This guide provides standardized definitions, mathematical formulas, instrumentation approaches, and benchmark ranges for every critical voice agent metric.
Quick filter: If you're running a demo agent with a handful of test calls, basic logging and manual review work fine. This reference is for teams preparing for production deployment or already handling real customer traffic where measurement rigor matters.
Voice Agent KPI Reference Table
This table provides the complete reference for voice agent evaluation metrics—definitions, formulas, targets, and instrumentation guidance:
| Metric | Definition | Formula | Good | Warning | Critical | How to Instrument | Alert On |
|---|---|---|---|---|---|---|---|
| WER | Word Error Rate - ASR transcription accuracy | (S + D + I) / N × 100 | <5% | 5-10% | >10% | Compare ASR output to reference transcripts | P50 >8% for 10min |
| TTFW | Time to First Word - initial response latency | Call connect → first audio byte | <400ms | 400-600ms | >800ms | Timestamp call events, measure first audio | P95 >600ms for 5min |
| Turn Latency | End-to-end response time per turn | User silence end → agent audio start | P95 <800ms | P95 800-1500ms | P95 >1500ms | Span traces across STT/LLM/TTS | P95 >1000ms for 5min |
| Intent Accuracy | Correct intent classification rate | Correct / Total × 100 | >95% | 90-95% | <90% | Compare predicted vs labeled intents | <92% for 15min |
| TSR | Task Success Rate - goal completion | Completed / Attempted × 100 | >85% | 75-85% | <75% | Define completion criteria per task type | <80% for 30min |
| FCR | First Call Resolution - no follow-up needed | Resolved first contact / Total × 100 | >75% | 65-75% | <65% | Track repeat calls within 24-48hr window | <70% for 2hr |
| Containment | Calls handled without human escalation | AI-resolved / Total × 100 | >70% | 60-70% | <60% | Tag escalation events by reason | <60% for 1hr |
| Barge-in Recovery | Successful interruption handling | Recovered / Total interruptions × 100 | >90% | 80-90% | <80% | Detect overlapping speech, measure recovery | <85% for 30min |
| MOS | Mean Opinion Score - TTS quality | Human rating 1-5 scale | >4.3 | 3.8-4.3 | <3.8 | Crowdsourced evaluation or MOSNet | N/A (periodic) |
| Hallucination Rate | Fabricated/incorrect information | Hallucinated responses / Total × 100 | <1% | 1-3% | >3% | LLM-as-judge validation against sources | >2% for 30min |
How to use this table:
- Instrument each metric using the guidance in the "How to Instrument" column
- Set alerts based on the thresholds and durations in the "Alert On" column
- Triage by severity: Critical requires immediate action, Warning requires investigation within 1 hour
Benchmarks by Use Case
Different voice agent applications have different performance expectations. Use these benchmarks to calibrate your targets:
Contact Center Support
| Metric | Target | Notes |
|---|---|---|
| Task Completion | >75% | Complex queries, knowledge base dependent |
| FCR | >70% | Industry standard for support |
| Containment | >65% | Higher escalation expected for complex issues |
| Turn Latency P95 | <1000ms | Users more tolerant when seeking help |
| WER | <8% | Background noise from home environments |
Appointment Scheduling
| Metric | Target | Notes |
|---|---|---|
| Task Completion | >90% | Structured flow, clear success criteria |
| FCR | >85% | Appointment confirmed = resolved |
| Containment | >80% | Simple transactions, fewer edge cases |
| Turn Latency P95 | <800ms | Transactional, users expect speed |
| WER | <5% | Dates/times require high accuracy |
Healthcare / Clinical
| Metric | Target | Notes |
|---|---|---|
| Task Completion | >85% | Compliance and accuracy critical |
| Hallucination Rate | <0.5% | Zero tolerance for medical misinformation |
| Compliance Score | >99% | HIPAA, regulatory requirements |
| Turn Latency P95 | <1200ms | Accuracy more important than speed |
| WER | <5% | Medical terminology, patient safety |
E-commerce / Order Taking
| Metric | Target | Notes |
|---|---|---|
| Task Completion | >85% | Order placed, payment processed |
| Upsell Success | >15% | Revenue optimization |
| Containment | >75% | Handle returns, status, ordering |
| Turn Latency P95 | <700ms | Transactional, users expect fast |
| WER | <6% | Product names, order numbers |
ASR Accuracy Metrics
Speech recognition accuracy determines whether your voice agent correctly "hears" what users say. Errors at this layer cascade through the entire pipeline—a misrecognized word becomes a wrong intent becomes a failed task.
Word Error Rate (WER)
Word Error Rate (WER) is the industry standard metric for ASR accuracy, measuring the percentage of words incorrectly transcribed.
Formula:
WER = (S + D + I) / N × 100
Where:
S = Substitutions (wrong words)
D = Deletions (missing words)
I = Insertions (extra words)
N = Total words in reference transcript
Worked Example:
Reference: "I want to book a flight to Berlin"
Transcription: "I want to look at flight Berlin"
Substitutions: 2 (book→look, a→at)
Deletions: 1 (to)
Insertions: 0
Total words: 8
WER = (2 + 1 + 0) / 8 × 100 = 37.5%
Important: WER can exceed 100% when errors outnumber reference words—this indicates catastrophic transcription failure requiring immediate investigation.
Character Error Rate (CER)
Character Error Rate (CER) uses the same formula but operates on characters instead of words:
CER = (S + D + I) / N × 100 (at character level)
When to use CER:
- Non-whitespace languages (Mandarin, Japanese, Thai) where word segmentation doesn't apply
- Character-level precision tasks like spelling verification
- Granular accuracy assessment for named entities
WER Benchmark Ranges
| Rating | Accuracy | WER | Production Readiness |
|---|---|---|---|
| Enterprise | 95%+ | <5% | High-stakes applications (healthcare, finance) |
| Good | 90-95% | 5-10% | Most production use cases |
| Fair | 85-90% | 10-15% | Requires improvement before production |
| Poor | <85% | >15% | Not production-ready |
Source: Benchmarks derived from Hamming's testing of 500K+ voice interactions and published ASR research including Google's Multilingual ASR studies.
Environmental Impact on ASR Performance
Real-world conditions significantly degrade ASR accuracy compared to clean benchmarks:
| Environment | WER Increase | Notes |
|---|---|---|
| Office noise | +3-5% | Typing, HVAC, distant conversations |
| Café/restaurant | +10-15% | Music, conversations, clinking |
| Street/traffic | +15-20% | Vehicle noise, crowds, wind |
| Airport | +20-25% | Announcements, crowds, echo |
| Car (hands-free) | +10-20% | Engine noise, road noise, echo |
Testing implication: Always test ASR under realistic acoustic conditions, not just clean benchmarks. LibriSpeech clean speech achieves 95%+ accuracy, but real-world conditions reduce this by 5-15 percentage points.
For comprehensive background noise testing methodology, see Background Noise Testing KPIs.
ASR Provider Performance Comparison (2024-2025)
| Provider | Strengths | Notable Benchmarks |
|---|---|---|
| OpenAI Whisper | Clean and accented speech | Lowest WER for formatted/unformatted transcriptions |
| Deepgram Nova-2 | Commercial deployment | 30% WER reduction vs previous generation |
| AssemblyAI Universal | Hallucination reduction | 30% fewer hallucinations vs Whisper Large-v3 |
| Google Speech-to-Text | Language coverage | 125+ languages supported |
Task Success & Completion Metrics
ASR accuracy alone doesn't guarantee a working voice agent. Task success metrics measure whether users actually accomplish their goals.
Task Success Rate (TSR)
Task Success Rate (TSR) measures the percentage of interactions that meet all success criteria:
TSR = (Successful Completions / Total Interactions) × 100
Success criteria must include:
- All user goals achieved
- No constraint violations (e.g., booking within allowed dates)
- Proper execution of required actions (e.g., confirmation sent)
Related metrics:
- Task Completion Time (TCT): Time from first utterance to goal achievement
- Turns-to-Success: Average turn count to completion (measures conversational efficiency)
First Call Resolution (FCR)
First Call Resolution (FCR) measures the percentage of issues resolved during the initial interaction without requiring callbacks:
FCR = (Resolved on First Contact / Total Contacts) × 100
| FCR Rating | Range | Assessment |
|---|---|---|
| Excellent | 85%+ | High-performing teams |
| Good | 75-85% | Industry benchmark |
| Fair | 65-75% | Room for improvement |
| Poor | <65% | Significant issues |
Measurement best practices:
- Use 48-72 hour verification window (issue resolved if customer doesn't return)
- Combine internal data with post-call surveys for external validation
- FCR directly correlates with CSAT, NPS, and customer retention
Impact: Advanced NLU and real-time data integration can reduce misrouted calls by 30%, directly improving FCR.
Intent Recognition Accuracy
Intent recognition measures whether the voice agent correctly understands what users want to do:
Intent Accuracy = (Correct Classifications / Total Utterances) × 100
| Target | Threshold | Action Required |
|---|---|---|
| Production | 95%+ | Deploy with confidence |
| Acceptable | 90-95% | Monitor closely |
| Investigation | <90% | Determine if issue is ASR or NLU |
Coverage Rate measures how completely agents handle real customer goals:
Coverage Rate = (Calls in Fully Supported Intents / Total Calls) × 100
For intent recognition testing methodology, see How to Evaluate Voice Agents.
Containment Rate
Containment Rate measures the percentage of calls handled without human escalation:
Containment Rate = (Calls Handled by AI / Total Calls) × 100
| Timeframe | Conservative Target | Mature System |
|---|---|---|
| Month 1 | 40-60% | — |
| Month 3 | 60-75% | — |
| Month 6+ | 75-85% | 85%+ |
Higher containment reduces call center load and improves automation ROI. Enterprise deployments regularly achieve 80%+ containment after optimization.
Latency & Performance Metrics
Latency determines whether your voice agent feels like a natural conversation or an awkward exchange with a slow robot.
Human Conversation Benchmarks
Understanding human conversational timing sets the target for voice AI:
| Behavior | Typical Latency | Source |
|---|---|---|
| Human response in conversation | ~200ms | Conversational turn-taking research |
| Natural dialogue gap | <500ms | ITU standards |
| GPT-4o audio response | 232-320ms | OpenAI benchmarks |
Production Voice AI Reality
Based on analysis of 2M+ voice agent calls in production:
| Percentile | Response Time | User Experience |
|---|---|---|
| P50 (median) | 1.4-1.7s | Noticeable delay, but functional |
| P90 | 3.3-3.8s | Significant delay, user frustration |
| P95 | 4.3-5.4s | Severe delay, many interruptions |
| P99 | 8.4-15.3s | Complete breakdown |
Key Reality Check:
- Industry median: 1.4-1.7 seconds - 5x slower than the 300ms human expectation
- 10% of calls exceed 3-5 seconds - causing severe user frustration
- 1% of calls exceed 8-15 seconds - complete conversation breakdown
Achievable Latency Targets
| Latency Range | What Actually Happens | Business Reality |
|---|---|---|
| Under 1s | Theoretical ideal | Rarely achieved in production |
| 1.4-1.7s | Industry standard (median) | Where 50% of voice AI operates today |
| 3-5s | Common experience (P90-P95) | 10-20% of all interactions |
| 8-15s | Worst-case (P99) | 1% failure rate = thousands of bad experiences |
Critical thresholds:
- 300ms: Human expectation for natural conversation
- 800ms: Practical target for high-quality experiences
- 1.5s: Point where users notice significant degradation
- 3s: Users frequently interrupt or repeat themselves
Component Latency Breakdown
Voice agent latency accumulates across multiple components:
| Component | Typical Range | Optimized Range | Notes |
|---|---|---|---|
| STT | 200-400ms | 100-200ms | Streaming STT can reduce this |
| LLM Inference | 300-1000ms | 200-400ms | Highly model-dependent, 70% of total latency |
| TTS | 150-500ms | 100-250ms | TTFB, not full synthesis |
| Network (Total) | 100-300ms | 50-150ms | Multiple round trips |
| Processing | 50-200ms | 20-50ms | Queuing, serialization |
| Turn Detection | 200-800ms | 200-400ms | Configurable silence threshold |
| Total | 1000-3200ms | 670-1450ms | End-to-end latency |
Provider benchmarks (2025):
- Deepgram Voice Agent API: <250ms end-to-end
- ElevenLabs Flash: 75-135ms TTS latency
- Murf Falcon: 55ms model latency, ~130ms time-to-first-audio
The Latency Reality Gap: While providers advertise sub-300ms latencies and humans expect instant responses, production systems consistently deliver 1.4-1.7s median latency. This gap between expectation and reality explains why users report agents that "feel slow" or "keep getting interrupted."
For detailed latency analysis and optimization strategies, see Voice AI Latency: What's Fast, What's Slow, and How to Fix It.
Latency Measurement Methodologies
| Metric | Definition | When to Use |
|---|---|---|
| VART | Voice Assistant Response Time: user request to TTS first byte | End-to-end measurement |
| TTFT | Time-to-First-Token: request to first LLM token | LLM performance |
| FTTS | First Token to Speech: LLM first token to TTS first byte | Pipeline efficiency |
| Endpointing | Time to ASR finalization after silence | Turn detection speed |
Best practice: Track p50, p90, p95, and p99 latencies in production—users remember bad experiences more than average performance. With typical p50 of 1.5s and p99 of 8-15s, that 1% represents thousands of terrible experiences daily at scale.
Real-Time Factor (RTF)
Real-Time Factor (RTF) measures ASR processing speed relative to audio duration:
RTF = Processing Time / Audio Duration
- RTF < 1.0: Processing faster than real-time (required for production)
- RTF = 0.5: Processing twice as fast as real-time
- RTF > 1.0: Cannot keep up with real-time audio (not production-ready)
TTS Quality Metrics
Text-to-Speech quality affects user trust and experience. Robotic or unnatural speech undermines even perfectly accurate responses.
Mean Opinion Score (MOS)
Mean Opinion Score (MOS) is the gold standard for TTS quality evaluation, using human listeners to rate synthesized speech on a 1-5 scale:
| Score | Rating | Description |
|---|---|---|
| 5 | Excellent | Completely natural speech, imperceptible issues |
| 4 | Good | Mostly natural, just perceptible but not annoying |
| 3 | Fair | Equally natural and unnatural, slightly annoying |
| 2 | Poor | Mostly unnatural, annoying but not objectionable |
| 1 | Bad | Completely unnatural, very annoying |
Benchmark targets:
- 4.3-4.5: Excellent quality rivaling human speech
- 3.8-4.2: Good quality for most production use cases
- <3.5: Requires improvement before deployment
Methodology: ITU-T P.800 guidelines provide standardized protocols for conducting MOS tests. ITU-T P.808 defines crowdsourcing protocols for scalable perceptual testing.
Objective TTS Metrics
When human evaluation isn't practical, objective metrics provide automated quality assessment:
| Metric | What It Measures | Use Case |
|---|---|---|
| MCD | Mel-Cepstral Distortion: spectral differences between real and synthetic speech | Technical quality assessment |
| MOSNet | ML-predicted perceived quality score | Automated MOS approximation |
| VQM | Voice Quality Metric: aggregates naturalness, accuracy, domain fit | Comprehensive quality scoring |
VQM components:
- Naturalness accuracy
- Numerical accuracy (reading numbers correctly)
- Domain accuracy (industry terminology)
- Multilingual accuracy
- Contextual accuracy
Safety & Compliance Metrics
Voice agents in production must handle safety and compliance rigorously—natural-sounding delivery can mask dangerous errors.
Hallucination Detection
Hallucinations in voice AI are especially risky because confident, natural-sounding speech masks incorrect information.
Definition (AssemblyAI standard): Five or more consecutive insertions, substitutions, or deletions constitute a hallucination event.
| Metric | Definition | Target |
|---|---|---|
| Hallucination Rate | Percentage of responses with hallucinated content | <1% |
| HUN Rate | Hallucination-Under-Noise: responses unrelated to audio input | <2% |
| Downstream Propagation | Hallucinations leading to incorrect actions | 0% |
Testing approach:
- Test with controlled noise and non-speech audio
- Verify hallucinations don't propagate to tool calls or database writes
- Implement real-time validation against verified sources
Compliance & Safety Scoring
| Metric | Definition | Industry Standard |
|---|---|---|
| Safety Refusal Rate | Correct refusal on adversarial prompts | 99%+ |
| PII Detection Rate | Identification of sensitive data | 99%+ |
| Compliance Score | Adherence to regulatory requirements | 100% |
Enterprise requirements:
- SOC 2 Type II certification
- HIPAA BAA for healthcare applications
- PCI DSS compliance for payment processing
- GDPR/CCPA data handling
Hamming includes 50+ built-in metrics including hallucination detection, sentiment analysis, compliance scoring, and repetition detection.
Observability & Tracing
Production voice agents require distributed tracing across all components:
Audio Input → ASR → Intent → LLM → Tool Calls → TTS → Audio Output
↓ ↓ ↓ ↓ ↓ ↓ ↓
[Span] [Span] [Span] [Span] [Span] [Span] [Span]
Trace metadata to capture:
- Prompt version and model parameters
- Confidence scores at each stage
- Latency breakdown by component
- Outcome signals (success/failure/escalation)
OpenTelemetry provides the standard framework for voice agent observability. For implementation guidance, see Voice Agent Observability: End-to-End Tracing.
Cost & ROI Metrics
Understanding cost economics enables data-driven decisions about voice AI investment.
Cost Per Call Comparison
| Channel | Cost per Interaction | Notes |
|---|---|---|
| Human agent | $5-8 | Wages, benefits, overhead, facilities |
| Voice AI | $0.01-0.25/minute | Varies by provider and features |
| Blended | $2-4 | AI handles routine, humans handle complex |
Cost reduction levers:
- Containment rate improvement (fewer human escalations)
- Average Handle Time (AHT) reduction
- First Call Resolution improvement (fewer repeat contacts)
ROI Benchmarks
| Metric | Typical Range | Timeframe |
|---|---|---|
| ROI | 200-500% | 3-6 months |
| Payback Period | 60-90 days | — |
| Three-Year ROI | Up to 331% | Independent studies |
| OpEx Reduction | Up to 45% | Automating tier-1 tasks |
Case study benchmarks:
- 40% agent workload reduction
- 30% AHT reduction
- $95,000 annual savings (mid-sized deployment)
Scaling Economics
Traditional call centers scale linearly: more calls = more agents = proportional cost increase.
Voice AI breaks this curve:
- Handle thousands of concurrent calls without proportional cost increases
- Fixed infrastructure costs amortized across volume
- Marginal cost per call decreases with scale
Production Monitoring & Instrumentation
Key Production Metrics Dashboard
| Category | Metrics | Alert Threshold |
|---|---|---|
| Accuracy | STT confidence, intent accuracy | <90% triggers alert |
| Latency | p50, p95, p99 response time | p95 >1000ms triggers alert |
| Success | Task completion, escalation rate | <80% TSR triggers alert |
| Quality | Sentiment score, repetition rate | Negative trend triggers alert |
The 4-Layer Quality Framework
Hamming's framework for comprehensive voice agent monitoring:
| Layer | Focus | Example Metrics |
|---|---|---|
| Infrastructure | System health | Packet loss, RTF, audio quality, uptime |
| Agent Execution | Behavioral correctness | Intent accuracy, tool success, flow completion |
| User Reaction | Experience signals | Sentiment, frustration, recovery patterns |
| Business Outcome | Value delivery | TSR, FCR, containment, revenue impact |
For the complete monitoring framework, see Voice Agent Monitoring Platform Guide.
Continuous Monitoring Best Practices
- Health checks: Run golden call sets every few minutes to detect drift or outages
- Alerting: Email and Slack notifications when thresholds breached
- Version tagging: Compare metrics across prompt/model versions
- Feedback loops: Feed low-scoring conversations back into evaluation datasets
Testing & Evaluation Methodologies
Offline vs Online Evaluation
| Approach | When | What | Strengths |
|---|---|---|---|
| Offline | Before deployment | Curated datasets, systematic comparison | Catches regressions, controlled conditions |
| Online | After deployment | Live traffic, continuous scoring | Reveals real-world issues, production conditions |
Best practice: Use both. Offline evaluation catches regressions before deployment. Online evaluation reveals issues that only appear in production.
Load & Stress Testing
| Test Type | Scale | Purpose |
|---|---|---|
| Baseline | 10-50 concurrent | Establish performance benchmarks |
| Load | 100-500 concurrent | Validate scaling behavior |
| Stress | 1,000+ concurrent | Find breaking points |
Testing requirements:
- Realistic conditions: accents, background noise, interruptions
- Edge cases: silence, interruptions, off-topic requests
- Production call replay: convert real failures to regression tests
Hamming's Voice Agent Simulation Engine achieves 95%+ accuracy predicting production behavior.
Multilingual Testing
| Dimension | Approach | Target |
|---|---|---|
| Baseline WER | Clean audio per language | Language-specific thresholds |
| Environmental | Café, traffic, airport noise | <15% WER degradation |
| Code-switching | Mixed language utterances | 80%+ task completion |
| Regional variants | Dialect-specific testing | Equivalent performance |
For the complete multilingual testing framework, see How to Test Multilingual Voice Agents.
Industry Benchmarks & Standards
Speech Recognition Benchmarks
| Framework/Dataset | Purpose | Use Case |
|---|---|---|
| SUPERB | Multi-task speech evaluation | ASR, speaker ID, emotion recognition |
| LibriSpeech | Clean speech ASR | Baseline accuracy benchmarks |
| Common Voice | Accent diversity | Multilingual/accent testing |
| Switchboard | Conversational speech | Real-world ASR performance |
Conversational AI Standards
| Standard | Organization | Purpose |
|---|---|---|
| PARADISE | Academic | Task success + dialogue costs + satisfaction |
| ITU-T P.800 | ITU | MOS testing protocols |
| ITU-T P.808 | ITU | Crowdsourced perceptual testing |
Industry Performance Standards
| Application | TTFT Target | Throughput |
|---|---|---|
| Chat applications | <100ms | 40+ tokens/second |
| Voice assistants | <500ms | Real-time streaming |
| Contact centers | <800ms | 100+ concurrent calls |
What These Metrics Don't Capture
No metric perfectly captures user experience. Some limitations we've observed at Hamming:
- WER doesn't capture semantic errors: "I want to cancel" transcribed as "I want to handle" has low WER but completely wrong intent
- MOS scores are resource-intensive: Crowdsourced testing at scale requires budget and time that teams often don't have
- Latency percentiles mask distribution shape: Two systems with identical P95 can have very different user experiences
- Task success is binary: A "failed" task where the user got 80% of what they needed scores the same as a complete failure
- Containment rate doesn't measure quality: High containment with frustrated users is worse than lower containment with satisfied users
These metrics work best in combination, not isolation. We recommend tracking 3-5 metrics per category and looking for correlations between them.
Start Measuring Your Voice Agent with Hamming
Hamming provides comprehensive voice agent evaluation with 50+ built-in metrics, automated regression detection, and production monitoring—all in one platform. Stop guessing about voice agent performance; measure what matters.
Book a Demo with Hamming to see how enterprise teams achieve 95%+ evaluation accuracy with data-driven voice agent optimization.
Related Guides:
- How to Evaluate Voice Agents: Complete Guide — Hamming's VOICE Framework
- Voice Agent Testing Maturity Model — 5 levels of testing maturity
- ASR Accuracy Evaluation for Voice Agents — Hamming's 5-Factor ASR Framework
- Voice Agent Monitoring Platform Guide — Production monitoring best practices

