A voice agent team at a healthcare company watched their Datadog dashboard show all-green metrics for weeks. Server uptime: 99.9%. API latency: under 200ms. Error rate: 0.1%.
But their CSAT scores were plummeting. Customers were calling back frustrated. Escalations to human agents doubled.
What was happening?
They were monitoring infrastructure, not conversation quality.
Their voice agent's First Call Resolution had dropped from 75% to 58%. Intent accuracy had drifted from 95% to 87%. Users were saying "I already told you that" in 23% of calls. None of this appeared in their dashboards.
According to Hamming's analysis of 4M+ production voice agent calls, the 10 KPIs that actually predict voice agent failure are completely different from traditional APM metrics. This guide defines each one—with calculation formulas, industry benchmarks, alert thresholds, and remediation strategies.
TL;DR: Monitor voice agents using Hamming's 10 Critical Production KPIs:
Outcome KPIs: First Call Resolution (greater than 75%), Task Completion (greater than 85%), Containment Rate (greater than 70%), CSAT (greater than 75%)
Execution KPIs: Intent Accuracy (greater than 95%), End-to-End Latency P95 (less than 800ms), WER (less than 8%), Prompt Compliance (greater than 95%)
Experience KPIs: Context Retention (greater than 90%), AHT (balanced with quality)
Generic APM tools miss 60% of voice-specific failures. Set alerts on P90/P95 percentiles, not averages. Use the 4-Layer Dashboard Framework (Infrastructure → Execution → User Reaction → Outcome) for complete visibility.
Methodology Note: The KPI definitions, benchmarks, and alert thresholds in this guide are derived from Hamming's analysis of 4M+ production voice agent calls across 10K+ voice agents (2025-2026).Benchmarks represent median performance across healthcare, financial services, e-commerce, and customer support verticals. Thresholds may vary by use case complexity and user expectations.
Last Updated: January 2026
Related Guides:
- Voice Agent Analytics & Post-Call Metrics: Definitions, Formulas & Dashboards — Complete KPI reference with formulas, benchmarks, and dashboard design
- Monitor Pipecat Agents in Production — OpenTelemetry tracing and alerting for Pipecat voice agents
- Voice Agent Dashboard Template — 6-Metric Framework with Executive Reports
- Voice Agent Monitoring Platform Guide — 4-Layer Monitoring Stack
- How to Evaluate Voice Agents — VOICE Framework
- Debugging Voice Agents: Logs, Missed Intents & Error Dashboards — Turn-level debugging with confidence scores and fallback monitoring
What Are Voice Agent KPIs?
Voice agent KPIs are quantitative metrics that measure the performance, quality, and business impact of AI-powered voice agents in production. Unlike traditional contact center metrics that focus on operational efficiency, voice agent KPIs must capture AI-specific quality dimensions: speech recognition accuracy, intent classification, response latency, prompt compliance, and multi-turn reasoning.
Why voice agents need specialized KPIs:
| Traditional Contact Center | Voice Agent Specific |
|---|---|
| Average Handle Time (AHT) | Time to First Word (TTFW) |
| Abandonment Rate | Intent Accuracy |
| Queue Wait Time | Word Error Rate (WER) |
| Agent Utilization | Prompt Compliance |
| Service Level | Context Retention |
The 10 KPIs in this guide span four layers of Hamming's monitoring framework:
- Infrastructure Layer: Audio quality, network reliability, system capacity
- Execution Layer: ASR accuracy, LLM latency, intent classification, tool calls
- User Reaction Layer: Sentiment, interruptions, retry patterns, frustration signals
- Outcome Layer: Task completion, containment, FCR, CSAT, business value
The 10 Critical Voice Agent Production KPIs
This master scorecard provides a reference for all 10 KPIs. Each is detailed in subsequent sections with formulas, benchmarks, alert configurations, and remediation strategies.
| KPI | Definition | Formula | Target | Warning | Critical |
|---|---|---|---|---|---|
| 1. First Call Resolution | % of issues resolved without follow-up | (resolved calls / total calls) × 100 | greater than 75% | less than 70% | less than 60% |
| 2. Average Handle Time | Mean conversation duration | (talk time + wrap-up) / total calls | 4-6 min | greater than 2x baseline | greater than 3x baseline |
| 3. Intent Accuracy | % of correctly classified intents | (correct / total) × 100 | greater than 95% | less than 92% | less than 90% |
| 4. E2E Latency (P95) | 95th percentile response time | STT + LLM + TTS | less than 800ms | greater than 1200ms | greater than 1500ms |
| 5. Word Error Rate | ASR transcription accuracy | (S + I + D) / words × 100 | less than 8% | greater than 12% | greater than 15% |
| 6. Task Completion | % completing business goal | (completed / attempts) × 100 | greater than 85% | less than 75% | less than 70% |
| 7. CSAT | % positive ratings (4-5 stars) | (positive / responses) × 100 | greater than 75% | less than 70% | less than 65% |
| 8. Containment Rate | % handled without transfer | (AI-resolved / total) × 100 | greater than 70% | less than 60% | less than 50% |
| 9. Prompt Compliance | % of instructions followed | (compliant / total) × 100 | greater than 95% | less than 90% | less than 85% |
| 10. Context Retention | % of contextual refs correct | (correct refs / total) × 100 | greater than 90% | less than 85% | less than 80% |
KPI 1: First Call Resolution (FCR)
First Call Resolution measures the percentage of customer issues resolved during the initial interaction without requiring follow-up calls, transfers, or escalations. FCR is the ultimate test of voice agent effectiveness—it indicates whether the agent understood the request, had the knowledge to address it, and executed the resolution correctly.
Definition & Calculation
Formula:
FCR = (Calls resolved without transfer or callback / Total calls) × 100
What it measures: The agent's ability to completely resolve customer needs in a single interaction. This requires accurate understanding, comprehensive knowledge base integration, and proper dialog flow execution.
Industry Benchmarks
| Level | Range | Interpretation |
|---|---|---|
| Excellent | greater than 80% | World-class resolution capability |
| Good | 75-80% | Strong performance, minor optimization needed |
| Acceptable | 65-75% | Adequate but significant improvement opportunity |
| Poor | less than 65% | Systemic issues requiring immediate attention |
Industry examples:
- Healthcare provider achieved 75% FCR handling appointment reminders and prescription queries
- Financial services targeting 80%+ FCR for account balance and transaction inquiries
- E-commerce achieving 70-75% FCR for order status and return initiation
Alert Thresholds & Failure Modes
| Severity | Threshold | Duration | Action |
|---|---|---|---|
| Warning | less than 70% | 1 hour | Investigate intent accuracy, knowledge gaps |
| Critical | less than 60% | 30 min | Immediate review of failed resolutions |
Common failure modes:
- Knowledge base gaps — Agent lacks information to resolve specific query types
- Intent misclassification — Wrong dialog flow activated, unable to complete resolution
- Incomplete dialog flows — Flow ends without proper resolution confirmation
- Premature transfers — Agent escalates when resolution was possible
Leading indicators: Rising fallback frequency, negative sentiment shifts, and repeated intent reclassification signal FCR problems before the metric drops.
Remediation Strategies
-
Analyze failed resolution transcripts — Identify patterns in queries that fail to resolve. Look for knowledge gaps, missing intents, and dialog breakdowns.
-
Segment FCR by query type — Track which categories underperform. Intent X might have 90% FCR while Intent Y has 45%, revealing specific training needs.
-
Monitor callback patterns — Users calling back within 24 hours indicate incomplete resolution even if the call was marked "resolved."
How Hamming helps: Hamming automatically tracks FCR across all calls, segments by intent category, and correlates with other metrics to identify root causes. Production call replay enables rapid diagnosis of failed resolutions.
KPI 2: Average Handle Time (AHT)
Average Handle Time measures the mean duration of voice agent calls from greeting to completion. Unlike traditional contact centers where lower AHT often equals better efficiency, voice agents must balance speed with quality—rushing conversations degrades CSAT and increases repeat calls.
Definition & Calculation
Formula:
AHT = (Total conversation time + After-call work time) / Total calls handled
What it measures: Overall conversation efficiency. However, optimal AHT varies significantly by use case complexity.
Industry Benchmarks
| Use Case | Target AHT | Context |
|---|---|---|
| Simple FAQs | 1-2 min | Balance inquiries, store hours |
| Account inquiries | 3-5 min | Transaction history, profile updates |
| Complex troubleshooting | 6-10 min | Technical support, multi-step resolution |
| Sales/appointments | 5-8 min | Consultative, relationship-building |
Critical insight: Research shows conversations lasting 4-6 minutes had 67% higher satisfaction than sub-2-minute interactions. The sweet spot is thoroughness, not pure speed.
Alert Thresholds & Failure Modes
| Severity | Threshold | Duration | Action |
|---|---|---|---|
| Warning | greater than 2x baseline | 15 min | Review for loops, verbose responses |
| Critical | greater than 3x baseline | 10 min | Investigate stuck states, infinite loops |
Common failure modes:
- Infinite loops — Agent repeating same questions or responses
- Verbose responses — Unnecessarily long explanations that don't add value
- Stuck dialog states — Conversation unable to progress to resolution
- Excessive clarification — Repeatedly asking users to confirm or repeat
Leading indicators: Track "Longest Monologue" metric—long monologues indicate the agent is failing to provide concise responses or misinterpreting the query.
Remediation Strategies
- Monitor turn count distribution — Calls with greater than 15 turns often indicate confusion or loops
- Optimize prompt engineering — Reduce verbosity without sacrificing task completion
- Identify AHT outliers — Replay calls with AHT greater than 3x average to detect patterns
How Hamming helps: Hamming tracks turn-level metrics including longest monologue, turn count, and silence duration to identify verbose or stuck conversations before they impact aggregate AHT.
KPI 3: Intent Classification Accuracy
Intent accuracy measures the percentage of user utterances correctly mapped to their intended action or query category. This is where voice agents face unique challenges—voice agents have 3-10x higher intent error rates than text-only systems due to ASR error cascade effects.
Definition & Calculation
Formula:
Intent Accuracy = (Correctly classified intents / Total classification attempts) × 100
What it measures: The NLU system's ability to understand what users want, accounting for ASR errors, accent variations, and natural language variability.
Industry Benchmarks
| Level | Range | Production Readiness |
|---|---|---|
| Excellent | greater than 98% | Required for critical domains (healthcare, banking) |
| Good | 95-98% | Acceptable for most production use cases |
| Acceptable | 90-95% | Requires human fallback for edge cases |
| Poor | less than 90% | Not production ready |
Scale impact: At 10,000 calls/day with 6.9% error rate, 690 users daily experience intent misclassification. At enterprise scale, even small accuracy improvements have massive impact.
Alert Thresholds & Failure Modes
| Severity | Threshold | Duration | Action |
|---|---|---|---|
| Warning | less than 92% | 15 min | Review confusion matrix by intent |
| Critical | less than 90% | 10 min | Immediate investigation, consider fallback |
Common failure modes:
- ASR cascade — Transcription errors cause downstream intent failures
- Intent confusion — Similar intents frequently misclassified for each other
- Model drift — Real-world inputs differ from training data over time
- Provider API changes — STT or NLU API updates silently degrade performance
Leading indicators: Confidence decay across conversation turns, repeated intent reclassification, and rising fallback frequency signal issues before aggregate accuracy drops.
Remediation Strategies
-
Build intent utterance matrix — Test with 10K+ utterances, not 50. Include variations: formal, casual, accented, noisy.
-
Track confusion patterns — Identify which specific intents confuse each other, not just aggregate accuracy.
-
Monitor confidence scores — Flag low-confidence predictions for human review and retraining.
-
Version-aware tracking — Prompt updates shift behavior baselines. Track performance by prompt version.
How Hamming helps: Hamming's automated evaluation runs intent utterance matrices across accent variations and noise conditions, tracking confusion patterns and confidence distributions at scale.
Related: Intent Recognition at Scale for detailed testing methodology.
KPI 4: End-to-End Latency (P95)
End-to-end latency measures the complete response time from when a user stops speaking until they hear the agent's first word—the "Mouth-to-Ear Turn Gap." This directly determines whether conversations feel natural or robotic. Humans expect responses within 300-500 milliseconds; exceeding this threshold makes conversations feel stilted.
Definition & Calculation
Formula:
E2E Latency = Audio Transmission + STT + LLM + TTS + Audio Playback
Typical breakdown:
- Audio transmission: 40ms
- Buffering/decoding: 55ms
- STT processing: 150-350ms
- LLM generation: 200-800ms
- TTS synthesis: 100-200ms
- Audio playback: 30ms
What it measures: The perceived responsiveness of the voice agent. Track P50, P90, P95, and P99 percentiles—averages hide critical outliers.
Industry Benchmarks
| Percentile | Target | Warning | Critical |
|---|---|---|---|
| P50 | less than 600ms | greater than 800ms | greater than 1000ms |
| P90 | less than 800ms | greater than 1200ms | greater than 1500ms |
| P95 | less than 1000ms | greater than 1500ms | greater than 2000ms |
| P99 | less than 1500ms | greater than 2000ms | greater than 3000ms |
Why percentiles matter: An average of 500ms might hide that 10% of users experience delays over 2 seconds. Those outliers drive complaints and abandonment.
Alert Thresholds & Failure Modes
| Severity | Threshold | Duration | Action |
|---|---|---|---|
| Warning | P95 greater than 1200ms | 5 min | Investigate component latencies |
| Critical | P95 greater than 1500ms | 3 min | Immediate escalation, check provider status |
Common failure modes:
- LLM slowdown — Provider degradation or model overload
- Cold starts — First request takes 3-5x longer due to model loading
- Network variability — Mobile networks add 50-200ms vs broadband
- Geographic distance — Transcontinental calls add 100-300ms
Leading indicators: Monitor component-level TTFB (Time to First Byte). If any component exceeds 2x baseline, investigate immediately.
Remediation Strategies
-
Decompose latency by component — Identify whether STT, LLM, or TTS is the bottleneck
-
Implement streaming — Stream STT, LLM responses, and TTS playback to reduce perceived latency
-
Geographic optimization — Deploy services closer to users via edge computing
-
Model selection tradeoffs — Use faster models for simple queries, reserve larger models for complex requests
How Hamming helps: Hamming captures per-component latency traces for every call, enabling instant drill-down from aggregate P95 to specific bottleneck identification.
Related: How to Optimize Latency in Voice Agents for optimization strategies.
KPI 5: Word Error Rate (WER)
Word Error Rate is the gold standard for measuring ASR transcription accuracy. WER quantifies how accurately the speech recognition system converts spoken words to text by comparing ASR output to reference transcripts. However, WER must be evaluated in context—a 10% WER might still achieve 95% intent accuracy if errors don't affect understanding.
Definition & Calculation
Formula:
WER = (Substitutions + Insertions + Deletions) / Total words in reference × 100
Example:
Reference: "I need to check my account balance"
ASR output: "I need to check my count balance"
Substitutions: 1 ("account" → "count")
Insertions: 0
Deletions: 0
Total words: 7
WER = (1 + 0 + 0) / 7 × 100 = 14.3%
What it measures: Raw transcription accuracy. Important caveat: WER doesn't measure whether tasks were completed or the agent responded appropriately.
Industry Benchmarks
| Level | WER Range | Context |
|---|---|---|
| Excellent | <5% | Clean audio, native speakers |
| Good | 5-8% | Production standard for English |
| Acceptable | 8-12% | Background noise, accents present |
| Poor | greater than 12% | Requires STT optimization |
Language variance: English typically achieves less than 8% WER while Hindi may reach 18-22% WER. Set language-specific thresholds.
Alert Thresholds & Failure Modes
| Severity | Threshold | Duration | Action |
|---|---|---|---|
| Warning | greater than 12% | 10 min | Review audio quality, check ASR provider |
| Critical | greater than 15% | 5 min | Fallback strategies, immediate investigation |
Common failure modes:
- Background noise — Airport, café, traffic environments
- Audio quality degradation — Low bitrate, packet loss, codec issues
- Accent variation — Non-native speakers, regional dialects
- Domain vocabulary — Medical terms, product names, proper nouns
Leading indicators: If STT accuracy drops suddenly, the speech recognition service might have issues. Track WER by user segment to identify affected populations.
Remediation Strategies
-
Test across conditions — Evaluate with varied noise (airport, café, traffic), device types (phones, smart speakers), and network conditions
-
Monitor downstream effects — WER should be tested in context. Evaluate ASR accuracy alongside intent success, repetition rates, and recovery success.
-
Implement fallback strategies — Request repetition, confirm low-confidence transcriptions, offer DTMF input for critical data
How Hamming helps: Hamming evaluates ASR accuracy alongside downstream effects like intent success, repetition, and recovery across accents and noise conditions—not just isolated WER scores.
Related: ASR Accuracy Evaluation for testing methodology.
KPI 6: Task Completion Rate
Task completion measures the percentage of calls where the agent successfully executes the intended business goal—booking an appointment, completing a purchase, resolving an inquiry, or processing a request. This is the ultimate outcome metric that connects agent performance to business value.
Definition & Calculation
Formula:
Task Completion = (Calls with successful task completion / Total calls with task intent) × 100
What it measures: Whether the voice agent actually accomplishes what users need, not just whether it responded appropriately.
Industry Benchmarks
| Complexity | Target | Typical Range |
|---|---|---|
| Simple tasks | greater than 90% | Balance check, store hours |
| Moderate complexity | 75-85% | Appointment booking, order status |
| Complex workflows | 60-75% | Multi-step troubleshooting, claims |
Alert Thresholds & Failure Modes
| Severity | Threshold | Duration | Action |
|---|---|---|---|
| Warning | less than 75% | 15 min | Review failed task transcripts |
| Critical | less than 70% | 10 min | Immediate investigation, check integrations |
Common failure modes:
- Tool invocation errors — API calls fail or return unexpected results
- Parameter extraction failures — Agent extracts wrong values from conversation
- Context loss mid-task — Agent forgets information needed to complete task
- Premature termination — Conversation ends before task is confirmed complete
- Hallucinated tool calls — LLM claims it called a tool but didn't actually invoke it
Leading indicators: Monitor tool call success rates and parameter accuracy to detect integration issues before they impact task completion.
Remediation Strategies
-
Trace every interaction — Capture STT output, intent classification, tool calls, response generation, and TTS input for every call
-
Segment by task type — Identify which specific workflows underperform and prioritize fixes
-
Monitor tool call accuracy — Track whether tools are called with correct parameters and return expected results
How Hamming helps: Hamming traces every step of task execution including tool calls, parameter extraction, and completion confirmation. Production call replay enables rapid diagnosis of failed tasks.
KPI 7: Customer Satisfaction (CSAT)
CSAT measures the percentage of customers rating their interaction positively, typically 4-5 on a 5-point scale. While it's a lagging indicator, CSAT correlates strongly with retention, referrals, and revenue—making it essential for understanding the human impact of voice agent performance.
Definition & Calculation
Formula:
CSAT = (Positive ratings (4-5 stars) / Total survey responses) × 100
What it measures: User perception of interaction quality, encompassing accuracy, speed, helpfulness, and overall experience.
Industry Benchmarks
| Level | CSAT Range | Interpretation |
|---|---|---|
| World-class | greater than 85% | Exceptional experience |
| Good | 75-84% | Strong performance |
| Acceptable | 65-74% | Room for improvement |
| Poor | less than 65% | Significant issues |
Critical insight: The 4% of bad calls drove close to 40% of complaints and early hangups. Outlier detection matters more than averages.
Alert Thresholds & Failure Modes
| Severity | Threshold | Duration | Action |
|---|---|---|---|
| Warning | less than 70% | 2 weeks | Analyze low-CSAT call patterns |
| Critical | less than 65% | 1 week | Immediate experience review |
Common failure modes:
- Latency frustration — Slow responses create poor experience
- Repetition fatigue — Users forced to repeat themselves
- Misunderstanding impact — Intent errors visible to users
- Tone/empathy gaps — Responses feel robotic or dismissive
Leading indicators: Track sentiment trajectory within calls. Negative sentiment shifts mid-conversation predict low CSAT before surveys.
Remediation Strategies
-
Correlate CSAT with metrics — Link low CSAT to specific patterns: high latency, repetition, misunderstanding, incomplete resolution
-
Monitor user interruptions — Track how often users barge in or speak over the agent—a direct frustration signal
-
Analyze sentiment velocity — Rate of sentiment change indicates conversation quality degradation
How Hamming helps: Hamming's speech-level analysis detects caller frustration, sentiment shifts, emotional cues, pauses, interruptions, and tone changes—evaluating how callers said things, not just what they said.
Related: Voice Agent Analytics to Improve CSAT for optimization strategies.
KPI 8: Call Containment Rate
Containment rate measures the percentage of calls handled entirely by the AI agent without requiring transfer to a human. It's an essential indicator of automation effectiveness and directly impacts operational costs—but must be balanced against quality to avoid "containing" calls by frustrating customers into giving up.
Definition & Calculation
Formula:
Containment Rate = (Calls handled entirely by AI / Total inbound calls) × 100
What it measures: Automation effectiveness—how much human labor is being offset by the voice agent.
Industry Benchmarks
| Level | Range | Context |
|---|---|---|
| Excellent | greater than 80% | Simple, well-defined use cases |
| Good | 70-80% | Standard customer service |
| Acceptable | 60-70% | Complex queries, new deployments |
| Poor | less than 60% | Significant capability gaps |
Economic impact: McKinsey research found AI automation enables companies to reduce agent headcount by 40-50% while handling 20-30% more calls—but only with quality containment.
Alert Thresholds & Failure Modes
| Severity | Threshold | Duration | Action |
|---|---|---|---|
| Warning | less than 60% | 1 hour | Review escalation reasons |
| Critical | less than 50% | 30 min | Investigate knowledge gaps, routing issues |
Common failure modes:
- Knowledge gaps — Agent lacks information for query categories
- Complex query handling — Multi-part or ambiguous requests exceed capability
- Authentication failures — Unable to verify user identity
- User preference — Customer explicitly requests human agent
- False containment — Bot "contains" by frustrating customers into giving up
Leading indicators: Track escalation reasons by category. A spike in "user requested human" suggests capability gaps or frustration.
Remediation Strategies
-
Analyze escalation transcripts — Identify knowledge gaps and out-of-scope queries requiring agent expansion
-
Segment by query type — AI might excel at appointment confirmations but struggle with complex product inquiries
-
Balance containment with quality — High containment with low CSAT indicates false containment
How Hamming helps: Hamming tracks escalation frequency, reasons, and correlates with other metrics to distinguish quality containment from frustrated abandonment.
KPI 9: Prompt Compliance Rate
Prompt compliance measures how frequently the agent follows specific instructions in its system prompt—including scope boundaries, safety guardrails, required disclaimers, and escalation triggers. This is critical for regulatory compliance, brand safety, and preventing scope creep.
Definition & Calculation
Formula:
Prompt Compliance = (Responses following instructions / Total responses) × 100
What it measures: Whether the agent executes within defined boundaries, follows safety protocols, and maintains brand voice.
Industry Benchmarks
| Instruction Type | Target | Minimum Acceptable |
|---|---|---|
| Safety/regulatory | greater than 99% | 95% |
| Scope boundaries | greater than 95% | 90% |
| Brand voice/tone | greater than 90% | 85% |
| Required disclosures | greater than 99% | 95% |
Alert Thresholds & Failure Modes
| Severity | Threshold | Duration | Action |
|---|---|---|---|
| Warning | less than 90% | 15 min | Review compliance failures |
| Critical | less than 85% | 5 min | Immediate prompt review, potential pause |
Common failure modes:
- Scope creep — Progressively going beyond permitted topics
- Safety guardrail bypass — Ignoring safety precautions in prompts
- Disclosure omission — Failing to provide required legal disclaimers
- Brand voice drift — Tone shifting away from guidelines
Leading indicators: Track compliance by prompt section to identify which specific instructions agents consistently violate.
Remediation Strategies
-
Implement automated assertion checks — Test against prohibited topics, required disclaimers, escalation triggers, data handling policies
-
Version-aware monitoring — Prompt updates shift behavior baselines. Track compliance by prompt version.
-
Segment by instruction category — Safety-critical instructions need higher compliance thresholds than stylistic guidelines
How Hamming helps: Hamming includes 50+ built-in metrics including compliance scorers, plus unlimited custom assertions you define. Automated testing validates prompt compliance across thousands of scenarios.
KPI 10: Context Retention Accuracy
Context retention measures the agent's ability to retain and use relevant information across conversation turns. When users say "I already told you that," it's a context retention failure. This metric is critical for multi-turn conversations requiring information recall and reasoning.
Definition & Calculation
Formula:
Context Retention = (Correct contextual references / Total contextual reference opportunities) × 100
What it measures: Whether the agent remembers and correctly uses information provided earlier in the conversation.
Industry Benchmarks
| Scenario | Target | Context |
|---|---|---|
| Same-call retention | greater than 95% | Information from current call |
| Multi-turn reasoning | greater than 90% | Complex tasks requiring recall |
| Session persistence | greater than 85% | Information across call transfers |
Alert Thresholds & Failure Modes
| Severity | Threshold | Duration | Action |
|---|---|---|---|
| Warning | less than 85% | 15 min | Review context loss patterns |
| Critical | less than 80% | 10 min | Investigate session management, prompts |
Common failure modes:
- Short context windows — LLM context limit exceeded, early information lost
- Session state issues — Context not properly persisted across turns
- Memory retrieval errors — RAG system fails to retrieve relevant history
- Prompt engineering gaps — Instructions don't emphasize context preservation
Leading indicators: Track user repetition patterns—users repeating information signals agent failed to retain context.
Remediation Strategies
-
Test multi-turn scenarios — Create scripted conversations requiring information recall across 5+ exchanges
-
Monitor repetition requests — "What was that again?" or users re-stating information indicates failures
-
Optimize conversation memory — Tune sliding window size, summarization strategy, and RAG integration
How Hamming helps: Hamming tracks context management as a core metric, identifying conversations where context loss occurs and correlating with user frustration signals.
How to Instrument Voice Agent KPIs
Effective KPI monitoring requires comprehensive instrumentation capturing events at every stage of the voice agent pipeline. Here's the event collection framework used by production voice agents monitored by Hamming.
Core Events to Capture
| Event | Required Fields | Purpose |
|---|---|---|
call.started | call_id, timestamp, caller_metadata, agent_version | Session initialization |
turn.user | transcript, confidence, audio_duration, timestamp | ASR quality tracking |
turn.agent | text, latency_breakdown, tool_calls, intent | Execution tracking |
intent.classified | intent, confidence, alternatives, method | Intent accuracy |
tool.called | tool_name, params, result, latency, success | Integration health |
call.ended | outcome, duration, metrics_summary, disposition | Outcome tracking |
Event Schema Example
{
"event": "turn.agent",
"timestamp": "2025-01-20T10:30:00Z",
"call_id": "call_abc123",
"turn_index": 3,
"latency_ms": {
"stt": 150,
"llm": 420,
"tts": 180,
"total": 750
},
"text": "I can help you check your account balance.",
"intent": {
"classified": "account_inquiry",
"confidence": 0.94,
"alternatives": [
{"intent": "balance_check", "confidence": 0.89}
]
},
"tool_calls": [
{
"tool": "get_balance",
"params": {"account_id": "12345"},
"result": {"balance": 1500.00},
"latency_ms": 180,
"success": true
}
]
}
OpenTelemetry Integration
Hamming natively ingests OpenTelemetry traces, spans, and logs for unified observability:
from opentelemetry import trace
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
# Configure Hamming as trace destination
hamming_exporter = OTLPSpanExporter(
endpoint="https://otel.hamming.ai",
headers={"x-api-key": "your-api-key"}
)
# Instrument voice agent components
tracer = trace.get_tracer("voice-agent")
with tracer.start_as_current_span("process_turn") as span:
span.set_attribute("turn.index", turn_index)
span.set_attribute("intent.classified", intent)
span.set_attribute("latency.total_ms", total_latency)
Component-Level Latency Tracking
Break down end-to-end latency into component spans:
User speaks
│
▼ (40ms network)
┌─────────────┐
│ STT Service │ ←── Span: stt.transcribe (150ms)
└─────────────┘
│
▼
┌─────────────┐
│ LLM Service │ ←── Span: llm.generate (420ms)
└─────────────┘
│
▼
┌─────────────┐
│ TTS Service │ ←── Span: tts.synthesize (180ms)
└─────────────┘
│
▼ (30ms network)
User hears response
Total E2E: 820ms
Dashboard Design: The 4-Layer Framework
Effective voice agent dashboards organize metrics into four layers, enabling both executive overview and operational drill-down.
Executive View (One Glance)
┌─────────────────────────────────────────────────────────┐
│ Voice Agent Health: 94.2% ✓ │
├───────────────┬───────────────┬─────────────────────────┤
│ Calls Today │ Task Success │ TTFW P90 │
│ 12,847 │ 94.2% │ 720ms │
│ ↑ 8% │ ↓ 0.3% │ ↓ 50ms │
├───────────────┴───────────────┴─────────────────────────┤
│ Active Alerts: 1 (P2) │
│ ⚠️ Intent accuracy below baseline in "billing" flow │
└─────────────────────────────────────────────────────────┘
Layer 1: Infrastructure Metrics
| Metric | Target | Alert Threshold |
|---|---|---|
| Audio Quality (MOS) | >4.0 | <3.5 |
| Packet Loss | <0.1% | greater than 1% |
| Concurrent Calls | <80% capacity | greater than 90% |
| Call Setup Time | <2s | greater than 5s |
Layer 2: Execution Metrics
| Metric | Target | Alert Threshold |
|---|---|---|
| Intent Accuracy | greater than 95% | <90% |
| WER | <8% | greater than 12% |
| LLM Latency (P90) | <800ms | greater than 1500ms |
| Tool Call Success | greater than 99% | <95% |
Layer 3: User Reaction Metrics
| Metric | Target | Alert Threshold |
|---|---|---|
| User Interruptions | <10% | greater than 20% |
| Retry Rate | <5% | greater than 15% |
| Sentiment Trajectory | Positive/Stable | Negative trend |
| Abandonment Rate | <8% | greater than 15% |
Layer 4: Outcome Metrics
| Metric | Target | Alert Threshold |
|---|---|---|
| Task Completion | greater than 85% | <70% |
| FCR | >75% | <60% |
| Containment Rate | >70% | <50% |
| CSAT | >75% | <65% |
Related: Anatomy of a Perfect Voice Agent Analytics Dashboard for detailed dashboard design.
Alerting Playbook: Detection to Remediation
Severity Levels
| Level | Response Time | Channel | Criteria |
|---|---|---|---|
| P0: Critical | <5 min | PagerDuty | Revenue impact, system down |
| P1: High | <15 min | Slack urgent | Customer impact, major degradation |
| P2: Medium | <1 hour | Slack | Performance degradation |
| P3: Low | <4 hours | Trend warnings |
Alert Configuration Template
name: TTFW Degradation
condition: ttfw_p90 > 1000ms
duration: 5 minutes
severity: P1
channels:
- slack://voice-alerts
- pagerduty://voice-team
context:
- current_value
- baseline_value
- sample_calls
- dashboard_link
runbook: /docs/runbooks/ttfw-degradation
cooldown: 30 minutes
Recommended Initial Alerts
Start with these four high-impact alerts:
- TTFW P90 greater than 1000ms (duration: 5 min) → P1
- Task completion less than 80% (duration: 15 min) → P1
- Intent accuracy less than 90% (duration: 10 min) → P1
- WER greater than 12% (duration: 10 min) → P2
Four-Step Remediation Workflow
- Detection: ML-based anomaly detection + rule-based validation
- Alert: Instant notification with metadata (affected agent, timestamp, sample calls)
- Diagnosis: Drill down from metrics to transcripts and audio
- Remediation: Coordinate fix deployment, monitor recovery
How Hamming helps: Detection begins with ML-based anomaly detection and rule-based validation. When anomalies are detected, alerts push instantly to Slack with metadata. Engineers can drill down from high-level metrics to individual transcripts and audio for rapid diagnosis.
Starter Alert Thresholds Table
Not sure where to start? Use these conservative thresholds as your Day 1 configuration. Tune them after 2-4 weeks of baseline data.
| KPI | Conservative Start | Tighten After Baseline | Aggressive Target |
|---|---|---|---|
| FCR | Alert if less than 60% | Alert if less than 70% | Alert if less than 75% |
| Task Completion | Alert if less than 65% | Alert if less than 75% | Alert if less than 85% |
| Intent Accuracy | Alert if less than 88% | Alert if less than 92% | Alert if less than 95% |
| TTFW P90 | Alert if greater than 1500ms | Alert if greater than 1200ms | Alert if >900ms |
| WER | Alert if greater than 15% | Alert if greater than 12% | Alert if <8% |
| Containment | Alert if less than 50% | Alert if less than 60% | Alert if less than 70% |
| CSAT | Alert if less than 60% | Alert if less than 68% | Alert if less than 75% |
| Context Retention | Alert if less than 75% | Alert if less than 85% | Alert if less than 90% |
| Prompt Compliance | Alert if less than 85% | Alert if less than 92% | Alert if less than 95% |
How to use this table:
- Week 1: Deploy with "Conservative Start" thresholds to avoid false positives
- Week 2-4: Collect baseline data, observe normal variance
- Week 4+: Tighten to "Tighten After Baseline" based on actual performance
- Month 2+: Move to "Aggressive Target" as you optimize
Slack Alert Examples by KPI
Here's what well-structured Slack alerts look like for each critical KPI:
FCR Alert Example
🚨 P1 ALERT: First Call Resolution Below Threshold
📊 Metric: fcr_rate
Current: 62% (threshold: 70%)
Baseline (7-day): 76%
Duration: 25 minutes
🔍 Root Cause Indicators:
• Intent "billing_dispute" FCR: 41% (↓ from 72%)
• Knowledge base gaps detected in billing flow
• 3 new edge cases not covered
🔗 [Dashboard] [Failed Calls] [KB Gaps Report] [Runbook]
TTFW Alert Example
🚨 P1 ALERT: Time to First Word Degradation
📊 Metric: ttfw_p90
Current: 1,340ms (threshold: 1,000ms)
Baseline: 680ms
Duration: 12 minutes
🔍 Component Breakdown:
• STT: 185ms (normal)
• LLM: 890ms (↑ 340ms - LIKELY CAUSE)
• TTS: 145ms (normal)
🔗 [LLM Provider Status] [Sample Calls] [Runbook]
Intent Accuracy Alert Example
🚨 P1 ALERT: Intent Classification Accuracy Drop
📊 Metric: intent_accuracy
Current: 89.2% (threshold: 92%)
Baseline: 96.1%
Duration: 18 minutes
🔍 Confusion Matrix:
• "cancel_order" → "track_order": 8.2% confusion
• "refund_request" → "cancel_order": 5.1% confusion
• ASR WER normal (6.8%)
🔗 [Confusion Matrix] [Sample Misclassifications] [Runbook]
Task Completion Alert Example
🚨 P0 ALERT: Task Completion Rate Critical
📊 Metric: task_completion_rate
Current: 68% (threshold: 75%)
Baseline: 87%
Duration: 8 minutes
Revenue Impact: ~$420/hour
🔍 Failed Tasks Breakdown:
• Tool: "create_appointment" failing 23% (API timeout)
• Tool: "lookup_account" success rate normal
• 156 affected calls in last 30 min
🔗 [API Status] [Failed Calls] [Runbook] [Incident Channel]
WER Alert Example
⚠️ P2 ALERT: Word Error Rate Elevated
📊 Metric: asr_wer
Current: 13.2% (threshold: 12%)
Baseline: 7.1%
Duration: 22 minutes
🔍 Pattern Analysis:
• Mobile callers: 18.4% WER (↑ from 9.2%)
• Landline callers: 6.8% WER (normal)
• Background noise detected in 34% of affected calls
🔗 [Audio Samples] [Caller Segment Analysis] [Runbook]
After Prompt Updates: What Regresses First
Prompt and model updates are the #1 cause of production regressions. Here's the pattern we see across deployments:
Regression Order After Prompt Changes
| Order | KPI | Why It Regresses First | Detection Window |
|---|---|---|---|
| 1st | Prompt Compliance | Direct impact from instruction changes | 5-15 minutes |
| 2nd | Intent Accuracy | Changed phrasing affects classification | 15-30 minutes |
| 3rd | Context Retention | New prompts may handle context differently | 30-60 minutes |
| 4th | TTFW | Longer prompts = slower inference | 15-30 minutes |
| 5th | Task Completion | Cascades from intent/compliance issues | 1-2 hours |
| 6th | FCR | Downstream effect of task failures | 2-4 hours |
| 7th | CSAT | Lagging indicator of all above | 24-48 hours |
Canary Deployment Strategy for Prompt Updates
Never deploy prompt changes to 100% of traffic immediately. Use canary deployment:
┌─────────────────────────────────────────────────────────────────┐
│ PROMPT UPDATE DEPLOYMENT │
└─────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ STAGE 1: Canary (5% traffic) │
│ Duration: 1 hour minimum │
│ Watch: Prompt Compliance, Intent Accuracy, TTFW │
│ Rollback if: Any KPI drops greater than 5% vs control │
└─────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ STAGE 2: Expanded Canary (25% traffic) │
│ Duration: 2 hours minimum │
│ Watch: + Context Retention, Task Completion │
│ Rollback if: Any KPI drops greater than 3% vs control │
└─────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ STAGE 3: Majority (75% traffic) │
│ Duration: 4 hours minimum │
│ Watch: All KPIs including FCR │
│ Rollback if: Any critical KPI regresses │
└─────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ STAGE 4: Full Rollout (100% traffic) │
│ Continue monitoring for 24-48 hours │
│ Watch: CSAT (lagging indicator) │
└─────────────────────────────────────────────────────────────────┘
Post-Prompt-Update Monitoring Checklist
- Immediate (0-15 min): Watch Prompt Compliance rate—any drop greater than 2%?
- Early (15-60 min): Check Intent Accuracy confusion matrix—new patterns?
- Medium (1-4 hours): Monitor Task Completion and Context Retention
- Extended (4-24 hours): Track FCR trend—are users calling back?
- Lagging (24-48 hours): Review CSAT scores—user perception changed?
Rollback Triggers
| KPI | Rollback Threshold | Timeframe |
|---|---|---|
| Prompt Compliance | less than 90% (from 95%+) | 15 min sustained |
| Intent Accuracy | less than 90% (from 95%+) | 30 min sustained |
| Task Completion | less than 75% (from 85%+) | 30 min sustained |
| TTFW P90 | >1.5x baseline | 15 min sustained |
| FCR | less than 65% (from 75%+) | 2 hours sustained |
How Hamming helps: Hamming's A/B testing framework automatically splits traffic, tracks KPI differences between prompt versions, and can trigger automated rollback when regression thresholds are breached.
Platform Comparison: Hamming vs Alternatives
| Capability | Hamming | Braintrust | Roark | Datadog |
|---|---|---|---|---|
| Voice-native KPIs | ✅ 50+ built-in | ⚠️ Limited | ✅ Yes | ❌ No |
| Real-time alerting | ✅ ML + rules | ⚠️ Manual | ✅ Yes | ✅ Infra only |
| Call replay | ✅ One-click | ⚠️ Manual | ✅ Yes | ❌ No |
| OpenTelemetry | ✅ Native | ✅ Yes | ⚠️ Limited | ✅ Yes |
| Intent accuracy | ✅ Automated | ⚠️ Manual | ✅ Yes | ❌ No |
| Sentiment analysis | ✅ Speech-level | ❌ No | ⚠️ Basic | ❌ No |
| Automated testing | ✅ 1000+ concurrent | ⚠️ Limited | ⚠️ Limited | ❌ No |
| Prompt compliance | ✅ Built-in | ⚠️ Custom | ⚠️ Limited | ❌ No |
When to use Hamming: Production voice agent monitoring with comprehensive KPI coverage, automated testing, and speech-level analysis. Ideal for teams needing unified testing + monitoring in one platform.
Related Guides
- Post-Call Analytics for Voice Agents — 4-Layer Analytics Framework for metrics, observability, and continuous improvement
- Voice Agent Drop-Off Analysis — Framework for measuring and reducing call abandonment with funnel tracking
- Slack Alerts for Voice Agents — Alert templates for latency, ASR drift, jitter, and prompt regressions
- Voice Agent Monitoring Platform Guide — 4-Layer Monitoring Stack
- How to Evaluate Voice Agents — VOICE Framework
- Voice Agent Observability Tracing Guide — OpenTelemetry integration
- Anatomy of a Perfect Voice Agent Analytics Dashboard — Dashboard design
- Intent Recognition at Scale — Intent testing methodology
- Real-Time Voice Analytics Dashboards — End-to-end tracing, prompt drift detection, and automated evals for customer service
- Debugging Voice Agents: Logs, Missed Intents & Error Dashboards — Turn-level debugging with confidence scores and fallback monitoring

