A voice agent team at a healthcare company watched their Datadog dashboard show all-green metrics for weeks. Server uptime: 99.9%. API latency: under 200ms. Error rate: 0.1%.
But their CSAT scores were plummeting. Customers were calling back frustrated. Escalations to human agents doubled.
What was happening?
They were monitoring infrastructure, not conversation quality.
Their voice agent's First Call Resolution had dropped from 75% to 58%. Intent accuracy had drifted from 95% to 87%. Users were saying "I already told you that" in 23% of calls. None of this appeared in their dashboards.
According to Hamming's analysis of 1M+ production voice agent calls, the 10 KPIs that actually predict voice agent failure are completely different from traditional APM metrics. This guide defines each one—with calculation formulas, industry benchmarks, alert thresholds, and remediation strategies.
TL;DR: Monitor voice agents using Hamming's 10 Critical Production KPIs:
Outcome KPIs: First Call Resolution (>75%), Task Completion (>85%), Containment Rate (>70%), CSAT (>75%)
Execution KPIs: Intent Accuracy (>95%), End-to-End Latency P95 (<800ms), WER (<8%), Prompt Compliance (>95%)
Experience KPIs: Context Retention (>90%), AHT (balanced with quality)
Generic APM tools miss 60% of voice-specific failures. Set alerts on P90/P95 percentiles, not averages. Use the 4-Layer Dashboard Framework (Infrastructure → Execution → User Reaction → Outcome) for complete visibility.
Methodology Note: The KPI definitions, benchmarks, and alert thresholds in this guide are derived from Hamming's analysis of 1M+ production voice agent calls across 50+ deployments (2024-2025). Benchmarks represent median performance across healthcare, financial services, e-commerce, and customer support verticals. Thresholds may vary by use case complexity and user expectations.
Last Updated: January 2026
Related Guides:
- Monitor Pipecat Agents in Production — OpenTelemetry tracing and alerting for Pipecat voice agents
- Voice Agent Dashboard Template — 6-Metric Framework with Executive Reports
- Voice Agent Monitoring Platform Guide — 4-Layer Monitoring Stack
- How to Evaluate Voice Agents — VOICE Framework
What Are Voice Agent KPIs?
Voice agent KPIs are quantitative metrics that measure the performance, quality, and business impact of AI-powered voice agents in production. Unlike traditional contact center metrics that focus on operational efficiency, voice agent KPIs must capture AI-specific quality dimensions: speech recognition accuracy, intent classification, response latency, prompt compliance, and multi-turn reasoning.
Why voice agents need specialized KPIs:
| Traditional Contact Center | Voice Agent Specific |
|---|---|
| Average Handle Time (AHT) | Time to First Word (TTFW) |
| Abandonment Rate | Intent Accuracy |
| Queue Wait Time | Word Error Rate (WER) |
| Agent Utilization | Prompt Compliance |
| Service Level | Context Retention |
The 10 KPIs in this guide span four layers of Hamming's monitoring framework:
- Infrastructure Layer: Audio quality, network reliability, system capacity
- Execution Layer: ASR accuracy, LLM latency, intent classification, tool calls
- User Reaction Layer: Sentiment, interruptions, retry patterns, frustration signals
- Outcome Layer: Task completion, containment, FCR, CSAT, business value
The 10 Critical Voice Agent Production KPIs
This master scorecard provides a reference for all 10 KPIs. Each is detailed in subsequent sections with formulas, benchmarks, alert configurations, and remediation strategies.
| KPI | Definition | Formula | Target | Warning | Critical |
|---|---|---|---|---|---|
| 1. First Call Resolution | % of issues resolved without follow-up | (resolved calls / total calls) × 100 | >75% | <70% | <60% |
| 2. Average Handle Time | Mean conversation duration | (talk time + wrap-up) / total calls | 4-6 min | >2x baseline | >3x baseline |
| 3. Intent Accuracy | % of correctly classified intents | (correct / total) × 100 | >95% | <92% | <90% |
| 4. E2E Latency (P95) | 95th percentile response time | STT + LLM + TTS | <800ms | >1200ms | >1500ms |
| 5. Word Error Rate | ASR transcription accuracy | (S + I + D) / words × 100 | <8% | >12% | >15% |
| 6. Task Completion | % completing business goal | (completed / attempts) × 100 | >85% | <75% | <70% |
| 7. CSAT | % positive ratings (4-5 stars) | (positive / responses) × 100 | >75% | <70% | <65% |
| 8. Containment Rate | % handled without transfer | (AI-resolved / total) × 100 | >70% | <60% | <50% |
| 9. Prompt Compliance | % of instructions followed | (compliant / total) × 100 | >95% | <90% | <85% |
| 10. Context Retention | % of contextual refs correct | (correct refs / total) × 100 | >90% | <85% | <80% |
KPI 1: First Call Resolution (FCR)
First Call Resolution measures the percentage of customer issues resolved during the initial interaction without requiring follow-up calls, transfers, or escalations. FCR is the ultimate test of voice agent effectiveness—it indicates whether the agent understood the request, had the knowledge to address it, and executed the resolution correctly.
Definition & Calculation
Formula:
FCR = (Calls resolved without transfer or callback / Total calls) × 100
What it measures: The agent's ability to completely resolve customer needs in a single interaction. This requires accurate understanding, comprehensive knowledge base integration, and proper dialog flow execution.
Industry Benchmarks
| Level | Range | Interpretation |
|---|---|---|
| Excellent | >80% | World-class resolution capability |
| Good | 75-80% | Strong performance, minor optimization needed |
| Acceptable | 65-75% | Adequate but significant improvement opportunity |
| Poor | <65% | Systemic issues requiring immediate attention |
Industry examples:
- Healthcare provider achieved 75% FCR handling appointment reminders and prescription queries
- Financial services targeting 80%+ FCR for account balance and transaction inquiries
- E-commerce achieving 70-75% FCR for order status and return initiation
Alert Thresholds & Failure Modes
| Severity | Threshold | Duration | Action |
|---|---|---|---|
| Warning | <70% | 1 hour | Investigate intent accuracy, knowledge gaps |
| Critical | <60% | 30 min | Immediate review of failed resolutions |
Common failure modes:
- Knowledge base gaps — Agent lacks information to resolve specific query types
- Intent misclassification — Wrong dialog flow activated, unable to complete resolution
- Incomplete dialog flows — Flow ends without proper resolution confirmation
- Premature transfers — Agent escalates when resolution was possible
Leading indicators: Rising fallback frequency, negative sentiment shifts, and repeated intent reclassification signal FCR problems before the metric drops.
Remediation Strategies
-
Analyze failed resolution transcripts — Identify patterns in queries that fail to resolve. Look for knowledge gaps, missing intents, and dialog breakdowns.
-
Segment FCR by query type — Track which categories underperform. Intent X might have 90% FCR while Intent Y has 45%, revealing specific training needs.
-
Monitor callback patterns — Users calling back within 24 hours indicate incomplete resolution even if the call was marked "resolved."
How Hamming helps: Hamming automatically tracks FCR across all calls, segments by intent category, and correlates with other metrics to identify root causes. Production call replay enables rapid diagnosis of failed resolutions.
KPI 2: Average Handle Time (AHT)
Average Handle Time measures the mean duration of voice agent conversations from greeting to completion. Unlike traditional contact centers where lower AHT often equals better efficiency, voice agents must balance speed with quality—rushing conversations degrades CSAT and increases repeat calls.
Definition & Calculation
Formula:
AHT = (Total conversation time + After-call work time) / Total calls handled
What it measures: Overall conversation efficiency. However, optimal AHT varies significantly by use case complexity.
Industry Benchmarks
| Use Case | Target AHT | Context |
|---|---|---|
| Simple FAQs | 1-2 min | Balance inquiries, store hours |
| Account inquiries | 3-5 min | Transaction history, profile updates |
| Complex troubleshooting | 6-10 min | Technical support, multi-step resolution |
| Sales/appointments | 5-8 min | Consultative, relationship-building |
Critical insight: Research shows conversations lasting 4-6 minutes had 67% higher satisfaction than sub-2-minute interactions. The sweet spot is thoroughness, not pure speed.
Alert Thresholds & Failure Modes
| Severity | Threshold | Duration | Action |
|---|---|---|---|
| Warning | >2x baseline | 15 min | Review for loops, verbose responses |
| Critical | >3x baseline | 10 min | Investigate stuck states, infinite loops |
Common failure modes:
- Infinite loops — Agent repeating same questions or responses
- Verbose responses — Unnecessarily long explanations that don't add value
- Stuck dialog states — Conversation unable to progress to resolution
- Excessive clarification — Repeatedly asking users to confirm or repeat
Leading indicators: Track "Longest Monologue" metric—long monologues indicate the agent is failing to provide concise responses or misinterpreting the query.
Remediation Strategies
- Monitor turn count distribution — Calls with >15 turns often indicate confusion or loops
- Optimize prompt engineering — Reduce verbosity without sacrificing task completion
- Identify AHT outliers — Replay calls with AHT >3x average to detect patterns
How Hamming helps: Hamming tracks turn-level metrics including longest monologue, turn count, and silence duration to identify verbose or stuck conversations before they impact aggregate AHT.
KPI 3: Intent Classification Accuracy
Intent accuracy measures the percentage of user utterances correctly mapped to their intended action or query category. This is where voice agents face unique challenges—voice agents have 3-10x higher intent error rates than text-only systems due to ASR error cascade effects.
Definition & Calculation
Formula:
Intent Accuracy = (Correctly classified intents / Total classification attempts) × 100
What it measures: The NLU system's ability to understand what users want, accounting for ASR errors, accent variations, and natural language variability.
Industry Benchmarks
| Level | Range | Production Readiness |
|---|---|---|
| Excellent | >98% | Required for critical domains (healthcare, banking) |
| Good | 95-98% | Acceptable for most production use cases |
| Acceptable | 90-95% | Requires human fallback for edge cases |
| Poor | <90% | Not production ready |
Scale impact: At 10,000 calls/day with 6.9% error rate, 690 users daily experience intent misclassification. At enterprise scale, even small accuracy improvements have massive impact.
Alert Thresholds & Failure Modes
| Severity | Threshold | Duration | Action |
|---|---|---|---|
| Warning | <92% | 15 min | Review confusion matrix by intent |
| Critical | <90% | 10 min | Immediate investigation, consider fallback |
Common failure modes:
- ASR cascade — Transcription errors cause downstream intent failures
- Intent confusion — Similar intents frequently misclassified for each other
- Model drift — Real-world inputs differ from training data over time
- Provider API changes — STT or NLU API updates silently degrade performance
Leading indicators: Confidence decay across conversation turns, repeated intent reclassification, and rising fallback frequency signal issues before aggregate accuracy drops.
Remediation Strategies
-
Build intent utterance matrix — Test with 10K+ utterances, not 50. Include variations: formal, casual, accented, noisy.
-
Track confusion patterns — Identify which specific intents confuse each other, not just aggregate accuracy.
-
Monitor confidence scores — Flag low-confidence predictions for human review and retraining.
-
Version-aware tracking — Prompt updates shift behavior baselines. Track performance by prompt version.
How Hamming helps: Hamming's automated evaluation runs intent utterance matrices across accent variations and noise conditions, tracking confusion patterns and confidence distributions at scale.
Related: Intent Recognition at Scale for detailed testing methodology.
KPI 4: End-to-End Latency (P95)
End-to-end latency measures the complete response time from when a user stops speaking until they hear the agent's first word—the "Mouth-to-Ear Turn Gap." This directly determines whether conversations feel natural or robotic. Humans expect responses within 300-500 milliseconds; exceeding this threshold makes conversations feel stilted.
Definition & Calculation
Formula:
E2E Latency = Audio Transmission + STT + LLM + TTS + Audio Playback
Typical breakdown:
- Audio transmission: 40ms
- Buffering/decoding: 55ms
- STT processing: 150-350ms
- LLM generation: 200-800ms
- TTS synthesis: 100-200ms
- Audio playback: 30ms
What it measures: The perceived responsiveness of the voice agent. Track P50, P90, P95, and P99 percentiles—averages hide critical outliers.
Industry Benchmarks
| Percentile | Target | Warning | Critical |
|---|---|---|---|
| P50 | <600ms | >800ms | >1000ms |
| P90 | <800ms | >1200ms | >1500ms |
| P95 | <1000ms | >1500ms | >2000ms |
| P99 | <1500ms | >2000ms | >3000ms |
Why percentiles matter: An average of 500ms might hide that 10% of users experience delays over 2 seconds. Those outliers drive complaints and abandonment.
Alert Thresholds & Failure Modes
| Severity | Threshold | Duration | Action |
|---|---|---|---|
| Warning | P95 >1200ms | 5 min | Investigate component latencies |
| Critical | P95 >1500ms | 3 min | Immediate escalation, check provider status |
Common failure modes:
- LLM slowdown — Provider degradation or model overload
- Cold starts — First request takes 3-5x longer due to model loading
- Network variability — Mobile networks add 50-200ms vs broadband
- Geographic distance — Transcontinental calls add 100-300ms
Leading indicators: Monitor component-level TTFB (Time to First Byte). If any component exceeds 2x baseline, investigate immediately.
Remediation Strategies
-
Decompose latency by component — Identify whether STT, LLM, or TTS is the bottleneck
-
Implement streaming — Stream STT, LLM responses, and TTS playback to reduce perceived latency
-
Geographic optimization — Deploy services closer to users via edge computing
-
Model selection tradeoffs — Use faster models for simple queries, reserve larger models for complex requests
How Hamming helps: Hamming captures per-component latency traces for every call, enabling instant drill-down from aggregate P95 to specific bottleneck identification.
Related: How to Optimize Latency in Voice Agents for optimization strategies.
KPI 5: Word Error Rate (WER)
Word Error Rate is the gold standard for measuring ASR transcription accuracy. WER quantifies how accurately the speech recognition system converts spoken words to text by comparing ASR output to reference transcripts. However, WER must be evaluated in context—a 10% WER might still achieve 95% intent accuracy if errors don't affect understanding.
Definition & Calculation
Formula:
WER = (Substitutions + Insertions + Deletions) / Total words in reference × 100
Example:
Reference: "I need to check my account balance"
ASR output: "I need to check my count balance"
Substitutions: 1 ("account" → "count")
Insertions: 0
Deletions: 0
Total words: 7
WER = (1 + 0 + 0) / 7 × 100 = 14.3%
What it measures: Raw transcription accuracy. Important caveat: WER doesn't measure whether tasks were completed or the agent responded appropriately.
Industry Benchmarks
| Level | WER Range | Context |
|---|---|---|
| Excellent | <5% | Clean audio, native speakers |
| Good | 5-8% | Production standard for English |
| Acceptable | 8-12% | Background noise, accents present |
| Poor | >12% | Requires STT optimization |
Language variance: English typically achieves <8% WER while Hindi may reach 18-22% WER. Set language-specific thresholds.
Alert Thresholds & Failure Modes
| Severity | Threshold | Duration | Action |
|---|---|---|---|
| Warning | >12% | 10 min | Review audio quality, check ASR provider |
| Critical | >15% | 5 min | Fallback strategies, immediate investigation |
Common failure modes:
- Background noise — Airport, café, traffic environments
- Audio quality degradation — Low bitrate, packet loss, codec issues
- Accent variation — Non-native speakers, regional dialects
- Domain vocabulary — Medical terms, product names, proper nouns
Leading indicators: If STT accuracy drops suddenly, the speech recognition service might have issues. Track WER by user segment to identify affected populations.
Remediation Strategies
-
Test across conditions — Evaluate with varied noise (airport, café, traffic), device types (phones, smart speakers), and network conditions
-
Monitor downstream effects — WER should be tested in context. Evaluate ASR accuracy alongside intent success, repetition rates, and recovery success.
-
Implement fallback strategies — Request repetition, confirm low-confidence transcriptions, offer DTMF input for critical data
How Hamming helps: Hamming evaluates ASR accuracy alongside downstream effects like intent success, repetition, and recovery across accents and noise conditions—not just isolated WER scores.
Related: ASR Accuracy Evaluation for testing methodology.
KPI 6: Task Completion Rate
Task completion measures the percentage of calls where the agent successfully executes the intended business goal—booking an appointment, completing a purchase, resolving an inquiry, or processing a request. This is the ultimate outcome metric that connects agent performance to business value.
Definition & Calculation
Formula:
Task Completion = (Calls with successful task completion / Total calls with task intent) × 100
What it measures: Whether the voice agent actually accomplishes what users need, not just whether it responded appropriately.
Industry Benchmarks
| Complexity | Target | Typical Range |
|---|---|---|
| Simple tasks | >90% | Balance check, store hours |
| Moderate complexity | 75-85% | Appointment booking, order status |
| Complex workflows | 60-75% | Multi-step troubleshooting, claims |
Alert Thresholds & Failure Modes
| Severity | Threshold | Duration | Action |
|---|---|---|---|
| Warning | <75% | 15 min | Review failed task transcripts |
| Critical | <70% | 10 min | Immediate investigation, check integrations |
Common failure modes:
- Tool invocation errors — API calls fail or return unexpected results
- Parameter extraction failures — Agent extracts wrong values from conversation
- Context loss mid-task — Agent forgets information needed to complete task
- Premature termination — Conversation ends before task is confirmed complete
- Hallucinated tool calls — LLM claims it called a tool but didn't actually invoke it
Leading indicators: Monitor tool call success rates and parameter accuracy to detect integration issues before they impact task completion.
Remediation Strategies
-
Trace every interaction — Capture STT output, intent classification, tool calls, response generation, and TTS input for every call
-
Segment by task type — Identify which specific workflows underperform and prioritize fixes
-
Monitor tool call accuracy — Track whether tools are called with correct parameters and return expected results
How Hamming helps: Hamming traces every step of task execution including tool calls, parameter extraction, and completion confirmation. Production call replay enables rapid diagnosis of failed tasks.
KPI 7: Customer Satisfaction (CSAT)
CSAT measures the percentage of customers rating their interaction positively, typically 4-5 on a 5-point scale. While it's a lagging indicator, CSAT correlates strongly with retention, referrals, and revenue—making it essential for understanding the human impact of voice agent performance.
Definition & Calculation
Formula:
CSAT = (Positive ratings (4-5 stars) / Total survey responses) × 100
What it measures: User perception of interaction quality, encompassing accuracy, speed, helpfulness, and overall experience.
Industry Benchmarks
| Level | CSAT Range | Interpretation |
|---|---|---|
| World-class | >85% | Exceptional experience |
| Good | 75-84% | Strong performance |
| Acceptable | 65-74% | Room for improvement |
| Poor | <65% | Significant issues |
Critical insight: The 4% of bad calls drove close to 40% of complaints and early hangups. Outlier detection matters more than averages.
Alert Thresholds & Failure Modes
| Severity | Threshold | Duration | Action |
|---|---|---|---|
| Warning | <70% | 2 weeks | Analyze low-CSAT call patterns |
| Critical | <65% | 1 week | Immediate experience review |
Common failure modes:
- Latency frustration — Slow responses create poor experience
- Repetition fatigue — Users forced to repeat themselves
- Misunderstanding impact — Intent errors visible to users
- Tone/empathy gaps — Responses feel robotic or dismissive
Leading indicators: Track sentiment trajectory within calls. Negative sentiment shifts mid-conversation predict low CSAT before surveys.
Remediation Strategies
-
Correlate CSAT with metrics — Link low CSAT to specific patterns: high latency, repetition, misunderstanding, incomplete resolution
-
Monitor user interruptions — Track how often users barge in or speak over the agent—a direct frustration signal
-
Analyze sentiment velocity — Rate of sentiment change indicates conversation quality degradation
How Hamming helps: Hamming's speech-level analysis detects caller frustration, sentiment shifts, emotional cues, pauses, interruptions, and tone changes—evaluating how callers said things, not just what they said.
Related: Voice Agent Analytics to Improve CSAT for optimization strategies.
KPI 8: Call Containment Rate
Containment rate measures the percentage of calls handled entirely by the AI agent without requiring transfer to a human. It's an essential indicator of automation effectiveness and directly impacts operational costs—but must be balanced against quality to avoid "containing" calls by frustrating customers into giving up.
Definition & Calculation
Formula:
Containment Rate = (Calls handled entirely by AI / Total inbound calls) × 100
What it measures: Automation effectiveness—how much human labor is being offset by the voice agent.
Industry Benchmarks
| Level | Range | Context |
|---|---|---|
| Excellent | >80% | Simple, well-defined use cases |
| Good | 70-80% | Standard customer service |
| Acceptable | 60-70% | Complex queries, new deployments |
| Poor | <60% | Significant capability gaps |
Economic impact: McKinsey research found AI automation enables companies to reduce agent headcount by 40-50% while handling 20-30% more calls—but only with quality containment.
Alert Thresholds & Failure Modes
| Severity | Threshold | Duration | Action |
|---|---|---|---|
| Warning | <60% | 1 hour | Review escalation reasons |
| Critical | <50% | 30 min | Investigate knowledge gaps, routing issues |
Common failure modes:
- Knowledge gaps — Agent lacks information for query categories
- Complex query handling — Multi-part or ambiguous requests exceed capability
- Authentication failures — Unable to verify user identity
- User preference — Customer explicitly requests human agent
- False containment — Bot "contains" by frustrating customers into giving up
Leading indicators: Track escalation reasons by category. A spike in "user requested human" suggests capability gaps or frustration.
Remediation Strategies
-
Analyze escalation transcripts — Identify knowledge gaps and out-of-scope queries requiring agent expansion
-
Segment by query type — AI might excel at appointment confirmations but struggle with complex product inquiries
-
Balance containment with quality — High containment with low CSAT indicates false containment
How Hamming helps: Hamming tracks escalation frequency, reasons, and correlates with other metrics to distinguish quality containment from frustrated abandonment.
KPI 9: Prompt Compliance Rate
Prompt compliance measures how frequently the agent follows specific instructions in its system prompt—including scope boundaries, safety guardrails, required disclaimers, and escalation triggers. This is critical for regulatory compliance, brand safety, and preventing scope creep.
Definition & Calculation
Formula:
Prompt Compliance = (Responses following instructions / Total responses) × 100
What it measures: Whether the agent executes within defined boundaries, follows safety protocols, and maintains brand voice.
Industry Benchmarks
| Instruction Type | Target | Minimum Acceptable |
|---|---|---|
| Safety/regulatory | >99% | 95% |
| Scope boundaries | >95% | 90% |
| Brand voice/tone | >90% | 85% |
| Required disclosures | >99% | 95% |
Alert Thresholds & Failure Modes
| Severity | Threshold | Duration | Action |
|---|---|---|---|
| Warning | <90% | 15 min | Review compliance failures |
| Critical | <85% | 5 min | Immediate prompt review, potential pause |
Common failure modes:
- Scope creep — Progressively going beyond permitted topics
- Safety guardrail bypass — Ignoring safety precautions in prompts
- Disclosure omission — Failing to provide required legal disclaimers
- Brand voice drift — Tone shifting away from guidelines
Leading indicators: Track compliance by prompt section to identify which specific instructions agents consistently violate.
Remediation Strategies
-
Implement automated assertion checks — Test against prohibited topics, required disclaimers, escalation triggers, data handling policies
-
Version-aware monitoring — Prompt updates shift behavior baselines. Track compliance by prompt version.
-
Segment by instruction category — Safety-critical instructions need higher compliance thresholds than stylistic guidelines
How Hamming helps: Hamming includes 50+ built-in metrics including compliance scorers, plus unlimited custom assertions you define. Automated testing validates prompt compliance across thousands of scenarios.
KPI 10: Context Retention Accuracy
Context retention measures the agent's ability to retain and use relevant information across conversation turns. When users say "I already told you that," it's a context retention failure. This metric is critical for multi-turn conversations requiring information recall and reasoning.
Definition & Calculation
Formula:
Context Retention = (Correct contextual references / Total contextual reference opportunities) × 100
What it measures: Whether the agent remembers and correctly uses information provided earlier in the conversation.
Industry Benchmarks
| Scenario | Target | Context |
|---|---|---|
| Same-call retention | >95% | Information from current call |
| Multi-turn reasoning | >90% | Complex tasks requiring recall |
| Session persistence | >85% | Information across call transfers |
Alert Thresholds & Failure Modes
| Severity | Threshold | Duration | Action |
|---|---|---|---|
| Warning | <85% | 15 min | Review context loss patterns |
| Critical | <80% | 10 min | Investigate session management, prompts |
Common failure modes:
- Short context windows — LLM context limit exceeded, early information lost
- Session state issues — Context not properly persisted across turns
- Memory retrieval errors — RAG system fails to retrieve relevant history
- Prompt engineering gaps — Instructions don't emphasize context preservation
Leading indicators: Track user repetition patterns—users repeating information signals agent failed to retain context.
Remediation Strategies
-
Test multi-turn scenarios — Create scripted conversations requiring information recall across 5+ exchanges
-
Monitor repetition requests — "What was that again?" or users re-stating information indicates failures
-
Optimize conversation memory — Tune sliding window size, summarization strategy, and RAG integration
How Hamming helps: Hamming tracks context management as a core metric, identifying conversations where context loss occurs and correlating with user frustration signals.
How to Instrument Voice Agent KPIs
Effective KPI monitoring requires comprehensive instrumentation capturing events at every stage of the voice agent pipeline. Here's the event collection framework used by production voice agents monitored by Hamming.
Core Events to Capture
| Event | Required Fields | Purpose |
|---|---|---|
call.started | call_id, timestamp, caller_metadata, agent_version | Session initialization |
turn.user | transcript, confidence, audio_duration, timestamp | ASR quality tracking |
turn.agent | text, latency_breakdown, tool_calls, intent | Execution tracking |
intent.classified | intent, confidence, alternatives, method | Intent accuracy |
tool.called | tool_name, params, result, latency, success | Integration health |
call.ended | outcome, duration, metrics_summary, disposition | Outcome tracking |
Event Schema Example
{
"event": "turn.agent",
"timestamp": "2025-01-20T10:30:00Z",
"call_id": "call_abc123",
"turn_index": 3,
"latency_ms": {
"stt": 150,
"llm": 420,
"tts": 180,
"total": 750
},
"text": "I can help you check your account balance.",
"intent": {
"classified": "account_inquiry",
"confidence": 0.94,
"alternatives": [
{"intent": "balance_check", "confidence": 0.89}
]
},
"tool_calls": [
{
"tool": "get_balance",
"params": {"account_id": "12345"},
"result": {"balance": 1500.00},
"latency_ms": 180,
"success": true
}
]
}
OpenTelemetry Integration
Hamming natively ingests OpenTelemetry traces, spans, and logs for unified observability:
from opentelemetry import trace
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
# Configure Hamming as trace destination
hamming_exporter = OTLPSpanExporter(
endpoint="https://otel.hamming.ai",
headers={"x-api-key": "your-api-key"}
)
# Instrument voice agent components
tracer = trace.get_tracer("voice-agent")
with tracer.start_as_current_span("process_turn") as span:
span.set_attribute("turn.index", turn_index)
span.set_attribute("intent.classified", intent)
span.set_attribute("latency.total_ms", total_latency)
Component-Level Latency Tracking
Break down end-to-end latency into component spans:
User speaks
│
▼ (40ms network)
┌─────────────┐
│ STT Service │ ←── Span: stt.transcribe (150ms)
└─────────────┘
│
▼
┌─────────────┐
│ LLM Service │ ←── Span: llm.generate (420ms)
└─────────────┘
│
▼
┌─────────────┐
│ TTS Service │ ←── Span: tts.synthesize (180ms)
└─────────────┘
│
▼ (30ms network)
User hears response
Total E2E: 820ms
Dashboard Design: The 4-Layer Framework
Effective voice agent dashboards organize metrics into four layers, enabling both executive overview and operational drill-down.
Executive View (One Glance)
┌─────────────────────────────────────────────────────────┐
│ Voice Agent Health: 94.2% ✓ │
├───────────────┬───────────────┬─────────────────────────┤
│ Calls Today │ Task Success │ TTFW P90 │
│ 12,847 │ 94.2% │ 720ms │
│ ↑ 8% │ ↓ 0.3% │ ↓ 50ms │
├───────────────┴───────────────┴─────────────────────────┤
│ Active Alerts: 1 (P2) │
│ ⚠️ Intent accuracy below baseline in "billing" flow │
└─────────────────────────────────────────────────────────┘
Layer 1: Infrastructure Metrics
| Metric | Target | Alert Threshold |
|---|---|---|
| Audio Quality (MOS) | >4.0 | <3.5 |
| Packet Loss | <0.1% | >1% |
| Concurrent Calls | <80% capacity | >90% |
| Call Setup Time | <2s | >5s |
Layer 2: Execution Metrics
| Metric | Target | Alert Threshold |
|---|---|---|
| Intent Accuracy | >95% | <90% |
| WER | <8% | >12% |
| LLM Latency (P90) | <800ms | >1500ms |
| Tool Call Success | >99% | <95% |
Layer 3: User Reaction Metrics
| Metric | Target | Alert Threshold |
|---|---|---|
| User Interruptions | <10% | >20% |
| Retry Rate | <5% | >15% |
| Sentiment Trajectory | Positive/Stable | Negative trend |
| Abandonment Rate | <8% | >15% |
Layer 4: Outcome Metrics
| Metric | Target | Alert Threshold |
|---|---|---|
| Task Completion | >85% | <70% |
| FCR | >75% | <60% |
| Containment Rate | >70% | <50% |
| CSAT | >75% | <65% |
Related: Anatomy of a Perfect Voice Agent Analytics Dashboard for detailed dashboard design.
Alerting Playbook: Detection to Remediation
Severity Levels
| Level | Response Time | Channel | Criteria |
|---|---|---|---|
| P0: Critical | <5 min | PagerDuty | Revenue impact, system down |
| P1: High | <15 min | Slack urgent | Customer impact, major degradation |
| P2: Medium | <1 hour | Slack | Performance degradation |
| P3: Low | <4 hours | Trend warnings |
Alert Configuration Template
name: TTFW Degradation
condition: ttfw_p90 > 1000ms
duration: 5 minutes
severity: P1
channels:
- slack://voice-alerts
- pagerduty://voice-team
context:
- current_value
- baseline_value
- sample_calls
- dashboard_link
runbook: /docs/runbooks/ttfw-degradation
cooldown: 30 minutes
Recommended Initial Alerts
Start with these four high-impact alerts:
- TTFW P90 >1000ms (duration: 5 min) → P1
- Task completion <80% (duration: 15 min) → P1
- Intent accuracy <90% (duration: 10 min) → P1
- WER >12% (duration: 10 min) → P2
Four-Step Remediation Workflow
- Detection: ML-based anomaly detection + rule-based validation
- Alert: Instant notification with metadata (affected agent, timestamp, sample calls)
- Diagnosis: Drill down from metrics to transcripts and audio
- Remediation: Coordinate fix deployment, monitor recovery
How Hamming helps: Detection begins with ML-based anomaly detection and rule-based validation. When anomalies are detected, alerts push instantly to Slack with metadata. Engineers can drill down from high-level metrics to individual transcripts and audio for rapid diagnosis.
Platform Comparison: Hamming vs Alternatives
| Capability | Hamming | Braintrust | Roark | Datadog |
|---|---|---|---|---|
| Voice-native KPIs | ✅ 50+ built-in | ⚠️ Limited | ✅ Yes | ❌ No |
| Real-time alerting | ✅ ML + rules | ⚠️ Manual | ✅ Yes | ✅ Infra only |
| Call replay | ✅ One-click | ⚠️ Manual | ✅ Yes | ❌ No |
| OpenTelemetry | ✅ Native | ✅ Yes | ⚠️ Limited | ✅ Yes |
| Intent accuracy | ✅ Automated | ⚠️ Manual | ✅ Yes | ❌ No |
| Sentiment analysis | ✅ Speech-level | ❌ No | ⚠️ Basic | ❌ No |
| Automated testing | ✅ 1000+ concurrent | ⚠️ Limited | ⚠️ Limited | ❌ No |
| Prompt compliance | ✅ Built-in | ⚠️ Custom | ⚠️ Limited | ❌ No |
When to use Hamming: Production voice agent monitoring with comprehensive KPI coverage, automated testing, and speech-level analysis. Ideal for teams needing unified testing + monitoring in one platform.
Related Guides
- Voice Agent Monitoring Platform Guide — 4-Layer Monitoring Stack
- How to Evaluate Voice Agents — VOICE Framework
- Voice Agent Observability Tracing Guide — OpenTelemetry integration
- Anatomy of a Perfect Voice Agent Analytics Dashboard — Dashboard design
- Intent Recognition at Scale — Intent testing methodology

