Voice Agent Monitoring KPIs: 10 Production Metrics, Dashboards & Alerting Guide

Sumanyu Sharma
Sumanyu Sharma
Founder & CEO
, Voice AI QA Pioneer

Has stress-tested 1M+ voice agent calls to find where they break.

January 19, 2026Updated January 19, 202625 min read
Voice Agent Monitoring KPIs: 10 Production Metrics, Dashboards & Alerting Guide

A voice agent team at a healthcare company watched their Datadog dashboard show all-green metrics for weeks. Server uptime: 99.9%. API latency: under 200ms. Error rate: 0.1%.

But their CSAT scores were plummeting. Customers were calling back frustrated. Escalations to human agents doubled.

What was happening?

They were monitoring infrastructure, not conversation quality.

Their voice agent's First Call Resolution had dropped from 75% to 58%. Intent accuracy had drifted from 95% to 87%. Users were saying "I already told you that" in 23% of calls. None of this appeared in their dashboards.

According to Hamming's analysis of 1M+ production voice agent calls, the 10 KPIs that actually predict voice agent failure are completely different from traditional APM metrics. This guide defines each one—with calculation formulas, industry benchmarks, alert thresholds, and remediation strategies.

TL;DR: Monitor voice agents using Hamming's 10 Critical Production KPIs:

Outcome KPIs: First Call Resolution (>75%), Task Completion (>85%), Containment Rate (>70%), CSAT (>75%)

Execution KPIs: Intent Accuracy (>95%), End-to-End Latency P95 (<800ms), WER (<8%), Prompt Compliance (>95%)

Experience KPIs: Context Retention (>90%), AHT (balanced with quality)

Generic APM tools miss 60% of voice-specific failures. Set alerts on P90/P95 percentiles, not averages. Use the 4-Layer Dashboard Framework (Infrastructure → Execution → User Reaction → Outcome) for complete visibility.

Methodology Note: The KPI definitions, benchmarks, and alert thresholds in this guide are derived from Hamming's analysis of 1M+ production voice agent calls across 50+ deployments (2024-2025). Benchmarks represent median performance across healthcare, financial services, e-commerce, and customer support verticals. Thresholds may vary by use case complexity and user expectations.

Last Updated: January 2026

Related Guides:

What Are Voice Agent KPIs?

Voice agent KPIs are quantitative metrics that measure the performance, quality, and business impact of AI-powered voice agents in production. Unlike traditional contact center metrics that focus on operational efficiency, voice agent KPIs must capture AI-specific quality dimensions: speech recognition accuracy, intent classification, response latency, prompt compliance, and multi-turn reasoning.

Why voice agents need specialized KPIs:

Traditional Contact CenterVoice Agent Specific
Average Handle Time (AHT)Time to First Word (TTFW)
Abandonment RateIntent Accuracy
Queue Wait TimeWord Error Rate (WER)
Agent UtilizationPrompt Compliance
Service LevelContext Retention

The 10 KPIs in this guide span four layers of Hamming's monitoring framework:

  1. Infrastructure Layer: Audio quality, network reliability, system capacity
  2. Execution Layer: ASR accuracy, LLM latency, intent classification, tool calls
  3. User Reaction Layer: Sentiment, interruptions, retry patterns, frustration signals
  4. Outcome Layer: Task completion, containment, FCR, CSAT, business value

The 10 Critical Voice Agent Production KPIs

This master scorecard provides a reference for all 10 KPIs. Each is detailed in subsequent sections with formulas, benchmarks, alert configurations, and remediation strategies.

KPIDefinitionFormulaTargetWarningCritical
1. First Call Resolution% of issues resolved without follow-up(resolved calls / total calls) × 100>75%<70%<60%
2. Average Handle TimeMean conversation duration(talk time + wrap-up) / total calls4-6 min>2x baseline>3x baseline
3. Intent Accuracy% of correctly classified intents(correct / total) × 100>95%<92%<90%
4. E2E Latency (P95)95th percentile response timeSTT + LLM + TTS<800ms>1200ms>1500ms
5. Word Error RateASR transcription accuracy(S + I + D) / words × 100<8%>12%>15%
6. Task Completion% completing business goal(completed / attempts) × 100>85%<75%<70%
7. CSAT% positive ratings (4-5 stars)(positive / responses) × 100>75%<70%<65%
8. Containment Rate% handled without transfer(AI-resolved / total) × 100>70%<60%<50%
9. Prompt Compliance% of instructions followed(compliant / total) × 100>95%<90%<85%
10. Context Retention% of contextual refs correct(correct refs / total) × 100>90%<85%<80%

KPI 1: First Call Resolution (FCR)

First Call Resolution measures the percentage of customer issues resolved during the initial interaction without requiring follow-up calls, transfers, or escalations. FCR is the ultimate test of voice agent effectiveness—it indicates whether the agent understood the request, had the knowledge to address it, and executed the resolution correctly.

Definition & Calculation

Formula:

FCR = (Calls resolved without transfer or callback / Total calls) × 100

What it measures: The agent's ability to completely resolve customer needs in a single interaction. This requires accurate understanding, comprehensive knowledge base integration, and proper dialog flow execution.

Industry Benchmarks

LevelRangeInterpretation
Excellent>80%World-class resolution capability
Good75-80%Strong performance, minor optimization needed
Acceptable65-75%Adequate but significant improvement opportunity
Poor<65%Systemic issues requiring immediate attention

Industry examples:

  • Healthcare provider achieved 75% FCR handling appointment reminders and prescription queries
  • Financial services targeting 80%+ FCR for account balance and transaction inquiries
  • E-commerce achieving 70-75% FCR for order status and return initiation

Alert Thresholds & Failure Modes

SeverityThresholdDurationAction
Warning<70%1 hourInvestigate intent accuracy, knowledge gaps
Critical<60%30 minImmediate review of failed resolutions

Common failure modes:

  1. Knowledge base gaps — Agent lacks information to resolve specific query types
  2. Intent misclassification — Wrong dialog flow activated, unable to complete resolution
  3. Incomplete dialog flows — Flow ends without proper resolution confirmation
  4. Premature transfers — Agent escalates when resolution was possible

Leading indicators: Rising fallback frequency, negative sentiment shifts, and repeated intent reclassification signal FCR problems before the metric drops.

Remediation Strategies

  1. Analyze failed resolution transcripts — Identify patterns in queries that fail to resolve. Look for knowledge gaps, missing intents, and dialog breakdowns.

  2. Segment FCR by query type — Track which categories underperform. Intent X might have 90% FCR while Intent Y has 45%, revealing specific training needs.

  3. Monitor callback patterns — Users calling back within 24 hours indicate incomplete resolution even if the call was marked "resolved."

How Hamming helps: Hamming automatically tracks FCR across all calls, segments by intent category, and correlates with other metrics to identify root causes. Production call replay enables rapid diagnosis of failed resolutions.

KPI 2: Average Handle Time (AHT)

Average Handle Time measures the mean duration of voice agent conversations from greeting to completion. Unlike traditional contact centers where lower AHT often equals better efficiency, voice agents must balance speed with quality—rushing conversations degrades CSAT and increases repeat calls.

Definition & Calculation

Formula:

AHT = (Total conversation time + After-call work time) / Total calls handled

What it measures: Overall conversation efficiency. However, optimal AHT varies significantly by use case complexity.

Industry Benchmarks

Use CaseTarget AHTContext
Simple FAQs1-2 minBalance inquiries, store hours
Account inquiries3-5 minTransaction history, profile updates
Complex troubleshooting6-10 minTechnical support, multi-step resolution
Sales/appointments5-8 minConsultative, relationship-building

Critical insight: Research shows conversations lasting 4-6 minutes had 67% higher satisfaction than sub-2-minute interactions. The sweet spot is thoroughness, not pure speed.

Alert Thresholds & Failure Modes

SeverityThresholdDurationAction
Warning>2x baseline15 minReview for loops, verbose responses
Critical>3x baseline10 minInvestigate stuck states, infinite loops

Common failure modes:

  1. Infinite loops — Agent repeating same questions or responses
  2. Verbose responses — Unnecessarily long explanations that don't add value
  3. Stuck dialog states — Conversation unable to progress to resolution
  4. Excessive clarification — Repeatedly asking users to confirm or repeat

Leading indicators: Track "Longest Monologue" metric—long monologues indicate the agent is failing to provide concise responses or misinterpreting the query.

Remediation Strategies

  1. Monitor turn count distribution — Calls with >15 turns often indicate confusion or loops
  2. Optimize prompt engineering — Reduce verbosity without sacrificing task completion
  3. Identify AHT outliers — Replay calls with AHT >3x average to detect patterns

How Hamming helps: Hamming tracks turn-level metrics including longest monologue, turn count, and silence duration to identify verbose or stuck conversations before they impact aggregate AHT.

KPI 3: Intent Classification Accuracy

Intent accuracy measures the percentage of user utterances correctly mapped to their intended action or query category. This is where voice agents face unique challenges—voice agents have 3-10x higher intent error rates than text-only systems due to ASR error cascade effects.

Definition & Calculation

Formula:

Intent Accuracy = (Correctly classified intents / Total classification attempts) × 100

What it measures: The NLU system's ability to understand what users want, accounting for ASR errors, accent variations, and natural language variability.

Industry Benchmarks

LevelRangeProduction Readiness
Excellent>98%Required for critical domains (healthcare, banking)
Good95-98%Acceptable for most production use cases
Acceptable90-95%Requires human fallback for edge cases
Poor<90%Not production ready

Scale impact: At 10,000 calls/day with 6.9% error rate, 690 users daily experience intent misclassification. At enterprise scale, even small accuracy improvements have massive impact.

Alert Thresholds & Failure Modes

SeverityThresholdDurationAction
Warning<92%15 minReview confusion matrix by intent
Critical<90%10 minImmediate investigation, consider fallback

Common failure modes:

  1. ASR cascade — Transcription errors cause downstream intent failures
  2. Intent confusion — Similar intents frequently misclassified for each other
  3. Model drift — Real-world inputs differ from training data over time
  4. Provider API changes — STT or NLU API updates silently degrade performance

Leading indicators: Confidence decay across conversation turns, repeated intent reclassification, and rising fallback frequency signal issues before aggregate accuracy drops.

Remediation Strategies

  1. Build intent utterance matrix — Test with 10K+ utterances, not 50. Include variations: formal, casual, accented, noisy.

  2. Track confusion patterns — Identify which specific intents confuse each other, not just aggregate accuracy.

  3. Monitor confidence scores — Flag low-confidence predictions for human review and retraining.

  4. Version-aware tracking — Prompt updates shift behavior baselines. Track performance by prompt version.

How Hamming helps: Hamming's automated evaluation runs intent utterance matrices across accent variations and noise conditions, tracking confusion patterns and confidence distributions at scale.

Related: Intent Recognition at Scale for detailed testing methodology.

KPI 4: End-to-End Latency (P95)

End-to-end latency measures the complete response time from when a user stops speaking until they hear the agent's first word—the "Mouth-to-Ear Turn Gap." This directly determines whether conversations feel natural or robotic. Humans expect responses within 300-500 milliseconds; exceeding this threshold makes conversations feel stilted.

Definition & Calculation

Formula:

E2E Latency = Audio Transmission + STT + LLM + TTS + Audio Playback

Typical breakdown:
- Audio transmission: 40ms
- Buffering/decoding: 55ms
- STT processing: 150-350ms
- LLM generation: 200-800ms
- TTS synthesis: 100-200ms
- Audio playback: 30ms

What it measures: The perceived responsiveness of the voice agent. Track P50, P90, P95, and P99 percentiles—averages hide critical outliers.

Industry Benchmarks

PercentileTargetWarningCritical
P50<600ms>800ms>1000ms
P90<800ms>1200ms>1500ms
P95<1000ms>1500ms>2000ms
P99<1500ms>2000ms>3000ms

Why percentiles matter: An average of 500ms might hide that 10% of users experience delays over 2 seconds. Those outliers drive complaints and abandonment.

Alert Thresholds & Failure Modes

SeverityThresholdDurationAction
WarningP95 >1200ms5 minInvestigate component latencies
CriticalP95 >1500ms3 minImmediate escalation, check provider status

Common failure modes:

  1. LLM slowdown — Provider degradation or model overload
  2. Cold starts — First request takes 3-5x longer due to model loading
  3. Network variability — Mobile networks add 50-200ms vs broadband
  4. Geographic distance — Transcontinental calls add 100-300ms

Leading indicators: Monitor component-level TTFB (Time to First Byte). If any component exceeds 2x baseline, investigate immediately.

Remediation Strategies

  1. Decompose latency by component — Identify whether STT, LLM, or TTS is the bottleneck

  2. Implement streaming — Stream STT, LLM responses, and TTS playback to reduce perceived latency

  3. Geographic optimization — Deploy services closer to users via edge computing

  4. Model selection tradeoffs — Use faster models for simple queries, reserve larger models for complex requests

How Hamming helps: Hamming captures per-component latency traces for every call, enabling instant drill-down from aggregate P95 to specific bottleneck identification.

Related: How to Optimize Latency in Voice Agents for optimization strategies.

KPI 5: Word Error Rate (WER)

Word Error Rate is the gold standard for measuring ASR transcription accuracy. WER quantifies how accurately the speech recognition system converts spoken words to text by comparing ASR output to reference transcripts. However, WER must be evaluated in context—a 10% WER might still achieve 95% intent accuracy if errors don't affect understanding.

Definition & Calculation

Formula:

WER = (Substitutions + Insertions + Deletions) / Total words in reference × 100

Example:
Reference: "I need to check my account balance"
ASR output: "I need to check my count balance"

Substitutions: 1 ("account"  "count")
Insertions: 0
Deletions: 0
Total words: 7

WER = (1 + 0 + 0) / 7 × 100 = 14.3%

What it measures: Raw transcription accuracy. Important caveat: WER doesn't measure whether tasks were completed or the agent responded appropriately.

Industry Benchmarks

LevelWER RangeContext
Excellent<5%Clean audio, native speakers
Good5-8%Production standard for English
Acceptable8-12%Background noise, accents present
Poor>12%Requires STT optimization

Language variance: English typically achieves <8% WER while Hindi may reach 18-22% WER. Set language-specific thresholds.

Alert Thresholds & Failure Modes

SeverityThresholdDurationAction
Warning>12%10 minReview audio quality, check ASR provider
Critical>15%5 minFallback strategies, immediate investigation

Common failure modes:

  1. Background noise — Airport, café, traffic environments
  2. Audio quality degradation — Low bitrate, packet loss, codec issues
  3. Accent variation — Non-native speakers, regional dialects
  4. Domain vocabulary — Medical terms, product names, proper nouns

Leading indicators: If STT accuracy drops suddenly, the speech recognition service might have issues. Track WER by user segment to identify affected populations.

Remediation Strategies

  1. Test across conditions — Evaluate with varied noise (airport, café, traffic), device types (phones, smart speakers), and network conditions

  2. Monitor downstream effects — WER should be tested in context. Evaluate ASR accuracy alongside intent success, repetition rates, and recovery success.

  3. Implement fallback strategies — Request repetition, confirm low-confidence transcriptions, offer DTMF input for critical data

How Hamming helps: Hamming evaluates ASR accuracy alongside downstream effects like intent success, repetition, and recovery across accents and noise conditions—not just isolated WER scores.

Related: ASR Accuracy Evaluation for testing methodology.

KPI 6: Task Completion Rate

Task completion measures the percentage of calls where the agent successfully executes the intended business goal—booking an appointment, completing a purchase, resolving an inquiry, or processing a request. This is the ultimate outcome metric that connects agent performance to business value.

Definition & Calculation

Formula:

Task Completion = (Calls with successful task completion / Total calls with task intent) × 100

What it measures: Whether the voice agent actually accomplishes what users need, not just whether it responded appropriately.

Industry Benchmarks

ComplexityTargetTypical Range
Simple tasks>90%Balance check, store hours
Moderate complexity75-85%Appointment booking, order status
Complex workflows60-75%Multi-step troubleshooting, claims

Alert Thresholds & Failure Modes

SeverityThresholdDurationAction
Warning<75%15 minReview failed task transcripts
Critical<70%10 minImmediate investigation, check integrations

Common failure modes:

  1. Tool invocation errors — API calls fail or return unexpected results
  2. Parameter extraction failures — Agent extracts wrong values from conversation
  3. Context loss mid-task — Agent forgets information needed to complete task
  4. Premature termination — Conversation ends before task is confirmed complete
  5. Hallucinated tool calls — LLM claims it called a tool but didn't actually invoke it

Leading indicators: Monitor tool call success rates and parameter accuracy to detect integration issues before they impact task completion.

Remediation Strategies

  1. Trace every interaction — Capture STT output, intent classification, tool calls, response generation, and TTS input for every call

  2. Segment by task type — Identify which specific workflows underperform and prioritize fixes

  3. Monitor tool call accuracy — Track whether tools are called with correct parameters and return expected results

How Hamming helps: Hamming traces every step of task execution including tool calls, parameter extraction, and completion confirmation. Production call replay enables rapid diagnosis of failed tasks.

KPI 7: Customer Satisfaction (CSAT)

CSAT measures the percentage of customers rating their interaction positively, typically 4-5 on a 5-point scale. While it's a lagging indicator, CSAT correlates strongly with retention, referrals, and revenue—making it essential for understanding the human impact of voice agent performance.

Definition & Calculation

Formula:

CSAT = (Positive ratings (4-5 stars) / Total survey responses) × 100

What it measures: User perception of interaction quality, encompassing accuracy, speed, helpfulness, and overall experience.

Industry Benchmarks

LevelCSAT RangeInterpretation
World-class>85%Exceptional experience
Good75-84%Strong performance
Acceptable65-74%Room for improvement
Poor<65%Significant issues

Critical insight: The 4% of bad calls drove close to 40% of complaints and early hangups. Outlier detection matters more than averages.

Alert Thresholds & Failure Modes

SeverityThresholdDurationAction
Warning<70%2 weeksAnalyze low-CSAT call patterns
Critical<65%1 weekImmediate experience review

Common failure modes:

  1. Latency frustration — Slow responses create poor experience
  2. Repetition fatigue — Users forced to repeat themselves
  3. Misunderstanding impact — Intent errors visible to users
  4. Tone/empathy gaps — Responses feel robotic or dismissive

Leading indicators: Track sentiment trajectory within calls. Negative sentiment shifts mid-conversation predict low CSAT before surveys.

Remediation Strategies

  1. Correlate CSAT with metrics — Link low CSAT to specific patterns: high latency, repetition, misunderstanding, incomplete resolution

  2. Monitor user interruptions — Track how often users barge in or speak over the agent—a direct frustration signal

  3. Analyze sentiment velocity — Rate of sentiment change indicates conversation quality degradation

How Hamming helps: Hamming's speech-level analysis detects caller frustration, sentiment shifts, emotional cues, pauses, interruptions, and tone changes—evaluating how callers said things, not just what they said.

Related: Voice Agent Analytics to Improve CSAT for optimization strategies.

KPI 8: Call Containment Rate

Containment rate measures the percentage of calls handled entirely by the AI agent without requiring transfer to a human. It's an essential indicator of automation effectiveness and directly impacts operational costs—but must be balanced against quality to avoid "containing" calls by frustrating customers into giving up.

Definition & Calculation

Formula:

Containment Rate = (Calls handled entirely by AI / Total inbound calls) × 100

What it measures: Automation effectiveness—how much human labor is being offset by the voice agent.

Industry Benchmarks

LevelRangeContext
Excellent>80%Simple, well-defined use cases
Good70-80%Standard customer service
Acceptable60-70%Complex queries, new deployments
Poor<60%Significant capability gaps

Economic impact: McKinsey research found AI automation enables companies to reduce agent headcount by 40-50% while handling 20-30% more calls—but only with quality containment.

Alert Thresholds & Failure Modes

SeverityThresholdDurationAction
Warning<60%1 hourReview escalation reasons
Critical<50%30 minInvestigate knowledge gaps, routing issues

Common failure modes:

  1. Knowledge gaps — Agent lacks information for query categories
  2. Complex query handling — Multi-part or ambiguous requests exceed capability
  3. Authentication failures — Unable to verify user identity
  4. User preference — Customer explicitly requests human agent
  5. False containment — Bot "contains" by frustrating customers into giving up

Leading indicators: Track escalation reasons by category. A spike in "user requested human" suggests capability gaps or frustration.

Remediation Strategies

  1. Analyze escalation transcripts — Identify knowledge gaps and out-of-scope queries requiring agent expansion

  2. Segment by query type — AI might excel at appointment confirmations but struggle with complex product inquiries

  3. Balance containment with quality — High containment with low CSAT indicates false containment

How Hamming helps: Hamming tracks escalation frequency, reasons, and correlates with other metrics to distinguish quality containment from frustrated abandonment.

KPI 9: Prompt Compliance Rate

Prompt compliance measures how frequently the agent follows specific instructions in its system prompt—including scope boundaries, safety guardrails, required disclaimers, and escalation triggers. This is critical for regulatory compliance, brand safety, and preventing scope creep.

Definition & Calculation

Formula:

Prompt Compliance = (Responses following instructions / Total responses) × 100

What it measures: Whether the agent executes within defined boundaries, follows safety protocols, and maintains brand voice.

Industry Benchmarks

Instruction TypeTargetMinimum Acceptable
Safety/regulatory>99%95%
Scope boundaries>95%90%
Brand voice/tone>90%85%
Required disclosures>99%95%

Alert Thresholds & Failure Modes

SeverityThresholdDurationAction
Warning<90%15 minReview compliance failures
Critical<85%5 minImmediate prompt review, potential pause

Common failure modes:

  1. Scope creep — Progressively going beyond permitted topics
  2. Safety guardrail bypass — Ignoring safety precautions in prompts
  3. Disclosure omission — Failing to provide required legal disclaimers
  4. Brand voice drift — Tone shifting away from guidelines

Leading indicators: Track compliance by prompt section to identify which specific instructions agents consistently violate.

Remediation Strategies

  1. Implement automated assertion checks — Test against prohibited topics, required disclaimers, escalation triggers, data handling policies

  2. Version-aware monitoring — Prompt updates shift behavior baselines. Track compliance by prompt version.

  3. Segment by instruction category — Safety-critical instructions need higher compliance thresholds than stylistic guidelines

How Hamming helps: Hamming includes 50+ built-in metrics including compliance scorers, plus unlimited custom assertions you define. Automated testing validates prompt compliance across thousands of scenarios.

KPI 10: Context Retention Accuracy

Context retention measures the agent's ability to retain and use relevant information across conversation turns. When users say "I already told you that," it's a context retention failure. This metric is critical for multi-turn conversations requiring information recall and reasoning.

Definition & Calculation

Formula:

Context Retention = (Correct contextual references / Total contextual reference opportunities) × 100

What it measures: Whether the agent remembers and correctly uses information provided earlier in the conversation.

Industry Benchmarks

ScenarioTargetContext
Same-call retention>95%Information from current call
Multi-turn reasoning>90%Complex tasks requiring recall
Session persistence>85%Information across call transfers

Alert Thresholds & Failure Modes

SeverityThresholdDurationAction
Warning<85%15 minReview context loss patterns
Critical<80%10 minInvestigate session management, prompts

Common failure modes:

  1. Short context windows — LLM context limit exceeded, early information lost
  2. Session state issues — Context not properly persisted across turns
  3. Memory retrieval errors — RAG system fails to retrieve relevant history
  4. Prompt engineering gaps — Instructions don't emphasize context preservation

Leading indicators: Track user repetition patterns—users repeating information signals agent failed to retain context.

Remediation Strategies

  1. Test multi-turn scenarios — Create scripted conversations requiring information recall across 5+ exchanges

  2. Monitor repetition requests — "What was that again?" or users re-stating information indicates failures

  3. Optimize conversation memory — Tune sliding window size, summarization strategy, and RAG integration

How Hamming helps: Hamming tracks context management as a core metric, identifying conversations where context loss occurs and correlating with user frustration signals.

How to Instrument Voice Agent KPIs

Effective KPI monitoring requires comprehensive instrumentation capturing events at every stage of the voice agent pipeline. Here's the event collection framework used by production voice agents monitored by Hamming.

Core Events to Capture

EventRequired FieldsPurpose
call.startedcall_id, timestamp, caller_metadata, agent_versionSession initialization
turn.usertranscript, confidence, audio_duration, timestampASR quality tracking
turn.agenttext, latency_breakdown, tool_calls, intentExecution tracking
intent.classifiedintent, confidence, alternatives, methodIntent accuracy
tool.calledtool_name, params, result, latency, successIntegration health
call.endedoutcome, duration, metrics_summary, dispositionOutcome tracking

Event Schema Example

{
  "event": "turn.agent",
  "timestamp": "2025-01-20T10:30:00Z",
  "call_id": "call_abc123",
  "turn_index": 3,
  "latency_ms": {
    "stt": 150,
    "llm": 420,
    "tts": 180,
    "total": 750
  },
  "text": "I can help you check your account balance.",
  "intent": {
    "classified": "account_inquiry",
    "confidence": 0.94,
    "alternatives": [
      {"intent": "balance_check", "confidence": 0.89}
    ]
  },
  "tool_calls": [
    {
      "tool": "get_balance",
      "params": {"account_id": "12345"},
      "result": {"balance": 1500.00},
      "latency_ms": 180,
      "success": true
    }
  ]
}

OpenTelemetry Integration

Hamming natively ingests OpenTelemetry traces, spans, and logs for unified observability:

from opentelemetry import trace
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter

# Configure Hamming as trace destination
hamming_exporter = OTLPSpanExporter(
    endpoint="https://otel.hamming.ai",
    headers={"x-api-key": "your-api-key"}
)

# Instrument voice agent components
tracer = trace.get_tracer("voice-agent")

with tracer.start_as_current_span("process_turn") as span:
    span.set_attribute("turn.index", turn_index)
    span.set_attribute("intent.classified", intent)
    span.set_attribute("latency.total_ms", total_latency)

Component-Level Latency Tracking

Break down end-to-end latency into component spans:

User speaks
    
     (40ms network)
┌─────────────┐
 STT Service  ←── Span: stt.transcribe (150ms)
└─────────────┘
    
    
┌─────────────┐
 LLM Service  ←── Span: llm.generate (420ms)
└─────────────┘
    
    
┌─────────────┐
 TTS Service  ←── Span: tts.synthesize (180ms)
└─────────────┘
    
     (30ms network)
User hears response

Total E2E: 820ms

Dashboard Design: The 4-Layer Framework

Effective voice agent dashboards organize metrics into four layers, enabling both executive overview and operational drill-down.

Executive View (One Glance)

┌─────────────────────────────────────────────────────────┐
  Voice Agent Health: 94.2%                             
├───────────────┬───────────────┬─────────────────────────┤
 Calls Today    Task Success   TTFW P90               
   12,847          94.2%         720ms                
    8%             0.3%         50ms               
├───────────────┴───────────────┴─────────────────────────┤
  Active Alerts: 1 (P2)                                  
  ⚠️ Intent accuracy below baseline in "billing" flow    
└─────────────────────────────────────────────────────────┘

Layer 1: Infrastructure Metrics

MetricTargetAlert Threshold
Audio Quality (MOS)>4.0<3.5
Packet Loss<0.1%>1%
Concurrent Calls<80% capacity>90%
Call Setup Time<2s>5s

Layer 2: Execution Metrics

MetricTargetAlert Threshold
Intent Accuracy>95%<90%
WER<8%>12%
LLM Latency (P90)<800ms>1500ms
Tool Call Success>99%<95%

Layer 3: User Reaction Metrics

MetricTargetAlert Threshold
User Interruptions<10%>20%
Retry Rate<5%>15%
Sentiment TrajectoryPositive/StableNegative trend
Abandonment Rate<8%>15%

Layer 4: Outcome Metrics

MetricTargetAlert Threshold
Task Completion>85%<70%
FCR>75%<60%
Containment Rate>70%<50%
CSAT>75%<65%

Related: Anatomy of a Perfect Voice Agent Analytics Dashboard for detailed dashboard design.

Alerting Playbook: Detection to Remediation

Severity Levels

LevelResponse TimeChannelCriteria
P0: Critical<5 minPagerDutyRevenue impact, system down
P1: High<15 minSlack urgentCustomer impact, major degradation
P2: Medium<1 hourSlackPerformance degradation
P3: Low<4 hoursEmailTrend warnings

Alert Configuration Template

name: TTFW Degradation
condition: ttfw_p90 > 1000ms
duration: 5 minutes
severity: P1
channels:
  - slack://voice-alerts
  - pagerduty://voice-team
context:
  - current_value
  - baseline_value
  - sample_calls
  - dashboard_link
runbook: /docs/runbooks/ttfw-degradation
cooldown: 30 minutes

Start with these four high-impact alerts:

  1. TTFW P90 >1000ms (duration: 5 min) → P1
  2. Task completion <80% (duration: 15 min) → P1
  3. Intent accuracy <90% (duration: 10 min) → P1
  4. WER >12% (duration: 10 min) → P2

Four-Step Remediation Workflow

  1. Detection: ML-based anomaly detection + rule-based validation
  2. Alert: Instant notification with metadata (affected agent, timestamp, sample calls)
  3. Diagnosis: Drill down from metrics to transcripts and audio
  4. Remediation: Coordinate fix deployment, monitor recovery

How Hamming helps: Detection begins with ML-based anomaly detection and rule-based validation. When anomalies are detected, alerts push instantly to Slack with metadata. Engineers can drill down from high-level metrics to individual transcripts and audio for rapid diagnosis.

Platform Comparison: Hamming vs Alternatives

CapabilityHammingBraintrustRoarkDatadog
Voice-native KPIs✅ 50+ built-in⚠️ Limited✅ Yes❌ No
Real-time alerting✅ ML + rules⚠️ Manual✅ Yes✅ Infra only
Call replay✅ One-click⚠️ Manual✅ Yes❌ No
OpenTelemetry✅ Native✅ Yes⚠️ Limited✅ Yes
Intent accuracy✅ Automated⚠️ Manual✅ Yes❌ No
Sentiment analysis✅ Speech-level❌ No⚠️ Basic❌ No
Automated testing✅ 1000+ concurrent⚠️ Limited⚠️ Limited❌ No
Prompt compliance✅ Built-in⚠️ Custom⚠️ Limited❌ No

When to use Hamming: Production voice agent monitoring with comprehensive KPI coverage, automated testing, and speech-level analysis. Ideal for teams needing unified testing + monitoring in one platform.

Frequently Asked Questions

The 10 most critical voice agent KPIs according to Hamming's analysis of 1M+ production calls are: First Call Resolution (FCR), Average Handle Time (AHT), Intent Classification Accuracy, End-to-End Latency (P95), Word Error Rate (WER), Task Completion Rate, CSAT, Containment Rate, Prompt Compliance, and Context Retention. These span four layers: Infrastructure, Execution, User Reaction, and Business Outcome.

Industry benchmarks show 75-85% FCR is excellent for AI voice agents. Healthcare providers typically achieve ~75% FCR for appointment scheduling and prescription queries. FCR below 60% indicates systemic issues requiring immediate attention—typically knowledge base gaps, intent misclassification, or incomplete dialog flows.

Measure End-to-End Latency as 'Mouth-to-Ear Turn Gap'—the time from when the user stops speaking until they hear the agent's first word. Break this into components: STT (~150ms), LLM (~400ms), TTS (~100ms), plus network overhead. Track P50, P90, P95, and P99 percentiles, not averages. Target P95 under 800ms for natural conversation feel.

Traditional contact center metrics (AHT, abandonment rate, queue time) measure operational efficiency. Voice agent KPIs must additionally measure AI-specific quality: intent accuracy, WER, prompt compliance, context retention, and multi-turn reasoning. Generic contact center dashboards miss 60% of voice-specific failures.

Start with these four high-impact alerts: (1) TTFW P90 >1000ms for 5 minutes → P1; (2) Task completion <80% for 15 minutes → P1; (3) Intent accuracy <90% for 10 minutes → P1; (4) WER >12% for 10 minutes → P2. Tune thresholds based on your baseline after 1 week of monitoring. Use duration filters (5+ minutes) and cooldowns (30 minutes) to avoid alert fatigue.

Datadog monitors infrastructure (servers, APIs, databases) but misses voice-specific metrics like TTFW, WER, intent accuracy, sentiment trajectory, and conversation flow quality. Best practice: Use Datadog for infrastructure monitoring + Hamming for voice-specific KPIs. They complement each other—Hamming natively ingests OpenTelemetry traces for unified observability.

Monitor these leading indicators: (1) Confidence decay across conversation turns, (2) Rising fallback/default intent frequency, (3) Intent accuracy drop for specific categories, (4) WER increases in user segments, (5) Negative sentiment trajectory. Set anomaly detection baselines after 2-4 weeks of data collection. Alert when metrics exceed baseline + 20% for sustained periods.

Prompt compliance measures how often the agent follows specific instructions in its system prompt—including scope boundaries, safety guardrails, required disclaimers, and escalation triggers. Target >95% for safety-critical instructions, >90% for brand guidelines. Low compliance indicates prompt engineering issues, scope creep, or guardrail failures that can cause regulatory or brand risk.

Real-time dashboards should refresh every 5-15 seconds for operational metrics (concurrent calls, latency, error rates). Historical analysis should use 15-minute aggregations for trend detection. Implement weekly performance reviews for operational metrics and quarterly business reviews for strategic KPIs.

Use OpenTelemetry for standardized trace collection across STT, LLM, TTS components. Log these events: call.started, turn.user (transcript + confidence), turn.agent (latency breakdown), intent.classified (intent + confidence), tool.called (params + result), call.ended (outcome + metrics). Hamming natively ingests OpenTelemetry traces for unified testing and production monitoring.

Sumanyu Sharma

Sumanyu Sharma

Founder & CEO

Previously Head of Data at Citizen, where he helped quadruple the user base. As Senior Staff Data Scientist at Tesla, grew AI-powered sales program to 100s of millions in revenue per year.

Researched AI-powered medical image search at the University of Waterloo, where he graduated with Engineering honors on dean's list.

“At Hamming, we're taking all of our learnings from Tesla and Citizen to build the future of trustworthy, safe and reliable voice AI agents.”