If your voice agent handles fewer than 100 calls a month, I would not start with a dashboard. I would pull ten calls, listen to them end to end, and write down the moments that made me wince. That will teach you more than a polished chart.
This guide is for the next stage, when the agent is taking enough calls that "just listen to them" stops being a real process. At that point, call volume is background noise. The useful questions get sharper: did the agent resolve the issue, did the caller get annoyed, where did the flow break, and is quality getting better for the intents that actually matter?
The scope is deliberately narrow: this is a voice agent analytics metrics guide for containment rate, sentiment analysis, drop-off analysis, flow adherence, and call quality scoring. If you need the broader KPI catalog or dashboard layout, start there instead.
That is where voice agent analytics earns its keep. The dashboard should make one uncomfortable thing easy to find: calls that looked fine in aggregate and still failed for the person on the phone.
Voice agent analytics is the measurement of conversation outcomes and voice-specific failure signals across production calls, including containment, sentiment, flow adherence, drop-off, latency, tool behavior, and call quality.
For years, call center voice analytics mostly meant post-call transcription and keyword search. That was enough when managers were coaching human agents after the fact. With AI agents, the failure can happen between the words. The agent decides, speaks, calls a tool, waits, recovers, and sometimes delivers the wrong answer with perfect confidence. The uncomfortable details live in the audio layer: silence, interruptions, latency, and tone.
The transcript trap: A transcript tells you what was said. It will miss some of the most important parts of a voice call: the caller talking over the agent, the awkward silence before "hello?", or the moment someone gives up after repeating the same request three times. Production voice analytics needs transcript, audio, timing, tool, and outcome signals together.
TL;DR: Use this voice agent analytics metric dictionary to measure four production outcomes:
- Containment - Did the AI resolve the call without inappropriate human escalation?
- Sentiment - Did the caller become more or less frustrated during the call?
- Flow - Did the conversation follow the expected path, or did it loop, stall, or drop off?
- Quality - Was the agent accurate, fast, policy-compliant, and easy to talk to?
Activity metrics tell you what happened. These outcome metrics tell you whether it worked.
Methodology Note: The benchmarks, formulas, and metric recommendations in this guide are derived from Hamming's analysis of 4M+ production voice agent calls across 10K+ voice agents (2025-2026). We've tested agents built on LiveKit, Pipecat, ElevenLabs, Retell, Vapi, and custom-built solutions.We also used anonymized customer and prospect discovery patterns to identify which analytics questions teams repeatedly ask when deploying production voice agents.
Last Updated: May 2026
Related Guides:
- Voice Agent Analytics & Post-Call Metrics - broader KPI reference across the full voice stack
- Voice Agent Dashboard Template - chart layout and executive reporting template
- Voice Agent Monitoring KPIs - production alerting thresholds
- Voice Agent Observability Tracing Guide - component-level traces across ASR, LLM, tools, and TTS
- Conversational Flow Measurement - deeper guide to flow adherence and path analysis
- How to Monitor Voice Agent Outages - incident thresholds and escalation paths
- Voice Agent Evaluation Metrics - evaluation metric definitions and benchmarks
- Debugging Voice Agents - logs, missed intents, and error dashboards
Why Voice Agent Analytics Matter
Most teams start with the metrics they already know from human contact centers: average handle time, abandonment rate, call volume, transfer rate, and maybe CSAT.
Those metrics still matter. They just do not explain AI failures. We have seen dashboards where every top-line metric looked acceptable, but the failed calls were obvious after listening to five recordings.
A voice agent can keep average handle time flat while hallucinating policy answers. It can reduce transfers by trapping callers in a bad loop. It can show high containment because users gave up before reaching a human.
We call this metric theater: tracking enough numbers to feel informed, but not the numbers that change decisions.
The first useful question is not "how many calls did the agent handle?" It is:
Which calls looked successful in aggregate but failed for the caller?
That question requires four metric families.
| Metric Family | Question It Answers | Example Failure It Catches |
|---|---|---|
| Containment | Did AI resolve the call? | False containment, inappropriate deflection, missing handoff |
| Sentiment | How did the caller feel? | Rising frustration, repeated interruption, angry hangup |
| Flow | Where did the conversation break? | Authentication loop, missing slot, dead-end branch |
| Quality | Was the agent actually good? | Correct intent but wrong answer, slow response, policy miss |
This is especially important for high-volume call center voice analytics. Cross-call patterns become the product feedback loop: repeated questions reveal documentation gaps, drop-off points reveal broken flows, and sentiment spikes reveal issues before weekly CSAT reports catch up.
The best analytics reviews I have seen all end with a concrete change: add a missing FAQ to the knowledge base, fix a flaky tool, shorten an authentication step, or roll back a prompt. If a metric never leads to one of those moves, it is probably dashboard decoration.
The Voice Agent Analytics Metric Map
This metric map groups production voice agent analytics into four families:
| Category | Primary Metric | Supporting Signals | Owner |
|---|---|---|---|
| Containment | AI-resolved calls / total calls | Escalation reason, repeat contact, task success | Operations |
| Sentiment | Frustration and satisfaction signals | Repetition, interruption, tone, volume, negative language | CX / QA |
| Flow | Stage completion and path adherence | Drop-off by step, loop count, missing slots, fallback intents | Product |
| Quality | Composite call score | Accuracy, latency, tool success, policy adherence, audio quality | Engineering / QA |
The map matters because each metric family has a different remediation path.
Low containment usually means the agent is missing capability, knowledge, or authority. Negative sentiment means the caller experience is bad, even if the task completes. Poor flow means the conversation design is broken. Low quality means the system may be slow, inaccurate, noncompliant, or hard to understand.
Do not combine these too early. A single "voice agent score" is useful for executives, but the team fixing the agent needs the component metrics.
Category 1: Containment Analytics
Containment rate measures the percentage of calls handled by the AI agent without escalation to a human.
Containment rate is useful only when it means "resolved by AI," not merely "not transferred." Hamming recommends treating repeat contact, abandonment, and inappropriate deflection as containment guardrails so the metric cannot improve by making the caller experience worse.
Containment Rate = (AI-contained calls / Total calls) x 100
That formula is simple. The hard part is defining a contained call correctly.
A call should count as contained only when the user goal was resolved or appropriately completed by the AI. A caller who hangs up after three failed attempts should not improve containment. A caller who asks for a human because the task requires a licensed representative should not hurt containment.
Use this classification:
| Outcome | Count as Contained? | Why |
|---|---|---|
| Agent completes the user's task | Yes | The AI resolved the issue |
| Agent answers the question and caller confirms | Yes | The intent was satisfied |
| Agent escalates because policy requires a human | Neutral / excluded | Escalation was correct behavior |
| Caller asks for a human after repeated failures | No | The AI failed to resolve |
| Caller hangs up mid-task | No | Treat as abandonment unless verified complete |
| Caller calls back for same issue within 48-72 hours | No | Original resolution did not hold |
Good benchmark: 70-80% containment for standard customer support flows. Narrow transactional flows can push above 80% when the backend systems are reliable. In healthcare, financial services, or legal workflows, 60-70% may be the right answer if the remaining calls are being escalated for good policy reasons.
Containment gets dangerous when it becomes the trophy metric. Make the handoff harder and the number may rise. The caller experience usually gets worse.
Containment Metrics to Track
| Metric | Formula | Good Target | Use |
|---|---|---|---|
| Containment Rate | AI-contained calls / total calls | 70-80% | Measures AI handling coverage |
| Verified Resolution Rate | Resolved calls with no repeat contact / total calls | 65-80% | Filters false containment |
| Escalation Rate | Human escalations / total calls | 10-30% | Shows remaining human workload |
| Incorrect Deflection Rate | Calls not escalated when they should have been / total calls | <2% | Safety and CX guardrail |
| Repeat Contact Rate | Same-issue repeat calls within 48-72 hours / resolved calls | <10% | Finds fake resolution |
I used to think containment was the cleanest executive metric. Too many "successful" calls changed my mind: the user had not been helped; they had just stopped trying. Containment needs repeat contact, sentiment, and task success beside it.
Category 2: Sentiment Analytics
Voice agent sentiment analysis should not stop at positive, neutral, and negative transcript labels.
In voice, frustration often shows up before the words turn negative. The caller repeats themselves, talks over the agent, waits a beat too long, then starts clipping their answers. Sometimes the best signal is painfully simple: the caller says "hello?" because the agent left too much dead air.
Those are analytics signals, not soft UX impressions.
| Frustration Signal | What to Detect | Why It Matters |
|---|---|---|
| Repetition | Same intent or phrase 3+ times | Agent is not understanding or acknowledging |
| Interruption | Caller talks over agent repeatedly | Agent is too slow, too verbose, or wrong |
| Long silence | Caller pauses after agent response | Confusion, dead air, or unclear next step |
| Volume / tone shift | Louder, sharper, or more clipped speech | Audio signal of frustration |
| Negative language | "This is wrong", "operator", "representative", "frustrated" | Explicit dissatisfaction |
| Rage clicks equivalent | Rapid DTMF presses, repeated menu choices | IVR-style escape behavior |
Frustration Signal Rate = Calls with 2+ frustration signals / Total calls x 100
For most production voice agents, a negative sentiment rate below 5-10% is healthy. A spike above 15% usually deserves investigation, especially if the spike is concentrated in one intent, one provider, or one prompt version.
How to Find Frustrated Customers in Voice Bot Calls
Start with a conservative rule:
Flag for Review =
repeated_user_intent >= 3
OR interruption_count >= 3
OR negative_sentiment_terms >= 2
OR abandonment_after_failure = true
OR explicit_human_request_after_agent_failure = true
Then sample the flagged calls manually for one week. You will find false positives. That is fine. Tighten the rule only after you know which signals correlate with real frustration in your call types.
The goal is not to label every caller emotion perfectly. The goal is to create a reliable queue of calls that a QA lead, product manager, or support owner should inspect.
Category 3: Flow Analytics
Flow analytics measures whether the conversation moved through the expected stages.
For a simple appointment scheduling agent, the expected flow might be:
Greeting -> Intent confirmation -> Eligibility / identity check -> Slot collection -> Booking -> Confirmation
For a billing agent, it might be:
Intent -> Authentication -> Account lookup -> Explanation -> Payment / adjustment / escalation -> Confirmation
Flow adherence is the percentage of calls that follow the expected stages without skipping required steps, looping, or dropping off.
Flow Adherence = Calls completing required stages / Eligible calls x 100
Flow analytics is where aggregate dashboards usually break down. A 78% task success rate is useful, but it does not tell you whether users are failing at authentication, tool execution, payment confirmation, or final handoff.
Track stage-level conversion instead.
| Stage | Metric | Example Alert |
|---|---|---|
| Intent capture | % calls with recognized supported intent | Drops below baseline by 5% |
| Authentication | % eligible calls passing auth | Failure rate doubles |
| Required slot collection | % calls collecting all needed fields | Missing slot rate above 10% |
| Tool execution | % tool calls succeeding | Error rate above 3% |
| Confirmation | % calls ending with explicit confirmation | Falls below 85% |
Drop-off Analysis
Drop-off rate measures where users abandon the conversation.
Drop-off Rate = Users who abandon at stage / Users who reached stage x 100
Do not report one blended drop-off number. Segment it by stage:
| Drop-off Type | Likely Cause | First Debug Step |
|---|---|---|
| Early drop-off under 30 seconds | Bad greeting, wrong entrypoint, poor audio, caller surprise | Listen to first-turn calls |
| Mid-flow drop-off | Repetition, latency, missing intent, confusing question | Inspect loops and fallback intents |
| Authentication drop-off | Auth too strict, unclear instructions, tool failure | Compare auth failures by caller segment |
| Near-completion drop-off | Payment, confirmation, or policy friction | Review final two stages |
| Silence-driven drop-off | Dead air or long latency | Inspect turn latency and timeout handling |
This is where voice analytics becomes product analytics. If 30% of callers abandon during identity verification, the issue is not "the AI voice agent is bad." The issue is a specific flow step.
For a deeper breakdown of stage modeling and path adherence, use the Conversational Flow Measurement guide. For production incidents where a flow suddenly degrades, pair the flow funnel with voice agent outage monitoring so the alert carries the failing stage, prompt version, and recent deploy context.
Category 4: Quality Analytics
Quality analytics combines correctness, speed, policy adherence, and conversation experience.
The cleanest production pattern is a composite score with visible sub-scores. Do not hide the inputs.
Quality Score =
(Accuracy Score x 0.30)
+ (Latency Score x 0.20)
+ (Flow Score x 0.25)
+ (Containment / Resolution Score x 0.25)
Use the weights as a starting point, not doctrine. A healthcare triage agent may weight policy adherence and escalation appropriateness higher. An e-commerce order status agent may weight latency and task completion higher.
| Score Band | Interpretation | Action |
|---|---|---|
| 80-100 | Strong production quality | Monitor regressions and edge cases |
| 70-79 | Good but uneven | Segment by intent and fix weak paths |
| 60-69 | Risky | Prioritize top failure categories before scaling |
| <60 | Not production-ready | Pause expansion or route traffic to humans |
Score Voice Agent Call Quality
At call level, score quality with five checks:
| Dimension | Question | Scoring Method |
|---|---|---|
| Intent | Did the agent identify the user's goal? | Intent match against human/ground-truth label |
| Answer correctness | Was the answer or action correct? | Evaluation against policy, knowledge base, or task result |
| Conversation control | Did the agent avoid loops and recover from errors? | Flow events, repetition count, fallback count |
| Latency | Did responses arrive fast enough for natural conversation? | P50/P90/P95 turn latency |
| Experience | Did the caller sound satisfied or frustrated? | Sentiment, interruption, silence, completion pattern |
This is the minimum viable QA rubric. Add domain checks for regulated workflows: disclosure delivery, consent, PHI/PII handling, PCI redaction, escalation rules, and prohibited advice.
If you need the broader evaluation taxonomy, start with Voice Agent Evaluation Metrics. If the low quality score comes from missed intents, tool failures, or error dashboards, use the Debugging Voice Agents guide to trace the failing call path.
How to Calculate Each Metric
Here is the working formula set.
| Metric | Formula | Notes |
|---|---|---|
| Containment Rate | AI-contained calls / total calls x 100 | Exclude policy-required human handoffs when comparing agent capability |
| Verified Resolution Rate | Resolved calls with no repeat contact / total calls x 100 | Use 48-72 hour repeat-contact windows |
| Negative Sentiment Rate | Negative or frustrated calls / total calls x 100 | Combine transcript and audio signals |
| Frustration Signal Rate | Calls with 2+ frustration signals / total calls x 100 | Useful for QA review queues |
| Drop-off Rate | Users abandoning at stage / users reaching stage x 100 | Always segment by stage |
| Flow Adherence | Calls completing required stages / eligible calls x 100 | Different flows need different stage maps |
| Task Completion Rate | Successful task completions / eligible calls x 100 | Define completion criteria per intent |
| Tool Success Rate | Successful tool calls / attempted tool calls x 100 | Segment by tool and provider |
| Quality Score | Weighted composite of accuracy, latency, flow, resolution | Keep sub-scores visible |
Two implementation details matter more than teams expect:
- Define the denominator. Is containment measured across all calls, eligible calls, supported intents, or non-policy escalations? Pick one and label it.
- Segment by intent. A blended metric hides the reason. Appointment scheduling, billing, eligibility, and troubleshooting should not share one undifferentiated score.
Benchmarks by Industry and Use Case
Use benchmarks as starting points. The right threshold depends on task complexity, caller risk, and how much authority the AI agent has.
| Use Case | Healthy Containment | Healthy Task Completion | Notes |
|---|---|---|---|
| Appointment scheduling | 80-90% | 85-95% | Well-defined slots and clear completion criteria |
| Order status | 80-90% | 85-95% | Strong fit for automation if systems are reliable |
| Billing explanation | 65-80% | 75-90% | More handoffs due to account nuance and disputes |
| Healthcare intake | 60-75% | 70-85% | Correct escalation may be more important than containment |
| Financial services support | 60-80% | 75-90% | Compliance, identity, and policy checks lower containment |
| Technical troubleshooting | 55-75% | 65-85% | Multi-step diagnosis creates more drop-off |
The benchmark that matters most is your own baseline by intent and prompt version. Run two weeks of measurement, define normal ranges, then alert on deviation.
Building Your Analytics Dashboard
A good voice agent analytics dashboard has three layers:
- Executive health: Is the agent improving or hurting the business?
- Operations triage: Which intents, flows, or cohorts need attention today?
- Call-level drilldown: Which calls prove the pattern and explain root cause?
Do not put 50 charts on the first screen. More tiles usually means more arguing about which tile matters. Start with one row per metric family.
| Dashboard Row | Primary Chart | Drilldown |
|---|---|---|
| Containment | Contained vs escalated by intent | Escalation reasons and repeat-contact calls |
| Sentiment | Negative sentiment trend | Frustration-flagged call queue |
| Flow | Funnel by conversation stage | Stage-level drop-off examples |
| Quality | Quality score distribution | Lowest-scoring calls with reason codes |
Then add queryable analytics:
- "Show calls where the user asked for a human after the agent repeated itself."
- "Find all calls where billing intent reached authentication but never completed."
- "Show noisy calls with low ASR confidence and negative sentiment."
- "Cluster the top new questions from the last 500 calls."
This is the part that generic analytics tools usually miss. Voice agent teams need both dashboard metrics and natural-language exploration across calls.
Alerting and Anomaly Detection
Alert on sustained deviations, not single-call failures.
| Alert | Warning | Critical | Route To |
|---|---|---|---|
| Containment drop | >5% below baseline for 30 minutes | >15% below baseline | Ops / product |
| Negative sentiment spike | 2x baseline | >25% negative calls | CX / QA |
| Drop-off increase | >10% increase at one stage | >25% increase at one stage | Product / engineering |
| Tool failure rate | >3% for one tool | >10% for one tool | Engineering |
| P95 turn latency | >1.5x baseline | >3 seconds for 15 minutes | Engineering |
| Quality score | Median below 70 | Median below 60 | QA / product |
The alert should include the likely first debug path. "Negative sentiment is up" is too vague to help. A better alert says: billing calls are souring after authentication, interruption count is up 41%, and here are the first five calls to open.
Flaws But Not Dealbreakers
Metrics can hide the truth. A high containment rate can mean the agent solved calls. It can also mean users gave up. I would take 68% honest containment over 85% containment created by trapped callers. Pair the number with repeat contact, sentiment, and task success.
Sentiment is noisy. Accent, culture, background noise, and call context all affect sentiment models. Treat sentiment as a triage signal and a trend line, not a courtroom verdict on one caller.
Benchmarks are not universal. A healthcare triage agent should not chase the same containment rate as an order-status bot. Use benchmarks to start the conversation, then set thresholds by intent and risk.
Manual review still matters. Automation tells you where to look. Humans still need to inspect samples, update rubrics, and decide whether a flagged pattern is actually harmful.
The Practical Starting Point
If you are building this from scratch, do not start with the full dashboard.
Start with 20 production calls per major intent and score them manually across the four metric families:
| Call | Contained? | Sentiment | Flow issue? | Quality issue? | Root cause |
|---|---|---|---|---|---|
| 1 | Yes | Neutral | None | Slow response | LLM latency |
| 2 | No | Negative | Auth loop | Tool error | Auth API timeout |
| 3 | Yes | Negative | Repetition | Wrong answer | Knowledge gap |
After 100-200 scored calls, automate the labels that match your real failure modes. This prevents the common mistake: instrumenting generic metrics before you know which failures matter for your agent.
For teams still validating the agent itself, pair this analytics setup with Call Center Voice Agent Testing and Background Noise Testing KPIs. Analytics tells you what failed in production; testing lets you reproduce and prevent the same failure before the next deploy.
Cite This Guide
If you reference this article, cite it as:
Hamming's voice agent analytics metric dictionary defines containment, sentiment, flow, and quality metrics with formulas, denominators, benchmarks, and alert thresholds for production voice agents.

