What is voice agent containment rate and how do I measure it?

Voice agent containment rate is the percentage of calls handled by the AI without human escalation. The formula is AI-contained calls divided by total calls times 100, but Hamming recommends pairing it with repeat-contact and task-success metrics so abandoned or falsely resolved calls do not inflate containment.

How do I detect frustrated customers in voice bot calls?

Detect frustrated customers by combining transcript and audio signals: repeated phrases, caller interruptions, long silences, louder or sharper tone, negative language, abandonment, and repeated requests for a human. Hamming recommends flagging calls with 2 or more frustration signals for QA review before using sentiment scores for automated decisions.

What causes high voice agent drop-off rates?

High voice agent drop-off rates usually come from bad entrypoints, confusing questions, repeated recognition failures, authentication friction, tool errors, or latency-driven dead air. Hamming recommends measuring drop-off by conversation stage because a single blended abandonment rate will not show whether the break happened at greeting, authentication, slot collection, tool execution, or confirmation.

How do I measure voice agent conversation flow adherence?

Measure flow adherence by defining required stages for each intent, then calculating calls that complete those stages divided by eligible calls times 100. Hamming recommends segmenting flow adherence by intent because appointment scheduling, billing, and troubleshooting flows have different required stages.

What is a good voice agent quality score?

A good voice agent quality score is usually 70-79, while 80+ indicates strong production quality. Hamming recommends building the score from visible sub-scores such as answer accuracy, latency, flow adherence, containment or resolution, tool success, and policy compliance rather than hiding everything inside one opaque number.

What metrics should be on a call center voice analytics dashboard?

A call center voice analytics dashboard should show containment by intent, negative sentiment trend, stage-level flow drop-off, quality score distribution, tool failures, and call-level drilldowns. Hamming recommends one dashboard row per metric family so executives can see health while operators can drill into the exact calls causing each pattern.

How often should I review voice agent analytics?

Review voice agent analytics at three cadences: real-time alerts for severe regressions, daily triage for intent-level anomalies, and weekly analysis for product and knowledge-base patterns. Hamming recommends waiting for a two-week baseline before tightening thresholds so alerts reflect sustained deviations rather than normal traffic variance.

Voice Agent Analytics Metrics: Containment to Quality

Q: What is voice agent analytics?

Voice agent analytics is the measurement of conversation outcomes and voice-specific failure signals across production calls, including containment, sentiment, flow adherence, drop-off, latency, and call quality. Hamming recommends grouping these metrics into Containment, Sentiment, Flow, and Quality so each failure routes to the right owner.

If your voice agent handles fewer than 100 calls a month, I would not start with a dashboard. I would pull ten calls, listen to them end to end, and write down the moments that made me wince. That will teach you more than a polished chart.

This guide is for the next stage, when the agent is taking enough calls that "just listen to them" stops being a real process. At that point, call volume is background noise. The useful questions get sharper: did the agent resolve the issue, did the caller get annoyed, where did the flow break, and is quality getting better for the intents that actually matter?

The scope is deliberately narrow: this is a voice agent analytics metrics guide for containment rate, sentiment analysis, drop-off analysis, flow adherence, and call quality scoring. If you need the broader KPI catalog or dashboard layout, start there instead.

That is where voice agent analytics earns its keep. The dashboard should make one uncomfortable thing easy to find: calls that looked fine in aggregate and still failed for the person on the phone.

Voice agent analytics is the measurement of conversation outcomes and voice-specific failure signals across production calls, including containment, sentiment, flow adherence, drop-off, latency, tool behavior, and call quality.

For years, call center voice analytics mostly meant post-call transcription and keyword search. That was enough when managers were coaching human agents after the fact. With AI agents, the failure can happen between the words. The agent decides, speaks, calls a tool, waits, recovers, and sometimes delivers the wrong answer with perfect confidence. The uncomfortable details live in the audio layer: silence, interruptions, latency, and tone.

The transcript trap: A transcript tells you what was said. It will miss some of the most important parts of a voice call: the caller talking over the agent, the awkward silence before "hello?", or the moment someone gives up after repeating the same request three times. Production voice analytics needs transcript, audio, timing, tool, and outcome signals together.

TL;DR: Use this voice agent analytics metric dictionary to measure four production outcomes:

Containment - Did the AI resolve the call without inappropriate human escalation?

Sentiment - Did the caller become more or less frustrated during the call?

Flow - Did the conversation follow the expected path, or did it loop, stall, or drop off?

Quality - Was the agent accurate, fast, policy-compliant, and easy to talk to?

Activity metrics tell you what happened. These outcome metrics tell you whether it worked.

Methodology Note: The benchmarks, formulas, and metric recommendations in this guide are derived from Hamming's analysis of 4M+ production voice agent calls across 10K+ voice agents (2025-2026). We've tested agents built on LiveKit, Pipecat, ElevenLabs, Retell, Vapi, and custom-built solutions.
We also used anonymized customer and prospect discovery patterns to identify which analytics questions teams repeatedly ask when deploying production voice agents.

Last Updated: May 2026

Related Guides:

Voice Agent Analytics & Post-Call Metrics - broader KPI reference across the full voice stack
Voice Agent Dashboard Template - chart layout and executive reporting template
Voice Agent Monitoring KPIs - production alerting thresholds
Voice Agent Observability Tracing Guide - component-level traces across ASR, LLM, tools, and TTS
Conversational Flow Measurement - deeper guide to flow adherence and path analysis
How to Monitor Voice Agent Outages - incident thresholds and escalation paths
Voice Agent Evaluation Metrics - evaluation metric definitions and benchmarks
Debugging Voice Agents - logs, missed intents, and error dashboards

Why Voice Agent Analytics Matter

Most teams start with the metrics they already know from human contact centers: average handle time, abandonment rate, call volume, transfer rate, and maybe CSAT.

Those metrics still matter. They just do not explain AI failures. We have seen dashboards where every top-line metric looked acceptable, but the failed calls were obvious after listening to five recordings.

A voice agent can keep average handle time flat while hallucinating policy answers. It can reduce transfers by trapping callers in a bad loop. It can show high containment because users gave up before reaching a human.

We call this metric theater: tracking enough numbers to feel informed, but not the numbers that change decisions.

The first useful question is not "how many calls did the agent handle?" It is:

Which calls looked successful in aggregate but failed for the caller?

That question requires four metric families.

Metric Family	Question It Answers	Example Failure It Catches
Containment	Did AI resolve the call?	False containment, inappropriate deflection, missing handoff
Sentiment	How did the caller feel?	Rising frustration, repeated interruption, angry hangup
Flow	Where did the conversation break?	Authentication loop, missing slot, dead-end branch
Quality	Was the agent actually good?	Correct intent but wrong answer, slow response, policy miss

This is especially important for high-volume call center voice analytics. Cross-call patterns become the product feedback loop: repeated questions reveal documentation gaps, drop-off points reveal broken flows, and sentiment spikes reveal issues before weekly CSAT reports catch up.

The best analytics reviews I have seen all end with a concrete change: add a missing FAQ to the knowledge base, fix a flaky tool, shorten an authentication step, or roll back a prompt. If a metric never leads to one of those moves, it is probably dashboard decoration.

The Voice Agent Analytics Metric Map

This metric map groups production voice agent analytics into four families:

Category	Primary Metric	Supporting Signals	Owner
Containment	AI-resolved calls / total calls	Escalation reason, repeat contact, task success	Operations
Sentiment	Frustration and satisfaction signals	Repetition, interruption, tone, volume, negative language	CX / QA
Flow	Stage completion and path adherence	Drop-off by step, loop count, missing slots, fallback intents	Product
Quality	Composite call score	Accuracy, latency, tool success, policy adherence, audio quality	Engineering / QA

The map matters because each metric family has a different remediation path.

Low containment usually means the agent is missing capability, knowledge, or authority. Negative sentiment means the caller experience is bad, even if the task completes. Poor flow means the conversation design is broken. Low quality means the system may be slow, inaccurate, noncompliant, or hard to understand.

Do not combine these too early. A single "voice agent score" is useful for executives, but the team fixing the agent needs the component metrics.

Category 1: Containment Analytics

Containment rate measures the percentage of calls handled by the AI agent without escalation to a human.

Containment rate is useful only when it means "resolved by AI," not merely "not transferred." Hamming recommends treating repeat contact, abandonment, and inappropriate deflection as containment guardrails so the metric cannot improve by making the caller experience worse.

Containment Rate = (AI-contained calls / Total calls) x 100

That formula is simple. The hard part is defining a contained call correctly.

A call should count as contained only when the user goal was resolved or appropriately completed by the AI. A caller who hangs up after three failed attempts should not improve containment. A caller who asks for a human because the task requires a licensed representative should not hurt containment.

Use this classification:

Outcome	Count as Contained?	Why
Agent completes the user's task	Yes	The AI resolved the issue
Agent answers the question and caller confirms	Yes	The intent was satisfied
Agent escalates because policy requires a human	Neutral / excluded	Escalation was correct behavior
Caller asks for a human after repeated failures	No	The AI failed to resolve
Caller hangs up mid-task	No	Treat as abandonment unless verified complete
Caller calls back for same issue within 48-72 hours	No	Original resolution did not hold

Good benchmark: 70-80% containment for standard customer support flows. Narrow transactional flows can push above 80% when the backend systems are reliable. In healthcare, financial services, or legal workflows, 60-70% may be the right answer if the remaining calls are being escalated for good policy reasons.

Containment gets dangerous when it becomes the trophy metric. Make the handoff harder and the number may rise. The caller experience usually gets worse.

Containment Metrics to Track

Metric	Formula	Good Target	Use
Containment Rate	AI-contained calls / total calls	70-80%	Measures AI handling coverage
Verified Resolution Rate	Resolved calls with no repeat contact / total calls	65-80%	Filters false containment
Escalation Rate	Human escalations / total calls	10-30%	Shows remaining human workload
Incorrect Deflection Rate	Calls not escalated when they should have been / total calls	<2%	Safety and CX guardrail
Repeat Contact Rate	Same-issue repeat calls within 48-72 hours / resolved calls	<10%	Finds fake resolution

I used to think containment was the cleanest executive metric. Too many "successful" calls changed my mind: the user had not been helped; they had just stopped trying. Containment needs repeat contact, sentiment, and task success beside it.

Category 2: Sentiment Analytics

Voice agent sentiment analysis should not stop at positive, neutral, and negative transcript labels.

In voice, frustration often shows up before the words turn negative. The caller repeats themselves, talks over the agent, waits a beat too long, then starts clipping their answers. Sometimes the best signal is painfully simple: the caller says "hello?" because the agent left too much dead air.

Those are analytics signals, not soft UX impressions.

Frustration Signal	What to Detect	Why It Matters
Repetition	Same intent or phrase 3+ times	Agent is not understanding or acknowledging
Interruption	Caller talks over agent repeatedly	Agent is too slow, too verbose, or wrong
Long silence	Caller pauses after agent response	Confusion, dead air, or unclear next step
Volume / tone shift	Louder, sharper, or more clipped speech	Audio signal of frustration
Negative language	"This is wrong", "operator", "representative", "frustrated"	Explicit dissatisfaction
Rage clicks equivalent	Rapid DTMF presses, repeated menu choices	IVR-style escape behavior

Frustration Signal Rate = Calls with 2+ frustration signals / Total calls x 100

For most production voice agents, a negative sentiment rate below 5-10% is healthy. A spike above 15% usually deserves investigation, especially if the spike is concentrated in one intent, one provider, or one prompt version.

How to Find Frustrated Customers in Voice Bot Calls

Start with a conservative rule:

Flag for Review =
  repeated_user_intent >= 3
  OR interruption_count >= 3
  OR negative_sentiment_terms >= 2
  OR abandonment_after_failure = true
  OR explicit_human_request_after_agent_failure = true

Then sample the flagged calls manually for one week. You will find false positives. That is fine. Tighten the rule only after you know which signals correlate with real frustration in your call types.

The goal is not to label every caller emotion perfectly. The goal is to create a reliable queue of calls that a QA lead, product manager, or support owner should inspect.

Category 3: Flow Analytics

Flow analytics measures whether the conversation moved through the expected stages.

For a simple appointment scheduling agent, the expected flow might be:

Greeting -> Intent confirmation -> Eligibility / identity check -> Slot collection -> Booking -> Confirmation

For a billing agent, it might be:

Intent -> Authentication -> Account lookup -> Explanation -> Payment / adjustment / escalation -> Confirmation

Flow adherence is the percentage of calls that follow the expected stages without skipping required steps, looping, or dropping off.

Flow Adherence = Calls completing required stages / Eligible calls x 100

Flow analytics is where aggregate dashboards usually break down. A 78% task success rate is useful, but it does not tell you whether users are failing at authentication, tool execution, payment confirmation, or final handoff.

Track stage-level conversion instead.

Stage	Metric	Example Alert
Intent capture	% calls with recognized supported intent	Drops below baseline by 5%
Authentication	% eligible calls passing auth	Failure rate doubles
Required slot collection	% calls collecting all needed fields	Missing slot rate above 10%
Tool execution	% tool calls succeeding	Error rate above 3%
Confirmation	% calls ending with explicit confirmation	Falls below 85%

Drop-off Analysis

Drop-off rate measures where users abandon the conversation.

Drop-off Rate = Users who abandon at stage / Users who reached stage x 100

Do not report one blended drop-off number. Segment it by stage:

Drop-off Type	Likely Cause	First Debug Step
Early drop-off under 30 seconds	Bad greeting, wrong entrypoint, poor audio, caller surprise	Listen to first-turn calls
Mid-flow drop-off	Repetition, latency, missing intent, confusing question	Inspect loops and fallback intents
Authentication drop-off	Auth too strict, unclear instructions, tool failure	Compare auth failures by caller segment
Near-completion drop-off	Payment, confirmation, or policy friction	Review final two stages
Silence-driven drop-off	Dead air or long latency	Inspect turn latency and timeout handling

This is where voice analytics becomes product analytics. If 30% of callers abandon during identity verification, the issue is not "the AI voice agent is bad." The issue is a specific flow step.

For a deeper breakdown of stage modeling and path adherence, use the Conversational Flow Measurement guide. For production incidents where a flow suddenly degrades, pair the flow funnel with voice agent outage monitoring so the alert carries the failing stage, prompt version, and recent deploy context.

Category 4: Quality Analytics

Quality analytics combines correctness, speed, policy adherence, and conversation experience.

The cleanest production pattern is a composite score with visible sub-scores. Do not hide the inputs.

Quality Score =
  (Accuracy Score x 0.30)
  + (Latency Score x 0.20)
  + (Flow Score x 0.25)
  + (Containment / Resolution Score x 0.25)

Use the weights as a starting point, not doctrine. A healthcare triage agent may weight policy adherence and escalation appropriateness higher. An e-commerce order status agent may weight latency and task completion higher.

Score Band	Interpretation	Action
80-100	Strong production quality	Monitor regressions and edge cases
70-79	Good but uneven	Segment by intent and fix weak paths
60-69	Risky	Prioritize top failure categories before scaling
<60	Not production-ready	Pause expansion or route traffic to humans

Score Voice Agent Call Quality

At call level, score quality with five checks:

Dimension	Question	Scoring Method
Intent	Did the agent identify the user's goal?	Intent match against human/ground-truth label
Answer correctness	Was the answer or action correct?	Evaluation against policy, knowledge base, or task result
Conversation control	Did the agent avoid loops and recover from errors?	Flow events, repetition count, fallback count
Latency	Did responses arrive fast enough for natural conversation?	P50/P90/P95 turn latency
Experience	Did the caller sound satisfied or frustrated?	Sentiment, interruption, silence, completion pattern

This is the minimum viable QA rubric. Add domain checks for regulated workflows: disclosure delivery, consent, PHI/PII handling, PCI redaction, escalation rules, and prohibited advice.

If you need the broader evaluation taxonomy, start with Voice Agent Evaluation Metrics. If the low quality score comes from missed intents, tool failures, or error dashboards, use the Debugging Voice Agents guide to trace the failing call path.

How to Calculate Each Metric

Here is the working formula set.

Metric	Formula	Notes
Containment Rate	AI-contained calls / total calls x 100	Exclude policy-required human handoffs when comparing agent capability
Verified Resolution Rate	Resolved calls with no repeat contact / total calls x 100	Use 48-72 hour repeat-contact windows
Negative Sentiment Rate	Negative or frustrated calls / total calls x 100	Combine transcript and audio signals
Frustration Signal Rate	Calls with 2+ frustration signals / total calls x 100	Useful for QA review queues
Drop-off Rate	Users abandoning at stage / users reaching stage x 100	Always segment by stage
Flow Adherence	Calls completing required stages / eligible calls x 100	Different flows need different stage maps
Task Completion Rate	Successful task completions / eligible calls x 100	Define completion criteria per intent
Tool Success Rate	Successful tool calls / attempted tool calls x 100	Segment by tool and provider
Quality Score	Weighted composite of accuracy, latency, flow, resolution	Keep sub-scores visible

Two implementation details matter more than teams expect:

Define the denominator. Is containment measured across all calls, eligible calls, supported intents, or non-policy escalations? Pick one and label it.
Segment by intent. A blended metric hides the reason. Appointment scheduling, billing, eligibility, and troubleshooting should not share one undifferentiated score.

Benchmarks by Industry and Use Case

Use benchmarks as starting points. The right threshold depends on task complexity, caller risk, and how much authority the AI agent has.

Use Case	Healthy Containment	Healthy Task Completion	Notes
Appointment scheduling	80-90%	85-95%	Well-defined slots and clear completion criteria
Order status	80-90%	85-95%	Strong fit for automation if systems are reliable
Billing explanation	65-80%	75-90%	More handoffs due to account nuance and disputes
Healthcare intake	60-75%	70-85%	Correct escalation may be more important than containment
Financial services support	60-80%	75-90%	Compliance, identity, and policy checks lower containment
Technical troubleshooting	55-75%	65-85%	Multi-step diagnosis creates more drop-off

The benchmark that matters most is your own baseline by intent and prompt version. Run two weeks of measurement, define normal ranges, then alert on deviation.

Building Your Analytics Dashboard

A good voice agent analytics dashboard has three layers:

Executive health: Is the agent improving or hurting the business?
Operations triage: Which intents, flows, or cohorts need attention today?
Call-level drilldown: Which calls prove the pattern and explain root cause?

Do not put 50 charts on the first screen. More tiles usually means more arguing about which tile matters. Start with one row per metric family.

Dashboard Row	Primary Chart	Drilldown
Containment	Contained vs escalated by intent	Escalation reasons and repeat-contact calls
Sentiment	Negative sentiment trend	Frustration-flagged call queue
Flow	Funnel by conversation stage	Stage-level drop-off examples
Quality	Quality score distribution	Lowest-scoring calls with reason codes

Then add queryable analytics:

"Show calls where the user asked for a human after the agent repeated itself."
"Find all calls where billing intent reached authentication but never completed."
"Show noisy calls with low ASR confidence and negative sentiment."
"Cluster the top new questions from the last 500 calls."

This is the part that generic analytics tools usually miss. Voice agent teams need both dashboard metrics and natural-language exploration across calls.

Alerting and Anomaly Detection

Alert on sustained deviations, not single-call failures.

Alert	Warning	Critical	Route To
Containment drop	>5% below baseline for 30 minutes	>15% below baseline	Ops / product
Negative sentiment spike	2x baseline	>25% negative calls	CX / QA
Drop-off increase	>10% increase at one stage	>25% increase at one stage	Product / engineering
Tool failure rate	>3% for one tool	>10% for one tool	Engineering
P95 turn latency	>1.5x baseline	>3 seconds for 15 minutes	Engineering
Quality score	Median below 70	Median below 60	QA / product

The alert should include the likely first debug path. "Negative sentiment is up" is too vague to help. A better alert says: billing calls are souring after authentication, interruption count is up 41%, and here are the first five calls to open.

Flaws But Not Dealbreakers

Metrics can hide the truth. A high containment rate can mean the agent solved calls. It can also mean users gave up. I would take 68% honest containment over 85% containment created by trapped callers. Pair the number with repeat contact, sentiment, and task success.

Sentiment is noisy. Accent, culture, background noise, and call context all affect sentiment models. Treat sentiment as a triage signal and a trend line, not a courtroom verdict on one caller.

Benchmarks are not universal. A healthcare triage agent should not chase the same containment rate as an order-status bot. Use benchmarks to start the conversation, then set thresholds by intent and risk.

Manual review still matters. Automation tells you where to look. Humans still need to inspect samples, update rubrics, and decide whether a flagged pattern is actually harmful.

The Practical Starting Point

If you are building this from scratch, do not start with the full dashboard.

Start with 20 production calls per major intent and score them manually across the four metric families:

Call	Contained?	Sentiment	Flow issue?	Quality issue?	Root cause
1	Yes	Neutral	None	Slow response	LLM latency
2	No	Negative	Auth loop	Tool error	Auth API timeout
3	Yes	Negative	Repetition	Wrong answer	Knowledge gap

After 100-200 scored calls, automate the labels that match your real failure modes. This prevents the common mistake: instrumenting generic metrics before you know which failures matter for your agent.

For teams still validating the agent itself, pair this analytics setup with Call Center Voice Agent Testing and Background Noise Testing KPIs. Analytics tells you what failed in production; testing lets you reproduce and prevent the same failure before the next deploy.

Cite This Guide

If you reference this article, cite it as:

Hamming's voice agent analytics metric dictionary defines containment, sentiment, flow, and quality metrics with formulas, denominators, benchmarks, and alert thresholds for production voice agents.