Voice Agent Analytics Metrics: Containment to Quality

Sumanyu Sharma
Sumanyu Sharma
Founder & CEO
, Voice AI QA Pioneer

Has stress-tested 4M+ voice agent calls to find where they break.

May 9, 2026Updated May 9, 202619 min read
Voice Agent Analytics Metrics: Containment to Quality

If your voice agent handles fewer than 100 calls a month, I would not start with a dashboard. I would pull ten calls, listen to them end to end, and write down the moments that made me wince. That will teach you more than a polished chart.

This guide is for the next stage, when the agent is taking enough calls that "just listen to them" stops being a real process. At that point, call volume is background noise. The useful questions get sharper: did the agent resolve the issue, did the caller get annoyed, where did the flow break, and is quality getting better for the intents that actually matter?

The scope is deliberately narrow: this is a voice agent analytics metrics guide for containment rate, sentiment analysis, drop-off analysis, flow adherence, and call quality scoring. If you need the broader KPI catalog or dashboard layout, start there instead.

That is where voice agent analytics earns its keep. The dashboard should make one uncomfortable thing easy to find: calls that looked fine in aggregate and still failed for the person on the phone.

Voice agent analytics is the measurement of conversation outcomes and voice-specific failure signals across production calls, including containment, sentiment, flow adherence, drop-off, latency, tool behavior, and call quality.

For years, call center voice analytics mostly meant post-call transcription and keyword search. That was enough when managers were coaching human agents after the fact. With AI agents, the failure can happen between the words. The agent decides, speaks, calls a tool, waits, recovers, and sometimes delivers the wrong answer with perfect confidence. The uncomfortable details live in the audio layer: silence, interruptions, latency, and tone.

The transcript trap: A transcript tells you what was said. It will miss some of the most important parts of a voice call: the caller talking over the agent, the awkward silence before "hello?", or the moment someone gives up after repeating the same request three times. Production voice analytics needs transcript, audio, timing, tool, and outcome signals together.

TL;DR: Use this voice agent analytics metric dictionary to measure four production outcomes:

  • Containment - Did the AI resolve the call without inappropriate human escalation?
  • Sentiment - Did the caller become more or less frustrated during the call?
  • Flow - Did the conversation follow the expected path, or did it loop, stall, or drop off?
  • Quality - Was the agent accurate, fast, policy-compliant, and easy to talk to?

Activity metrics tell you what happened. These outcome metrics tell you whether it worked.

Methodology Note: The benchmarks, formulas, and metric recommendations in this guide are derived from Hamming's analysis of 4M+ production voice agent calls across 10K+ voice agents (2025-2026). We've tested agents built on LiveKit, Pipecat, ElevenLabs, Retell, Vapi, and custom-built solutions.

We also used anonymized customer and prospect discovery patterns to identify which analytics questions teams repeatedly ask when deploying production voice agents.

Last Updated: May 2026

Related Guides:


Why Voice Agent Analytics Matter

Most teams start with the metrics they already know from human contact centers: average handle time, abandonment rate, call volume, transfer rate, and maybe CSAT.

Those metrics still matter. They just do not explain AI failures. We have seen dashboards where every top-line metric looked acceptable, but the failed calls were obvious after listening to five recordings.

A voice agent can keep average handle time flat while hallucinating policy answers. It can reduce transfers by trapping callers in a bad loop. It can show high containment because users gave up before reaching a human.

We call this metric theater: tracking enough numbers to feel informed, but not the numbers that change decisions.

The first useful question is not "how many calls did the agent handle?" It is:

Which calls looked successful in aggregate but failed for the caller?

That question requires four metric families.

Metric FamilyQuestion It AnswersExample Failure It Catches
ContainmentDid AI resolve the call?False containment, inappropriate deflection, missing handoff
SentimentHow did the caller feel?Rising frustration, repeated interruption, angry hangup
FlowWhere did the conversation break?Authentication loop, missing slot, dead-end branch
QualityWas the agent actually good?Correct intent but wrong answer, slow response, policy miss

This is especially important for high-volume call center voice analytics. Cross-call patterns become the product feedback loop: repeated questions reveal documentation gaps, drop-off points reveal broken flows, and sentiment spikes reveal issues before weekly CSAT reports catch up.

The best analytics reviews I have seen all end with a concrete change: add a missing FAQ to the knowledge base, fix a flaky tool, shorten an authentication step, or roll back a prompt. If a metric never leads to one of those moves, it is probably dashboard decoration.

The Voice Agent Analytics Metric Map

This metric map groups production voice agent analytics into four families:

CategoryPrimary MetricSupporting SignalsOwner
ContainmentAI-resolved calls / total callsEscalation reason, repeat contact, task successOperations
SentimentFrustration and satisfaction signalsRepetition, interruption, tone, volume, negative languageCX / QA
FlowStage completion and path adherenceDrop-off by step, loop count, missing slots, fallback intentsProduct
QualityComposite call scoreAccuracy, latency, tool success, policy adherence, audio qualityEngineering / QA

The map matters because each metric family has a different remediation path.

Low containment usually means the agent is missing capability, knowledge, or authority. Negative sentiment means the caller experience is bad, even if the task completes. Poor flow means the conversation design is broken. Low quality means the system may be slow, inaccurate, noncompliant, or hard to understand.

Do not combine these too early. A single "voice agent score" is useful for executives, but the team fixing the agent needs the component metrics.

Category 1: Containment Analytics

Containment rate measures the percentage of calls handled by the AI agent without escalation to a human.

Containment rate is useful only when it means "resolved by AI," not merely "not transferred." Hamming recommends treating repeat contact, abandonment, and inappropriate deflection as containment guardrails so the metric cannot improve by making the caller experience worse.

Containment Rate = (AI-contained calls / Total calls) x 100

That formula is simple. The hard part is defining a contained call correctly.

A call should count as contained only when the user goal was resolved or appropriately completed by the AI. A caller who hangs up after three failed attempts should not improve containment. A caller who asks for a human because the task requires a licensed representative should not hurt containment.

Use this classification:

OutcomeCount as Contained?Why
Agent completes the user's taskYesThe AI resolved the issue
Agent answers the question and caller confirmsYesThe intent was satisfied
Agent escalates because policy requires a humanNeutral / excludedEscalation was correct behavior
Caller asks for a human after repeated failuresNoThe AI failed to resolve
Caller hangs up mid-taskNoTreat as abandonment unless verified complete
Caller calls back for same issue within 48-72 hoursNoOriginal resolution did not hold

Good benchmark: 70-80% containment for standard customer support flows. Narrow transactional flows can push above 80% when the backend systems are reliable. In healthcare, financial services, or legal workflows, 60-70% may be the right answer if the remaining calls are being escalated for good policy reasons.

Containment gets dangerous when it becomes the trophy metric. Make the handoff harder and the number may rise. The caller experience usually gets worse.

Containment Metrics to Track

MetricFormulaGood TargetUse
Containment RateAI-contained calls / total calls70-80%Measures AI handling coverage
Verified Resolution RateResolved calls with no repeat contact / total calls65-80%Filters false containment
Escalation RateHuman escalations / total calls10-30%Shows remaining human workload
Incorrect Deflection RateCalls not escalated when they should have been / total calls<2%Safety and CX guardrail
Repeat Contact RateSame-issue repeat calls within 48-72 hours / resolved calls<10%Finds fake resolution

I used to think containment was the cleanest executive metric. Too many "successful" calls changed my mind: the user had not been helped; they had just stopped trying. Containment needs repeat contact, sentiment, and task success beside it.

Category 2: Sentiment Analytics

Voice agent sentiment analysis should not stop at positive, neutral, and negative transcript labels.

In voice, frustration often shows up before the words turn negative. The caller repeats themselves, talks over the agent, waits a beat too long, then starts clipping their answers. Sometimes the best signal is painfully simple: the caller says "hello?" because the agent left too much dead air.

Those are analytics signals, not soft UX impressions.

Frustration SignalWhat to DetectWhy It Matters
RepetitionSame intent or phrase 3+ timesAgent is not understanding or acknowledging
InterruptionCaller talks over agent repeatedlyAgent is too slow, too verbose, or wrong
Long silenceCaller pauses after agent responseConfusion, dead air, or unclear next step
Volume / tone shiftLouder, sharper, or more clipped speechAudio signal of frustration
Negative language"This is wrong", "operator", "representative", "frustrated"Explicit dissatisfaction
Rage clicks equivalentRapid DTMF presses, repeated menu choicesIVR-style escape behavior
Frustration Signal Rate = Calls with 2+ frustration signals / Total calls x 100

For most production voice agents, a negative sentiment rate below 5-10% is healthy. A spike above 15% usually deserves investigation, especially if the spike is concentrated in one intent, one provider, or one prompt version.

How to Find Frustrated Customers in Voice Bot Calls

Start with a conservative rule:

Flag for Review =
  repeated_user_intent >= 3
  OR interruption_count >= 3
  OR negative_sentiment_terms >= 2
  OR abandonment_after_failure = true
  OR explicit_human_request_after_agent_failure = true

Then sample the flagged calls manually for one week. You will find false positives. That is fine. Tighten the rule only after you know which signals correlate with real frustration in your call types.

The goal is not to label every caller emotion perfectly. The goal is to create a reliable queue of calls that a QA lead, product manager, or support owner should inspect.

Category 3: Flow Analytics

Flow analytics measures whether the conversation moved through the expected stages.

For a simple appointment scheduling agent, the expected flow might be:

Greeting -> Intent confirmation -> Eligibility / identity check -> Slot collection -> Booking -> Confirmation

For a billing agent, it might be:

Intent -> Authentication -> Account lookup -> Explanation -> Payment / adjustment / escalation -> Confirmation

Flow adherence is the percentage of calls that follow the expected stages without skipping required steps, looping, or dropping off.

Flow Adherence = Calls completing required stages / Eligible calls x 100

Flow analytics is where aggregate dashboards usually break down. A 78% task success rate is useful, but it does not tell you whether users are failing at authentication, tool execution, payment confirmation, or final handoff.

Track stage-level conversion instead.

StageMetricExample Alert
Intent capture% calls with recognized supported intentDrops below baseline by 5%
Authentication% eligible calls passing authFailure rate doubles
Required slot collection% calls collecting all needed fieldsMissing slot rate above 10%
Tool execution% tool calls succeedingError rate above 3%
Confirmation% calls ending with explicit confirmationFalls below 85%

Drop-off Analysis

Drop-off rate measures where users abandon the conversation.

Drop-off Rate = Users who abandon at stage / Users who reached stage x 100

Do not report one blended drop-off number. Segment it by stage:

Drop-off TypeLikely CauseFirst Debug Step
Early drop-off under 30 secondsBad greeting, wrong entrypoint, poor audio, caller surpriseListen to first-turn calls
Mid-flow drop-offRepetition, latency, missing intent, confusing questionInspect loops and fallback intents
Authentication drop-offAuth too strict, unclear instructions, tool failureCompare auth failures by caller segment
Near-completion drop-offPayment, confirmation, or policy frictionReview final two stages
Silence-driven drop-offDead air or long latencyInspect turn latency and timeout handling

This is where voice analytics becomes product analytics. If 30% of callers abandon during identity verification, the issue is not "the AI voice agent is bad." The issue is a specific flow step.

For a deeper breakdown of stage modeling and path adherence, use the Conversational Flow Measurement guide. For production incidents where a flow suddenly degrades, pair the flow funnel with voice agent outage monitoring so the alert carries the failing stage, prompt version, and recent deploy context.

Category 4: Quality Analytics

Quality analytics combines correctness, speed, policy adherence, and conversation experience.

The cleanest production pattern is a composite score with visible sub-scores. Do not hide the inputs.

Quality Score =
  (Accuracy Score x 0.30)
  + (Latency Score x 0.20)
  + (Flow Score x 0.25)
  + (Containment / Resolution Score x 0.25)

Use the weights as a starting point, not doctrine. A healthcare triage agent may weight policy adherence and escalation appropriateness higher. An e-commerce order status agent may weight latency and task completion higher.

Score BandInterpretationAction
80-100Strong production qualityMonitor regressions and edge cases
70-79Good but unevenSegment by intent and fix weak paths
60-69RiskyPrioritize top failure categories before scaling
<60Not production-readyPause expansion or route traffic to humans

Score Voice Agent Call Quality

At call level, score quality with five checks:

DimensionQuestionScoring Method
IntentDid the agent identify the user's goal?Intent match against human/ground-truth label
Answer correctnessWas the answer or action correct?Evaluation against policy, knowledge base, or task result
Conversation controlDid the agent avoid loops and recover from errors?Flow events, repetition count, fallback count
LatencyDid responses arrive fast enough for natural conversation?P50/P90/P95 turn latency
ExperienceDid the caller sound satisfied or frustrated?Sentiment, interruption, silence, completion pattern

This is the minimum viable QA rubric. Add domain checks for regulated workflows: disclosure delivery, consent, PHI/PII handling, PCI redaction, escalation rules, and prohibited advice.

If you need the broader evaluation taxonomy, start with Voice Agent Evaluation Metrics. If the low quality score comes from missed intents, tool failures, or error dashboards, use the Debugging Voice Agents guide to trace the failing call path.

How to Calculate Each Metric

Here is the working formula set.

MetricFormulaNotes
Containment RateAI-contained calls / total calls x 100Exclude policy-required human handoffs when comparing agent capability
Verified Resolution RateResolved calls with no repeat contact / total calls x 100Use 48-72 hour repeat-contact windows
Negative Sentiment RateNegative or frustrated calls / total calls x 100Combine transcript and audio signals
Frustration Signal RateCalls with 2+ frustration signals / total calls x 100Useful for QA review queues
Drop-off RateUsers abandoning at stage / users reaching stage x 100Always segment by stage
Flow AdherenceCalls completing required stages / eligible calls x 100Different flows need different stage maps
Task Completion RateSuccessful task completions / eligible calls x 100Define completion criteria per intent
Tool Success RateSuccessful tool calls / attempted tool calls x 100Segment by tool and provider
Quality ScoreWeighted composite of accuracy, latency, flow, resolutionKeep sub-scores visible

Two implementation details matter more than teams expect:

  1. Define the denominator. Is containment measured across all calls, eligible calls, supported intents, or non-policy escalations? Pick one and label it.
  2. Segment by intent. A blended metric hides the reason. Appointment scheduling, billing, eligibility, and troubleshooting should not share one undifferentiated score.

Benchmarks by Industry and Use Case

Use benchmarks as starting points. The right threshold depends on task complexity, caller risk, and how much authority the AI agent has.

Use CaseHealthy ContainmentHealthy Task CompletionNotes
Appointment scheduling80-90%85-95%Well-defined slots and clear completion criteria
Order status80-90%85-95%Strong fit for automation if systems are reliable
Billing explanation65-80%75-90%More handoffs due to account nuance and disputes
Healthcare intake60-75%70-85%Correct escalation may be more important than containment
Financial services support60-80%75-90%Compliance, identity, and policy checks lower containment
Technical troubleshooting55-75%65-85%Multi-step diagnosis creates more drop-off

The benchmark that matters most is your own baseline by intent and prompt version. Run two weeks of measurement, define normal ranges, then alert on deviation.

Building Your Analytics Dashboard

A good voice agent analytics dashboard has three layers:

  1. Executive health: Is the agent improving or hurting the business?
  2. Operations triage: Which intents, flows, or cohorts need attention today?
  3. Call-level drilldown: Which calls prove the pattern and explain root cause?

Do not put 50 charts on the first screen. More tiles usually means more arguing about which tile matters. Start with one row per metric family.

Dashboard RowPrimary ChartDrilldown
ContainmentContained vs escalated by intentEscalation reasons and repeat-contact calls
SentimentNegative sentiment trendFrustration-flagged call queue
FlowFunnel by conversation stageStage-level drop-off examples
QualityQuality score distributionLowest-scoring calls with reason codes

Then add queryable analytics:

  • "Show calls where the user asked for a human after the agent repeated itself."
  • "Find all calls where billing intent reached authentication but never completed."
  • "Show noisy calls with low ASR confidence and negative sentiment."
  • "Cluster the top new questions from the last 500 calls."

This is the part that generic analytics tools usually miss. Voice agent teams need both dashboard metrics and natural-language exploration across calls.

Alerting and Anomaly Detection

Alert on sustained deviations, not single-call failures.

AlertWarningCriticalRoute To
Containment drop>5% below baseline for 30 minutes>15% below baselineOps / product
Negative sentiment spike2x baseline>25% negative callsCX / QA
Drop-off increase>10% increase at one stage>25% increase at one stageProduct / engineering
Tool failure rate>3% for one tool>10% for one toolEngineering
P95 turn latency>1.5x baseline>3 seconds for 15 minutesEngineering
Quality scoreMedian below 70Median below 60QA / product

The alert should include the likely first debug path. "Negative sentiment is up" is too vague to help. A better alert says: billing calls are souring after authentication, interruption count is up 41%, and here are the first five calls to open.

Flaws But Not Dealbreakers

Metrics can hide the truth. A high containment rate can mean the agent solved calls. It can also mean users gave up. I would take 68% honest containment over 85% containment created by trapped callers. Pair the number with repeat contact, sentiment, and task success.

Sentiment is noisy. Accent, culture, background noise, and call context all affect sentiment models. Treat sentiment as a triage signal and a trend line, not a courtroom verdict on one caller.

Benchmarks are not universal. A healthcare triage agent should not chase the same containment rate as an order-status bot. Use benchmarks to start the conversation, then set thresholds by intent and risk.

Manual review still matters. Automation tells you where to look. Humans still need to inspect samples, update rubrics, and decide whether a flagged pattern is actually harmful.

The Practical Starting Point

If you are building this from scratch, do not start with the full dashboard.

Start with 20 production calls per major intent and score them manually across the four metric families:

CallContained?SentimentFlow issue?Quality issue?Root cause
1YesNeutralNoneSlow responseLLM latency
2NoNegativeAuth loopTool errorAuth API timeout
3YesNegativeRepetitionWrong answerKnowledge gap

After 100-200 scored calls, automate the labels that match your real failure modes. This prevents the common mistake: instrumenting generic metrics before you know which failures matter for your agent.

For teams still validating the agent itself, pair this analytics setup with Call Center Voice Agent Testing and Background Noise Testing KPIs. Analytics tells you what failed in production; testing lets you reproduce and prevent the same failure before the next deploy.

Cite This Guide

If you reference this article, cite it as:

Hamming's voice agent analytics metric dictionary defines containment, sentiment, flow, and quality metrics with formulas, denominators, benchmarks, and alert thresholds for production voice agents.

Frequently Asked Questions

Voice agent analytics is the measurement of conversation outcomes and voice-specific failure signals across production calls, including containment, sentiment, flow adherence, drop-off, latency, and call quality. Hamming recommends grouping these metrics into Containment, Sentiment, Flow, and Quality so each failure routes to the right owner.

Voice agent containment rate is the percentage of calls handled by the AI without human escalation. The formula is AI-contained calls divided by total calls times 100, but Hamming recommends pairing it with repeat-contact and task-success metrics so abandoned or falsely resolved calls do not inflate containment.

Detect frustrated customers by combining transcript and audio signals: repeated phrases, caller interruptions, long silences, louder or sharper tone, negative language, abandonment, and repeated requests for a human. Hamming recommends flagging calls with 2 or more frustration signals for QA review before using sentiment scores for automated decisions.

High voice agent drop-off rates usually come from bad entrypoints, confusing questions, repeated recognition failures, authentication friction, tool errors, or latency-driven dead air. Hamming recommends measuring drop-off by conversation stage because a single blended abandonment rate will not show whether the break happened at greeting, authentication, slot collection, tool execution, or confirmation.

Measure flow adherence by defining required stages for each intent, then calculating calls that complete those stages divided by eligible calls times 100. Hamming recommends segmenting flow adherence by intent because appointment scheduling, billing, and troubleshooting flows have different required stages.

A good voice agent quality score is usually 70-79, while 80+ indicates strong production quality. Hamming recommends building the score from visible sub-scores such as answer accuracy, latency, flow adherence, containment or resolution, tool success, and policy compliance rather than hiding everything inside one opaque number.

A call center voice analytics dashboard should show containment by intent, negative sentiment trend, stage-level flow drop-off, quality score distribution, tool failures, and call-level drilldowns. Hamming recommends one dashboard row per metric family so executives can see health while operators can drill into the exact calls causing each pattern.

Review voice agent analytics at three cadences: real-time alerts for severe regressions, daily triage for intent-level anomalies, and weekly analysis for product and knowledge-base patterns. Hamming recommends waiting for a two-week baseline before tightening thresholds so alerts reflect sustained deviations rather than normal traffic variance.

Sumanyu Sharma

Sumanyu Sharma

Founder & CEO

Previously Head of Data at Citizen, where he helped quadruple the user base. As Senior Staff Data Scientist at Tesla, grew AI-powered sales program to 100s of millions in revenue per year.

Researched AI-powered medical image search at the University of Waterloo, where he graduated with Engineering honors on dean's list.

“At Hamming, we're taking all of our learnings from Tesla and Citizento build the future of trustworthy, safe and reliable voice AI agents.”