Voice Agent Call Review Triage Runbook

Sumanyu Sharma
Sumanyu Sharma
Founder & CEO
, Voice AI QA Pioneer

Has stress-tested 4M+ voice agent calls to find where they break.

June 8, 2026Updated June 8, 202612 min read
Voice Agent Call Review Triage Runbook

Voice agent call review triage is the process of ranking production calls by review value so humans inspect the calls most likely to expose product bugs, unsafe behavior, compliance risk, customer-impacting failures, or missing regression coverage.

If you run fewer than 100 production voice agent calls per week, a simple manual review habit is probably fine. Listen to failed calls, review a few random successes, and keep notes.

This runbook is for teams past that point. Once the system handles hundreds, thousands, or tens of thousands of calls per day, "pick 20 recordings at random" stops being a quality program. It becomes a lottery.

We used to think the main question was, "How many calls should QA review?" The better question is, "Which calls can change what we ship, block, escalate, or test tomorrow?"

TL;DR: Build a production voice call review queue with 5 parts: selection reason, triage score, evidence packet, reviewer owner, and outcome. Keep random sampling for calibration, but prioritize calls with unsafe responses, failed tool calls, abandonment, repeat callers, low-confidence turns, compliance risk, new failure clusters, and high-impact cohorts.

Methodology Note: This triage runbook is based on Hamming's analysis of 4M+ production voice agent calls and QA review workflows across 10K+ voice agents (2025-2026). We've tested agents built on LiveKit, Pipecat, ElevenLabs, Retell, Vapi, and custom-built solutions.

The goal is not to review more calls. The goal is to review the calls that can change reliability, safety, compliance, or regression coverage.

Across Hamming's analysis of 4M+ production voice agent calls across 10K+ voice agents, we found that review systems break when every call has the same priority. The useful calls are not evenly distributed.

Last Updated: June 2026

Related Guides:

Random Sampling Is Calibration, Not Triage

Random sampling still has a job. It catches drift in your scoring rubric, gives QA a baseline view, and prevents the team from only looking at obvious failures.

It should not be the only way calls enter review.

Salesforce describes contact center quality management as a continuous loop of monitoring, evaluation, scoring, coaching, and improvement. That loop gets weaker when selection is blind. The call that caused a duplicate refund, unsafe healthcare answer, or 12-minute abandonment should not compete equally with a clean happy-path call.

Triage rule: random samples protect calibration. Risk-based samples protect customers.

Use both:

Selection ModeBest ForWeakness
Random sampleCalibration, representative baseline, evaluator drift checksWastes review capacity on low-signal calls
Stratified sampleComparing agents, queues, languages, locations, or cohortsCan miss rare but severe failures
Risk-based queueUnsafe responses, failed tools, abandonments, repeat callers, new clustersNeeds guardrails to avoid overfitting to obvious failures
Customer-reported queueFast support follow-up and incident reconstructionDepends on external reports and usually arrives late
Calibration sampleKeeping human and automated scoring alignedDoes not replace production incident review

The mistake is pretending one mode can do every job.

Select Calls by Reason First

Every call in the review queue should carry a selection reason. Without it, reviewers waste time rediscovering why the call mattered.

Selection ReasonAdd the Call WhenDefault QueueReviewer
unsafe_responseThe agent may have given unsafe, prohibited, or policy-violating adviceP0Compliance + product
failed_tool_callA tool timed out, returned bad data, duplicated a write, or used unsupported argumentsP0Engineering
customer_reportedA customer, support rep, or account team named a bad callP0Support + engineering
abandoned_before_resolutionCaller hung up before success, transfer, or safe fallbackP1QA + product
repeat_callerSame caller or account came back within the review window for the same taskP1QA + product
negative_sentiment_shiftCaller frustration rose during the callP1QA
low_confidence_turnASR, intent, policy, or evaluator confidence fell below thresholdP2QA
new_failure_clusterSeveral calls share a previously unseen failure patternP1Product + QA
cohort_regressionA language, queue, location, provider, or agent version degradedP1Engineering + product
random_calibrationThe call was sampled to keep scoring honestP2QA

The call evidence export runbook starts after this decision. It explains how to package transcripts, audio, traces, and tool evidence. Triage decides which calls deserve that packet.

Use a Triage Score, But Keep It Explainable

Do not build a black-box score that nobody trusts. A review score should be simple enough for an ops lead to inspect and override.

Use this starter formula:

review_priority =
  severity
  + evidence_quality
  + novelty
  + cohort_impact
  + freshness
  - reviewer_load_penalty
ComponentScoreHow to Assign It
Severity0-5Customer harm, compliance risk, money movement, safety risk, or blocked task
Evidence quality0-3Transcript, audio, trace, tool result, and evaluator rationale are available
Novelty0-3New failure cluster, new agent version, new provider, or unseen user behavior
Cohort impact0-3Affects many calls, high-value customers, regulated workflows, or a vulnerable segment
Freshness0-2Happened recently enough to debug while logs and context are still useful
Reviewer load penalty0-3Deduct when the queue is saturated with duplicates from the same issue

Then map the score to queues:

ScoreQueueSLAAction
12+P0Same dayAssign owner, inspect packet, decide block/escalate/fix
8-11P12 business daysReview for product bug, prompt bug, workflow gap, or regression candidate
4-7P2WeeklyUse for pattern review, coaching, coverage expansion, or calibration
0-3ArchiveNo human review by defaultKeep aggregate label unless it joins a larger cluster

This is intentionally simple. The score is a routing tool, not a truth machine.

Review-value definition: a call is worth human review when the likely next action could change a prompt, policy, tool, product workflow, compliance queue, regression suite, or customer follow-up.

Build the Daily Review Queue

The daily queue should be small enough to finish and structured enough to trust.

QueueDaily ContentsCapExit Criteria
P0 safety and incidentsUnsafe responses, failed writes, customer reports, compliance flagsNo artificial capOwner assigned and decision recorded
P1 product failuresAbandonment, repeat caller, new cluster, cohort regression10-30 callsIssue labeled and pattern counted
P2 QA signalLow confidence, sentiment shift, unusual fallback, long silence20-50 callsEither dismissed, clustered, or promoted
Calibration sampleRandom calls across agents, cohorts, and outcomes5-10% of review volumeHuman/automated scoring compared

The right volume depends on reviewer capacity. A queue of 500 calls that nobody finishes is worse than a queue of 40 calls with clean outcomes.

Qualtrics' quality management documentation describes rubrics, scorecards, alerts, and tickets that route coaching opportunities. Voice agents need the same control loop, but with voice-specific evidence: transcript turn, audio segment, trace, tool result, prompt or agent version, and evaluator rationale.

For urgent production failures, route the P0 item to the same operating surface you use for voice agent Slack alerts. For slow quality decay, keep the item in the QA queue and aggregate it into weekly coverage work.

Give Reviewers a Packet, Not a Dashboard Hunt

A reviewer should not need five tabs and tribal knowledge to judge one call.

Minimum packet:

FieldRequired?Why
Canonical call IDYesJoins transcript, audio, traces, evaluations, provider IDs, and review outcome
Selection reasonYesExplains why the call is in the queue
Triage scoreYesShows priority and routing logic
Agent versionYesConnects behavior to prompt, model, tool, or deployment changes
Transcript excerptYesLets reviewer inspect the relevant turn quickly
Audio pointerUsuallyCaptures silence, interruption, tone, background noise, and ASR ambiguity
Trace or tool evidenceFor workflow callsProves what happened outside the conversation text
Evaluation resultYesIncludes rubric, score, evaluator version, and failure label
Privacy stateYesShows whether the packet is redacted, restricted, or aggregate-only
Allowed outcomesYesPrevents free-text review notes from becoming a dead end

This should reuse your call evidence export format. The triage queue should store the selectionReason and reviewPriority; the evidence packet should store the proof.

The CTAA quality assurance guide emphasizes documented QA plans, call monitoring, customer satisfaction tools, reporting, and escalation. That old operational lesson still applies: the review process has to create a record that managers and teams can act on later.

Force Every Review Into an Outcome

The review is not done when someone listens to the call. It is done when the next action is recorded.

Use a closed outcome taxonomy:

OutcomeUse WhenNext Step
no_issueThe call was correctly handled or the signal was a false positiveFeed calibration data back to scoring
prompt_bugAgent wording, instruction following, or policy interpretation caused the failureUpdate prompt and add regression case
tool_bugBackend action, integration, timeout, retry, or argument handling failedFile engineering issue and keep tool evidence
product_gapCaller asked for a supported-but-missing workflowAdd roadmap or coverage item
unsupported_requestCaller asked for something the agent should not handleImprove graceful fallback
policy_riskAgent may have violated safety, compliance, legal, or brand policyRoute to compliance or policy owner
regression_candidateThe failure should never recurConvert to a test using safe fixtures
coaching_signalHuman operations or handoff process needs changeRoute to training or ops lead

Pair regression_candidate with the failed production call regression runbook. A call review that finds a bug but does not create a durable test is only a one-time inspection.

For coverage gaps, pair product_gap and unsupported_request with response coverage. The goal is not to make the agent handle every long-tail request. The goal is to know which gaps deserve workflow coverage and which deserve a clean handoff.

Protect Review Quality

Review queues drift without guardrails.

RiskSymptomGuardrail
Duplicate floodingOne incident fills the whole queue with near-identical callsCluster first, review representative examples, keep aggregate count
Severity inflationEverything becomes P0 because teams want attentionRequire severity definition and owner override reason
Reviewer fatigueQueue grows faster than humans can resolveCap P1/P2, preserve P0, archive low-score calls into clusters
Privacy leakageReviewers see raw audio or sensitive transcript text by defaultUse redacted packets and role-gated raw access
Calibration driftReviewers disagree on the same callKeep random calibration samples and compare decisions
Dead-end notesReviewers write comments but no action happensUse closed outcomes and required next owner
Dashboard-only evidenceNobody can reconstruct why the call was selectedStore selection reason, score, packet pointer, and review outcome

Privacy deserves special attention. Production calls may include names, account numbers, health details, addresses, payment information, and background speech. Use the log retention compliance checklist and security review questions before giving broad teams raw recordings or unredacted transcripts.

What Hamming Does in This Loop

Hamming helps teams move from "we have a dashboard" to "we know which calls need action."

Use Hamming to:

  • Analyze production voice calls across transcripts, audio, metadata, and evaluation results.
  • Rank calls by failure type, severity, confidence, cohort, and review value.
  • Give reviewers the evidence needed to decide whether the issue is a prompt bug, tool bug, policy risk, product gap, or regression candidate.
  • Turn selected failures into workflow tests, sandbox tests, or CI regression cases.
  • Measure whether fixes actually reduce repeat failures, abandonment, unsafe responses, and unresolved coverage gaps.

The practical loop is:

monitor production calls
  -> score review priority
  -> package evidence
  -> review with closed outcomes
  -> fix or escalate
  -> promote durable failures into tests
  -> watch whether the cluster disappears

This is where call review becomes a product-quality system instead of a recording audit.

Final Checklist

Before you trust a production voice call review queue, check:

  • Every reviewed call has a selection reason.
  • P0 calls bypass normal queue caps.
  • Random samples are used for calibration, not as the whole QA strategy.
  • The triage score is explainable and overrideable.
  • Duplicate calls are clustered before they consume review capacity.
  • Reviewers get transcript, audio pointer, trace/tool evidence, evaluation result, and privacy state.
  • Every review ends in a closed outcome.
  • Regression candidates become tests with source labels.
  • Compliance and safety issues route to a named owner.
  • Weekly reporting shows which clusters disappeared, persisted, or got worse.

Frequently Asked Questions

Voice agent call review triage ranks production calls by review value so humans inspect the calls most likely to change reliability, safety, compliance, or regression coverage. Hamming's runbook uses selection reason, triage score, evidence packet, reviewer owner, and closed outcome as the minimum review contract.

Start with selection reasons such as unsafe response, failed tool call, customer report, abandonment, repeat caller, low-confidence turn, new failure cluster, or cohort regression. Hamming's triage matrix then scores severity, evidence quality, novelty, cohort impact, freshness, and reviewer load before routing the call to P0, P1, P2, or archive.

No. Random sampling is useful for calibration and representative baselines, but it should not be the whole production QA strategy. Hamming recommends combining random calibration samples with risk-based queues that prioritize high-impact failures and new patterns.

A daily queue should separate P0 safety or incident calls, P1 product failures, P2 QA signals, and a small calibration sample. Hamming's runbook recommends giving each call a selection reason, score, owner, SLA, evidence packet, and allowed review outcomes.

Review volume should be based on capacity and risk, not a fixed percentage of total calls. Hamming recommends finishing a smaller queue with clear outcomes over creating a large backlog, while keeping P0 calls uncapped and using 5-10% of review volume for calibration.

Reviewers should see canonical call ID, selection reason, triage score, agent version, transcript excerpt, audio pointer when allowed, trace or tool evidence, evaluation result, privacy state, and allowed outcomes. Hamming's call-review packet keeps the reviewer from hunting across dashboards before making a decision.

Reviewed calls become regression tests when the outcome is labeled as a prompt bug, tool bug, product gap, policy risk, or regression candidate and the smallest safe reproduction is added to the test suite. Hamming recommends preserving the source label so future reviewers know which production failure the test prevents.

Sumanyu Sharma

Sumanyu Sharma

Founder & CEO

Previously Head of Data at Citizen, where he helped quadruple the user base. As Senior Staff Data Scientist at Tesla, grew AI-powered sales program to 100s of millions in revenue per year.

Researched AI-powered medical image search at the University of Waterloo, where he graduated with Engineering honors on dean's list.

“At Hamming, we're taking all of our learnings from Tesla and Citizento build the future of trustworthy, safe and reliable voice AI agents.”