Voice Agent Handoff and Transfer Testing Runbook

Sumanyu Sharma
Sumanyu Sharma
Founder & CEO
, Voice AI QA Pioneer

Has stress-tested 4M+ voice agent calls to find where they break.

June 24, 2026Updated June 24, 202615 min read
Voice Agent Handoff and Transfer Testing Runbook

Voice agent handoff testing proves that an AI voice agent escalates, transfers, or hands control to the right destination with the right context, at the right time, and with evidence that the next step actually happened.

If your agent never routes a caller anywhere else, this runbook is too much. Use normal conversation tests.

If your agent can transfer to a human, route to a queue, hand off to a specialist AI agent, leave voicemail, or escalate regulated workflows, transcript-only QA is not enough. The transcript can look clean while the caller lands in the wrong queue, the receiving agent gets no summary, the transfer fails silently, or the AI escalates a case it should have resolved.

We call this transfer theater: the system records a transfer event, but nobody proves whether the right handoff happened.

TL;DR: Test voice-agent handoffs as operational workflows:

  • Trigger: Did the caller, policy, or safety condition justify escalation?
  • Destination: Did the agent route to the right person, queue, number, SIP endpoint, or specialist agent?
  • Context: Did the handoff include caller identity, reason, collected fields, transcript summary, and next action?
  • Bridge state: Did the call connect, hold, consult, merge, fail, or fall back as expected?
  • Receipt: Did the receiving system or human path acknowledge the handoff?
  • Outcome: Did the caller avoid repeating themselves, or did the workflow fail after the transfer event?
Methodology Note: This runbook is based on Hamming's analysis of 4M+ production voice agent calls where escalation, queue routing, human transfer, and workflow state changed the caller outcome across 10K+ voice agents (2025-2026). We've tested agents built on LiveKit, Pipecat, ElevenLabs, Retell, Vapi, and custom-built solutions.

Use it as a release-safety template for healthcare, financial services, insurance, BPO, QSR, and support workflows where a bad handoff creates customer effort or operational risk.

Last Updated: June 2026

Related Guides:

What Should Handoff Testing Prove?

Handoff testing should prove that the caller moved from one responsible system or person to another without losing the reason for the call.

That means a passing test needs more than "transfer tool called." It needs trigger evidence, routing evidence, context evidence, bridge evidence, and post-transfer evidence.

LayerWhat it provesSample failure
Escalation triggerThe agent escalated only when a caller request, policy, risk, or workflow condition required itAgent transfers every frustrated caller even when self-service should continue
DestinationThe selected queue, number, SIP endpoint, human team, or specialist AI agent matches the policyBilling question goes to technical support
Context payloadThe receiver gets caller identity, reason, summary, collected fields, and next actionHuman answers but asks the caller to repeat everything
Bridge stateHold, consult, merge, cold transfer, warm transfer, no-answer, and cancel states behave correctlyCaller sits on hold after destination fails
ReceiptThe destination acknowledges the handoff or emits a transfer resultSystem logs "transfer requested" but not "transfer connected"
FallbackFailed transfers produce a safe next stepAgent says "I transferred you" after no one answered

Handoff success: the escalation reason, destination, transferred context, bridge state, receipt, and caller-facing outcome all agree. If any one is missing, the handoff is not proven.

In Hamming workflow reviews, we found that expensive handoff failures rarely look dramatic in the transcript. The agent often says the correct sentence. The failure is in the receipt, the queue, the summary, or the transfer target.

Choose the Handoff Type Before Writing Tests

Do not write one generic "transfer test." The evidence is different for each handoff type.

Handoff typeUse whenRequired evidenceUsually blocks release?
AI-agent handoffOne specialist agent should take over from anotherSource agent, destination agent, handoff reason, preserved history, active instructionsYes for critical workflows
Human warm transferA human needs context before speaking with the callerConsult connection, private summary, merge/bridge result, receiving participantYes
Human cold transferThe caller can be routed directly without consultDestination, announcement, transfer result, no-answer fallbackSometimes
Queue transferRouting depends on skill, department, region, priority, or tenantQueue ID, routing rule version, priority, wait behavior, receiptYes for support-critical queues
IVR or DTMF transferDestination requires keypad navigation or phone-tree traversalDigits sent, menu branch, destination answer, timeout pathPre-release
Voicemail fallbackHuman path unavailable but a message should be capturedVoicemail detection, message policy, callback record, follow-up ownerSometimes
No-transfer fallbackTransfer is forbidden or unavailableReason not transferred, caller explanation, safe next stepYes when policy-sensitive

LiveKit's supervisor pattern documentation distinguishes between keeping one supervisor agent in control and handing control to another agent with different instructions or tools. That distinction matters in tests. A task delegation test is not the same as a full handoff test.

Provider transfer behavior also varies. Retell's transfer docs describe cold and warm transfer modes for phone calls, while Vapi's dynamic transfer docs describe server-controlled destination decisions using call context. Your test should match the specific runtime behavior you use.

Build the Handoff Test Contract

Start with a contract. Then run the call.

handoff_test_contract =  caller scenario  + escalation trigger  + allowed transfer mode  + expected destination  + required context payload  + bridge-state assertions  + receipt assertions  + fallback rule  + evidence retention
Contract fieldRequired?Sample
Scenario IDYessupport_escalation_billing_014
Caller goalYesCaller asks about a disputed invoice and requests a person
Trigger conditionYesBilling dispute over threshold plus explicit human request
Transfer modeYesWarm transfer to billing queue
DestinationYesqueue_billing_priority_2 or approved SIP endpoint
Context payloadYesCaller ID, account fixture, issue category, amount, summary, next action
Forbidden actionsYesNo transfer to general support; no account update before human review
Bridge assertionYesConsult call established, then customer bridged, or fallback fires
Receipt assertionYesTransfer event, receiving participant, queue/task ID, or destination ack
Fallback ruleYesIf no answer in 30 seconds, offer callback or case creation
EvidenceYesTranscript, routing rule, tool trace, transfer events, call IDs, final state

Keep this next to your tests-as-code definitions. A teammate should be able to review the escalation policy without listening to the whole call.

Copyable Test Definition

suite: voice_agent_handoff_testingowner: voice-platformenvironment: stagingscenarios:  - id: support_escalation_billing_014    caller_goal: "Dispute an invoice and ask for a person"    trigger:      explicit_human_request: true      issue_category: billing      amount_disputed_cents: 18420    expected_handoff:      mode: warm_transfer      destination_type: queue      destination_id: queue_billing_priority_2      context_required:        - caller_id        - verified_account_id        - issue_category        - dispute_amount        - summary        - next_action    forbidden:      - transfer_to_general_support      - account_credit_before_human_review      - caller_repeats_identity_after_bridge    fallback:      no_answer_after_seconds: 30      expected_action: offer_callback_or_case_creation    evidence:      retain_transcript: true      retain_transfer_events: true      retain_routing_rule_version: true      retain_receipt: true

That is more useful than a 1-10 transcript score. It tells you what the agent was allowed to do, where the caller should land, and what proof should exist afterward.

Test Escalation Triggers and Forbidden Transfers

Most handoff bugs start before the transfer. The agent either escalates too late, escalates too early, or escalates for the wrong reason.

Trigger classShould transfer whenShould not transfer whenTest assertion
Explicit caller requestCaller says "representative," "human," or "transfer me" and policy allows itCaller asks a normal question with no escalation requestTransfer reason includes explicit request
Safety or complianceMedical, legal, payment, account, or emergency policy requires human reviewLow-risk FAQ or read-only status questionPolicy rule version is recorded
Tool failureRequired lookup, booking, payment, or CRM tool fails after allowed retryNon-critical tool times out but fallback answer is safeAgent explains failure and follows approved route
Negative sentimentCaller frustration passes threshold and self-service is failingCaller uses mild frustration but task is progressingSentiment is supporting evidence, not sole trigger
Unsupported intentCaller asks for something outside scopeAgent can answer safely or gather missing infoUnsupported scope maps to correct fallback
VIP or account ruleTenant/customer rule requires special queueGeneric account has no special routingRule fixture and destination match

The negative cases matter. A voice agent that transfers too easily can destroy containment and create unnecessary queue load. A voice agent that refuses to transfer when policy requires it creates a worse problem.

For customer-specific routing, pair this with the customer workflow rules template. The same caller phrase can require different destinations depending on tenant, region, priority, or regulated workflow.

Verify Destination, Context, and Bridge State

Transfer tests should inspect the receiving side, not just the sending side.

Twilio's warm transfer documentation describes a consult pattern where the initiating agent talks to the receiving agent before the customer is bridged. Twilio Conference also exposes participant states such as waiting, hold, mute, join, and leave. Those states are testable evidence.

EvidenceWhat to checkFail when
DestinationQueue, number, SIP URI, assistant, or agent IDDestination is missing, stale, or not policy-approved
Caller contextVerified caller ID, account, preferred language, reason, collected fieldsReceiver gets no usable summary
Transfer modeCold, warm, queue, AI-agent handoff, voicemail fallbackMode differs from policy
Hold and consultCaller hold state, receiving participant answer, private context deliveryCaller hears private summary or waits after failure
Bridge or mergeBoth parties connected, original agent leaves or stays as designedTransfer event fires but no bridge occurs
ReceiptTransfer update, conference participant, queue task, destination ackOnly request event exists
Post-transfer outcomeCaller continues with preserved contextCaller repeats identity, reason, and details

Transfer receipt rule: a handoff test should fail unless the destination emits an acknowledgment or bridge state that proves the caller had somewhere to go.

Some providers expose this as a webhook event. Vapi server events include transfer-related event types such as transfer destination requests and transfer updates. Other stacks require joining telephony logs, queue events, or conference participant state. The data source can vary. The requirement does not.

Save an Evidence Envelope for Every Handoff

When a transfer fails in production, the hardest question is usually basic: where did the caller go?

Use an evidence envelope that connects the transcript, routing rule, transfer event, destination, and final state.

{  "run_id": "handoff_run_2026_06_24_0019",  "call_id": "call_fixture_442",  "scenario_id": "support_escalation_billing_014",  "agent_version": "support-agent-pr-913",  "escalation": {    "trigger": "explicit_human_request",    "policy_rule_version": "routing_rules_2026_06_24",    "reason": "billing_dispute"  },  "handoff": {    "mode": "warm_transfer",    "expected_destination": "queue_billing_priority_2",    "actual_destination": "queue_billing_priority_2",    "context_fields_sent": [      "caller_id",      "verified_account_id",      "issue_category",      "summary",      "next_action"    ],    "bridge_status": "connected",    "receipt_id": "transfer_receipt_77"  },  "post_transfer": {    "caller_repeated_identity": false,    "receiver_acknowledged_context": true,    "fallback_used": false  }}

Redact sensitive fields before this leaves your system. The reviewer does not need raw account data or the full transcript. They need enough structure to know whether the handoff matched policy.

For packaging this kind of evidence across teams, use the call evidence export runbook. For joining telephony and IVR events to a single call, use the IVR log correlation runbook.

Decide What Runs in CI Versus Pre-Release

Do not put every phone-path transfer in CI. You will create slow, flaky, expensive gates.

Block CI on deterministic logic and fixture-backed routing. Run telephony-heavy and live destination checks as pre-release or scheduled suites.

GateRun whenRecommended sizeBlocks merge?
Routing policy testsPrompt, policy, tenant rule, or escalation logic changes8-20 fixture casesYes
Context payload testsSummary, extracted fields, or destination contract changes5-12 casesYes
Mocked transfer tool testsTool schema, function name, destination argument, or fallback changes5-15 casesYes
Sandbox queue testsQueue/task integration or CRM routing changes3-8 casesYes for critical workflows
Phone-path warm transfer testsTelephony provider, SIP, queue, IVR, or human bridge path changes2-5 callsPre-release
Live scoped transfer checksProduction-only routing or provider behavior1-3 allowlisted runsRelease owner decision
Production monitoringContinuous transfer quality trackingSample or 100% metadataAlert, do not block CI

The rule of thumb: if a failed handoff can affect account access, healthcare decisions, payments, legal state, safety, or a high-value customer, it deserves blocking coverage somewhere. If it only checks a rarely used queue variant, schedule it and alert the owner.

When a handoff fails in production, add it to the failed production call regression runbook. A transfer bug that escapes once should not rely on memory next time.

Provider and Runtime Caveats

Provider docs are useful, but they are not interchangeable. Test the specific surface you deploy.

SurfaceUseful public behaviorTest implication
LiveKit agent handoffsAgent handoffs and supervisor/task patterns can model different control-transfer shapesTest whether control truly changes hands or a supervisor remains in charge
Vapi dynamic transfersServer-controlled destination selection can use conversation and customer contextAssert request payload, routing decision, control action, and transfer update
Retell transfer toolsPhone-call transfers can be cold or warm, with human-detection and transfer settingsTest phone-call-only assumptions, no-answer, human detection, and fallback
Retell conversation-flow transfer nodesTransfer nodes can branch on transfer failureAssert the failure edge, not just the successful transfer path
Twilio ConferenceConferences expose participant, hold, join, leave, and bridge behaviorAssert participant state and conference lifecycle for warm transfers
Twilio Flex warm transferWarm transfer includes consult, hold/unhold, bridge, and leave stepsTest the consult and bridge sequence, not only final connection

This is why workflow testing and handoff testing are separate. Workflow tests prove the agent chose the right action. Handoff tests prove the caller landed somewhere useful after that action.

What This Runbook Cannot Prove

Handoff testing does not prove the human resolved the issue, the queue staffing was adequate, or the downstream team had the right policy.

It proves a narrower thing: the voice agent escalated for the right reason, routed to the right destination, included the right context, handled bridge/failure states, and retained enough evidence to debug the result.

Three limitations matter:

LimitationWhy it mattersPractical response
Human availability changesA transfer test can pass while real queues are understaffedSeparate workflow correctness from workforce planning
Provider transfer behavior differsCaller ID, SIP REFER, warm transfer, hold, and no-answer semantics varyKeep provider-specific tests and read official docs before assuming parity
Context quality is subjectiveA summary can include required fields but still be hard for a human to useAdd receiver feedback and production monitoring after launch

The goal is not to automate judgment away. The goal is to remove the easy-to-miss failure modes before a caller finds them.

Voice Agent Handoff Testing FAQ

How do I test a voice agent handoff end to end?

Test the escalation trigger, destination, context payload, bridge state, transfer receipt, fallback path, and post-transfer outcome in one contract. A passing handoff test should prove where the caller went and what context arrived, not just that the agent said "I will transfer you."

What is the difference between handoff testing and transfer testing?

Transfer testing usually checks the telephony or routing event. Handoff testing checks the full operational outcome: why the transfer happened, who received it, what context moved with it, and whether the caller could continue without starting over.

How do I test warm transfers from a voice agent to a human?

Test warm transfers by asserting the consult call, hold state, private summary, receiving participant, bridge/merge event, and original-agent leave behavior. The test should fail if the caller hears private context, if the receiver gets no summary, or if the bridge never completes.

What evidence should a voice agent transfer test save?

Save the run ID, call ID, agent version, escalation reason, routing rule version, transfer mode, expected destination, actual destination, context fields sent, bridge state, receipt ID, fallback result, and post-transfer outcome. Redact sensitive values, but keep enough structure to reproduce the failure.

Should transfer tests run in CI?

Run deterministic routing, context payload, mocked transfer-tool, and fixture-backed queue tests in CI. Keep full phone-path warm transfer, IVR traversal, live scoped destination, and provider-specific checks in pre-release or scheduled suites unless the change touches that specific path.

How do I test that a voice agent escalated to the right queue?

Seed a caller fixture, rule version, issue category, priority, and expected destination before the call. After the call, assert the queue ID, routing reason, priority, context payload, transfer receipt, and no-answer fallback instead of relying on transcript language.

How do I test failed transfers or no-answer paths?

Create a destination fixture that rejects, times out, returns busy, or never answers. The agent should explain the failure, offer an approved fallback such as callback or case creation, and avoid claiming the caller was transferred.

What is the most common handoff testing mistake?

The most common mistake is treating a transfer request as proof of a handoff. Hamming recommends failing the test unless the destination, context payload, bridge state, receipt, and fallback behavior are all verified.

Frequently Asked Questions

Test the escalation trigger, destination, context payload, bridge state, transfer receipt, fallback path, and post-transfer outcome in one contract. According to Hamming's runbook, a passing handoff test should prove where the caller went and what context arrived, not just that the agent said it would transfer the call.

Transfer testing usually checks the telephony or routing event. Handoff testing checks the full operational outcome: why the transfer happened, who received it, what context moved with it, and whether the caller could continue without starting over.

Test warm transfers by asserting the consult call, hold state, private summary, receiving participant, bridge or merge event, and original-agent leave behavior. Hamming recommends failing the test if the caller hears private context, if the receiver gets no summary, or if the bridge never completes.

Save the run ID, call ID, agent version, escalation reason, routing rule version, transfer mode, expected destination, actual destination, context fields sent, bridge state, receipt ID, fallback result, and post-transfer outcome. Redact sensitive values, but keep enough structure for an engineer to reproduce the failure.

Run deterministic routing, context payload, mocked transfer-tool, and fixture-backed queue tests in CI. Keep full phone-path warm transfer, IVR traversal, live scoped destination, and provider-specific checks in pre-release or scheduled suites unless the change touches that specific path.

Seed a caller fixture, rule version, issue category, priority, and expected destination before the call. After the call, assert the queue ID, routing reason, priority, context payload, transfer receipt, and no-answer fallback instead of relying on transcript language.

Create a destination fixture that rejects, times out, returns busy, or never answers. The agent should explain the failure, offer an approved fallback such as callback or case creation, and avoid claiming the caller was transferred.

The most common mistake is treating a transfer request as proof of a handoff. Hamming recommends failing the test unless the destination, context payload, bridge state, receipt, and fallback behavior are all verified.

Sumanyu Sharma

Sumanyu Sharma

Founder & CEO

Previously Head of Data at Citizen, where he helped quadruple the user base. As Senior Staff Data Scientist at Tesla, grew AI-powered sales program to 100s of millions in revenue per year.

Researched AI-powered medical image search at the University of Waterloo, where he graduated with Engineering honors on dean's list.

“At Hamming, we're taking all of our learnings from Tesla and Citizento build the future of trustworthy, safe and reliable voice AI agents.”