Voice agent handoff testing proves that an AI voice agent escalates, transfers, or hands control to the right destination with the right context, at the right time, and with evidence that the next step actually happened.
If your agent never routes a caller anywhere else, this runbook is too much. Use normal conversation tests.
If your agent can transfer to a human, route to a queue, hand off to a specialist AI agent, leave voicemail, or escalate regulated workflows, transcript-only QA is not enough. The transcript can look clean while the caller lands in the wrong queue, the receiving agent gets no summary, the transfer fails silently, or the AI escalates a case it should have resolved.
We call this transfer theater: the system records a transfer event, but nobody proves whether the right handoff happened.
TL;DR: Test voice-agent handoffs as operational workflows:
- Trigger: Did the caller, policy, or safety condition justify escalation?
- Destination: Did the agent route to the right person, queue, number, SIP endpoint, or specialist agent?
- Context: Did the handoff include caller identity, reason, collected fields, transcript summary, and next action?
- Bridge state: Did the call connect, hold, consult, merge, fail, or fall back as expected?
- Receipt: Did the receiving system or human path acknowledge the handoff?
- Outcome: Did the caller avoid repeating themselves, or did the workflow fail after the transfer event?
Methodology Note: This runbook is based on Hamming's analysis of 4M+ production voice agent calls where escalation, queue routing, human transfer, and workflow state changed the caller outcome across 10K+ voice agents (2025-2026). We've tested agents built on LiveKit, Pipecat, ElevenLabs, Retell, Vapi, and custom-built solutions.Use it as a release-safety template for healthcare, financial services, insurance, BPO, QSR, and support workflows where a bad handoff creates customer effort or operational risk.
Last Updated: June 2026
Related Guides:
- Voice Agent Workflow Testing Runbook - broader tool-call, state, and side-effect testing
- Voice Agent Sandbox Testing - prove tool calls and side effects without production writes
- IVR and Voice Agent Log Correlation - connect call IDs, routing events, transcripts, and outcomes
- Voice Agent Call Evidence Export Runbook - package evidence for review
- Customer-Specific Workflow Rules Template - test account, tenant, and rule-specific routing
- Voice Agent Tests as Code - keep routing rules and expected receipts reviewable
- Voice Agent Production Readiness Checklist - launch gates for critical voice workflows
- Voice Agent Incident Response Runbook - respond when handoffs fail in production
- Failed Production Call Regression Runbook - convert escaped handoff bugs into tests
- DTMF Support for Voice Agent Testing - test keypad paths before or after transfer
What Should Handoff Testing Prove?
Handoff testing should prove that the caller moved from one responsible system or person to another without losing the reason for the call.
That means a passing test needs more than "transfer tool called." It needs trigger evidence, routing evidence, context evidence, bridge evidence, and post-transfer evidence.
| Layer | What it proves | Sample failure |
|---|---|---|
| Escalation trigger | The agent escalated only when a caller request, policy, risk, or workflow condition required it | Agent transfers every frustrated caller even when self-service should continue |
| Destination | The selected queue, number, SIP endpoint, human team, or specialist AI agent matches the policy | Billing question goes to technical support |
| Context payload | The receiver gets caller identity, reason, summary, collected fields, and next action | Human answers but asks the caller to repeat everything |
| Bridge state | Hold, consult, merge, cold transfer, warm transfer, no-answer, and cancel states behave correctly | Caller sits on hold after destination fails |
| Receipt | The destination acknowledges the handoff or emits a transfer result | System logs "transfer requested" but not "transfer connected" |
| Fallback | Failed transfers produce a safe next step | Agent says "I transferred you" after no one answered |
Handoff success: the escalation reason, destination, transferred context, bridge state, receipt, and caller-facing outcome all agree. If any one is missing, the handoff is not proven.
In Hamming workflow reviews, we found that expensive handoff failures rarely look dramatic in the transcript. The agent often says the correct sentence. The failure is in the receipt, the queue, the summary, or the transfer target.
Choose the Handoff Type Before Writing Tests
Do not write one generic "transfer test." The evidence is different for each handoff type.
| Handoff type | Use when | Required evidence | Usually blocks release? |
|---|---|---|---|
| AI-agent handoff | One specialist agent should take over from another | Source agent, destination agent, handoff reason, preserved history, active instructions | Yes for critical workflows |
| Human warm transfer | A human needs context before speaking with the caller | Consult connection, private summary, merge/bridge result, receiving participant | Yes |
| Human cold transfer | The caller can be routed directly without consult | Destination, announcement, transfer result, no-answer fallback | Sometimes |
| Queue transfer | Routing depends on skill, department, region, priority, or tenant | Queue ID, routing rule version, priority, wait behavior, receipt | Yes for support-critical queues |
| IVR or DTMF transfer | Destination requires keypad navigation or phone-tree traversal | Digits sent, menu branch, destination answer, timeout path | Pre-release |
| Voicemail fallback | Human path unavailable but a message should be captured | Voicemail detection, message policy, callback record, follow-up owner | Sometimes |
| No-transfer fallback | Transfer is forbidden or unavailable | Reason not transferred, caller explanation, safe next step | Yes when policy-sensitive |
LiveKit's supervisor pattern documentation distinguishes between keeping one supervisor agent in control and handing control to another agent with different instructions or tools. That distinction matters in tests. A task delegation test is not the same as a full handoff test.
Provider transfer behavior also varies. Retell's transfer docs describe cold and warm transfer modes for phone calls, while Vapi's dynamic transfer docs describe server-controlled destination decisions using call context. Your test should match the specific runtime behavior you use.
Build the Handoff Test Contract
Start with a contract. Then run the call.
handoff_test_contract = caller scenario + escalation trigger + allowed transfer mode + expected destination + required context payload + bridge-state assertions + receipt assertions + fallback rule + evidence retention
| Contract field | Required? | Sample |
|---|---|---|
| Scenario ID | Yes | support_escalation_billing_014 |
| Caller goal | Yes | Caller asks about a disputed invoice and requests a person |
| Trigger condition | Yes | Billing dispute over threshold plus explicit human request |
| Transfer mode | Yes | Warm transfer to billing queue |
| Destination | Yes | queue_billing_priority_2 or approved SIP endpoint |
| Context payload | Yes | Caller ID, account fixture, issue category, amount, summary, next action |
| Forbidden actions | Yes | No transfer to general support; no account update before human review |
| Bridge assertion | Yes | Consult call established, then customer bridged, or fallback fires |
| Receipt assertion | Yes | Transfer event, receiving participant, queue/task ID, or destination ack |
| Fallback rule | Yes | If no answer in 30 seconds, offer callback or case creation |
| Evidence | Yes | Transcript, routing rule, tool trace, transfer events, call IDs, final state |
Keep this next to your tests-as-code definitions. A teammate should be able to review the escalation policy without listening to the whole call.
Copyable Test Definition
suite: voice_agent_handoff_testingowner: voice-platformenvironment: stagingscenarios: - id: support_escalation_billing_014 caller_goal: "Dispute an invoice and ask for a person" trigger: explicit_human_request: true issue_category: billing amount_disputed_cents: 18420 expected_handoff: mode: warm_transfer destination_type: queue destination_id: queue_billing_priority_2 context_required: - caller_id - verified_account_id - issue_category - dispute_amount - summary - next_action forbidden: - transfer_to_general_support - account_credit_before_human_review - caller_repeats_identity_after_bridge fallback: no_answer_after_seconds: 30 expected_action: offer_callback_or_case_creation evidence: retain_transcript: true retain_transfer_events: true retain_routing_rule_version: true retain_receipt: true
That is more useful than a 1-10 transcript score. It tells you what the agent was allowed to do, where the caller should land, and what proof should exist afterward.
Test Escalation Triggers and Forbidden Transfers
Most handoff bugs start before the transfer. The agent either escalates too late, escalates too early, or escalates for the wrong reason.
| Trigger class | Should transfer when | Should not transfer when | Test assertion |
|---|---|---|---|
| Explicit caller request | Caller says "representative," "human," or "transfer me" and policy allows it | Caller asks a normal question with no escalation request | Transfer reason includes explicit request |
| Safety or compliance | Medical, legal, payment, account, or emergency policy requires human review | Low-risk FAQ or read-only status question | Policy rule version is recorded |
| Tool failure | Required lookup, booking, payment, or CRM tool fails after allowed retry | Non-critical tool times out but fallback answer is safe | Agent explains failure and follows approved route |
| Negative sentiment | Caller frustration passes threshold and self-service is failing | Caller uses mild frustration but task is progressing | Sentiment is supporting evidence, not sole trigger |
| Unsupported intent | Caller asks for something outside scope | Agent can answer safely or gather missing info | Unsupported scope maps to correct fallback |
| VIP or account rule | Tenant/customer rule requires special queue | Generic account has no special routing | Rule fixture and destination match |
The negative cases matter. A voice agent that transfers too easily can destroy containment and create unnecessary queue load. A voice agent that refuses to transfer when policy requires it creates a worse problem.
For customer-specific routing, pair this with the customer workflow rules template. The same caller phrase can require different destinations depending on tenant, region, priority, or regulated workflow.
Verify Destination, Context, and Bridge State
Transfer tests should inspect the receiving side, not just the sending side.
Twilio's warm transfer documentation describes a consult pattern where the initiating agent talks to the receiving agent before the customer is bridged. Twilio Conference also exposes participant states such as waiting, hold, mute, join, and leave. Those states are testable evidence.
| Evidence | What to check | Fail when |
|---|---|---|
| Destination | Queue, number, SIP URI, assistant, or agent ID | Destination is missing, stale, or not policy-approved |
| Caller context | Verified caller ID, account, preferred language, reason, collected fields | Receiver gets no usable summary |
| Transfer mode | Cold, warm, queue, AI-agent handoff, voicemail fallback | Mode differs from policy |
| Hold and consult | Caller hold state, receiving participant answer, private context delivery | Caller hears private summary or waits after failure |
| Bridge or merge | Both parties connected, original agent leaves or stays as designed | Transfer event fires but no bridge occurs |
| Receipt | Transfer update, conference participant, queue task, destination ack | Only request event exists |
| Post-transfer outcome | Caller continues with preserved context | Caller repeats identity, reason, and details |
Transfer receipt rule: a handoff test should fail unless the destination emits an acknowledgment or bridge state that proves the caller had somewhere to go.
Some providers expose this as a webhook event. Vapi server events include transfer-related event types such as transfer destination requests and transfer updates. Other stacks require joining telephony logs, queue events, or conference participant state. The data source can vary. The requirement does not.
Save an Evidence Envelope for Every Handoff
When a transfer fails in production, the hardest question is usually basic: where did the caller go?
Use an evidence envelope that connects the transcript, routing rule, transfer event, destination, and final state.
{ "run_id": "handoff_run_2026_06_24_0019", "call_id": "call_fixture_442", "scenario_id": "support_escalation_billing_014", "agent_version": "support-agent-pr-913", "escalation": { "trigger": "explicit_human_request", "policy_rule_version": "routing_rules_2026_06_24", "reason": "billing_dispute" }, "handoff": { "mode": "warm_transfer", "expected_destination": "queue_billing_priority_2", "actual_destination": "queue_billing_priority_2", "context_fields_sent": [ "caller_id", "verified_account_id", "issue_category", "summary", "next_action" ], "bridge_status": "connected", "receipt_id": "transfer_receipt_77" }, "post_transfer": { "caller_repeated_identity": false, "receiver_acknowledged_context": true, "fallback_used": false }}
Redact sensitive fields before this leaves your system. The reviewer does not need raw account data or the full transcript. They need enough structure to know whether the handoff matched policy.
For packaging this kind of evidence across teams, use the call evidence export runbook. For joining telephony and IVR events to a single call, use the IVR log correlation runbook.
Decide What Runs in CI Versus Pre-Release
Do not put every phone-path transfer in CI. You will create slow, flaky, expensive gates.
Block CI on deterministic logic and fixture-backed routing. Run telephony-heavy and live destination checks as pre-release or scheduled suites.
| Gate | Run when | Recommended size | Blocks merge? |
|---|---|---|---|
| Routing policy tests | Prompt, policy, tenant rule, or escalation logic changes | 8-20 fixture cases | Yes |
| Context payload tests | Summary, extracted fields, or destination contract changes | 5-12 cases | Yes |
| Mocked transfer tool tests | Tool schema, function name, destination argument, or fallback changes | 5-15 cases | Yes |
| Sandbox queue tests | Queue/task integration or CRM routing changes | 3-8 cases | Yes for critical workflows |
| Phone-path warm transfer tests | Telephony provider, SIP, queue, IVR, or human bridge path changes | 2-5 calls | Pre-release |
| Live scoped transfer checks | Production-only routing or provider behavior | 1-3 allowlisted runs | Release owner decision |
| Production monitoring | Continuous transfer quality tracking | Sample or 100% metadata | Alert, do not block CI |
The rule of thumb: if a failed handoff can affect account access, healthcare decisions, payments, legal state, safety, or a high-value customer, it deserves blocking coverage somewhere. If it only checks a rarely used queue variant, schedule it and alert the owner.
When a handoff fails in production, add it to the failed production call regression runbook. A transfer bug that escapes once should not rely on memory next time.
Provider and Runtime Caveats
Provider docs are useful, but they are not interchangeable. Test the specific surface you deploy.
| Surface | Useful public behavior | Test implication |
|---|---|---|
| LiveKit agent handoffs | Agent handoffs and supervisor/task patterns can model different control-transfer shapes | Test whether control truly changes hands or a supervisor remains in charge |
| Vapi dynamic transfers | Server-controlled destination selection can use conversation and customer context | Assert request payload, routing decision, control action, and transfer update |
| Retell transfer tools | Phone-call transfers can be cold or warm, with human-detection and transfer settings | Test phone-call-only assumptions, no-answer, human detection, and fallback |
| Retell conversation-flow transfer nodes | Transfer nodes can branch on transfer failure | Assert the failure edge, not just the successful transfer path |
| Twilio Conference | Conferences expose participant, hold, join, leave, and bridge behavior | Assert participant state and conference lifecycle for warm transfers |
| Twilio Flex warm transfer | Warm transfer includes consult, hold/unhold, bridge, and leave steps | Test the consult and bridge sequence, not only final connection |
This is why workflow testing and handoff testing are separate. Workflow tests prove the agent chose the right action. Handoff tests prove the caller landed somewhere useful after that action.
What This Runbook Cannot Prove
Handoff testing does not prove the human resolved the issue, the queue staffing was adequate, or the downstream team had the right policy.
It proves a narrower thing: the voice agent escalated for the right reason, routed to the right destination, included the right context, handled bridge/failure states, and retained enough evidence to debug the result.
Three limitations matter:
| Limitation | Why it matters | Practical response |
|---|---|---|
| Human availability changes | A transfer test can pass while real queues are understaffed | Separate workflow correctness from workforce planning |
| Provider transfer behavior differs | Caller ID, SIP REFER, warm transfer, hold, and no-answer semantics vary | Keep provider-specific tests and read official docs before assuming parity |
| Context quality is subjective | A summary can include required fields but still be hard for a human to use | Add receiver feedback and production monitoring after launch |
The goal is not to automate judgment away. The goal is to remove the easy-to-miss failure modes before a caller finds them.
Voice Agent Handoff Testing FAQ
How do I test a voice agent handoff end to end?
Test the escalation trigger, destination, context payload, bridge state, transfer receipt, fallback path, and post-transfer outcome in one contract. A passing handoff test should prove where the caller went and what context arrived, not just that the agent said "I will transfer you."
What is the difference between handoff testing and transfer testing?
Transfer testing usually checks the telephony or routing event. Handoff testing checks the full operational outcome: why the transfer happened, who received it, what context moved with it, and whether the caller could continue without starting over.
How do I test warm transfers from a voice agent to a human?
Test warm transfers by asserting the consult call, hold state, private summary, receiving participant, bridge/merge event, and original-agent leave behavior. The test should fail if the caller hears private context, if the receiver gets no summary, or if the bridge never completes.
What evidence should a voice agent transfer test save?
Save the run ID, call ID, agent version, escalation reason, routing rule version, transfer mode, expected destination, actual destination, context fields sent, bridge state, receipt ID, fallback result, and post-transfer outcome. Redact sensitive values, but keep enough structure to reproduce the failure.
Should transfer tests run in CI?
Run deterministic routing, context payload, mocked transfer-tool, and fixture-backed queue tests in CI. Keep full phone-path warm transfer, IVR traversal, live scoped destination, and provider-specific checks in pre-release or scheduled suites unless the change touches that specific path.
How do I test that a voice agent escalated to the right queue?
Seed a caller fixture, rule version, issue category, priority, and expected destination before the call. After the call, assert the queue ID, routing reason, priority, context payload, transfer receipt, and no-answer fallback instead of relying on transcript language.
How do I test failed transfers or no-answer paths?
Create a destination fixture that rejects, times out, returns busy, or never answers. The agent should explain the failure, offer an approved fallback such as callback or case creation, and avoid claiming the caller was transferred.
What is the most common handoff testing mistake?
The most common mistake is treating a transfer request as proof of a handoff. Hamming recommends failing the test unless the destination, context payload, bridge state, receipt, and fallback behavior are all verified.

