A voice agent can be fast and still feel rude. The dashboard says P95 turn latency is healthy, but callers hear the agent cut them off mid-account-number, ignore a correction, or restart after every "uh-huh."
That is why voice agent interruption handling needs its own runbook. Barge-in is not a single setting. It is a policy that decides when caller input should stop agent audio, when it should be treated as a backchannel, when it should be ignored as noise, and what evidence should be logged so QA can replay the decision later.
If you run fewer than 50 production calls a week, keep this simple. Review interrupted calls manually, pick a conservative default, and add 5-10 regression tests. This guide is for teams with enough call volume that interruption failures hide inside aggregate latency, fallback, and completion metrics.
Voice agent interruption handling is the policy and instrumentation layer that decides what happens when a caller speaks, presses DTMF, or triggers a command while the agent is speaking. A production-ready policy records the caller input, agent speech state, interruption decision, playback action, transcript result, and recovery outcome.
Quick filter: If you cannot answer "did the caller intentionally interrupt, or did we fire on noise/backchannel?" from one call record, your interruption handling is not observable enough yet.
TL;DR: Build interruption handling as a runbook, not a toggle:
- Classify the input: true correction, backchannel, accidental noise, DTMF, silence timeout, or safety escalation.
- Log the lifecycle: user speech state, agent speech state, interruption candidate, decision, playback action, transcript outcome, and recovery result.
- Test both sides: false positives cut the agent off; false negatives force callers to wait or repeat themselves.
- Tune by workflow risk: legal disclosures, payment steps, and urgent support paths need different policies than open-ended support chat.
Methodology Note: This runbook is based on Hamming's analysis of 4M+ production voice agent calls across 10K+ voice agents (2025-2026). We've tested agents built on LiveKit, Pipecat, ElevenLabs, Retell, Vapi, and custom-built solutions.It also uses public provider documentation from LiveKit, OpenAI, Twilio, Amazon Nova, Dialogflow CX, and Agora to ground the turn-detection and event samples.
Last Updated: May 2026
Related Guides:
- Voice AI Latency: What's Fast, What's Slow, and How to Fix It - latency thresholds that interact with turn-taking
- Voice Agent Analytics and Post-Call Metrics - formulas for interruption rate, containment, and task completion
- Voice Agent Observability Tracing - trace the ASR, LLM, tool, and TTS path around an interrupted turn
- OpenTelemetry for AI Voice Agents - span and event modeling for voice pipelines
- IVR and Voice Agent Log Correlation - preserve call IDs across IVR, telephony, and agent sessions
- Debugging Voice Agents - investigate missed intents and fallback spikes
- Testing LiveKit Voice Agents - platform-specific test setup for LiveKit agents
- Voice Agent SLOs and Error Budgets - turn interruption failures into reliability targets
What Is Voice Agent Interruption Handling?
Voice agent interruption handling answers one question: when a caller does something while the agent is speaking, should the agent stop, keep talking, pause and resume, or route the input somewhere else?
The answer changes by context. A caller saying "wait, that's the wrong address" should interrupt. A caller saying "yeah" while listening usually should not. A keypad press during an IVR-like prompt may be intentional DTMF. A loud keyboard click should not cancel TTS.
| Caller Input During Agent Speech | Usually Means | Default Action | Evidence to Keep |
|---|---|---|---|
| "No, I meant Friday" | True correction | Stop playback, accept new turn, preserve partial agent transcript | speech duration, transcript, agent playback position |
| "uh-huh" or "okay" | Backchannel | Continue or briefly acknowledge without cancelling critical audio | utterance text, confidence, backchannel decision |
| DTMF key press | Menu or confirmation action | Stop or route based on prompt policy | digit class, prompt state, expected menu options |
| Short noise or echo | False interruption | Resume playback from safe point | audio energy, no transcript, resume decision |
| Long silence | No input or hesitation | Reprompt, wait, or escalate depending on step | silence duration, timeout policy, next action |
| "I need a human" | Safety or escalation interruption | Stop playback and route to handoff logic | intent, transcript, escalation outcome |
LiveKit's turn-detection docs split the problem into detection modes, endpointing delay, adaptive interruption handling, and VAD. OpenAI's Realtime VAD docs expose server VAD and semantic VAD settings such as threshold, prefix padding, silence duration, eagerness, and response interruption.
Those provider knobs are useful. They are not the runbook.
Working rule: Turn detection decides when the system thinks speech started or ended. Interruption handling decides what the agent does with that signal while the agent is already speaking.
Why Barge-In Fails in Production
The most common failure is treating barge-in as a boolean. Turn it on and callers can interrupt. Turn it off and they cannot.
Production is messier than that.
| Failure Mode | What the Caller Feels | Root Cause | First Check |
|---|---|---|---|
| False barge-in | Agent keeps stopping for no reason | Noise, echo, short backchannel, overly sensitive VAD | audio energy, transcript presence, false interruption events |
| Missed correction | Caller has to wait, repeat, or hang up | Interruption disabled, threshold too strict, buffered audio dropped | agent speech state, input reporting policy |
| Premature endpointing | Agent answers before caller is done | Silence threshold too short for the workflow | pause duration, partial transcript, phrase completion |
| Backchannel confusion | "okay" becomes a new task | No semantic/backchannel policy | utterance length, words, confidence, next action |
| Lost recovery | Agent stops, then forgets what it already said | Playback truncation not reflected in conversation history | heard-audio boundary, transcript truncation |
| No evidence | QA cannot prove what happened | Missing event taxonomy and call IDs | interruption event lifecycle |
Twilio's ConversationRelay docs show why this needs precision: interruptible controls whether caller input stops TTS playback, while reportInputDuringAgentSpeech controls whether the application receives input while the agent is talking. Those are different decisions. A system can listen without stopping playback, or stop playback without preserving enough application context.
Google Dialogflow CX exposes a similar separation at a different layer: advanced speech settings include end-of-speech sensitivity, smart endpointing, no-speech timeout, barge-in, and partial response cancellation. Amazon Nova Sonic's turn-taking docs make the latency tradeoff explicit with sensitivity levels that wait roughly 1.5, 1.75, or 2.0 seconds before responding.
The practical lesson is boring but important: the best policy is not "always interrupt." It is "interrupt when the user's intent is more important than the current audio, and prove that decision in the logs."
What Events Should a Voice Agent Log for Interruptions?
If you only log the final transcript, you will miss the interruption decision. The evidence is in the timing: when the user started speaking, where the agent was in playback, what the detector decided, and whether the agent recovered.
Use this event taxonomy as the starting point.
| Event | Required Fields | Why It Matters |
|---|---|---|
user.speech_started | call ID, turn ID, timestamp, audio source, VAD confidence | Shows when the interruption candidate began |
user.speech_stopped | duration, transcript status, silence duration | Separates real speech from noise |
agent.speech_started | response ID, playback start, message type | Shows whether the agent was interruptible |
agent.speech_interrupted | playback position, reason, heard text boundary | Reconstructs what the caller actually heard |
interruption.candidate_detected | mode, threshold, speech duration, words detected | Explains why the detector fired |
interruption.decision_made | decision, policy version, confidence, reason | Proves whether the app chose stop, continue, resume, or escalate |
interruption.recovered | resume position, new user turn ID, task state | Shows whether the conversation repaired cleanly |
interruption.false_positive | timeout, no transcript, resume behavior | Counts noise/backchannel mistakes separately |
silence.timeout | elapsed silence, prompt state, next action | Handles no-input paths without mixing them into barge-in |
Twilio's Conversation Relay Insights event reference includes speech events, latency events, interaction events such as interrupt, and an interrupt payload type. Agora's turn-information API exposes turn starts, interrupted turn endings, ignored turns, silence timeouts, and latency segments. Those are useful samples of the evidence families to normalize even if your runtime uses a different provider.
Here is a normalized event envelope you can adapt:
{
"eventName": "voice.interruption.decision_made",
"eventVersion": "2026-05-20",
"occurredAt": "2026-05-20T15:42:18.231Z",
"canonicalCallId": "call_01JZ9W2M7K",
"turnId": "turn_0007",
"agentResponseId": "response_0006",
"traceId": "9f7c2d4f0f3a4c1e8e4d2a5b7c6f9012",
"agentSpeech": {
"state": "speaking",
"messageType": "billing_summary",
"interruptible": true,
"playbackPositionMs": 1840
},
"callerInput": {
"type": "speech",
"speechDurationMs": 420,
"transcriptText": "no I meant Friday",
"isBackchannel": false
},
"decision": {
"action": "stop_agent_audio_and_accept_user_turn",
"policyVersion": "interruption-policy-2026-05-20",
"reason": "caller_correction_detected",
"confidence": 0.87
},
"recovery": {
"agentTranscriptTruncatedAtMs": 1840,
"newTurnCommitted": true,
"taskStatePreserved": true
}
}
Keep raw transcripts and audio in the right evidence store. For broad dashboards, store pointers, policy versions, and redaction state. The IVR and voice agent log correlation runbook explains how to keep provider IDs and call context attached across the call path.
How to Choose the Right Interruption Policy
The policy should be per message type, not global. A caller should be able to correct an appointment date. They should not accidentally skip a required disclosure because they breathed loudly near the phone.
| Message or Flow Type | Recommended Policy | Why |
|---|---|---|
| Greeting | Speech + DTMF interruption allowed after a short grace period | Callers already know why they called |
| Menu prompt | DTMF and speech allowed, with expected option validation | IVR-style flows depend on early selection |
| Account number or long entity capture | Patient endpointing, avoid early response | Callers pause while reading numbers |
| Legal, consent, or payment disclosure | Non-interruptible or DTMF-only until required content plays | The system may need proof that audio was delivered |
| Open-ended support answer | Adaptive speech interruption with backchannel detection | Callers correct or narrow their request |
| Long tool wait message | Allow interruption and cancellation | Caller may want a human or a different path |
| Escalation handoff | Always allow human-transfer intent | Safety and customer frustration outrank current audio |
This is where a voice agent's conversational policy meets reliability. If you track voice agent SLOs, interruption handling should feed at least two reliability signals: task completion after interruption and escalation correctness after interruption.
Provider settings should map to that policy rather than replace it:
| Provider Surface | Useful Knob | What to Decide First |
|---|---|---|
| LiveKit Agents | turn detection mode, endpointing delay, interruption mode, false interruption resume | Is this flow realtime-model driven, STT pipeline driven, or manually controlled? |
| OpenAI Realtime | server VAD vs semantic VAD, threshold, prefix padding, silence duration, eagerness, interrupt response | Should the model decide turn completion, or should the app own it? |
| Twilio ConversationRelay | interruptible, report input during agent speech, interrupt sensitivity, speech timeout, backchannel handling | Do you need to receive caller input without stopping TTS? |
| Dialogflow CX | end-of-speech sensitivity, smart endpointing, no-speech timeout, barge-in | Which flows can be interrupted at agent, flow, page, or fulfillment level? |
| Amazon Nova Sonic | endpointing sensitivity | Are you optimizing for fast Q&A or patient, complex turns? |
| Agora Conversational AI | interrupted, ignored, silence timeout, latency segments | Do you have post-call turn records that explain the outcome? |
We used to think the right answer was mostly latency tuning: shorten the silence window, make the agent snappier, reduce dead air. That helps, but it is not enough. The hard part is distinguishing a correction from a backchannel, then preserving the state needed to recover.
How to Test Barge-In, Backchannels, and Silence Timeouts
Do not test "interruption works" as one scenario. Split false positives from false negatives.
| Test Case | Setup | Expected Result | Failure Signal |
|---|---|---|---|
| True correction | Agent reads a date; caller says "no, Friday" after 1 second | Agent stops, accepts correction, preserves task context | Caller repeats same correction or agent continues old path |
| Short backchannel | Caller says "yeah" during a support explanation | Agent continues or acknowledges without losing place | Agent cancels answer and treats "yeah" as new intent |
| Background noise | Keyboard click or side speech during agent answer | Agent continues, logs no transcript or false interruption | Playback stops without meaningful caller transcript |
| DTMF during prompt | Caller presses 2 while menu audio plays | Agent routes to option 2 and logs digit | Digit ignored or transcript path handles it as speech |
| Legal disclosure | Caller speaks during non-interruptible message | Agent continues required audio, optionally buffers input | Required message is skipped |
| Long account number | Caller pauses in the middle of a number | Agent waits, does not respond early | Agent interrupts before entity is complete |
| Silence timeout | Caller says nothing after a question | Agent reprompts or escalates according to policy | Timeout counted as user interruption or hidden in latency |
| Escalation interrupt | Caller says "human" while agent is explaining | Agent stops and starts handoff path | Agent finishes explanation first |
For each test, capture the same fields:
Test assertion =
interruption decision is correct
AND playback action is correct
AND transcript state is correct
AND task state is preserved
AND recovery result is correct
The Testing LiveKit Voice Agents guide is a good companion if your runtime is LiveKit. For broader release policy, use Testing Voice Agents for Production Reliability to decide which scenarios block deployment.
How to Tune Thresholds Without Breaking Latency
Tuning interruption handling is a balancing problem. Lower thresholds make the agent feel responsive, but they create false interruptions. Higher thresholds reduce false positives, but callers feel trapped.
Start with a scorecard, not a vibe check.
| Metric | What It Measures | Watch For |
|---|---|---|
| False interruption rate | Agent stopped without meaningful caller input | Noise, echo, backchannel confusion |
| Missed interruption rate | Caller tried to interrupt but agent kept speaking | Threshold too high, reporting disabled, non-interruptible segment too broad |
| Resume success rate | Agent resumes cleanly after false interruption | Broken playback state or transcript truncation |
| Repeated user speech rate | Caller repeats the same correction | Missed interruption or poor recovery |
| Silence after interruption | Dead air after agent stops | State machine did not commit next action |
| Task completion after interruption | Outcome quality for interrupted calls | Recovery path is worse than uninterrupted path |
| Escalation after interruption | Handoff rate after interruption | User frustration or correct safety routing |
Then tune one thing at a time:
- Pick one workflow, such as appointment rescheduling or billing lookup.
- Freeze a test set with true corrections, backchannels, noise, long entities, and silence timeouts.
- Change one setting: threshold, silence duration, endpointing sensitivity, backchannel policy, or non-interruptible segment.
- Run the same test set and compare false positives against missed interruptions.
- Review the top 20 production interrupted calls after release.
For analytics, connect these signals to the voice agent metrics dictionary and voice agent dashboard template. For root cause analysis, the voice agent observability tracing guide and OpenTelemetry guide show where to attach stage timings and trace IDs.
Tuning rule: optimize for the caller-visible mistake, not the provider knob. A 300 ms silence change is good only if it reduces bad outcomes without increasing false interruptions in the flows that matter.
Rollout Checklist
Before shipping a new interruption policy, make the release owner answer these questions:
- Which message types are interruptible, DTMF-only, or non-interruptible?
- Which detector owns turn completion: VAD, STT endpointing, realtime model, manual control, or a contextual turn detector?
- Are true interruptions, backchannels, noise, long entities, and silence timeouts covered in tests?
- Does the call record show user speech state, agent speech state, decision, playback action, transcript outcome, and recovery result?
- Can QA replay the specific interrupted turn with audio, transcript, and trace pointers?
- Are raw transcripts and audio protected with the same redaction rules as other call evidence?
- Do dashboards separate false interruptions from missed interruptions?
- Is task completion after interruption tracked separately from overall task completion?
- Did the owner review production interrupted calls after the change?
- Is there a rollback plan if false interruptions spike?
One unresolved tension is that the best policy is sometimes less "natural" than a demo. A demo agent that interrupts instantly feels impressive for 3 minutes. A production healthcare, finance, or support agent has to be patient enough for real callers, noisy rooms, account numbers, accents, DTMF, and compliance language.
That is the bar. Do not ship the demo behavior. Ship the behavior you can test, explain, and replay.

