WebSocket Voice Agent Testing: A No-Phone-Number Guide

Sumanyu Sharma
Sumanyu Sharma
Founder & CEO
, Voice AI QA Pioneer

Has stress-tested 4M+ voice agent calls to find where they break.

May 27, 2026Updated May 27, 202612 min read
WebSocket Voice Agent Testing: A No-Phone-Number Guide

WebSocket voice agent testing answers a specific question: does the realtime endpoint work when no one is dialing a phone number?

That sounds narrower than full voice-agent QA. It is. That is the point. A phone call can fail because of SIP routing, carrier behavior, WebRTC media, recording, or contact-center configuration. A WebSocket endpoint test strips that away and asks whether your voice agent can accept audio, emit the right events, call tools, stream audio back, handle interruptions, and close cleanly.

WebSocket voice agent testing validates a realtime voice agent through its WebSocket transport: handshake, authentication, audio format, event ordering, response streaming, tool calls, interruption handling, close behavior, and replay evidence.

Quick filter: If your agent is only reachable through a phone number, start with voice agent workflow testing or WebRTC call quality testing. This guide is for teams with a WebSocket, server-side realtime, media-stream, or custom endpoint path they can hit directly.

TL;DR: Test the WebSocket endpoint before you test the phone path:

  • Prove the handshake, auth, headers, session config, and close codes.
  • Send known audio fixtures with silence, interruptions, noise, and different durations.
  • Assert event order: session started, input accepted, transcript produced, tool called, audio streamed, session closed.
  • Mock mutating tools and verify side effects outside the transcript.
  • Put the smallest endpoint suite in CI, then reserve full phone-call tests for release gates.
Methodology Note: This guide is based on Hamming's analysis of 4M+ production voice agent calls from realtime, WebRTC, SIP, and custom endpoint deployments across 10K+ voice agents (2025-2026). We've tested agents built on LiveKit, Pipecat, ElevenLabs, Retell, Vapi, and custom-built solutions.

It also uses public provider documentation from Twilio, OpenAI, and LiveKit to ground transport-specific behavior.

We used to treat WebSocket testing as a developer shortcut. Useful, but not production proof. After reviewing workflow-heavy voice-agent failures, we changed our view: WebSocket tests are often the best first gate because they catch deterministic endpoint bugs before carrier and browser media issues make the failure harder to isolate. When a 7.4-second fixture fails at the socket layer, a phone call usually adds cost and ambiguity before it adds proof.

Last Updated: May 2026

Related Guides:

When Should You Test WebSocket Instead of Phone Calls?

Use WebSocket tests when the risk sits inside your realtime agent endpoint, not the carrier path.

SituationUse WebSocket Test?Why
Custom voice agent exposes a server-side realtime endpointYesYou can validate auth, audio format, events, tools, and close behavior directly.
Browser app talks to an agent over WebSocketYesYou need to prove the same event contract the frontend depends on.
Twilio Media Streams or similar media server forwards audio to your appYesTest your media server before adding live telephony variance.
LiveKit or WebRTC room is the production entry pointSometimesUse WebSocket tests for backend agent logic, then WebRTC tests for media quality.
PSTN caller experience is the launch blockerNoA WebSocket test will not prove SIP routing, caller ID, carrier audio, DTMF, or transfer behavior.
Contact-center queue routing is the main riskNoTest the phone or CCaaS path because routing policy is part of the product.

Twilio Media Streams show why this matters. The WebSocket can receive call audio, and bidirectional streams can send audio back into the call. But the stream has direction, track, DTMF, and connection constraints. Your tests should make those constraints visible.

OpenAI's Agents SDK voice guide makes a similar distinction at the SDK layer: WebRTC can handle browser audio for you, while WebSocket setups require your application to manage audio input and output. That is a testable contract.

What Should a WebSocket Voice-Agent Test Prove?

A good WebSocket test proves the realtime path, not just the model response.

LayerWhat to AssertCommon Failure
HandshakeURL, protocol, required headers, auth token, tenant/workspace contextSocket opens in dev but fails with production headers.
Session configModel, voice, language, modalities, VAD, tools, metadataAudio reaches the model before config is applied.
Audio inputEncoding, sample rate, channel count, chunk size, silence handlingAgent hears garbled audio or endpoints too early.
Event orderSession started before audio, transcript before response, tool result before final answerClient races ahead and renders stale state.
Output audioFirst byte time, chunks emitted, final marker, playback cancellationAgent speaks over interruption or never flushes final audio.
Tool callsTool name, arguments, call ID, result, timeout, retryAgent says it booked something but no tool result exists.
Close behaviorNormal close, error close, reconnect, cleanupFailed sessions leave stale locks, rooms, or test data.

The named failure mode here is the "green socket trap": the WebSocket connects, so everyone assumes the agent works. A connected socket only proves transport reachability. It does not prove the voice pipeline accepted the right audio, produced the right events, or called the right tools.

Event-stream assertion: a check that the WebSocket emitted the expected event type, in the expected order, with the expected IDs and payload fields. For voice agents, this usually matters more than final transcript text because the event stream drives playback, tool execution, UI state, and debugging.

The WebSocket Test Contract

Start each test with a small contract. Do not begin with a vague "send audio and see what happens" script.

WebSocket voice test contract =
  connection target
  + session config
  + audio fixture
  + expected event sequence
  + tool-call assertions
  + output-audio assertions
  + cleanup rule
Contract FieldSample Value
Endpointwss://staging.acme.test/voice-agent/realtime
Authshort-lived test token with workspace, agent, and environment claims
Session configEnglish, phone-support agent, VAD enabled, tools mocked
Audio fixture16 kHz mono PCM, 7.4 seconds, caller asks to reschedule
Expected eventssession.started -> input_audio.accepted -> transcript.delta -> tool.call -> tool.result -> output_audio.delta -> session.closed
Tool assertionlookup_appointment called before reschedule_appointment
Output assertionFirst audio byte under 800 ms after final caller speech for this fixture
CleanupDelete sandbox appointment and close socket with normal code

For OpenAI realtime WebSocket paths, public docs describe custom URLs, headers, low-level events, and server-side audio sending. For LiveKit-based agents, public testing docs describe sessions, workflows, tools, and assertions. Provider names change, but the contract stays the same: connect, configure, stream, observe, assert, clean up.

How to Build Audio Fixtures and Replay Events

Do not test with a microphone first. Microphones add noise, device differences, and human timing. Start with audio fixtures.

FixtureIncludeWhat It Catches
Clean happy pathclear speech, expected phrase, normal silencebaseline endpoint and event contract
Long silence2.7 seconds before the user continuespremature endpointing and timeout bugs
Self-correction"Tuesday, sorry, Thursday"transcript correction and state update bugs
Barge-incaller interrupts while agent audio is streamingplayback cancellation and truncation bugs
Background noisekeyboard, traffic, low-volume side speechfalse turn starts and noisy transcript bugs
Bad formatwrong sample rate or channel countmissing validation and confusing ASR failures
Early closeclient disconnects mid-responsecleanup, retry, and stale session bugs

Keep the fixture library boring. A 6-12 second WAV or PCM fixture with known words is more useful than a clever synthetic conversation that changes every run.

Then replay events. The WebSocket test should record every inbound and outbound message with a timestamp, monotonic sequence number, session ID, and trace ID. That lets a failed test answer whether the bug happened before ASR, during model response, inside a tool call, or during audio playback.

{
  "testRunId": "test_run_2026_05_27_001",
  "eventIndex": 14,
  "direction": "server_to_client",
  "eventName": "tool.call",
  "occurredAtMs": 1842,
  "traceId": "trace_7f3c",
  "payloadShape": {
    "toolName": "reschedule_appointment",
    "callId": "tool_call_03",
    "argumentsHash": "sha256:..."
  }
}

Store hashes or redacted payloads when the event may contain sensitive data. The goal is replay evidence, not another uncontrolled transcript archive.

How to Test Tool Calls and Side Effects

WebSocket tests are especially useful for tool-call verification because you can drive the same request repeatedly.

AssertionHow to Check ItFail When
Tool selectedinspect emitted tool-call event or backend tracewrong tool, missing tool, duplicate tool
Arguments validvalidate against schema and caller fixturemissing required field, wrong date, invented ID
Order correctcompare event sequencewrite happens before lookup or confirmation
Result handledreturn mocked success, retry, timeout, and error responsesagent ignores failure or confirms stale result
Side effect saferoute write to sandbox, mock, or dry-run endpointlive record changes during a test
Response alignedcompare spoken output to tool resultagent says a booking succeeded after a failed tool

This is where WebSocket testing connects to the broader voice agent workflow testing runbook. The WebSocket test proves the realtime endpoint can trigger and handle the tool path. The workflow test proves the full business action is correct under production-like conditions.

The honest limitation: a WebSocket test cannot prove that a real caller will hear the same audio over a carrier path. It also cannot prove queue routing, caller ID, SIP headers, or DTMF behavior unless your WebSocket endpoint receives those fields from a real integration. Use it as a fast gate, not the only launch gate.

What Belongs in CI?

Put deterministic WebSocket checks in CI because they are cheaper and faster than full phone-call tests.

CI GateRun OnRecommended SizeBlocks Merge?
Handshake and authendpoint, auth, routing, config changes3-5 testsYes
Audio fixture smoke suiteprompt, model, VAD, transport changes5-10 fixturesYes
Tool-call contract suitetool schema or workflow change5-15 workflowsYes for critical tools
Reconnect and close behaviortransport or session lifecycle changes3-6 testsYes
Longer replay suitenightly or pre-release20-50 fixturesUsually no
Full PSTN/WebRTC testsrelease candidatetop revenue and compliance flowsYes

The voice agent CI/CD testing guide covers the broader release pattern. The WebSocket-specific rule is simple: if the endpoint contract breaks, do not spend money on phone-call tests yet.

For load, start with concurrency at the WebSocket layer before carrier-level load testing. A small run of 25 concurrent sockets can expose session leaks, backpressure, queue saturation, and provider rate-limit behavior. Then use voice agent load testing to test the full media and telephony path.

Troubleshooting WebSocket Voice-Agent Failures

When a WebSocket voice test fails, classify the failure before changing prompts.

SymptomLikely LayerFirst CheckNext Action
Socket fails to openhandshake/authURL, token scope, headers, TLS, workspace routingreplay with same headers from CI
Audio accepted but no transcriptaudio input or ASRsample rate, encoding, chunk size, silence, ASR providersend known-good fixture and compare
Transcript appears but no responsemodel/sessionsession config, modalities, response triggerinspect session update and response event
Tool event missingmodel/tool configtool schema loaded, tool choice, prompt pathassert config before audio starts
Tool called twiceretry/idempotencytimeout, duplicate event, reconnect behavioradd idempotency key and duplicate assertion
Audio streams but playback is wrongoutput audiochunk order, final marker, local playback bufferrecord output chunks and playback boundary
Interruption does not stop audioturn handlingspeech-start event, response cancel, playback truncationuse interruption handling tests
CI flakestiming/test harnessfixed fixtures, deterministic mocks, timeout budgetsseparate endpoint bug from infrastructure noise

For production debugging, tie every WebSocket test run to trace IDs. The voice agent observability tracing guide and OpenTelemetry guide show how to connect realtime events to backend spans.

Minimum Production-Ready Checklist

  • The WebSocket endpoint rejects missing, expired, wrong-tenant, and wrong-agent tokens.
  • Session config is acknowledged before test audio is streamed.
  • Test fixtures cover clean speech, silence, self-correction, interruption, bad format, and early close.
  • Event assertions check order, IDs, payload shape, and terminal state.
  • Mutating tools are mocked, sandboxed, or dry-run in automated tests.
  • Output audio assertions cover first-byte time, chunk order, final marker, and interruption behavior.
  • Close and reconnect tests prove cleanup for sessions, locks, rooms, and fixture data.
  • CI reports link to transcript, event log, tool trace, audio fixture, and backend trace.
  • Full phone, SIP, or WebRTC tests still run before launch when those paths matter.

WebSocket testing is not a replacement for production voice QA. It is the shortest path to a clean failure.

If a WebSocket fixture fails, fix the endpoint before blaming the phone path. If the WebSocket fixture passes but the phone call fails, move down the stack: media, routing, carrier, DTMF, caller identity, recording, and handoff.

That split saves time. It also keeps teams from tuning prompts when the real bug is a socket, codec, event, or playback issue.

Frequently Asked Questions

Test the WebSocket endpoint directly with known audio fixtures and assert the full event stream: connection, session config, audio acceptance, transcript events, tool calls, response audio, interruption, and close behavior. Hamming recommends starting with 5-10 deterministic fixtures before running full phone, SIP, or WebRTC tests.

A WebSocket voice agent test should verify handshake, authentication, audio format, event order, output audio, tool calls, backend side effects, errors, and cleanup. Hamming's checklist treats the endpoint as the system under test, not just a pass-through to the model.

No. WebSocket testing proves the realtime endpoint and agent logic, but it does not prove SIP routing, carrier audio, caller ID, DTMF, queue transfer, or PSTN recording behavior. Hamming recommends using WebSocket tests as a fast CI gate and full phone-call tests as the release gate for phone-based agents.

Use fixed audio fixtures with known sample rate, channel count, duration, silence, background noise, and self-corrections, then assert transcript quality and event timing from the same files on every run. Hamming recommends including at least 6 fixture classes: clean speech, long silence, self-correction, interruption, background noise, and bad-format audio.

Mock or sandbox mutating tools, then assert tool name, arguments, order, result handling, idempotency, and the spoken response after the tool result. Hamming recommends failing the test if the agent confirms a side effect without matching sandbox evidence or a verified dry-run result.

Common WebSocket-specific failures include bad auth headers, audio format mismatch, session config racing with audio, out-of-order events, missing final audio markers, duplicate tool calls after reconnect, and playback not stopping on interruption. Hamming recommends recording every inbound and outbound event with a timestamp, sequence number, session ID, and trace ID.

Run a small deterministic suite on every endpoint, prompt, model, transport, or tool-schema change: 3-5 handshake tests, 5-10 audio fixture tests, and critical tool-call contract tests. Hamming recommends saving the transcript, event log, tool trace, audio fixture, and backend trace for every CI failure.

Sumanyu Sharma

Sumanyu Sharma

Founder & CEO

Previously Head of Data at Citizen, where he helped quadruple the user base. As Senior Staff Data Scientist at Tesla, grew AI-powered sales program to 100s of millions in revenue per year.

Researched AI-powered medical image search at the University of Waterloo, where he graduated with Engineering honors on dean's list.

“At Hamming, we're taking all of our learnings from Tesla and Citizento build the future of trustworthy, safe and reliable voice AI agents.”