How do you test a WebSocket voice agent without a phone number?

Test the WebSocket endpoint directly with known audio fixtures and assert the full event stream: connection, session config, audio acceptance, transcript events, tool calls, response audio, interruption, and close behavior. Hamming recommends starting with 5-10 deterministic fixtures before running full phone, SIP, or WebRTC tests.

What should a WebSocket voice agent test verify?

A WebSocket voice agent test should verify handshake, authentication, audio format, event order, output audio, tool calls, backend side effects, errors, and cleanup. Hamming's checklist treats the endpoint as the system under test, not just a pass-through to the model.

Is WebSocket testing enough before launching a phone-based voice agent?

No. WebSocket testing proves the realtime endpoint and agent logic, but it does not prove SIP routing, carrier audio, caller ID, DTMF, queue transfer, or PSTN recording behavior. Hamming recommends using WebSocket tests as a fast CI gate and full phone-call tests as the release gate for phone-based agents.

How do you test audio quality through a WebSocket endpoint?

Use fixed audio fixtures with known sample rate, channel count, duration, silence, background noise, and self-corrections, then assert transcript quality and event timing from the same files on every run. Hamming recommends including at least 6 fixture classes: clean speech, long silence, self-correction, interruption, background noise, and bad-format audio.

How do you test tool calls and side effects in a WebSocket voice agent?

Mock or sandbox mutating tools, then assert tool name, arguments, order, result handling, idempotency, and the spoken response after the tool result. Hamming recommends failing the test if the agent confirms a side effect without matching sandbox evidence or a verified dry-run result.

What failures are unique to WebSocket voice-agent testing?

Common WebSocket-specific failures include bad auth headers, audio format mismatch, session config racing with audio, out-of-order events, missing final audio markers, duplicate tool calls after reconnect, and playback not stopping on interruption. Hamming recommends recording every inbound and outbound event with a timestamp, sequence number, session ID, and trace ID.

How do you run WebSocket voice-agent tests in CI?

Run a small deterministic suite on every endpoint, prompt, model, transport, or tool-schema change: 3-5 handshake tests, 5-10 audio fixture tests, and critical tool-call contract tests. Hamming recommends saving the transcript, event log, tool trace, audio fixture, and backend trace for every CI failure.

WebSocket Voice Agent Testing: A No-Phone-Number Guide

WebSocket voice agent testing answers a specific question: does the realtime endpoint work when no one is dialing a phone number?

That sounds narrower than full voice-agent QA. It is. That is the point. A phone call can fail because of SIP routing, carrier behavior, WebRTC media, recording, or contact-center configuration. A WebSocket endpoint test strips that away and asks whether your voice agent can accept audio, emit the right events, call tools, stream audio back, handle interruptions, and close cleanly.

WebSocket voice agent testing validates a realtime voice agent through its WebSocket transport: handshake, authentication, audio format, event ordering, response streaming, tool calls, interruption handling, close behavior, and replay evidence.

Quick filter: If your agent is only reachable through a phone number, start with voice agent workflow testing or WebRTC call quality testing. This guide is for teams with a WebSocket, server-side realtime, media-stream, or custom endpoint path they can hit directly.

TL;DR: Test the WebSocket endpoint before you test the phone path:

Prove the handshake, auth, headers, session config, and close codes.

Send known audio fixtures with silence, interruptions, noise, and different durations.

Assert event order: session started, input accepted, transcript produced, tool called, audio streamed, session closed.

Mock mutating tools and verify side effects outside the transcript.

Put the smallest endpoint suite in CI, then reserve full phone-call tests for release gates.

Methodology Note: This guide is based on Hamming's analysis of production voice agent calls from realtime, WebRTC, SIP, and custom endpoint deployments across 10K+ voice agents (2025-2026). Hamming's platform has 10M+ mins protected. We've tested agents built on LiveKit, Pipecat, ElevenLabs, Retell, Vapi, and custom-built solutions.
It also uses public provider documentation from Twilio, OpenAI, and LiveKit to ground transport-specific behavior.

We used to treat WebSocket testing as a developer shortcut. Useful, but not production proof. After reviewing workflow-heavy voice-agent failures, we changed our view: WebSocket tests are often the best first gate because they catch deterministic endpoint bugs before carrier and browser media issues make the failure harder to isolate. When a 7.4-second fixture fails at the socket layer, a phone call usually adds cost and ambiguity before it adds proof.

Last Updated: May 2026

Related Guides:

Voice Agent Workflow Testing - prove tool calls, state, side effects, and handoffs
WebRTC Call Quality Testing for Voice Agents - test packet loss, jitter, codec behavior, and browser media
Testing LiveKit Voice Agents - LiveKit-specific test setup
Voice Agent CI/CD Testing - turn endpoint tests into release gates
Voice Agent Observability Tracing - trace ASR, LLM, tool, and TTS stages
OpenTelemetry for AI Voice Agents - model spans and events for voice pipelines
Voice Agent Interruption Handling - test barge-in and recovery policy
Voice Agent Load Testing - scale beyond single endpoint checks
Debugging Voice Agents - investigate missed intents and realtime failures
Voice Agent Production Readiness Checklist - decide what must pass before launch

When Should You Test WebSocket Instead of Phone Calls?

Use WebSocket tests when the risk sits inside your realtime agent endpoint, not the carrier path.

Situation	Use WebSocket Test?	Why
Custom voice agent exposes a server-side realtime endpoint	Yes	You can validate auth, audio format, events, tools, and close behavior directly.
Browser app talks to an agent over WebSocket	Yes	You need to prove the same event contract the frontend depends on.
Twilio Media Streams or similar media server forwards audio to your app	Yes	Test your media server before adding live telephony variance.
LiveKit or WebRTC room is the production entry point	Sometimes	Use WebSocket tests for backend agent logic, then WebRTC tests for media quality.
PSTN caller experience is the launch blocker	No	A WebSocket test will not prove SIP routing, caller ID, carrier audio, DTMF, or transfer behavior.
Contact-center queue routing is the main risk	No	Test the phone or CCaaS path because routing policy is part of the product.

Twilio Media Streams show why this matters. The WebSocket can receive call audio, and bidirectional streams can send audio back into the call. But the stream has direction, track, DTMF, and connection constraints. Your tests should make those constraints visible.

OpenAI's Agents SDK voice guide makes a similar distinction at the SDK layer: WebRTC can handle browser audio for you, while WebSocket setups require your application to manage audio input and output. That is a testable contract.

What Should a WebSocket Voice-Agent Test Prove?

A good WebSocket test proves the realtime path, not just the model response.

Layer	What to Assert	Common Failure
Handshake	URL, protocol, required headers, auth token, tenant/workspace context	Socket opens in dev but fails with production headers.
Session config	Model, voice, language, modalities, VAD, tools, metadata	Audio reaches the model before config is applied.
Audio input	Encoding, sample rate, channel count, chunk size, silence handling	Agent hears garbled audio or endpoints too early.
Event order	Session started before audio, transcript before response, tool result before final answer	Client races ahead and renders stale state.
Output audio	First byte time, chunks emitted, final marker, playback cancellation	Agent speaks over interruption or never flushes final audio.
Tool calls	Tool name, arguments, call ID, result, timeout, retry	Agent says it booked something but no tool result exists.
Close behavior	Normal close, error close, reconnect, cleanup	Failed sessions leave stale locks, rooms, or test data.

The named failure mode here is the "green socket trap": the WebSocket connects, so everyone assumes the agent works. A connected socket only proves transport reachability. It does not prove the voice pipeline accepted the right audio, produced the right events, or called the right tools.

Event-stream guardrail: a check that the WebSocket emitted the expected event type, in the expected order, with the expected IDs and payload fields. For voice agents, this usually matters more than final transcript text because the event stream drives playback, tool execution, UI state, and debugging.

The WebSocket Test Contract

Start each test with a small contract. Do not begin with a vague "send audio and see what happens" script.

WebSocket voice test contract =  connection target  + session config  + audio fixture  + expected event sequence  + tool-call guardrails  + output-audio guardrails  + cleanup rule

Contract Field	Sample Value
Endpoint	`wss://staging.acme.test/voice-agent/realtime`
Auth	short-lived test token with workspace, agent, and environment claims
Session config	English, phone-support agent, VAD enabled, tools mocked
Audio fixture	16 kHz mono PCM, 7.4 seconds, caller asks to reschedule
Expected events	`session.started` -> `input_audio.accepted` -> `transcript.delta` -> `tool.call` -> `tool.result` -> `output_audio.delta` -> `session.closed`
Tool guardrail	`lookup_appointment` called before `reschedule_appointment`
Output guardrail	First audio byte under 800 ms after final caller speech for this fixture
Cleanup	Delete sandbox appointment and close socket with normal code

For OpenAI realtime WebSocket paths, public docs describe custom URLs, headers, low-level events, and server-side audio sending. For LiveKit-based agents, public testing docs describe sessions, workflows, tools, and assertions—the equivalent of Hamming Guardrails. Provider names change, but the contract stays the same: connect, configure, stream, observe, assert, clean up.

How to Build Audio Fixtures and Replay Events

Do not test with a microphone first. Microphones add noise, device differences, and human timing. Start with audio fixtures.

Fixture	Include	What It Catches
Clean happy path	clear speech, expected phrase, normal silence	baseline endpoint and event contract
Long silence	2.7 seconds before the user continues	premature endpointing and timeout bugs
Self-correction	"Tuesday, sorry, Thursday"	transcript correction and state update bugs
Barge-in	caller interrupts while agent audio is streaming	playback cancellation and truncation bugs
Background noise	keyboard, traffic, low-volume side speech	false turn starts and noisy transcript bugs
Bad format	wrong sample rate or channel count	missing validation and confusing ASR failures
Early close	client disconnects mid-response	cleanup, retry, and stale session bugs

Keep the fixture library boring. A 6-12 second WAV or PCM fixture with known words is more useful than a clever synthetic conversation that changes every run.

Then replay events. The WebSocket test should record every inbound and outbound message with a timestamp, monotonic sequence number, session ID, and trace ID. That lets a failed test answer whether the bug happened before ASR, during model response, inside a tool call, or during audio playback.

{  "testRunId": "test_run_2026_05_27_001",  "eventIndex": 14,  "direction": "server_to_client",  "eventName": "tool.call",  "occurredAtMs": 1842,  "traceId": "trace_7f3c",  "payloadShape": {    "toolName": "reschedule_appointment",    "callId": "tool_call_03",    "argumentsHash": "sha256:..."  }}

Store hashes or redacted payloads when the event may contain sensitive data. The goal is replay evidence, not another uncontrolled transcript archive.

How to Test Tool Calls and Side Effects

WebSocket tests are especially useful for tool-call verification because you can drive the same request repeatedly.

Guardrail	How to Check It	Fail When
Tool selected	inspect emitted tool-call event or backend trace	wrong tool, missing tool, duplicate tool
Arguments valid	validate against schema and caller fixture	missing required field, wrong date, invented ID
Order correct	compare event sequence	write happens before lookup or confirmation
Result handled	return mocked success, retry, timeout, and error responses	agent ignores failure or confirms stale result
Side effect safe	route write to sandbox, mock, or dry-run endpoint	live record changes during a test
Response aligned	compare spoken output to tool result	agent says a booking succeeded after a failed tool

This is where WebSocket testing connects to the broader voice agent workflow testing runbook. The WebSocket test proves the realtime endpoint can trigger and handle the tool path. The workflow test proves the full business action is correct under production-like conditions.

The honest limitation: a WebSocket test cannot prove that a real caller will hear the same audio over a carrier path. It also cannot prove queue routing, caller ID, SIP headers, or DTMF behavior unless your WebSocket endpoint receives those fields from a real integration. Use it as a fast gate, not the only launch gate.

What Belongs in CI?

Put deterministic WebSocket checks in CI because they are cheaper and faster than full phone-call tests.

CI Gate	Run On	Recommended Size	Blocks Merge?
Handshake and auth	endpoint, auth, routing, config changes	3-5 tests	Yes
Audio fixture smoke suite	prompt, model, VAD, transport changes	5-10 fixtures	Yes
Tool-call contract suite	tool schema or workflow change	5-15 workflows	Yes for critical tools
Reconnect and close behavior	transport or session lifecycle changes	3-6 tests	Yes
Longer replay suite	nightly or pre-release	20-50 fixtures	Usually no
Full PSTN/WebRTC tests	release candidate	top revenue and compliance flows	Yes

The voice agent CI/CD testing guide covers the broader release pattern. The WebSocket-specific rule is simple: if the endpoint contract breaks, do not spend money on phone-call tests yet.

For load, start with concurrency at the WebSocket layer before carrier-level load testing. A small run of 25 concurrent sockets can expose session leaks, backpressure, queue saturation, and provider rate-limit behavior. Then use voice agent load testing to test the full media and telephony path.

Troubleshooting WebSocket Voice-Agent Failures

When a WebSocket voice test fails, classify the failure before changing prompts.

Symptom	Likely Layer	First Check	Next Action
Socket fails to open	handshake/auth	URL, token scope, headers, TLS, workspace routing	replay with same headers from CI
Audio accepted but no transcript	audio input or ASR	sample rate, encoding, chunk size, silence, ASR provider	send known-good fixture and compare
Transcript appears but no response	model/session	session config, modalities, response trigger	inspect session update and response event
Tool event missing	model/tool config	tool schema loaded, tool choice, prompt path	assert config before audio starts
Tool called twice	retry/idempotency	timeout, duplicate event, reconnect behavior	add idempotency key and duplicate guardrail
Audio streams but playback is wrong	output audio	chunk order, final marker, local playback buffer	record output chunks and playback boundary
Interruption does not stop audio	turn handling	speech-start event, response cancel, playback truncation	use interruption handling tests
CI flakes	timing/test harness	fixed fixtures, deterministic mocks, timeout budgets	separate endpoint bug from infrastructure noise

For production debugging, tie every WebSocket test run to trace IDs. The voice agent observability tracing guide and OpenTelemetry guide show how to connect realtime events to backend spans.

Minimum Production-Ready Checklist

The WebSocket endpoint rejects missing, expired, wrong-tenant, and wrong-agent tokens.
Session config is acknowledged before test audio is streamed.
Test fixtures cover clean speech, silence, self-correction, interruption, bad format, and early close.
Event guardrails check order, IDs, payload shape, and terminal state.
Mutating tools are mocked, sandboxed, or dry-run in automated tests.
Output audio guardrails cover first-byte time, chunk order, final marker, and interruption behavior.
Close and reconnect tests prove cleanup for sessions, locks, rooms, and fixture data.
CI reports link to transcript, event log, tool trace, audio fixture, and backend trace.
Full phone, SIP, or WebRTC tests still run before launch when those paths matter.

WebSocket testing is not a replacement for production voice QA. It is the shortest path to a clean failure.

If a WebSocket fixture fails, fix the endpoint before blaming the phone path. If the WebSocket fixture passes but the phone call fails, move down the stack: media, routing, carrier, DTMF, caller identity, recording, and handoff.

That split saves time. It also keeps teams from tuning prompts when the real bug is a socket, codec, event, or playback issue.

WebSocket Voice Agent Testing: A No-Phone-Number Guide

When Should You Test WebSocket Instead of Phone Calls?

What Should a WebSocket Voice-Agent Test Prove?

The WebSocket Test Contract

How to Build Audio Fixtures and Replay Events

How to Test Tool Calls and Side Effects

What Belongs in CI?

Troubleshooting WebSocket Voice-Agent Failures

Minimum Production-Ready Checklist

Frequently Asked Questions

Sumanyu Sharma

Related Resources

Voice Agent Tool Call Contract Testing Template

Long-Call Voice Agent Testing: How to Test 70+ Conversation Turns

Persistent Caller ID Testing for Inbound Voice Agents