Voice agent sandbox testing proves that a voice agent can call tools and create the right side effects without touching production systems.
If your agent only answers public FAQs, this is overkill. But if it books appointments, updates a CRM, sends SMS, starts refunds, routes calls, opens tickets, or changes account state, transcript-only testing is not enough.
The failure mode is simple: the call sounds right, the agent says "you're booked," and the real system either has no appointment, 2 appointments, the wrong time zone, or a record written under the wrong customer.
TL;DR: Treat every tool call as a side-effect boundary. Decide whether each dependency runs in mock, sandbox, or live mode; seed fixture data; assert tool inputs and outputs; verify the final record state; then clean up by run ID.
A passing transcript is useful evidence. It is not proof that the workflow succeeded.
Methodology Note: This runbook is based on Hamming's analysis of 4M+ production voice agent calls where tool calls, bookings, CRM updates, and workflow state changed the caller experience across 10K+ voice agents (2025-2026). We've tested agents built on LiveKit, Pipecat, ElevenLabs, Retell, Vapi, and custom-built solutions.Use it as a release-safety checklist. Regulated workflows, payments, and account changes need stricter approvals than low-risk notification or lookup flows.
Last Updated: May 2026
Related Guides:
- Voice Agent Workflow Testing Runbook - broader workflow and state-transition coverage
- Voice Agent Tests as Code - define side-effect policy in reviewable files
- Voice Agent Caller Identity Testing - prove trusted caller context before account-specific writes
- Voice Agent Testing in CI/CD - connect sandbox tests to release gates
- Voice Agent Production Readiness Checklist - launch gates for critical workflows
- WebSocket Voice Agent Testing - test endpoints before phone-path complexity
- Testing LiveKit Voice Agents - runtime-specific test setup
- Voice Agent Observability and Tracing - evidence that makes side-effect failures debuggable
- IVR Log Correlation Runbook - correlate call IDs, traces, and telephony events
- Questions to Ask Voice Testing Vendors - vendor checklist for trace access and sandbox support
Decide What Can Be Mocked, Sandboxed, or Live-Tested
Start by classifying each dependency. Do not let the agent decide this at runtime.
| Dependency mode | Use it for | Good signal | Release risk |
|---|---|---|---|
| Mock | Fast CI, deterministic failure cases, unit-level tool checks | Agent called the right tool with the right arguments and handled the response | Can miss provider auth, schema, latency, and real side-effect behavior |
| Sandbox | Calendar, CRM, ticketing, telephony, payment, or database integration tests with fixture data | Workflow created the right record in an isolated environment | Requires fixture hygiene, cleanup, and environment drift checks |
| Live scoped | Pre-release checks for routing, telephony, provider limits, and production-only behavior | The real path still works under tight controls | Can mutate live systems if allowlists, quotas, or cleanup are wrong |
Sandbox side-effect test: a voice agent test that lets the agent execute a real integration against isolated fixture data, then checks the durable record state and cleanup result before passing.
The decision is not "realistic or fake." The decision is which layer you are trying to prove.
Use mocks when you need speed and determinism. Use sandboxes when the failure hides in real schemas, auth, state transitions, or provider behavior. Use live scoped checks only when the production surface cannot be represented anywhere else.
The Sandbox Test Contract
Every sandbox test should name the boundary before the call starts.
Sandbox test contract =
test run ID
+ fixture identity
+ allowed tools
+ allowed dependency mode
+ expected side effect
+ forbidden side effects
+ evidence retention
+ cleanup rule
| Contract field | Required? | Example |
|---|---|---|
| Run ID | Yes | sandbox_run_2026_05_31_0142 |
| Caller fixture | Yes | caller_fixture_appointments_07 |
| Agent target | Yes | Staging agent, prompt version, workflow branch |
| Allowed tools | Yes | lookup_customer, check_availability, hold_slot, create_booking, send_confirmation |
| Dependency mode | Yes | Mock calendar availability, sandbox calendar write, mock SMS send |
| Expected side effect | Yes | Exactly 1 booking event created for fixture customer |
| Forbidden side effects | Yes | No production calendar event, no duplicate event, no live SMS, no CRM write outside fixture account |
| Evidence | Yes | Call ID, transcript, tool trace, request/response payloads, final record query |
| Cleanup | Yes | Delete event by fixture ID and verify deletion |
This belongs near your tests-as-code definitions. The value is reviewability: a teammate can see what the call is allowed to mutate before the test runs.
Calendar Booking Is the Best First Sandbox Test
Calendar booking looks simple and breaks in useful ways.
It touches identity, availability, time zones, tool order, duplicate writes, confirmations, and cleanup. It also gives you a concrete record to verify after the call.
| Test case | Fixture setup | Assertion | Block release when |
|---|---|---|---|
| Happy path booking | Caller has no existing appointment; requested slot is open | 1 event exists with correct start, end, attendee, owner, and fixture tag | Transcript says booked but no event exists |
| Duplicate prevention | Caller repeats the same request or call is retried | Still 1 event exists for the run ID | 2 events are created |
| Time-zone boundary | Caller asks for "Friday at 3" from a known locale | Event stores the intended local time and normalized time zone | Event appears at the wrong local time |
| Conflict handling | Requested slot is unavailable | Agent offers valid alternatives and writes nothing | Agent creates event in a busy slot |
| Identity mismatch | Caller fixture does not match account owner | Agent refuses account-specific booking or requests verification | Event is created under the wrong account |
| Tool timeout | Availability or booking tool times out | Agent gives safe fallback and no partial write remains | Agent claims success after timeout |
| Cleanup failure | Event delete or rollback fails | Test fails with cleanup evidence | CI passes while fixture data remains |
Google Calendar's API treats events as concrete objects with start and end times, attendees, recurrence, and calendar ownership rules. That is why calendar tests should assert the stored event, not just the agent's spoken confirmation.
We used to treat booking tests as conversation tests: did the caller and agent agree on a time? Now we treat them as state-transition tests: did the correct system record change exactly once?
Side-Effect Assertions Need Four Layers
Do not collapse everything into one "success" rubric.
| Assertion layer | What it checks | Example |
|---|---|---|
| Conversation | Caller reached the intended step and understood the outcome | Agent confirmed Friday at 3 PM and explained next steps |
| Tool request | Agent selected the right tool with safe arguments | create_booking used fixture customer ID, not a spoken account ID |
| Tool response | Agent handled the integration result correctly | Booking response returned created, not pending or failed |
| Durable state | The external system ended in the expected state | Calendar contains exactly 1 tagged event for the run ID |
Workflow success: the caller outcome, tool trace, and durable system state all agree. If any one of those disagrees, the test should fail.
This is where many voice-agent QA setups are too thin. They can grade the transcript, but they cannot see the internal execution trace or final database state. If your test vendor cannot see those traces, you need a callback, artifact upload, or post-run verifier that can.
Evidence When Internal Traces Are Private
Some teams cannot expose internal execution traces to a vendor. That is reasonable. It does not mean you should skip side-effect evidence.
Use a redacted evidence envelope.
{
"run_id": "sandbox_run_2026_05_31_0142",
"call_id": "call_fixture_883",
"agent_version": "appointments-agent-pr-128",
"allowed_tools": ["lookup_customer", "check_availability", "create_booking"],
"tool_results": [
{
"tool": "create_booking",
"status": "created",
"fixture_record_id": "booking_fixture_883",
"idempotency_key": "sandbox_run_2026_05_31_0142:create_booking"
}
],
"final_state": {
"booking_count_for_run": 1,
"duplicate_events": 0,
"cleanup_status": "verified"
}
}
The vendor does not need customer PII or raw database rows. It needs enough evidence to know whether the voice test actually changed the expected fixture state.
For trace design, use the voice agent observability guide. For phone or IVR paths, connect the run ID to provider call IDs with the IVR log correlation runbook.
Cleanup, Idempotency, and Replay Safety
Sandbox tests fail in boring ways: stale fixture data, duplicate runs, half-cleaned records, and retries without idempotency.
Make these checks blocking.
| Check | Why it matters | Evidence |
|---|---|---|
| Fixture names include run ID | Prevents one test run from matching another run's data | Record IDs or tags include run_id |
| Writes use idempotency keys | Retries should not create duplicate records | Same idempotency key returns same record or no-op |
| Cleanup runs after pass and fail | Failed tests leave the most residue | Cleanup log and final query |
| Cleanup is verified | Delete requests can fail silently | Post-cleanup count is 0 |
| Shared sandbox is reset | Old data causes false positives | Fixture snapshot or reset timestamp |
| Production writes are allowlisted | Live scoped checks need narrow blast radius | Allowlist, owner, quota, and rollback plan |
Replay-safe test: a test that can run twice with the same run ID without creating duplicate side effects or corrupting fixture state.
If a test cannot be replayed safely, do not put it in CI. Run it manually or scheduled until the dependency supports idempotency, isolation, and verified cleanup.
Provider and Runtime Notes
Public provider surfaces keep moving, so test against the exact feature you use.
| Surface | Useful testing behavior | What to verify |
|---|---|---|
| ElevenLabs agent testing | Official docs describe scenario tests, tool-call tests, simulation tests, dynamic variables, and tool mocking | Parameter validation, mocked tool fallback behavior, and whether system/workflow tools are mockable in your setup |
| LiveKit Agents | The runtime exposes tool, workflow, task, RPC, and testing/evaluation concepts | Room/session ID, tool trace, client RPC path, and async tool completion |
| Vapi voice testing | Voice tests use simulated phone conversations, scripts, recordings/transcripts, and rubric assessment | Whether the phone-path test also proves your external side effect |
| Twilio test credentials | Supported API paths can be exercised without charges or live account state changes, with documented limitations | Which endpoints are covered and which callbacks or downstream effects are not triggered |
| Calendar APIs | Events have owners, attendees, time zones, recurrence, and deletion behavior | Exact event state, duplicate prevention, timezone normalization, and cleanup |
Use provider docs for constraints, but keep your test contract provider-neutral. The core question does not change: what did the agent try to do, what did the dependency return, and what state exists now?
What Belongs in CI?
Put the smallest deterministic suite in CI. Push expensive or phone-path-heavy checks to scheduled runs unless they protect a launch-critical change.
| Gate | Run when | Recommended size | Blocks merge? |
|---|---|---|---|
| Tool schema tests | Tool contract, prompt, or orchestration changes | 5-15 mocked cases | Yes |
| Sandbox side-effect tests | Booking, CRM, ticketing, account, or database workflow changes | 3-8 fixture-backed cases | Yes for critical flows |
| Phone-path workflow tests | Provider, telephony, handoff, or audio-path changes | 2-5 calls | Usually pre-release |
| Live scoped checks | Production-only routing or provider behavior | 1-3 allowlisted runs | Release owner decision |
| Production sampling | Continuous monitoring | 1-5% of eligible calls | No, alert on drift |
Tie this back to the production readiness checklist: if the workflow can change money, appointments, account access, healthcare data, insurance data, or legal state, it deserves a sandbox side-effect gate before launch.
What This Runbook Cannot Prove
Sandbox testing proves that the workflow behaves against controlled fixtures. It does not prove that production data is clean, provider limits are identical, or every race condition is gone.
Three limitations matter:
| Limitation | Why it matters | Practical response |
|---|---|---|
| Sandboxes drift from production | Schemas, auth, provider flags, and data quality can differ | Run drift checks and a narrow pre-release live scoped test |
| Mocks can overfit | A mock response may not match provider latency, error shape, or retries | Use mocks for CI speed and sandboxes for integration confidence |
| Cleanup can hide product bugs | Deleting bad records after the test can mask duplicate writes | Assert duplicates before cleanup, then verify cleanup after |
The point is not to make every test realistic. The point is to make each test honest about what it proves.
Voice Agent Sandbox Testing FAQ
What is voice agent sandbox testing?
Voice agent sandbox testing verifies tool calls and durable side effects against isolated fixture data instead of production systems. The test should assert the spoken outcome, tool request, tool response, final record state, and cleanup status.
When should I mock a voice agent tool instead of using a sandbox?
Mock the tool when you need deterministic CI checks, fast failure cases, or validation that the agent selected the right tool with the right arguments. Use a sandbox when you need to prove auth, schemas, provider behavior, retries, time zones, or durable writes.
How do I test calendar booking without creating real appointments?
Create fixture users and a dedicated test calendar, tag every event with a run ID, and verify the final event count, time zone, attendee, owner, and cleanup result. The test should fail if the agent claims success but no event exists, or if a retry creates duplicate events.
What evidence should a side-effect test store?
Store run ID, call ID, agent version, fixture IDs, allowed tools, tool request/response summaries, final state query, assertion results, and cleanup status. Redact sensitive fields, but keep enough structure for an engineer to reproduce the failure.
How do I test tool calls when a vendor cannot see my internal traces?
Emit a redacted evidence envelope after the run. It can include tool names, statuses, fixture record IDs, idempotency keys, final counts, and cleanup status without exposing raw database rows or private customer data.
Which side-effect tests should block a pull request?
Block on critical account, payment, booking, compliance, support-ticket, CRM, and handoff workflows where a bad write would affect a customer. Keep expensive phone-path and live-provider checks scheduled unless the change touches that exact path.
What is the most common sandbox testing mistake?
The most common mistake is treating transcript success as workflow success. A voice agent can say the right thing while writing the wrong record, skipping the write, creating duplicates, or leaving fixture data behind.
How should sandbox tests clean up after failures?
Cleanup should run after both pass and fail, then verify the final state with a post-cleanup query. The run should remain failed if cleanup cannot prove that fixture data was removed or safely isolated.

