Voice agent tests as code means defining test cases in version-controlled files so prompts, personas, call paths, assertions, and expected evidence can be reviewed before they run against production agents.
If your team changes one demo agent by hand once a month, this is probably too much process. Use the dashboard, listen to the calls, and keep moving.
This is for teams where voice-agent behavior changes through pull requests: prompt edits, tool schemas, routing rules, language support, compliance scripts, and workflow fixes. Once those changes ship through Git, the tests should live there too.
TL;DR: Put voice agent test cases in YAML, validate them against a schema, run the blocking subset in CI, and store the result with the same commit that changed the prompt or workflow.
A test that only exists in a vendor dashboard can still be useful. It is just hard to review, diff, export, or connect to the code change that made it necessary.
Methodology Note: The template in this guide is based on Hamming's analysis of 4M+ production voice agent calls across 10K+ voice agents (2025-2026).Treat this as a starting contract. Regulated flows, payments, and account changes need stricter approvals and side-effect controls than low-risk FAQ flows.
Last Updated: May 2026
Related Guides:
- Voice Agent Testing in CI/CD - broader release-gate strategy
- Testing Voice Agents for Production Reliability - regression and production-failure conversion
- Voice Agent Workflow Testing Runbook - tool-call and side-effect assertions
- Voice Agent Production Readiness Checklist - launch gates and owners
- Voice Agent Response Coverage - find gaps from production calls
- Testing LiveKit Voice Agents - runtime-specific test coverage
- Questions to Ask Voice Testing Vendors - exportability and CI/CD vendor checks
- Voice Agent Observability and Tracing - evidence that makes failures debuggable
- Debugging Voice Agents - trace failures after the gate catches them
What Belongs in a Voice Agent Test File
The file should be boring. A reviewer should understand what the call will do, what must pass, what is allowed to touch a real system, and what evidence will be saved.
| Test field | What it captures | Why it belongs in Git |
|---|---|---|
id | Stable test identifier | Lets failures stay searchable across runs |
owner | Team or person responsible | Prevents orphaned tests |
agent_ref | Agent, prompt, branch, or environment | Ties the test to a deployable target |
persona | Caller language, accent, goal, constraints | Makes caller coverage reviewable |
setup | Fixture state before the call | Keeps tests reproducible |
call_path | Inbound, outbound, WebRTC, SIP, or provider mode | Prevents text-only tests from pretending to be voice tests |
assertions | Outcome, transcript, tool, latency, policy, and side-effect checks | Defines pass/fail before the run starts |
evidence | Audio, transcript, trace, tool result, dashboard link | Makes failures debuggable |
cleanup | How test data is removed or isolated | Prevents synthetic calls from polluting production |
GitHub Actions workflows are YAML files under .github/workflows, and GitLab CI uses YAML for pipeline configuration. That does not mean voice-agent tests must use YAML forever. It means YAML is familiar enough for code review, comments, ownership, and CI wiring.
A Copyable YAML Template
Use this as the starting shape. Keep IDs stable and names human-readable.
version: 1
suite: appointment_booking_smoke
owner: growth-eng
agent_ref:
environment: staging
agent_slug: scheduling-agent
prompt_version: pr-482
defaults:
call_path: inbound_phone
timeout_seconds: 180
evidence:
retain_audio: true
retain_transcript: true
retain_tool_trace: true
retain_days: 30
tests:
- id: booking_reschedule_verified_caller
title: Reschedule an appointment after identity check
risk: blocking
persona:
language: en-US
caller_goal: Move my Tuesday appointment to Friday afternoon
speech_conditions:
accent: neutral
background_noise: office
setup:
fixtures:
caller_id: caller_qa_104
appointment_id: appt_fixture_882
side_effect_mode: sandbox
call_script:
- user: I need to move my appointment from Tuesday to Friday after 2.
- user: Yes, Friday at 3 works.
assertions:
outcome:
task_completed: true
no_unplanned_handoff: true
transcript:
must_include:
- Friday at 3
must_not_include:
- I cannot access your account
tools:
required_order:
- lookup_identity
- list_appointments
- hold_slot
- update_booking
- send_confirmation
side_effects:
calendar:
appointment_id: appt_fixture_882
expected_status: rescheduled
duplicate_events_allowed: false
latency:
turn_p95_ms_max: 1500
cleanup:
delete_sandbox_records: true
This template is intentionally more specific than a generic "run test" payload. Voice-agent failures hide in the details: wrong fixture, wrong call path, missing tool trace, duplicate calendar write, or a human handoff that technically happened but carried no context.
Field-by-Field Review Checklist
Review the test file like production code.
| Review question | Good answer | Block the PR when |
|---|---|---|
| Does the test have an owner? | owner maps to a team or person | Nobody can triage failures |
| Is the target clear? | Environment, agent slug, and prompt version are explicit | Test could run against the wrong agent |
| Is the caller realistic? | Persona includes goal, language, and relevant speech condition | Test only mirrors the happy-path script |
| Are assertions layered? | Outcome, transcript, tool, side-effect, and latency checks are separated | A transcript check stands in for the whole workflow |
| Are writes safe? | side_effect_mode is mock, sandbox, or explicitly allowlisted | CI can touch real customer systems |
| Is evidence retained? | Audio, transcript, tool trace, and run ID are saved | Failure cannot be debugged later |
| Is cleanup defined? | Fixtures reset or expire | Synthetic data leaks into later runs |
The important shift is that QA changes become reviewable. A teammate can ask, "Why is this test non-blocking?" or "Why does this prompt change not add a regression case?" before the agent reaches customers.
Git review rule: a voice agent test is reviewable only when the reviewer can see the caller setup, target agent, assertions, side-effect policy, and retained evidence before the test runs. If those details live only in a dashboard, the PR cannot prove what behavior it is protecting.
How to Import a Golden Dataset
Most teams start with a spreadsheet. That is fine. The mistake is importing the spreadsheet as loose rows with no schema.
Use an import map:
| Spreadsheet column | YAML field | Required? | Notes |
|---|---|---|---|
case_id | tests[].id | Yes | Stable, lowercase, no spaces |
caller_goal | persona.caller_goal | Yes | Plain English job-to-be-done |
language | persona.language | Yes | Use locale format when possible |
fixture_customer | setup.fixtures.caller_id | Conditional | Required for account workflows |
opening_utterance | call_script[0].user | Yes | First caller turn |
expected_outcome | assertions.outcome | Yes | Convert prose into typed checks |
must_call_tool | assertions.tools.required_order | Conditional | Required for workflow tests |
max_latency_ms | assertions.latency.turn_p95_ms_max | Optional | Use only when latency is part of the risk |
risk | tests[].risk | Yes | blocking, scheduled, or manual |
After import, validate the file. JSON Schema is a good fit because it can require fields, constrain object shapes, and validate nested data. The point is not ceremony. The point is refusing a "golden dataset" where half the rows have no expected outcome.
Blocking, Scheduled, and Ephemeral Tests
Not every test belongs in the same gate.
| Test class | When to use | Storage policy | Run cadence |
|---|---|---|---|
| Blocking | Critical account, payment, compliance, booking, or safety flows | Persist test definition and evidence | Every relevant PR |
| Scheduled | Broader regression, language, persona, and coverage suites | Persist definition, retain sampled evidence | Nightly or weekly |
| Ephemeral | One-off investigation, vendor trial, support reproduction | Store run metadata and result, not permanent suite entry | Manual or temporary CI job |
Ephemeral is a lifecycle choice, not a logging shortcut. Otherwise the team cannot tell whether it was a real test or just a dashboard click.
Ephemeral run rule: temporary voice agent tests can skip permanent suite registration, but they should not skip evidence. The minimum record is the run ID, agent version, assertion result, trace link, and cleanup status.
GitHub Actions Gate
Here is a minimal GitHub Actions shape. Use your own runner and command names.
name: voice-agent-tests
on:
pull_request:
paths:
- agents/**
- prompts/**
- tests/voice-agents/**
jobs:
voice-agent-tests:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with:
node-version: "20"
- run: npm ci
- name: Validate voice test YAML
run: npm run voice-tests:validate -- tests/voice-agents
- name: Run blocking voice tests
run: npm run voice-tests:run -- --risk=blocking --env=staging
- name: Upload voice test evidence
if: always()
run: npm run voice-tests:evidence -- --format=summary
The gate should fail when a blocking test fails, when a required fixture is missing, when a test tries to write outside its allowed mode, or when the evidence upload fails. A test result that nobody can inspect is not a CI gate. It is a guess with a status icon.
Common Mistakes
| Mistake | Why it hurts | Better pattern |
|---|---|---|
| Storing tests only in a vendor dashboard | Prompt changes and test changes cannot be reviewed together | Keep critical tests in Git and sync to the runner |
| Making every test blocking | CI becomes slow and noisy | Use blocking, scheduled, and manual risk classes |
| Checking only transcript text | The agent can say the right thing while tools fail | Add tool, side-effect, latency, and handoff assertions |
| Importing spreadsheets without schema validation | Rows drift into inconsistent formats | Convert columns into typed YAML fields |
| Keeping no evidence for failed runs | Developers cannot debug the failure | Retain audio, transcript, trace, tool output, and cleanup status |
| Letting tests write to shared systems | Synthetic calls pollute calendars, CRMs, and support queues | Use mock or sandbox modes by default |
The honest limitation: tests as code will not replace exploratory QA. Voice agents still need humans listening for weird pacing, awkward recovery, and caller frustration. But once a failure matters enough to block a release, it should graduate into a file someone can review.
Voice Agent Tests as Code FAQ
Can I define voice agent tests in YAML and run them in CI?
Yes. Define the persona, setup, call path, assertions, evidence policy, and cleanup rules in YAML, validate the file against a schema, then run the blocking subset from CI. Hamming recommends keeping high-risk workflow tests in Git so prompt and test changes can be reviewed together.
How do I keep voice agent tests reviewable in Git?
Use stable IDs, explicit owners, typed assertions, and small test files grouped by agent or workflow. Reviewers should be able to see what changed in the caller goal, fixture, expected tool order, latency threshold, or side-effect policy before the test runs.
How do I import a golden dataset of voice agent personas?
Map spreadsheet columns into typed fields such as id, persona.caller_goal, persona.language, setup.fixtures, assertions.outcome, and risk. Hamming recommends rejecting imported rows that do not have an expected outcome or owner because they become untriageable later.
How should ephemeral voice agent tests work?
Ephemeral tests can be temporary, but they should still produce evidence: run ID, agent version, assertion results, trace link, and cleanup status. Use them for investigations or vendor trials, then promote repeat failures into the permanent regression suite.
Which voice agent tests should block a pull request?
Block on tests for account access, payments, compliance scripts, booking writes, workflow side effects, production handoff behavior, and any failure that previously caused customer impact. Keep long-tail coverage and expensive language sweeps scheduled unless they protect a launch-critical flow.
What evidence should CI save for failed voice agent tests?
Save audio, transcript, trace ID, tool inputs and outputs, assertion results, final state, and cleanup status. Hamming recommends storing enough context that an engineer can reproduce the failure without asking QA which dashboard filters they used.

