Voice Agent Tests as Code: YAML Template for CI

Sumanyu Sharma
Sumanyu Sharma
Founder & CEO
, Voice AI QA Pioneer

Has stress-tested 4M+ voice agent calls to find where they break.

May 25, 2026Updated May 25, 20269 min read
Voice Agent Tests as Code: YAML Template for CI

Voice agent tests as code means defining test cases in version-controlled files so prompts, personas, call paths, assertions, and expected evidence can be reviewed before they run against production agents.

If your team changes one demo agent by hand once a month, this is probably too much process. Use the dashboard, listen to the calls, and keep moving.

This is for teams where voice-agent behavior changes through pull requests: prompt edits, tool schemas, routing rules, language support, compliance scripts, and workflow fixes. Once those changes ship through Git, the tests should live there too.

TL;DR: Put voice agent test cases in YAML, validate them against a schema, run the blocking subset in CI, and store the result with the same commit that changed the prompt or workflow.

A test that only exists in a vendor dashboard can still be useful. It is just hard to review, diff, export, or connect to the code change that made it necessary.

Methodology Note: The template in this guide is based on Hamming's analysis of 4M+ production voice agent calls across 10K+ voice agents (2025-2026).

Treat this as a starting contract. Regulated flows, payments, and account changes need stricter approvals and side-effect controls than low-risk FAQ flows.

Last Updated: May 2026

Related Guides:

What Belongs in a Voice Agent Test File

The file should be boring. A reviewer should understand what the call will do, what must pass, what is allowed to touch a real system, and what evidence will be saved.

Test fieldWhat it capturesWhy it belongs in Git
idStable test identifierLets failures stay searchable across runs
ownerTeam or person responsiblePrevents orphaned tests
agent_refAgent, prompt, branch, or environmentTies the test to a deployable target
personaCaller language, accent, goal, constraintsMakes caller coverage reviewable
setupFixture state before the callKeeps tests reproducible
call_pathInbound, outbound, WebRTC, SIP, or provider modePrevents text-only tests from pretending to be voice tests
assertionsOutcome, transcript, tool, latency, policy, and side-effect checksDefines pass/fail before the run starts
evidenceAudio, transcript, trace, tool result, dashboard linkMakes failures debuggable
cleanupHow test data is removed or isolatedPrevents synthetic calls from polluting production

GitHub Actions workflows are YAML files under .github/workflows, and GitLab CI uses YAML for pipeline configuration. That does not mean voice-agent tests must use YAML forever. It means YAML is familiar enough for code review, comments, ownership, and CI wiring.

A Copyable YAML Template

Use this as the starting shape. Keep IDs stable and names human-readable.

version: 1
suite: appointment_booking_smoke
owner: growth-eng
agent_ref:
  environment: staging
  agent_slug: scheduling-agent
  prompt_version: pr-482

defaults:
  call_path: inbound_phone
  timeout_seconds: 180
  evidence:
    retain_audio: true
    retain_transcript: true
    retain_tool_trace: true
    retain_days: 30

tests:
  - id: booking_reschedule_verified_caller
    title: Reschedule an appointment after identity check
    risk: blocking
    persona:
      language: en-US
      caller_goal: Move my Tuesday appointment to Friday afternoon
      speech_conditions:
        accent: neutral
        background_noise: office
    setup:
      fixtures:
        caller_id: caller_qa_104
        appointment_id: appt_fixture_882
      side_effect_mode: sandbox
    call_script:
      - user: I need to move my appointment from Tuesday to Friday after 2.
      - user: Yes, Friday at 3 works.
    assertions:
      outcome:
        task_completed: true
        no_unplanned_handoff: true
      transcript:
        must_include:
          - Friday at 3
        must_not_include:
          - I cannot access your account
      tools:
        required_order:
          - lookup_identity
          - list_appointments
          - hold_slot
          - update_booking
          - send_confirmation
      side_effects:
        calendar:
          appointment_id: appt_fixture_882
          expected_status: rescheduled
          duplicate_events_allowed: false
      latency:
        turn_p95_ms_max: 1500
    cleanup:
      delete_sandbox_records: true

This template is intentionally more specific than a generic "run test" payload. Voice-agent failures hide in the details: wrong fixture, wrong call path, missing tool trace, duplicate calendar write, or a human handoff that technically happened but carried no context.

Field-by-Field Review Checklist

Review the test file like production code.

Review questionGood answerBlock the PR when
Does the test have an owner?owner maps to a team or personNobody can triage failures
Is the target clear?Environment, agent slug, and prompt version are explicitTest could run against the wrong agent
Is the caller realistic?Persona includes goal, language, and relevant speech conditionTest only mirrors the happy-path script
Are assertions layered?Outcome, transcript, tool, side-effect, and latency checks are separatedA transcript check stands in for the whole workflow
Are writes safe?side_effect_mode is mock, sandbox, or explicitly allowlistedCI can touch real customer systems
Is evidence retained?Audio, transcript, tool trace, and run ID are savedFailure cannot be debugged later
Is cleanup defined?Fixtures reset or expireSynthetic data leaks into later runs

The important shift is that QA changes become reviewable. A teammate can ask, "Why is this test non-blocking?" or "Why does this prompt change not add a regression case?" before the agent reaches customers.

Git review rule: a voice agent test is reviewable only when the reviewer can see the caller setup, target agent, assertions, side-effect policy, and retained evidence before the test runs. If those details live only in a dashboard, the PR cannot prove what behavior it is protecting.

How to Import a Golden Dataset

Most teams start with a spreadsheet. That is fine. The mistake is importing the spreadsheet as loose rows with no schema.

Use an import map:

Spreadsheet columnYAML fieldRequired?Notes
case_idtests[].idYesStable, lowercase, no spaces
caller_goalpersona.caller_goalYesPlain English job-to-be-done
languagepersona.languageYesUse locale format when possible
fixture_customersetup.fixtures.caller_idConditionalRequired for account workflows
opening_utterancecall_script[0].userYesFirst caller turn
expected_outcomeassertions.outcomeYesConvert prose into typed checks
must_call_toolassertions.tools.required_orderConditionalRequired for workflow tests
max_latency_msassertions.latency.turn_p95_ms_maxOptionalUse only when latency is part of the risk
risktests[].riskYesblocking, scheduled, or manual

After import, validate the file. JSON Schema is a good fit because it can require fields, constrain object shapes, and validate nested data. The point is not ceremony. The point is refusing a "golden dataset" where half the rows have no expected outcome.

Blocking, Scheduled, and Ephemeral Tests

Not every test belongs in the same gate.

Test classWhen to useStorage policyRun cadence
BlockingCritical account, payment, compliance, booking, or safety flowsPersist test definition and evidenceEvery relevant PR
ScheduledBroader regression, language, persona, and coverage suitesPersist definition, retain sampled evidenceNightly or weekly
EphemeralOne-off investigation, vendor trial, support reproductionStore run metadata and result, not permanent suite entryManual or temporary CI job

Ephemeral is a lifecycle choice, not a logging shortcut. Otherwise the team cannot tell whether it was a real test or just a dashboard click.

Ephemeral run rule: temporary voice agent tests can skip permanent suite registration, but they should not skip evidence. The minimum record is the run ID, agent version, assertion result, trace link, and cleanup status.

GitHub Actions Gate

Here is a minimal GitHub Actions shape. Use your own runner and command names.

name: voice-agent-tests

on:
  pull_request:
    paths:
      - agents/**
      - prompts/**
      - tests/voice-agents/**

jobs:
  voice-agent-tests:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with:
          node-version: "20"
      - run: npm ci
      - name: Validate voice test YAML
        run: npm run voice-tests:validate -- tests/voice-agents
      - name: Run blocking voice tests
        run: npm run voice-tests:run -- --risk=blocking --env=staging
      - name: Upload voice test evidence
        if: always()
        run: npm run voice-tests:evidence -- --format=summary

The gate should fail when a blocking test fails, when a required fixture is missing, when a test tries to write outside its allowed mode, or when the evidence upload fails. A test result that nobody can inspect is not a CI gate. It is a guess with a status icon.

Common Mistakes

MistakeWhy it hurtsBetter pattern
Storing tests only in a vendor dashboardPrompt changes and test changes cannot be reviewed togetherKeep critical tests in Git and sync to the runner
Making every test blockingCI becomes slow and noisyUse blocking, scheduled, and manual risk classes
Checking only transcript textThe agent can say the right thing while tools failAdd tool, side-effect, latency, and handoff assertions
Importing spreadsheets without schema validationRows drift into inconsistent formatsConvert columns into typed YAML fields
Keeping no evidence for failed runsDevelopers cannot debug the failureRetain audio, transcript, trace, tool output, and cleanup status
Letting tests write to shared systemsSynthetic calls pollute calendars, CRMs, and support queuesUse mock or sandbox modes by default

The honest limitation: tests as code will not replace exploratory QA. Voice agents still need humans listening for weird pacing, awkward recovery, and caller frustration. But once a failure matters enough to block a release, it should graduate into a file someone can review.

Voice Agent Tests as Code FAQ

Can I define voice agent tests in YAML and run them in CI?

Yes. Define the persona, setup, call path, assertions, evidence policy, and cleanup rules in YAML, validate the file against a schema, then run the blocking subset from CI. Hamming recommends keeping high-risk workflow tests in Git so prompt and test changes can be reviewed together.

How do I keep voice agent tests reviewable in Git?

Use stable IDs, explicit owners, typed assertions, and small test files grouped by agent or workflow. Reviewers should be able to see what changed in the caller goal, fixture, expected tool order, latency threshold, or side-effect policy before the test runs.

How do I import a golden dataset of voice agent personas?

Map spreadsheet columns into typed fields such as id, persona.caller_goal, persona.language, setup.fixtures, assertions.outcome, and risk. Hamming recommends rejecting imported rows that do not have an expected outcome or owner because they become untriageable later.

How should ephemeral voice agent tests work?

Ephemeral tests can be temporary, but they should still produce evidence: run ID, agent version, assertion results, trace link, and cleanup status. Use them for investigations or vendor trials, then promote repeat failures into the permanent regression suite.

Which voice agent tests should block a pull request?

Block on tests for account access, payments, compliance scripts, booking writes, workflow side effects, production handoff behavior, and any failure that previously caused customer impact. Keep long-tail coverage and expensive language sweeps scheduled unless they protect a launch-critical flow.

What evidence should CI save for failed voice agent tests?

Save audio, transcript, trace ID, tool inputs and outputs, assertion results, final state, and cleanup status. Hamming recommends storing enough context that an engineer can reproduce the failure without asking QA which dashboard filters they used.

Frequently Asked Questions

Yes. Define the persona, setup, call path, assertions, evidence policy, and cleanup rules in YAML, validate the file against a schema, then run the blocking subset from CI. Hamming recommends keeping high-risk workflow tests in Git so prompt and test changes can be reviewed together.

Include a stable test ID, owner, agent reference, persona, setup fixtures, call path, assertions, evidence retention, and cleanup rules. Hamming recommends separating transcript, tool, side-effect, latency, and handoff assertions so a spoken answer does not hide an operational failure.

Use stable IDs, explicit owners, typed assertions, and small files grouped by agent or workflow. Reviewers should be able to see what changed in the caller goal, fixture, expected tool order, latency threshold, or side-effect policy before the test runs.

Map spreadsheet columns into typed fields such as id, caller goal, language, fixture, expected outcome, and risk. Hamming recommends rejecting imported rows that do not have an expected outcome or owner because they become untriageable later.

Ephemeral tests can be temporary, but they should still produce evidence: run ID, agent version, assertion results, trace link, and cleanup status. Use them for investigations or vendor trials, then promote repeat failures into the permanent regression suite.

Block on tests for account access, payments, compliance scripts, booking writes, workflow side effects, production handoff behavior, and any failure that previously caused customer impact. Keep long-tail coverage and expensive language sweeps scheduled unless they protect a launch-critical flow.

Save audio, transcript, trace ID, tool inputs and outputs, assertion results, final state, and cleanup status. Hamming recommends storing enough context that an engineer can reproduce the failure without asking QA which dashboard filters they used.

No. Dashboard-created tests are useful for exploration, triage, and non-engineering workflows. Hamming recommends promoting release-blocking or repeatedly failing cases into Git so they become reviewable, portable, and connected to the code or prompt change that requires them.

Sumanyu Sharma

Sumanyu Sharma

Founder & CEO

Previously Head of Data at Citizen, where he helped quadruple the user base. As Senior Staff Data Scientist at Tesla, grew AI-powered sales program to 100s of millions in revenue per year.

Researched AI-powered medical image search at the University of Waterloo, where he graduated with Engineering honors on dean's list.

“At Hamming, we're taking all of our learnings from Tesla and Citizento build the future of trustworthy, safe and reliable voice AI agents.”