What is a failed-call regression test?

A failed-call regression test is a repeatable voice agent test created from a real production failure. Hamming recommends preserving the source call ID, sanitized fixture, expected behavior, tool-call guardrails, side-effect checks, and promotion level so the failure cannot return silently after the next release.

Which production calls should become regression tests?

Convert calls that are repeated, high impact, safety-sensitive, or tied to a workflow contract the agent must preserve. According to Hamming's runbook, one-off confusing calls usually stay manual, while wrong tool calls, unsafe responses, duplicate writes, and failed handoffs should become blocking or scheduled regression tests.

Should I replay raw production recordings in CI?

No. Hamming recommends using production calls to identify the failure pattern, then recreating it with redacted transcripts, safe fixtures, and sandboxed side effects. Raw production recordings should not enter CI unless your security, consent, retention, and access controls explicitly allow that use.

What evidence do I need before creating a failed-call test?

At minimum, capture the source call ID, agent version, caller goal, sanitized transcript turns, failure label, severity, owner, and expected behavior. For workflow agents, Hamming's checklist also requires tool-call traces, state before the call, side effects after the call, and a trace or correlation ID when available.

How do I test failed tool calls safely?

Use a sandbox, mock, dry-run endpoint, or allowlisted test record before asserting tool behavior. Hamming recommends checking the tool name, arguments, order, result, forbidden calls, and backend side effect because transcript-only tests can pass while the real workflow remains broken.

How often should failed-call regression tests run?

Blocking failed-call tests should run on prompt changes, tool schema changes, workflow changes, and release candidates. Scheduled tests can run nightly or weekly, while manual tests should stay in launch reviews or incident follow-up until their expected behavior is clear enough to automate.

How is this different from response coverage or tests as code?

Response coverage identifies which production gaps matter, and tests as code makes test definitions reviewable in Git. The failed-call regression runbook connects those two ideas by taking one production failure and turning it into a safe, owned, promoted regression case.

How to Turn Failed Production Calls Into Regression Tests

Failed production calls regression tests are repeatable voice agent tests created from real production failures. The goal is not to replay every bad call forever. The goal is to preserve the one failure pattern that matters, recreate it safely, and make sure the next prompt, model, or workflow change cannot bring it back.

If you handle 30 calls a week, this runbook is probably too heavy. Listen to the bad calls, fix the obvious bug, and add 2 or 3 dashboard tests. This is for teams with enough traffic that the same failure can show up across agents, locations, languages, or releases.

The uncomfortable lesson: a monitoring alert is not a regression test. A bug ticket is not a regression test. A saved recording is not a regression test. Until the failure has a fixture, guardrails, owner, and promotion level, it is just evidence.

TL;DR: Turn failed production calls into regression tests with 5 steps: capture the evidence packet, classify the failure, recreate it with safe fixtures, write guardrails against the transcript and backend behavior, then promote the case into blocking, scheduled, or manual regression.

Do not dump raw production calls into CI. Start with the smallest safe scenario that reproduces the failure.

Failed-call regression case: A failed-call regression case is the smallest safe test that preserves a production failure's source label, fixture, guardrails, and promotion rule. It should reproduce the behavior that mattered without replaying private caller data or unrelated conversational noise.

Methodology Note: This runbook is based on Hamming's analysis of production voice agent calls and testing workflows across 10K+ voice agents (2025-2026). Hamming's platform has 10M+ mins protected. We've tested agents built on LiveKit, Pipecat, ElevenLabs, Retell, Vapi, and custom-built solutions.
The sample cases focus on tool calls, workflow side effects, unsafe responses, and production monitoring because those are the failures that most often come back after a release.

In Hamming workflow reviews, we found that the cases worth keeping usually have one thing in common: the transcript looks less broken than the backend evidence. The caller may hear a reasonable response while the trace shows a duplicate write, missing lookup, stale eligibility check, or unsafe handoff. That is why this runbook starts with evidence, not prompt editing.

Last Updated: May 2026

Related Guides:

Voice Agent Response Coverage - find production gaps worth turning into tests
Voice Agent Workflow Testing - assert tool calls, state transitions, and side effects
Voice Agent Tests as Code - keep regression cases reviewable in YAML and Git
Voice Agent CI/CD Testing - decide what blocks a release
Voice Agent Observability and Tracing - preserve traces, spans, and correlation IDs
Voice Agent Incident Response - connect incidents to follow-up tests

What Makes a Production Call Test-Worthy?

A failed call is test-worthy when it teaches you something the existing suite did not know.

That sounds obvious, but teams usually get this wrong in two ways. They either convert every bad call into a bloated test suite nobody trusts, or they fix the bug once and leave no guardrail behind. Both fail.

Use this decision table:

Production signal	Make it blocking?	Make it scheduled?	Keep manual?	Why
Wrong tool called in a payment, booking, prescription, identity, or claims flow	Yes	Yes	No	The agent changed real workflow state or risked doing so.
Correct tool called with wrong arguments	Yes for high-risk flows	Yes	No	Argument mistakes tend to return after prompt and schema edits.
Unsafe medical, financial, legal, or policy response	Yes	Yes	No	Safety regressions should block release.
Repeated fallback cluster with clear caller goal	Maybe	Yes	No	It improves response coverage but may not need to block every PR.
One-off confused caller with no repeated pattern	No	No	Yes	Keep it for review, not CI.
Provider outage or telephony failure outside the agent's control	No	Maybe	Yes	Test your fallback behavior, not the vendor outage itself.
Long, messy call with 9 unrelated issues	No	Maybe after reduction	Yes	Split it into the smallest reproducible failure first.

Failed-call regression rule: a production failure should become a regression test when it is repeated, high impact, safety-sensitive, or caused by a workflow contract the agent must preserve.

Capture the Evidence Packet Before You Rewrite Anything

Do not start by editing the prompt. First, freeze the evidence.

For voice agents, the failure may live in the transcript, audio, latency, tool trace, state transition, handoff summary, or post-call side effect. If you preserve only the transcript, you may create a test that passes while the real bug survives.

Evidence field	Required?	What it proves	Sample value
Source call ID	Yes	Connects the test to the original failure	`call_2026_05_29_1842`
Agent version	Yes	Shows what code, prompt, model, or workflow failed	`agent=scheduler`, `prompt=v38`
Caller goal	Yes	Keeps the test human-readable	"Reschedule appointment after identity check"
Sanitized transcript turns	Yes	Recreates the conversational path	Last 6 turns before failure
Audio condition	Usually	Explains ASR or interruption issues	office noise, accent, speakerphone
Tool-call trace	For workflow agents	Proves tool name, arguments, order, result, and error	`lookup_identity` returned missing record
State before call	For writes	Recreates fixture setup	appointment exists, payment plan disabled
Side effect after call	For writes	Proves backend outcome	duplicate calendar event created
Trace or correlation ID	Strongly recommended	Connects ASR, LLM, tools, TTS, and logs	OpenTelemetry trace ID
Severity and owner	Yes	Decides promotion and triage	`severity=high`, `owner=scheduling-eng`

OpenTelemetry context propagation exists to correlate signals across services. For voice agents, that correlation matters because one call can cross SIP, ASR, LLM, tool execution, TTS, storage, and post-call evaluation. A regression case without correlation IDs is harder to trust.

OpenTelemetry traces also model spans with timestamps, events, status, links, and context. That is the right mental model for failed-call evidence: keep the causal path, not just the final symptom.

Use the voice agent observability and tracing guide when you need the broader instrumentation layer. Use this runbook when you already have a failed call and need to preserve it as a test.

Evidence packet rule: If a teammate cannot move from the regression test back to the original call, trace, fixture, and owner, the test is not ready for CI. The point is to preserve causal evidence, not just transcript text.

Classify the Failure Before You Create the Test

Every failed-call test needs one primary failure type. Otherwise the guardrail becomes a grab bag.

Failure type	Primary guardrail	Common fix	Promotion level
Intent miss	Agent identifies the wrong caller goal	Add phrasing variants and intent guardrails	Scheduled unless high risk
Tool selection	Agent calls the wrong tool	Tighten tool descriptions and routing policy	Blocking for critical flows
Tool arguments	Agent passes wrong or missing fields	Add entity checks and schema validation	Blocking for writes
Tool ordering	Agent calls tools in unsafe order	Add state-machine checks	Blocking
Side effect	Backend write is missing, duplicated, or wrong	Add sandbox fixture and post-call verification	Blocking
Handoff	Transfer succeeds but context is missing	Add handoff summary and destination guardrails	Scheduled or blocking
Safety/policy	Agent says or does something prohibited	Add policy guardrails and severity routing	Blocking
Latency/interruption	Agent is too slow or mishandles barge-in	Add turn-level timing and interruption checks	Scheduled unless launch-critical

The key is reducing the production call to one durable lesson.

If the real call had 14 turns, a noisy office, two interruptions, a stale CRM record, and a wrong appointment update, resist the urge to preserve all 14 turns. Keep the smallest version that still fails for the same reason. Then add separate tests for separate lessons.

Recreate the Call With Safe Fixtures

Never use raw production data as the fixture in CI. Use production to learn the pattern. Use sanitized fixtures to reproduce it.

That means:

replace real names, phone numbers, account IDs, and addresses with test records
preserve the failure-relevant shape: missing record, duplicate appointment, expired policy, unavailable slot
route writes through sandbox, dry-run, mock, or allowlisted test endpoints
keep audio or transcript snippets only when they are redacted and approved for test storage
keep the source label so future reviewers know why the case exists

Safe fixture rule: A safe fixture keeps the production failure's shape while replacing every real caller identifier and production side effect. If the test needs the real account, phone number, or live write path to fail, it is not ready for automated regression.

This is where workflow testing and tests as code meet. Workflow testing tells you what to assert. Tests as code tells you how to make the case reviewable.

Write the Failed-Call Regression Test

Use a test file that names the production source, sanitized fixture, expected behavior, tool guardrails, and promotion level.

version: 1suite: failed_call_regressionowner: scheduling-engsource:  type: production_call  source_call_id: call_2026_05_29_1842  failure_label: wrong_tool_arguments  severity: high  detected_by: production_monitoring  sanitized: trueagent_ref:  environment: staging  agent_slug: scheduling-agent  prompt_version: pr-491fixture:  caller_id: caller_fixture_204  appointment_id: appt_fixture_771  existing_state:    appointment_status: confirmed    available_slots:      - 2026-06-03T15:00:00-05:00  side_effect_mode: sandboxcaller:  language: en-US  audio_condition: speakerphone_with_office_noise  goal: move my appointment from Tuesday morning to Wednesday after 2scenario:  starting_turn: "I need to move Tuesday's appointment to Wednesday afternoon."  must_reproduce:    - caller asks to reschedule    - agent verifies identity    - agent offers an allowed Wednesday slotguardrails:  outcome:    task_completed: true    no_unplanned_handoff: true  tools:    required_order:      - lookup_identity      - list_appointments      - list_available_slots      - update_appointment      - send_confirmation    arguments:      update_appointment:        appointment_id: appt_fixture_771        new_start_time: 2026-06-03T15:00:00-05:00    forbidden_tools:      - create_new_appointment  side_effects:    calendar:      appointment_id: appt_fixture_771      expected_status: rescheduled      duplicate_events_allowed: false  transcript:    must_not_include:      - I created a new appointment      - I cannot access your account  evidence:    retain_transcript: true    retain_audio: true    retain_tool_trace: true    retain_days: 30promotion:  level: blocking  run_on:    - prompt_change    - tool_schema_change    - release_candidate

The field names are less important than the discipline. A reviewer should be able to answer 5 questions before the test runs:

What production failure created this case?
What safe fixture recreates the failure?
Which behavior proves the bug is fixed?
Which side effects are allowed?
What release path does this test block?

Validate the Test Contract

Failed-call tests become dangerous when they are half-structured. A missing fixture or vague expected outcome turns a regression case into a note.

JSON Schema validation is useful here because it can require fields, constrain nested objects, and reject malformed cases before they hit CI.

{  "$schema": "http://json-schema.org/draft-07/schema#",  "type": "object",  "required": ["version", "suite", "owner", "source", "agent_ref", "fixture", "guardrails", "promotion"],  "properties": {    "source": {      "type": "object",      "required": ["type", "source_call_id", "failure_label", "severity", "sanitized"],      "properties": {        "type": { "enum": ["production_call"] },        "source_call_id": { "type": "string", "minLength": 1 },        "failure_label": { "type": "string", "minLength": 1 },        "severity": { "enum": ["low", "medium", "high", "critical"] },        "sanitized": { "const": true }      }    },    "promotion": {      "type": "object",      "required": ["level", "run_on"],      "properties": {        "level": { "enum": ["blocking", "scheduled", "manual"] },        "run_on": {          "type": "array",          "items": { "type": "string" },          "minItems": 1        }      }    }  }}

The schema will not tell you whether the test is useful. It will catch the easy failures: no owner, no source call, no sanitization marker, no promotion level, no run trigger.

Promote the Test Into the Right Gate

Not every failed-call regression test belongs in every pull request.

Promotion level	Runs when	Use for	Do not use for
Blocking	Prompt, tool schema, workflow, or release-candidate changes	High-risk tool calls, safety, payments, identity, compliance, duplicate writes	Long-tail exploratory coverage
Scheduled	Nightly, weekly, or low-traffic windows	Coverage clusters, latency drift, handoff quality, multilingual variants	Urgent release blockers
Manual	Before launches, incident follow-up, or QA review	Messy cases that need human judgment	Known critical regressions

Use the voice agent CI/CD testing guide for broad release-gate policy. For failed-call cases, GitHub Actions workflows can keep the first gate straightforward because they are YAML files that define jobs and triggers:

name: voice-agent-failed-call-regressionon:  pull_request:    paths:      - "agents/**"      - "prompts/**"      - "voice-tests/**"  workflow_dispatch:jobs:  failed-call-regression:    runs-on: ubuntu-latest    steps:      - uses: actions/checkout@v4      - name: Validate test cases        run: npm run validate:voice-tests -- voice-tests/failed-calls      - name: Run blocking failed-call regressions        run: npm run voice-tests -- --suite=failed_call_regression --level=blocking

For larger organizations, reusable workflows can accept inputs and secrets from calling workflows. That lets application teams call the same voice regression gate while passing their own suite path, environment, or sandbox credentials.

What This Runbook Cannot Prove

This workflow is useful, but it has limits.

It does not prove the original caller would behave the same way again. A regression test recreates the failure pattern, not the entire human context.
It does not replace production monitoring. Tests catch known patterns. Monitoring catches new failures and tells you what deserves the next test.
It does not make unsafe writes safe by default. Any test that can book, charge, transfer, delete, or notify needs sandboxing or explicit allowlists.
It does not remove human review. Some calls are ambiguous. Keep those manual until the expected behavior is crisp enough to automate.

I used to think the hardest part was generating the test case. It is not. The hard part is preserving enough evidence that the test still means something 6 months later, after the prompt, tool schema, and owner have all changed.

Failed-Call Promotion Checklist

Before a failed production call becomes part of the regression suite or a broader production readiness checklist, check every item:

Pair this with incident response after customer-impacting failures. Pair it with voice agent troubleshooting when you still do not know the root cause. Pair it with hallucination detection when the failure is an unsupported or unsafe answer.

The loop is simple: production teaches you what broke, testing prevents the same break from returning, and CI decides which fixes are important enough to block.

How to Turn Failed Production Calls Into Regression Tests

What Makes a Production Call Test-Worthy?

Capture the Evidence Packet Before You Rewrite Anything

Classify the Failure Before You Create the Test

Recreate the Call With Safe Fixtures

Write the Failed-Call Regression Test

Validate the Test Contract

Promote the Test Into the Right Gate

What This Runbook Cannot Prove

Failed-Call Promotion Checklist

Frequently Asked Questions

Sumanyu Sharma

Related Resources

Voice Agent Tool Call Contract Testing Template

Voice Agent Structured Output Validation Checklist

Voice Agent Sandbox Testing for Tool Calls and Side Effects