Failed production calls regression tests are repeatable voice agent tests created from real production failures. The goal is not to replay every bad call forever. The goal is to preserve the one failure pattern that matters, recreate it safely, and make sure the next prompt, model, or workflow change cannot bring it back.
If you handle 30 calls a week, this runbook is probably too heavy. Listen to the bad calls, fix the obvious bug, and add 2 or 3 dashboard tests. This is for teams with enough traffic that the same failure can show up across agents, locations, languages, or releases.
The uncomfortable lesson: a monitoring alert is not a regression test. A bug ticket is not a regression test. A saved recording is not a regression test. Until the failure has a fixture, assertions, owner, and promotion level, it is just evidence.
TL;DR: Turn failed production calls into regression tests with 5 steps: capture the evidence packet, classify the failure, recreate it with safe fixtures, write assertions against the transcript and backend behavior, then promote the case into blocking, scheduled, or manual regression.
Do not dump raw production calls into CI. Start with the smallest safe scenario that reproduces the failure.
Failed-call regression case: A failed-call regression case is the smallest safe test that preserves a production failure's source label, fixture, assertions, and promotion rule. It should reproduce the behavior that mattered without replaying private caller data or unrelated conversational noise.
Methodology Note: This runbook is based on Hamming's analysis of 4M+ production voice agent calls and testing workflows across 10K+ voice agents (2025-2026). We've tested agents built on LiveKit, Pipecat, ElevenLabs, Retell, Vapi, and custom-built solutions.The sample cases focus on tool calls, workflow side effects, unsafe responses, and production monitoring because those are the failures that most often come back after a release.
In Hamming workflow reviews, we found that the cases worth keeping usually have one thing in common: the transcript looks less broken than the backend evidence. The caller may hear a reasonable response while the trace shows a duplicate write, missing lookup, stale eligibility check, or unsafe handoff. That is why this runbook starts with evidence, not prompt editing.
Last Updated: May 2026
Related Guides:
- Voice Agent Response Coverage - find production gaps worth turning into tests
- Voice Agent Workflow Testing - assert tool calls, state transitions, and side effects
- Voice Agent Tests as Code - keep regression cases reviewable in YAML and Git
- Voice Agent CI/CD Testing - decide what blocks a release
- Voice Agent Observability and Tracing - preserve traces, spans, and correlation IDs
- Voice Agent Incident Response - connect incidents to follow-up tests
What Makes a Production Call Test-Worthy?
A failed call is test-worthy when it teaches you something the existing suite did not know.
That sounds obvious, but teams usually get this wrong in two ways. They either convert every bad call into a bloated test suite nobody trusts, or they fix the bug once and leave no guardrail behind. Both fail.
Use this decision table:
| Production signal | Make it blocking? | Make it scheduled? | Keep manual? | Why |
|---|---|---|---|---|
| Wrong tool called in a payment, booking, prescription, identity, or claims flow | Yes | Yes | No | The agent changed real workflow state or risked doing so. |
| Correct tool called with wrong arguments | Yes for high-risk flows | Yes | No | Argument mistakes tend to return after prompt and schema edits. |
| Unsafe medical, financial, legal, or policy response | Yes | Yes | No | Safety regressions should block release. |
| Repeated fallback cluster with clear caller goal | Maybe | Yes | No | It improves response coverage but may not need to block every PR. |
| One-off confused caller with no repeated pattern | No | No | Yes | Keep it for review, not CI. |
| Provider outage or telephony failure outside the agent's control | No | Maybe | Yes | Test your fallback behavior, not the vendor outage itself. |
| Long, messy call with 9 unrelated issues | No | Maybe after reduction | Yes | Split it into the smallest reproducible failure first. |
Failed-call regression rule: a production failure should become a regression test when it is repeated, high impact, safety-sensitive, or caused by a workflow contract the agent must preserve.
Capture the Evidence Packet Before You Rewrite Anything
Do not start by editing the prompt. First, freeze the evidence.
For voice agents, the failure may live in the transcript, audio, latency, tool trace, state transition, handoff summary, or post-call side effect. If you preserve only the transcript, you may create a test that passes while the real bug survives.
| Evidence field | Required? | What it proves | Sample value |
|---|---|---|---|
| Source call ID | Yes | Connects the test to the original failure | call_2026_05_29_1842 |
| Agent version | Yes | Shows what code, prompt, model, or workflow failed | agent=scheduler, prompt=v38 |
| Caller goal | Yes | Keeps the test human-readable | "Reschedule appointment after identity check" |
| Sanitized transcript turns | Yes | Recreates the conversational path | Last 6 turns before failure |
| Audio condition | Usually | Explains ASR or interruption issues | office noise, accent, speakerphone |
| Tool-call trace | For workflow agents | Proves tool name, arguments, order, result, and error | lookup_identity returned missing record |
| State before call | For writes | Recreates fixture setup | appointment exists, payment plan disabled |
| Side effect after call | For writes | Proves backend outcome | duplicate calendar event created |
| Trace or correlation ID | Strongly recommended | Connects ASR, LLM, tools, TTS, and logs | OpenTelemetry trace ID |
| Severity and owner | Yes | Decides promotion and triage | severity=high, owner=scheduling-eng |
OpenTelemetry context propagation exists to correlate signals across services. For voice agents, that correlation matters because one call can cross SIP, ASR, LLM, tool execution, TTS, storage, and post-call evaluation. A regression case without correlation IDs is harder to trust.
OpenTelemetry traces also model spans with timestamps, events, status, links, and context. That is the right mental model for failed-call evidence: keep the causal path, not just the final symptom.
Use the voice agent observability and tracing guide when you need the broader instrumentation layer. Use this runbook when you already have a failed call and need to preserve it as a test.
Evidence packet rule: If a teammate cannot move from the regression test back to the original call, trace, fixture, and owner, the test is not ready for CI. The point is to preserve causal evidence, not just transcript text.
Classify the Failure Before You Create the Test
Every failed-call test needs one primary failure type. Otherwise the assertion becomes a grab bag.
| Failure type | Primary assertion | Common fix | Promotion level |
|---|---|---|---|
| Intent miss | Agent identifies the wrong caller goal | Add phrasing variants and intent assertions | Scheduled unless high risk |
| Tool selection | Agent calls the wrong tool | Tighten tool descriptions and routing policy | Blocking for critical flows |
| Tool arguments | Agent passes wrong or missing fields | Add entity checks and schema validation | Blocking for writes |
| Tool ordering | Agent calls tools in unsafe order | Add state-machine checks | Blocking |
| Side effect | Backend write is missing, duplicated, or wrong | Add sandbox fixture and post-call verification | Blocking |
| Handoff | Transfer succeeds but context is missing | Add handoff summary and destination assertions | Scheduled or blocking |
| Safety/policy | Agent says or does something prohibited | Add policy assertions and severity routing | Blocking |
| Latency/interruption | Agent is too slow or mishandles barge-in | Add turn-level timing and interruption checks | Scheduled unless launch-critical |
The key is reducing the production call to one durable lesson.
If the real call had 14 turns, a noisy office, two interruptions, a stale CRM record, and a wrong appointment update, resist the urge to preserve all 14 turns. Keep the smallest version that still fails for the same reason. Then add separate tests for separate lessons.
Recreate the Call With Safe Fixtures
Never use raw production data as the fixture in CI. Use production to learn the pattern. Use sanitized fixtures to reproduce it.
That means:
- replace real names, phone numbers, account IDs, and addresses with test records
- preserve the failure-relevant shape: missing record, duplicate appointment, expired policy, unavailable slot
- route writes through sandbox, dry-run, mock, or allowlisted test endpoints
- keep audio or transcript snippets only when they are redacted and approved for test storage
- keep the source label so future reviewers know why the case exists
Safe fixture rule: A safe fixture keeps the production failure's shape while replacing every real caller identifier and production side effect. If the test needs the real account, phone number, or live write path to fail, it is not ready for automated regression.
This is where workflow testing and tests as code meet. Workflow testing tells you what to assert. Tests as code tells you how to make the case reviewable.
Write the Failed-Call Regression Test
Use a test file that names the production source, sanitized fixture, expected behavior, tool assertions, and promotion level.
version: 1
suite: failed_call_regression
owner: scheduling-eng
source:
type: production_call
source_call_id: call_2026_05_29_1842
failure_label: wrong_tool_arguments
severity: high
detected_by: production_monitoring
sanitized: true
agent_ref:
environment: staging
agent_slug: scheduling-agent
prompt_version: pr-491
fixture:
caller_id: caller_fixture_204
appointment_id: appt_fixture_771
existing_state:
appointment_status: confirmed
available_slots:
- 2026-06-03T15:00:00-05:00
side_effect_mode: sandbox
caller:
language: en-US
audio_condition: speakerphone_with_office_noise
goal: move my appointment from Tuesday morning to Wednesday after 2
scenario:
starting_turn: "I need to move Tuesday's appointment to Wednesday afternoon."
must_reproduce:
- caller asks to reschedule
- agent verifies identity
- agent offers an allowed Wednesday slot
assertions:
outcome:
task_completed: true
no_unplanned_handoff: true
tools:
required_order:
- lookup_identity
- list_appointments
- list_available_slots
- update_appointment
- send_confirmation
arguments:
update_appointment:
appointment_id: appt_fixture_771
new_start_time: 2026-06-03T15:00:00-05:00
forbidden_tools:
- create_new_appointment
side_effects:
calendar:
appointment_id: appt_fixture_771
expected_status: rescheduled
duplicate_events_allowed: false
transcript:
must_not_include:
- I created a new appointment
- I cannot access your account
evidence:
retain_transcript: true
retain_audio: true
retain_tool_trace: true
retain_days: 30
promotion:
level: blocking
run_on:
- prompt_change
- tool_schema_change
- release_candidate
The field names are less important than the discipline. A reviewer should be able to answer 5 questions before the test runs:
- What production failure created this case?
- What safe fixture recreates the failure?
- Which behavior proves the bug is fixed?
- Which side effects are allowed?
- What release path does this test block?
Validate the Test Contract
Failed-call tests become dangerous when they are half-structured. A missing fixture or vague expected outcome turns a regression case into a note.
JSON Schema validation is useful here because it can require fields, constrain nested objects, and reject malformed cases before they hit CI.
{
"$schema": "http://json-schema.org/draft-07/schema#",
"type": "object",
"required": ["version", "suite", "owner", "source", "agent_ref", "fixture", "assertions", "promotion"],
"properties": {
"source": {
"type": "object",
"required": ["type", "source_call_id", "failure_label", "severity", "sanitized"],
"properties": {
"type": { "enum": ["production_call"] },
"source_call_id": { "type": "string", "minLength": 1 },
"failure_label": { "type": "string", "minLength": 1 },
"severity": { "enum": ["low", "medium", "high", "critical"] },
"sanitized": { "const": true }
}
},
"promotion": {
"type": "object",
"required": ["level", "run_on"],
"properties": {
"level": { "enum": ["blocking", "scheduled", "manual"] },
"run_on": {
"type": "array",
"items": { "type": "string" },
"minItems": 1
}
}
}
}
}
The schema will not tell you whether the test is useful. It will catch the easy failures: no owner, no source call, no sanitization marker, no promotion level, no run trigger.
Promote the Test Into the Right Gate
Not every failed-call regression test belongs in every pull request.
| Promotion level | Runs when | Use for | Do not use for |
|---|---|---|---|
| Blocking | Prompt, tool schema, workflow, or release-candidate changes | High-risk tool calls, safety, payments, identity, compliance, duplicate writes | Long-tail exploratory coverage |
| Scheduled | Nightly, weekly, or low-traffic windows | Coverage clusters, latency drift, handoff quality, multilingual variants | Urgent release blockers |
| Manual | Before launches, incident follow-up, or QA review | Messy cases that need human judgment | Known critical regressions |
Use the voice agent CI/CD testing guide for broad release-gate policy. For failed-call cases, GitHub Actions workflows can keep the first gate straightforward because they are YAML files that define jobs and triggers:
name: voice-agent-failed-call-regression
on:
pull_request:
paths:
- "agents/**"
- "prompts/**"
- "voice-tests/**"
workflow_dispatch:
jobs:
failed-call-regression:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Validate test cases
run: npm run validate:voice-tests -- voice-tests/failed-calls
- name: Run blocking failed-call regressions
run: npm run voice-tests -- --suite=failed_call_regression --level=blocking
For larger organizations, reusable workflows can accept inputs and secrets from calling workflows. That lets application teams call the same voice regression gate while passing their own suite path, environment, or sandbox credentials.
What This Runbook Cannot Prove
This workflow is useful, but it has limits.
- It does not prove the original caller would behave the same way again. A regression test recreates the failure pattern, not the entire human context.
- It does not replace production monitoring. Tests catch known patterns. Monitoring catches new failures and tells you what deserves the next test.
- It does not make unsafe writes safe by default. Any test that can book, charge, transfer, delete, or notify needs sandboxing or explicit allowlists.
- It does not remove human review. Some calls are ambiguous. Keep those manual until the expected behavior is crisp enough to automate.
I used to think the hardest part was generating the test case. It is not. The hard part is preserving enough evidence that the test still means something 6 months later, after the prompt, tool schema, and owner have all changed.
Failed-Call Promotion Checklist
Before a failed production call becomes part of the regression suite or a broader production readiness checklist, check every item:
- The source call ID, agent version, and detection reason are stored.
- The caller data is redacted or replaced with a safe fixture.
- The test has one primary failure label.
- The fixture recreates the failure without touching real customer records.
- Tool-call assertions include name, arguments, order, result, and forbidden calls where relevant.
- Side-effect assertions verify the backend result, not just transcript text.
- The promotion level is
blocking,scheduled, ormanual. - The owner is a team that can fix failures.
- The test stores evidence for failed runs.
- The test links back to the incident, monitoring alert, or coverage cluster that created it.
Pair this with incident response after customer-impacting failures. Pair it with voice agent troubleshooting when you still do not know the root cause. Pair it with hallucination detection when the failure is an unsupported or unsafe answer.
The loop is simple: production teaches you what broke, testing prevents the same break from returning, and CI decides which fixes are important enough to block.

