How to Turn Failed Production Calls Into Regression Tests

Sumanyu Sharma
Sumanyu Sharma
Founder & CEO
, Voice AI QA Pioneer

Has stress-tested 4M+ voice agent calls to find where they break.

May 29, 2026Updated May 29, 202612 min read
How to Turn Failed Production Calls Into Regression Tests

Failed production calls regression tests are repeatable voice agent tests created from real production failures. The goal is not to replay every bad call forever. The goal is to preserve the one failure pattern that matters, recreate it safely, and make sure the next prompt, model, or workflow change cannot bring it back.

If you handle 30 calls a week, this runbook is probably too heavy. Listen to the bad calls, fix the obvious bug, and add 2 or 3 dashboard tests. This is for teams with enough traffic that the same failure can show up across agents, locations, languages, or releases.

The uncomfortable lesson: a monitoring alert is not a regression test. A bug ticket is not a regression test. A saved recording is not a regression test. Until the failure has a fixture, assertions, owner, and promotion level, it is just evidence.

TL;DR: Turn failed production calls into regression tests with 5 steps: capture the evidence packet, classify the failure, recreate it with safe fixtures, write assertions against the transcript and backend behavior, then promote the case into blocking, scheduled, or manual regression.

Do not dump raw production calls into CI. Start with the smallest safe scenario that reproduces the failure.

Failed-call regression case: A failed-call regression case is the smallest safe test that preserves a production failure's source label, fixture, assertions, and promotion rule. It should reproduce the behavior that mattered without replaying private caller data or unrelated conversational noise.

Methodology Note: This runbook is based on Hamming's analysis of 4M+ production voice agent calls and testing workflows across 10K+ voice agents (2025-2026). We've tested agents built on LiveKit, Pipecat, ElevenLabs, Retell, Vapi, and custom-built solutions.

The sample cases focus on tool calls, workflow side effects, unsafe responses, and production monitoring because those are the failures that most often come back after a release.

In Hamming workflow reviews, we found that the cases worth keeping usually have one thing in common: the transcript looks less broken than the backend evidence. The caller may hear a reasonable response while the trace shows a duplicate write, missing lookup, stale eligibility check, or unsafe handoff. That is why this runbook starts with evidence, not prompt editing.

Last Updated: May 2026

Related Guides:

What Makes a Production Call Test-Worthy?

A failed call is test-worthy when it teaches you something the existing suite did not know.

That sounds obvious, but teams usually get this wrong in two ways. They either convert every bad call into a bloated test suite nobody trusts, or they fix the bug once and leave no guardrail behind. Both fail.

Use this decision table:

Production signalMake it blocking?Make it scheduled?Keep manual?Why
Wrong tool called in a payment, booking, prescription, identity, or claims flowYesYesNoThe agent changed real workflow state or risked doing so.
Correct tool called with wrong argumentsYes for high-risk flowsYesNoArgument mistakes tend to return after prompt and schema edits.
Unsafe medical, financial, legal, or policy responseYesYesNoSafety regressions should block release.
Repeated fallback cluster with clear caller goalMaybeYesNoIt improves response coverage but may not need to block every PR.
One-off confused caller with no repeated patternNoNoYesKeep it for review, not CI.
Provider outage or telephony failure outside the agent's controlNoMaybeYesTest your fallback behavior, not the vendor outage itself.
Long, messy call with 9 unrelated issuesNoMaybe after reductionYesSplit it into the smallest reproducible failure first.

Failed-call regression rule: a production failure should become a regression test when it is repeated, high impact, safety-sensitive, or caused by a workflow contract the agent must preserve.

Capture the Evidence Packet Before You Rewrite Anything

Do not start by editing the prompt. First, freeze the evidence.

For voice agents, the failure may live in the transcript, audio, latency, tool trace, state transition, handoff summary, or post-call side effect. If you preserve only the transcript, you may create a test that passes while the real bug survives.

Evidence fieldRequired?What it provesSample value
Source call IDYesConnects the test to the original failurecall_2026_05_29_1842
Agent versionYesShows what code, prompt, model, or workflow failedagent=scheduler, prompt=v38
Caller goalYesKeeps the test human-readable"Reschedule appointment after identity check"
Sanitized transcript turnsYesRecreates the conversational pathLast 6 turns before failure
Audio conditionUsuallyExplains ASR or interruption issuesoffice noise, accent, speakerphone
Tool-call traceFor workflow agentsProves tool name, arguments, order, result, and errorlookup_identity returned missing record
State before callFor writesRecreates fixture setupappointment exists, payment plan disabled
Side effect after callFor writesProves backend outcomeduplicate calendar event created
Trace or correlation IDStrongly recommendedConnects ASR, LLM, tools, TTS, and logsOpenTelemetry trace ID
Severity and ownerYesDecides promotion and triageseverity=high, owner=scheduling-eng

OpenTelemetry context propagation exists to correlate signals across services. For voice agents, that correlation matters because one call can cross SIP, ASR, LLM, tool execution, TTS, storage, and post-call evaluation. A regression case without correlation IDs is harder to trust.

OpenTelemetry traces also model spans with timestamps, events, status, links, and context. That is the right mental model for failed-call evidence: keep the causal path, not just the final symptom.

Use the voice agent observability and tracing guide when you need the broader instrumentation layer. Use this runbook when you already have a failed call and need to preserve it as a test.

Evidence packet rule: If a teammate cannot move from the regression test back to the original call, trace, fixture, and owner, the test is not ready for CI. The point is to preserve causal evidence, not just transcript text.

Classify the Failure Before You Create the Test

Every failed-call test needs one primary failure type. Otherwise the assertion becomes a grab bag.

Failure typePrimary assertionCommon fixPromotion level
Intent missAgent identifies the wrong caller goalAdd phrasing variants and intent assertionsScheduled unless high risk
Tool selectionAgent calls the wrong toolTighten tool descriptions and routing policyBlocking for critical flows
Tool argumentsAgent passes wrong or missing fieldsAdd entity checks and schema validationBlocking for writes
Tool orderingAgent calls tools in unsafe orderAdd state-machine checksBlocking
Side effectBackend write is missing, duplicated, or wrongAdd sandbox fixture and post-call verificationBlocking
HandoffTransfer succeeds but context is missingAdd handoff summary and destination assertionsScheduled or blocking
Safety/policyAgent says or does something prohibitedAdd policy assertions and severity routingBlocking
Latency/interruptionAgent is too slow or mishandles barge-inAdd turn-level timing and interruption checksScheduled unless launch-critical

The key is reducing the production call to one durable lesson.

If the real call had 14 turns, a noisy office, two interruptions, a stale CRM record, and a wrong appointment update, resist the urge to preserve all 14 turns. Keep the smallest version that still fails for the same reason. Then add separate tests for separate lessons.

Recreate the Call With Safe Fixtures

Never use raw production data as the fixture in CI. Use production to learn the pattern. Use sanitized fixtures to reproduce it.

That means:

  • replace real names, phone numbers, account IDs, and addresses with test records
  • preserve the failure-relevant shape: missing record, duplicate appointment, expired policy, unavailable slot
  • route writes through sandbox, dry-run, mock, or allowlisted test endpoints
  • keep audio or transcript snippets only when they are redacted and approved for test storage
  • keep the source label so future reviewers know why the case exists

Safe fixture rule: A safe fixture keeps the production failure's shape while replacing every real caller identifier and production side effect. If the test needs the real account, phone number, or live write path to fail, it is not ready for automated regression.

This is where workflow testing and tests as code meet. Workflow testing tells you what to assert. Tests as code tells you how to make the case reviewable.

Write the Failed-Call Regression Test

Use a test file that names the production source, sanitized fixture, expected behavior, tool assertions, and promotion level.

version: 1
suite: failed_call_regression
owner: scheduling-eng
source:
  type: production_call
  source_call_id: call_2026_05_29_1842
  failure_label: wrong_tool_arguments
  severity: high
  detected_by: production_monitoring
  sanitized: true

agent_ref:
  environment: staging
  agent_slug: scheduling-agent
  prompt_version: pr-491

fixture:
  caller_id: caller_fixture_204
  appointment_id: appt_fixture_771
  existing_state:
    appointment_status: confirmed
    available_slots:
      - 2026-06-03T15:00:00-05:00
  side_effect_mode: sandbox

caller:
  language: en-US
  audio_condition: speakerphone_with_office_noise
  goal: move my appointment from Tuesday morning to Wednesday after 2

scenario:
  starting_turn: "I need to move Tuesday's appointment to Wednesday afternoon."
  must_reproduce:
    - caller asks to reschedule
    - agent verifies identity
    - agent offers an allowed Wednesday slot

assertions:
  outcome:
    task_completed: true
    no_unplanned_handoff: true
  tools:
    required_order:
      - lookup_identity
      - list_appointments
      - list_available_slots
      - update_appointment
      - send_confirmation
    arguments:
      update_appointment:
        appointment_id: appt_fixture_771
        new_start_time: 2026-06-03T15:00:00-05:00
    forbidden_tools:
      - create_new_appointment
  side_effects:
    calendar:
      appointment_id: appt_fixture_771
      expected_status: rescheduled
      duplicate_events_allowed: false
  transcript:
    must_not_include:
      - I created a new appointment
      - I cannot access your account
  evidence:
    retain_transcript: true
    retain_audio: true
    retain_tool_trace: true
    retain_days: 30

promotion:
  level: blocking
  run_on:
    - prompt_change
    - tool_schema_change
    - release_candidate

The field names are less important than the discipline. A reviewer should be able to answer 5 questions before the test runs:

  1. What production failure created this case?
  2. What safe fixture recreates the failure?
  3. Which behavior proves the bug is fixed?
  4. Which side effects are allowed?
  5. What release path does this test block?

Validate the Test Contract

Failed-call tests become dangerous when they are half-structured. A missing fixture or vague expected outcome turns a regression case into a note.

JSON Schema validation is useful here because it can require fields, constrain nested objects, and reject malformed cases before they hit CI.

{
  "$schema": "http://json-schema.org/draft-07/schema#",
  "type": "object",
  "required": ["version", "suite", "owner", "source", "agent_ref", "fixture", "assertions", "promotion"],
  "properties": {
    "source": {
      "type": "object",
      "required": ["type", "source_call_id", "failure_label", "severity", "sanitized"],
      "properties": {
        "type": { "enum": ["production_call"] },
        "source_call_id": { "type": "string", "minLength": 1 },
        "failure_label": { "type": "string", "minLength": 1 },
        "severity": { "enum": ["low", "medium", "high", "critical"] },
        "sanitized": { "const": true }
      }
    },
    "promotion": {
      "type": "object",
      "required": ["level", "run_on"],
      "properties": {
        "level": { "enum": ["blocking", "scheduled", "manual"] },
        "run_on": {
          "type": "array",
          "items": { "type": "string" },
          "minItems": 1
        }
      }
    }
  }
}

The schema will not tell you whether the test is useful. It will catch the easy failures: no owner, no source call, no sanitization marker, no promotion level, no run trigger.

Promote the Test Into the Right Gate

Not every failed-call regression test belongs in every pull request.

Promotion levelRuns whenUse forDo not use for
BlockingPrompt, tool schema, workflow, or release-candidate changesHigh-risk tool calls, safety, payments, identity, compliance, duplicate writesLong-tail exploratory coverage
ScheduledNightly, weekly, or low-traffic windowsCoverage clusters, latency drift, handoff quality, multilingual variantsUrgent release blockers
ManualBefore launches, incident follow-up, or QA reviewMessy cases that need human judgmentKnown critical regressions

Use the voice agent CI/CD testing guide for broad release-gate policy. For failed-call cases, GitHub Actions workflows can keep the first gate straightforward because they are YAML files that define jobs and triggers:

name: voice-agent-failed-call-regression

on:
  pull_request:
    paths:
      - "agents/**"
      - "prompts/**"
      - "voice-tests/**"
  workflow_dispatch:

jobs:
  failed-call-regression:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Validate test cases
        run: npm run validate:voice-tests -- voice-tests/failed-calls
      - name: Run blocking failed-call regressions
        run: npm run voice-tests -- --suite=failed_call_regression --level=blocking

For larger organizations, reusable workflows can accept inputs and secrets from calling workflows. That lets application teams call the same voice regression gate while passing their own suite path, environment, or sandbox credentials.

What This Runbook Cannot Prove

This workflow is useful, but it has limits.

  • It does not prove the original caller would behave the same way again. A regression test recreates the failure pattern, not the entire human context.
  • It does not replace production monitoring. Tests catch known patterns. Monitoring catches new failures and tells you what deserves the next test.
  • It does not make unsafe writes safe by default. Any test that can book, charge, transfer, delete, or notify needs sandboxing or explicit allowlists.
  • It does not remove human review. Some calls are ambiguous. Keep those manual until the expected behavior is crisp enough to automate.

I used to think the hardest part was generating the test case. It is not. The hard part is preserving enough evidence that the test still means something 6 months later, after the prompt, tool schema, and owner have all changed.

Failed-Call Promotion Checklist

Before a failed production call becomes part of the regression suite or a broader production readiness checklist, check every item:

  • The source call ID, agent version, and detection reason are stored.
  • The caller data is redacted or replaced with a safe fixture.
  • The test has one primary failure label.
  • The fixture recreates the failure without touching real customer records.
  • Tool-call assertions include name, arguments, order, result, and forbidden calls where relevant.
  • Side-effect assertions verify the backend result, not just transcript text.
  • The promotion level is blocking, scheduled, or manual.
  • The owner is a team that can fix failures.
  • The test stores evidence for failed runs.
  • The test links back to the incident, monitoring alert, or coverage cluster that created it.

Pair this with incident response after customer-impacting failures. Pair it with voice agent troubleshooting when you still do not know the root cause. Pair it with hallucination detection when the failure is an unsupported or unsafe answer.

The loop is simple: production teaches you what broke, testing prevents the same break from returning, and CI decides which fixes are important enough to block.

Frequently Asked Questions

A failed-call regression test is a repeatable voice agent test created from a real production failure. Hamming recommends preserving the source call ID, sanitized fixture, expected behavior, tool-call assertions, side-effect checks, and promotion level so the failure cannot return silently after the next release.

Convert calls that are repeated, high impact, safety-sensitive, or tied to a workflow contract the agent must preserve. According to Hamming's runbook, one-off confusing calls usually stay manual, while wrong tool calls, unsafe responses, duplicate writes, and failed handoffs should become blocking or scheduled regression tests.

No. Hamming recommends using production calls to identify the failure pattern, then recreating it with redacted transcripts, safe fixtures, and sandboxed side effects. Raw production recordings should not enter CI unless your security, consent, retention, and access controls explicitly allow that use.

At minimum, capture the source call ID, agent version, caller goal, sanitized transcript turns, failure label, severity, owner, and expected behavior. For workflow agents, Hamming's checklist also requires tool-call traces, state before the call, side effects after the call, and a trace or correlation ID when available.

Use a sandbox, mock, dry-run endpoint, or allowlisted test record before asserting tool behavior. Hamming recommends checking the tool name, arguments, order, result, forbidden calls, and backend side effect because transcript-only tests can pass while the real workflow remains broken.

Blocking failed-call tests should run on prompt changes, tool schema changes, workflow changes, and release candidates. Scheduled tests can run nightly or weekly, while manual tests should stay in launch reviews or incident follow-up until their expected behavior is clear enough to automate.

Response coverage identifies which production gaps matter, and tests as code makes test definitions reviewable in Git. The failed-call regression runbook connects those two ideas by taking one production failure and turning it into a safe, owned, promoted regression case.

Sumanyu Sharma

Sumanyu Sharma

Founder & CEO

Previously Head of Data at Citizen, where he helped quadruple the user base. As Senior Staff Data Scientist at Tesla, grew AI-powered sales program to 100s of millions in revenue per year.

Researched AI-powered medical image search at the University of Waterloo, where he graduated with Engineering honors on dean's list.

“At Hamming, we're taking all of our learnings from Tesla and Citizento build the future of trustworthy, safe and reliable voice AI agents.”