Voice Agent Call Evidence Export Runbook: Transcripts, Audio, Traces, and QA Packets

Sumanyu Sharma
Sumanyu Sharma
Founder & CEO
, Voice AI QA Pioneer

Has stress-tested 4M+ voice agent calls to find where they break.

June 3, 2026Updated June 3, 202614 min read
Voice Agent Call Evidence Export Runbook: Transcripts, Audio, Traces, and QA Packets

Voice agent call evidence export is the workflow for packaging transcripts, audio, traces, tool-call records, and evaluation results so QA, compliance, and engineering reviewers can inspect selected calls without searching five production systems.

The common mistake is exporting transcripts and calling the job done. A transcript may show what was said, but it will not prove which agent version ran, which provider call ID owns the recording, whether the audio was redacted, which tool call wrote to a backend, why the call was selected, or whether the reviewer saw the same artifact as everyone else.

If you only need to listen to 5 calls after a launch, this runbook is overkill. Use the dashboard. If you review hundreds or thousands of calls across agents, regions, queues, or compliance programs, you need an evidence packet.

Scope: this runbook applies to production voice agents that already capture transcripts, recordings, traces, tool calls, and evaluation results. It does not replace retention policy, legal review, recording-consent rules, or provider-specific export permissions.

TL;DR: Export voice agent calls as reviewer-safe evidence packets with 10 required fields: canonical call ID, provider aliases, selection reason, redacted transcript, audio pointer, trace ID, tool-call summary, evaluation result, redaction state, and manifest hash.

Do not batch download everything. Export the calls with a reason: failure cluster, compliance sample, regression candidate, escalation spike, customer report, or scheduled QA cohort.

Call evidence packet: a bounded bundle that lets a reviewer reconstruct one voice-agent call from transcript to audio to traces to tool behavior without broad production access.

Methodology Note: This export workflow is based on Hamming's analysis of 4M+ production voice agent calls and QA review workflows across 10K+ voice agents (2025-2026). We've tested agents built on LiveKit, Pipecat, ElevenLabs, Retell, Vapi, and custom-built solutions.

The packet fields focus on the evidence families teams need for QA review, incident follow-up, compliance sampling, and regression-test promotion.

Across Hamming's analysis of 4M+ production voice agent calls across 10K+ agents, we found that the painful part is rarely the download button. Review breaks when the transcript, audio, trace, tool result, and evaluation score point to different IDs.

Last Updated: June 2026

Related Guides:

What Goes in a Call Evidence Packet

The packet is not one file. It is a manifest plus controlled references to the artifacts a reviewer needs.

Use this as the minimum schema:

FieldRequired?Sample ValueWhy It Matters
canonicalCallIdYescall_2026_06_03_1842Stable internal identity across providers and systems
providerAliasesYesTwilio CallSid, LiveKit room, Retell call ID, Vapi call IDLets reviewers find source artifacts
selectionReasonYesfailed_tool_call, qa_sample, compliance_sampleExplains why this call was exported
agentVersionYesbilling-agent@2026-06-03.2Connects behavior to a shipped prompt/model/tool config
redactedTranscriptUriYess3://review-packets/.../transcript.jsonReviewable conversation text
audioUriUsuallys3://review-packets/.../audio.wavLets QA hear silence, interruption, noise, tone, and playback issues
traceIdUsually4bf92f3577b34da6a3ce929d0e0e4736Connects logs, spans, model calls, and tool calls
toolEvidenceFor workflow agentstool name, call ID, argument summary, result summaryProves whether backend actions matched the conversation
evaluationResultYestask_failed, score 0.42, rubric refund_policy_v3Shows how the call was judged
redactionStateYesredacted, raw_restricted, aggregate_onlyPrevents accidental broad access to sensitive content
manifestHashYesSHA-256 of manifestProves the packet did not silently change after review
reviewStatusYesqueued, reviewed, promoted_to_test, ignoredKeeps offline review from becoming a dead archive

Export rule: if the transcript, audio, trace, and tool evidence cannot be joined by one call identity, the packet is not audit-ready.

This is why the IVR log correlation runbook matters before export. A reviewer should not need to know whether the source system calls the object a CallSid, contact ID, room name, test run, or trace. Store those as aliases under one internal call ID.

I used to think the transcript was the main export artifact. After reviewing production voice-agent failures, I now treat the transcript as one field in a packet; the manifest, selection reason, call identity, redaction state, and trace/tool pointers are what make the review repeatable.

Choose Calls by Review Reason, Not by Volume

Batch export is dangerous when it starts with "give me everything from yesterday." That creates large sensitive datasets and gives reviewers too much noise.

Start with a selection taxonomy:

Selection ReasonInclude WhenExport PriorityReviewer
customer_reportedA customer or support team named a bad call, timestamp, account, or symptomHighestEngineering + support
failed_tool_callTool failed, timed out, returned invalid data, duplicated a write, or used wrong argumentsHighestEngineering
unsafe_or_noncompliantThe agent gave unsafe advice, missed a disclosure, leaked sensitive data, or ignored policyHighestCompliance + QA
regression_candidateA production failure should become a test caseHighEngineering + QA
low_confidence_turnASR or intent confidence fell below thresholdMediumQA
latency_or_silenceThe call had long pauses, slow response, dead air, or interruption failuresMediumEngineering
scheduled_sampleRandomized or stratified QA sample by agent, queue, language, or cohortMediumQA
executive_report_sampleRepresentative calls for weekly quality reviewLowOps leadership

For regression candidates, pair this with the failed-call regression runbook. The evidence packet explains the original failure; the regression test recreates the smallest safe version of that failure.

Export Provider Artifacts Without Losing Context

Most voice-agent stacks already expose the raw ingredients. The hard part is keeping them joined.

Source SystemExportJoin KeyKeep Out of Broad Exports
Telephony providercall metadata, recording ID, disconnect reason, call quality metricsprovider call ID and canonical call IDphone numbers, raw SIP headers with customer data
Voice agent platformtranscript, messages, recording URL, log URL, PCAP, call analysis, latencyplatform call ID and trace IDraw logs with secrets or unredacted variables
Media runtimeroom events, participant events, track IDs, egress recordingsroom name, participant ID, canonical call IDraw media for callers outside the selected cohort
Evaluation systemrubric, score, assertion results, evaluator version, failure labeltest run ID and call IDmodel prompts with sensitive customer data
Tool/backend logstool name, request summary, result summary, retry status, side-effect prooftool-call ID, trace ID, call IDfull payloads unless needed for restricted review
Object storagetranscript JSON, audio file, manifest, redaction reportobject path and manifest hashraw unredacted archive outside approved role

Twilio's Recording resource documents metadata and media retrieval for recordings, including call SID, recording SID, status, channels, duration, and recording URL callbacks. Its Transcriptions resource represents text and metadata from transcribed recordings. Those are source artifacts, not the complete evidence packet.

Amazon Connect stores recordings and transcripts in S3-backed locations, and its Contact Lens output paths separate original transcript JSON, redacted transcript JSON, and redacted audio by contact ID. That separation is a useful pattern: raw audio, redacted text, and analytics output should not be treated as the same artifact.

LiveKit Egress can record or export rooms and tracks. LiveKit's agent observability docs also describe recordings, transcripts, traces, and logs. For export packets, keep the room or participant IDs as aliases, then tie the media pointer back to the canonical call ID used by the rest of the evidence.

OpenTelemetry context propagation is what keeps traces, metrics, and logs correlated across services. For voice agents, propagate the trace ID through ASR, LLM, tools, TTS, storage, and evaluation so the reviewer can move from a failed score to the span or tool call that explains it.

Voice-agent platforms expose similar artifact families. Vapi artifact plans can include recordings, transcripts, logs, PCAP files, and API-accessible call artifacts. Retell's call APIs expose call details, transcript, recording, latency, and function-call data, while its dynamic variables include a per-call ID. Treat those IDs as aliases in your manifest, not as the only identity in your system.

Build the Daily Export Job

Do not let humans download files by hand from dashboards. Build a repeatable job.

StepActionOutputFailure Mode to Block
1. Select callsQuery by selection reason, cohort, severity, date, and ownercandidate listno selection reason
2. Resolve identityMap provider aliases to one canonical call IDidentity mapduplicate or missing call ID
3. Fetch transcriptPull redacted transcript or generate redacted copytranscript artifactraw transcript exported to broad bucket
4. Fetch audio pointerStore controlled URI or scoped signed URL; copy raw audio only if requiredaudio referenceunavailable or wrong-channel recording
5. Fetch tracesAttach trace ID and key span summarytrace summarytrace missing for engineering-review packet
6. Fetch tool evidenceAttach tool-call summaries and side-effect prooftool evidence JSONtranscript says success but tool proof missing
7. Attach evaluationAdd rubric, score, assertion result, evaluator versionevaluation resultscore without rubric/version
8. Write manifestHash packet manifest and object referencesmanifest JSONmanifest changed after review
9. Queue reviewAssign reviewer, SLA, and outcome choicesreview taskorphaned export with no owner

For workflow agents, the export job should preserve enough tool evidence for the workflow testing runbook: tool selected, arguments, result, order, retry status, and side-effect proof. The transcript alone can say "I booked that appointment" while the backend shows no appointment or two duplicates.

Use a Manifest as the Control Plane

The manifest is the part auditors and engineers can trust later.

Manifest rule: the manifest should answer four questions without opening a dashboard: which call was exported, why it was selected, which artifacts were attached, and whether the packet changed after review.

{
  "canonicalCallId": "call_2026_06_03_1842",
  "selectionReason": "failed_tool_call",
  "agentVersion": "billing-agent@2026-06-03.2",
  "providerAliases": {
    "twilioCallSid": "CA...",
    "livekitRoom": "billing-prod-1842",
    "traceId": "4bf92f3577b34da6a3ce929d0e0e4736"
  },
  "artifacts": {
    "redactedTranscriptUri": "s3://review-packets/2026-06-03/call_1842/transcript.json",
    "audioUri": "s3://review-packets/2026-06-03/call_1842/audio.wav",
    "toolEvidenceUri": "s3://review-packets/2026-06-03/call_1842/tools.json",
    "evaluationUri": "s3://review-packets/2026-06-03/call_1842/evaluation.json"
  },
  "redactionState": "redacted",
  "review": {
    "owner": "qa",
    "status": "queued",
    "allowedOutcomes": ["no_issue", "bug", "promote_to_regression", "needs_compliance_review"]
  }
}

Hash the manifest after the export finishes. If a transcript, audio pointer, trace ID, or tool-evidence object changes, write a new manifest version instead of mutating the old one.

This is especially important for compliance review. The log retention checklist covers retention classes and legal holds; this runbook covers the packet you hand to a reviewer inside those policies.

Redact Before Broad Review

Raw audio and transcripts are not ordinary debug logs. They can contain account numbers, addresses, health details, payment information, caller names, employee names, and accidental background speech.

Use three export levels:

Export LevelWho Gets ItContentsUse Case
aggregate_onlyLeadership, reportingcounts, cohorts, failure labels, samples without call artifactsweekly quality report
redacted_packetQA, support, product, engineeringredacted transcript, audio pointer if allowed, trace/tool summaries, evaluation resultsnormal call review
raw_restrictedapproved security/compliance roleraw audio/transcript and full payloadslegal, dispute, incident, or deep RCA

Default to redacted_packet. Use PII redaction for voice agents before broad review, and keep access to raw audio narrower than access to metrics. A reviewer usually needs to hear the relevant segment, not export every second of every call forever.

Reviewer-safe export: if a broad QA reviewer can see raw audio, unredacted transcript text, secrets, or full backend payloads by default, the export boundary is too wide.

Run Quality Gates Before Sharing the Packet

Block the export when the packet cannot be trusted.

GateCheckBlock When
Identitycanonical call ID maps to every provider aliasduplicate, missing, or conflicting IDs
Transcriptredacted transcript exists and has ordered speaker turnsmissing turns or unknown redaction state
Audioaudio pointer exists and matches call duration/channel policybroken link, wrong channel, or restricted audio copied broadly
Tracetrace ID opens the relevant call or turnmissing trace for engineering packet
Tool evidencetool calls include request summary, result summary, latency, retriestranscript implies action but no backend proof
Evaluationrubric, score, evaluator version, and failure label are presentscore cannot be interpreted
Manifestmanifest hash is written after artifacts are finalizedmutable packet with no version
Reviewowner and outcome choices are assignedexport has no reviewer or next action

The voice agent analytics Grafana dashboard should help find cohorts and anomalies. It should not become the system of record for raw call replay or evidence review. Metrics dashboards are for detection; evidence packets are for inspection.

What Not to Export

More data does not make a better review packet.

Avoid exporting:

  • Full raw provider logs unless the reviewer needs them for a restricted incident review.
  • Secrets, API keys, auth headers, or webhook payloads with credentials.
  • Unredacted phone numbers, addresses, payment fields, or health information in broad QA packets.
  • Every trace span when a summary plus trace ID is enough.
  • Every call from a high-volume day when the selection reason is unclear.
  • Prompt chains or hidden system instructions unless the reviewer has explicit need and access.
  • Metric labels with call IDs, user IDs, phone numbers, or transcript text.

The Grafana dashboard guide covers why high-cardinality identifiers do not belong in metrics labels. Use replay pointers and trace IDs instead.

How Hamming Fits

Hamming helps teams connect production call analysis, transcripts, audio, traces, tool evidence, and evaluation results into one review workflow. The value is not another export button. The value is deciding which calls deserve review, preserving the evidence that explains why, and turning the right failures into lasting coverage.

Use Hamming when you need to:

  • Review only the calls that need attention instead of sampling blindly.
  • Attach evaluation results to transcripts, audio, traces, and tool behavior.
  • Preserve evidence packets for QA, compliance, incident response, and regression promotion.
  • Turn production failure clusters into response coverage improvements and sandbox tests.
  • Keep reviewers focused on decision-ready packets rather than dashboard archaeology.

The operating loop is simple: detect the call, export the packet, review the evidence, fix the issue, and promote the pattern when it should never return.

Reviewer Handoff Checklist

Before a packet leaves the export job, verify:

  • Every packet has one canonical call ID and provider aliases.
  • Every packet has a selection reason and reviewer owner.
  • Transcript redaction state is explicit.
  • Audio is a controlled pointer unless raw export is approved.
  • Trace ID and key span summary are attached for engineering packets.
  • Tool-call summaries include request, result, latency, retry, and side-effect proof.
  • Evaluation result includes rubric, score, evaluator version, and failure label.
  • Manifest hash is written and immutable for the review version.
  • Review outcomes include no_issue, bug, promote_to_regression, and needs_compliance_review.
  • Retention class matches the log retention checklist.

If a customer-reported incident is involved, pair the packet with the voice agent incident response runbook. If a failure should become a test, promote it with the failed-call regression runbook.

Frequently Asked Questions

A voice agent call evidence packet is a reviewable bundle that connects one call's transcript, audio pointer, trace ID, tool-call evidence, evaluation results, redaction state, and selection reason. Hamming recommends exporting at least 10 core fields so QA or compliance reviewers can reconstruct the call without getting broad production access.

Start by selecting calls with a clear reason, then export a manifest, redacted transcript, audio pointer or file, trace ID, provider IDs, tool-call summaries, and evaluation scores. Hamming's runbook treats the manifest as the control plane because it proves which calls were exported, when, by whom, and under what redaction policy.

Each transcript should carry the canonical call ID, provider call IDs, timestamps, speaker turns, redaction state, agent version, prompt version, trace ID, tool-call references, and evaluation outcomes. Hamming recommends keeping audio and tool evidence as pointers when possible so reviewers can inspect the call without duplicating sensitive artifacts unnecessarily.

Use turn IDs, tool-call IDs, timestamps, and a shared trace or call ID to join each tool invocation and result to the transcript turn that triggered it. Hamming recommends storing tool name, arguments summary, result summary, latency, retry status, and side-effect proof so evaluators can judge workflow correctness instead of transcript text alone.

Create one canonical call ID before the simulation starts, then store provider IDs, room IDs, recording IDs, test-run IDs, and trace IDs as aliases under it. Hamming's checklist keeps those aliases in the export manifest so a reviewer can move from a test result to transcript, audio, traces, and cleanup evidence.

Run checks for missing transcript turns, unavailable audio, broken trace links, missing tool-call results, unresolved redaction state, and manifest hash mismatches. Hamming recommends blocking the export if any high-risk packet lacks a selection reason, owner, or reviewer-safe redaction marker.

Use audio pointers when reviewers can access controlled storage, and export raw audio only when the review process requires a portable package. Hamming recommends keeping raw audio, redacted transcripts, and aggregate evaluation data under separate retention and access policies because they carry different privacy and audit risks.

Sumanyu Sharma

Sumanyu Sharma

Founder & CEO

Previously Head of Data at Citizen, where he helped quadruple the user base. As Senior Staff Data Scientist at Tesla, grew AI-powered sales program to 100s of millions in revenue per year.

Researched AI-powered medical image search at the University of Waterloo, where he graduated with Engineering honors on dean's list.

“At Hamming, we're taking all of our learnings from Tesla and Citizento build the future of trustworthy, safe and reliable voice AI agents.”