Multilingual Voice Agent Transcript Repository: Architecture and Schema

Sumanyu Sharma
Sumanyu Sharma
Founder & CEO
, Voice AI QA Pioneer

Has stress-tested 4M+ voice agent calls to find where they break.

June 19, 2026Updated June 19, 202612 min read
Multilingual Voice Agent Transcript Repository: Architecture and Schema

A multilingual voice agent transcript repository is the system of record that centralizes native transcripts, optional translations, language confidence, audio replay pointers, QA labels, evaluation results, and analytics fields across every language and voice bot your team runs.

Most teams do not fail because they forgot to save transcripts. They fail because Spanish support calls, Hindi-English code-switching, German consent flows, and English regression tests all land in slightly different shapes. The warehouse has text, but not the language confidence. The QA dashboard has scores, but not the native transcript. The audio is in another system. Nobody can tell whether a bad score came from the agent, the speech-to-text model, or a translation layer.

If you run one English-only agent and review 10 calls a week, this architecture is probably too much. Use your provider dashboard and keep moving. This guide is for teams operating multiple agents, regions, languages, queues, or compliance programs where multilingual transcript aggregation becomes the front door for QA and analytics.

TL;DR: Build the repository around six linked records: call, turn, language, artifact, label, and evaluation.

Store native transcript text and translated review text separately. Keep language code, language confidence, audio offset, redaction state, region, agent version, trace ID, QA label, and evaluation result as typed fields. Do not hide multilingual evidence in one transcript blob.

Repository rule: a reviewer should be able to answer one question from a search result: what happened in the original language, how confident was the language pipeline, where is the matching audio, and what action should we take next?

Methodology Note: This repository schema is based on Hamming's analysis of 4M+ production voice agent calls, QA review workflows, and multilingual testing patterns across 10K+ voice agents (2025-2026). We've tested agents built on LiveKit, Pipecat, ElevenLabs, Retell, Vapi, and custom-built solutions.

Treat this as an engineering architecture guide, not legal advice. Data residency, retention, consent, and deletion policies still need owner-approved controls for each market.

Last Updated: June 2026

Related Guides:

What the Repository Must Prove

A multilingual repository is not just storage. It has to prove that five systems agree on the same call:

SystemWhat It OwnsWhat the Repository Must Keep
Telephony or CCaaScall identity, recording, route, regionprovider aliases, canonical call ID, audio pointer
Speech-to-textnative transcript, language code, confidence, speaker turnsturn text, language confidence, model version, timing
Translation or localizationtranslated review text and glossary decisionstranslation text, translation provider, review status
QA and evaluationlabels, scores, pass/fail, reviewer decisionsevaluation result, rubric version, failure reason
Analytics warehousecohorts, trends, dashboards, exportssafe aggregate fields and access policy

The mistake is letting one layer pretend to be the system of record. A transcript service can identify language. It cannot decide retention. A warehouse can aggregate calls. It cannot prove which audio segment matched the failed turn unless the repository carries the pointer.

We found that multilingual review breaks at the handoff. A team can find the call, but not the native turn, translated text, audio offset, STT confidence, and evaluation that explain the issue.

I used to think this was mostly a warehouse design problem. It is not. The hard part is deciding which record is allowed to answer the question when the native transcript, translated review text, audio, and score disagree.

The Repository Architecture

Use a two-zone architecture: evidence storage for controlled artifacts, and query storage for searchable fields.

LayerStoresDefault AccessDo Not Store Here
Capture layerprovider call ID, room ID, audio URI, timestamps, consent stateplatform and compliance ownerslong-term QA decisions
Transcript layernative turns, speaker, language code, language confidence, model versionQA and engineering, after redactionunrestricted raw audio
Translation layertranslated text, glossary version, reviewer language, translation confidencereviewers who need cross-language triagesource-of-truth decisions without native text
Artifact layeraudio pointers, trace links, redaction reports, transcript JSONrole-scoped usersbroad searchable PII
Analytics layeraggregate metrics, labels, cohorts, trendsproduct, QA, operationsraw transcripts or unrestricted recordings

Google Cloud Speech-to-Text can transcribe from a configured set of possible languages and label results with a predicted language code. Amazon Transcribe supports streaming language identification and multi-language identification, but its docs also call out constraints around language options, dialects, custom language models, and redaction. Those constraints are the reason the repository should preserve the language decision, not just the final text.

Required Fields for Every Multilingual Turn

Start with the turn record because QA reviewers work at the turn level.

FieldTypeWhy It Matters
canonicalCallIdstringJoins every language, audio, trace, label, and evaluation record
turnIdstringMakes one utterance addressable
speakerenumSeparates caller, voice agent, IVR, human agent, and system output
turnStartMs / turnEndMsintegerLets reviewers replay the matching audio segment
nativeTexttextSource transcript in the spoken language
translatedTexttext or nullReview aid for global QA and leadership
languageCodeBCP 47 stringEnables language cohorts and model routing analysis
languageConfidencenumberFlags low-confidence language detection before scoring
sttProvider / sttModelstringExplains behavior changes after provider or model updates
translationProvider / glossaryVersionstring or nullMakes translated review text auditable
redactionStateenumPrevents broad search over raw sensitive content
region / residencyClassstringKeeps storage and access aligned with market rules
agentVersionstringConnects failures to prompt, workflow, and model changes
traceId / spanIdstringLinks the turn to logs, tool calls, and latency spans
audioArtifactIdstringPoints to controlled replay
qaLabel / evaluationResultobjectConnects text to review and regression decisions

This is a repository schema, not a product feature list. You can implement it in relational tables, search documents, a lakehouse, or a hybrid index. The invariant is simpler: every multilingual search result needs text, language, confidence, replay, redaction, and action.

{  "turnId": "turn_0017",  "canonicalCallId": "call_2026_06_19_0942",  "speaker": "caller",  "turnStartMs": 84210,  "turnEndMs": 91780,  "nativeText": "Ya verifiqué mi cuenta, ¿por qué me preguntas otra vez?",  "translatedText": "I already verified my account. Why are you asking again?",  "languageCode": "es-US",  "languageConfidence": 0.82,  "sttProvider": "provider_name",  "sttModel": "multilingual-prod-2026-06",  "translationProvider": "translation_provider",  "glossaryVersion": "support_terms_v4",  "redactionState": "redacted",  "region": "us",  "residencyClass": "customer_support_us",  "agentVersion": "billing-agent@2026-06-19.2",  "traceId": "4bf92f3577b34da6a3ce929d0e0e4736",  "spanId": "span_asr_turn_0017",  "audioArtifactId": "audio_0942_redacted",  "qaLabel": {    "type": "identity_verification_confusion",    "source": "reviewer"  },  "evaluationResult": {    "rubricId": "billing_identity_v5",    "passed": false,    "failureReason": "repeated_identity_check"  }}

The sample record uses plain text for readability. In production, run redaction before broad indexing and keep raw native text behind stricter access controls when policy requires it.

Native Text, Translations, and Language Confidence

Do not make English translation the source of truth for every market.

Data ChoiceUse It ForFailure If Misused
Native transcriptQA, dispute review, language-specific evaluation, model debuggingReviewers miss language-specific errors if they only inspect translations
Translated transcriptglobal triage, leadership review, cross-region issue clusteringTranslation can hide tone, entity mistakes, and policy wording
Language confidencerouting decisions, review flags, score eligibilityLow-confidence calls get scored as if the transcript were reliable
Native audio pointerASR disputes, accent review, interruption analysis, consent checksText-only review cannot explain noise, barge-in, or pronunciation failures
Aggregate language metricsdashboards and trend analysisAverages hide one failing language or region

Azure AI Video Indexer describes language identification, multi-language identification, translation, diarization, and JSON insight output with language fields. AssemblyAI's language detection docs call out expected-language lists, fallback language, confidence scores, confidence thresholds, and misdetection troubleshooting. The product details differ, but the repository rule is the same: keep the language decision observable.

Language confidence rule: if language confidence is low, the repository should flag the turn before it feeds automated scoring, dashboards, or regression promotion.

That flag matters in code-switching. A caller can start in English, switch to Spanish for an account detail, then return to English. If the transcript pipeline collapses that into one English record, QA may blame the voice agent for a failure that started in language detection.

Query Cookbook for QA and Analytics

Design the repository around review questions, not storage tables.

Review QuestionRequired FiltersResult Should Show
Which Spanish calls failed identity verification after the latest prompt change?language code, agent version, QA label, date rangenative turns, translated text, audio offsets, evaluation result
Which calls switched languages mid-conversation?per-turn language code, call ID, sequenceturn sequence, confidence, replay offsets
Which low-confidence transcripts were still auto-scored?language confidence, evaluation statusscore, rubric, owner, blocked-report flag
Which translated summaries disagree with native reviewer labels?translation status, reviewer label, languagenative text, translated text, reviewer decision
Which regions have restricted raw audio access?region, residency class, artifact typeaudio policy, redaction state, access role
Which production failures should become multilingual regression tests?QA label, evaluation failure, review statussource turn, expected behavior, test persona, fixture

This is where the repository connects to failed production call regression tests. A multilingual failure should not become a test case until the team can preserve the native turn, translated aid, language confidence, prompt version, expected behavior, and audio pointer.

For search-index details, use the voice agent transcript search schema. This page answers the multilingual repository question. The search schema answers how to index, search, and highlight turns once the records exist.

Access, Retention, and Residency Gates

Multilingual repositories cross markets, languages, and customer data classes. One global access policy will not survive contact with enterprise review.

Use separate controls for each evidence class:

Evidence ClassDefault PolicyCommon Exception
Raw audiocontrolled pointer, narrow playback roledispute, consent review, ASR investigation
Native transcriptredacted search by defaultnative-language QA or compliance review
Translated transcriptreviewer aid, not source of truthexecutive summary or global triage
Language metadatabroadly queryableremove user identifiers before export
QA labels and evaluationsQA and product analyticscustomer-specific contract restrictions
Aggregate analyticsbroadest access after de-identificationlow-volume cohorts that could re-identify callers

Pair this with the voice agent log retention checklist before launch. The repository should carry region, retention class, redaction state, deletion status, and legal-hold state so the analytics layer does not become an accidental archive.

The honest limitation: zero-retention and full longitudinal analytics fight each other. If policy says a vendor cannot store raw transcripts or recordings, the architecture needs customer-owned storage, push-based ingestion, scoped pointers, and aggregate-only analytics. Do not pretend those tradeoffs disappear.

Implementation Checklist

Build the repository in this order:

StepActionEvidence to Keep
1. Normalize identityCreate one canonical call ID and store provider aliasesprovider IDs, room IDs, trace IDs
2. Capture language fieldsStore language code, confidence, provider, model, and fallback behavior per turnlanguage decision record
3. Split native and translated textKeep native text as source; store translation as review aidtranslation provider and glossary version
4. Attach audio pointersStore replay offsets and controlled artifact IDsaudio URI, redaction state, access role
5. Join labels and evaluationsLink QA labels, scores, reviewer decisions, and rubric versionsevaluation record
6. Apply residency and retentionAdd region, retention class, deletion status, and legal-hold statepolicy metadata
7. Gate analyticsBlock low-confidence or unredacted records from broad dashboardsblocked-report reason
8. Promote regressionsTurn selected failures into multilingual test casessource turn, persona, expected behavior

Start smaller than feels satisfying: one agent, two languages, one region, and one QA workflow. If the joins work there, expand.

If the first cohort cannot answer "what did the caller say, in which language, with what confidence, and what should we do next," adding more markets will not create clarity. It will just make the repository harder to trust.

What Not to Centralize

More centralization is not always better.

Avoid putting these into broad query storage:

  • Raw audio files.
  • Unredacted transcript text.
  • Full tool payloads with customer data.
  • Secrets, auth headers, or webhook bodies.
  • Translations with no native-text pointer.
  • Aggregate metrics for cohorts so small they identify a caller.
  • Low-confidence language-detection output with no review flag.

Use the call evidence export runbook when reviewers need portable packets. Use the PII redaction guide before transcript text becomes broadly searchable. Use multi-tenant dashboard requirements when the same repository feeds client-facing reports.

The repository is supposed to reduce ambiguity. If it makes raw multilingual call data easier to over-share, the architecture is moving in the wrong direction.

Frequently Asked Questions

Companies aggregate multi-language voice agent transcripts by normalizing every call into linked call, turn, artifact, label, and evaluation records. According to Hamming's repository schema, each turn should keep native text, optional translated text, language code, language confidence, audio pointer, redaction state, QA labels, and evaluation results under one canonical call ID.

A multilingual transcript repository should include at least 16 fields: canonical call ID, provider aliases, agent version, environment, language code, language confidence, speaker, timestamps, native text, translated text, audio pointer, trace ID, redaction state, QA label, evaluation score, and retention class. Hamming recommends storing these as typed fields instead of hiding them in one transcript blob.

Store both when your QA or analytics workflow needs global review, but treat them as different evidence classes. Hamming's checklist keeps native text as the source of truth, translated text as a reviewer aid, and aggregate metrics as a separate analytics layer with stricter quality gates when language confidence is low.

Language confidence should decide whether a turn is safe for automated scoring, needs native-speaker review, or should be retried with a constrained expected-language list. Hamming recommends flagging low-confidence or code-switched turns before they affect regression scores, dashboards, or customer-facing reports.

Raw audio usually belongs in controlled object storage or the approved recording system of record, not directly inside a broad analytics warehouse. Hamming recommends storing audio pointers, replay offsets, redaction state, and access policy in the repository so reviewers can replay approved segments without granting blanket raw-audio access.

Handle data residency and retention by assigning separate policies to raw audio, native transcript, translated transcript, metadata, QA labels, and aggregate analytics. Hamming's repository schema includes region, retention class, redaction state, and legal-hold state because one global retention window rarely works for multilingual production calls.

A transcript repository connects to regression testing by preserving the failed turn, language context, agent version, trace ID, evaluation result, and reviewer decision under one call identity. Hamming recommends promoting selected multilingual failures into tests only after the repository can prove the native transcript, translated review aid, audio pointer, and expected behavior are aligned.

Sumanyu Sharma

Sumanyu Sharma

Founder & CEO

Previously Head of Data at Citizen, where he helped quadruple the user base. As Senior Staff Data Scientist at Tesla, grew AI-powered sales program to 100s of millions in revenue per year.

Researched AI-powered medical image search at the University of Waterloo, where he graduated with Engineering honors on dean's list.

“At Hamming, we're taking all of our learnings from Tesla and Citizento build the future of trustworthy, safe and reliable voice AI agents.”