How to Test Multilingual Voice Agents: The Complete Framework

Sumanyu Sharma
Sumanyu Sharma
Founder & CEO
, Voice AI QA Pioneer

Has stress-tested 1M+ voice agent calls to find where they break.

November 15, 2025Updated December 23, 202517 min read
How to Test Multilingual Voice Agents: The Complete Framework

A customer deployed their voice agent to Latin America after testing only in English. Their test suite showed 95%+ accuracy. First week in production, they saw 40% task completion rates in Mexico and 35% in Argentina.

The problem wasn't the model. It was their test cases. They'd translated English scenarios word-for-word instead of creating culturally equivalent scenarios. "I'd like to schedule an appointment" became "Me gustaría programar una cita"—grammatically correct but awkward. Real users said "Quiero sacar turno" in Argentina or "Necesito agendar" in Mexico. The ASR worked fine. The intent classifier had never seen those phrasings.

TL;DR: Multilingual voice agent testing validates ASR accuracy, intent recognition, and latency across languages using equivalent scenarios, code-switching tests, and benchmarked WER targets. Use Hamming's 5-Step Multilingual Testing Framework to baseline performance, then monitor drift by language over time. Target WER under 10% for English, with adjusted thresholds per language (e.g., under 15% for Hindi, under 12% for German).

Quick filter: If you are not testing code-switching, your multilingual agent is not production-ready.

Related Guides:

How to Test Multilingual Voice Agents?

Voice agents serving global markets face a fundamental challenge: ASR accuracy, intent recognition, and conversational flow vary significantly across languages. A voice agent that performs flawlessly in English may struggle with German compound words, Japanese honorifics, or Hindi code-switching.

This guide provides a structured framework for testing multilingual voice agents, including specific metrics, benchmarks, and methodologies for ensuring consistent performance across languages.

What Is Multilingual Voice Agent Testing?

Multilingual voice agent testing is the process of verifying that a voice agent performs consistently across languages, accents, and real-world conditions by measuring ASR accuracy, intent recognition, latency, and task completion using language-native test cases.

Methodology Note: WER benchmarks in this guide are derived from Hamming's testing of 500K+ multilingual voice interactions across 49 languages (2025). Thresholds align with published ASR research including Google's Multilingual ASR studies and industry standards from contact center deployments. Use these as directional baselines, not guarantees—actual performance varies by ASR provider, audio quality, and domain vocabulary.

Why Is Multilingual Voice Agent Testing Different?

Testing a voice agent in multiple languages isn't simply translating test cases. Each language introduces unique challenges, and translation-only tests are where teams get surprised.

ChallengeDescriptionLanguages Most Affected
Phonetic complexitySome languages have sounds that ASR models struggle to distinguishMandarin (tones), Arabic (pharyngeal consonants)
Word boundariesLanguages without clear word boundaries complicate transcriptionJapanese, Chinese, Thai
Compound wordsLong compound words can exceed ASR model training patternsGerman, Dutch, Finnish
Code-switchingUsers mixing languages mid-sentence breaks most modelsHindi-English, Spanish-English
Honorific systemsFormal/informal distinctions affect intent recognitionJapanese, Korean, Indonesian
Regional variantsSame language spoken differently across regionsSpanish (Castilian vs. Mexican), Portuguese (Brazilian vs. European)

Sources: Multilingual ASR challenges documented in Google's Multilingual ASR research (Pratap et al., 2020), OpenAI Whisper paper (Radford et al., 2023), and Mozilla Common Voice dataset analysis (2024).

Hamming's 5-Step Multilingual Testing Framework

How Do You Establish Baseline WER Per Language?

Before testing real scenarios, establish baseline Word Error Rate (WER) for each language using clean, controlled audio. For deeper ASR benchmarking, see ASR Accuracy Evaluation for Voice Agents.

How to Calculate Word Error Rate:

WER = (S + D + I) / N × 100

Where:
S = Substitutions (wrong words)
D = Deletions (missing words)
I = Insertions (extra words)
N = Total words in reference

Worked Example:

Reference: "I want to book a flight to Berlin" Transcription: "I want to look at flight Berlin"

  • Substitutions: 2 (book→look, a→at)
  • Deletions: 1 (to)
  • Insertions: 0
  • Total words: 8

WER = (2 + 1 + 0) / 8 × 100 = 37.5%

What Are Multilingual ASR Accuracy Benchmarks?

Use these benchmarks to evaluate whether your voice agent meets acceptable thresholds. Values indicate maximum WER for each quality tier:

LanguageExcellentGoodAcceptableCritical
Arabic15%20%25%30%+
Bengali12%18%22%28%+
Bulgarian10%15%18%23%+
Cantonese15%20%25%30%+
Catalan10%15%18%23%+
Chinese (Mandarin)12%18%22%28%+
Croatian10%15%18%23%+
Czech10%15%18%23%+
Danish8%12%15%20%+
Dutch8%12%15%20%+
English5%8%10%15%+
Estonian12%18%22%28%+
Finnish12%18%22%28%+
French7%10%13%18%+
Galician10%15%18%23%+
German8%12%15%20%+
Greek10%15%18%23%+
Gujarati15%20%25%30%+
Hebrew12%18%22%28%+
Hindi12%18%22%28%+
Hungarian12%18%22%28%+
Icelandic15%20%25%30%+
Indonesian10%15%18%23%+
Italian7%10%13%18%+
Japanese10%15%20%25%+
Kannada15%20%25%30%+
Korean10%15%20%25%+
Latvian12%18%22%28%+
Lithuanian12%18%22%28%+
Malay10%15%18%23%+
Malayalam15%20%25%30%+
Marathi15%20%25%30%+
Norwegian8%12%15%20%+
Odia18%23%28%33%+
Polish10%15%18%23%+
Portuguese8%12%15%20%+
Punjabi15%20%25%30%+
Romanian10%15%18%23%+
Russian10%15%18%23%+
Slovak10%15%18%23%+
Slovenian10%15%18%23%+
Spanish8%12%15%20%+
Swedish8%12%15%20%+
Tamil15%20%25%30%+
Telugu15%20%25%30%+
Thai15%20%25%30%+
Turkish10%15%18%23%+
Ukrainian10%15%18%23%+
Vietnamese15%20%25%30%+

Note: These benchmarks assume clean audio conditions. Real-world performance with background noise typically adds 5-10% to WER. If your channel is noisy or mobile-heavy, plan for worse.

Sources: WER benchmarks compiled from OpenAI Whisper multilingual evaluation, Google Speech-to-Text language benchmarks, Deepgram Nova-3 language support, Mozilla Common Voice dataset evaluations, and Hamming's multilingual testing across 500K+ voice interactions (2025). Threshold tiers calibrated based on ICSI Meeting Corpus and contact center deployment standards.

How Do You Test Intent Recognition Accuracy Across Languages?

ASR accuracy alone doesn't guarantee understanding. A voice agent might transcribe "réserver un vol" correctly but fail to map it to the BOOK_FLIGHT intent.

Intent Recognition Testing Methodology:

  1. Create equivalent test cases in each language (not literal translations)
  2. Test intent accuracy independently from ASR
  3. Measure by intent category, not just overall accuracy
Intent CategoryEnglish Test CaseSpanish EquivalentExpected Intent
Booking"I'd like to book a flight""Quiero reservar un vuelo"BOOK_FLIGHT
Cancellation"Cancel my reservation""Cancela mi reserva"CANCEL_BOOKING
Information"What time does it leave?""¿A qué hora sale?"GET_DEPARTURE_TIME
Complaint"This is unacceptable""Esto es inaceptable"ESCALATE_COMPLAINT

Intent Accuracy Formula:

Intent Accuracy = (Correct Intent Classifications / Total Test Cases) × 100

Target: 95%+ intent accuracy per language. If any language falls below 90%, investigate whether the issue is ASR (transcription) or NLU (understanding).

How Do You Validate Code-Switching Handling?

Code-switching—mixing languages within a single utterance—is common among multilingual users but breaks most voice agents.

Common Code-Switching Patterns to Test:

PatternExampleLanguages
Noun substitution"Quiero pagar my bill"Spanish-English
Technical terms"मुझे flight book करनी है" (I need to book a flight)Hindi-English
Filler words"So, euh, je voudrais réserver"French-English
Loanwords"予約をキャンセルしたいです" with English brand namesJapanese-English

Sources: Code-switching patterns based on Multilingual Speech Processing research (Sitaram et al., 2020) and analysis of real-world voice interactions in bilingual markets (US Hispanic, India, Singapore). Patterns validated through Hamming customer testing data (2025).

Code-Switching Test Protocol:

  1. Create 10-20 code-switched utterances per language pair
  2. Test both directions (Language A with Language B words, and vice versa)
  3. Measure: ASR accuracy, intent recognition, and task completion
  4. Pass threshold: 80%+ task completion despite code-switching

How Do You Measure Latency Variance by Language?

Different languages require different processing times. Tonal languages, languages with complex scripts, or languages with less training data may exhibit higher latency.

Latency Metrics to Track:

MetricDefinitionTarget
Time to First Word (TTFW)Time from user silence to agent's first audio500ms
ASR Processing TimeTime to complete transcription300ms
Total Response LatencyEnd-to-end from user stop to agent response1200ms

Latency Benchmarks by Language:

LanguageExpected TTFWExpected Total LatencyNotes
English400ms1000msBaseline
Spanish450ms1100msSlightly higher due to longer words
German500ms1200msCompound words increase processing
Japanese550ms1300msCharacter-based processing
Hindi500ms1250msCode-switching handling adds latency

Alert Threshold: If any language shows >20% latency increase vs. English baseline, investigate ASR model performance.

Sources: Latency benchmarks based on Hamming's multilingual testing across 49 languages (2025). Language-specific latency variations documented in Multilingual Speech Processing research (Interspeech 2021). Targets aligned with conversational turn-taking research (Stivers et al., 2009).

For latency optimization tactics and benchmarks, see How to Optimize Latency in Voice Agents.

How Do You Monitor for Language Model Drift?

ASR and NLU models are updated regularly. These updates can improve one language while degrading another—often without warning.

Drift Detection Methodology:

  1. Establish baselines for each language (WER, intent accuracy, latency)
  2. Run regression tests after any model update
  3. Compare metrics against baseline with tolerance thresholds
  4. Alert on drift exceeding acceptable variance
MetricAcceptable VarianceAlert ThresholdCritical Threshold
WER±2%±5%±10%
Intent Accuracy±1%±3%±5%
Latency (P95)±50ms±100ms±200ms

Sources: Drift detection thresholds based on Hamming's regression testing methodology across 200+ model updates (2025). Variance tolerances aligned with ML model monitoring best practices (Google MLOps, 2024).

How Do You Test for Regional Variants?

Many languages have significant regional variations that affect voice agent performance:

What Spanish Variants Should You Test?

VariantKey DifferencesTesting Focus
Castilian (Spain)Distinction of "c/z" sounds, "vosotros" formFormal business interactions
MexicanUnique vocabulary, faster speech patternsCustomer service scenarios
Argentine"Vos" form, distinctive intonationRegional-specific terms

What Portuguese Variants Should You Test?

VariantKey DifferencesTesting Focus
BrazilianOpen vowels, different vocabularyMost common variant for Americas
EuropeanClosed vowels, formal constructionsEuropean market deployments

What English Variants Should You Test?

VariantKey DifferencesTesting Focus
AmericanRhotic pronunciation, specific vocabularyDefault for US deployments
BritishNon-rhotic, different spelling conventionsUK market
IndianDistinct phonetic patterns, code-switchingHigh code-switching with Hindi

How Do You Run Environmental Testing Across Languages?

Background noise affects ASR differently across languages. Test each language under these conditions:

ConditionDescriptionExpected WER Impact
Office noiseTyping, HVAC, distant conversations+3-5% WER
Street noiseTraffic, crowds, wind+5-10% WER
Café/restaurantMusic, conversations, clinking+8-15% WER
Car (hands-free)Engine noise, road noise, echo+10-20% WER
SpeakerphoneEcho, distance, room acoustics+5-12% WER

Sources: Environmental WER impact ranges based on CHiME Challenge evaluation protocols, ICSI Meeting Corpus research, and Hamming production testing across diverse acoustic environments (2025). SNR testing methodology aligned with ETSI speech quality standards.

Testing Protocol:

  1. Run baseline tests in quiet conditions
  2. Apply noise profiles at -10dB, -5dB, and 0dB SNR
  3. Measure WER degradation per language
  4. Flag languages with >15% degradation vs. English in same conditions

What Is the Multilingual Testing Checklist?

Use this checklist before deploying a voice agent to a new language market:

ASR Validation:

  • Baseline WER established under clean conditions
  • WER tested under noisy conditions (office, street, car)
  • Regional variants tested (if applicable)
  • Domain-specific vocabulary validated

Intent Recognition:

  • All intents tested with equivalent (not literal) translations
  • Intent accuracy meets 95% threshold
  • Edge cases and ambiguous phrases tested
  • Negative testing (out-of-scope requests)

Code-Switching:

  • Common code-switching patterns tested
  • Task completion rate >80% with code-switched input
  • Graceful fallback when code-switching fails

Performance:

  • Latency within 20% of English baseline
  • P95 latency under 1500ms
  • No timeout errors under normal load

Monitoring:

  • Baseline metrics recorded for drift detection
  • Alerts configured for metric degradation
  • Regression test suite ready for model updates

What Languages Does Hamming Support?

Hamming supports multilingual voice agent testing in 49 languages:

LanguageASR SupportIntent TestingCode-SwitchingRegional Variants
ArabicLimitedMSA, Gulf, Levantine, Egyptian, Maghrebi
BengaliBangladeshi, West Bengali
BulgarianLimitedStandard Bulgarian
CantoneseHong Kong, Guangzhou
CatalanLimitedCatalan, Valencian
ChineseMandarin (Simplified), Traditional, Cantonese
CroatianLimitedStandard Croatian
CzechLimitedStandard Czech
DanishLimitedStandard Danish
DutchLimitedNetherlands, Belgian Dutch
EnglishUS, UK, Australian, Indian, New Zealand
EstonianLimitedStandard Estonian
FinnishLimitedStandard Finnish
FrenchMetropolitan, Canadian, Belgian, Swiss
GalicianLimitedStandard Galician
GermanLimitedDACH region (Germany, Austria, Switzerland)
GreekLimitedModern Greek
GujaratiStandard Gujarati
HebrewLimitedModern Hebrew
HindiStandard Hindi, regional variants
HungarianLimitedStandard Hungarian
IcelandicLimitedStandard Icelandic
IndonesianLimitedStandard Indonesian
ItalianLimitedStandard Italian, regional dialects
JapaneseStandard Japanese, Kansai, Tokyo
KannadaStandard Kannada
KoreanLimitedStandard Korean, regional dialects
LatvianLimitedStandard Latvian
LithuanianLimitedStandard Lithuanian
MalayLimitedMalaysian, Indonesian influences
MalayalamStandard Malayalam
MarathiStandard Marathi
NorwegianLimitedBokmål, Nynorsk
OdiaStandard Odia
PolishLimitedStandard Polish
PortugueseLimitedBrazilian, European Portuguese
PunjabiEastern, Western Punjabi
RomanianLimitedStandard Romanian
RussianLimitedStandard Russian, regional variants
SlovakLimitedStandard Slovak
SlovenianLimitedStandard Slovenian
SpanishLatin American, European Spanish, US Spanish
SwedishLimitedStandard Swedish, Finland Swedish
TamilIndian Tamil, Sri Lankan Tamil
TeluguStandard Telugu
ThaiLimitedCentral Thai, regional variants
TurkishLimitedIstanbul Turkish, regional variants
UkrainianLimitedStandard Ukrainian
VietnameseLimitedNorthern, Central, Southern dialects

How Do You Get Started with Multilingual Testing?

To begin testing your voice agent across languages:

  1. Select target languages based on your market requirements
  2. Create test scenarios using the 5-Step Framework
  3. Establish baselines for each language
  4. Run comparative tests to identify language-specific issues
  5. Monitor continuously for drift after deployment

Hamming's multilingual testing capabilities enable you to validate voice agent performance across all supported languages with language-specific metrics, benchmarks, and drift detection.

Get started with multilingual testing →

Frequently Asked Questions

Our current list includes 11+ languages: Dutch, English (US, UK, Australian, Indian variants), French (France, Canadian), German (Germany, Austrian, Swiss), Hindi, Indonesian, Italian, Japanese, Korean, Portuguese (Brazilian, European), and Spanish (Castilian, Mexican, Argentine). Each language includes regional variants, accent coverage, and code-switching support where applicable. If you need a specific locale, ask and we will confirm coverage.

ASR accuracy varies a lot by language. English can sit under 8% WER, while languages like Hindi often land in the high teens or low 20s. If you do not set per-language baselines, issues only show up after launch. Multilingual testing catches phonetic complexity (Mandarin tones), word boundary challenges (Japanese, Chinese), compound words (German, Dutch), and code-switching failures before users feel the gaps.

Hamming runs automated test calls in multiple languages and surfaces where performance diverges by language, accent, and flow. Teams can baseline WER, test code-switching (for example, "Quiero pagar my bill"), validate localization details (dates, addresses, honorifics), detect ASR drift after model updates, and measure latency variance by language. The goal is to catch regressions before you roll changes globally.

Use a simple 5-step plan: (1) Establish baseline WER per language, not a single global score; (2) Test intent recognition with equivalent scenarios, not literal translations; (3) Validate code-switching handling; (4) Measure latency variance by language (keep it within ~20% of English); (5) Monitor drift after ASR or model updates. Add accent and regional variants early, and track both outcome metrics (completion, transfer rates) and voice UX (latency, interruptions) by language.

Targets vary by language. Rough directional thresholds: English <8% excellent, <10% acceptable; Spanish <12% excellent, <15% acceptable; French <10% excellent, <13% acceptable; German <12% excellent, <15% acceptable; Hindi <18% excellent, <22% acceptable; Mandarin <18% excellent, <22% acceptable; Japanese <15% excellent, <20% acceptable. These assume clean audio—add 5-10% for noisy conditions. WER above critical thresholds (for example, >20% for Spanish) usually makes the experience feel unreliable.

Start with 50-100 scenarios per language to establish baseline WER and intent accuracy, covering happy paths, edge cases, and domain vocabulary. Scale to 300+ scenarios for production monitoring to cover regional variants, accent diversity, and code-switching patterns. Critical flows (booking, payments, escalation) should have 20+ variations each. Prioritize by call volume and business impact.

Short answer: yes. Native speakers bring natural phrasing, regional variants, and realistic code-switching that non-native speakers miss. They also know formality levels (critical for Japanese and Korean) and local vocabulary differences (for example, "carro" vs "coche" in Spanish). Translation-only test sets tend to sound unnatural and miss real failure modes.

Rerun after any ASR or LLM update. In production, a weekly multilingual regression pass is a solid default, with immediate tests after changes to core flows. For critical languages, run synthetic calls every 5-15 minutes during business hours. Set drift thresholds per language: WER variance >2% should trigger investigation, >5% should trigger a critical alert.

Rule of thumb: keep total latency within ~20% of the English baseline. Benchmarks used by many teams: English TTFW <400ms, total latency <1000ms; Spanish <450ms TTFW, <1100ms total; German <500ms TTFW, <1200ms total; Japanese <550ms TTFW, <1300ms total; Hindi <500ms TTFW, <1250ms total. Alert if P95 latency exceeds 1500ms in any language. Variance above 30% usually points to provider capacity or model performance issues.

Build 10-20 code-switched utterances per language pair and test both directions. Common patterns include noun substitution ("Quiero pagar my bill"), technical terms (Hindi-English), filler words (French-English), and brand names embedded in local language. Measure task completion rate—aim for 80%+ despite mixed-language input. This is where most voice agents break.

There is no single best provider for every language. Popular options include Deepgram Nova-3, AssemblyAI Universal-1, Whisper Large-v3, and Google Speech-to-Text. Reported WER and latency numbers vary by language and domain, so run your own benchmarks with your actual audio. Choose based on language coverage, latency requirements, and whether you need real-time streaming or batch processing.

If you serve Spain and LATAM, test at least three variants: Castilian (Spain), Mexican Spanish, and Argentine Spanish. Each has vocabulary differences ("ordenador" vs "computadora"), pronunciation patterns, and formality expectations. A single "Spanish" model often fails on regional vocabulary and accent recognition unless you test it directly.

Sumanyu Sharma

Sumanyu Sharma

Founder & CEO

Previously Head of Data at Citizen, where he helped quadruple the user base. As Senior Staff Data Scientist at Tesla, grew AI-powered sales program to 100s of millions in revenue per year.

Researched AI-powered medical image search at the University of Waterloo, where he graduated with Engineering honors on dean's list.

“At Hamming, we're taking all of our learnings from Tesla and Citizen to build the future of trustworthy, safe and reliable voice AI agents.”