What languages does Hamming support for voice agent testing?

Our current list includes 11+ languages: Dutch, English (US, UK, Australian, Indian variants), French (France, Canadian), German (Germany, Austrian, Swiss), Hindi, Indonesian, Italian, Japanese, Korean, Portuguese (Brazilian, European), and Spanish (Castilian, Mexican, Argentine). Each language includes regional variants, accent coverage, and code-switching support where applicable. If you need a specific locale, ask and we will confirm coverage.

Why is multilingual testing important for voice AI applications?

ASR accuracy varies a lot by language. English can sit under 8% WER, while languages like Hindi often land in the high teens or low 20s. If you do not set per-language baselines, issues only show up after launch. Multilingual testing catches phonetic complexity (Mandarin tones), word boundary challenges (Japanese, Chinese), compound words (German, Dutch), and code-switching failures before users feel the gaps.

How does Hamming help teams test multilingual voice agents?

Hamming runs automated test calls in multiple languages and surfaces where performance diverges by language, accent, and flow. Teams can baseline WER, test code-switching (for example, "Quiero pagar my bill"), validate localization details (dates, addresses, honorifics), detect ASR drift after model updates, and measure latency variance by language. The goal is to catch regressions before you roll changes globally.

What is a good multilingual testing strategy for voice agents?

Use a simple 5-step plan: (1) Establish baseline WER per language, not a single global score; (2) Test intent recognition with equivalent scenarios, not literal translations; (3) Validate code-switching handling; (4) Measure latency variance by language (keep it within ~20% of English); (5) Monitor drift after ASR or model updates. Add accent and regional variants early, and track both outcome metrics (completion, transfer rates) and voice UX (latency, interruptions) by language.

What is a good WER target per language for voice agents?

Targets vary by language. Rough directional thresholds: English 20% for Spanish) usually makes the experience feel unreliable.

How many test cases do I need per language for voice agent testing?

Start with 50-100 scenarios per language to establish baseline WER and intent accuracy, covering happy paths, edge cases, and domain vocabulary. Scale to 300+ scenarios for production monitoring to cover regional variants, accent diversity, and code-switching patterns. Critical flows (booking, payments, escalation) should have 20+ variations each. Prioritize by call volume and business impact.

Do I need native speakers for voice agent test case creation?

Short answer: yes. Native speakers bring natural phrasing, regional variants, and realistic code-switching that non-native speakers miss. They also know formality levels (critical for Japanese and Korean) and local vocabulary differences (for example, "carro" vs "coche" in Spanish). Translation-only test sets tend to sound unnatural and miss real failure modes.

How often should I rerun multilingual voice agent tests?

Rerun after any ASR or LLM update. In production, a weekly multilingual regression pass is a solid default, with immediate tests after changes to core flows. For critical languages, run synthetic calls every 5-15 minutes during business hours. Set drift thresholds per language: WER variance >2% should trigger investigation, >5% should trigger a critical alert.

What is acceptable latency variance across languages for voice agents?

Rule of thumb: keep total latency within ~20% of the English baseline. Benchmarks used by many teams: English TTFW <400ms, total latency <1000ms; Spanish <450ms TTFW, <1100ms total; German <500ms TTFW, <1200ms total; Japanese <550ms TTFW, <1300ms total; Hindi <500ms TTFW, <1250ms total. Alert if P95 latency exceeds 1500ms in any language. Variance above 30% usually points to provider capacity or model performance issues.

How should I test code-switching in multilingual voice agents?

Build 10-20 code-switched utterances per language pair and test both directions. Common patterns include noun substitution ("Quiero pagar my bill"), technical terms (Hindi-English), filler words (French-English), and brand names embedded in local language. Measure task completion rate—aim for 80%+ despite mixed-language input. This is where most voice agents break.

Which ASR provider is best for multilingual voice agents?

There is no single best provider for every language. Popular options include Deepgram Nova-3, AssemblyAI Universal-1, Whisper Large-v3, and Google Speech-to-Text. Reported WER and latency numbers vary by language and domain, so run your own benchmarks with your actual audio. Choose based on language coverage, latency requirements, and whether you need real-time streaming or batch processing.

What regional variants should I test for Spanish voice agents?

If you serve Spain and LATAM, test at least three variants: Castilian (Spain), Mexican Spanish, and Argentine Spanish. Each has vocabulary differences ("ordenador" vs "computadora"), pronunciation patterns, and formality expectations. A single "Spanish" model often fails on regional vocabulary and accent recognition unless you test it directly.

How to Test Multilingual Voice Agents: The Complete Framework

A customer deployed their voice agent to Latin America after testing only in English. Their test suite showed 95%+ accuracy. First week in production, they saw 40% task completion rates in Mexico and 35% in Argentina.

The problem wasn't the model. It was their test cases. They'd translated English scenarios word-for-word instead of creating culturally equivalent scenarios. "I'd like to schedule an appointment" became "Me gustaría programar una cita"—grammatically correct but awkward. Real users said "Quiero sacar turno" in Argentina or "Necesito agendar" in Mexico. The ASR worked fine. The intent classifier had never seen those phrasings.

TL;DR: Multilingual voice agent testing validates ASR accuracy, intent recognition, and latency across languages using equivalent scenarios, code-switching tests, and benchmarked WER targets. Use Hamming's 5-Step Multilingual Testing Framework to baseline performance, then monitor drift by language over time. Target WER under 10% for English, with adjusted thresholds per language (e.g., under 15% for Hindi, under 12% for German).

Quick filter: If you are not testing code-switching, your multilingual agent is not production-ready.

Related Guides:

Voice Agent Testing Guide (2026) — Methods, Regression, Load & Compliance Testing
How to Evaluate Voice Agents: Complete Guide — Hamming's VOICE Framework with all metrics
ASR Accuracy Evaluation for Voice Agents — Hamming's 5-Factor ASR Framework
How to Monitor Voice Agent Outages in Real-Time — Hamming's 4-Layer Monitoring Framework
Best Voice Agent Stack — Architecture and component selection
OpenTelemetry for Voice Agents — Instrument language-tagged ASR spans to detect per-language drift in production

How to Test Multilingual Voice Agents?

Voice agents serving global markets face a fundamental challenge: ASR accuracy, intent recognition, and conversational flow vary significantly across languages. A voice agent that performs flawlessly in English may struggle with German compound words, Japanese honorifics, or Hindi code-switching.

This guide provides a structured framework for testing multilingual voice agents, including specific metrics, benchmarks, and methodologies for ensuring consistent performance across languages.

What Is Multilingual Voice Agent Testing?

Multilingual voice agent testing is the process of verifying that a voice agent performs consistently across languages, accents, and real-world conditions by measuring ASR accuracy, intent recognition, latency, and task completion using language-native test cases.

Methodology Note: WER benchmarks in this guide are derived from Hamming's testing of 4M+ multilingual voice interactions across 49 languages across 10K+ voice agents (2025).
Thresholds align with published ASR research including Google's Multilingual ASR studies and industry standards from contact center deployments. Use these as directional baselines, not guarantees—actual performance varies by ASR provider, audio quality, and domain vocabulary.

Why Is Multilingual Voice Agent Testing Different?

Testing a voice agent in multiple languages isn't simply translating test cases. Each language introduces unique challenges, and translation-only tests are where teams get surprised.

Challenge	Description	Languages Most Affected
Phonetic complexity	Some languages have sounds that ASR models struggle to distinguish	Mandarin (tones), Arabic (pharyngeal consonants)
Word boundaries	Languages without clear word boundaries complicate transcription	Japanese, Chinese, Thai
Compound words	Long compound words can exceed ASR model training patterns	German, Dutch, Finnish
Code-switching	Users mixing languages mid-sentence breaks most models	Hindi-English, Spanish-English
Honorific systems	Formal/informal distinctions affect intent recognition	Japanese, Korean, Indonesian
Regional variants	Same language spoken differently across regions	Spanish (Castilian vs. Mexican), Portuguese (Brazilian vs. European)

Sources: Multilingual ASR challenges documented in Google's Multilingual ASR research (Pratap et al., 2020), OpenAI Whisper paper (Radford et al., 2023), and Mozilla Common Voice dataset analysis (2024).

Hamming's 5-Step Multilingual Testing Framework

How Do You Establish Baseline WER Per Language?

Before testing real scenarios, establish baseline Word Error Rate (WER) for each language using clean, controlled audio. For deeper ASR benchmarking, see ASR Accuracy Evaluation for Voice Agents.

How to Calculate Word Error Rate:

WER = (S + D + I) / N × 100

Where:
S = Substitutions (wrong words)
D = Deletions (missing words)
I = Insertions (extra words)
N = Total words in reference

Worked Example:

Reference: "I want to book a flight to Berlin" Transcription: "I want to look at flight Berlin"

Substitutions: 2 (book→look, a→at)
Deletions: 1 (to)
Insertions: 0
Total words: 8

WER = (2 + 1 + 0) / 8 × 100 = 37.5%

What Are Multilingual ASR Accuracy Benchmarks?

Use these benchmarks to evaluate whether your voice agent meets acceptable thresholds. Values indicate maximum WER for each quality tier:

Language	Excellent	Good	Acceptable	Critical
Arabic	15%	20%	25%	30%+
Bengali	12%	18%	22%	28%+
Bulgarian	10%	15%	18%	23%+
Cantonese	15%	20%	25%	30%+
Catalan	10%	15%	18%	23%+
Chinese (Mandarin)	12%	18%	22%	28%+
Croatian	10%	15%	18%	23%+
Czech	10%	15%	18%	23%+
Danish	8%	12%	15%	20%+
Dutch	8%	12%	15%	20%+
English	5%	8%	10%	15%+
Estonian	12%	18%	22%	28%+
Finnish	12%	18%	22%	28%+
French	7%	10%	13%	18%+
Galician	10%	15%	18%	23%+
German	8%	12%	15%	20%+
Greek	10%	15%	18%	23%+
Gujarati	15%	20%	25%	30%+
Hebrew	12%	18%	22%	28%+
Hindi	12%	18%	22%	28%+
Hungarian	12%	18%	22%	28%+
Icelandic	15%	20%	25%	30%+
Indonesian	10%	15%	18%	23%+
Italian	7%	10%	13%	18%+
Japanese	10%	15%	20%	25%+
Kannada	15%	20%	25%	30%+
Korean	10%	15%	20%	25%+
Latvian	12%	18%	22%	28%+
Lithuanian	12%	18%	22%	28%+
Malay	10%	15%	18%	23%+
Malayalam	15%	20%	25%	30%+
Marathi	15%	20%	25%	30%+
Norwegian	8%	12%	15%	20%+
Odia	18%	23%	28%	33%+
Polish	10%	15%	18%	23%+
Portuguese	8%	12%	15%	20%+
Punjabi	15%	20%	25%	30%+
Romanian	10%	15%	18%	23%+
Russian	10%	15%	18%	23%+
Slovak	10%	15%	18%	23%+
Slovenian	10%	15%	18%	23%+
Spanish	8%	12%	15%	20%+
Swedish	8%	12%	15%	20%+
Tamil	15%	20%	25%	30%+
Telugu	15%	20%	25%	30%+
Thai	15%	20%	25%	30%+
Turkish	10%	15%	18%	23%+
Ukrainian	10%	15%	18%	23%+
Vietnamese	15%	20%	25%	30%+

Note: These benchmarks assume clean audio conditions. Real-world performance with background noise typically adds 5-10% to WER. If your channel is noisy or mobile-heavy, plan for worse.

Sources: WER benchmarks compiled from OpenAI Whisper multilingual evaluation, Google Speech-to-Text language benchmarks, Deepgram Nova-3 language support, Mozilla Common Voice dataset evaluations, and Hamming's multilingual testing across 4M+ voice interactions (2025). Threshold tiers calibrated based on ICSI Meeting Corpus and contact center deployment standards.

How Do You Test Intent Recognition Accuracy Across Languages?

ASR accuracy alone doesn't guarantee understanding. A voice agent might transcribe "réserver un vol" correctly but fail to map it to the BOOK_FLIGHT intent.

Intent Recognition Testing Methodology:

Create equivalent test cases in each language (not literal translations)
Test intent accuracy independently from ASR
Measure by intent category, not just overall accuracy

Intent Category	English Test Case	Spanish Equivalent	Expected Intent
Booking	"I'd like to book a flight"	"Quiero reservar un vuelo"	BOOK_FLIGHT
Cancellation	"Cancel my reservation"	"Cancela mi reserva"	CANCEL_BOOKING
Information	"What time does it leave?"	"¿A qué hora sale?"	GET_DEPARTURE_TIME
Complaint	"This is unacceptable"	"Esto es inaceptable"	ESCALATE_COMPLAINT

Intent Accuracy Formula:

Intent Accuracy = (Correct Intent Classifications / Total Test Cases) × 100

Target: 95%+ intent accuracy per language. If any language falls below 90%, investigate whether the issue is ASR (transcription) or NLU (understanding).

How Do You Validate Code-Switching Handling?

Code-switching—mixing languages within a single utterance—is common among multilingual users but breaks most voice agents.

Common Code-Switching Patterns to Test:

Pattern	Example	Languages
Noun substitution	"Quiero pagar my bill"	Spanish-English
Technical terms	"मुझे flight book करनी है" (I need to book a flight)	Hindi-English
Filler words	"So, euh, je voudrais réserver"	French-English
Loanwords	"予約をキャンセルしたいです" with English brand names	Japanese-English

Sources: Code-switching patterns based on Multilingual Speech Processing research (Sitaram et al., 2020) and analysis of real-world voice interactions in bilingual markets (US Hispanic, India, Singapore). Patterns validated through Hamming customer testing data (2025).

Code-Switching Test Protocol:

Create 10-20 code-switched utterances per language pair
Test both directions (Language A with Language B words, and vice versa)
Measure: ASR accuracy, intent recognition, and task completion
Pass threshold: 80%+ task completion despite code-switching

How Do You Measure Latency Variance by Language?

Different languages require different processing times. Tonal languages, languages with complex scripts, or languages with less training data may exhibit higher latency.

Latency Metrics to Track:

Metric	Definition	Target
Time to First Word (TTFW)	Time from user silence to agent's first audio	500ms
ASR Processing Time	Time to complete transcription	300ms
Total Response Latency	End-to-end from user stop to agent response	1200ms

Latency Benchmarks by Language:

Language	Expected TTFW	Expected Total Latency	Notes
English	400ms	1000ms	Baseline
Spanish	450ms	1100ms	Slightly higher due to longer words
German	500ms	1200ms	Compound words increase processing
Japanese	550ms	1300ms	Character-based processing
Hindi	500ms	1250ms	Code-switching handling adds latency

Alert Threshold: If any language shows >20% latency increase vs. English baseline, investigate ASR model performance.

Sources: Latency benchmarks based on Hamming's multilingual testing across 49 languages (2025). Language-specific latency variations documented in Multilingual Speech Processing research (Interspeech 2021). Targets aligned with conversational turn-taking research (Stivers et al., 2009).

For latency optimization tactics and benchmarks, see How to Optimize Latency in Voice Agents.

How Do You Monitor for Language Model Drift?

ASR and NLU models are updated regularly. These updates can improve one language while degrading another—often without warning.

Drift Detection Methodology:

Establish baselines for each language (WER, intent accuracy, latency)
Run regression tests after any model update
Compare metrics against baseline with tolerance thresholds
Alert on drift exceeding acceptable variance

Metric	Acceptable Variance	Alert Threshold	Critical Threshold
WER	±2%	±5%	±10%
Intent Accuracy	±1%	±3%	±5%
Latency (P95)	±50ms	±100ms	±200ms

Sources: Drift detection thresholds based on Hamming's regression testing methodology across 1,000+ model updates (2025). Variance tolerances aligned with ML model monitoring best practices (Google MLOps, 2024).

How Do You Test for Regional Variants?

Many languages have significant regional variations that affect voice agent performance:

What Spanish Variants Should You Test?

Variant	Key Differences	Testing Focus
Castilian (Spain)	Distinction of "c/z" sounds, "vosotros" form	Formal business interactions
Mexican	Unique vocabulary, faster speech patterns	Customer service scenarios
Argentine	"Vos" form, distinctive intonation	Regional-specific terms

What Portuguese Variants Should You Test?

Variant	Key Differences	Testing Focus
Brazilian	Open vowels, different vocabulary	Most common variant for Americas
European	Closed vowels, formal constructions	European market deployments

What English Variants Should You Test?

Variant	Key Differences	Testing Focus
American	Rhotic pronunciation, specific vocabulary	Default for US deployments
British	Non-rhotic, different spelling conventions	UK market
Indian	Distinct phonetic patterns, code-switching	High code-switching with Hindi

How Do You Run Environmental Testing Across Languages?

Background noise affects ASR differently across languages. Test each language under these conditions:

Condition	Description	Expected WER Impact
Office noise	Typing, HVAC, distant conversations	+3-5% WER
Street noise	Traffic, crowds, wind	+5-10% WER
Café/restaurant	Music, conversations, clinking	+8-15% WER
Car (hands-free)	Engine noise, road noise, echo	+10-20% WER
Speakerphone	Echo, distance, room acoustics	+5-12% WER

Sources: Environmental WER impact ranges based on CHiME Challenge evaluation protocols, ICSI Meeting Corpus research, and Hamming production testing across diverse acoustic environments (2025). SNR testing methodology aligned with ETSI speech quality standards.

Testing Protocol:

Run baseline tests in quiet conditions
Apply noise profiles at -10dB, -5dB, and 0dB SNR
Measure WER degradation per language
Flag languages with >15% degradation vs. English in same conditions

What Is the Multilingual Testing Checklist?

Use this checklist before deploying a voice agent to a new language market:

ASR Validation:

Baseline WER established under clean conditions
WER tested under noisy conditions (office, street, car)
Regional variants tested (if applicable)
Domain-specific vocabulary validated

Intent Recognition:

All intents tested with equivalent (not literal) translations
Intent accuracy meets 95% threshold
Edge cases and ambiguous phrases tested
Negative testing (out-of-scope requests)

Code-Switching:

Common code-switching patterns tested
Task completion rate >80% with code-switched input
Graceful fallback when code-switching fails

Performance:

Latency within 20% of English baseline
P95 latency under 1500ms
No timeout errors under normal load

Monitoring:

Baseline metrics recorded for drift detection
Alerts configured for metric degradation
Regression test suite ready for model updates

What Languages Does Hamming Support?

Hamming supports multilingual voice agent testing in 49 languages:

Language	ASR Support	Intent Testing	Code-Switching	Regional Variants
Arabic	✓	✓	Limited	MSA, Gulf, Levantine, Egyptian, Maghrebi
Bengali	✓	✓	✓	Bangladeshi, West Bengali
Bulgarian	✓	✓	Limited	Standard Bulgarian
Cantonese	✓	✓	✓	Hong Kong, Guangzhou
Catalan	✓	✓	Limited	Catalan, Valencian
Chinese	✓	✓	✓	Mandarin (Simplified), Traditional, Cantonese
Croatian	✓	✓	Limited	Standard Croatian
Czech	✓	✓	Limited	Standard Czech
Danish	✓	✓	Limited	Standard Danish
Dutch	✓	✓	Limited	Netherlands, Belgian Dutch
English	✓	✓	✓	US, UK, Australian, Indian, New Zealand
Estonian	✓	✓	Limited	Standard Estonian
Finnish	✓	✓	Limited	Standard Finnish
French	✓	✓	✓	Metropolitan, Canadian, Belgian, Swiss
Galician	✓	✓	Limited	Standard Galician
German	✓	✓	Limited	DACH region (Germany, Austria, Switzerland)
Greek	✓	✓	Limited	Modern Greek
Gujarati	✓	✓	✓	Standard Gujarati
Hebrew	✓	✓	Limited	Modern Hebrew
Hindi	✓	✓	✓	Standard Hindi, regional variants
Hungarian	✓	✓	Limited	Standard Hungarian
Icelandic	✓	✓	Limited	Standard Icelandic
Indonesian	✓	✓	Limited	Standard Indonesian
Italian	✓	✓	Limited	Standard Italian, regional dialects
Japanese	✓	✓	✓	Standard Japanese, Kansai, Tokyo
Kannada	✓	✓	✓	Standard Kannada
Korean	✓	✓	Limited	Standard Korean, regional dialects
Latvian	✓	✓	Limited	Standard Latvian
Lithuanian	✓	✓	Limited	Standard Lithuanian
Malay	✓	✓	Limited	Malaysian, Indonesian influences
Malayalam	✓	✓	✓	Standard Malayalam
Marathi	✓	✓	✓	Standard Marathi
Norwegian	✓	✓	Limited	Bokmål, Nynorsk
Odia	✓	✓	✓	Standard Odia
Polish	✓	✓	Limited	Standard Polish
Portuguese	✓	✓	Limited	Brazilian, European Portuguese
Punjabi	✓	✓	✓	Eastern, Western Punjabi
Romanian	✓	✓	Limited	Standard Romanian
Russian	✓	✓	Limited	Standard Russian, regional variants
Slovak	✓	✓	Limited	Standard Slovak
Slovenian	✓	✓	Limited	Standard Slovenian
Spanish	✓	✓	✓	Latin American, European Spanish, US Spanish
Swedish	✓	✓	Limited	Standard Swedish, Finland Swedish
Tamil	✓	✓	✓	Indian Tamil, Sri Lankan Tamil
Telugu	✓	✓	✓	Standard Telugu
Thai	✓	✓	Limited	Central Thai, regional variants
Turkish	✓	✓	Limited	Istanbul Turkish, regional variants
Ukrainian	✓	✓	Limited	Standard Ukrainian
Vietnamese	✓	✓	Limited	Northern, Central, Southern dialects

How Do You Get Started with Multilingual Testing?

To begin testing your voice agent across languages:

Select target languages based on your market requirements
Create test scenarios using the 5-Step Framework
Establish baselines for each language
Run comparative tests to identify language-specific issues
Monitor continuously for drift after deployment

Hamming's multilingual testing capabilities enable you to validate voice agent performance across all supported languages with language-specific metrics, benchmarks, and drift detection.

Start Testing Your Multilingual Voice Agent with Hamming

Hamming is the only voice agent testing platform with native support for 49 languages, including code-switching detection, regional variant testing, and language-specific WER benchmarks. Stop translating test cases—start testing like your users actually speak.

Book a Demo with Hamming to see how teams validate multilingual voice agents before production.