Intent Recognition for Voice Agents: Testing at Scale

Sumanyu Sharma
Sumanyu Sharma
Founder & CEO
, Voice AI QA Pioneer

Has stress-tested 1M+ voice agent calls to find where they break.

January 5, 202613 min read
Intent Recognition for Voice Agents: Testing at Scale

This guide is for teams testing voice agents at scale—thousands of calls, dozens of intents, production environments where small errors compound into major failures. If you're testing a simple FAQ bot with 5 intents, standard accuracy metrics will do.

Your voice agent's intent recognition looks perfect in testing. 98% accuracy on your evaluation set. You ship to production. Within hours, users report the agent "doesn't understand anything."

Here's what actually happened: ASR transcribed "Check my account balance" as "Check my cow balance." I still don't fully understand the phonetics on that one - "account" to "cow" isn't even close. But the NLU model had never seen "cow balance" in training, panicked, and guessed "livestock_inquiry" at 73% confidence. A banking customer got a confident explanation about cattle management. The support ticket was... memorable.

This is the cascade effect. Voice agents don't just have NLU errors—they have compounding ASR + NLU errors. In Hamming's analysis of 1M+ production calls across 50+ deployments, we see intent error rates in voice agents that are 3-10x higher than in text chatbots, even when both use identical NLU models. Testing intent recognition at scale means accounting for this reality.

We learned this the hard way. Early on, we shipped a banking agent that looked "clean" in text-only tests. The first week in production, a small cluster of ASR errors ("balance" → "ballots") drove a spike in abandonment. It wasn't a model problem. It was a test design problem.

The short version: Test with 10K+ utterances, not 50. Track which specific intents confuse each other, not just aggregate accuracy. Make sure your agent knows when to say "I don't understand" instead of confidently guessing wrong. And test first-turn intent separately - if you lose them on turn one, you've lost them period.

Voice agents have 3-10x higher error rates than text due to ASR cascade. The metrics that matter: Intent Classification Accuracy (>98% for critical domains), Intent Confusion Rate (<2% per pair), Out-of-Scope Detection (>95%), Slot Filling (>98% critical), and First-Turn Accuracy (>97%).

Related Guides:

Methodology Note: The benchmarks in this guide are derived from Hamming's analysis of 1M+ voice agent interactions across 50+ production deployments (2024-2025). Voice-specific error rates account for ASR cascade effects. Domain benchmarks vary: banking requires >98% ICA, customer support accepts >93%.

What Is Intent Recognition in Voice Agents?

Intent recognition is the process of mapping user utterances to actionable intents (e.g., "book_appointment", "check_balance", "cancel_order"). In text chatbots, this is a straightforward NLU task. In voice agents, it becomes exponentially harder.

Voice agents face a unique challenge: the utterance must first pass through ASR before reaching the NLU model. Each layer introduces errors that compound. A 95% accurate ASR + 98% accurate NLU = 93.1% combined accuracy (assuming independent errors). That's a 3.45x higher error rate than text-only.

The business impact is severe. In our analysis of 1M+ production calls across 50+ deployments, poor intent recognition consistently ranks among the top two causes of user abandonment (alongside poor conversational flow). Users won't tolerate an agent that confidently misunderstands them.

What surprised us in production

Three patterns kept repeating across deployments:

  1. High overall accuracy can still feel broken. A 97-98% ICA sounds great, but a few high-impact confusions can dominate user complaints.
  2. OOS errors are rare but memorable. A handful of hallucinated answers can undo weeks of trust-building.
  3. First-turn mistakes are sticky. Users rarely recover from a wrong first turn, even if the agent corrects later.

Voice vs Text: Why Intent Recognition Is Different

ChallengeText NLUVoice NLU
Input qualityClean textASR errors cascade
VariationsTypos, abbreviationsAccents, mumbling, background noise
ContextFull sentence visiblePartial utterances, real-time processing
TimingAsync, can retryReal-time, one chance
Error rateBaseline3-10x higher

The Cascade Effect: Why Voice Intent Recognition Is Harder

The cascade effect is the compounding of ASR errors into NLU failures. Even minor transcription errors can trigger completely wrong intent classifications.

The Cascade in Action

User says: "I'd like to book an appointment"
ASR outputs: "I'd like to book a appointment" (minor grammatical error)
 Intent: book_appointment  (robust NLU handles it)

User says: "Check my account balance"
ASR outputs: "Check my cow balance" (phonetic confusion)
 Intent: ??? (NLU model has never seen "cow balance")
 Best guess: livestock_inquiry (wrong domain entirely)

The first example shows ASR robustness. The second shows cascade failure: a common phonetic error ("account" → "cow") derails the entire conversation.

Compound Error Rate Formula

Voice Intent Error = 1 - (ASR Accuracy × NLU Accuracy)

Worked Example:

InputValue
ASR Accuracy95% (0.95)
NLU Accuracy (on clean text)98% (0.98)
Combined Accuracy0.95 × 0.98 = 0.931 (93.1%)
Voice Intent Error Rate1 - 0.931 = 6.9%
Text-only NLU Error Rate1 - 0.98 = 2%
Error Rate Increase6.9% / 2% = 3.45x higher

At 10,000 calls per day, that 6.9% error rate means 690 users daily experience intent misclassification. At scale, these "rare" errors become systemic problems.

Hamming's Intent Recognition Quality Framework

We used to think intent testing was simple: test set, accuracy number, ship it. After the fourth deployment where aggregate accuracy looked great but users were rage-quitting, we had to admit we were measuring the wrong things. Aggregate accuracy hides confusion between specific intent pairs. Small test sets miss rare but critical errors. And testing with clean text ignores the ASR cascade entirely.

Hamming's Intent Recognition Quality Framework measures 5 key metrics at scale. Each metric addresses a specific failure mode that small test sets miss.

MetricWhat It MeasuresTarget
Intent Classification Accuracy (ICA)% of utterances correctly classified>98% (critical domains)
Intent Confusion Rate (ICR)Which intent pairs are confused<2% per pair
Out-of-Scope Detection Rate (OSDR)Recognition of unhandleable queries>95%
Slot Filling Accuracy (SFA)Entity extraction accuracy>98% (critical), >90% (non-critical)
First-Turn Intent Accuracy (FTIA)First-turn intent correctness>97%

Test at scale (100+ utterances per intent, 10K+ total) to catch rare but critical confusion patterns that small evaluation sets miss.

These targets come from production deployments, not theory. The ICA threshold in particular is controversial - some teams ship at 93% and do fine, others need 99% because their error recovery is terrible. Know your fallback before picking your threshold.

Metric 1: Intent Classification Accuracy (ICA)

Intent Classification Accuracy measures the percentage of utterances correctly classified to their intended intent.

ICA Formula

ICA = Correct Classifications / Total Utterances × 100

Worked Example

InputValue
Test set size10,000 utterances
Number of intents50
Utterances per intent200
Correct classifications9,450
ICA9,450 / 10,000 × 100 = 94.5%

Interpretation: 94.5% ICA is acceptable for customer support but needs improvement for banking or healthcare.

ICA Benchmarks

RatingICAProduction Readiness
Excellent>98%Ship to critical domains
Good95-98%Ship with monitoring
Acceptable90-95%Ship only with human fallback
Poor<90%Not production ready

How to Improve ICA

  1. Add more training examples for low-performing intents
  2. Review confusion patterns and add disambiguating examples
  3. Consider merging similar intents that users don't distinguish
  4. Test with ASR output, not just clean text

Metric 2: Intent Confusion Rate (ICR)

Intent Confusion Rate identifies which specific intent pairs are commonly confused. This is more actionable than aggregate accuracy because it tells you exactly what to fix.

ICR Analysis: Confusion Matrix Example

True Intentbook_appointmentreschedulecancelcheck_status
book_appointment94%4%1%1%
reschedule8%88%3%1%
cancel2%5%91%2%
check_status1%2%2%95%

Insights from this matrix:

  • reschedulebook_appointment confusion at 8% is critical (users say "reschedule" but system books new)
  • cancelreschedule confusion at 5% needs attention
  • check_status performs well with minimal confusion

ICR Threshold

Target <2% confusion rate for any intent pair. Higher confusion indicates:

  • Overlapping training data
  • Similar phrasing between intents
  • Need for disambiguation examples

How to Reduce ICR

  1. Add intent-distinguishing examples: "I need to change my existing appointment" (reschedule) vs "I need to schedule a new appointment" (book)
  2. Use slot requirements: If user mentions existing appointment ID, bias toward reschedule/cancel
  3. Add confirmation for high-confusion pairs: "Just to confirm, you want to reschedule your existing appointment, not book a new one?"

Metric 3: Out-of-Scope Detection Rate (OSDR)

Out-of-Scope Detection Rate measures how well your agent recognizes queries it can't handle. Low OSDR leads to hallucinations—the agent confidently gives wrong answers to unknown queries.

OSDR Formula

OSDR = Correctly Flagged OOS / Total OOS Utterances × 100

Worked Example

InputValue
OOS utterances in test set500
Correctly flagged as OOS475
Misclassified as in-scope25
OSDR475 / 500 × 100 = 95%

Risk Analysis:

  • 25 utterances (5%) will trigger incorrect intents
  • At 1,000 calls/day with 10% OOS queries: 5 hallucinations per day
  • User trust impact: High (agent confidently answers questions it shouldn't)

OSDR Benchmarks

RatingOSDRHallucination Risk
Excellent>95%Minimal
Good90-95%Low, monitor closely
Acceptable80-90%Moderate, needs improvement
Poor<80%High, not production ready

Why OSDR Matters

In our datasets, poor OSDR consistently shows up as a top driver of abandonment alongside conversational flow issues. Users expect agents to know their limits. An agent that says "I can't help with that, but I can connect you to someone who can" builds trust. An agent that confidently mishandles queries destroys it.

How to Improve OSDR

  1. Add diverse OOS examples to training (at least 500)
  2. Include edge cases: weather, sports, unrelated questions
  3. Set confidence thresholds: Route low-confidence predictions to fallback
  4. Test with real user queries that fell outside expected patterns

Metric 4: Slot Filling Accuracy (SFA)

Slot Filling Accuracy measures how well your agent extracts entities from utterances. Getting the intent right but the slot wrong still breaks the conversation.

SFA Formula

SFA = Correct Slot Extractions / Total Required Slots × 100

Critical vs Non-Critical Slots

Slot TypeExamplesTarget SFA
CriticalAccount numbers, medication names, dollar amounts>98%
Non-CriticalPreferred contact time, reason for call>90%

Worked Example

UtteranceExpected SlotsExtracted SlotsCorrect?
"Book for Tuesday at 3pm"date: Tuesday, time: 3pmdate: Tuesday, time: 3pm✓ ✓
"Reschedule to next Friday"date: next Fridaydate: Friday (missing "next")
"Transfer $500 to savings"amount: $500, account: savingsamount: $500, account: savings✓ ✓

SFA Calculation: 5 correct / 6 total = 83.3% (needs improvement)

Common SFA Failures

Failure TypeExampleFix
Date normalization"Next Friday" → "Friday"Add relative date handling
Amount parsing"$5,000" → "$5" (truncation)Improve number parsing
Name extraction"John Smith Jr." → "John Smith"Include suffixes in entity training
Partial extraction"Account ending in 1234" → nullAdd partial reference patterns

Metric 5: First-Turn Intent Accuracy (FTIA)

First-Turn Intent Accuracy measures whether the conversation starts on the right path. Wrong first-turn intent leads to misrouted conversations that are difficult to recover.

FTIA Formula

FTIA = Correct First-Turn Intents / Total Conversations × 100

Worked Example

InputValue
Total conversations1,000
Correct first-turn intent970
Wrong first-turn intent30
FTIA970 / 1,000 × 100 = 97%

Impact: 30 conversations (3%) start on wrong path, requiring recovery or abandonment.

FTIA Benchmarks

RatingFTIAConversation Impact
Excellent>97%Smooth starts, high completion
Good93-97%Occasional misdirection
Acceptable88-93%Frequent recovery needed
Poor<88%High abandonment risk

Why First Turn Matters Most

We noticed something striking in our data: conversations with incorrect first-turn intent have ~4x higher abandonment rates than those that start correctly. Users form impressions fast. A wrong first-turn signals "this agent doesn't understand me" and colors the entire interaction.

Recovery is possible but costly:

  • Extra turns to correct the path
  • User frustration ("That's not what I said")
  • Increased cognitive load for both user and agent

Intent Testing at Scale: Methodology

Testing intent recognition at scale requires systematic coverage. Small test sets (10-20 examples per intent) miss the confusion patterns that emerge in production.

Step 1: Build the Intent Utterance Matrix

Goal: 100+ utterances per intent with comprehensive variations

IntentCountVariation Types
book_appointment150Formal, casual, accented, noisy, indirect
reschedule120Direct, indirect, with context, with reason
cancel130Polite, urgent, with/without reason
check_status140Various question phrasings
update_information125Field-specific variations

Variation types to include:

  • Formal: "I would like to schedule an appointment"
  • Casual: "Can I book something?"
  • Indirect: "I need to see someone about my account"
  • With context: "After my last visit, I need another appointment"
  • Accented: Speech variations in audio (synthetic or real)
  • Noisy: Background noise injected at 10-20dB SNR

Step 2: Create Confusion Test Sets

Goal: Intentionally test boundary cases and ambiguous utterances

Test TypeExample UtterancesWhat It Tests
Similar intents"book" vs "reschedule"Intent boundary clarity
Negations"I don't want to cancel"Negation handling (≠ cancel)
Compound"Book appointment and also cancel my old one"Multi-intent detection
Ambiguous"Change my appointment"Disambiguation (reschedule? modify?)
Implied"I need to talk to someone"Vague intent handling
Out-of-scope"What's the weather?"OOS detection

Step 3: Run Automated Evaluation

Evaluation Pipeline:

1. Submit 10,000+ utterances to voice agent
2. Collect predicted intents + confidence scores
3. Compare to ground truth labels
4. Calculate all 5 metrics (ICA, ICR, OSDR, SFA, FTIA)
5. Generate confusion matrix
6. Flag problem areas (bottom 10% intents, confused pairs)

Step 4: Analyze and Iterate

Improvement Loop:

  1. Identify bottom 10% intents by accuracy
  2. Analyze confusion patterns from matrix
  3. Add disambiguating training data to NLU model
  4. Re-test with same utterances to measure improvement
  5. Repeat until all metrics meet thresholds

Common patterns to look for:

  • Intents with <90% accuracy → Need more training data
  • Intent pairs with >5% confusion → Need disambiguation
  • OOS utterances classified in-scope → Need more OOS examples
  • Slot extraction failures → Need entity training

Intent Recognition Benchmarks by Domain

Different industries have different tolerance for intent errors. Banking and healthcare have near-zero tolerance. Customer support has more flexibility because human fallback is available.

DomainICA TargetOSDR TargetCritical SlotsSFA TargetNotes
Banking>98%>95%Account #, amounts, dates>99%Zero tolerance for financial errors
Healthcare>97%>98%Medications, dosages, dates>98%Safety critical, high liability
E-commerce>95%>90%Products, quantities, addresses>95%Speed over perfection, recoverable
Customer Support>93%>85%Issue type, urgency, account ID>90%Wide intent variety, human fallback
Appointment Booking>96%>92%Date, time, type, location>97%Clear intent boundaries, high expectations
Insurance>97%>93%Policy #, claim details, dates>98%Regulatory compliance, accuracy critical

Why domains differ:

  • Banking/Healthcare: Errors have severe consequences (financial loss, patient harm)
  • E-commerce: Speed matters, errors can be corrected in checkout flow
  • Customer Support: Human agents available for escalation, wide intent variety makes perfection impossible
  • Appointment Booking: Clear, well-defined intents with high user expectations

The tolerance for intent errors depends entirely on what happens next. A wrong appointment type is recoverable—the agent can clarify. A wrong medication name is not—it's a compliance incident waiting to happen.

Fair warning: these benchmarks assume you've defined your intents well. If your "check_balance" intent is actually covering 15 different ways to ask about account status, that 98% target becomes much harder to hit. We've seen teams waste months chasing accuracy numbers when the real problem was intent architecture.

Intent Architecture Best Practices

Good intent architecture makes recognition easier. Poor architecture creates inherent confusion.

Intent Granularity

Too coarse (avoid):

  • ❌ "customer_service" (too broad, hard to route)
  • ❌ "account_management" (encompasses too many actions)

Too fine (avoid):

  • ❌ "book_monday_morning_appointment" (too specific, insufficient data)
  • ❌ "reschedule_to_next_week" (conflates intent with slot value)

Just right (target):

  • ✅ "book_appointment" + slots (date, time, type)
  • ✅ "reschedule_appointment" + slots (new date, reason)
  • ✅ "cancel_appointment" + slots (appointment_id, reason)

Rule of thumb: If you can't collect 100+ training examples, the intent is too specific.

There's an unresolved tension here: finer intents give you better routing control but require more data and create more confusion opportunities. Coarser intents are easier to train but push complexity downstream. We don't have a universal answer—the right balance depends on your domain, data volume, and error tolerance.

Intent Hierarchy

Level 1: Domain
├── appointments
   ├── book_appointment
   ├── reschedule_appointment
   ├── cancel_appointment
   └── check_appointment_status
├── account
   ├── check_balance
   ├── update_information
   ├── close_account
   └── report_fraud
└── support
    ├── report_issue
    ├── get_help
    └── provide_feedback

Benefits of hierarchy:

  • Fallback routing: If "reschedule_appointment" fails, route to "appointments" domain handler
  • Cleaner confusion analysis: Group related intents
  • Easier expansion: Add new intents within existing domains
  • Multi-level confidence: Check domain first, then specific intent

Handling Low-Confidence Predictions

Confidence ScoreActionExample Response
>0.9Execute intent directly[Proceed with booking]
0.7-0.9Confirm with user"Just to confirm, you want to book an appointment?"
0.5-0.7Offer top 2-3 options"Did you want to book, reschedule, or check on an appointment?"
<0.5Route to fallback"I'm not sure I understood. Could you rephrase that?"

Never execute low-confidence intents directly in critical domains (banking, healthcare). Always confirm.

Common Intent Recognition Failures

Failure TypeSymptomRoot CauseFix
Similar intent confusion"reschedule" classified as "book" 15% of timeOverlapping training dataAdd disambiguating examples
OOS misclassificationUser asks "What's the weather?" → Agent books appointmentNo fallback intentAdd "unknown" intent with diverse OOS examples
Slot extraction errorsDates extracted wrong, names misspelledEntity recognition gapsExpand entity training, add validation
First-turn failuresGreetings misclassified as intentsIncomplete greeting handlingAdd greeting variations to training
Compound intent misses"Book and also cancel" → Only booksSingle-intent assumptionSupport multi-intent parsing or sequential clarification
Negation failures"I don't want to cancel" → Classified as cancelNegation not handledAdd negation examples to training
Context-dependent errors"Change it to tomorrow" without prior contextNo conversation memoryImplement context tracking

How to Implement Scale Testing

Follow this 4-step process to implement intent recognition testing at scale:

Step 1: Set Up Your Test Infrastructure

  • Choose automated testing platform (Hamming or equivalent)
  • Configure voice agent integration (API or call simulation)
  • Set up metrics collection and reporting
  • Timeline: 1-2 days

Step 2: Build Your Utterance Matrix

  • List all intents in your voice agent (aim for 20-100 intents)
  • Generate 100+ utterances per intent using:
    • Real user transcripts (best source)
    • Synthetic generation with variations
    • Team brainstorming sessions
  • Include all variation types (formal, casual, accented, noisy)
  • Label ground truth intents and slots
  • Timeline: 1-2 weeks

Step 3: Run Baseline Evaluation

  • Submit all utterances to voice agent
  • Collect predictions and confidence scores
  • Calculate all 5 metrics (ICA, ICR, OSDR, SFA, FTIA)
  • Generate confusion matrix
  • Identify problem areas
  • Timeline: 1-2 days

Step 4: Iterate Until Thresholds Met

  • For each failing metric:
    • Analyze root cause (confusion matrix, error patterns)
    • Add targeted training data to NLU model
    • Re-test to measure improvement
  • Repeat until all metrics meet domain benchmarks
  • Timeline: 2-4 weeks (iterative)

Total Timeline: 4-7 weeks from setup to production-ready (faster if you already have labeled transcripts; longer if you need to generate and validate new data).

Flaws but Not Dealbreakers

Scale testing for intent recognition isn't perfect. Some limitations worth knowing:

Synthetic utterances never fully match production. No matter how many variations you generate, real users will surprise you. In November, we saw a banking customer say "I need to look at my numbers" meaning account balance—no synthetic test set would include that phrasing. Use production transcripts to continuously expand your test set.

Confusion matrices can mislead. A 5% confusion rate between two intents looks small in aggregate, but if those intents have very different outcomes (transfer vs cancel), the impact is outsized. We're still figuring out how to weight confusion by downstream consequence.

The 100-utterance rule is arbitrary. Some intents need 300+ examples to catch edge cases; others plateau at 50. The right number depends on intent complexity, ASR variability, and your error tolerance.

Domain benchmarks vary more than tables suggest. A 98% target for banking assumes you've defined banking intents well. If your "check_balance" intent encompasses 15 different ways to ask about account status, that 98% is much harder to hit.

One more thing we're still figuring out: how to weight confusion by consequence. A 5% confusion rate between "transfer" and "cancel" is way worse than 5% confusion between "check_balance" and "account_summary." We don't have a good formula for this yet. If you do, seriously, email us.


Intent Recognition Checklist

Use this checklist to validate your intent recognition testing:

Scale Coverage:

  • 100+ utterances per intent
  • 10K+ total test utterances
  • All variation types included (formal, casual, accented, noisy)
  • Confusion test sets for boundary cases
  • Out-of-scope test set (500+ examples)

Metrics Tracked:

  • Intent Classification Accuracy (ICA)
  • Intent Confusion Rate (ICR) with confusion matrix
  • Out-of-Scope Detection Rate (OSDR)
  • Slot Filling Accuracy (SFA) by slot type
  • First-Turn Intent Accuracy (FTIA)

Thresholds Met:

  • ICA meets domain benchmark (>98% banking, >93% support)
  • No intent pair with >5% confusion
  • OSDR >90% minimum, >95% target
  • Critical slot SFA >98%
  • FTIA >97%

Continuous Monitoring:

  • Real-time ICA tracking in production
  • Alerts on metric degradation
  • Weekly confusion matrix review
  • Quarterly retesting with updated utterances

Frequently Asked Questions

What is intent recognition in voice agents?

Intent recognition is the process of mapping spoken user utterances to actionable intents (e.g., "book_appointment", "check_balance"). In voice agents, this is more challenging than text chatbots because ASR errors cascade to NLU, creating compound failures. According to Hamming's analysis, voice agents have 3-10x higher intent error rates than text-only systems.

How do you test intent recognition at scale?

Test intent recognition using 100+ utterances per intent (10K+ total) with Hamming's 4-step methodology: (1) Build intent utterance matrix with variations (formal, casual, accented, noisy), (2) Create confusion test sets for boundary cases, (3) Run automated evaluation with metrics tracking, (4) Analyze confusion patterns and iterate. Target >98% accuracy for critical domains.

What is a good intent classification accuracy for voice agents?

According to Hamming's benchmarks: >98% is excellent for critical domains (banking, healthcare), 95-98% is good for most use cases, 90-95% is acceptable for customer support with human fallback, and <90% is not production ready. Voice agents need higher accuracy than text chatbots due to ASR error cascade.

What is out-of-scope detection and why does it matter?

Out-of-scope detection (OSDR) measures how well your voice agent recognizes queries it can't handle. Low OSDR leads to hallucinations—the agent confidently gives wrong answers to unknown queries. Target >95% OSDR to prevent trust-destroying failures. In our datasets, poor OSDR consistently shows up alongside conversational flow as a top driver of user abandonment.

Why is first-turn intent accuracy important?

First-turn intent accuracy (FTIA) determines whether the conversation starts on the right path. Wrong first-turn intent leads to misrouted conversations that are difficult to recover. Based on Hamming's analysis, conversations with incorrect first-turn intent have ~4x higher abandonment rates. Target >97% FTIA for production voice agents.


Ready to test your voice agent's intent recognition at scale?

Hamming runs thousands of test utterances through your voice agent, measures all 5 Intent Recognition Quality Framework metrics, and identifies exactly which intents need work. Stop guessing from small test sets—test at scale before your users find the bugs.

Start your free trial →


Frequently Asked Questions

Intent recognition is the process of mapping spoken user utterances to actionable intents (e.g., "book_appointment", "check_balance"). In voice agents, this is more challenging than text chatbots because ASR errors cascade to NLU, creating compound failures. In Hamming's analysis of 1M+ production calls across 50+ deployments, voice agents have 3-10x higher intent error rates than text-only systems.

Test intent recognition using 100+ utterances per intent (10K+ total) with Hamming's 4-step methodology: (1) Build intent utterance matrix with variations (formal, casual, accented, noisy), (2) Create confusion test sets for boundary cases, (3) Run automated evaluation with metrics tracking, (4) Analyze confusion patterns and iterate. Target >98% accuracy for critical domains.

According to Hamming's benchmarks: >98% is excellent for critical domains (banking, healthcare), 95-98% is good for most use cases, 90-95% is acceptable for customer support with human fallback, and <90% is not production ready. Voice agents need higher accuracy than text chatbots due to ASR error cascade.

Out-of-scope detection (OSDR) measures how well your voice agent recognizes queries it can't handle. Low OSDR leads to hallucinations—the agent confidently gives wrong answers to unknown queries. Target >95% OSDR to prevent trust-destroying failures. In our datasets, poor OSDR consistently shows up alongside conversational flow as a top driver of user abandonment.

First-turn intent accuracy (FTIA) determines whether the conversation starts on the right path. Wrong first-turn intent leads to misrouted conversations that are difficult to recover. Based on Hamming's analysis, conversations with incorrect first-turn intent have ~4x higher abandonment rates. Target >97% FTIA for production voice agents.

Sumanyu Sharma

Sumanyu Sharma

Founder & CEO

Previously Head of Data at Citizen, where he helped quadruple the user base. As Senior Staff Data Scientist at Tesla, grew AI-powered sales program to 100s of millions in revenue per year.

Researched AI-powered medical image search at the University of Waterloo, where he graduated with Engineering honors on dean's list.

“At Hamming, we're taking all of our learnings from Tesla and Citizen to build the future of trustworthy, safe and reliable voice AI agents.”