A Guide to Quality Assurance for AI Voice Agents

Sumanyu Sharma
Sumanyu Sharma
July 29, 2025
A Guide to Quality Assurance for AI Voice Agents

A guide to engineering quality in voice agents of our enterprise customers.

Insights from Calls Analyzed

Over the past year at Hamming, we've analyzed thousands of calls from our customers and developed a robust framework for measuring quality across your voice agents.

Defining Voice Agent Quality

When enterprises first come to Hamming for their AI voice agent quality assurance, they typically track between 30-50 voice agent quality metrics: response times, sentiment scores, task completion rates, ASR accuracy, latency percentiles. Yet, despite extensive voice agent testing, their AI systems are still failing in predictable ways.

Though these metrics are necessary, they are not sufficient. They tell you that something is going wrong, but not always why or how to fix it.

For example,

  • You can have great ASR accuracy and still misunderstand intent.
  • Sentiment scores can show frustration, but not explain what triggered it.
  • Task completion may look high, but users might have taken 10 turns to get there.

We've found that there are four layers involved in building capable AI voice agents. An error at any layer can lead to a breakdown of the entire system and result in poor customer experience:

Infrastructure  Agent Execution  User Reaction  Business Outcome

The Four Layers in Practice

Infrastructure: Can users hear and interact with your voice agent smoothly?

If the foundation is broken — the audio drops, latency lags, the text-to-speech (TTS) sounds robotic, automatic speech recognition (ASR) is misfiring — the agent has already lost your customer's trust.

Typical errors that occur:

  • Random audio artifacts (clicks, pops, static) that give callers the impression that the line dropped
  • Conversations feel awkward due to inconsistent latency
  • Silent gaps where the agent should be responding

Hamming scans for errors at the infrastructure level

  • Time to first word - from call to the agent's first sound.
  • Turn-level latency - measured at every exchange, not just an average.
  • Interruption count - the frequency with which your agent talks over the customer
  • Agent Talk Ratio - percentage of conversation time the agent holds the floor

How Hamming's AI voice agent QA identifies infrastructure errors

  • Breaks calls into segments according to the time when people are speaking
  • Monitors each segment for technical problems - spikes, delays, or anything out of the ordinary
  • Tags problems with context - like device type, audio format, model version to identify the underlying cause
  • Sends alerts when issues surpass the thresholds you've set for these metrics

Agent Execution: Does Your Voice Agent Stick to the Script or Go Off-Track?

AI voice agents go off track for several interconnected reasons. Even when conversation designers build clear, structured scripts, real-world interactions are messy.

Typical errors that occur:

  • Progressively going beyond what they are permitted to respond to ("scope creep").
  • Ignoring important safety precautions that are hidden in lengthy prompts
  • Exhibiting inconsistent behavior between morning and evening calls
  • Making up policies or procedures that don't exist
  • Taking on completely different personalities after model / prompt updates
  • Showing inconsistent accuracy of knowledge base recall

Hamming monitors for AI voice agents going off script

  • Prompt compliance rate - The frequency with which the agent follows each specific instruction. We look at greeting, verification, transaction handling, and closing at the segment level.
  • Edge case performance - Response quality when customers say unexpected things. Does "My hamster ate my credit card" crash the conversation or get handled gracefully?
  • Consistency index - How similar responses are to the same question asked in different ways. High variance usually means the agent is improvising rather than following guidelines.

How Hamming identifies agent errors

  • Segments each conversation into logical chunks - greeting, authentication, main task, upsell, closing. Problems often tend to hide in specific segments.
  • Compares actual responses to expected behaviors - complete semantic matching against your business rules and knowledge base, not just keywords.
  • Tracks response evolution over time - highlighting instances in which strict agents become unhelpfully rigid or helpful agents become overly helpful.
  • Stress-tests with edge cases - observing how agents respond to foul language, requests that aren't feasible, or inquiries that are wholly unrelated.

User Reaction: Is the end user happy?

Even if your agent sounds flawless and complies with all regulations, it won't make a difference if customers end up hanging up in frustration. What tends to happen if you don't keep track of this:

Time in callTypical feelingEvents
0 – 15 sUpbeatCustomer places order
15 – 45 sFlatRoutine details
45 – 75 sSharp dropAgent repeats "Would you like breadsticks?" three-plus times
~76 sHang-upCustomer gives up

Custom metrics you can track with Hamming

Hamming's flexible scoring system allows you to define custom LLM-as-a-judge prompts to evaluate any aspect of user satisfaction:

  • Conversation Flow Quality - Create a scorer that detects when agents repeat the same question multiple times or get stuck in loops
  • Frustration Indicators - Define custom prompts to identify phrases like "Can you repeat that?", "I don't understand", or "Let me speak to a human"
  • Engagement Scoring - Build metrics that track whether users are giving short, one-word responses (indicating disengagement) vs. fuller responses
  • Task Abandonment Patterns - Configure scorers to detect when users say things like "Never mind", "Forget it", or abruptly change topics

How Hamming helps you track user satisfaction

  • Custom Scoring Prompts - Define your own LLM-based evaluation criteria using natural language prompts that analyze transcripts for specific patterns
  • Real-time Production Monitoring - Automatically tag live calls with custom labels like "customer frustrated", "requested human agent", or "successful resolution"
  • Assertion Framework - Set up critical assertions for user experience, such as "Customer should never be asked the same question more than twice"
  • Conversation Analytics - Access detailed transcripts and audio recordings to understand exactly where conversations break down
  • Flexible Evaluation - Create different scorer configurations for different business contexts (sales calls vs. support calls vs. appointment scheduling)

Example Custom Scorer for Repetition Detection

Analyze this conversation transcript and identify any instances where the agent
asks the same question more than twice. Consider variations of the same question
as repetitions.

Score:
- 100 if no repetitions detected
- 50 if agent repeated a question exactly twice
- 0 if agent repeated any question more than twice

Provide specific examples of any repetitions found.

Business Outcome: Is Your AI Voice Agent Helping You Achieve Your Business Goals?

A high completion rate might suggest your voice agent is doing its job, but that metric alone doesn't tell the full story. Your bot could be closing calls efficiently while missing key opportunities to drive revenue, increase order value, or deepen customer relationships. Hamming's flexible assertion system allows you to track the metrics that matter most to your business:

Custom business metrics you can define in Hamming

  • Task Completion Rate - Define what constitutes a successful outcome for your specific use case (appointment booked, order placed, issue resolved)
  • Upsell Success - Create scorers that detect whether agents offered relevant add-ons and track acceptance rates
  • Call Efficiency - Measure whether agents achieved objectives within target timeframes
  • Compliance Adherence - Ensure agents follow required scripts for legal disclosures or verification procedures

How Hamming helps you track business impact

  • Custom Assertion Framework - Define business-critical assertions like "Agent must confirm appointment time and date" or "Agent must offer premium service option"
  • Production Call Tagging - Automatically categorize calls by outcome (successful sale, appointment scheduled, escalation needed)
  • Performance Analytics - Track success rates across different scenarios, times of day, and agent configurations
  • A/B Testing Support - Compare different prompt versions or agent configurations to optimize for business metrics
  • Integration via Webhooks - Connect call outcomes to your business systems through post-call webhooks for comprehensive tracking

Example Custom Scorer for Upsell Performance

Evaluate this restaurant order call transcript for upsell effectiveness:

Did the agent mention any add-on items (drinks, desserts, sides)?
Was the upsell offer made at an appropriate time (after main order)?
Did the customer accept any upsell offers?

Score:

100: Upsell offered appropriately AND accepted
75: Upsell offered appropriately but declined
50: Upsell offered but timing was poor
0: No upsell attempted when opportunity existed

List specific upsell attempts and their outcomes.

Moving Forward: Towards Building High-Quality and Reliable Voice Agents

AI voice agents now shoulder a growing share of front desk conversations alongside human reps. So now when these AI systems falter, like when the audio cuts out, they respond too slowly, the conversation doesn't flow, or they just don't actually help the customer, it can directly harm your bottom line.

Hamming helps your business adopt a strategic, end-to-end AI voice agent QA approach, so that you can be assured your voice agent is trustworthy and delivering consistent value, even before it starts interacting with customers. Our comprehensive voice agent testing framework ensures AI voice agent quality at every level.

LayerIf left uncheckedWhen actively monitored and corrected
Infrastructure (audio path, latency)Call drops, awkward silencesConsistently clear audio on any device with minimal hidden tech debt
Conversation design (dialogue logic)Loops, repetitive confirmations, deviation from personalityPerfect prompt adherence, Natural pacing, fewer retries, faster task completion
Customer sentiment (custom scoring)Polite yet frustrated callers who churn after the interactionCustom metrics detect frustration patterns; proactive improvements based on scoring data
Business impact (outcome tracking)"Successful" call counts that still miss financial targetsCustom assertions track business KPIs; webhooks enable integration with business systems

Quality comes from understanding the whole system, not optimizing individual parts.

Frequently Asked Questions

What is AI voice agent quality assurance?

A structured process that tests, measures, and improves every layer of an AI voice agent - from infrastructure and execution to user satisfaction and business outcomes.

How often should we run voice agent testing?

Continuously for production systems, with deep-dive audits at least once per release cycle.

Which metrics matter most in AI voice agent evaluation?

Time to first word, turn-level latency, prompt compliance, plus custom metrics you define for your specific business needs (frustration detection, upsell success, task completion).

Can quality assurance improve revenue?

Yes. By creating custom scorers that track business outcomes and using assertions to enforce best practices, you can optimize for metrics that directly impact revenue like upsell rates and customer retention.