A guide to engineering quality in voice agents of our enterprise customers.
Insights from Calls Analyzed
Over the past year at Hamming, we've analyzed thousands of calls from our customers and developed a robust framework for measuring quality across your voice agents.
Defining Voice Agent Quality
When enterprises first come to Hamming for their AI voice agent quality assurance, they typically track between 30-50 voice agent quality metrics: response times, sentiment scores, task completion rates, ASR accuracy, latency percentiles. Yet, despite extensive voice agent testing, their AI systems are still failing in predictable ways.
Though these metrics are necessary, they are not sufficient. They tell you that something is going wrong, but not always why or how to fix it.
For example,
- You can have great ASR accuracy and still misunderstand intent.
- Sentiment scores can show frustration, but not explain what triggered it.
- Task completion may look high, but users might have taken 10 turns to get there.
We've found that there are four layers involved in building capable AI voice agents. An error at any layer can lead to a breakdown of the entire system and result in poor customer experience:
Infrastructure → Agent Execution → User Reaction → Business Outcome
The Four Layers in Practice
Infrastructure: Can users hear and interact with your voice agent smoothly?
If the foundation is broken — the audio drops, latency lags, the text-to-speech (TTS) sounds robotic, automatic speech recognition (ASR) is misfiring — the agent has already lost your customer's trust.
Typical errors that occur:
- Random audio artifacts (clicks, pops, static) that give callers the impression that the line dropped
- Conversations feel awkward due to inconsistent latency
- Silent gaps where the agent should be responding
Hamming scans for errors at the infrastructure level
- Time to first word - from call to the agent's first sound.
- Turn-level latency - measured at every exchange, not just an average.
- Interruption count - the frequency with which your agent talks over the customer
- Agent Talk Ratio - percentage of conversation time the agent holds the floor
How Hamming's AI voice agent QA identifies infrastructure errors
- Breaks calls into segments according to the time when people are speaking
- Monitors each segment for technical problems - spikes, delays, or anything out of the ordinary
- Tags problems with context - like device type, audio format, model version to identify the underlying cause
- Sends alerts when issues surpass the thresholds you've set for these metrics
Agent Execution: Does Your Voice Agent Stick to the Script or Go Off-Track?
AI voice agents go off track for several interconnected reasons. Even when conversation designers build clear, structured scripts, real-world interactions are messy.
Typical errors that occur:
- Progressively going beyond what they are permitted to respond to ("scope creep").
- Ignoring important safety precautions that are hidden in lengthy prompts
- Exhibiting inconsistent behavior between morning and evening calls
- Making up policies or procedures that don't exist
- Taking on completely different personalities after model / prompt updates
- Showing inconsistent accuracy of knowledge base recall
Hamming monitors for AI voice agents going off script
- Prompt compliance rate - The frequency with which the agent follows each specific instruction. We look at greeting, verification, transaction handling, and closing at the segment level.
- Edge case performance - Response quality when customers say unexpected things. Does "My hamster ate my credit card" crash the conversation or get handled gracefully?
- Consistency index - How similar responses are to the same question asked in different ways. High variance usually means the agent is improvising rather than following guidelines.
How Hamming identifies agent errors
- Segments each conversation into logical chunks - greeting, authentication, main task, upsell, closing. Problems often tend to hide in specific segments.
- Compares actual responses to expected behaviors - complete semantic matching against your business rules and knowledge base, not just keywords.
- Tracks response evolution over time - highlighting instances in which strict agents become unhelpfully rigid or helpful agents become overly helpful.
- Stress-tests with edge cases - observing how agents respond to foul language, requests that aren't feasible, or inquiries that are wholly unrelated.
User Reaction: Is the end user happy?
Even if your agent sounds flawless and complies with all regulations, it won't make a difference if customers end up hanging up in frustration. What tends to happen if you don't keep track of this:
Time in call | Typical feeling | Events |
---|---|---|
0 – 15 s | Upbeat | Customer places order |
15 – 45 s | Flat | Routine details |
45 – 75 s | Sharp drop | Agent repeats "Would you like breadsticks?" three-plus times |
~76 s | Hang-up | Customer gives up |
Custom metrics you can track with Hamming
Hamming's flexible scoring system allows you to define custom LLM-as-a-judge prompts to evaluate any aspect of user satisfaction:
- Conversation Flow Quality - Create a scorer that detects when agents repeat the same question multiple times or get stuck in loops
- Frustration Indicators - Define custom prompts to identify phrases like "Can you repeat that?", "I don't understand", or "Let me speak to a human"
- Engagement Scoring - Build metrics that track whether users are giving short, one-word responses (indicating disengagement) vs. fuller responses
- Task Abandonment Patterns - Configure scorers to detect when users say things like "Never mind", "Forget it", or abruptly change topics
How Hamming helps you track user satisfaction
- Custom Scoring Prompts - Define your own LLM-based evaluation criteria using natural language prompts that analyze transcripts for specific patterns
- Real-time Production Monitoring - Automatically tag live calls with custom labels like "customer frustrated", "requested human agent", or "successful resolution"
- Assertion Framework - Set up critical assertions for user experience, such as "Customer should never be asked the same question more than twice"
- Conversation Analytics - Access detailed transcripts and audio recordings to understand exactly where conversations break down
- Flexible Evaluation - Create different scorer configurations for different business contexts (sales calls vs. support calls vs. appointment scheduling)
Example Custom Scorer for Repetition Detection
Analyze this conversation transcript and identify any instances where the agent
asks the same question more than twice. Consider variations of the same question
as repetitions.
Score:
- 100 if no repetitions detected
- 50 if agent repeated a question exactly twice
- 0 if agent repeated any question more than twice
Provide specific examples of any repetitions found.
Business Outcome: Is Your AI Voice Agent Helping You Achieve Your Business Goals?
A high completion rate might suggest your voice agent is doing its job, but that metric alone doesn't tell the full story. Your bot could be closing calls efficiently while missing key opportunities to drive revenue, increase order value, or deepen customer relationships. Hamming's flexible assertion system allows you to track the metrics that matter most to your business:
Custom business metrics you can define in Hamming
- Task Completion Rate - Define what constitutes a successful outcome for your specific use case (appointment booked, order placed, issue resolved)
- Upsell Success - Create scorers that detect whether agents offered relevant add-ons and track acceptance rates
- Call Efficiency - Measure whether agents achieved objectives within target timeframes
- Compliance Adherence - Ensure agents follow required scripts for legal disclosures or verification procedures
How Hamming helps you track business impact
- Custom Assertion Framework - Define business-critical assertions like "Agent must confirm appointment time and date" or "Agent must offer premium service option"
- Production Call Tagging - Automatically categorize calls by outcome (successful sale, appointment scheduled, escalation needed)
- Performance Analytics - Track success rates across different scenarios, times of day, and agent configurations
- A/B Testing Support - Compare different prompt versions or agent configurations to optimize for business metrics
- Integration via Webhooks - Connect call outcomes to your business systems through post-call webhooks for comprehensive tracking
Example Custom Scorer for Upsell Performance
Evaluate this restaurant order call transcript for upsell effectiveness:
Did the agent mention any add-on items (drinks, desserts, sides)?
Was the upsell offer made at an appropriate time (after main order)?
Did the customer accept any upsell offers?
Score:
100: Upsell offered appropriately AND accepted
75: Upsell offered appropriately but declined
50: Upsell offered but timing was poor
0: No upsell attempted when opportunity existed
List specific upsell attempts and their outcomes.
Moving Forward: Towards Building High-Quality and Reliable Voice Agents
AI voice agents now shoulder a growing share of front desk conversations alongside human reps. So now when these AI systems falter, like when the audio cuts out, they respond too slowly, the conversation doesn't flow, or they just don't actually help the customer, it can directly harm your bottom line.
Hamming helps your business adopt a strategic, end-to-end AI voice agent QA approach, so that you can be assured your voice agent is trustworthy and delivering consistent value, even before it starts interacting with customers. Our comprehensive voice agent testing framework ensures AI voice agent quality at every level.
Layer | If left unchecked | When actively monitored and corrected |
---|---|---|
Infrastructure (audio path, latency) | Call drops, awkward silences | Consistently clear audio on any device with minimal hidden tech debt |
Conversation design (dialogue logic) | Loops, repetitive confirmations, deviation from personality | Perfect prompt adherence, Natural pacing, fewer retries, faster task completion |
Customer sentiment (custom scoring) | Polite yet frustrated callers who churn after the interaction | Custom metrics detect frustration patterns; proactive improvements based on scoring data |
Business impact (outcome tracking) | "Successful" call counts that still miss financial targets | Custom assertions track business KPIs; webhooks enable integration with business systems |
Quality comes from understanding the whole system, not optimizing individual parts.
Frequently Asked Questions
What is AI voice agent quality assurance?
A structured process that tests, measures, and improves every layer of an AI voice agent - from infrastructure and execution to user satisfaction and business outcomes.
How often should we run voice agent testing?
Continuously for production systems, with deep-dive audits at least once per release cycle.
Which metrics matter most in AI voice agent evaluation?
Time to first word, turn-level latency, prompt compliance, plus custom metrics you define for your specific business needs (frustration detection, upsell success, task completion).
Can quality assurance improve revenue?
Yes. By creating custom scorers that track business outcomes and using assertions to enforce best practices, you can optimize for metrics that directly impact revenue like upsell rates and customer retention.