How do I perform load testing with thousands of concurrent synthetic calls?

Generate realistic concurrent call volume that matches or exceeds production peaks. Use synthetic callers with varied personas, speech patterns, and scenarios. Test at 2x expected peak capacity, then watch latency P95/P99, error rates, and audio quality (MOS) under load.

What baseline metrics should I establish before launching a voice agent?

Set baselines for First Call Resolution (around 75%+), goal completion rate (around 85%+), latency P95 (under ~800ms), and Word Error Rate (under ~10%). These become your regression benchmarks so you can spot drift quickly.

What is the difference between deterministic and blind sampling in A/B voice evaluations?

Deterministic sampling routes specific call types to specific versions for controlled comparison. Blind (random) sampling routes calls randomly for unbiased statistical comparison. Use deterministic sampling early, then blind sampling when you are ready to make production decisions.

How often should I run regression tests on voice agents?

Run regression tests after every model update, prompt change, or dependency upgrade. Weekly regression suites are a good minimum in production. A large portion of regressions come from upstream model provider changes, not your own code.

What sample size do I need for statistically significant A/B tests?

A rule of thumb: for 95% confidence and a 5% minimum detectable effect, you need about 1,000 calls per variant. If you can tolerate a larger effect size (around 10%), 250 calls per variant can be enough.

Testing Voice Agents: Load, Regression, and A/B Evaluation for Production Reliability

Manual QA can tell you if your voice agent sounds natural. It can verify the agent says the right things in controlled scenarios. What it cannot tell you: whether your agent will handle 1,000 concurrent calls, whether last week's model update broke 15% of your conversations, or whether version A actually outperforms version B.

We call this the "demo-to-production gap": agents that excel in controlled tests but break under real-world conditions. Not catastrophically. No error pages, no crashes. Instead, they fail through awkward pauses, repeated questions, and dropped context. Users don't file bug reports. They just hang up.

Quick filter: If you are shipping weekly or handling 100+ daily calls, you need more than manual QA.

If you're running fewer than 50 test calls per week, structured testing platforms are probably overkill. Manual QA works at that scale. If your voice agent handles a single use case with predictable flows, you can get by with spreadsheets and occasional spot-checks.

But if you're scaling beyond 100+ daily calls, operating in regulated industries, or shipping weekly updates—here's why manual QA breaks down.

At Hamming, we've analyzed 4M+ production voice agent calls. The teams with reliable agents don't test harder. They test differently. Testing voice agents for production reliability requires three distinct methodologies: load testing for scale, regression testing for consistency, and A/B evaluation for optimization.

TL;DR: Test voice agents for production reliability using Hamming's 3-Pillar Production Reliability Testing Framework: Load Testing for scale, Regression Testing for consistency, and A/B Evaluation for optimization. Manual QA verifies functionality. These three pillars verify reliability.

Methodology Note: The benchmarks in this guide are derived from Hamming's analysis of 4M+ voice agent interactions across 10K+ voice agents (2025).
Latency thresholds align with
research on conversational turn-taking

showing 200-500ms as the natural pause in human dialogue.

(There's fascinating linguistics research behind why the 200-500ms threshold matters. It's baked into how humans naturally take turns in conversation—going slower feels broken at a neurological level. The short version: your voice agent is fighting millions of years of evolution if it responds too slowly.)

Related Guides:

Pipecat Bot Testing: Automated QA & Regression Tests — Automated testing for Pipecat voice agents
Voice Agent Incident Response Runbook — Hamming's 4-Stack framework for diagnosing and resolving outages
How to Evaluate Voice Agents: Complete Guide — Hamming's VOICE Framework
AI Voice Agent Regression Testing — Hamming's Regression Detection Framework
How to Monitor Voice Agent Outages in Real Time — Hamming's 4-Layer Monitoring Framework
Voice Agent Testing Maturity Model — Hamming's 5-Level Testing Maturity Model

Why Voice Agents Need Dedicated Testing Platforms

Manual QA has three fundamental limitations for voice agents.

Real-Time Performance Dependencies

Voice agents fail in ways text systems don't. A half-second delay that's imperceptible in chat feels like an eternity on a phone call.

Real-time dependencies compound across the stack:

Component	Function	Typical Latency	Failure Mode
STT (Speech-to-Text)	Processing delay before understanding	200-400ms	Missed words, wrong transcription
LLM Inference	Thinking time to generate response	300-800ms	Slow or irrelevant responses
TTS (Text-to-Speech)	Time to generate audio output	150-300ms	Delayed or garbled audio
Network Round-trips	Cumulative delay across calls	50-200ms	Compounding latency

Sources: Component latency ranges based on Hamming's production monitoring across 10K+ voice agents (2025). Latency budgets derived from cascading architecture benchmarks.

Manual testers can't detect P95 latency spikes. You need systematic measurement.

Failure Visibility

Voice agent failures don't crash. They manifest as:

Awkward pauses (latency issues)
Repeated questions (context loss)
Missed intent (ASR errors, often caused by background noise)
Dropped calls (infrastructure failures)

Without an observability platform, you only learn about failures through customer complaints. Or worse, silent abandonment.

Manual QA Limitations

What Manual QA Can Do	What Manual QA Cannot Do
Verify happy-path functionality	Detect P95 latency spikes
Check if responses sound natural	Measure performance at scale
Test specific edge cases	Identify regression from model updates
Validate conversation flow	Compare versions statistically

Manual QA is necessary but insufficient. Production reliability requires systematic testing.

Hamming's 3-Pillar Production Reliability Testing Framework

Reliable voice agents require three distinct testing methodologies. Each pillar addresses a different failure mode:

Pillar	Purpose	Key Question	Without It
Load Testing	Scale validation	"Can we handle peak traffic?"	Silent failures at scale
Regression Testing	Consistency	"Did the update break anything?"	Undetected drift
A/B Evaluation	Optimization	"Which version performs better?"	Shipping blind

Pillar 1: Load Testing for Scale

Voice agents can be "up" but unusable.

Scale issues don't crash servers. They break conversations. Awkward pauses. Repeated questions. Dropped context. Your infrastructure looks healthy. Your users are frustrated. And they don't file bug reports—they just hang up.

Why Concurrency Matters for Voice

Unlike web APIs where you can queue requests, voice is real-time. Users expect immediate responses.

The Problem: Your voice agent handles 100 concurrent calls with 400ms latency. Launch day arrives. Traffic spikes to 1,000 concurrent calls.

The Reality: At 1,000 concurrent calls, latency jumps to 2,000ms. That's a 5x degradation. Users experience:

STT servers hitting capacity limits
LLM inference queues backing up
TTS synthesis delays compounding
Telephony infrastructure saturating

The Math: At 10,000 calls per day with 2-second latency, that's 10,000 frustrated users who expected sub-second responses. Many will hang up before the agent finishes speaking.

Thousands of Concurrent Synthetic Calls

Effective load testing requires realistic simulation:

Synthetic Caller Requirements:

Varied personas with different speech patterns and accents
Realistic conversation flows (not just "hello, goodbye")
Background noise injection to simulate real-world conditions
Natural pause timing (don't hammer the system unnaturally)

Measurement Across the Full Pipeline:

Metric	Excellent	Good	Acceptable	Critical
Time to First Word (TTFW)	<300ms	<500ms	<800ms	>800ms
Turn Latency P50	<1,600ms	<1,800ms	<2,000ms	>2,000ms
Turn Latency P95	<2,200ms	<2,500ms	<3,000ms	>3,000ms
Error Rate	<0.1%	<0.5%	<1%	>1%
Audio Quality (MOS)	>4.2	>4.0	>3.5	<3.5

Sources: Latency benchmarks based on Hamming's production monitoring (2025) and conversational turn-taking research (Stivers et al., 2009). MOS thresholds from ITU-T P.800 standards.

Load Test Phases

Run five phases to catch different failure modes:

Baseline: Measure performance at low load (10% capacity)
Ramp: Gradually increase to 100%, 150%, 200% of expected peak
Spike: Sudden jump to 2x load (simulates viral traffic)
Soak: Sustained load for 4+ hours (catches memory leaks)
Recovery: Verify system returns to baseline after load drops

Worked Example: Latency Degradation Under Load

Degradation % = ((Latency at Load - Baseline Latency) / Baseline Latency) × 100

Example:
- Baseline P95 latency: 800ms
- P95 latency at 2x load: 1,500ms
- Degradation = ((1500 - 800) / 800) × 100 = 88%

Result: At 2x peak load, your users experience 88% worse latency.
If baseline felt responsive, 2x load will feel sluggish.

Acceptable Degradation Under Load:

Metric	Baseline	At 100% Load	At 200% Load	Max Degradation
Latency P95	800ms	1,000ms	1,500ms	+88%
Error Rate	0.1%	0.2%	0.5%	+0.4% absolute
Audio MOS	4.2	4.0	3.8	-10%

Pillar 2: Regression Testing for Consistency

Model updates, prompt changes, and dependency upgrades can silently break your voice agent. Manual checks cannot detect drift at scale.

I used to think comprehensive regression testing was overkill for teams shipping weekly. After watching three customers hit production issues last quarter that simple regression tests would have caught, I've changed my position. Even small teams benefit from baseline tracking. The setup cost is a few hours, but catching one production regression pays for months of testing.

Baseline Metrics for Deployment

Before launching, establish numerical snapshots of current performance. For a comprehensive evaluation framework, see Hamming's VOICE Framework. Without baselines, you can't interpret current performance. "Is 82% FCR good?" Only relative to your baseline.

Metric	Target Baseline	Drift Tolerance	Action Trigger
First Call Resolution (FCR)	>75%	+/- 3%	Below tolerance: investigate
Goal Completion Rate	>85%	+/- 2%	Below tolerance: alert
Turn Latency P95	<800ms	+/- 10%	Above tolerance: alert
Word Error Rate (WER)	<10%	+/- 2%	Above tolerance: investigate
Hallucination Rate	<5%	+/- 1%	Above tolerance: critical

Detecting Drift from Model Updates

According to Hamming's deployment analysis, 40% of voice agent regressions come from upstream changes. Not your code.

We used to recommend regression tests only after major updates. Now, after seeing silent model provider changes cause unexpected regressions, we suggest weekly runs as the minimum. One customer's goal completion dropped 12% overnight because their LLM provider quietly updated the model. Their code hadn't changed at all.

The Problem: Your voice agent worked perfectly last week. This week, goal completion dropped 12%. Your team didn't change anything.

The Reality: Your LLM provider released a new model version. The update improved reasoning but introduced 200ms additional latency. That latency pushed P95 over the threshold where users start interrupting, which broke conversation flow.

Monitor for these upstream changes:

LLM provider updates: OpenAI, Anthropic release new model versions
STT/TTS provider changes: Accuracy or latency shifts
Infrastructure changes: New regions, capacity adjustments

Run regression suites:

After every prompt change
After every model update (yours or provider's)
Weekly on production (catch silent regressions)

Call Flow Regression Suite

Create representative test sets covering your full conversation range:

Call Type	Test Cases	Key Assertions
Simple (password reset)	50+	>90% FCR, <3 min AHT
Medium (billing inquiry)	50+	>80% FCR, correct amount stated
Complex (technical support)	50+	>70% FCR, correct diagnosis
Edge cases (interruptions)	30+	Handles gracefully, no loops
Adversarial (jailbreak attempts)	20+	Refuses appropriately

Pillar 3: A/B Voice Agent Testing

Which prompt version performs better? Which LLM produces higher resolution rates? A/B testing provides statistical answers instead of opinions.

The Problem: Your team debates whether the new prompt is better than the old one. Some engineers think it's faster. Others think the old version had better resolution rates. Everyone has an opinion, but nobody has data.

The Reality: Without statistical comparison, you're shipping based on hunches. The "faster" prompt might actually have worse completion rates. The "better resolution" prompt might just seem better because you tested it on easier scenarios.

The Fix: Run controlled A/B tests with sufficient sample size. Let the data decide, not the loudest voice in the room.

Two sampling protocols serve different purposes:

Deterministic Sampling:

Route specific call types to specific versions
Use for: Initial validation, debugging specific issues
Limitation: May introduce selection bias

Blind (Random) Sampling:

Route calls randomly to each version
Use for: Production decisions, unbiased comparison
Requirement: Sufficient sample size

Recommended Protocol:

Start with deterministic sampling to validate both versions work
Switch to blind sampling for statistical comparison
Run until statistically significant (typically 1,000+ calls per variant)
Make decision based on primary metric (usually goal completion or FCR)

What Your A/B Tests Need to Measure

The table below shows minimum sample sizes for detecting differences in each metric (diagnostic power). These are statistical minimums—smaller samples may not reliably detect real differences.

Metric	Why It Matters	Min Sample for Detection
Goal Completion	Primary success metric	500+ per variant
First Call Resolution	Business impact	1,000+ per variant
Latency P95	User experience	200+ per variant
Error Rate	Reliability	500+ per variant
CSAT (if collected)	Customer perception	500+ per variant

A/B Decision Framework

Ship/no-ship decisions should be based on your primary metric (typically Goal Completion or FCR). Use the 1,000+ threshold for these decisions to ensure statistical confidence.

Declare winner when:

Primary metric (Goal Completion or FCR) is statistically significant (p < 0.05)
Primary metric sample size ≥ 1,000 calls per variant
Secondary metrics don't regress more than 5%

Continue testing when:

Results not yet significant
Sample size insufficient
Secondary metrics show concerning trends

Stop test early when:

Error rate exceeds 2% in either variant
User complaints spike
Critical failure detected

Worked Example: Sample Size Calculation

Where do these sample size numbers come from? Standard power analysis for binary outcomes (like goal completion). Here's the math:

n = (Z_α + Z_β)² × (p₁(1-p₁) + p₂(1-p₂)) / (p₁ - p₂)²

Where:
- Z_α = 1.96 (95% confidence, two-tailed)
- Z_β = 0.84 (80% power)
- p₁ = baseline rate (e.g., 80% goal completion)
- p₂ = target rate (e.g., 85% goal completion)

For detecting a 5 percentage point improvement (80% → 85%):

n = (1.96 + 0.84)² × (0.80 × 0.20 + 0.85 × 0.15) / (0.05)²
n = 7.84 × 0.2875 / 0.0025
n ≈ 902 per variant

For detecting a 10 percentage point improvement (80% → 90%):

n = 7.84 × (0.16 + 0.09) / (0.10)²
n ≈ 196 per variant

Bottom line: ~1,000 calls per variant detects 5-point differences; ~250 calls detects 10-point differences. These assume 80% baseline—adjust if your metrics differ significantly.

There's a tension we haven't fully resolved: testing thoroughness vs. shipping velocity. More testing catches more issues, but it also slows you down. Different teams land in different places on this tradeoff. High-stakes industries (healthcare, finance) tend toward comprehensive testing before every change. Fast-moving startups often start with weekly regression only and add load testing before major launches. We're still figuring out the right guidance for teams in the middle.

Remember the P95 latency threshold from Load Testing? That's exactly what you're comparing in A/B tests. A 50ms improvement might not seem like much, but at scale—with thousands of daily calls—it compounds into meaningful user experience gains.

Real-World Results: Testing in Production

Healthcare voice agent company NextDimensionAI achieved 99% production reliability using this 3-pillar framework. Before implementing structured testing, their engineers could only make approximately 20 manual test calls per day. With Hamming handling 200 concurrent test calls, they reduced latency by 40% through controlled testing.

As their co-founder Simran Khara puts it: "For us, unit tests are Hamming tests. Every time we talk about a new agent, everyone already knows: step two is Hamming."

Similarly, Podium reduced manual testing time by 90% while testing 8+ languages with accent variations across 5,000+ monthly scenarios. The common thread: replacing ad-hoc manual QA with structured, repeatable testing frameworks.

Flaws but not Dealbreakers

We got this wrong initially. Our first guidance was "test everything, all the time." After watching teams drown in test infrastructure instead of shipping product, we've simplified. Here's what we now tell teams upfront:

Load testing requires compute resources. Running thousands of concurrent synthetic calls isn't free. Budget for cloud costs during peak testing. For smaller teams, start with 100-call baseline tests before scaling up. The costs scale with your testing volume—plan accordingly.

Initial baseline setup takes time. Expect 2-4 hours to configure your first regression suite. The ROI comes from automated runs afterward, not the first test. Don't let setup time discourage you from starting.

A/B tests need volume. You need 1,000+ calls per variant for 95% statistical confidence. If you're running 100 calls/week, A/B testing isn't practical yet. Start with regression testing first, add A/B when you have the traffic to support it.

False positives happen. Automated evaluations sometimes flag conversations that humans would pass. Our LLM-based scoring achieves 95%+ agreement with human evaluators, but that means 5% disagreement. Plan for manual review of edge cases.

Not every team needs all three pillars. If you're pre-launch with low traffic, load testing is overkill. If you're not iterating on prompts, A/B testing adds complexity without value. Start with regression testing—it's the highest-ROI pillar for most teams.

Production Reliability Testing Checklist

This checklist builds on the baseline metrics from Pillar 2. If you haven't established baselines yet, start there.

Pre-Launch

Establish baseline metrics for all key KPIs
Complete load test at 2x expected peak capacity
Run full regression suite (200+ test cases)
Validate monitoring and alerting configured

Ongoing

Weekly regression suite on production
Regression test after every model/prompt change
Load test quarterly or before major scaling events
A/B test for all significant changes (don't ship blind)

Monitoring

Real-time dashboards for latency, error rate, goal completion
Alerts on >5% regression from baseline
Weekly performance reports comparing to baseline
Monthly trend analysis for gradual drift

For teams building on LiveKit, see how these three pillars map to LiveKit-specific tooling and metrics in our guide on testing and monitoring LiveKit voice agents in production.