Voice agent load testing answers one question: can the agent keep working when many real-time conversations happen at once? A normal API load test is not enough. Voice calls are long-lived sessions with audio streaming, turn-taking, provider limits, tool calls, and users who hang up when silence gets awkward.
If you only test one happy-path call, you learn that the agent can work. You do not learn whether it still works when 100, 1,000, or 10,000 callers arrive in the same window.
Quick filter: If this is an internal demo with five users, skip the full load test and focus on functional correctness. If the agent is about to handle paid customers, regulated workflows, launch-day traffic, or contact-center volume, run the load test before launch.
TL;DR: Load test voice agents with realistic synthetic calls, not repeated HTTP requests. Establish a baseline, ramp to expected peak, stress beyond peak, spike suddenly, soak for hours, and verify recovery. The launch gate is not "the server stayed up." The launch gate is "p95 latency, call setup success, task completion, and provider error rates stayed inside tolerance under expected peak plus safety margin."
Methodology Note: This runbook is based on Hamming's analysis of 4M+ production voice agent calls and reliability workflows across 10K+ voice agents (2025-2026). We've tested agents built on LiveKit, Pipecat, ElevenLabs, Retell, Vapi, and custom-built solutions.It also uses public Microsoft, Twilio, and LiveKit documentation to ground performance-test planning, synthetic call generation, and realtime media testing patterns.
Last Updated: May 2026
Related Guides:
- Testing Voice Agents for Production Reliability - broad reliability framework for load, regression, and A/B testing
- Voice Agent Testing Guide - complete testing lifecycle across infrastructure, behavior, and business outcomes
- Voice Agent Observability and Tracing - trace the STT, LLM, tool, and TTS stages during load tests
- Voice Agent Analytics in Grafana - dashboard routing for load-test metrics and events
- Monitor Voice Agent Outages in Real Time - alerting once the system is live
- Voice Agent CI/CD Testing - where lighter regression and smoke tests fit in release pipelines
- Background Noise Testing KPIs - acoustic realism for synthetic callers
- Voice Agent Response Coverage - turn discovered failures into permanent test coverage
- AI Voice Agent Regression Testing - release checks after load-test failures become regressions
- Testing LiveKit Voice Agents - LiveKit-specific testing and load validation
Why Voice Agent Load Testing Is Different
HTTP load testing usually measures short request-response cycles. A voice agent call is a live conversation. The connection stays open, audio streams in real time, the agent waits for user turns, and every downstream dependency can become the bottleneck.
That means "1,000 concurrent calls" is not the same as "1,000 requests per second." It means 1,000 active sessions, each with its own audio path, conversation state, provider calls, tool calls, retry behavior, and timeout risk.
| Traditional Web Load Test | Voice Agent Load Test |
|---|---|
| Short requests | Long-lived calls |
| Stateless or lightly stateful | Stateful multi-turn conversation |
| User waits behind a spinner | User hears silence or interruption |
| Bottleneck is often app, DB, or cache | Bottleneck can be SIP/WebRTC, STT, LLM, TTS, tools, or call recording |
| Success means response returned | Success means call connects, conversation stays natural, task completes, and evidence is recorded |
Microsoft's conversational agent performance-testing guidance makes the same planning point in a broader agent context: define objectives, scenarios, KPIs, test data, tools, and success criteria before generating load. For voice agents, add media quality, turn latency, provider quotas, and call completion to that plan.
The Load Test Plan Template
Start with a plan. Do not start by cranking concurrency until something breaks.
| Plan Field | What To Define | Voice-Specific Sample |
|---|---|---|
| Objective | Why this test exists | Validate launch capacity for 500 concurrent appointment calls |
| Scope | What is in and out | In: PSTN, STT, LLM, tools, TTS, recording. Out: CRM bulk exports |
| Traffic model | How users arrive and behave | 10-minute ramp, 20-minute hold, realistic silence between turns |
| Scenario mix | Which flows run | 60% scheduling, 25% reschedule, 10% billing, 5% escalation |
| Test data | What callers and agents use | Synthetic names, safe phone numbers, seeded accounts, no real PII |
| Success criteria | What passes | p95 turn latency under 2.5s, call setup success over 99%, task completion within 5% of baseline |
| Stop criteria | When to abort | Error rate over 5%, provider throttling sustained for 3 minutes, recording loss above 1% |
| Evidence | What gets saved | Run ID, trace IDs, per-stage latency, failed-call replay pointers |
The most common mistake is running load against a vague "main flow." Voice agents degrade differently by scenario. A simple FAQ call may stay healthy while a billing call with three tool calls times out.
The Six-Phase Voice Agent Load Testing Framework
Use six phases. The older four-phase model misses spike and recovery behavior, which are where launch incidents often show up.
| Phase | Purpose | Typical Shape | Pass Signal |
|---|---|---|---|
| Baseline | Measure clean behavior | 1-5 calls for 20-30 minutes | Stable latency, quality, and task completion |
| Ramp | Find first degradation | Step from 10% to 100% expected peak | No sharp p95 jump or error-rate cliff |
| Stress | Find ceiling | 100% to 150% or 200% expected peak | Graceful degradation and clear bottleneck |
| Spike | Test sudden traffic | Jump from baseline to peak quickly | Autoscaling and queues recover without call collapse |
| Soak | Find slow leaks | 60-80% peak for 4-8 hours | Memory, queue depth, and latency stay flat |
| Recovery | Prove cleanup | Drop back to baseline | Metrics return to baseline and no orphaned sessions remain |
If you have time for only two phases, run baseline and ramp. If you are launching into real customer traffic, run all six.
Baseline First: Know What Healthy Means
Before you add load, capture a clean baseline. Otherwise every later number is guesswork.
Baseline at least these metrics:
| Metric | Why It Matters | Baseline Target |
|---|---|---|
| Call setup success | Users must connect before quality matters | 99%+ |
| Time to first audio | First impression of responsiveness | Within your normal production target |
| Turn latency p50/p95/p99 | Reveals tail pain hidden by averages | Stable across a 20-30 minute run |
| STT processing time | Shows speech recognition pressure | No queue buildup |
| LLM time and queue depth | Often the first scale bottleneck | No sustained queue growth |
| TTS synthesis time | Silence can come from output generation | No provider throttling |
| Tool-call latency | Backend dependencies can dominate later turns | Stable by tool and scenario |
| Task completion | Load can change behavior, not just speed | Within expected functional baseline |
| Recording and transcript capture | Load can drop evidence silently | 99%+ evidence capture |
The baseline is the contract. Later phases should answer: how far did each metric move, which stage moved first, and did quality degrade before infrastructure failed?
Ramp Testing: Find the First Bend in the Curve
Ramp tests should be boring. Increase load in controlled steps and watch for the first bend in the curve.
Sample ramp:
| Step | Concurrent Calls | Hold Time | What To Watch |
|---|---|---|---|
| 1 | 10% expected peak | 10 minutes | Baseline parity |
| 2 | 25% expected peak | 10 minutes | Provider queues |
| 3 | 50% expected peak | 15 minutes | p95 and p99 latency |
| 4 | 75% expected peak | 15 minutes | Tool-call saturation |
| 5 | 100% expected peak | 20 minutes | Task completion and evidence capture |
| 6 | 125% expected peak | 20 minutes | Headroom |
| 7 | 150% expected peak | 20 minutes | Graceful degradation |
Do not average across the whole test. Plot each step separately. Averages hide the moment p95 turns from "fine" into "call-ending silence."
Stress, Spike, and Soak Testing
Stress testing is where you intentionally go past expected capacity. The point is not to pass. The point is to know the ceiling and the failure mode.
Spike testing answers a different question: can the system absorb sudden arrival rate? A system can pass a slow ramp and still fail when 300 calls arrive in the same minute.
Soak testing finds the failures that only appear after time: memory leaks, orphaned WebRTC sessions, connection pool exhaustion, log ingestion lag, recording backlog, retry storms, and transcript jobs that fall behind.
| Failure Mode | Usually Found By | Signal |
|---|---|---|
| Provider rate-limit surprise | Ramp or stress | 429s, rising queue depth, retries |
| Autoscaling lag | Spike | First minutes fail, later minutes recover |
| Memory leak marathon | Soak | Memory grows while traffic stays flat |
| Recording backlog | Soak | Calls complete but evidence appears late or not at all |
| LLM queue collapse | Ramp or spike | p95 expands before error rate rises |
| Tool bottleneck | Scenario-specific ramp | One flow degrades while simple calls pass |
Actually, "no errors" is not enough. A voice agent can return 200s everywhere and still be unusable because every turn takes five seconds.
Generating Synthetic Voice Traffic
Synthetic callers should behave like users, not like scripts hammering an endpoint.
Use a scenario model with:
- realistic call durations
- natural silence between turns
- varied speech rates
- accents and dialects that match the user base
- background noise when the production environment has it
- interruptions and barge-in attempts
- tool-using paths, not only FAQ paths
- seeded data that avoids real PII
Twilio's synthetic call data walkthrough shows the core pattern: fictional personas, real call infrastructure, recordings, transcripts, downstream events, and webhook stress tests without using real customer data. That is the right privacy default for load testing. Generate safe conversations on purpose.
For WebRTC-native systems, include media-path testing too. LiveKit's benchmarking documentation and testing docs point teams toward load testing the realtime media layer, not only the agent logic. If production users connect over WebRTC, test WebRTC. If they call over PSTN/SIP, test that path too.
What To Measure During Load Tests
Measure the whole voice stack. Component-level visibility is the difference between "load test failed" and "TTS queue saturation starts at 620 concurrent calls."
| Layer | Metric | Warning Pattern |
|---|---|---|
| Telephony/WebRTC | call setup success, connection time, disconnect reason | setup failures or early disconnects rise |
| Media | packet loss, jitter, audio gap count, MOS proxy | users hear choppy audio before app metrics fail |
| STT | processing latency, confidence, timeout rate, queue depth | transcription gets late or wrong |
| LLM | first-token latency, total generation time, tokens per minute, retry count | queue grows and p95 turns slow |
| Tools | tool latency, error rate, timeout count | specific workflows fail under load |
| TTS | synthesis latency, stream start time, provider errors | silence after LLM completion |
| Conversation | task completion, barge-in recovery, escalation rate | behavior changes under load |
| Evidence | transcript capture, recording capture, trace linkage | debugging data disappears when needed most |
For dashboard routing, use the split in Voice Agent Analytics in Grafana: stable metrics to Prometheus or Mimir, searchable events to logs, stage timings to traces, and raw evidence in the QA system of record.
Launch Gates and Thresholds
Use relative degradation, not only absolute thresholds. Every stack starts with different latency.
| Gate | Pass | Warn | Fail |
|---|---|---|---|
| p95 turn latency at expected peak | under 1.5x baseline | 1.5x-2x baseline | over 2x baseline |
| p99 turn latency | no call-ending tail | occasional spikes with recovery | sustained silence or timeout risk |
| Call setup success | 99%+ | 97%-99% | below 97% |
| End-to-end task completion | within 5% of baseline | 5%-10% drop | over 10% drop |
| Provider throttling | none sustained | short burst, auto-recovers | sustained throttling |
| Evidence capture | 99%+ recordings/transcripts/traces | 97%-99% | below 97% |
| Recovery | returns to baseline in minutes | slow recovery | stuck queues or orphaned sessions |
Use this formula during analysis:
Latency degradation percent =
((p95 latency under load - baseline p95 latency) / baseline p95 latency) * 100
If baseline p95 is 1.4 seconds and peak p95 is 2.4 seconds, degradation is 71%. That might pass for a back-office task and fail for a fast customer-service flow. Tie thresholds to user experience, not only infrastructure comfort.
Capacity Planning From Load Test Results
The useful output of load testing is a capacity model.
required_peak_capacity =
forecast_peak_concurrent_calls * launch_safety_margin
Start with a 1.5x safety margin for ordinary launches. Use 2x or more when demand is hard to predict, when marketing is creating a spike, or when regulatory workflows cannot degrade.
Then calculate the real ceiling:
voice_agent_capacity =
minimum(
telephony_concurrent_call_limit,
media_server_session_limit,
stt_provider_limit,
llm_tokens_per_minute_limit / average_tokens_per_call_minute,
tts_provider_limit,
tool_backend_limit,
evidence_pipeline_limit
)
The minimum is the bottleneck. Scaling anything else first is theatre.
Common Bottlenecks and Fixes
| Bottleneck | Symptom | Fix |
|---|---|---|
| STT provider limit | transcripts arrive late or timeout | raise quota, shard providers, reduce retries |
| LLM queue | first-token latency grows before errors | reserve capacity, shorten prompts, add backpressure |
| TTS saturation | agent "thinks" then stays silent | pre-cache stable phrases, raise quota, stream earlier |
| Tool API | only tool-heavy flows fail | add caching, bulkhead tools, rate-limit lower-priority calls |
| Recording pipeline | calls pass but evidence is missing | queue recordings separately, monitor backlog |
| Load generator | generator CPU maxes before system under test | distribute generators, lower local synthesis cost |
| Autoscaling | spike fails then stabilizes | pre-warm capacity for launches |
One practical rule: monitor the load generator as carefully as the system under test. Otherwise you can mistake generator exhaustion for product capacity.
When Load Testing Is Not Critical
Load testing is not free. You pay for synthetic callers, telephony, STT, LLM, TTS, infrastructure, and analysis time.
Skip or shrink the full run when:
- the agent is an internal prototype
- expected usage is low and controllable
- a soft launch can throttle traffic manually
- the workflow is not customer-facing yet
- you have production monitoring and a rollback plan for a small beta
Do not skip it when:
- launch traffic is unpredictable
- the agent replaces human support during peak hours
- the workflow touches healthcare, finance, insurance, or compliance
- the agent depends on multiple external providers
- missed calls have direct revenue or safety impact
Synthetic traffic is not perfect. Real users are messier than test personas. The point is not to prove production will be flawless. The point is to find the obvious scale failures before customers do.
Pre-Launch Checklist
Use this checklist before approving launch.
| Item | Required Evidence |
|---|---|
| Forecast peak concurrent calls | model with source assumptions |
| Safety margin chosen | 1.5x, 2x, or explicit exception |
| Scenario mix defined | production-like flow distribution |
| Synthetic data safe | no real customer PII, PHI, PCI, or private call content |
| Baseline captured | clean p50, p95, p99, success, quality, evidence metrics |
| Ramp passed | expected peak within threshold |
| Stress ceiling known | bottleneck and breaking point documented |
| Spike behavior tested | autoscaling and queues recover |
| Soak passed | no sustained memory, queue, or evidence backlog |
| Recovery passed | metrics return to baseline |
| Monitoring ready | launch dashboard and alert owners assigned |
| Regression follow-up ready | failures converted into test coverage |
After launch, the load test should feed the regression suite. Every broken flow, slow path, provider throttle, and evidence-capture miss becomes a scenario you can rerun before the next release. Voice Agent Response Coverage covers that loop in detail, and AI Voice Agent Regression Testing explains how to run those checks before the next prompt or model change.

