Voice agents built on LiveKit fail differently than text-based AI systems. A chatbot with a 2-second response delay is mildly annoying. A voice agent with the same delay creates dead air that makes users hang up. When your pipeline spans ASR, NLU, LLM, and TTS, a failure in any single layer cascades through the entire conversation. Testing and monitoring voice agents requires a fundamentally different approach.
This guide covers the five-pillar framework for production-grade LiveKit voice agent quality: evaluation, regression testing, load testing, observability, and alerting.
Methodology Note: The testing framework, metrics, and thresholds in this guide are derived from Hamming's analysis of 4M+ production LiveKit voice agent deployments across 10K+ voice agents (2025-2026).Thresholds and benchmarks represent performance patterns across customer support, healthcare, and enterprise deployments. Your targets may vary based on use case and latency tolerance.
TL;DR: Five-Pillar Testing Framework
| Pillar | Purpose | Key Tool |
|---|---|---|
| Evaluation | Measure quality across ASR, NLU, LLM, TTS | Curated datasets + LLM-as-judge scorers |
| Regression | Prevent quality degradation on changes | Behavioral baselines + automated suites |
| Load | Validate scalability under traffic | lk perf agent-load-test + synthetic calls |
| Observability | Diagnose production issues | OpenTelemetry distributed tracing |
| Alerting | Catch problems before users complain | P90/P99 threshold monitoring |
Key targets: P90 end-to-end latency under 3.5s, P99 under 5s, WER under 5%, task completion above 90%.
Core insight: Offline evaluation catches logic errors. Online monitoring catches production-only failures. You need both, and they need to share metrics definitions.
Quick filter: If you are running LiveKit agents in production without automated regression suites and distributed tracing, this guide covers the gaps that will eventually cause customer complaints.
Understanding Voice Agent Testing Fundamentals
The Voice Agent Testing Stack
Voice agents are not single-component systems. They are four interdependent layers where failures cascade:
User speaks → ASR (transcription) → NLU (intent) → LLM (response) → TTS (audio) → User hears
| Layer | Function | Failure Impact |
|---|---|---|
| ASR (Speech-to-Text) | Converts audio to text | Wrong transcription → wrong intent → wrong response |
| NLU (Intent Recognition) | Classifies user intent | Misrouted conversation → task failure |
| LLM (Response Generation) | Produces reply content | Hallucination, context loss, wrong tool call |
| TTS (Text-to-Speech) | Synthesizes audio output | Robotic voice, pronunciation errors, latency |
A 3% increase in ASR word error rate does not stay contained. It propagates through NLU, causes the LLM to generate responses to the wrong intent, and the user hears a confidently wrong answer. Testing each layer in isolation is necessary but insufficient—you need full-stack validation.
Offline vs. Online Evaluation
Offline evaluation runs before deployment against curated datasets with known-correct outcomes. You control the inputs, the environment, and the expected outputs. This catches logic errors, prompt regressions, and intent classification failures.
Online evaluation monitors live production traffic continuously. It samples real conversations and scores them against quality criteria. This catches failures that only appear in production: audio degradation from cellular connections, accent-driven ASR errors, network jitter disrupting turn-taking, and model drift over time.
| Dimension | Offline Evaluation | Online Monitoring |
|---|---|---|
| When it runs | Pre-deployment | Continuous in production |
| Data source | Curated test datasets | Sampled live traffic |
| Controls | Full (inputs, environment) | None (real-world conditions) |
| Catches | Logic errors, regressions, prompt issues | Audio quality, network effects, drift |
| Latency | Not representative | Actual production latency |
| Scale | 50-500 test cases | Thousands of conversations daily |
Both feed into the same metrics pipeline. If your offline evaluation measures task completion rate at 95% but production monitoring shows 82%, the gap is your testing blind spot.
Key Quality Metrics for Voice Agents
| Metric | Target | Alert Threshold | Why It Matters |
|---|---|---|---|
| Word Error Rate (WER) | <5% | >8% | Foundation of everything—bad transcription cascades |
| Mean Opinion Score (MOS) | >4.3/5 | <3.8 | Perceived audio quality drives user trust |
| End-to-End Latency P90 | <3.5s | >3.5s | 10% of users experiencing significant delays |
| End-to-End Latency P99 | <5s | >5s | Responses over 5 seconds feel broken |
| TTFT (Time to First Token) | <800ms | >1200ms | User-perceived response initiation |
| Task Completion Rate | >90% | <85% | Whether the agent actually solves the problem |
| Interruption Rate | <15% | >25% | Turn detection quality indicator |
For a deeper dive into latency benchmarks and optimization strategies, see Voice AI Latency: What's Fast, What's Slow, and How to Fix It.
Setting Up Your LiveKit Testing Environment
Installing LiveKit Testing Dependencies
Start with the core testing stack:
pip install livekit-agents pytest pytest-asyncio
For observability instrumentation, add OpenTelemetry:
pip install opentelemetry-api opentelemetry-sdk \
opentelemetry-exporter-otlp-proto-grpc \
opentelemetry-instrumentation
For load testing, install the LiveKit CLI:
# macOS
brew install livekit-cli
# Linux
curl -sSL https://get.livekit.io/cli | bash
Configuring Test Infrastructure
Set up environment variables for your test environment:
# .env.test
LIVEKIT_URL=wss://your-project.livekit.cloud
LIVEKIT_API_KEY=your-test-api-key
LIVEKIT_API_SECRET=your-test-api-secret
OPENAI_API_KEY=your-openai-key # or other LLM provider
Configure pytest for async support:
[pytest]
asyncio_mode = auto
testpaths = tests
python_files = test_*.py
markers =
unit: Text-only logic tests
webrtc: Full audio pipeline tests
load: Concurrent capacity tests
Separate test markers let you run fast unit tests on every commit and reserve expensive WebRTC and load tests for deploy candidates.
Creating Your First Test Suite
LiveKit's pytest helpers validate agent behavior in text mode—no audio pipeline required:
import pytest
from your_agent import create_agent
@pytest.mark.unit
@pytest.mark.asyncio
async def test_greeting_and_intent_routing():
"""Verify agent correctly identifies and routes initial intent."""
agent = create_agent()
response = await agent.process_text(
"Hi, I need to reschedule my appointment for next week"
)
# Agent should recognize scheduling intent
assert any(keyword in response.lower() for keyword in [
"reschedule", "appointment", "when", "available", "date"
])
# Should ask clarifying question
assert "?" in response
@pytest.mark.unit
@pytest.mark.asyncio
async def test_multi_turn_context_retention():
"""Verify agent maintains context across conversation turns."""
agent = create_agent()
await agent.process_text("I want to book a table for 4 people")
await agent.process_text("Actually, make that 6")
response = await agent.process_text("What reservation details do you have?")
# Agent should remember the corrected party size
assert "6" in response or "six" in response.lower()
For comprehensive coverage of LiveKit's built-in testing helpers and pytest patterns, see Testing LiveKit Voice Agents: Unit, Scenario, Load and Production Guide.
Evaluation Frameworks for Voice Agents
Conversational Quality Metrics
Beyond pass/fail assertions, measure the qualitative aspects of conversation:
| Metric | How to Measure | Target |
|---|---|---|
| Turn-taking latency | Time from user silence to agent speech start | <500ms end-of-utterance delay |
| Interruption handling | Agent stops and listens when user barges in | Recovery within 1 turn |
| Time to first word | Duration from user finish to agent first audio | P90 under 2.5s |
| Talk-to-listen ratio | Agent speaking time vs. user speaking time | 40-60% agent, context-dependent |
Track these across conversations to detect systematic issues. A talk-to-listen ratio above 70% often indicates the agent is monologuing rather than having a dialogue.
Task Completion Assessment
Task completion is the metric that matters most to the business. Define clear success criteria for each conversation type:
TASK_DEFINITIONS = {
"appointment_booking": {
"required_fields": ["date", "time", "service_type"],
"success_criteria": "booking_confirmed",
"max_turns": 8,
},
"order_status": {
"required_fields": ["order_id"],
"success_criteria": "status_delivered",
"max_turns": 4,
},
"complaint_resolution": {
"required_fields": ["issue_description"],
"success_criteria": "resolution_offered",
"max_turns": 12,
},
}
Measure both binary completion (did the task succeed?) and efficiency (how many turns did it take?). An agent that completes bookings in 15 turns when the baseline is 6 has a problem even if the task technically succeeds.
ASR and TTS Accuracy Testing
Test speech recognition under realistic conditions, not just clean studio audio:
| Condition | Expected WER Impact | Test Approach |
|---|---|---|
| Quiet environment | Baseline (<3%) | Standard test audio |
| Background office noise | +1-2% | Mix noise at -20dB SNR |
| Street/traffic noise | +3-5% | Mix noise at -10dB SNR |
| Regional accents | +2-4% | Diverse speaker test set |
| Non-native speakers | +3-6% | Accented speech samples |
| Phone-quality audio (8kHz) | +1-3% | Downsample test audio |
For TTS, validate pronunciation of domain-specific terms, proper names, numbers, and dates. A medical scheduling agent that mispronounces "ophthalmologist" erodes user confidence regardless of task accuracy.
Intent Classification Validation
Intent routing determines conversation flow. Test both correct classification and recovery from misclassification:
@pytest.mark.asyncio
async def test_intent_recovery_after_misrecognition():
"""Verify agent recovers when initial intent is misclassified."""
agent = create_agent()
# Ambiguous input that might be misclassified
response1 = await agent.process_text("I want to change my plan")
# User corrects/clarifies
response2 = await agent.process_text(
"No, not my subscription plan. I want to change my flight plan."
)
# Agent should pivot to flight-related handling
assert any(keyword in response2.lower() for keyword in [
"flight", "itinerary", "travel", "departure"
])
Compliance and Safety Guardrails
Implement automated scorers that run on every test case and sampled production conversations:
| Guardrail | What It Catches | Implementation |
|---|---|---|
| Prompt injection detection | Users attempting to override system prompt | Pattern matching + LLM classifier |
| Policy violation scoring | Agent making unauthorized promises or commitments | LLM-as-judge with policy rubric |
| PII handling | Agent improperly repeating or storing sensitive data | Regex + entity detection |
| Safety boundary enforcement | Agent engaging with harmful or off-topic requests | Topic classifier + refusal verification |
These scorers should run automatically in both offline evaluation and online monitoring. A compliance failure in production is worse than any latency spike.
Implementing Regression Testing
Building Behavioral Baselines
Before changing anything, establish your baseline. This becomes the reference point for detecting regressions:
Quantitative baseline:
| Metric | Baseline Value | Acceptable Regression |
|---|---|---|
| Task completion rate | 93% | No more than 3% drop |
| P90 end-to-end latency | 3.2s | No more than 300ms increase |
| WER | 4.1% | No more than 1% increase |
| Intent accuracy | 96% | No more than 2% drop |
| Average turns to completion | 5.8 | No more than 1.5 turn increase |
Qualitative baseline: Context preservation across turns, coherent multi-step reasoning, appropriate tone and register. These require LLM-as-judge evaluation rather than simple metric thresholds.
Detecting Model Drift
Model drift happens silently. ASR provider updates, LLM fine-tuning, or even TTS voice model changes can shift behavior without any code change on your side:
@pytest.mark.asyncio
async def test_behavioral_consistency():
"""Run weekly to detect drift in agent behavior."""
agent = create_agent()
results = []
for scenario in BASELINE_SCENARIOS:
response = await agent.process_text(scenario["input"])
score = evaluate_response(response, scenario["expected"])
results.append(score)
avg_score = sum(results) / len(results)
baseline_score = load_baseline_score()
# Flag if cumulative score drops more than 5%
assert avg_score >= baseline_score * 0.95, (
f"Behavioral drift detected: {avg_score:.2f} vs baseline {baseline_score:.2f}"
)
Schedule this weekly. Cumulative 1% degradations per month become a 12% regression over a year that nobody notices until customer complaints spike.
Prompt Version Testing
Every prompt change gets an A/B comparison against the regression suite before deployment:
@pytest.mark.asyncio
async def test_prompt_upgrade_no_regression():
"""Compare candidate prompt against production baseline."""
test_cases = load_regression_suite()
baseline_agent = create_agent(prompt_version="production")
candidate_agent = create_agent(prompt_version="candidate")
baseline_pass = 0
candidate_pass = 0
for case in test_cases:
b_response = await baseline_agent.process_text(case["input"])
c_response = await candidate_agent.process_text(case["input"])
if passes_criteria(b_response, case["expected"]):
baseline_pass += 1
if passes_criteria(c_response, case["expected"]):
candidate_pass += 1
baseline_rate = baseline_pass / len(test_cases)
candidate_rate = candidate_pass / len(test_cases)
assert candidate_rate >= baseline_rate - 0.05, (
f"Prompt regression: {baseline_rate:.1%} → {candidate_rate:.1%}"
)
For a comprehensive guide to regression testing patterns and production failure conversion workflows, see AI Voice Agent Regression Testing.
Automated Regression Suites
Integrate regression checks as deployment gates:
# .github/workflows/voice-agent-ci.yml
- name: Run regression suite
run: |
pytest tests/regression/ -m "not load" --tb=short -q
if [ $? -ne 0 ]; then
echo "Regression suite failed - blocking deploy"
exit 1
fi
Every production failure becomes a permanent test case. Over time your regression suite becomes an increasingly complete specification of correct agent behavior.
Load and Scalability Testing
Using lk perf agent-load-test
The LiveKit CLI includes built-in load testing for agent scalability:
# Simulate 50 concurrent agent rooms
lk perf agent-load-test \
--url wss://your-project.livekit.cloud \
--api-key $LIVEKIT_API_KEY \
--api-secret $LIVEKIT_API_SECRET \
--num-rooms 50 \
--duration 300s \
--publish-audio
This creates concurrent rooms with synthetic participants, measuring agent join time, response latency, and resource consumption. Start with 10 rooms and scale up to identify the inflection point where performance degrades.
Stress Testing Strategies
Test at 2x your expected peak capacity to find breaking points before they find your users:
| Test Scenario | Configuration | What It Reveals |
|---|---|---|
| Sustained load | 100 concurrent rooms for 30 minutes | Memory leaks, connection pool exhaustion |
| Spike test | Ramp from 10 to 200 rooms in 60 seconds | Autoscaling responsiveness |
| Endurance test | 50 rooms for 4 hours | Long-running stability issues |
| Chaos test | Kill agent workers mid-conversation | Recovery and failover behavior |
For stress tests exceeding 100 concurrent sessions, use synthetic callers with realistic voice characteristics and background noise to simulate production conditions. Clean synthetic audio does not stress-test ASR the same way real-world audio does.
Identifying Performance Bottlenecks
Track P90 and P99 latency at each pipeline stage independently during load tests:
Load Test Results (100 concurrent rooms)
Component Latency Breakdown:
├── STT Processing: P90: 280ms P99: 450ms ← Within budget
├── LLM Inference: P90: 1800ms P99: 3200ms ← Dominant bottleneck
├── TTS Synthesis: P90: 220ms P99: 380ms ← Within budget
└── Network/Routing: P90: 180ms P99: 320ms ← Within budget
Total End-to-End: P90: 3100ms P99: 4800ms ← Under 3.5s P90 target
When P90 exceeds 3.5 seconds, identify which component is the bottleneck. In most deployments, LLM inference dominates at 60-70% of total latency. Optimization efforts should target the largest contributor first.
For detailed load testing methodology and the 3-Pillar testing framework, see Testing Voice Agents: Load, Regression, and A/B Evaluation for Production Reliability.
Capacity Planning
Determine your maximum concurrent call capacity by finding the degradation point:
| Concurrent Rooms | P90 Latency | P99 Latency | Task Completion | Status |
|---|---|---|---|---|
| 25 | 2.8s | 3.9s | 94% | Healthy |
| 50 | 3.1s | 4.3s | 93% | Healthy |
| 75 | 3.4s | 4.8s | 91% | Warning |
| 100 | 3.8s | 5.6s | 87% | Degraded |
| 150 | 4.9s | 7.2s | 78% | Critical |
Set your autoscaling threshold at 70% of the degradation point. If performance degrades at 100 rooms, trigger scaling at 70 concurrent rooms to maintain headroom.
Logging and Tracing Architecture
OpenTelemetry Integration
Implement distributed tracing that captures the complete voice turn lifecycle with a unified trace ID:
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
def configure_tracing():
provider = TracerProvider()
processor = BatchSpanProcessor(
OTLPSpanExporter(endpoint="http://otel-collector:4317")
)
provider.add_span_processor(processor)
trace.set_tracer_provider(provider)
tracer = trace.get_tracer("livekit-voice-agent")
async def process_voice_turn(audio_input):
with tracer.start_as_current_span("voice_turn") as turn_span:
turn_span.set_attribute("session.id", session_id)
with tracer.start_as_current_span("stt_processing") as stt_span:
transcript = await transcribe(audio_input)
stt_span.set_attribute("stt.confidence", transcript.confidence)
stt_span.set_attribute("stt.word_count", len(transcript.words))
with tracer.start_as_current_span("llm_inference") as llm_span:
llm_span.set_attribute("llm.model", model_name)
response = await generate_response(transcript.text)
llm_span.set_attribute("llm.tokens", response.token_count)
if response.requires_tool:
with tracer.start_as_current_span("tool_call") as tool_span:
tool_span.set_attribute("tool.name", response.tool_name)
result = await execute_tool(response.tool_call)
tool_span.set_attribute("tool.success", result.success)
with tracer.start_as_current_span("tts_synthesis") as tts_span:
audio = await synthesize(response.text)
tts_span.set_attribute("tts.characters", len(response.text))
tts_span.set_attribute("tts.duration_ms", audio.duration_ms)
This produces traces like:
voice_turn (3.2s total)
├── stt_processing (210ms)
├── llm_inference (2.4s) ← bottleneck visible
├── tool_call (380ms)
└── tts_synthesis (190ms)
For implementation details on Prometheus metrics collection and Grafana dashboard configuration, see LiveKit Agent Monitoring in Production: Prometheus, Grafana and Alerts.
Structured Logging Best Practices
Log full conversation context with every turn for post-incident debugging:
import structlog
logger = structlog.get_logger()
async def log_turn(session_id: str, turn_number: int, turn_data: dict):
logger.info(
"voice_turn_completed",
session_id=session_id,
turn_number=turn_number,
user_input=turn_data["transcript"],
agent_response=turn_data["response"],
stt_confidence=turn_data["stt_confidence"],
stt_latency_ms=turn_data["stt_latency_ms"],
llm_latency_ms=turn_data["llm_latency_ms"],
tts_latency_ms=turn_data["tts_latency_ms"],
total_latency_ms=turn_data["total_latency_ms"],
tool_calls=turn_data.get("tool_calls", []),
intent_detected=turn_data.get("intent"),
interruption_occurred=turn_data.get("interrupted", False),
)
Structure logs so you can query by session, by latency range, by error type, or by intent. When a customer reports a bad experience, you should be able to reconstruct the complete conversation within minutes.
Multi-Agent Tracing
When agents delegate tasks or coordinate with other agents, trace the full interaction graph:
Primary Agent (session-abc)
├── voice_turn_1 (greeting)
├── voice_turn_2 (intent: transfer)
│ └── agent_handoff
│ ├── context_transfer (120ms)
│ └── Secondary Agent (session-abc-transfer)
│ ├── voice_turn_1 (pickup)
│ └── voice_turn_2 (resolution)
└── voice_turn_3 (confirmation)
Propagate trace context across agent boundaries so you can follow a single conversation through multiple agents, tool calls, and external API interactions.
LiveKit Agent Observability
LiveKit Cloud provides native observability features for agent debugging:
- Trace View: Visual timeline showing turn detection, LLM timing, and tool execution per conversation
- Session Recordings: Audio and transcript capture for debugging and compliance review
- Real-time Metrics: WebRTC quality metrics, room health, and participant status
- Synchronized Playback: Listen to audio while viewing the corresponding transcript and trace data side by side
These built-in tools complement your custom instrumentation. Use LiveKit Cloud Dashboard for individual session debugging and your Prometheus/Grafana stack for aggregate monitoring and alerting.
Production Monitoring and Alerting
Real-Time Performance Monitoring
Monitor these metrics continuously in production:
| Metric Category | What to Track | Alert When |
|---|---|---|
| Latency | End-to-end P90, P99, TTFT | P90 > 3.5s or P99 > 5s |
| Audio Quality | ASR WER, TTS MOS | WER > 8% or MOS < 3.8 |
| Conversation | Intent accuracy, interruption rate | Intent accuracy < 90% or interruption > 25% |
| Reliability | Tool call success rate, connection rate | Tool success < 95% or connection drop > 5% |
| Cost | Token consumption, per-session cost | >2x baseline spend |
Configuring Alert Thresholds
Set alerts that catch real problems without generating noise:
| Alert | Warning | Critical | Rationale |
|---|---|---|---|
| P90 latency | >3.0s | >3.5s | 10% of users experiencing delays |
| P99 latency | >4.0s | >5.0s | Conversation flow breakdown |
| Connection drop rate | >5% | >15% | 15% drop suggests infrastructure issues |
| Intent accuracy (rolling 1h) | <92% | <85% | Sustained degradation, not momentary dips |
| Fallback/escalation rate | >20% | >35% | Rising fallback indicates systematic failure |
Use duration filters—require issues to persist for 5+ minutes before firing alerts. This avoids false alarms from momentary spikes while still catching sustained degradation.
For a comprehensive guide to monitoring voice agent outages and the 4-Layer Monitoring Framework, see How to Monitor Voice Agent Outages in Real Time.
Custom LLM-as-Judge Scorers
Define business-specific quality evaluators that run on sampled production conversations:
SCORER_DEFINITIONS = {
"empathy_check": {
"description": "Evaluate whether agent shows appropriate empathy",
"rubric": """Score 1-5:
5: Acknowledges emotion, validates concern, offers help
3: Acknowledges issue but skips emotional validation
1: Ignores emotional context entirely""",
"alert_threshold": 2.5,
},
"compliance_adherence": {
"description": "Verify agent follows regulatory requirements",
"rubric": """Score 1-5:
5: All required disclosures made, no unauthorized promises
3: Minor omissions in required language
1: Missing critical disclosures or unauthorized commitments""",
"alert_threshold": 4.0,
},
}
Run these scorers on 5-10% of production conversations. Alert when rolling averages drop below thresholds over a 1-hour window.
Cross-Call Pattern Detection
Individual call reviews miss systemic issues. Aggregate analysis reveals patterns:
| Pattern | Detection Method | Example |
|---|---|---|
| Time-of-day degradation | Hourly latency heatmaps | LLM provider throttling during business hours |
| Geographic performance variance | Per-region P90 breakdown | Higher ASR errors in specific regions |
| Conversation loops | Repeated intent classification per session | Agent asking the same question three times |
| Silent failures | Task completion vs. user satisfaction gap | Task marked complete but user called back |
Build dashboards that surface these patterns automatically. A 5% task completion drop that only affects users calling between 2-4 PM EST would be invisible in daily aggregates.
For foundational concepts on production monitoring strategy, see An Intro to Production Monitoring for Voice Agents.
Building Continuous Testing Pipelines
CI/CD Integration Strategies
Gate deployments with automated quality checks:
# .github/workflows/voice-agent-deploy.yml
name: Voice Agent Deploy Pipeline
on:
push:
branches: [main]
pull_request:
branches: [main]
jobs:
unit-tests:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Run text-only regression suite
run: pytest tests/ -m unit --tb=short -q
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
webrtc-tests:
needs: unit-tests
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Run WebRTC validation suite
run: pytest tests/ -m webrtc --tb=short -q
env:
LIVEKIT_URL: ${{ secrets.LIVEKIT_URL }}
LIVEKIT_API_KEY: ${{ secrets.LIVEKIT_API_KEY }}
LIVEKIT_API_SECRET: ${{ secrets.LIVEKIT_API_SECRET }}
deploy:
needs: [unit-tests, webrtc-tests]
if: github.ref == 'refs/heads/main'
runs-on: ubuntu-latest
steps:
- name: Deploy to production
run: ./deploy.sh
Run text-only tests on every PR. Run full WebRTC tests on merge to main. Block deploys when pass rates drop below 95%.
Automated Test Generation
Generate test cases from production conversations to expand coverage:
- Sample low-scoring conversations from production monitoring
- Extract the conversation transcript and expected outcomes
- Convert to regression test format with assertions
- Add to regression suite for continuous validation
def generate_test_from_production(conversation_log: dict) -> dict:
"""Convert a production conversation into a regression test case."""
return {
"id": f"prod-{conversation_log['session_id'][:8]}",
"name": f"Production failure: {conversation_log['failure_reason']}",
"source": f"production-{conversation_log['timestamp'][:10]}",
"conversation": conversation_log["turns"],
"expected": {
"task_completed": conversation_log["expected_outcome"],
"max_turns": len(conversation_log["turns"]) + 2,
},
}
This creates a flywheel: production failures improve the test suite, which prevents future failures, which improves production quality.
A/B Testing in Production
Run parallel agent versions to measure the impact of changes with statistical confidence:
| Parameter | Recommendation |
|---|---|
| Minimum sample per variant | 1,000 conversations |
| Statistical confidence target | 95% |
| Key comparison metrics | Task completion, P90 latency, user satisfaction |
| Maximum test duration | 2 weeks |
| Traffic split | 50/50 for fastest results, 90/10 for lower risk |
Route traffic based on session ID hash for consistent assignment. Never switch a user mid-conversation between agent versions.
Feedback Loop Implementation
Close the loop between production monitoring and offline testing:
Production conversations
→ Score with LLM-as-judge (online)
→ Flag low-scoring sessions
→ Extract as test cases
→ Add to regression suite (offline)
→ Run on next deploy candidate
→ Deploy improved agent
→ Monitor production...
This continuous improvement cycle means your test coverage grows organically from real production scenarios rather than hypothetical test cases.
Common Failure Modes and Debugging
Diagnosing Cascading Failures
Voice agent failures rarely have a single root cause. Use multi-layer correlation to trace the cascade:
| Symptom | Layer 1 Check | Layer 2 Check | Layer 3 Check |
|---|---|---|---|
| Wrong response | ASR transcript accuracy | Intent classification | LLM prompt/context |
| High latency | Per-component latency breakdown | Network path analysis | Provider rate limits |
| User hangs up | TTFT metrics | Turn detection timing | Audio quality scores |
| Repeated questions | Context window management | Memory/state handling | Tool call failures |
Example cascade: Audio degradation (MOS drops to 3.2) causes ASR word error rate to spike to 12%, which causes intent misclassification in 30% of turns, which causes the LLM to generate irrelevant responses, which causes users to interrupt, which causes further ASR errors due to overlapping speech. The root cause is audio quality, but the symptom is "agent gives wrong answers."
Handling Real-World Edge Cases
Test scenarios that only happen in production:
| Edge Case | Test Approach | Expected Behavior |
|---|---|---|
| User interrupts mid-response | Synthetic barge-in at random points | Agent stops, listens, responds to new input |
| Connection drops for 3 seconds | Network simulation with packet loss | Agent resumes or gracefully re-establishes |
| Background noise spike | Inject noise at varying SNR levels | ASR degrades gracefully, agent asks to repeat |
| Mid-conversation context switch | User changes topic abruptly | Agent acknowledges pivot, updates context |
| Silence for 15+ seconds | No user input after agent prompt | Agent re-prompts once, then offers alternatives |
For in-depth coverage of WebRTC testing for interruptions and turn-taking, see How to Test Voice Agents Built with LiveKit.
Latency and Timing Issues
When end-to-end latency exceeds the 3.5-second P90 target, decompose by component:
Total P90: 4.2s (OVER TARGET)
├── STT: 310ms (OK, budget: 300ms)
├── LLM: 2.9s (HIGH, budget: 2.0s) ← Root cause
├── TTS: 280ms (OK, budget: 300ms)
└── Network: 710ms (HIGH, budget: 400ms) ← Contributing factor
Common latency root causes:
| Component | Common Cause | Fix |
|---|---|---|
| STT | Long utterances, poor audio | Streaming transcription, noise filtering |
| LLM | Large context window, complex prompts | Prompt optimization, context pruning |
| TTS | Long responses, cold starts | Response chunking, connection pooling |
| Network | Geographic distance, routing | Edge deployment, CDN for static assets |
For detailed latency optimization techniques, see How to Optimize Latency in Voice Agents.
Audio Quality Problems
Identify audio issues that reduce ASR accuracy:
| Issue | Detection | Impact on WER | Mitigation |
|---|---|---|---|
| Reverberation | Room impulse response analysis | +3-8% | Echo cancellation, derev processing |
| Background noise | SNR measurement | +2-5% at -10dB | Noise suppression, gain control |
| Codec artifacts | Bitrate monitoring | +1-3% | Higher bitrate encoding |
| Packet loss | WebRTC stats | +2-6% at 3% loss | FEC, jitter buffer tuning |
Test ASR accuracy with synthetic impulse responses during development to catch reverberation issues before deployment. Production environments (call centers, cars, outdoor spaces) introduce acoustic challenges that clean test audio never exercises.
Best Practices and Implementation Roadmap
Starting Small
Begin with a focused test suite before scaling:
- Curate 50-100 conversations representing your core use cases
- Define pass/fail criteria for each conversation type
- Run offline evaluation against this dataset on every deploy
- Track 3 key metrics in production: P90 latency, task completion rate, WER
This baseline takes days to set up and immediately catches the most common failures.
Scaling Your Testing Practice
| Phase | Focus | Tools | Effort |
|---|---|---|---|
| 1. Foundation | Text-only regression + 3 production metrics | pytest + basic monitoring | Low |
| 2. Audio coverage | Add WebRTC testing for latency and interruptions | Hamming + LiveKit testing | Medium |
| 3. Load validation | Concurrent capacity testing | lk perf + synthetic callers | Medium |
| 4. Full observability | Distributed tracing + automated scorers | OpenTelemetry + LLM-as-judge | High |
Each phase builds on the previous one. Do not skip to phase 4 without the foundation of phase 1.
Tool Selection Criteria
When evaluating voice agent testing platforms, weight capabilities based on production impact:
| Capability | Weight | What to Look For |
|---|---|---|
| Quality metric coverage | 30% | WER, MOS, task completion, latency percentiles |
| Production monitoring | 25% | Continuous scoring, alerting, drift detection |
| CI/CD integration | 20% | GitHub Actions/Jenkins support, deploy gating |
| Load testing | 15% | Concurrent session simulation, realistic audio |
| Ease of setup | 10% | Time to first test run, documentation quality |
Prioritize platforms that cover both offline evaluation and online monitoring. Tools that only do one or the other leave gaps that production will expose.
Building Internal Expertise
Voice agent quality is not a single team's responsibility:
| Stakeholder | Responsibility | Feedback Channel |
|---|---|---|
| Engineering | Instrumentation, CI/CD integration, incident response | Automated alerts, trace review |
| QA | Test case curation, regression suite maintenance | Weekly quality reports |
| Product | Success criteria definition, user experience standards | Customer satisfaction data |
| Operations | Capacity planning, cost monitoring, vendor management | Monthly capacity reviews |
Establish weekly quality review meetings where engineering, QA, and product review production metrics together. The feedback loop between "what customers experience" and "what tests validate" should be as short as possible.
Conclusion
Voice agent failures cascade. A small ASR degradation propagates through intent classification, response generation, and audio synthesis—each layer amplifying the original error. Without observability across all four layers, you spend hours debugging symptoms instead of root causes.
The five-pillar framework—evaluation, regression, load, observability, alerting—provides complete coverage. Start with the foundation: 50-100 curated test cases, automated regression suites blocking deploys, and three production metrics (P90 latency under 3.5 seconds, task completion above 90%, WER under 5%). Scale from there based on what production monitoring reveals.
Every production failure should make your test suite stronger. Every test suite improvement should prevent the next production failure. That flywheel is the difference between voice agents that work in demos and voice agents that work in production.

