What latency targets should I set for LiveKit voice agents in production?

Target P90 end-to-end latency under 3.5 seconds and P99 under 5 seconds. Time to First Token (TTFT) should be under 800ms. Anything over 5 seconds feels completely broken to users and leads to conversation abandonment. Monitor latency at each pipeline stage independently—STT, LLM, and TTS—to identify bottlenecks when targets are exceeded.

What is the difference between offline evaluation and online monitoring for voice agents?

Offline evaluation runs pre-deployment against curated datasets with known-correct outcomes, catching logic errors, prompt regressions, and intent classification failures. Online monitoring samples live production traffic continuously, catching failures that only appear in production such as audio degradation, accent-driven ASR errors, and network jitter disrupting turn-taking. Production-grade voice agents require both approaches.

How do I implement distributed tracing for LiveKit voice agents?

Use OpenTelemetry to create spans for each pipeline stage: STT processing, LLM inference, tool calls, and TTS synthesis. Configure a TracerProvider with an OTLP exporter pointing to your collector, then wrap each stage in a span with relevant attributes like confidence scores, token counts, and latency measurements. This produces waterfall traces showing exactly where time is spent in each voice turn.

What are the five pillars of voice agent testing?

The five pillars are: (1) Evaluation—measuring quality across ASR, NLU, LLM, and TTS layers; (2) Regression testing—preventing quality degradation when prompts, models, or configurations change; (3) Load testing—validating scalability under concurrent traffic; (4) Observability—distributed tracing and structured logging for diagnosis; (5) Alerting—threshold-based monitoring that catches problems before customers complain.

How do I detect model drift in voice agents?

Run behavioral consistency tests weekly against a fixed set of baseline scenarios. Score responses using the same evaluation criteria and compare against stored baseline scores. Flag when cumulative scores drop more than 5%. Model drift can come from ASR provider updates, LLM fine-tuning, or TTS voice model changes—none of which require code changes on your side, making them easy to miss without automated detection.

What should I alert on for voice agent production monitoring?

Alert on P90 and P99 latency (not averages), ASR word error rate exceeding 8%, intent accuracy dropping below 90% over a rolling hour, tool call success rate below 95%, and connection drop rates above 15%. Use duration filters requiring issues to persist for 5+ minutes before firing alerts to reduce noise from momentary spikes.

How many test cases do I need to start testing voice agents?

Start with 50-100 curated conversations representing your core use cases. Define pass/fail criteria for each conversation type and run offline evaluation on every deploy. Track three key production metrics: P90 latency, task completion rate, and word error rate. This foundation takes days to set up and immediately catches the most common failures before scaling to comprehensive testing.

How do I load test LiveKit voice agents?

Use the LiveKit CLI command lk perf agent-load-test to simulate concurrent rooms with synthetic participants. Start with 10 rooms and scale up to find the degradation point. Test at 2x expected peak capacity with realistic audio conditions including background noise. Set autoscaling thresholds at 70% of the capacity where P90 latency exceeds 3.5 seconds or task completion drops below 90%.

Testing and Monitoring LiveKit Voice Agents in Production

Voice agents built on LiveKit fail differently than text-based AI systems. A chatbot with a 2-second response delay is mildly annoying. A voice agent with the same delay creates dead air that makes users hang up. When your pipeline spans ASR, NLU, LLM, and TTS, a failure in any single layer cascades through the entire conversation. Testing and monitoring voice agents requires a fundamentally different approach.

This guide covers the five-pillar framework for production-grade LiveKit voice agent quality: evaluation, regression testing, load testing, observability, and alerting.

Methodology Note: The testing framework, metrics, and thresholds in this guide are derived from Hamming's analysis of 4M+ production LiveKit voice agent deployments across 10K+ voice agents (2025-2026).
Thresholds and benchmarks represent performance patterns across customer support, healthcare, and enterprise deployments. Your targets may vary based on use case and latency tolerance.

TL;DR: Five-Pillar Testing Framework

Pillar	Purpose	Key Tool
Evaluation	Measure quality across ASR, NLU, LLM, TTS	Curated datasets + LLM-as-judge scorers
Regression	Prevent quality degradation on changes	Behavioral baselines + automated suites
Load	Validate scalability under traffic	`lk perf agent-load-test` + synthetic calls
Observability	Diagnose production issues	OpenTelemetry distributed tracing
Alerting	Catch problems before users complain	P90/P99 threshold monitoring

Key targets: P90 end-to-end latency under 3.5s, P99 under 5s, WER under 5%, task completion above 90%.

Core insight: Offline evaluation catches logic errors. Online monitoring catches production-only failures. You need both, and they need to share metrics definitions.

Quick filter: If you are running LiveKit agents in production without automated regression suites and distributed tracing, this guide covers the gaps that will eventually cause customer complaints.

Understanding Voice Agent Testing Fundamentals

The Voice Agent Testing Stack

Voice agents are not single-component systems. They are four interdependent layers where failures cascade:

User speaks → ASR (transcription) → NLU (intent) → LLM (response) → TTS (audio) → User hears

Layer	Function	Failure Impact
ASR (Speech-to-Text)	Converts audio to text	Wrong transcription → wrong intent → wrong response
NLU (Intent Recognition)	Classifies user intent	Misrouted conversation → task failure
LLM (Response Generation)	Produces reply content	Hallucination, context loss, wrong tool call
TTS (Text-to-Speech)	Synthesizes audio output	Robotic voice, pronunciation errors, latency

A 3% increase in ASR word error rate does not stay contained. It propagates through NLU, causes the LLM to generate responses to the wrong intent, and the user hears a confidently wrong answer. Testing each layer in isolation is necessary but insufficient—you need full-stack validation.

Offline vs. Online Evaluation

Offline evaluation runs before deployment against curated datasets with known-correct outcomes. You control the inputs, the environment, and the expected outputs. This catches logic errors, prompt regressions, and intent classification failures.

Online evaluation monitors live production traffic continuously. It samples real conversations and scores them against quality criteria. This catches failures that only appear in production: audio degradation from cellular connections, accent-driven ASR errors, network jitter disrupting turn-taking, and model drift over time.

Dimension	Offline Evaluation	Online Monitoring
When it runs	Pre-deployment	Continuous in production
Data source	Curated test datasets	Sampled live traffic
Controls	Full (inputs, environment)	None (real-world conditions)
Catches	Logic errors, regressions, prompt issues	Audio quality, network effects, drift
Latency	Not representative	Actual production latency
Scale	50-500 test cases	Thousands of conversations daily

Both feed into the same metrics pipeline. If your offline evaluation measures task completion rate at 95% but production monitoring shows 82%, the gap is your testing blind spot.

Key Quality Metrics for Voice Agents

Metric	Target	Alert Threshold	Why It Matters
Word Error Rate (WER)	<5%	>8%	Foundation of everything—bad transcription cascades
Mean Opinion Score (MOS)	>4.3/5	<3.8	Perceived audio quality drives user trust
End-to-End Latency P90	<3.5s	>3.5s	10% of users experiencing significant delays
End-to-End Latency P99	<5s	>5s	Responses over 5 seconds feel broken
TTFT (Time to First Token)	<800ms	>1200ms	User-perceived response initiation
Task Completion Rate	>90%	<85%	Whether the agent actually solves the problem
Interruption Rate	<15%	>25%	Turn detection quality indicator

For a deeper dive into latency benchmarks and optimization strategies, see Voice AI Latency: What's Fast, What's Slow, and How to Fix It.

Setting Up Your LiveKit Testing Environment

Installing LiveKit Testing Dependencies

Start with the core testing stack:

pip install livekit-agents pytest pytest-asyncio

For observability instrumentation, add OpenTelemetry:

pip install opentelemetry-api opentelemetry-sdk \
  opentelemetry-exporter-otlp-proto-grpc \
  opentelemetry-instrumentation

For load testing, install the LiveKit CLI:

# macOS
brew install livekit-cli

# Linux
curl -sSL https://get.livekit.io/cli | bash

Configuring Test Infrastructure

Set up environment variables for your test environment:

# .env.test
LIVEKIT_URL=wss://your-project.livekit.cloud
LIVEKIT_API_KEY=your-test-api-key
LIVEKIT_API_SECRET=your-test-api-secret
OPENAI_API_KEY=your-openai-key  # or other LLM provider

Configure pytest for async support:

[pytest]
asyncio_mode = auto
testpaths = tests
python_files = test_*.py
markers =
    unit: Text-only logic tests
    webrtc: Full audio pipeline tests
    load: Concurrent capacity tests

Separate test markers let you run fast unit tests on every commit and reserve expensive WebRTC and load tests for deploy candidates.

Creating Your First Test Suite

LiveKit's pytest helpers validate agent behavior in text mode—no audio pipeline required:

import pytest
from your_agent import create_agent

@pytest.mark.unit
@pytest.mark.asyncio
async def test_greeting_and_intent_routing():
    """Verify agent correctly identifies and routes initial intent."""
    agent = create_agent()

    response = await agent.process_text(
        "Hi, I need to reschedule my appointment for next week"
    )

    # Agent should recognize scheduling intent
    assert any(keyword in response.lower() for keyword in [
        "reschedule", "appointment", "when", "available", "date"
    ])
    # Should ask clarifying question
    assert "?" in response


@pytest.mark.unit
@pytest.mark.asyncio
async def test_multi_turn_context_retention():
    """Verify agent maintains context across conversation turns."""
    agent = create_agent()

    await agent.process_text("I want to book a table for 4 people")
    await agent.process_text("Actually, make that 6")
    response = await agent.process_text("What reservation details do you have?")

    # Agent should remember the corrected party size
    assert "6" in response or "six" in response.lower()

For comprehensive coverage of LiveKit's built-in testing helpers and pytest patterns, see Testing LiveKit Voice Agents: Unit, Scenario, Load and Production Guide.

Evaluation Frameworks for Voice Agents

Conversational Quality Metrics

Beyond pass/fail assertions, measure the qualitative aspects of conversation:

Metric	How to Measure	Target
Turn-taking latency	Time from user silence to agent speech start	<500ms end-of-utterance delay
Interruption handling	Agent stops and listens when user barges in	Recovery within 1 turn
Time to first word	Duration from user finish to agent first audio	P90 under 2.5s
Talk-to-listen ratio	Agent speaking time vs. user speaking time	40-60% agent, context-dependent

Track these across conversations to detect systematic issues. A talk-to-listen ratio above 70% often indicates the agent is monologuing rather than having a dialogue.

Task Completion Assessment

Task completion is the metric that matters most to the business. Define clear success criteria for each conversation type:

TASK_DEFINITIONS = {
    "appointment_booking": {
        "required_fields": ["date", "time", "service_type"],
        "success_criteria": "booking_confirmed",
        "max_turns": 8,
    },
    "order_status": {
        "required_fields": ["order_id"],
        "success_criteria": "status_delivered",
        "max_turns": 4,
    },
    "complaint_resolution": {
        "required_fields": ["issue_description"],
        "success_criteria": "resolution_offered",
        "max_turns": 12,
    },
}

Measure both binary completion (did the task succeed?) and efficiency (how many turns did it take?). An agent that completes bookings in 15 turns when the baseline is 6 has a problem even if the task technically succeeds.

ASR and TTS Accuracy Testing

Test speech recognition under realistic conditions, not just clean studio audio:

Condition	Expected WER Impact	Test Approach
Quiet environment	Baseline (<3%)	Standard test audio
Background office noise	+1-2%	Mix noise at -20dB SNR
Street/traffic noise	+3-5%	Mix noise at -10dB SNR
Regional accents	+2-4%	Diverse speaker test set
Non-native speakers	+3-6%	Accented speech samples
Phone-quality audio (8kHz)	+1-3%	Downsample test audio

For TTS, validate pronunciation of domain-specific terms, proper names, numbers, and dates. A medical scheduling agent that mispronounces "ophthalmologist" erodes user confidence regardless of task accuracy.

Intent Classification Validation

Intent routing determines conversation flow. Test both correct classification and recovery from misclassification:

@pytest.mark.asyncio
async def test_intent_recovery_after_misrecognition():
    """Verify agent recovers when initial intent is misclassified."""
    agent = create_agent()

    # Ambiguous input that might be misclassified
    response1 = await agent.process_text("I want to change my plan")

    # User corrects/clarifies
    response2 = await agent.process_text(
        "No, not my subscription plan. I want to change my flight plan."
    )

    # Agent should pivot to flight-related handling
    assert any(keyword in response2.lower() for keyword in [
        "flight", "itinerary", "travel", "departure"
    ])

Compliance and Safety Guardrails

Implement automated scorers that run on every test case and sampled production conversations:

Guardrail	What It Catches	Implementation
Prompt injection detection	Users attempting to override system prompt	Pattern matching + LLM classifier
Policy violation scoring	Agent making unauthorized promises or commitments	LLM-as-judge with policy rubric
PII handling	Agent improperly repeating or storing sensitive data	Regex + entity detection
Safety boundary enforcement	Agent engaging with harmful or off-topic requests	Topic classifier + refusal verification

These scorers should run automatically in both offline evaluation and online monitoring. A compliance failure in production is worse than any latency spike.

Implementing Regression Testing

Building Behavioral Baselines

Before changing anything, establish your baseline. This becomes the reference point for detecting regressions:

Quantitative baseline:

Metric	Baseline Value	Acceptable Regression
Task completion rate	93%	No more than 3% drop
P90 end-to-end latency	3.2s	No more than 300ms increase
WER	4.1%	No more than 1% increase
Intent accuracy	96%	No more than 2% drop
Average turns to completion	5.8	No more than 1.5 turn increase

Qualitative baseline: Context preservation across turns, coherent multi-step reasoning, appropriate tone and register. These require LLM-as-judge evaluation rather than simple metric thresholds.

Detecting Model Drift

Model drift happens silently. ASR provider updates, LLM fine-tuning, or even TTS voice model changes can shift behavior without any code change on your side:

@pytest.mark.asyncio
async def test_behavioral_consistency():
    """Run weekly to detect drift in agent behavior."""
    agent = create_agent()
    results = []

    for scenario in BASELINE_SCENARIOS:
        response = await agent.process_text(scenario["input"])
        score = evaluate_response(response, scenario["expected"])
        results.append(score)

    avg_score = sum(results) / len(results)
    baseline_score = load_baseline_score()

    # Flag if cumulative score drops more than 5%
    assert avg_score >= baseline_score * 0.95, (
        f"Behavioral drift detected: {avg_score:.2f} vs baseline {baseline_score:.2f}"
    )

Schedule this weekly. Cumulative 1% degradations per month become a 12% regression over a year that nobody notices until customer complaints spike.

Prompt Version Testing

Every prompt change gets an A/B comparison against the regression suite before deployment:

@pytest.mark.asyncio
async def test_prompt_upgrade_no_regression():
    """Compare candidate prompt against production baseline."""
    test_cases = load_regression_suite()

    baseline_agent = create_agent(prompt_version="production")
    candidate_agent = create_agent(prompt_version="candidate")

    baseline_pass = 0
    candidate_pass = 0

    for case in test_cases:
        b_response = await baseline_agent.process_text(case["input"])
        c_response = await candidate_agent.process_text(case["input"])

        if passes_criteria(b_response, case["expected"]):
            baseline_pass += 1
        if passes_criteria(c_response, case["expected"]):
            candidate_pass += 1

    baseline_rate = baseline_pass / len(test_cases)
    candidate_rate = candidate_pass / len(test_cases)

    assert candidate_rate >= baseline_rate - 0.05, (
        f"Prompt regression: {baseline_rate:.1%} → {candidate_rate:.1%}"
    )

For a comprehensive guide to regression testing patterns and production failure conversion workflows, see AI Voice Agent Regression Testing.

Automated Regression Suites

Integrate regression checks as deployment gates:

# .github/workflows/voice-agent-ci.yml
- name: Run regression suite
  run: |
    pytest tests/regression/ -m "not load" --tb=short -q
    if [ $? -ne 0 ]; then
      echo "Regression suite failed - blocking deploy"
      exit 1
    fi

Every production failure becomes a permanent test case. Over time your regression suite becomes an increasingly complete specification of correct agent behavior.

Load and Scalability Testing

Using lk perf agent-load-test

The LiveKit CLI includes built-in load testing for agent scalability:

# Simulate 50 concurrent agent rooms
lk perf agent-load-test \
  --url wss://your-project.livekit.cloud \
  --api-key $LIVEKIT_API_KEY \
  --api-secret $LIVEKIT_API_SECRET \
  --num-rooms 50 \
  --duration 300s \
  --publish-audio

This creates concurrent rooms with synthetic participants, measuring agent join time, response latency, and resource consumption. Start with 10 rooms and scale up to identify the inflection point where performance degrades.

Stress Testing Strategies

Test at 2x your expected peak capacity to find breaking points before they find your users:

Test Scenario	Configuration	What It Reveals
Sustained load	100 concurrent rooms for 30 minutes	Memory leaks, connection pool exhaustion
Spike test	Ramp from 10 to 200 rooms in 60 seconds	Autoscaling responsiveness
Endurance test	50 rooms for 4 hours	Long-running stability issues
Chaos test	Kill agent workers mid-conversation	Recovery and failover behavior

For stress tests exceeding 100 concurrent sessions, use synthetic callers with realistic voice characteristics and background noise to simulate production conditions. Clean synthetic audio does not stress-test ASR the same way real-world audio does.

Identifying Performance Bottlenecks

Track P90 and P99 latency at each pipeline stage independently during load tests:

Load Test Results (100 concurrent rooms)
Component Latency Breakdown:
├── STT Processing:  P90: 280ms  P99: 450ms   ← Within budget
├── LLM Inference:   P90: 1800ms P99: 3200ms  ← Dominant bottleneck
├── TTS Synthesis:   P90: 220ms  P99: 380ms   ← Within budget
└── Network/Routing: P90: 180ms  P99: 320ms   ← Within budget
Total End-to-End:    P90: 3100ms P99: 4800ms  ← Under 3.5s P90 target

When P90 exceeds 3.5 seconds, identify which component is the bottleneck. In most deployments, LLM inference dominates at 60-70% of total latency. Optimization efforts should target the largest contributor first.

For detailed load testing methodology and the 3-Pillar testing framework, see Testing Voice Agents: Load, Regression, and A/B Evaluation for Production Reliability.

Capacity Planning

Determine your maximum concurrent call capacity by finding the degradation point:

Concurrent Rooms	P90 Latency	P99 Latency	Task Completion	Status
25	2.8s	3.9s	94%	Healthy
50	3.1s	4.3s	93%	Healthy
75	3.4s	4.8s	91%	Warning
100	3.8s	5.6s	87%	Degraded
150	4.9s	7.2s	78%	Critical

Set your autoscaling threshold at 70% of the degradation point. If performance degrades at 100 rooms, trigger scaling at 70 concurrent rooms to maintain headroom.

Logging and Tracing Architecture

OpenTelemetry Integration

Implement distributed tracing that captures the complete voice turn lifecycle with a unified trace ID:

from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter

def configure_tracing():
    provider = TracerProvider()
    processor = BatchSpanProcessor(
        OTLPSpanExporter(endpoint="http://otel-collector:4317")
    )
    provider.add_span_processor(processor)
    trace.set_tracer_provider(provider)

tracer = trace.get_tracer("livekit-voice-agent")

async def process_voice_turn(audio_input):
    with tracer.start_as_current_span("voice_turn") as turn_span:
        turn_span.set_attribute("session.id", session_id)

        with tracer.start_as_current_span("stt_processing") as stt_span:
            transcript = await transcribe(audio_input)
            stt_span.set_attribute("stt.confidence", transcript.confidence)
            stt_span.set_attribute("stt.word_count", len(transcript.words))

        with tracer.start_as_current_span("llm_inference") as llm_span:
            llm_span.set_attribute("llm.model", model_name)
            response = await generate_response(transcript.text)
            llm_span.set_attribute("llm.tokens", response.token_count)

        if response.requires_tool:
            with tracer.start_as_current_span("tool_call") as tool_span:
                tool_span.set_attribute("tool.name", response.tool_name)
                result = await execute_tool(response.tool_call)
                tool_span.set_attribute("tool.success", result.success)

        with tracer.start_as_current_span("tts_synthesis") as tts_span:
            audio = await synthesize(response.text)
            tts_span.set_attribute("tts.characters", len(response.text))
            tts_span.set_attribute("tts.duration_ms", audio.duration_ms)

This produces traces like:

voice_turn (3.2s total)
├── stt_processing (210ms)
├── llm_inference (2.4s) ← bottleneck visible
├── tool_call (380ms)
└── tts_synthesis (190ms)

For implementation details on Prometheus metrics collection and Grafana dashboard configuration, see LiveKit Agent Monitoring in Production: Prometheus, Grafana and Alerts.

Structured Logging Best Practices

Log full conversation context with every turn for post-incident debugging:

import structlog

logger = structlog.get_logger()

async def log_turn(session_id: str, turn_number: int, turn_data: dict):
    logger.info(
        "voice_turn_completed",
        session_id=session_id,
        turn_number=turn_number,
        user_input=turn_data["transcript"],
        agent_response=turn_data["response"],
        stt_confidence=turn_data["stt_confidence"],
        stt_latency_ms=turn_data["stt_latency_ms"],
        llm_latency_ms=turn_data["llm_latency_ms"],
        tts_latency_ms=turn_data["tts_latency_ms"],
        total_latency_ms=turn_data["total_latency_ms"],
        tool_calls=turn_data.get("tool_calls", []),
        intent_detected=turn_data.get("intent"),
        interruption_occurred=turn_data.get("interrupted", False),
    )

Structure logs so you can query by session, by latency range, by error type, or by intent. When a customer reports a bad experience, you should be able to reconstruct the complete conversation within minutes.

Multi-Agent Tracing

When agents delegate tasks or coordinate with other agents, trace the full interaction graph:

Primary Agent (session-abc)
├── voice_turn_1 (greeting)
├── voice_turn_2 (intent: transfer)
│   └── agent_handoff
│       ├── context_transfer (120ms)
│       └── Secondary Agent (session-abc-transfer)
│           ├── voice_turn_1 (pickup)
│           └── voice_turn_2 (resolution)
└── voice_turn_3 (confirmation)

Propagate trace context across agent boundaries so you can follow a single conversation through multiple agents, tool calls, and external API interactions.

LiveKit Agent Observability

LiveKit Cloud provides native observability features for agent debugging:

Trace View: Visual timeline showing turn detection, LLM timing, and tool execution per conversation
Session Recordings: Audio and transcript capture for debugging and compliance review
Real-time Metrics: WebRTC quality metrics, room health, and participant status
Synchronized Playback: Listen to audio while viewing the corresponding transcript and trace data side by side

These built-in tools complement your custom instrumentation. Use LiveKit Cloud Dashboard for individual session debugging and your Prometheus/Grafana stack for aggregate monitoring and alerting.

Production Monitoring and Alerting

Real-Time Performance Monitoring

Monitor these metrics continuously in production:

Metric Category	What to Track	Alert When
Latency	End-to-end P90, P99, TTFT	P90 > 3.5s or P99 > 5s
Audio Quality	ASR WER, TTS MOS	WER > 8% or MOS < 3.8
Conversation	Intent accuracy, interruption rate	Intent accuracy < 90% or interruption > 25%
Reliability	Tool call success rate, connection rate	Tool success < 95% or connection drop > 5%
Cost	Token consumption, per-session cost	>2x baseline spend

Configuring Alert Thresholds

Set alerts that catch real problems without generating noise:

Alert	Warning	Critical	Rationale
P90 latency	>3.0s	>3.5s	10% of users experiencing delays
P99 latency	>4.0s	>5.0s	Conversation flow breakdown
Connection drop rate	>5%	>15%	15% drop suggests infrastructure issues
Intent accuracy (rolling 1h)	<92%	<85%	Sustained degradation, not momentary dips
Fallback/escalation rate	>20%	>35%	Rising fallback indicates systematic failure

Use duration filters—require issues to persist for 5+ minutes before firing alerts. This avoids false alarms from momentary spikes while still catching sustained degradation.

For a comprehensive guide to monitoring voice agent outages and the 4-Layer Monitoring Framework, see How to Monitor Voice Agent Outages in Real Time.

Custom LLM-as-Judge Scorers

Define business-specific quality evaluators that run on sampled production conversations:

SCORER_DEFINITIONS = {
    "empathy_check": {
        "description": "Evaluate whether agent shows appropriate empathy",
        "rubric": """Score 1-5:
        5: Acknowledges emotion, validates concern, offers help
        3: Acknowledges issue but skips emotional validation
        1: Ignores emotional context entirely""",
        "alert_threshold": 2.5,
    },
    "compliance_adherence": {
        "description": "Verify agent follows regulatory requirements",
        "rubric": """Score 1-5:
        5: All required disclosures made, no unauthorized promises
        3: Minor omissions in required language
        1: Missing critical disclosures or unauthorized commitments""",
        "alert_threshold": 4.0,
    },
}

Run these scorers on 5-10% of production conversations. Alert when rolling averages drop below thresholds over a 1-hour window.

Cross-Call Pattern Detection

Individual call reviews miss systemic issues. Aggregate analysis reveals patterns:

Pattern	Detection Method	Example
Time-of-day degradation	Hourly latency heatmaps	LLM provider throttling during business hours
Geographic performance variance	Per-region P90 breakdown	Higher ASR errors in specific regions
Conversation loops	Repeated intent classification per session	Agent asking the same question three times
Silent failures	Task completion vs. user satisfaction gap	Task marked complete but user called back

Build dashboards that surface these patterns automatically. A 5% task completion drop that only affects users calling between 2-4 PM EST would be invisible in daily aggregates.

For foundational concepts on production monitoring strategy, see An Intro to Production Monitoring for Voice Agents.

Building Continuous Testing Pipelines

CI/CD Integration Strategies

Gate deployments with automated quality checks:

# .github/workflows/voice-agent-deploy.yml
name: Voice Agent Deploy Pipeline

on:
  push:
    branches: [main]
  pull_request:
    branches: [main]

jobs:
  unit-tests:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Run text-only regression suite
        run: pytest tests/ -m unit --tb=short -q
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}

  webrtc-tests:
    needs: unit-tests
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Run WebRTC validation suite
        run: pytest tests/ -m webrtc --tb=short -q
        env:
          LIVEKIT_URL: ${{ secrets.LIVEKIT_URL }}
          LIVEKIT_API_KEY: ${{ secrets.LIVEKIT_API_KEY }}
          LIVEKIT_API_SECRET: ${{ secrets.LIVEKIT_API_SECRET }}

  deploy:
    needs: [unit-tests, webrtc-tests]
    if: github.ref == 'refs/heads/main'
    runs-on: ubuntu-latest
    steps:
      - name: Deploy to production
        run: ./deploy.sh

Run text-only tests on every PR. Run full WebRTC tests on merge to main. Block deploys when pass rates drop below 95%.

Automated Test Generation

Generate test cases from production conversations to expand coverage:

Sample low-scoring conversations from production monitoring
Extract the conversation transcript and expected outcomes
Convert to regression test format with assertions
Add to regression suite for continuous validation

def generate_test_from_production(conversation_log: dict) -> dict:
    """Convert a production conversation into a regression test case."""
    return {
        "id": f"prod-{conversation_log['session_id'][:8]}",
        "name": f"Production failure: {conversation_log['failure_reason']}",
        "source": f"production-{conversation_log['timestamp'][:10]}",
        "conversation": conversation_log["turns"],
        "expected": {
            "task_completed": conversation_log["expected_outcome"],
            "max_turns": len(conversation_log["turns"]) + 2,
        },
    }

This creates a flywheel: production failures improve the test suite, which prevents future failures, which improves production quality.

A/B Testing in Production

Run parallel agent versions to measure the impact of changes with statistical confidence:

Parameter	Recommendation
Minimum sample per variant	1,000 conversations
Statistical confidence target	95%
Key comparison metrics	Task completion, P90 latency, user satisfaction
Maximum test duration	2 weeks
Traffic split	50/50 for fastest results, 90/10 for lower risk

Route traffic based on session ID hash for consistent assignment. Never switch a user mid-conversation between agent versions.

Feedback Loop Implementation

Close the loop between production monitoring and offline testing:

Production conversations
    → Score with LLM-as-judge (online)
    → Flag low-scoring sessions
    → Extract as test cases
    → Add to regression suite (offline)
    → Run on next deploy candidate
    → Deploy improved agent
    → Monitor production...

This continuous improvement cycle means your test coverage grows organically from real production scenarios rather than hypothetical test cases.

Common Failure Modes and Debugging

Diagnosing Cascading Failures

Voice agent failures rarely have a single root cause. Use multi-layer correlation to trace the cascade:

Symptom	Layer 1 Check	Layer 2 Check	Layer 3 Check
Wrong response	ASR transcript accuracy	Intent classification	LLM prompt/context
High latency	Per-component latency breakdown	Network path analysis	Provider rate limits
User hangs up	TTFT metrics	Turn detection timing	Audio quality scores
Repeated questions	Context window management	Memory/state handling	Tool call failures

Example cascade: Audio degradation (MOS drops to 3.2) causes ASR word error rate to spike to 12%, which causes intent misclassification in 30% of turns, which causes the LLM to generate irrelevant responses, which causes users to interrupt, which causes further ASR errors due to overlapping speech. The root cause is audio quality, but the symptom is "agent gives wrong answers."

Handling Real-World Edge Cases

Test scenarios that only happen in production:

Edge Case	Test Approach	Expected Behavior
User interrupts mid-response	Synthetic barge-in at random points	Agent stops, listens, responds to new input
Connection drops for 3 seconds	Network simulation with packet loss	Agent resumes or gracefully re-establishes
Background noise spike	Inject noise at varying SNR levels	ASR degrades gracefully, agent asks to repeat
Mid-conversation context switch	User changes topic abruptly	Agent acknowledges pivot, updates context
Silence for 15+ seconds	No user input after agent prompt	Agent re-prompts once, then offers alternatives

For in-depth coverage of WebRTC testing for interruptions and turn-taking, see How to Test Voice Agents Built with LiveKit.

Latency and Timing Issues

When end-to-end latency exceeds the 3.5-second P90 target, decompose by component:

Total P90: 4.2s (OVER TARGET)
├── STT: 310ms (OK, budget: 300ms)
├── LLM: 2.9s  (HIGH, budget: 2.0s) ← Root cause
├── TTS: 280ms (OK, budget: 300ms)
└── Network: 710ms (HIGH, budget: 400ms) ← Contributing factor

Common latency root causes:

Component	Common Cause	Fix
STT	Long utterances, poor audio	Streaming transcription, noise filtering
LLM	Large context window, complex prompts	Prompt optimization, context pruning
TTS	Long responses, cold starts	Response chunking, connection pooling
Network	Geographic distance, routing	Edge deployment, CDN for static assets

For detailed latency optimization techniques, see How to Optimize Latency in Voice Agents.

Audio Quality Problems

Identify audio issues that reduce ASR accuracy:

Issue	Detection	Impact on WER	Mitigation
Reverberation	Room impulse response analysis	+3-8%	Echo cancellation, derev processing
Background noise	SNR measurement	+2-5% at -10dB	Noise suppression, gain control
Codec artifacts	Bitrate monitoring	+1-3%	Higher bitrate encoding
Packet loss	WebRTC stats	+2-6% at 3% loss	FEC, jitter buffer tuning

Test ASR accuracy with synthetic impulse responses during development to catch reverberation issues before deployment. Production environments (call centers, cars, outdoor spaces) introduce acoustic challenges that clean test audio never exercises.

Best Practices and Implementation Roadmap

Starting Small

Begin with a focused test suite before scaling:

Curate 50-100 conversations representing your core use cases
Define pass/fail criteria for each conversation type
Run offline evaluation against this dataset on every deploy
Track 3 key metrics in production: P90 latency, task completion rate, WER

This baseline takes days to set up and immediately catches the most common failures.

Scaling Your Testing Practice

Phase	Focus	Tools	Effort
1. Foundation	Text-only regression + 3 production metrics	pytest + basic monitoring	Low
2. Audio coverage	Add WebRTC testing for latency and interruptions	Hamming + LiveKit testing	Medium
3. Load validation	Concurrent capacity testing	`lk perf` + synthetic callers	Medium
4. Full observability	Distributed tracing + automated scorers	OpenTelemetry + LLM-as-judge	High

Each phase builds on the previous one. Do not skip to phase 4 without the foundation of phase 1.

Tool Selection Criteria

When evaluating voice agent testing platforms, weight capabilities based on production impact:

Capability	Weight	What to Look For
Quality metric coverage	30%	WER, MOS, task completion, latency percentiles
Production monitoring	25%	Continuous scoring, alerting, drift detection
CI/CD integration	20%	GitHub Actions/Jenkins support, deploy gating
Load testing	15%	Concurrent session simulation, realistic audio
Ease of setup	10%	Time to first test run, documentation quality

Prioritize platforms that cover both offline evaluation and online monitoring. Tools that only do one or the other leave gaps that production will expose.

Building Internal Expertise

Voice agent quality is not a single team's responsibility:

Stakeholder	Responsibility	Feedback Channel
Engineering	Instrumentation, CI/CD integration, incident response	Automated alerts, trace review
QA	Test case curation, regression suite maintenance	Weekly quality reports
Product	Success criteria definition, user experience standards	Customer satisfaction data
Operations	Capacity planning, cost monitoring, vendor management	Monthly capacity reviews

Establish weekly quality review meetings where engineering, QA, and product review production metrics together. The feedback loop between "what customers experience" and "what tests validate" should be as short as possible.

Conclusion

Voice agent failures cascade. A small ASR degradation propagates through intent classification, response generation, and audio synthesis—each layer amplifying the original error. Without observability across all four layers, you spend hours debugging symptoms instead of root causes.

The five-pillar framework—evaluation, regression, load, observability, alerting—provides complete coverage. Start with the foundation: 50-100 curated test cases, automated regression suites blocking deploys, and three production metrics (P90 latency under 3.5 seconds, task completion above 90%, WER under 5%). Scale from there based on what production monitoring reveals.

Every production failure should make your test suite stronger. Every test suite improvement should prevent the next production failure. That flywheel is the difference between voice agents that work in demos and voice agents that work in production.

Frequently Asked Questions

What latency targets should I set for LiveKit voice agents in production?

What is the difference between offline evaluation and online monitoring for voice agents?

How do I implement distributed tracing for LiveKit voice agents?

What are the five pillars of voice agent testing?

How do I detect model drift in voice agents?

What should I alert on for voice agent production monitoring?

How many test cases do I need to start testing voice agents?

How do I load test LiveKit voice agents?

Sumanyu Sharma

Related Resources

Debugging Voice Agents: Real-Time Logs, Missed Intents & Error Dashboards (2026)

LiveKit Agent Monitoring in Production: Prometheus, Grafana & Alerts

Testing LiveKit Voice Agents: Unit, Scenario, Load & Production Guide (2026)