Testing and Monitoring LiveKit Voice Agents in Production

Sumanyu Sharma
Sumanyu Sharma
Founder & CEO
, Voice AI QA Pioneer

Has stress-tested 4M+ voice agent calls to find where they break.

February 12, 2026Updated February 12, 202619 min read
Testing and Monitoring LiveKit Voice Agents in Production

Voice agents built on LiveKit fail differently than text-based AI systems. A chatbot with a 2-second response delay is mildly annoying. A voice agent with the same delay creates dead air that makes users hang up. When your pipeline spans ASR, NLU, LLM, and TTS, a failure in any single layer cascades through the entire conversation. Testing and monitoring voice agents requires a fundamentally different approach.

This guide covers the five-pillar framework for production-grade LiveKit voice agent quality: evaluation, regression testing, load testing, observability, and alerting.

Methodology Note: The testing framework, metrics, and thresholds in this guide are derived from Hamming's analysis of 4M+ production LiveKit voice agent deployments across 10K+ voice agents (2025-2026).

Thresholds and benchmarks represent performance patterns across customer support, healthcare, and enterprise deployments. Your targets may vary based on use case and latency tolerance.

TL;DR: Five-Pillar Testing Framework

PillarPurposeKey Tool
EvaluationMeasure quality across ASR, NLU, LLM, TTSCurated datasets + LLM-as-judge scorers
RegressionPrevent quality degradation on changesBehavioral baselines + automated suites
LoadValidate scalability under trafficlk perf agent-load-test + synthetic calls
ObservabilityDiagnose production issuesOpenTelemetry distributed tracing
AlertingCatch problems before users complainP90/P99 threshold monitoring

Key targets: P90 end-to-end latency under 3.5s, P99 under 5s, WER under 5%, task completion above 90%.

Core insight: Offline evaluation catches logic errors. Online monitoring catches production-only failures. You need both, and they need to share metrics definitions.

Quick filter: If you are running LiveKit agents in production without automated regression suites and distributed tracing, this guide covers the gaps that will eventually cause customer complaints.


Understanding Voice Agent Testing Fundamentals

The Voice Agent Testing Stack

Voice agents are not single-component systems. They are four interdependent layers where failures cascade:

User speaks  ASR (transcription)  NLU (intent)  LLM (response)  TTS (audio)  User hears
LayerFunctionFailure Impact
ASR (Speech-to-Text)Converts audio to textWrong transcription → wrong intent → wrong response
NLU (Intent Recognition)Classifies user intentMisrouted conversation → task failure
LLM (Response Generation)Produces reply contentHallucination, context loss, wrong tool call
TTS (Text-to-Speech)Synthesizes audio outputRobotic voice, pronunciation errors, latency

A 3% increase in ASR word error rate does not stay contained. It propagates through NLU, causes the LLM to generate responses to the wrong intent, and the user hears a confidently wrong answer. Testing each layer in isolation is necessary but insufficient—you need full-stack validation.

Offline vs. Online Evaluation

Offline evaluation runs before deployment against curated datasets with known-correct outcomes. You control the inputs, the environment, and the expected outputs. This catches logic errors, prompt regressions, and intent classification failures.

Online evaluation monitors live production traffic continuously. It samples real conversations and scores them against quality criteria. This catches failures that only appear in production: audio degradation from cellular connections, accent-driven ASR errors, network jitter disrupting turn-taking, and model drift over time.

DimensionOffline EvaluationOnline Monitoring
When it runsPre-deploymentContinuous in production
Data sourceCurated test datasetsSampled live traffic
ControlsFull (inputs, environment)None (real-world conditions)
CatchesLogic errors, regressions, prompt issuesAudio quality, network effects, drift
LatencyNot representativeActual production latency
Scale50-500 test casesThousands of conversations daily

Both feed into the same metrics pipeline. If your offline evaluation measures task completion rate at 95% but production monitoring shows 82%, the gap is your testing blind spot.

Key Quality Metrics for Voice Agents

MetricTargetAlert ThresholdWhy It Matters
Word Error Rate (WER)<5%>8%Foundation of everything—bad transcription cascades
Mean Opinion Score (MOS)>4.3/5<3.8Perceived audio quality drives user trust
End-to-End Latency P90<3.5s>3.5s10% of users experiencing significant delays
End-to-End Latency P99<5s>5sResponses over 5 seconds feel broken
TTFT (Time to First Token)<800ms>1200msUser-perceived response initiation
Task Completion Rate>90%<85%Whether the agent actually solves the problem
Interruption Rate<15%>25%Turn detection quality indicator

For a deeper dive into latency benchmarks and optimization strategies, see Voice AI Latency: What's Fast, What's Slow, and How to Fix It.


Setting Up Your LiveKit Testing Environment

Installing LiveKit Testing Dependencies

Start with the core testing stack:

pip install livekit-agents pytest pytest-asyncio

For observability instrumentation, add OpenTelemetry:

pip install opentelemetry-api opentelemetry-sdk \
  opentelemetry-exporter-otlp-proto-grpc \
  opentelemetry-instrumentation

For load testing, install the LiveKit CLI:

# macOS
brew install livekit-cli

# Linux
curl -sSL https://get.livekit.io/cli | bash

Configuring Test Infrastructure

Set up environment variables for your test environment:

# .env.test
LIVEKIT_URL=wss://your-project.livekit.cloud
LIVEKIT_API_KEY=your-test-api-key
LIVEKIT_API_SECRET=your-test-api-secret
OPENAI_API_KEY=your-openai-key  # or other LLM provider

Configure pytest for async support:

[pytest]
asyncio_mode = auto
testpaths = tests
python_files = test_*.py
markers =
    unit: Text-only logic tests
    webrtc: Full audio pipeline tests
    load: Concurrent capacity tests

Separate test markers let you run fast unit tests on every commit and reserve expensive WebRTC and load tests for deploy candidates.

Creating Your First Test Suite

LiveKit's pytest helpers validate agent behavior in text mode—no audio pipeline required:

import pytest
from your_agent import create_agent

@pytest.mark.unit
@pytest.mark.asyncio
async def test_greeting_and_intent_routing():
    """Verify agent correctly identifies and routes initial intent."""
    agent = create_agent()

    response = await agent.process_text(
        "Hi, I need to reschedule my appointment for next week"
    )

    # Agent should recognize scheduling intent
    assert any(keyword in response.lower() for keyword in [
        "reschedule", "appointment", "when", "available", "date"
    ])
    # Should ask clarifying question
    assert "?" in response


@pytest.mark.unit
@pytest.mark.asyncio
async def test_multi_turn_context_retention():
    """Verify agent maintains context across conversation turns."""
    agent = create_agent()

    await agent.process_text("I want to book a table for 4 people")
    await agent.process_text("Actually, make that 6")
    response = await agent.process_text("What reservation details do you have?")

    # Agent should remember the corrected party size
    assert "6" in response or "six" in response.lower()

For comprehensive coverage of LiveKit's built-in testing helpers and pytest patterns, see Testing LiveKit Voice Agents: Unit, Scenario, Load and Production Guide.


Evaluation Frameworks for Voice Agents

Conversational Quality Metrics

Beyond pass/fail assertions, measure the qualitative aspects of conversation:

MetricHow to MeasureTarget
Turn-taking latencyTime from user silence to agent speech start<500ms end-of-utterance delay
Interruption handlingAgent stops and listens when user barges inRecovery within 1 turn
Time to first wordDuration from user finish to agent first audioP90 under 2.5s
Talk-to-listen ratioAgent speaking time vs. user speaking time40-60% agent, context-dependent

Track these across conversations to detect systematic issues. A talk-to-listen ratio above 70% often indicates the agent is monologuing rather than having a dialogue.

Task Completion Assessment

Task completion is the metric that matters most to the business. Define clear success criteria for each conversation type:

TASK_DEFINITIONS = {
    "appointment_booking": {
        "required_fields": ["date", "time", "service_type"],
        "success_criteria": "booking_confirmed",
        "max_turns": 8,
    },
    "order_status": {
        "required_fields": ["order_id"],
        "success_criteria": "status_delivered",
        "max_turns": 4,
    },
    "complaint_resolution": {
        "required_fields": ["issue_description"],
        "success_criteria": "resolution_offered",
        "max_turns": 12,
    },
}

Measure both binary completion (did the task succeed?) and efficiency (how many turns did it take?). An agent that completes bookings in 15 turns when the baseline is 6 has a problem even if the task technically succeeds.

ASR and TTS Accuracy Testing

Test speech recognition under realistic conditions, not just clean studio audio:

ConditionExpected WER ImpactTest Approach
Quiet environmentBaseline (<3%)Standard test audio
Background office noise+1-2%Mix noise at -20dB SNR
Street/traffic noise+3-5%Mix noise at -10dB SNR
Regional accents+2-4%Diverse speaker test set
Non-native speakers+3-6%Accented speech samples
Phone-quality audio (8kHz)+1-3%Downsample test audio

For TTS, validate pronunciation of domain-specific terms, proper names, numbers, and dates. A medical scheduling agent that mispronounces "ophthalmologist" erodes user confidence regardless of task accuracy.

Intent Classification Validation

Intent routing determines conversation flow. Test both correct classification and recovery from misclassification:

@pytest.mark.asyncio
async def test_intent_recovery_after_misrecognition():
    """Verify agent recovers when initial intent is misclassified."""
    agent = create_agent()

    # Ambiguous input that might be misclassified
    response1 = await agent.process_text("I want to change my plan")

    # User corrects/clarifies
    response2 = await agent.process_text(
        "No, not my subscription plan. I want to change my flight plan."
    )

    # Agent should pivot to flight-related handling
    assert any(keyword in response2.lower() for keyword in [
        "flight", "itinerary", "travel", "departure"
    ])

Compliance and Safety Guardrails

Implement automated scorers that run on every test case and sampled production conversations:

GuardrailWhat It CatchesImplementation
Prompt injection detectionUsers attempting to override system promptPattern matching + LLM classifier
Policy violation scoringAgent making unauthorized promises or commitmentsLLM-as-judge with policy rubric
PII handlingAgent improperly repeating or storing sensitive dataRegex + entity detection
Safety boundary enforcementAgent engaging with harmful or off-topic requestsTopic classifier + refusal verification

These scorers should run automatically in both offline evaluation and online monitoring. A compliance failure in production is worse than any latency spike.


Implementing Regression Testing

Building Behavioral Baselines

Before changing anything, establish your baseline. This becomes the reference point for detecting regressions:

Quantitative baseline:

MetricBaseline ValueAcceptable Regression
Task completion rate93%No more than 3% drop
P90 end-to-end latency3.2sNo more than 300ms increase
WER4.1%No more than 1% increase
Intent accuracy96%No more than 2% drop
Average turns to completion5.8No more than 1.5 turn increase

Qualitative baseline: Context preservation across turns, coherent multi-step reasoning, appropriate tone and register. These require LLM-as-judge evaluation rather than simple metric thresholds.

Detecting Model Drift

Model drift happens silently. ASR provider updates, LLM fine-tuning, or even TTS voice model changes can shift behavior without any code change on your side:

@pytest.mark.asyncio
async def test_behavioral_consistency():
    """Run weekly to detect drift in agent behavior."""
    agent = create_agent()
    results = []

    for scenario in BASELINE_SCENARIOS:
        response = await agent.process_text(scenario["input"])
        score = evaluate_response(response, scenario["expected"])
        results.append(score)

    avg_score = sum(results) / len(results)
    baseline_score = load_baseline_score()

    # Flag if cumulative score drops more than 5%
    assert avg_score >= baseline_score * 0.95, (
        f"Behavioral drift detected: {avg_score:.2f} vs baseline {baseline_score:.2f}"
    )

Schedule this weekly. Cumulative 1% degradations per month become a 12% regression over a year that nobody notices until customer complaints spike.

Prompt Version Testing

Every prompt change gets an A/B comparison against the regression suite before deployment:

@pytest.mark.asyncio
async def test_prompt_upgrade_no_regression():
    """Compare candidate prompt against production baseline."""
    test_cases = load_regression_suite()

    baseline_agent = create_agent(prompt_version="production")
    candidate_agent = create_agent(prompt_version="candidate")

    baseline_pass = 0
    candidate_pass = 0

    for case in test_cases:
        b_response = await baseline_agent.process_text(case["input"])
        c_response = await candidate_agent.process_text(case["input"])

        if passes_criteria(b_response, case["expected"]):
            baseline_pass += 1
        if passes_criteria(c_response, case["expected"]):
            candidate_pass += 1

    baseline_rate = baseline_pass / len(test_cases)
    candidate_rate = candidate_pass / len(test_cases)

    assert candidate_rate >= baseline_rate - 0.05, (
        f"Prompt regression: {baseline_rate:.1%} → {candidate_rate:.1%}"
    )

For a comprehensive guide to regression testing patterns and production failure conversion workflows, see AI Voice Agent Regression Testing.

Automated Regression Suites

Integrate regression checks as deployment gates:

# .github/workflows/voice-agent-ci.yml
- name: Run regression suite
  run: |
    pytest tests/regression/ -m "not load" --tb=short -q
    if [ $? -ne 0 ]; then
      echo "Regression suite failed - blocking deploy"
      exit 1
    fi

Every production failure becomes a permanent test case. Over time your regression suite becomes an increasingly complete specification of correct agent behavior.


Load and Scalability Testing

Using lk perf agent-load-test

The LiveKit CLI includes built-in load testing for agent scalability:

# Simulate 50 concurrent agent rooms
lk perf agent-load-test \
  --url wss://your-project.livekit.cloud \
  --api-key $LIVEKIT_API_KEY \
  --api-secret $LIVEKIT_API_SECRET \
  --num-rooms 50 \
  --duration 300s \
  --publish-audio

This creates concurrent rooms with synthetic participants, measuring agent join time, response latency, and resource consumption. Start with 10 rooms and scale up to identify the inflection point where performance degrades.

Stress Testing Strategies

Test at 2x your expected peak capacity to find breaking points before they find your users:

Test ScenarioConfigurationWhat It Reveals
Sustained load100 concurrent rooms for 30 minutesMemory leaks, connection pool exhaustion
Spike testRamp from 10 to 200 rooms in 60 secondsAutoscaling responsiveness
Endurance test50 rooms for 4 hoursLong-running stability issues
Chaos testKill agent workers mid-conversationRecovery and failover behavior

For stress tests exceeding 100 concurrent sessions, use synthetic callers with realistic voice characteristics and background noise to simulate production conditions. Clean synthetic audio does not stress-test ASR the same way real-world audio does.

Identifying Performance Bottlenecks

Track P90 and P99 latency at each pipeline stage independently during load tests:

Load Test Results (100 concurrent rooms)
Component Latency Breakdown:
├── STT Processing:  P90: 280ms  P99: 450ms    Within budget
├── LLM Inference:   P90: 1800ms P99: 3200ms   Dominant bottleneck
├── TTS Synthesis:   P90: 220ms  P99: 380ms    Within budget
└── Network/Routing: P90: 180ms  P99: 320ms    Within budget
Total End-to-End:    P90: 3100ms P99: 4800ms   Under 3.5s P90 target

When P90 exceeds 3.5 seconds, identify which component is the bottleneck. In most deployments, LLM inference dominates at 60-70% of total latency. Optimization efforts should target the largest contributor first.

For detailed load testing methodology and the 3-Pillar testing framework, see Testing Voice Agents: Load, Regression, and A/B Evaluation for Production Reliability.

Capacity Planning

Determine your maximum concurrent call capacity by finding the degradation point:

Concurrent RoomsP90 LatencyP99 LatencyTask CompletionStatus
252.8s3.9s94%Healthy
503.1s4.3s93%Healthy
753.4s4.8s91%Warning
1003.8s5.6s87%Degraded
1504.9s7.2s78%Critical

Set your autoscaling threshold at 70% of the degradation point. If performance degrades at 100 rooms, trigger scaling at 70 concurrent rooms to maintain headroom.


Logging and Tracing Architecture

OpenTelemetry Integration

Implement distributed tracing that captures the complete voice turn lifecycle with a unified trace ID:

from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter

def configure_tracing():
    provider = TracerProvider()
    processor = BatchSpanProcessor(
        OTLPSpanExporter(endpoint="http://otel-collector:4317")
    )
    provider.add_span_processor(processor)
    trace.set_tracer_provider(provider)

tracer = trace.get_tracer("livekit-voice-agent")

async def process_voice_turn(audio_input):
    with tracer.start_as_current_span("voice_turn") as turn_span:
        turn_span.set_attribute("session.id", session_id)

        with tracer.start_as_current_span("stt_processing") as stt_span:
            transcript = await transcribe(audio_input)
            stt_span.set_attribute("stt.confidence", transcript.confidence)
            stt_span.set_attribute("stt.word_count", len(transcript.words))

        with tracer.start_as_current_span("llm_inference") as llm_span:
            llm_span.set_attribute("llm.model", model_name)
            response = await generate_response(transcript.text)
            llm_span.set_attribute("llm.tokens", response.token_count)

        if response.requires_tool:
            with tracer.start_as_current_span("tool_call") as tool_span:
                tool_span.set_attribute("tool.name", response.tool_name)
                result = await execute_tool(response.tool_call)
                tool_span.set_attribute("tool.success", result.success)

        with tracer.start_as_current_span("tts_synthesis") as tts_span:
            audio = await synthesize(response.text)
            tts_span.set_attribute("tts.characters", len(response.text))
            tts_span.set_attribute("tts.duration_ms", audio.duration_ms)

This produces traces like:

voice_turn (3.2s total)
├── stt_processing (210ms)
├── llm_inference (2.4s)  bottleneck visible
├── tool_call (380ms)
└── tts_synthesis (190ms)

For implementation details on Prometheus metrics collection and Grafana dashboard configuration, see LiveKit Agent Monitoring in Production: Prometheus, Grafana and Alerts.

Structured Logging Best Practices

Log full conversation context with every turn for post-incident debugging:

import structlog

logger = structlog.get_logger()

async def log_turn(session_id: str, turn_number: int, turn_data: dict):
    logger.info(
        "voice_turn_completed",
        session_id=session_id,
        turn_number=turn_number,
        user_input=turn_data["transcript"],
        agent_response=turn_data["response"],
        stt_confidence=turn_data["stt_confidence"],
        stt_latency_ms=turn_data["stt_latency_ms"],
        llm_latency_ms=turn_data["llm_latency_ms"],
        tts_latency_ms=turn_data["tts_latency_ms"],
        total_latency_ms=turn_data["total_latency_ms"],
        tool_calls=turn_data.get("tool_calls", []),
        intent_detected=turn_data.get("intent"),
        interruption_occurred=turn_data.get("interrupted", False),
    )

Structure logs so you can query by session, by latency range, by error type, or by intent. When a customer reports a bad experience, you should be able to reconstruct the complete conversation within minutes.

Multi-Agent Tracing

When agents delegate tasks or coordinate with other agents, trace the full interaction graph:

Primary Agent (session-abc)
├── voice_turn_1 (greeting)
├── voice_turn_2 (intent: transfer)
   └── agent_handoff
       ├── context_transfer (120ms)
       └── Secondary Agent (session-abc-transfer)
           ├── voice_turn_1 (pickup)
           └── voice_turn_2 (resolution)
└── voice_turn_3 (confirmation)

Propagate trace context across agent boundaries so you can follow a single conversation through multiple agents, tool calls, and external API interactions.

LiveKit Agent Observability

LiveKit Cloud provides native observability features for agent debugging:

  • Trace View: Visual timeline showing turn detection, LLM timing, and tool execution per conversation
  • Session Recordings: Audio and transcript capture for debugging and compliance review
  • Real-time Metrics: WebRTC quality metrics, room health, and participant status
  • Synchronized Playback: Listen to audio while viewing the corresponding transcript and trace data side by side

These built-in tools complement your custom instrumentation. Use LiveKit Cloud Dashboard for individual session debugging and your Prometheus/Grafana stack for aggregate monitoring and alerting.


Production Monitoring and Alerting

Real-Time Performance Monitoring

Monitor these metrics continuously in production:

Metric CategoryWhat to TrackAlert When
LatencyEnd-to-end P90, P99, TTFTP90 > 3.5s or P99 > 5s
Audio QualityASR WER, TTS MOSWER > 8% or MOS < 3.8
ConversationIntent accuracy, interruption rateIntent accuracy < 90% or interruption > 25%
ReliabilityTool call success rate, connection rateTool success < 95% or connection drop > 5%
CostToken consumption, per-session cost>2x baseline spend

Configuring Alert Thresholds

Set alerts that catch real problems without generating noise:

AlertWarningCriticalRationale
P90 latency>3.0s>3.5s10% of users experiencing delays
P99 latency>4.0s>5.0sConversation flow breakdown
Connection drop rate>5%>15%15% drop suggests infrastructure issues
Intent accuracy (rolling 1h)<92%<85%Sustained degradation, not momentary dips
Fallback/escalation rate>20%>35%Rising fallback indicates systematic failure

Use duration filters—require issues to persist for 5+ minutes before firing alerts. This avoids false alarms from momentary spikes while still catching sustained degradation.

For a comprehensive guide to monitoring voice agent outages and the 4-Layer Monitoring Framework, see How to Monitor Voice Agent Outages in Real Time.

Custom LLM-as-Judge Scorers

Define business-specific quality evaluators that run on sampled production conversations:

SCORER_DEFINITIONS = {
    "empathy_check": {
        "description": "Evaluate whether agent shows appropriate empathy",
        "rubric": """Score 1-5:
        5: Acknowledges emotion, validates concern, offers help
        3: Acknowledges issue but skips emotional validation
        1: Ignores emotional context entirely""",
        "alert_threshold": 2.5,
    },
    "compliance_adherence": {
        "description": "Verify agent follows regulatory requirements",
        "rubric": """Score 1-5:
        5: All required disclosures made, no unauthorized promises
        3: Minor omissions in required language
        1: Missing critical disclosures or unauthorized commitments""",
        "alert_threshold": 4.0,
    },
}

Run these scorers on 5-10% of production conversations. Alert when rolling averages drop below thresholds over a 1-hour window.

Cross-Call Pattern Detection

Individual call reviews miss systemic issues. Aggregate analysis reveals patterns:

PatternDetection MethodExample
Time-of-day degradationHourly latency heatmapsLLM provider throttling during business hours
Geographic performance variancePer-region P90 breakdownHigher ASR errors in specific regions
Conversation loopsRepeated intent classification per sessionAgent asking the same question three times
Silent failuresTask completion vs. user satisfaction gapTask marked complete but user called back

Build dashboards that surface these patterns automatically. A 5% task completion drop that only affects users calling between 2-4 PM EST would be invisible in daily aggregates.

For foundational concepts on production monitoring strategy, see An Intro to Production Monitoring for Voice Agents.


Building Continuous Testing Pipelines

CI/CD Integration Strategies

Gate deployments with automated quality checks:

# .github/workflows/voice-agent-deploy.yml
name: Voice Agent Deploy Pipeline

on:
  push:
    branches: [main]
  pull_request:
    branches: [main]

jobs:
  unit-tests:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Run text-only regression suite
        run: pytest tests/ -m unit --tb=short -q
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}

  webrtc-tests:
    needs: unit-tests
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Run WebRTC validation suite
        run: pytest tests/ -m webrtc --tb=short -q
        env:
          LIVEKIT_URL: ${{ secrets.LIVEKIT_URL }}
          LIVEKIT_API_KEY: ${{ secrets.LIVEKIT_API_KEY }}
          LIVEKIT_API_SECRET: ${{ secrets.LIVEKIT_API_SECRET }}

  deploy:
    needs: [unit-tests, webrtc-tests]
    if: github.ref == 'refs/heads/main'
    runs-on: ubuntu-latest
    steps:
      - name: Deploy to production
        run: ./deploy.sh

Run text-only tests on every PR. Run full WebRTC tests on merge to main. Block deploys when pass rates drop below 95%.

Automated Test Generation

Generate test cases from production conversations to expand coverage:

  1. Sample low-scoring conversations from production monitoring
  2. Extract the conversation transcript and expected outcomes
  3. Convert to regression test format with assertions
  4. Add to regression suite for continuous validation
def generate_test_from_production(conversation_log: dict) -> dict:
    """Convert a production conversation into a regression test case."""
    return {
        "id": f"prod-{conversation_log['session_id'][:8]}",
        "name": f"Production failure: {conversation_log['failure_reason']}",
        "source": f"production-{conversation_log['timestamp'][:10]}",
        "conversation": conversation_log["turns"],
        "expected": {
            "task_completed": conversation_log["expected_outcome"],
            "max_turns": len(conversation_log["turns"]) + 2,
        },
    }

This creates a flywheel: production failures improve the test suite, which prevents future failures, which improves production quality.

A/B Testing in Production

Run parallel agent versions to measure the impact of changes with statistical confidence:

ParameterRecommendation
Minimum sample per variant1,000 conversations
Statistical confidence target95%
Key comparison metricsTask completion, P90 latency, user satisfaction
Maximum test duration2 weeks
Traffic split50/50 for fastest results, 90/10 for lower risk

Route traffic based on session ID hash for consistent assignment. Never switch a user mid-conversation between agent versions.

Feedback Loop Implementation

Close the loop between production monitoring and offline testing:

Production conversations
     Score with LLM-as-judge (online)
     Flag low-scoring sessions
     Extract as test cases
     Add to regression suite (offline)
     Run on next deploy candidate
     Deploy improved agent
     Monitor production...

This continuous improvement cycle means your test coverage grows organically from real production scenarios rather than hypothetical test cases.


Common Failure Modes and Debugging

Diagnosing Cascading Failures

Voice agent failures rarely have a single root cause. Use multi-layer correlation to trace the cascade:

SymptomLayer 1 CheckLayer 2 CheckLayer 3 Check
Wrong responseASR transcript accuracyIntent classificationLLM prompt/context
High latencyPer-component latency breakdownNetwork path analysisProvider rate limits
User hangs upTTFT metricsTurn detection timingAudio quality scores
Repeated questionsContext window managementMemory/state handlingTool call failures

Example cascade: Audio degradation (MOS drops to 3.2) causes ASR word error rate to spike to 12%, which causes intent misclassification in 30% of turns, which causes the LLM to generate irrelevant responses, which causes users to interrupt, which causes further ASR errors due to overlapping speech. The root cause is audio quality, but the symptom is "agent gives wrong answers."

Handling Real-World Edge Cases

Test scenarios that only happen in production:

Edge CaseTest ApproachExpected Behavior
User interrupts mid-responseSynthetic barge-in at random pointsAgent stops, listens, responds to new input
Connection drops for 3 secondsNetwork simulation with packet lossAgent resumes or gracefully re-establishes
Background noise spikeInject noise at varying SNR levelsASR degrades gracefully, agent asks to repeat
Mid-conversation context switchUser changes topic abruptlyAgent acknowledges pivot, updates context
Silence for 15+ secondsNo user input after agent promptAgent re-prompts once, then offers alternatives

For in-depth coverage of WebRTC testing for interruptions and turn-taking, see How to Test Voice Agents Built with LiveKit.

Latency and Timing Issues

When end-to-end latency exceeds the 3.5-second P90 target, decompose by component:

Total P90: 4.2s (OVER TARGET)
├── STT: 310ms (OK, budget: 300ms)
├── LLM: 2.9s  (HIGH, budget: 2.0s)  Root cause
├── TTS: 280ms (OK, budget: 300ms)
└── Network: 710ms (HIGH, budget: 400ms)  Contributing factor

Common latency root causes:

ComponentCommon CauseFix
STTLong utterances, poor audioStreaming transcription, noise filtering
LLMLarge context window, complex promptsPrompt optimization, context pruning
TTSLong responses, cold startsResponse chunking, connection pooling
NetworkGeographic distance, routingEdge deployment, CDN for static assets

For detailed latency optimization techniques, see How to Optimize Latency in Voice Agents.

Audio Quality Problems

Identify audio issues that reduce ASR accuracy:

IssueDetectionImpact on WERMitigation
ReverberationRoom impulse response analysis+3-8%Echo cancellation, derev processing
Background noiseSNR measurement+2-5% at -10dBNoise suppression, gain control
Codec artifactsBitrate monitoring+1-3%Higher bitrate encoding
Packet lossWebRTC stats+2-6% at 3% lossFEC, jitter buffer tuning

Test ASR accuracy with synthetic impulse responses during development to catch reverberation issues before deployment. Production environments (call centers, cars, outdoor spaces) introduce acoustic challenges that clean test audio never exercises.


Best Practices and Implementation Roadmap

Starting Small

Begin with a focused test suite before scaling:

  1. Curate 50-100 conversations representing your core use cases
  2. Define pass/fail criteria for each conversation type
  3. Run offline evaluation against this dataset on every deploy
  4. Track 3 key metrics in production: P90 latency, task completion rate, WER

This baseline takes days to set up and immediately catches the most common failures.

Scaling Your Testing Practice

PhaseFocusToolsEffort
1. FoundationText-only regression + 3 production metricspytest + basic monitoringLow
2. Audio coverageAdd WebRTC testing for latency and interruptionsHamming + LiveKit testingMedium
3. Load validationConcurrent capacity testinglk perf + synthetic callersMedium
4. Full observabilityDistributed tracing + automated scorersOpenTelemetry + LLM-as-judgeHigh

Each phase builds on the previous one. Do not skip to phase 4 without the foundation of phase 1.

Tool Selection Criteria

When evaluating voice agent testing platforms, weight capabilities based on production impact:

CapabilityWeightWhat to Look For
Quality metric coverage30%WER, MOS, task completion, latency percentiles
Production monitoring25%Continuous scoring, alerting, drift detection
CI/CD integration20%GitHub Actions/Jenkins support, deploy gating
Load testing15%Concurrent session simulation, realistic audio
Ease of setup10%Time to first test run, documentation quality

Prioritize platforms that cover both offline evaluation and online monitoring. Tools that only do one or the other leave gaps that production will expose.

Building Internal Expertise

Voice agent quality is not a single team's responsibility:

StakeholderResponsibilityFeedback Channel
EngineeringInstrumentation, CI/CD integration, incident responseAutomated alerts, trace review
QATest case curation, regression suite maintenanceWeekly quality reports
ProductSuccess criteria definition, user experience standardsCustomer satisfaction data
OperationsCapacity planning, cost monitoring, vendor managementMonthly capacity reviews

Establish weekly quality review meetings where engineering, QA, and product review production metrics together. The feedback loop between "what customers experience" and "what tests validate" should be as short as possible.


Conclusion

Voice agent failures cascade. A small ASR degradation propagates through intent classification, response generation, and audio synthesis—each layer amplifying the original error. Without observability across all four layers, you spend hours debugging symptoms instead of root causes.

The five-pillar framework—evaluation, regression, load, observability, alerting—provides complete coverage. Start with the foundation: 50-100 curated test cases, automated regression suites blocking deploys, and three production metrics (P90 latency under 3.5 seconds, task completion above 90%, WER under 5%). Scale from there based on what production monitoring reveals.

Every production failure should make your test suite stronger. Every test suite improvement should prevent the next production failure. That flywheel is the difference between voice agents that work in demos and voice agents that work in production.

Frequently Asked Questions

Target P90 end-to-end latency under 3.5 seconds and P99 under 5 seconds. Time to First Token (TTFT) should be under 800ms. Anything over 5 seconds feels completely broken to users and leads to conversation abandonment. Monitor latency at each pipeline stage independently—STT, LLM, and TTS—to identify bottlenecks when targets are exceeded.

Offline evaluation runs pre-deployment against curated datasets with known-correct outcomes, catching logic errors, prompt regressions, and intent classification failures. Online monitoring samples live production traffic continuously, catching failures that only appear in production such as audio degradation, accent-driven ASR errors, and network jitter disrupting turn-taking. Production-grade voice agents require both approaches.

Use OpenTelemetry to create spans for each pipeline stage: STT processing, LLM inference, tool calls, and TTS synthesis. Configure a TracerProvider with an OTLP exporter pointing to your collector, then wrap each stage in a span with relevant attributes like confidence scores, token counts, and latency measurements. This produces waterfall traces showing exactly where time is spent in each voice turn.

The five pillars are: (1) Evaluation—measuring quality across ASR, NLU, LLM, and TTS layers; (2) Regression testing—preventing quality degradation when prompts, models, or configurations change; (3) Load testing—validating scalability under concurrent traffic; (4) Observability—distributed tracing and structured logging for diagnosis; (5) Alerting—threshold-based monitoring that catches problems before customers complain.

Run behavioral consistency tests weekly against a fixed set of baseline scenarios. Score responses using the same evaluation criteria and compare against stored baseline scores. Flag when cumulative scores drop more than 5%. Model drift can come from ASR provider updates, LLM fine-tuning, or TTS voice model changes—none of which require code changes on your side, making them easy to miss without automated detection.

Alert on P90 and P99 latency (not averages), ASR word error rate exceeding 8%, intent accuracy dropping below 90% over a rolling hour, tool call success rate below 95%, and connection drop rates above 15%. Use duration filters requiring issues to persist for 5+ minutes before firing alerts to reduce noise from momentary spikes.

Start with 50-100 curated conversations representing your core use cases. Define pass/fail criteria for each conversation type and run offline evaluation on every deploy. Track three key production metrics: P90 latency, task completion rate, and word error rate. This foundation takes days to set up and immediately catches the most common failures before scaling to comprehensive testing.

Use the LiveKit CLI command lk perf agent-load-test to simulate concurrent rooms with synthetic participants. Start with 10 rooms and scale up to find the degradation point. Test at 2x expected peak capacity with realistic audio conditions including background noise. Set autoscaling thresholds at 70% of the capacity where P90 latency exceeds 3.5 seconds or task completion drops below 90%.

Sumanyu Sharma

Sumanyu Sharma

Founder & CEO

Previously Head of Data at Citizen, where he helped quadruple the user base. As Senior Staff Data Scientist at Tesla, grew AI-powered sales program to 100s of millions in revenue per year.

Researched AI-powered medical image search at the University of Waterloo, where he graduated with Engineering honors on dean's list.

“At Hamming, we're taking all of our learnings from Tesla and Citizen to build the future of trustworthy, safe and reliable voice AI agents.”