LiveKit Agent Monitoring in Production: Prometheus, Grafana & Alerts

Sumanyu Sharma
Sumanyu Sharma
Founder & CEO
, Voice AI QA Pioneer

Has stress-tested 4M+ voice agent calls to find where they break.

February 4, 2026Updated February 4, 202614 min read
LiveKit Agent Monitoring in Production: Prometheus, Grafana & Alerts

Why Voice Agents Need Specialized Monitoring

Voice agents operate in a fundamentally different environment than traditional web applications. While your HTTP API might tolerate a 500ms latency spike, a voice agent with the same delay creates an awkward pause that breaks conversational flow and erodes user trust.

The challenge compounds when you consider the architecture: real-time audio streaming, variable LLM inference times, multi-service orchestration across STT, LLM, and TTS providers, and the unpredictable nature of human speech. Generic APM tools like Datadog or New Relic excel at infrastructure monitoring but miss 60% of voice-specific failures.

Methodology Note: The monitoring framework, metrics, and alert thresholds in this guide are derived from Hamming's analysis of 4M+ production LiveKit voice agent deployments across 10K+ voice agents (2025-2026).

Thresholds may vary by use case, latency tolerance, and user expectations. Our benchmarks represent performance patterns across customer support, healthcare, and enterprise deployments.

Last updated: December 2026

TL;DR: LiveKit agents need purpose-built monitoring with Prometheus for metrics collection, Grafana for visualization, and OpenTelemetry for distributed tracing. Key thresholds: TTFT under 800ms, P90 latency under 3.5s, P99 latency under 5s, WER under 5%, interruption rate under 15%. Alert on P99 metrics, not averages—responses over 5 seconds feel broken to users.

Understanding LiveKit Agent Architecture

Core Components of LiveKit Agents

LiveKit agents operate as real-time room participants that process audio through a pipeline: Speech-to-Text (STT) captures user utterances, the LLM generates responses, and Text-to-Speech (TTS) delivers the audio back to the user. Turn detection determines when the user has finished speaking, triggering the response generation.

User speaks  STT (150ms)  LLM (600ms)  TTS (120ms)  User hears
             └─────────────── TTFT: 870ms ───────────────┘

Each component introduces latency and potential failure modes. The agent participates as a room member via WebRTC, handling real-time audio streams while maintaining conversation context across turns.

Why Voice Agents Need Specialized Monitoring

Traditional monitoring fails voice agents for three reasons:

ChallengeWhat Generic APM SeesWhat Voice Monitoring Sees
Variable LLM latencyAverage API response timePer-turn P99 latency, TTFT distribution
Multi-layer failuresIndividual service errorsCascading failures across STT→LLM→TTS
Audio quality issuesNothingMOS degradation, packet loss, jitter impact
Conversation breakdownsNothingInterruption patterns, context loss, loops
Token cost spikesAPI call countsPer-session costs, TTS as 50% cost driver

The monitoring you need must understand voice semantics: turn-taking, interruption handling, response timing, and conversation coherence.

Essential Metrics for LiveKit Voice Agent Monitoring

Voice Agent Monitoring Metrics: The Complete Reference

MetricCategoryDefinitionTargetAlert ThresholdCommon Causes
TTFT (Time to First Token)LatencyLLM response initiation time<800ms>1000msCold starts, prompt length, rate limits
End-to-End Latency P90Latency90th percentile total response time<3500ms>3500msCumulative STT+LLM+TTS delays
End-to-End Latency P99Latency99th percentile total response time<5000ms>5000msCumulative STT+LLM+TTS delays
ASR LatencyLatencySpeech-to-text processing time<200ms>400msAudio quality, model complexity
TTS LatencyLatencyText-to-speech synthesis time<150ms>300msVoice model, text length
Word Error Rate (WER)Audio QualityTranscription accuracy<5%>8%Background noise, accents, audio degradation
Mean Opinion Score (MOS)Audio QualitySynthesized audio quality (1-5)>4.3<3.8TTS model quality, network issues
Real-Time Factor (RTF)Audio QualityProcessing time vs audio duration<0.5>1.0Insufficient compute resources
End-of-Utterance DelayConversationTime from speech end to response start<500ms>800msTurn detection tuning, VAD sensitivity
Interruption RateConversationPercentage of turns with user interruption<15%>25%Slow responses, poor turn detection
Conversation TurnsConversationAverage turns to task completion<8>15Stuck logic, context loss, misunderstanding
Token ConsumptionCostLLM tokens per sessionVaries>2x baselinePrompt bloat, context window overflow
Per-Session CostCostTotal cost per conversationVaries>2x baselineTTS verbosity, LLM token usage
Tool Call Success RateReliabilityExternal integration success>99%<95%API failures, timeout issues

Latency Metrics Deep Dive

Time to First Token (TTFT) is the most critical latency metric for voice agents. Users perceive responses under 600ms as "instant" and responses over 1200ms as "delayed." Anything over 5 seconds feels completely broken.

Monitor latency at each pipeline stage independently:

Component Latency Breakdown (Target)
├── STT Processing: <200ms
├── LLM Inference: <600ms (TTFT <800ms)
└── TTS Synthesis: <150ms
Total End-to-End: <3500ms P90, <5000ms P99

Why P90 and P99 matter: Averages hide the worst experiences. If your P50 is 400ms but P90 is 3500ms and P99 is 5000ms, 10% of your users are experiencing significant delays and 1% are having terrible experiences—at scale, that's thousands of broken conversations daily.

Audio Quality Metrics

Word Error Rate (WER) directly impacts conversation quality. When transcription fails, the LLM receives incorrect input, generating irrelevant responses that frustrate users.

Target WER thresholds:

  • Excellent: <3% (quiet environment, clear speech)
  • Good: <5% (normal conditions)
  • Acceptable: <8% (noisy environment)
  • Critical: >12% (conversation breakdown likely)

Mean Opinion Score (MOS) for TTS output should exceed 4.3 on a 5-point scale. Lower scores indicate synthesized audio that sounds robotic or unnatural.

Conversation Flow Metrics

End-of-utterance delay measures how quickly the agent responds after the user stops speaking. This depends on turn detection accuracy—detecting too early causes interruptions, detecting too late creates awkward silences.

Track interruption rates carefully. Some interruption is natural (users correcting themselves), but rates above 15% indicate the agent is responding too slowly or turn detection is misconfigured.

Cost and Resource Metrics

Voice agents have complex cost structures. In our analysis, TTS typically drives 50% of per-session costs for verbose agents. Track:

  • Token consumption per turn and per session
  • TTS character count (many providers charge per character)
  • Concurrent session counts for capacity planning
  • Cost per successful task completion (the metric that matters)

Implementing Prometheus for LiveKit Agents

Setting Up Metrics Collection

LiveKit Agent Worker exposes a /metrics endpoint compatible with Prometheus scraping. Configure your agent to expose metrics during initialization:

from livekit.agents import metrics

# In your agent entrypoint
def prewarm(proc: JobProcess):
    # Initialize metrics collection
    proc.userdata["metrics"] = metrics.AgentMetrics()

Key Prometheus Metrics to Instrument

Define counters, histograms, and gauges for voice-specific metrics:

from prometheus_client import Counter, Histogram, Gauge

# Conversation metrics
conversation_turns_total = Counter(
    'livekit_conversation_turns_total',
    'Total conversation turns processed',
    ['agent_id', 'outcome']
)

# Latency histograms with voice-appropriate buckets
llm_duration_ms = Histogram(
    'livekit_llm_duration_ms',
    'LLM response time in milliseconds',
    ['model', 'agent_id'],
    buckets=[100, 200, 400, 600, 800, 1000, 1500, 2000, 3000, 5000]
)

eou_delay_ms = Histogram(
    'livekit_eou_delay_ms',
    'End-of-utterance to response delay',
    ['agent_id'],
    buckets=[100, 200, 300, 500, 800, 1000, 1500, 2000]
)

# Audio quality gauges
wer_gauge = Gauge(
    'livekit_asr_wer',
    'Current word error rate',
    ['agent_id']
)

# Active sessions for capacity monitoring
active_sessions = Gauge(
    'livekit_active_sessions',
    'Number of active voice sessions',
    ['agent_id']
)

Prometheus Configuration for Voice Workloads

Configure Prometheus with scrape intervals appropriate for real-time monitoring:

# prometheus.yml
global:
  scrape_interval: 15s      # Balance between granularity and overhead
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'livekit-agents'
    static_configs:
      - targets: ['agent-worker-1:8080', 'agent-worker-2:8080']
    scrape_interval: 15s    # 15-30s appropriate for voice metrics
    scrape_timeout: 10s

  - job_name: 'livekit-server'
    static_configs:
      - targets: ['livekit-server:7880']

# Retention for time-series data
storage:
  tsdb:
    retention.time: 30d     # Keep 30 days for trend analysis
    retention.size: 50GB

For high-volume deployments, consider 30-second scrape intervals to reduce overhead while maintaining sufficient granularity for alerting.

Building Grafana Dashboards for Voice Agents

Dashboard Design Principles

Organize panels by pipeline stage to enable rapid root cause identification:

  1. Telephony Layer: Connection health, WebRTC metrics, room status
  2. ASR Layer: WER, transcription latency, confidence scores
  3. LLM Layer: TTFT, token usage, response generation time
  4. TTS Layer: Synthesis latency, audio quality, character throughput
  5. Integration Layer: Tool call success rates, external API latency

Critical Dashboard Panels

Latency Overview Panel:

# P50, P90, P95, P99 latency percentiles
histogram_quantile(0.50, rate(livekit_llm_duration_ms_bucket[5m]))
histogram_quantile(0.90, rate(livekit_llm_duration_ms_bucket[5m]))
histogram_quantile(0.95, rate(livekit_llm_duration_ms_bucket[5m]))
histogram_quantile(0.99, rate(livekit_llm_duration_ms_bucket[5m]))

Error Rate by Component:

# Error rate percentage by component
sum(rate(livekit_errors_total{component="stt"}[5m])) /
sum(rate(livekit_requests_total{component="stt"}[5m])) * 100

Active Sessions and Capacity:

# Current utilization vs capacity
livekit_active_sessions / livekit_max_sessions * 100

Cost Tracking:

# Hourly token consumption trend
increase(livekit_tokens_consumed_total[1h])

Visualizing End-to-End Call Journey

Create trace-like waterfall views showing timing from user utterance through agent response:

StageStartDurationEnd
User Speech0ms2500ms2500ms
STT Processing2500ms180ms2680ms
LLM Inference2680ms650ms3330ms
TTS Synthesis3330ms140ms3470ms
Audio Playback3470ms

This visualization helps identify which component is the bottleneck when latency spikes occur.

Distributed Tracing with OpenTelemetry

Enabling OpenTelemetry in LiveKit Agents

For Python Agents SDK v1.3+, configure the tracer provider in your entrypoint:

from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter

def configure_tracing():
    provider = TracerProvider()
    processor = BatchSpanProcessor(
        OTLPSpanExporter(endpoint="http://otel-collector:4317")
    )
    provider.add_span_processor(processor)
    trace.set_tracer_provider(provider)

# Call during agent initialization
configure_tracing()

Structuring Spans for Voice Pipelines

Create semantic spans for each pipeline stage with relevant attributes:

tracer = trace.get_tracer("livekit-agent")

async def process_turn(user_input: str):
    with tracer.start_as_current_span("voice_turn") as turn_span:
        turn_span.set_attribute("turn.user_input_length", len(user_input))

        # STT span (if processing raw audio)
        with tracer.start_as_current_span("stt_processing") as stt_span:
            transcript = await transcribe(audio)
            stt_span.set_attribute("stt.confidence", transcript.confidence)
            stt_span.set_attribute("stt.word_count", len(transcript.words))

        # LLM span
        with tracer.start_as_current_span("llm_inference") as llm_span:
            llm_span.set_attribute("llm.model", "gpt-4")
            llm_span.set_attribute("llm.prompt_tokens", prompt_tokens)
            response = await generate_response(transcript.text)
            llm_span.set_attribute("llm.completion_tokens", response.tokens)

        # Tool call span (if applicable)
        if response.requires_tool:
            with tracer.start_as_current_span("tool_invocation") as tool_span:
                tool_span.set_attribute("tool.name", response.tool_name)
                result = await execute_tool(response.tool_call)
                tool_span.set_attribute("tool.success", result.success)

        # TTS span
        with tracer.start_as_current_span("tts_synthesis") as tts_span:
            audio = await synthesize(response.text)
            tts_span.set_attribute("tts.character_count", len(response.text))
            tts_span.set_attribute("tts.audio_duration_ms", audio.duration_ms)

Correlating Traces with Logs and Metrics

Tag spans with semantic attributes that enable cross-referencing:

span.set_attribute("prompt.version", "v2.3.1")
span.set_attribute("session.id", session_id)
span.set_attribute("agent.id", agent_id)
span.set_attribute("user.segment", user_segment)

Propagate trace context across HTTP and gRPC boundaries to maintain trace continuity through distributed services:

from opentelemetry.propagate import inject

headers = {}
inject(headers)  # Adds trace context to outgoing request headers
response = await client.post(url, headers=headers, json=payload)

Setting Up Production Alerts

Critical Alert Thresholds

MetricWarningCriticalRationale
TTFT P99>800ms>1200msUser-perceived delay threshold
End-to-End Latency P90>3000ms>3500ms10% of users experiencing delays
End-to-End Latency P99>4000ms>5000msConversation flow breakdown
Word Error Rate>5%>8%Transcription accuracy floor
Tool Call Success Rate<99%<95%Integration reliability
Interruption Rate>15%>25%Turn detection failure
Active Sessions>80% capacity>90% capacityCapacity planning

Configuring Alertmanager

Route voice agent alerts with appropriate severity and notification channels:

# alertmanager.yml
route:
  receiver: 'default'
  group_by: ['alertname', 'agent_id']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h

  routes:
    # P0: Revenue-impacting issues  PagerDuty
    - match:
        severity: critical
      receiver: 'pagerduty-oncall'
      continue: true

    # P1: Customer-impacting issues  Slack #voice-alerts
    - match:
        severity: warning
      receiver: 'slack-voice-alerts'

receivers:
  - name: 'slack-voice-alerts'
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/...'
        channel: '#voice-alerts'
        title: '{{ "{{" }} .GroupLabels.alertname {{ "}}" }}'
        text: 'Agent {{ "{{" }} .Labels.agent_id {{ "}}" }}: {{ "{{" }} .Annotations.summary {{ "}}" }}'

  - name: 'pagerduty-oncall'
    pagerduty_configs:
      - service_key: '...'
        severity: critical

Reducing Alert Fatigue

Focus on metrics that directly impact users:

  1. Alert on P90 and P99, not averages: A P50 of 400ms with P90 of 3500ms and P99 of 5000ms means 10% of users are experiencing delays and 1% are suffering. At 10,000 daily sessions, that's 1,000 delayed and 100 broken experiences.

  2. Use duration filters: Require issues to persist for 5+ minutes before alerting to avoid momentary spikes.

  3. Track behavior drift: Alert on sudden changes in response length, conversation loops, or topic distribution—these indicate prompt regression or model updates.

  4. Include runbook links: Every alert should link to a runbook with diagnostic steps and remediation actions.

Example alert rule with duration filtering:

groups:
  - name: voice-agent-latency
    rules:
      - alert: HighP90Latency
        expr: histogram_quantile(0.90, rate(livekit_llm_duration_ms_bucket[5m])) > 3500
        for: 5m  # Must persist for 5 minutes
        labels:
          severity: warning
        annotations:
          summary: "End-to-end P90 latency exceeds 3.5 seconds"
          runbook: "https://wiki.internal/runbooks/high-latency"

      - alert: HighP99Latency
        expr: histogram_quantile(0.99, rate(livekit_llm_duration_ms_bucket[5m])) > 5000
        for: 5m  # Must persist for 5 minutes
        labels:
          severity: critical
        annotations:
          summary: "End-to-end P99 latency exceeds 5 seconds"
          runbook: "https://wiki.internal/runbooks/high-latency"

LiveKit Agent Observability Features

Built-in Agent Observability

LiveKit Cloud provides native observability including:

  • Trace View: Visual timeline showing turn detection, LLM timing, and tool execution for each conversation
  • Session Recordings: Opt-in audio and transcript capture for debugging and compliance
  • Real-time Metrics: WebRTC quality metrics, room health, and participant status

Transcript and Session Recording

Configure recording at project or agent level for compliance-sensitive deployments:

room_options = RoomOptions(
    recording_enabled=True,
    recording_output="s3://recordings-bucket/",
)

Note that recording increases storage costs and may have compliance implications. Enable selectively based on use case requirements.

Integration with Third-Party Tools

LiveKit integrates with specialized voice QA platforms:

  • Hamming: End-to-end voice agent testing and production monitoring with 50+ built-in metrics
  • Simulation platforms: Synthetic caller generation for load testing and regression suites
  • Evaluation frameworks: Automated scoring for task completion, conversation quality, and compliance

Production Monitoring Best Practices

Monitoring Implementation Strategy

  1. Instrument at app lifecycle start: Configure metrics and tracing in your agent's initialization, not lazily during requests.

  2. Track component-level latency separately: Don't just measure end-to-end time. Instrument STT, LLM, tool calls, and TTS independently to enable root cause analysis.

  3. Enable continuous evaluation: Sample production conversations for quality scoring, not just latency metrics.

  4. Version your prompts: Tag traces and metrics with prompt version to correlate performance changes with prompt updates.

Handling Multi-Layer Failures

Voice agent failures cascade across layers. Use this diagnostic framework:

SymptomCheck FirstCheck SecondCheck Third
High latencyLLM provider statusToken rate limitsNetwork path
Poor transcriptionAudio quality (MOS)Background noise levelsASR provider
Wrong responsesIntent classificationContext windowPrompt version
Broken tool callsExternal API healthAuthenticationTimeout settings
User abandonmentTTFT metricsConversation loopsTask completion rate

Continuous Optimization

Use production data to identify bottlenecks:

  • LLM typically dominates latency: In our analysis, LLM inference accounts for 60-70% of end-to-end response time
  • TTS drives costs: For verbose agents, TTS synthesis represents 50% of per-session costs
  • Turn detection is tunable: Adjusting end-of-utterance thresholds can significantly impact perceived responsiveness

Testing and Evaluation Integration

Pre-Production Testing with Prometheus

Run synthetic tests while collecting metrics to establish baseline performance:

  1. Generate test traffic with varied audio conditions (accents, background noise, interruptions)
  2. Collect Prometheus metrics during test runs
  3. Compare results against production thresholds
  4. Gate deployments on metric regressions

Connecting Evaluation Metrics to Monitoring

Feed low-scoring production calls back into offline datasets:

Production Call  Score Below Threshold  Add to Regression Suite
                                        Create Test Case
                                        Validate Fix in CI

Platforms like Hamming enable this workflow by capturing production conversations and integrating with CI/CD pipelines.

CI/CD Integration for Voice Agents

Block deployments when critical thresholds are exceeded:

# Example CI gate
voice-quality-gate:
  script:
    - run-synthetic-tests --duration 5m
    - check-metrics --ttft-p99 1200 --p90-max 3500 --p99-max 5000 --wer-max 8 --success-rate 95
  allow_failure: false

Real-World Implementation Examples

Reducing Latency from 7s to 4.8s

A team using the webrtc-agent-livekit pattern identified their bottleneck through Grafana dashboards:

  1. Discovery: P99 latency was 7000ms, with 70% attributed to LLM inference
  2. Analysis: Prompt length had grown to 4000 tokens over time
  3. Action: Prompt optimization reduced tokens by 60%, switched to faster model variant
  4. Result: P99 dropped to 4800ms (under the 5s target), user satisfaction scores improved 23%

Cost Optimization Through Monitoring

Dashboard analysis revealed TTS drove 50% of per-session costs for a verbose customer support agent:

  1. Discovery: Agents were generating 400+ character responses on average
  2. Analysis: Prompt encouraged detailed explanations even for simple queries
  3. Action: Added response length guidelines to prompt, implemented streaming TTS
  4. Result: Average response length dropped to 180 characters, costs reduced 35%

Operational Alerts in Action

When TTFT exceeded 800ms, alerts enabled rapid response:

  1. Alert triggered: "TTFT P99 > 800ms for 5 minutes" fired to Slack
  2. Investigation: LLM provider dashboard showed elevated latency
  3. Action: Triggered failover to secondary LLM provider
  4. Resolution: Latency normalized within 2 minutes of failover

Common Pitfalls and Solutions

Missing Context Propagation

Problem: Traces break across service boundaries, hiding root causes in distributed voice systems.

Solution: Ensure trace IDs propagate through all HTTP/gRPC calls. Verify with end-to-end trace validation in staging.

Over-Reliance on Average Metrics

Problem: Monitoring averages while P90 and P99 latency causes thousands of poor experiences at scale.

Solution: Configure dashboards and alerts for percentile distributions (P50, P90, P95, P99). A "healthy" average can mask severe tail latency. Focus especially on P90 (3.5s threshold) and P99 (5s threshold) as these represent significant portions of user experiences.

Ignoring ASR Error Correlation

Problem: LLM generates irrelevant responses, but LLM metrics look fine.

Solution: Attach WER and confidence scores at span level. When response quality drops, check if transcription errors are the root cause.

Alert Configuration Anti-Patterns

Common mistakes:

  • No duration filter: Alerts on momentary spikes
  • No cooldown: 50 alerts for single incident
  • Wrong severity: Everything feels equally urgent
  • Missing context: Alert fires but no diagnostic info included

Tools and Ecosystem

Prometheus and Grafana Stack

The Prometheus + Grafana combination provides a lightweight, self-hosted foundation for metrics collection and visualization. Benefits include:

  • Low operational overhead
  • Flexible query language (PromQL)
  • Rich visualization options
  • Strong community and ecosystem

OpenTelemetry for Tracing

OpenTelemetry provides language-agnostic primitives and vendor-neutral instrumentation:

  • Consistent span semantics across services
  • Auto-instrumentation for common frameworks
  • Export to multiple backends (Jaeger, Zipkin, commercial APM)
  • Growing support in voice/AI frameworks

Specialized Voice QA Platforms

For teams requiring comprehensive voice agent observability, specialized platforms offer:

  • End-to-end evaluation: Task completion, conversation quality, compliance scoring
  • Production monitoring: Real-time alerts with voice-specific metrics
  • Regression testing: Automated test suites for voice agent deployments
  • Cost analysis: Per-session and per-component cost attribution

Hamming provides 50+ built-in voice metrics and integrates with LiveKit, Pipecat, and other voice platforms for unified observability.

Conclusion

Production monitoring catches issues that testing misses. Voice agents require purpose-built observability that understands real-time audio processing, multi-service orchestration, and conversation semantics.

Start with the essential metrics: TTFT under 800ms, P90 latency under 3.5s, P99 latency under 5s, WER under 5%, and interruption rate under 15%. Build Grafana dashboards organized by pipeline stage. Configure alerts on P90 and P99 metrics with duration filters to reduce noise.

The difference between a frustrating voice agent and a delightful one often comes down to the monitoring infrastructure behind it. Users don't care about your architecture—they care about whether the agent responds quickly, understands them accurately, and completes their task.

Related Guides:

Frequently Asked Questions

Real-time audio streams, variable LLM latency, and multi-service orchestration require voice-specific metrics that traditional APM tools cannot capture. Generic monitoring tools like Datadog track infrastructure health but miss 60% of voice-specific failures including TTFT degradation, ASR accuracy drops, turn detection issues, and conversation flow breakdowns. Voice agents need monitoring that understands real-time audio processing and conversation semantics.

The most critical metrics are: P99 latency (target under 1200ms), Time to First Token or TTFT (target under 800ms), Word Error Rate or WER (target under 5%), interruption rates (target under 15%), tool call success rate (target above 99%), and compliance violations. These metrics directly correlate with user satisfaction and conversation quality. Alert on P99 values rather than averages to catch tail latency issues.

Configure the tracer provider via set_tracer_provider in your agent's entrypoint function to capture spans for LLM calls, tool invocations, STT processing, and TTS synthesis. For Python Agents SDK v1.3+, initialize the TracerProvider with an OTLP exporter, add a BatchSpanProcessor, and call trace.set_tracer_provider(). Create semantic spans for each pipeline stage with relevant attributes like model name, token counts, and latency measurements.

Set P99 latency alerts at 1200ms for end-to-end response time and TTFT alerts at 800ms. Responses over 3 seconds feel broken to users and cause conversation abandonment. Use duration filters requiring issues to persist for 5+ minutes before alerting to avoid noise from momentary spikes. Warning thresholds should be set at 800ms for TTFT and 1000ms for end-to-end latency.

LiveKit Agent Worker exposes a /metrics endpoint that Prometheus can scrape. Configure prometheus.yml with 15-30 second scrape intervals for voice workloads, targeting your agent worker endpoints. Define counters for conversation turns, histograms for latency distributions with voice-appropriate buckets (100ms to 5000ms), and gauges for WER and active sessions. Grafana then visualizes the time-series data with dashboards organized by pipeline stage.

Essential dashboards include: latency percentile breakdown (P50/P95/P99) for LLM, STT, and TTS components; error rates by component and error type; cost transparency showing token consumption and per-session costs; active session counts and capacity utilization; and end-to-end call journey visualization showing timing waterfalls from user utterance through agent response. Organize panels by pipeline stage: telephony, ASR, LLM, TTS, and integrations.

Focus on P99 metrics rather than averages, as averages hide the worst user experiences. Use duration filters requiring issues to persist for 5+ minutes before alerting. Track behavior drift patterns like sudden changes in response length or conversation loops. Monitor cost anomalies that indicate prompt bloat or integration issues. Include context with every alert: current value, baseline, sample calls, and runbook links with diagnostic steps.

Offline testing validates curated datasets before deployment using simulated scenarios to catch bugs before launch. Online monitoring watches live production traffic with continuous scoring and real-time alerting to catch unknown issues and production degradation. You need both: testing prevents known issues from reaching production, while monitoring catches issues that only surface at scale or due to provider updates and conversation pattern changes.

Sumanyu Sharma

Sumanyu Sharma

Founder & CEO

Previously Head of Data at Citizen, where he helped quadruple the user base. As Senior Staff Data Scientist at Tesla, grew AI-powered sales program to 100s of millions in revenue per year.

Researched AI-powered medical image search at the University of Waterloo, where he graduated with Engineering honors on dean's list.

“At Hamming, we're taking all of our learnings from Tesla and Citizen to build the future of trustworthy, safe and reliable voice AI agents.”