Voice Agent Analytics in Grafana: Dashboard Template

Sumanyu Sharma
Sumanyu Sharma
Founder & CEO
, Voice AI QA Pioneer

Has stress-tested 4M+ voice agent calls to find where they break.

May 15, 2026Updated May 15, 202613 min read
Voice Agent Analytics in Grafana: Dashboard Template

A voice agent analytics Grafana dashboard works best when it treats voice calls as three different telemetry shapes, not one stream of "AI events." Stable counts and durations belong in Prometheus. Searchable call events belong in Loki. Cross-service timing belongs in traces through Tempo or your tracing backend.

If you skip that split, Grafana gets noisy fast. The first dashboard looks impressive, then somebody adds call_id, transcript text, intent name, user ID, prompt version, and customer tier as metric labels. Two weeks later, queries slow down, alert rules flap, and the team still cannot answer which failed call needs replay.

This guide is for teams that already use Grafana, Prometheus, Loki, Tempo, Alloy, or OpenTelemetry and want voice analytics in that stack without turning the observability system into a transcript warehouse.

Voice agent analytics in Grafana is the practice of routing voice-agent metrics, call events, traces, and QA pointers into Grafana-compatible backends so engineering and QA teams can monitor production quality, drill into failures, and alert on regressions without exposing raw conversation data unnecessarily.

Quick filter: If you run fewer than 100 voice agent calls per week, start with voice agent analytics metrics and manual call review. Grafana pays off when you have enough volume that trends, alerts, and per-intent breakdowns matter.

TL;DR: Build the dashboard with four rules:

  • Metrics: Send counters, current-value metrics, and histograms to Prometheus or Mimir.
  • Events: Send searchable JSON call events to Loki.
  • Traces: Send STT, LLM, tool, TTS, and webhook spans through OpenTelemetry.
  • Evidence: Store replay URLs, transcript IDs, QA score IDs, and redaction state as pointers, not raw payloads.

Grafana should show where quality is degrading. Your QA system should still own call replay, transcript annotation, and evaluation evidence.

Methodology Note: This dashboard template is based on Hamming's analysis of 4M+ production voice agent monitoring and debugging workflows across 10K+ voice agents (2025-2026). We've tested agents built on LiveKit, Pipecat, ElevenLabs, Retell, Vapi, and custom-built solutions.

It also uses public Grafana, OpenTelemetry, Prometheus, and LiveKit documentation to ground the telemetry pipeline and query examples.

Last Updated: May 2026

Related Guides:

The Routing Rule: Metric, Event, Trace, or Evidence Pointer

Most bad Grafana voice dashboards fail at the ingestion contract. They emit everything as a metric because Prometheus is already there.

That is the wrong default.

Use this routing table instead:

Voice SignalSend ToExampleWhy
Call volumePrometheus / Mimirvoice_calls_total{agent="billing", outcome="resolved"}Stable counter, easy to alert on
Latency percentilesPrometheus / Mimirvoice_response_latency_seconds_bucketHistograms support p50/p95/p99 queries
Active sessionsPrometheus / Mimirvoice_active_sessionsLow-cardinality current-value metric
Stage timingsOpenTelemetry traces / Tempostt.transcription, llm.inference, tts.synthesis spansDebugs cascades across services
Transcript turn eventLokiJSON log with turn_id, redaction_state, intent, confidenceSearchable event, not a metric label
Low-confidence ASR turnLoki + metric counterLog full event; increment voice_low_confidence_turns_totalSupports both search and alerting
QA assertion resultMetrics + evidence pointervoice_assertion_failures_total; qa_result_id in logsAggregate in Grafana, replay in QA tool
Raw transcript textRestricted transcript storetranscript_id pointer only in GrafanaAvoids privacy and label explosion
Audio recordingRestricted recording storerecording_id or replay URL pointerGrafana should not store raw audio
Prompt versionTrace attribute and event fieldprompt_version="checkout-v14"Useful for filtering without storing prompt text

Grafana's OpenTelemetry docs describe the same three-signal shape at the collector level: OTLP comes in, then metrics, logs, and traces fan out to Prometheus/Mimir, Loki, and Tempo. The voice-specific part is deciding which call facts belong in which signal.

Copy This Voice Analytics Event Envelope

Start with one event envelope. Do not let every service invent its own shape.

{
  "event_name": "voice.turn.completed",
  "event_version": "2026-05-15",
  "timestamp": "2026-05-15T17:42:31.122Z",
  "canonical_call_id": "call_01JZ7Q8P2F9M6K",
  "trace_id": "9f7c2d4f0f3a4c1e8e4d2a5b7c6f9012",
  "span_id": "3d4f0f3a4c1e8e4d",
  "agent_id": "billing-agent",
  "agent_version": "checkout-v14",
  "environment": "prod",
  "language": "en-US",
  "channel": "pstn",
  "provider": {
    "transport": "twilio",
    "stt": "deepgram",
    "llm": "openai",
    "tts": "elevenlabs"
  },
  "turn": {
    "turn_index": 7,
    "speaker": "user",
    "intent": "billing_dispute",
    "asr_confidence": 0.72,
    "interruption_count": 1,
    "silence_ms_before_response": 640
  },
  "quality": {
    "task_success": false,
    "policy_passed": true,
    "qa_score": 0.61,
    "failure_reason": "tool_timeout"
  },
  "evidence": {
    "transcript_turn_id": "turn_07",
    "recording_segment_id": "rec_seg_07",
    "qa_result_id": "qa_8ac2",
    "redaction_state": "redacted"
  }
}

The important fields are canonical_call_id, trace_id, and redaction_state.

canonical_call_id lets you join voice analytics to IVR paths, telephony provider records, CRM outcomes, and QA results. If you have IVR transfers or multiple provider IDs, use the IVR-to-agent log correlation runbook before building dashboards.

trace_id lets Grafana jump from a metric spike to the exact trace. Grafana's signal correlation documentation calls out shared labels, resource attributes, and trace context as the foundation for moving between metrics, logs, and traces.

redaction_state tells reviewers whether the event is safe for broad dashboards. If it says unredacted, do not index transcript text in Loki or show it in a shared Grafana table. Use a pointer instead.

What Metrics Should Go Into Prometheus

Prometheus should get metrics whose aggregations make sense across many calls. That usually means counters, current-value metrics, and histograms with low-cardinality labels.

MetricTypeLabelsUse
voice_calls_totalCounteragent, environment, outcome, languageVolume and outcome trend
voice_active_sessionsCurrent-value metricagent, environment, regionLive capacity and incident detection
voice_response_latency_secondsHistogramagent, environment, stagep50/p95/p99 latency by pipeline stage
voice_turns_totalCounteragent, environment, speakerLoop and verbosity detection
voice_interruptions_totalCounteragent, environment, intent_familyBarge-in and overtalk trend
voice_low_confidence_turns_totalCounteragent, environment, stt_provider, languageASR drift and audio quality
voice_assertion_failures_totalCounteragent, environment, assertion_typeQA regression alerting
voice_tool_failures_totalCounteragent, environment, tool_name, failure_typeBackend integration failures
voice_escalations_totalCounteragent, environment, escalation_reasonHuman handoff and containment
voice_cost_usd_totalCounteragent, environment, provider_familyCost monitoring

Do not add call_id, user_id, transcript text, account ID, phone number, prompt text, or raw intent text as metric labels.

Grafana's high-cardinality alerting docs explain why: each unique label set creates a separate time series. In voice analytics, call_id turns every call into a new time series. That is expensive and usually useless.

Starter PromQL Queries

Use recording rules for expensive queries once the dashboard becomes important. These are starter expressions, not final production rules.

P95 response latency by stage

histogram_quantile(
  0.95,
  sum by (le, stage) (
    rate(voice_response_latency_seconds_bucket{environment="prod"}[5m])
  )
)

Prometheus documents the histogram_quantile() pattern for percentile estimates from histogram buckets. Use histograms for latency. Do not precompute only averages.

Low-confidence ASR rate

sum(rate(voice_low_confidence_turns_total{environment="prod"}[15m]))
/
sum(rate(voice_turns_total{environment="prod",speaker="user"}[15m]))
unless sum(rate(voice_turns_total{environment="prod",speaker="user"}[15m])) == 0

This catches the calls where the LLM may be fine but the input is degraded. Pair it with voice agent analytics metrics so ASR quality connects to containment, sentiment, and flow outcomes.

Tool failure rate by tool

sum by (tool_name) (
  rate(voice_tool_failures_total{environment="prod"}[15m])
)
/
sum by (tool_name) (
  rate(voice_tool_calls_total{environment="prod"}[15m])
)

This belongs on the same dashboard as latency. A slow tool and a failing tool often produce the same caller behavior: silence, repetition, and eventual escalation.

False containment guardrail

sum(rate(voice_repeat_contact_total{environment="prod"}[24h]))
/
sum(rate(voice_contained_calls_total{environment="prod"}[24h]))

Containment without repeat-contact guardrails becomes metric theater. If the agent "contains" the call and the user calls back tomorrow for the same issue, the first call did not really succeed.

What Should Go Into Loki

Loki is the right place for searchable voice analytics events: low-confidence turns, tool failures, escalation events, policy misses, redaction state changes, and QA assertion results.

Use labels sparingly:

Loki LabelGood?Why
service_nameYesStandard correlation label
environmentYesRequired for prod/staging filtering
agent_idYes, if boundedUseful operational dimension
event_nameYesLets teams filter event families
languageYes, if normalizedUseful for language drift
canonical_call_idNoToo high-cardinality for labels; keep in JSON body
transcript_textNoPrivacy risk and impossible to aggregate
phone_numberNoPrivate and high-cardinality

Example LogQL query:

{service_name="voice-agent", environment="prod", event_name="voice.turn.completed"}
| json
| quality_qa_score < 0.7
| evidence_redaction_state = "redacted"

That query finds low-scoring turns that are safe to show in a shared dashboard. The row should show canonical_call_id, trace_id, agent_id, intent, failure_reason, and qa_result_id, not the full transcript.

For a broader taxonomy of what belongs in call logs, use the call logging guide.

Grafana answers "what changed?" Traces answer "where did this call break?"

Use the trace hierarchy from the OpenTelemetry voice agents guide:

call.lifecycle
├── stt.transcription
   └── stt.provider.deepgram
├── llm.inference
   └── llm.tool_call.lookup_account
├── tts.synthesis
├── webhook.dispatch
└── evaluation.assertion_check

Grafana AI Observability documentation describes an OpenTelemetry-based path for LLM agent generations, including traces and metrics. For voice agents, add the audio and telephony spans that LLM-only observability usually misses: end-of-utterance delay, STT confidence, provider fallback, TTS synthesis, tool latency, and interruption events.

The useful Grafana interaction is:

  1. Alert fires on p95 latency, low-confidence ASR rate, or QA assertion failures.
  2. Dashboard row shows the affected agent, language, provider, and intent family.
  3. Operator clicks into Loki event rows for recent examples.
  4. Event row links to trace by trace_id.
  5. Trace links to the QA replay or transcript review by qa_result_id or transcript_turn_id.

If step 5 opens raw audio or transcript in Grafana for every viewer, you probably crossed a privacy boundary.

Dashboard Layout Template

Use five rows. Resist the urge to make one giant wall of panels.

RowPanelsOwnerDecision
Live healthActive sessions, calls/min, error rate, escalation rateOn-call engineerIs production healthy right now?
Latency waterfallSTT p95, LLM TTFT p95, tool p95, TTS p95, end-to-end p95EngineeringWhich stage is creating dead air?
Conversation qualitycontainment, repeat contact, low-confidence ASR, interruption rate, QA pass rateQA / productAre calls working for users?
Event drilldownrecent low-score turns, tool failures, policy misses, redaction warningsQA / supportWhich calls need replay?
Alert hygienefiring alerts, pending alerts, cardinality growth, missing trace IDsPlatformAre alerts useful and affordable?

This is deliberately narrower than a full executive dashboard. For weekly stakeholder reporting, use the voice agent dashboard template. Grafana is better for operational loops: is something breaking, where, and what examples prove it?

Example dashboard row

VOICE AGENT ANALYTICS - PROD - LAST 6 HOURS

Live Health
  Active sessions | Calls/min | Error rate | Escalation rate

Latency Waterfall
  STT p95 | LLM TTFT p95 | Tool p95 | TTS p95 | End-to-end p95

Conversation Quality
  QA pass rate | Low-confidence turns | Interruptions/min | Repeat contact

Event Drilldown
  Recent failed QA turns | Tool failures | Policy misses | Redaction warnings

Alert Hygiene
  Firing alerts | Missing trace IDs | High-cardinality series | No-data panels

Alert Rules That Catch Real Voice Failures

Start with a small alert set. Too many voice dashboards fail because every metric gets a Slack alert.

AlertExpression ShapeWindowRoute ToWhy
End-to-end latency p95 above baselinehistogram_quantile(0.95, ...) > threshold10mEngineeringCatches dead air users feel
Low-confidence ASR spikelow-confidence turns / user turns > threshold15mEngineering + QACatches audio/STT drift
QA pass rate dropassertion failures / assertions > threshold30mQACatches prompt or policy regression
Tool failure spiketool failures / tool calls > threshold10mEngineeringCatches backend dependency failures
Missing trace IDsevents without trace ID / events > threshold15mPlatformCatches observability breakage
Redaction warningunredacted events in shared stream > 0ImmediateSecurity / platformPrevents private data leakage

Use duration windows. Voice traffic is bursty, and one bad call should create a drilldown row before it pages someone.

Prometheus alerting rules support for durations and annotations that can include runbook links. Put the voice agent incident response runbook in those annotations so the alert tells the responder what to do next.

Privacy and Cardinality Guardrails

This is the part teams usually postpone. Do not.

Bad PatternWhy It BreaksSafer Pattern
call_id as a Prometheus labelCreates one time series per callStore in Loki JSON body and trace attributes
Raw transcript in Loki by defaultPrivacy risk and noisy searchStore transcript_turn_id and redacted summary
Phone number as any labelPrivate and high-cardinalityStore hashed customer reference in restricted system
Prompt text in metricsHigh-cardinality and sensitiveStore prompt_version
QA comments in Grafana tablePrivate reviewer notes leak broadlyLink to QA result in role-gated system
No redaction state fieldReviewers cannot tell safe vs restricted eventsAdd redaction_state to every event

The mental model is simple: Grafana should point to evidence; it should not become the evidence vault.

For healthcare, finance, or any regulated workflow, pair this with PII redaction for voice agents. Redaction must happen before events enter broad analytics streams, not after somebody notices a transcript in a dashboard screenshot.

How Hamming Fits With Grafana

Grafana is strong at operational monitoring. Hamming is strong at voice-agent QA: test cases, assertions, replayable evidence, prompt/version comparisons, and production call analysis.

Use both where they are strongest:

JobBetter System of Record
Live latency and error alertingGrafana / Prometheus
Cross-service trace timingOpenTelemetry + Grafana / Tempo
Searchable operational eventsLoki or existing log stack
Raw transcript replayHamming or restricted transcript store
QA assertion evidenceHamming
Regression test generationHamming
Executive trend reportingHamming dashboards plus Grafana summaries

We found that the cleanest pattern is to let Grafana show operational symptoms and link out to Hamming for examples. If QA pass rate drops after a prompt release, Grafana should show the drop, affected agent, language, and prompt version. Hamming should show the failed calls, expected behavior, transcript/audio evidence, and the regression tests to add next.

That boundary keeps the dashboard useful without making it carry private content, annotation workflows, and replay UX it was not designed to own.

Implementation Checklist

Use this checklist before you ship a Grafana voice analytics dashboard:

  • Define one canonical_call_id and pass it through IVR, voice runtime, QA, and CRM events.
  • Add trace_id to every event that should link from Grafana to traces.
  • Decide which fields become Prometheus labels and reject high-cardinality labels.
  • Emit latency histograms for STT, LLM, tools, TTS, and end-to-end response timing.
  • Send turn-level and QA events to Loki as JSON logs with redaction_state.
  • Store raw transcript and audio in restricted systems; expose only pointers in Grafana.
  • Build the five dashboard rows: live health, latency waterfall, conversation quality, event drilldown, alert hygiene.
  • Add alerts for latency, ASR confidence, QA pass rate, tool failures, missing trace IDs, and redaction warnings.
  • Link every alert to a runbook and at least one drilldown panel.
  • Run a synthetic bad-call test to verify that metric, log, trace, and replay links line up.

The synthetic test matters. Create one controlled call where the tool times out, one where ASR confidence is low, and one where redaction is required. If the dashboard cannot route you from the alert to the right trace and replay pointer, it is not ready for production.

Frequently Asked Questions

Yes. According to Hamming's voice analytics dashboard model, Grafana works well when stable metrics go to Prometheus or Mimir, searchable call events go to Loki, and STT, LLM, tool, and TTS timings go to OpenTelemetry traces. Grafana should point to replay and QA evidence rather than storing raw transcripts or audio.

Hamming recommends starting with active sessions, calls per minute, error rate, escalation rate, stage-level latency, low-confidence ASR turns, interruption rate, QA pass rate, tool failure rate, and repeat contact. These 10 metrics cover live health, latency, conversation quality, and regression detection without turning the dashboard into a transcript review tool.

Usually no. Hamming's privacy guardrail is to store raw transcript text in a restricted transcript or QA system, then expose transcript turn IDs, redacted summaries, trace IDs, and replay pointers in Grafana. That keeps shared dashboards useful without spreading PII, PHI, PCI, or call-specific identifiers unnecessarily.

Keep Prometheus labels bounded to fields such as agent, environment, region, language, stage, provider family, outcome, and assertion type. Hamming recommends never using call IDs, user IDs, phone numbers, transcript text, account IDs, or free-form intent strings as metric labels because each unique value creates a new time series.

Use Hamming's routing model: counters, current-value metrics, and histograms go to Prometheus or Mimir; searchable call events go to Loki; stage timing goes to OpenTelemetry traces; and transcript/audio/QA evidence stays in the system of record as pointers. This gives Grafana enough context to alert and drill down without making it the raw call evidence store.

Use Grafana for operational monitoring and alerting, then link to Hamming for replayable QA evidence, assertions, regression tests, prompt comparisons, and production call analysis. The practical pattern is that Grafana shows the symptom, affected agent, prompt version, and trace ID; Hamming shows the failed calls and the tests to add next.

Sumanyu Sharma

Sumanyu Sharma

Founder & CEO

Previously Head of Data at Citizen, where he helped quadruple the user base. As Senior Staff Data Scientist at Tesla, grew AI-powered sales program to 100s of millions in revenue per year.

Researched AI-powered medical image search at the University of Waterloo, where he graduated with Engineering honors on dean's list.

“At Hamming, we're taking all of our learnings from Tesla and Citizento build the future of trustworthy, safe and reliable voice AI agents.”