A voice agent analytics Grafana dashboard works best when it treats voice calls as three different telemetry shapes, not one stream of "AI events." Stable counts and durations belong in Prometheus. Searchable call events belong in Loki. Cross-service timing belongs in traces through Tempo or your tracing backend.
If you skip that split, Grafana gets noisy fast. The first dashboard looks impressive, then somebody adds call_id, transcript text, intent name, user ID, prompt version, and customer tier as metric labels. Two weeks later, queries slow down, alert rules flap, and the team still cannot answer which failed call needs replay.
This guide is for teams that already use Grafana, Prometheus, Loki, Tempo, Alloy, or OpenTelemetry and want voice analytics in that stack without turning the observability system into a transcript warehouse.
Voice agent analytics in Grafana is the practice of routing voice-agent metrics, call events, traces, and QA pointers into Grafana-compatible backends so engineering and QA teams can monitor production quality, drill into failures, and alert on regressions without exposing raw conversation data unnecessarily.
Quick filter: If you run fewer than 100 voice agent calls per week, start with voice agent analytics metrics and manual call review. Grafana pays off when you have enough volume that trends, alerts, and per-intent breakdowns matter.
TL;DR: Build the dashboard with four rules:
- Metrics: Send counters, current-value metrics, and histograms to Prometheus or Mimir.
- Events: Send searchable JSON call events to Loki.
- Traces: Send STT, LLM, tool, TTS, and webhook spans through OpenTelemetry.
- Evidence: Store replay URLs, transcript IDs, QA score IDs, and redaction state as pointers, not raw payloads.
Grafana should show where quality is degrading. Your QA system should still own call replay, transcript annotation, and evaluation evidence.
Methodology Note: This dashboard template is based on Hamming's analysis of 4M+ production voice agent monitoring and debugging workflows across 10K+ voice agents (2025-2026). We've tested agents built on LiveKit, Pipecat, ElevenLabs, Retell, Vapi, and custom-built solutions.It also uses public Grafana, OpenTelemetry, Prometheus, and LiveKit documentation to ground the telemetry pipeline and query examples.
Last Updated: May 2026
Related Guides:
- Voice Agent Analytics Metrics Guide - formulas and thresholds before you wire dashboards
- Voice Agent Dashboard Template - executive and QA dashboard layout
- OpenTelemetry for AI Voice Agents - span hierarchy and trace propagation
- LiveKit Agent Monitoring with Prometheus and Grafana - LiveKit-specific implementation guide
- IVR and Voice Agent Log Correlation - unified call context and provider ID mapping
- Call Logging for AI Voice Agents - log taxonomy and compliance requirements
- PII Redaction for Voice Agents - privacy architecture for transcripts and audio
- Voice Agent Incident Response Runbook - escalation workflow once alerts fire
The Routing Rule: Metric, Event, Trace, or Evidence Pointer
Most bad Grafana voice dashboards fail at the ingestion contract. They emit everything as a metric because Prometheus is already there.
That is the wrong default.
Use this routing table instead:
| Voice Signal | Send To | Example | Why |
|---|---|---|---|
| Call volume | Prometheus / Mimir | voice_calls_total{agent="billing", outcome="resolved"} | Stable counter, easy to alert on |
| Latency percentiles | Prometheus / Mimir | voice_response_latency_seconds_bucket | Histograms support p50/p95/p99 queries |
| Active sessions | Prometheus / Mimir | voice_active_sessions | Low-cardinality current-value metric |
| Stage timings | OpenTelemetry traces / Tempo | stt.transcription, llm.inference, tts.synthesis spans | Debugs cascades across services |
| Transcript turn event | Loki | JSON log with turn_id, redaction_state, intent, confidence | Searchable event, not a metric label |
| Low-confidence ASR turn | Loki + metric counter | Log full event; increment voice_low_confidence_turns_total | Supports both search and alerting |
| QA assertion result | Metrics + evidence pointer | voice_assertion_failures_total; qa_result_id in logs | Aggregate in Grafana, replay in QA tool |
| Raw transcript text | Restricted transcript store | transcript_id pointer only in Grafana | Avoids privacy and label explosion |
| Audio recording | Restricted recording store | recording_id or replay URL pointer | Grafana should not store raw audio |
| Prompt version | Trace attribute and event field | prompt_version="checkout-v14" | Useful for filtering without storing prompt text |
Grafana's OpenTelemetry docs describe the same three-signal shape at the collector level: OTLP comes in, then metrics, logs, and traces fan out to Prometheus/Mimir, Loki, and Tempo. The voice-specific part is deciding which call facts belong in which signal.
Copy This Voice Analytics Event Envelope
Start with one event envelope. Do not let every service invent its own shape.
{
"event_name": "voice.turn.completed",
"event_version": "2026-05-15",
"timestamp": "2026-05-15T17:42:31.122Z",
"canonical_call_id": "call_01JZ7Q8P2F9M6K",
"trace_id": "9f7c2d4f0f3a4c1e8e4d2a5b7c6f9012",
"span_id": "3d4f0f3a4c1e8e4d",
"agent_id": "billing-agent",
"agent_version": "checkout-v14",
"environment": "prod",
"language": "en-US",
"channel": "pstn",
"provider": {
"transport": "twilio",
"stt": "deepgram",
"llm": "openai",
"tts": "elevenlabs"
},
"turn": {
"turn_index": 7,
"speaker": "user",
"intent": "billing_dispute",
"asr_confidence": 0.72,
"interruption_count": 1,
"silence_ms_before_response": 640
},
"quality": {
"task_success": false,
"policy_passed": true,
"qa_score": 0.61,
"failure_reason": "tool_timeout"
},
"evidence": {
"transcript_turn_id": "turn_07",
"recording_segment_id": "rec_seg_07",
"qa_result_id": "qa_8ac2",
"redaction_state": "redacted"
}
}
The important fields are canonical_call_id, trace_id, and redaction_state.
canonical_call_id lets you join voice analytics to IVR paths, telephony provider records, CRM outcomes, and QA results. If you have IVR transfers or multiple provider IDs, use the IVR-to-agent log correlation runbook before building dashboards.
trace_id lets Grafana jump from a metric spike to the exact trace. Grafana's signal correlation documentation calls out shared labels, resource attributes, and trace context as the foundation for moving between metrics, logs, and traces.
redaction_state tells reviewers whether the event is safe for broad dashboards. If it says unredacted, do not index transcript text in Loki or show it in a shared Grafana table. Use a pointer instead.
What Metrics Should Go Into Prometheus
Prometheus should get metrics whose aggregations make sense across many calls. That usually means counters, current-value metrics, and histograms with low-cardinality labels.
| Metric | Type | Labels | Use |
|---|---|---|---|
voice_calls_total | Counter | agent, environment, outcome, language | Volume and outcome trend |
voice_active_sessions | Current-value metric | agent, environment, region | Live capacity and incident detection |
voice_response_latency_seconds | Histogram | agent, environment, stage | p50/p95/p99 latency by pipeline stage |
voice_turns_total | Counter | agent, environment, speaker | Loop and verbosity detection |
voice_interruptions_total | Counter | agent, environment, intent_family | Barge-in and overtalk trend |
voice_low_confidence_turns_total | Counter | agent, environment, stt_provider, language | ASR drift and audio quality |
voice_assertion_failures_total | Counter | agent, environment, assertion_type | QA regression alerting |
voice_tool_failures_total | Counter | agent, environment, tool_name, failure_type | Backend integration failures |
voice_escalations_total | Counter | agent, environment, escalation_reason | Human handoff and containment |
voice_cost_usd_total | Counter | agent, environment, provider_family | Cost monitoring |
Do not add call_id, user_id, transcript text, account ID, phone number, prompt text, or raw intent text as metric labels.
Grafana's high-cardinality alerting docs explain why: each unique label set creates a separate time series. In voice analytics, call_id turns every call into a new time series. That is expensive and usually useless.
Starter PromQL Queries
Use recording rules for expensive queries once the dashboard becomes important. These are starter expressions, not final production rules.
P95 response latency by stage
histogram_quantile(
0.95,
sum by (le, stage) (
rate(voice_response_latency_seconds_bucket{environment="prod"}[5m])
)
)
Prometheus documents the histogram_quantile() pattern for percentile estimates from histogram buckets. Use histograms for latency. Do not precompute only averages.
Low-confidence ASR rate
sum(rate(voice_low_confidence_turns_total{environment="prod"}[15m]))
/
sum(rate(voice_turns_total{environment="prod",speaker="user"}[15m]))
unless sum(rate(voice_turns_total{environment="prod",speaker="user"}[15m])) == 0
This catches the calls where the LLM may be fine but the input is degraded. Pair it with voice agent analytics metrics so ASR quality connects to containment, sentiment, and flow outcomes.
Tool failure rate by tool
sum by (tool_name) (
rate(voice_tool_failures_total{environment="prod"}[15m])
)
/
sum by (tool_name) (
rate(voice_tool_calls_total{environment="prod"}[15m])
)
This belongs on the same dashboard as latency. A slow tool and a failing tool often produce the same caller behavior: silence, repetition, and eventual escalation.
False containment guardrail
sum(rate(voice_repeat_contact_total{environment="prod"}[24h]))
/
sum(rate(voice_contained_calls_total{environment="prod"}[24h]))
Containment without repeat-contact guardrails becomes metric theater. If the agent "contains" the call and the user calls back tomorrow for the same issue, the first call did not really succeed.
What Should Go Into Loki
Loki is the right place for searchable voice analytics events: low-confidence turns, tool failures, escalation events, policy misses, redaction state changes, and QA assertion results.
Use labels sparingly:
| Loki Label | Good? | Why |
|---|---|---|
service_name | Yes | Standard correlation label |
environment | Yes | Required for prod/staging filtering |
agent_id | Yes, if bounded | Useful operational dimension |
event_name | Yes | Lets teams filter event families |
language | Yes, if normalized | Useful for language drift |
canonical_call_id | No | Too high-cardinality for labels; keep in JSON body |
transcript_text | No | Privacy risk and impossible to aggregate |
phone_number | No | Private and high-cardinality |
Example LogQL query:
{service_name="voice-agent", environment="prod", event_name="voice.turn.completed"}
| json
| quality_qa_score < 0.7
| evidence_redaction_state = "redacted"
That query finds low-scoring turns that are safe to show in a shared dashboard. The row should show canonical_call_id, trace_id, agent_id, intent, failure_reason, and qa_result_id, not the full transcript.
For a broader taxonomy of what belongs in call logs, use the call logging guide.
Traces: The Missing Link Between Metrics and Replay
Grafana answers "what changed?" Traces answer "where did this call break?"
Use the trace hierarchy from the OpenTelemetry voice agents guide:
call.lifecycle
├── stt.transcription
│ └── stt.provider.deepgram
├── llm.inference
│ └── llm.tool_call.lookup_account
├── tts.synthesis
├── webhook.dispatch
└── evaluation.assertion_check
Grafana AI Observability documentation describes an OpenTelemetry-based path for LLM agent generations, including traces and metrics. For voice agents, add the audio and telephony spans that LLM-only observability usually misses: end-of-utterance delay, STT confidence, provider fallback, TTS synthesis, tool latency, and interruption events.
The useful Grafana interaction is:
- Alert fires on p95 latency, low-confidence ASR rate, or QA assertion failures.
- Dashboard row shows the affected agent, language, provider, and intent family.
- Operator clicks into Loki event rows for recent examples.
- Event row links to trace by
trace_id. - Trace links to the QA replay or transcript review by
qa_result_idortranscript_turn_id.
If step 5 opens raw audio or transcript in Grafana for every viewer, you probably crossed a privacy boundary.
Dashboard Layout Template
Use five rows. Resist the urge to make one giant wall of panels.
| Row | Panels | Owner | Decision |
|---|---|---|---|
| Live health | Active sessions, calls/min, error rate, escalation rate | On-call engineer | Is production healthy right now? |
| Latency waterfall | STT p95, LLM TTFT p95, tool p95, TTS p95, end-to-end p95 | Engineering | Which stage is creating dead air? |
| Conversation quality | containment, repeat contact, low-confidence ASR, interruption rate, QA pass rate | QA / product | Are calls working for users? |
| Event drilldown | recent low-score turns, tool failures, policy misses, redaction warnings | QA / support | Which calls need replay? |
| Alert hygiene | firing alerts, pending alerts, cardinality growth, missing trace IDs | Platform | Are alerts useful and affordable? |
This is deliberately narrower than a full executive dashboard. For weekly stakeholder reporting, use the voice agent dashboard template. Grafana is better for operational loops: is something breaking, where, and what examples prove it?
Example dashboard row
VOICE AGENT ANALYTICS - PROD - LAST 6 HOURS
Live Health
Active sessions | Calls/min | Error rate | Escalation rate
Latency Waterfall
STT p95 | LLM TTFT p95 | Tool p95 | TTS p95 | End-to-end p95
Conversation Quality
QA pass rate | Low-confidence turns | Interruptions/min | Repeat contact
Event Drilldown
Recent failed QA turns | Tool failures | Policy misses | Redaction warnings
Alert Hygiene
Firing alerts | Missing trace IDs | High-cardinality series | No-data panels
Alert Rules That Catch Real Voice Failures
Start with a small alert set. Too many voice dashboards fail because every metric gets a Slack alert.
| Alert | Expression Shape | Window | Route To | Why |
|---|---|---|---|---|
| End-to-end latency p95 above baseline | histogram_quantile(0.95, ...) > threshold | 10m | Engineering | Catches dead air users feel |
| Low-confidence ASR spike | low-confidence turns / user turns > threshold | 15m | Engineering + QA | Catches audio/STT drift |
| QA pass rate drop | assertion failures / assertions > threshold | 30m | QA | Catches prompt or policy regression |
| Tool failure spike | tool failures / tool calls > threshold | 10m | Engineering | Catches backend dependency failures |
| Missing trace IDs | events without trace ID / events > threshold | 15m | Platform | Catches observability breakage |
| Redaction warning | unredacted events in shared stream > 0 | Immediate | Security / platform | Prevents private data leakage |
Use duration windows. Voice traffic is bursty, and one bad call should create a drilldown row before it pages someone.
Prometheus alerting rules support for durations and annotations that can include runbook links. Put the voice agent incident response runbook in those annotations so the alert tells the responder what to do next.
Privacy and Cardinality Guardrails
This is the part teams usually postpone. Do not.
| Bad Pattern | Why It Breaks | Safer Pattern |
|---|---|---|
call_id as a Prometheus label | Creates one time series per call | Store in Loki JSON body and trace attributes |
| Raw transcript in Loki by default | Privacy risk and noisy search | Store transcript_turn_id and redacted summary |
| Phone number as any label | Private and high-cardinality | Store hashed customer reference in restricted system |
| Prompt text in metrics | High-cardinality and sensitive | Store prompt_version |
| QA comments in Grafana table | Private reviewer notes leak broadly | Link to QA result in role-gated system |
| No redaction state field | Reviewers cannot tell safe vs restricted events | Add redaction_state to every event |
The mental model is simple: Grafana should point to evidence; it should not become the evidence vault.
For healthcare, finance, or any regulated workflow, pair this with PII redaction for voice agents. Redaction must happen before events enter broad analytics streams, not after somebody notices a transcript in a dashboard screenshot.
How Hamming Fits With Grafana
Grafana is strong at operational monitoring. Hamming is strong at voice-agent QA: test cases, assertions, replayable evidence, prompt/version comparisons, and production call analysis.
Use both where they are strongest:
| Job | Better System of Record |
|---|---|
| Live latency and error alerting | Grafana / Prometheus |
| Cross-service trace timing | OpenTelemetry + Grafana / Tempo |
| Searchable operational events | Loki or existing log stack |
| Raw transcript replay | Hamming or restricted transcript store |
| QA assertion evidence | Hamming |
| Regression test generation | Hamming |
| Executive trend reporting | Hamming dashboards plus Grafana summaries |
We found that the cleanest pattern is to let Grafana show operational symptoms and link out to Hamming for examples. If QA pass rate drops after a prompt release, Grafana should show the drop, affected agent, language, and prompt version. Hamming should show the failed calls, expected behavior, transcript/audio evidence, and the regression tests to add next.
That boundary keeps the dashboard useful without making it carry private content, annotation workflows, and replay UX it was not designed to own.
Implementation Checklist
Use this checklist before you ship a Grafana voice analytics dashboard:
- Define one
canonical_call_idand pass it through IVR, voice runtime, QA, and CRM events. - Add
trace_idto every event that should link from Grafana to traces. - Decide which fields become Prometheus labels and reject high-cardinality labels.
- Emit latency histograms for STT, LLM, tools, TTS, and end-to-end response timing.
- Send turn-level and QA events to Loki as JSON logs with
redaction_state. - Store raw transcript and audio in restricted systems; expose only pointers in Grafana.
- Build the five dashboard rows: live health, latency waterfall, conversation quality, event drilldown, alert hygiene.
- Add alerts for latency, ASR confidence, QA pass rate, tool failures, missing trace IDs, and redaction warnings.
- Link every alert to a runbook and at least one drilldown panel.
- Run a synthetic bad-call test to verify that metric, log, trace, and replay links line up.
The synthetic test matters. Create one controlled call where the tool times out, one where ASR confidence is low, and one where redaction is required. If the dashboard cannot route you from the alert to the right trace and replay pointer, it is not ready for production.

