Last month, I watched a customer debug a voice agent failure with Langfuse open in one tab and Datadog in another. The LLM traces looked clean. The infrastructure metrics showed no anomalies. Yet more than half of their transcript merges were failing silently. The fix? OpenTelemetry, instrumented for voice agents, not just LLM calls.
The problem wasn't in the LLM. It wasn't in the infrastructure. It was in the space between them: a language routing decision in one service triggered an STT provider switch in another, which produced empty transcripts that a third service evaluated as "no data available." Three services. Two programming languages. One trace ID that connected them all, once we instrumented it properly.
I used to think you could bolt general LLM observability tools like Langfuse or Arize onto voice agents and call it done. After analyzing 4M+ voice agent calls at Hamming, I changed my mind. Voice agents need their own span hierarchy, their own attributes, and their own debugging playbooks. This is the implementation guide I wish existed when we started.
TL;DR: Instrument voice agents using Hamming's 3-Layer OTel Instrumentation Model:
- Layer 1: Span Hierarchy — Design parent-child spans for
call.lifecycle → stt → llm → tools → tts → webhooks(not flat LLM-only traces)- Layer 2: Voice-Specific Attributes — Attach
stt.provider,llm.ttft_ms,tts.synthesis_ms,tool.execution_msto every span (generic HTTP attributes won't help you debug voice failures)- Layer 3: Unified Export — Route all traces to a single backend that correlates with test results and production calls (the "Context Switching Tax" of 3-4 separate dashboards kills debugging speed)
One trace ID across TypeScript, Temporal, and Python beats five dashboards.
Methodology Note: The span hierarchies, attributes, and debugging playbooks in this guide are derived from Hamming's analysis of 4M+ production voice agent calls across 10K+ voice agents (2024-2026).Error cascade patterns are drawn from real production incidents across 10K+ voice agent deployments. All examples are anonymized. Last Updated: February 2026.
Related Guides:
- Voice Agent Observability: End-to-End Tracing — Hamming's 5-Layer Observability Stack (conceptual model)
- Why Voice Agent Teams Need Unified Observability — Why unified beats fragmented
- Debugging Voice Agents: Logs, Missed Intents & Error Dashboards — Turn-level debugging
- Voice Agent Troubleshooting Guide — Common failure patterns and fixes
Who This Is For (And Who Should Skip It)
If you're running fewer than 100 voice agent calls per week, structured logging is probably enough. Bookmark this for when you scale.
If you're on a managed platform like Retell or Vapi with built-in dashboards, check whether their traces span all layers you care about. Most don't include STT provider details or tool execution timing. If those gaps don't hurt you yet, this guide can wait.
This guide is for teams that:
- Run 1,000+ voice agent calls per week across custom infrastructure
- Use separate services for ASR, LLM, TTS, and tool execution
- Have hit a debugging wall where logs alone can't explain why calls fail
- Want to instrument with OpenTelemetry rather than lock into a proprietary tracing vendor
If you're already spending more than 30 minutes per incident correlating timestamps across Langfuse, Datadog, and your application logs, the approach below will save you hours.
Why Standard LLM Observability Tools Miss Voice Agent Failures
Here's the trap: you set up Langfuse or Arize, trace your LLM calls, and assume you have observability. The LLM dashboard looks green. Prompt latency is normal. Token counts are within range.
Meanwhile, your users hear silence.
We call this the Dashboard Blindspot: each observability tool sees its own slice of the voice pipeline, but no single view shows the cascade.
Here's what happens in a real voice call. Audio arrives from a phone network, gets routed to an ASR provider, flows through language detection logic, hits the LLM for inference, triggers tool calls via webhooks, generates TTS audio, and streams back to the caller. That chain crosses 3+ service boundaries and typically spans 2 programming languages (TypeScript on the web/queue side, Python on the voice worker side).
Langfuse traces the LLM hop. Datadog monitors the infrastructure. Your application logs capture individual events. But none of them show you that a language routing decision in Service A triggered an STT provider switch in Service B that produced empty transcripts consumed by Service C. In our analysis of production incidents, the majority of voice agent failures that present as "LLM errors" actually originate in a different pipeline stage. Across the four cascade patterns we've documented in detail, every one involved the LLM performing fine while the input it received was garbage.
Standard LLM observability tools weren't built for this. Langfuse has added partial voice tracing through Pipecat and LiveKit integrations, which is a step forward, but these capture individual STT/TTS spans without the cross-service cascade visibility (provider fallback chains, language routing decisions, database consistency) that voice debugging requires. That's not a knock on Langfuse or Arize. They're excellent at what they do. But bolting them onto a voice agent and expecting full pipeline visibility is like monitoring your database and assuming you're observing your web app. (For more on structuring call logs and analytics, see our taxonomy guide.)
The Voice Agent Span Hierarchy: What to Trace
A voice agent span hierarchy mirrors the audio pipeline, not the HTTP request path. The root span is call.lifecycle, with child spans for each processing stage:
call.lifecycle (root, 45,000ms)
├── stt.transcription (1,200ms)
│ ├── stt.provider.deepgram (timeout, 800ms)
│ └── stt.provider.fallback.azure (400ms)
├── llm.inference (2,100ms)
│ ├── llm.tool_call.check_inventory (350ms)
│ └── llm.tool_call.create_order (890ms)
├── tts.synthesis (380ms)
├── webhook.dispatch (120ms)
│ └── webhook.callback.result_update (95ms)
└── evaluation.assertion_check (450ms)
Notice the nesting. When Deepgram times out, the system falls back to Azure. That fallback is a child span of stt.transcription, so you see it in context. When the LLM calls two tools, each tool call is a separate child span with its own timing. When the assertion evaluation runs after the call, it's traceable back to the same root.
This hierarchy is different from what Langfuse gives you. Langfuse would show the llm.inference span and its tool calls, but not the STT provider fallback that happened 800ms earlier or the assertion evaluation that failed 30 seconds later because the transcript was incomplete. (For a deep-dive on STT failure patterns that trigger these fallbacks, see 7 ASR failure modes in production.)
"The hardest part isn't adding spans. It's resisting the urge to trace everything and instead tracing the decision points that actually break," says Sumanyu Sharma, CEO of Hamming. "We wasted two weeks tracing HTTP headers before we realized provider fallback decisions were the thing we needed visibility into."
Design principle: Every decision point in the pipeline gets its own span. Provider selection, fallback triggers, webhook dispatches, and post-call evaluations are all traceable. If a component can fail independently, it needs a span.
| Pipeline Stage | Span Name | Parent | What It Captures |
|---|---|---|---|
| Call start/end | call.lifecycle | Root | Total duration, final status, error type |
| Speech-to-text | stt.transcription | call.lifecycle | Provider used, confidence, fallback triggered |
| STT provider attempt | stt.provider.{name} | stt.transcription | Per-provider latency, success/failure, timeout |
| LLM inference | llm.inference | call.lifecycle | Model, TTFT, tokens, finish reason |
| Tool execution | llm.tool_call.{name} | llm.inference | Webhook URL, status code, response time |
| Text-to-speech | tts.synthesis | call.lifecycle | Provider, synthesis latency, character count |
| Webhook dispatch | webhook.dispatch | call.lifecycle | Target URL, delivery status, retry count |
| Post-call evaluation | evaluation.assertion_check | call.lifecycle | Assertion type, result, score |
| STT provider routing | stt.provider_selection | stt.transcription | Routing decision, language array, selected provider order |
| Transcript finalization | transcript.finalization | call.lifecycle | Finalization status, canonical transcript written |
| Metric score write | metric.score.write | call.lifecycle | Metric type, write target (primary/replica), status |
| Transcript JSON parse | transcript.json_parse | call.lifecycle | Parse status, fallback triggered |
| Transcript merge fallback | transcript.merge.fallback | call.lifecycle | Fallback reason, quality level |
Voice-Specific Span Attributes That Actually Help Debugging
Generic HTTP span attributes (http.method, http.status_code, http.url) tell you a request happened. They don't tell you why your voice agent failed. These 12 attributes are what we recommend based on debugging thousands of production incidents:
| Attribute | Type | Example | When You'll Need This |
|---|---|---|---|
stt.provider | string | "deepgram" | Debugging empty transcripts: which provider handled this call? |
stt.confidence | float | 0.87 | Filtering calls where ASR quality degraded below threshold |
stt.latency_ms | int | 412 | Identifying provider latency spikes that cascade to user-perceived delay |
llm.model | string | "gpt-4o" | Correlating failures with specific model versions or deployments |
llm.ttft_ms | int | 340 | Time-to-first-token: the latency users actually feel |
llm.tokens.input | int | 487 | Detecting prompt bloat that slows inference |
llm.tokens.output | int | 203 | Catching truncated responses from output token limits |
tts.provider | string | "elevenlabs" | Tracing synthesis failures or quality issues to specific providers |
tts.synthesis_ms | int | 290 | Identifying TTS bottlenecks in the latency waterfall |
tool.name | string | "check_inventory" | Filtering traces by which tool calls were invoked |
tool.execution_ms | int | 350 | Finding slow webhook endpoints that delay responses |
call.duration_ms | int | 45000 | Correlating call length with error rates and user satisfaction |
The pattern: every attribute should answer a debugging question you'd actually ask. "Which STT provider handled this call?" "Was the LLM response truncated?" "Which tool call was slow?" If an attribute doesn't map to a question you'd ask during an incident, don't add it. Trace storage isn't free. (For the production KPIs these attributes power, see our voice agent monitoring KPIs guide.)
We found that adding stt.confidence to every transcription span was the single highest-ROI attribute. It lets you query "show me all calls where ASR confidence dropped below 0.7" and immediately see which calls had degraded input before the LLM ever ran.
How to Propagate Context Across Language Boundaries
Voice agent calls typically cross 3+ language boundaries in a single request: TypeScript (web server and task queue), Go (Temporal workflow orchestration), and Python (voice worker). Web APIs rarely do this, which is why standard tracing guides don't cover it.
W3C Trace Context (traceparent header) is the standard that works across all of them. The header format is straightforward:
traceparent: 00-{32-char-trace-id}-{16-char-span-id}-01
Here's how trace context flows through a voice agent system:
1. TypeScript (Web Server) — Create the root span:
import { withOpenTelemetry, getWebTracer } from '@hamming/tracing/instrumentation';
const result = await withOpenTelemetry()
.setSpanName('call.lifecycle')
.setTracer(getWebTracer())
.setSpanAttributes({
'call.id': callId,
'workspace.id': workspaceId,
'agent.id': agentId,
})
.setFunction(async (span) => {
// Temporal client auto-injects traceparent via OTel interceptor
await temporalClient.workflow.start(voiceCallWorkflow, {
args: [callConfig],
taskQueue: 'voice-calls',
});
})
.execute();
2. Temporal (Workflow Orchestration) — Context propagates automatically:
Temporal's OpenTelemetry interceptors extract the traceparent from workflow metadata and inject it into activity calls. No manual propagation code needed. This is the "Go boundary" that most teams don't realize exists.
3. Python (Voice Worker) — Extract and continue the trace:
from opentelemetry import trace
from opentelemetry.propagate import extract
async def handle_voice_job(job):
# Build a carrier dict from job metadata; start a fresh trace if header is missing
traceparent = getattr(getattr(job, "metadata", None), "headers", {}).get("traceparent")
carrier: dict[str, str] = {"traceparent": traceparent} if traceparent else {}
context = extract(carrier)
tracer = trace.get_tracer("voice-worker")
with tracer.start_as_current_span(
"stt.transcription",
context=context,
attributes={
"stt.provider": "deepgram",
"stt.confidence": 0.92,
"stt.latency_ms": 412,
},
):
transcript = await stt_provider.transcribe(audio)
return transcript
4. Webhook Callback — Context returns to TypeScript:
When the voice worker sends webhook events back to the web server, the traceparent header is included automatically by OpenTelemetry's HTTP instrumentation. The web server continues the same trace, meaning post-call evaluation spans are children of the original call.lifecycle root.
The result: one trace ID follows a call from the moment it's initiated in TypeScript, through Temporal's Go-based orchestration, into Python's voice processing, and back to TypeScript for evaluation. Any span in that chain is queryable by the original trace ID.
Framework-Specific OTel Setup
If you're building on LiveKit Agents or Pipecat, both frameworks have native OpenTelemetry support that gives you a head start on Layer 1 (span hierarchy) and Layer 2 (attributes).
LiveKit Agents exports the same OTel spans that power LiveKit Cloud's Agent Insights dashboard. Call set_tracer_provider() from livekit.agents.telemetry to route those spans to your own backend (Jaeger, SigNoz, etc.) instead of — or in addition to — LiveKit Cloud. LiveKit auto-instruments five pipeline stages: VAD, STT, end-of-utterance detection, LLM inference, and TTS. Each span carries gen_ai.* attributes (model, token counts, TTFB) aligned with the emerging OpenTelemetry GenAI semantic conventions. See the LiveKit observability data docs for the full set_tracer_provider() API and the Agent Insights guide for the session timeline view.
Pipecat instruments at three levels: a root conversation span, child turn spans (one per user-agent exchange), and grandchild service spans for each STT, LLM, and TTS call. Enable it with setup_tracing() and set enable_tracing=True plus enable_turn_tracking=True on your PipelineTask. Pipecat's service spans include provider-specific attributes like stt.audio_duration, llm.tokens.prompt, and tts.characters. See the Pipecat OTel docs for setup and span attribute reference, and the metrics guide for TTFB and processing time metrics.
Both frameworks handle Layer 1 and Layer 2 within a single service. What they don't handle is Layer 3: propagating trace context across services (your orchestrator, your webhook handlers, your evaluation pipeline). That's where the W3C traceparent propagation described above becomes essential.
For Hamming-specific integration patterns, see our guides on monitoring Pipecat agents in production and testing and monitoring LiveKit voice agents.
Debugging Playbook: 4 Error Cascades and How to Trace Them
This is the section I could write 5,000 words on. These four patterns account for the majority of "I can't figure out why my voice agent is broken" incidents we've seen across 10K+ deployments. Each one is invisible without cross-service traces. (If you're looking for a structured incident response workflow, see our voice agent SEV playbook and postmortem template.)
Cascade 1: Silent Transcript Merging Failures
Symptom: Test assertions return "SKIPPED" with "transcript is empty," but you can see the full conversation in the UI playback.
What happened: The voice worker streamed transcript data in real-time (visible in the UI), but the canonical transcript used by the assertion evaluator was never finalized. A protocol configuration caused the finalization step to be skipped, so streaming data existed but the evaluation system read an empty canonical source.
How to trace it:
| Step | Span to Check | What to Look For |
|---|---|---|
| 1 | call.lifecycle | Verify call completed successfully |
| 2 | stt.transcription | Check streaming transcript was received (messages > 0) |
| 3 | transcript.finalization | Missing or status=skipped — this is the break |
| 4 | evaluation.assertion_check | Reads canonical transcript, finds empty, marks SKIPPED |
Fix: Add a finalization guard that checks whether canonical transcript exists before assertion evaluation. If streaming data exists but canonical doesn't, trigger finalization before proceeding.
Without unified tracing: You'd see passing UI playback in one dashboard, failing assertions in another, and have no way to connect the two. The streaming vs. canonical distinction is invisible to any single tool.
Cascade 2: Language Routing Causes Empty STT Output
Symptom: Non-English test runs (Japanese, Spanish, Hindi) produce empty transcripts. English calls work fine.
What happened: A language composition change combined persona language (ja-JP) with test case language (en-US) into a multilingual array [ja-JP, en-US]. The STT routing logic saw languages.length > 1 and selected a multilingual provider order. That specific provider returned empty transcripts for these language combinations. All downstream assertions failed with "transcript is empty."
How to trace it:
| Step | Span to Check | What to Look For |
|---|---|---|
| 1 | call.lifecycle | Check call.languages attribute — shows [ja-JP, en-US] |
| 2 | stt.provider_selection | Routing decision — multilingual path selected |
| 3 | stt.provider.deepgram | Empty or near-empty output (stt.confidence = 0 or missing) |
| 4 | evaluation.assertion_check | "Transcript is empty, cannot evaluate" |
Fix: Validate the STT provider's language support before routing. If a provider doesn't handle a specific combination, fall back to a provider that does or split into separate single-language transcriptions. (See our multilingual voice agent testing guide for testing patterns that catch this before production.)
Without unified tracing: The language routing decision lives in Service A, the STT provider selection in Service B, and the assertion failure in Service C. You'd need to correlate timestamps across three log streams to connect "language composition change" to "empty transcript" to "assertion failure."
Cascade 3: Replica Lag Causes Phantom "Not Found" Errors
Symptom: Custom metric assertions show "metric score not found" even though the metric was computed successfully. Scores are visible in the database when you check manually.
What happened: Metric scores were written to the primary database at timestamp T. The assertion evaluator read from a read replica. During a recovery-conflict storm, replica lag spiked from seconds to minutes. The evaluator's read hit the stale replica, got its query canceled by PostgreSQL's recovery process, and marked the assertion as SKIPPED.
How to trace it:
| Step | Span to Check | What to Look For |
|---|---|---|
| 1 | metric.score.write | Timestamp T, status=success, db.target=primary |
| 2 | evaluation.assertion_check | Timestamp T+5ms, db.target=replica |
| 3 | Compare timestamps | Write at T, read at T+5ms on a replica lagging seconds to minutes (recovery-conflict storm) |
| 4 | db.query spans | Check db.replica_lag_ms or PostgreSQL "canceling statement due to conflict with recovery" errors |
Fix: For consistency-critical reads (metrics needed for assertion evaluation), route to the primary database. Or add a retry with exponential backoff that waits for replica convergence.
Without unified tracing: Langfuse would show the LLM call succeeded. Datadog would show the database is healthy. Neither would reveal that the read replica was in a recovery-conflict storm, canceling queries before they could return. In production, this pattern can generate thousands of errors across thousands of traces within minutes before anyone identifies the root cause.
Cascade 4: LLM Output Limits Cause Silent Quality Degradation
Symptom: Transcript merging quality drops sharply. In our incident reviews, failure rates ranged between 50-70% of merge operations producing degraded output, but no errors appeared in logs.
What happened: The LLM model used for transcript merging hit its output token limit on long conversations (>15 messages). The model returned truncated JSON, the parser failed silently, and the system fell back to a lower-quality text-only merge. No error was thrown because the fallback was "working correctly."
How to trace it:
| Step | Span to Check | What to Look For |
|---|---|---|
| 1 | llm.inference (transcript merge) | llm.tokens.output near model limit, llm.finish_reason=length |
| 2 | transcript.json_parse | Status=failed, no exception thrown (silent failure) |
| 3 | transcript.merge.fallback | Fallback triggered, quality=degraded |
| 4 | Correlation: transcript length vs failure rate | In our data, long transcripts failed at 5-7x the rate of shorter transcripts |
Fix: Add a JSON repair step before falling back to text-only merge. Detect finish_reason=length and retry with a larger context window or split the transcript into chunks.
Without unified tracing: The LLM provider dashboard shows calls completing. The transcript service shows merges running. The quality degradation only becomes visible when you correlate LLM output token counts with downstream merge quality metrics. That correlation requires a unified trace.
Unified Platform vs. Stitching Together Langfuse + Arize + Datadog
Remember the Dashboard Blindspot from earlier, where each tool sees its own slice but misses the cascade? The cost of working around it has a name too. We call the debugging overhead of correlating information across separate tools the Context Switching Tax. In our analysis of production incidents, engineers spend 40-60% of debugging time switching between tools, matching timestamps, and mentally reconstructing the chain of events.
Here's what each tool sees versus what a unified trace shows. "Unified Trace" here means a single OTel-backed tracing backend (e.g., SigNoz, Jaeger, Tempo) that also correlates voice artifacts (audio, transcripts, test results) via shared trace IDs. OpenTelemetry gives you the trace; correlating recordings and transcripts alongside spans requires attaching IDs and a UI that surfaces those artifacts together.
| Debugging Dimension | Langfuse | Arize | Datadog | Unified Trace |
|---|---|---|---|---|
| LLM prompt/completion | Full visibility | Full visibility | Limited | Full visibility |
| LLM token counts and latency | Yes | Yes | No | Yes |
| STT provider selection | Partial | No | No | Yes |
| STT provider fallback chain | No | No | No | Yes |
| STT confidence per utterance | No | No | No | Yes |
| TTS synthesis latency | Partial | No | Partial | Yes |
| Tool call webhook timing | Partial | No | Yes | Yes |
| Database read/write routing | No | No | Yes | Yes |
| Replica lag correlation | No | No | Partial | Yes |
| Cross-service cascade | No | No | Partial | Yes |
| Test assertion correlation | No | No | No | Yes |
| Audio playback alongside trace | No | No | No | Yes |
| Call recording + trace linkage | No | No | No | Yes |
| Language routing decisions | No | No | No | Yes |
| Provider health across calls | No | Partial | Yes | Yes |
The pattern is clear: Langfuse and Arize excel at LLM-layer visibility, and Langfuse has added partial STT/TTS tracing via Pipecat and LiveKit integrations. Datadog excels at infrastructure monitoring. But voice agent debugging requires correlating events across all three layers simultaneously, including provider fallback chains, language routing decisions, and database consistency issues that none of these tools trace natively. The question is whether you want one trace or three dashboards.
I should be clear about when fragmented tooling works fine. If your voice agent is a thin wrapper around a single LLM call with no custom STT/TTS, Langfuse alone gives you most of what you need. If you're primarily debugging infrastructure issues (CPU, memory, network), Datadog is the right tool. The unified approach becomes essential when you're operating a multi-service voice pipeline where failures cascade across boundaries.
Flaws But Not Dealbreakers
Sampling decisions are brutal. You can't trace 100% of production calls at scale. (For the broader observability framework that tracing fits into, see Hamming's 4-Layer Voice Observability Framework.) At 10,000 calls/day, even with efficient BatchSpanProcessor export, full tracing generates substantial storage costs. But tail sampling (keep only slow or errored traces) has a critical weakness: the bugs you most need to trace are often the ones that don't trigger errors. The transcript merge that silently degrades quality? No error. The STT provider that returns low-confidence output? No error. We call this the "Sampling Blind Spot." We haven't fully resolved it. Different teams land differently based on call volume and debugging needs.
Third-party providers don't propagate trace context. When you call Deepgram's API or ElevenLabs' TTS endpoint, they don't return a traceparent header. Your trace has a gap at the provider boundary. You can bracket the call with a client-side span (start span before request, end after response), but you lose visibility into what happened inside the provider. This is an industry-wide limitation, not specific to any one tool.
Expect 1-3% latency overhead. Span creation and export aren't free. We measure 1-3% latency increase from OTel instrumentation across our deployments. For most voice agents where turn latency is 800-2,000ms, that's 8-60ms of added latency. Worth it for the debugging capability, but measure it in your system. Use BatchSpanProcessor (not SimpleSpanProcessor) to amortize export costs.
The OTel spec doesn't have voice agent semantic conventions yet. The OpenTelemetry GenAI semantic conventions cover LLM calls but not STT, TTS, or audio processing. That means the attribute names in this guide (stt.provider, tts.synthesis_ms) are Hamming's conventions, not an industry standard. We'd love to see the OTel community adopt voice-specific conventions. Until then, pick a naming scheme and be consistent.
OTel Instrumentation Readiness Checklist
Use this before your next deployment:
Span Hierarchy
- Root
call.lifecyclespan created at call initiation - Child spans for STT, LLM, TTS, tool execution, webhooks
- Provider-level spans nested under STT/TTS (for fallback tracing)
- Post-call evaluation spans linked to same root trace
- Every span has
call.idattribute for cross-referencing - Framework-native spans enabled (LiveKit:
set_tracer_provider(), Pipecat:setup_tracing()+enable_tracing=True)
Voice-Specific Attributes
- STT: provider, confidence, latency_ms
- LLM: model, ttft_ms, tokens.input, tokens.output, finish_reason
- TTS: provider, synthesis_ms, character_count
- Tools: name, execution_ms, webhook_status
- Call: duration_ms, status, error_type (if applicable)
Context Propagation
- W3C
traceparentheader on all HTTP requests between services - Temporal OTel interceptor configured for workflow/activity context
- Python voice worker extracts context from job metadata
- Webhook callbacks carry
traceparentback to web server
Export & Storage
-
BatchSpanProcessorconfigured (notSimpleSpanProcessor) - Sampling strategy defined (head sampling, tail sampling, or adaptive)
- Trace retention policy matches your debugging window (7-30 days)
- Alerts configured on trace completeness (>95% target)
Verification
- End-to-end trace visible from web server → voice worker → webhook callback
- Can filter traces by
call.id,workspace.id,stt.provider - Can identify provider fallback chains in the span tree
- Can correlate test assertion results with call traces
Voice agent observability isn't about more dashboards. It's about one trace that follows a call through every service, every language boundary, and every decision point. The teams that debug fastest aren't the ones with the best engineers. They're the ones with the best traces.

