How does OpenTelemetry work with voice agents specifically?

OpenTelemetry provides W3C Trace Context propagation, span creation APIs, and multi-backend export for voice agents. According to Hamming's 3-Layer OTel Instrumentation Model, you create a root span at call initiation, child spans for each pipeline stage (STT, LLM, TTS, tool execution), and propagate the trace ID across service boundaries using the traceparent header. This gives you a single trace spanning 3+ services and 2+ programming languages.

What span attributes should I add for voice agent tracing?

Voice agents need attributes beyond standard HTTP spans. According to Hamming's analysis of 4M+ calls, the 12 most useful attributes are: stt.provider, stt.confidence, stt.latency_ms, llm.model, llm.ttft_ms, llm.tokens.input, llm.tokens.output, tts.provider, tts.synthesis_ms, tool.name, tool.execution_ms, and call.duration_ms. Each attribute maps to a specific debugging question you'd ask during an incident.

What is a good OpenTelemetry trace and span model for voice agents?

Use a root call span, child spans for each conversational turn, and service spans for STT, LLM, tool execution, TTS, webhooks, and post-call evaluation. Hamming recommends keeping `call.id`, `room.id`, `test_run.id`, `scenario.id`, and `turn.index` attached to the trace so engineers can join audio, transcripts, tool evidence, and evaluation results without timestamp matching. This model turns a voice call into a debuggable timeline instead of a disconnected set of LLM and infrastructure events.

How do OpenTelemetry GenAI semantic conventions apply to voice agents?

Use OpenTelemetry GenAI conventions for LLM and tool spans, including gen_ai.operation.name, gen_ai.provider.name, gen_ai.request.model, gen_ai.usage.input_tokens, gen_ai.usage.output_tokens, gen_ai.tool.name, and error.type. Hamming adds voice-specific attributes for STT, TTS, VAD, barge-in, silence, SIP, and call identity because the GenAI conventions do not yet standardize the full voice pipeline.

How do I propagate trace context between TypeScript and Python services?

Use W3C Trace Context (traceparent header) at every service boundary. In TypeScript, OpenTelemetry HTTP instrumentation automatically injects the header into outgoing requests. For Temporal workflow orchestration, use the OTel interceptor to embed context in workflow metadata. In Python, extract the context using TraceContextTextMapPropagator. According to Hamming's architecture, a single voice call crosses 3+ language boundaries, and W3C traceparent is the only standard that works across all of them.

Can I use Langfuse or Arize for voice agent observability?

Langfuse and Arize excel at LLM-layer tracing and have added partial voice framework integrations (Pipecat, LiveKit), but they still miss cross-service cascade visibility. They trace prompt/completion cycles and some STT/TTS spans, but don't capture provider fallback chains, language routing decisions, or database read-after-write consistency issues. According to Hamming's debugging data from 4M+ calls, the majority of voice agent failures that present as LLM errors actually originate in a different pipeline stage. A unified platform that traces the entire call catches these cross-component cascades.

What is W3C traceparent and why does it matter for voice agents?

W3C Trace Context is a standard HTTP header format (traceparent: 00-traceId-spanId-flags) that propagates distributed trace identity across services. For voice agents, it matters because a single call can cross 3+ service boundaries and 2+ programming languages. According to Hamming's 3-Layer OTel Instrumentation Model, traceparent is the glue that turns five separate log streams into one debuggable trace. Without it, each service creates disconnected traces.

How do I debug a voice agent error cascade using traces?

Start with the symptom, then follow the trace backward through the span hierarchy. According to Hamming's debugging playbook from 10K+ deployments, most cascades follow a pattern: a decision in Service A triggers behavior in Service B that produces bad output consumed by Service C. With a unified trace, you see the full chain in one view instead of correlating timestamps across 3 dashboards. Hamming calls this overhead the Context Switching Tax.

What telemetry should I capture for voice agent root cause analysis?

Capture call, turn, STT, LLM, tool, TTS, webhook, database, and evaluation telemetry with shared identifiers. According to Hamming's trace model, the minimum useful join keys are `call.id`, `room.id`, `turn.index`, `test_run.id`, `scenario.id`, and the W3C trace ID. Do not store raw audio, transcripts, prompts, or tool payloads as default span attributes; attach redacted pointers and artifact hashes instead.

How do I instrument an AI voice agent with OpenTelemetry traces, logs, and metrics?

Start by creating a `call.lifecycle` root span, then add turn-level spans and child spans around STT, LLM, tool execution, TTS, webhooks, and post-call evaluation. Use OpenTelemetry metrics for latency histograms, success rates, and error counters, and correlate logs by injecting the same trace ID and call ID into each log event. Hamming recommends validating trace completeness on at least 95% of calls before relying on the data for production incident response.

What is the performance overhead of OTel instrumentation for voice agents?

Expect 1-3% latency increase from span creation and export. According to Hamming's benchmarks across 10K+ voice agents, for turn latency of 800-2,000ms, that translates to 8-60ms of added latency. Use BatchSpanProcessor instead of SimpleSpanProcessor to minimize per-span cost. At high volume, implement sampling, but be aware of the Sampling Blind Spot: the bugs you most need to trace are the ones that don't trigger traditional error thresholds.

Should I store transcripts and tool payloads directly in OpenTelemetry spans?

No. OpenTelemetry's GenAI guidance treats model inputs, outputs, tool arguments, and tool results as sensitive opt-in content, and Hamming applies the same rule to voice transcripts and call audio. Store sensitive artifacts in controlled storage, then attach redacted pointers, manifest hashes, call IDs, and trace IDs to the span.

OpenTelemetry for AI Voice Agents: How to Trace Calls End-to-End

Last month, I watched a customer debug a voice agent failure with Langfuse open in one tab and Datadog in another. The LLM traces looked clean. The infrastructure metrics showed no anomalies. Yet more than half of their transcript merges were failing silently. The fix? OpenTelemetry, instrumented for voice agents, not just LLM calls.

The problem wasn't in the LLM. It wasn't in the infrastructure. It was in the space between them: a language routing decision in one service triggered an STT provider switch in another, which produced empty transcripts that a third service evaluated as "no data available." Three services. Two programming languages. One trace ID that connected them all, once we instrumented it properly.

I used to think you could bolt general LLM observability tools like Langfuse or Arize onto voice agents and call it done. After analyzing 4M+ voice agent calls at Hamming, I changed my mind. Voice agents need their own span hierarchy, their own attributes, and their own debugging playbooks. This is the implementation guide I wish existed when we started.

TL;DR: Instrument voice agents using Hamming's 3-Layer OTel Instrumentation Model:

Layer 1: Span Hierarchy: Design parent-child spans for call.lifecycle -> stt -> llm -> tools -> tts -> webhooks (not flat LLM-only traces)

Layer 2: Voice-Specific Attributes: Attach stt.provider, llm.ttft_ms, tts.synthesis_ms, tool.execution_ms to every span (generic HTTP attributes won't help you debug voice failures)

Layer 3: Unified Export: Route all traces to a single backend that correlates with test results and production calls (the "Context Switching Tax" of 3-4 separate dashboards kills debugging speed)

One trace ID across TypeScript, Temporal, and Python beats five dashboards.

Methodology Note: The span hierarchies, attributes, and debugging playbooks in this guide are derived from Hamming's analysis of 4M+ production voice agent calls across 10K+ voice agents (2024-2026).
Error cascade patterns are drawn from real production incidents across 10K+ voice agent deployments. All samples are anonymized. Last Updated: June 2026.

Related Guides:

Voice Agent Observability: End-to-End Tracing - Hamming's 5-Layer Observability Stack (conceptual model)
Why Voice Agent Teams Need Unified Observability - Why unified beats fragmented
Debugging Voice Agents: Logs, Missed Intents & Error Dashboards - Turn-level debugging
Voice Agent Troubleshooting Guide - Common failure patterns and fixes

Who This Is For (And Who Should Skip It)

If you're running fewer than 100 voice agent calls per week, structured logging is probably enough. Bookmark this for when you scale.

If you're on a managed platform like Retell or Vapi with built-in dashboards, check whether their traces span all layers you care about. Most don't include STT provider details or tool execution timing. If those gaps don't hurt you yet, this guide can wait.

This guide is for teams that:

Run 1,000+ voice agent calls per week across custom infrastructure
Use separate services for ASR, LLM, TTS, and tool execution
Have hit a debugging wall where logs alone can't explain why calls fail
Want to instrument with OpenTelemetry rather than lock into a proprietary tracing vendor

If you're already spending more than 30 minutes per incident correlating timestamps across Langfuse, Datadog, and your application logs, the approach below will save you hours.

Why Standard LLM Observability Tools Miss Voice Agent Failures

Here's the trap: you set up Langfuse or Arize, trace your LLM calls, and assume you have observability. The LLM dashboard looks green. Prompt latency is normal. Token counts are within range.

Meanwhile, your users hear silence.

We call this the Dashboard Blindspot: each observability tool sees its own slice of the voice pipeline, but no single view shows the cascade.

Here's what happens in a real voice call. Audio arrives from a phone network, gets routed to an ASR provider, flows through language detection logic, hits the LLM for inference, triggers tool calls via webhooks, generates TTS audio, and streams back to the caller. That chain crosses 3+ service boundaries and spans 2 programming languages in Hamming's web and worker architecture: TypeScript on the web/queue side, Python on the voice worker side.

Langfuse traces the LLM hop. Datadog monitors the infrastructure. Your application logs capture individual events. But none of them show you that a language routing decision in Service A triggered an STT provider switch in Service B that produced empty transcripts consumed by Service C. In our analysis of production incidents, the majority of voice agent failures that present as "LLM errors" actually originate in a different pipeline stage. Across the four cascade patterns we've documented in detail, every one involved the LLM performing fine while the input it received was garbage.

Standard LLM observability tools weren't built for this. Langfuse has added partial voice tracing through Pipecat and LiveKit integrations, which is a step forward, but these capture individual STT/TTS spans without the cross-service cascade visibility (provider fallback chains, language routing decisions, database consistency) that voice debugging requires. That's not a knock on Langfuse or Arize. They're excellent at what they do. But bolting them onto a voice agent and expecting full pipeline visibility is like monitoring your database and assuming you're observing your web app. (For more on structuring call logs and analytics, see our taxonomy guide.)

The Voice Agent Span Hierarchy: What to Trace

A voice agent span hierarchy mirrors the audio pipeline, not the HTTP request path. The root span is call.lifecycle, with child spans for each processing stage:

call.lifecycle (root, 45,000ms)├── stt.transcription (1,200ms)│   ├── stt.provider.deepgram (timeout, 800ms)│   └── stt.provider.fallback.azure (400ms)├── llm.inference (2,100ms)│   ├── llm.tool_call.check_inventory (350ms)│   └── llm.tool_call.create_order (890ms)├── tts.synthesis (380ms)├── webhook.dispatch (120ms)│   └── webhook.callback.result_update (95ms)└── evaluation.assertion_check (450ms)

Notice the nesting. When Deepgram times out, the system falls back to Azure. That fallback is a child span of stt.transcription, so you see it in context. When the LLM calls two tools, each tool call is a separate child span with its own timing. When the assertion evaluation runs after the call, it's traceable back to the same root.

This hierarchy is different from what Langfuse gives you. Langfuse would show the llm.inference span and its tool calls, but not the STT provider fallback that happened 800ms earlier or the assertion evaluation that failed 30 seconds later because the transcript was incomplete. (For a deep-dive on STT failure patterns that trigger these fallbacks, see 7 ASR failure modes in production.)

"The hardest part isn't adding spans. It's resisting the urge to trace everything and instead tracing the decision points that actually break," says Sumanyu Sharma, CEO of Hamming. "We wasted two weeks tracing HTTP headers before we realized provider fallback decisions were the thing we needed visibility into."

Design principle: Every decision point in the pipeline gets its own span. Provider selection, fallback triggers, webhook dispatches, and post-call evaluations are all traceable. If a component can fail independently, it needs a span.

Pipeline Stage	Span Name	Parent	What It Captures
Call start/end	`call.lifecycle`	Root	Total duration, final status, error type
Speech-to-text	`stt.transcription`	`call.lifecycle`	Provider used, confidence, fallback triggered
STT provider attempt	`stt.provider.{name}`	`stt.transcription`	Per-provider latency, success/failure, timeout
LLM inference	`llm.inference`	`call.lifecycle`	Model, TTFT, tokens, finish reason
Tool execution	`llm.tool_call.{name}`	`llm.inference`	Webhook URL, status code, response time
Text-to-speech	`tts.synthesis`	`call.lifecycle`	Provider, synthesis latency, character count
Webhook dispatch	`webhook.dispatch`	`call.lifecycle`	Target URL, delivery status, retry count
Post-call evaluation	`evaluation.assertion_check`	`call.lifecycle`	Assertion type, result, score
STT provider routing	`stt.provider_selection`	`stt.transcription`	Routing decision, language array, selected provider order
Transcript finalization	`transcript.finalization`	`call.lifecycle`	Finalization status, canonical transcript written
Metric score write	`metric.score.write`	`call.lifecycle`	Metric type, write target (primary/replica), status
Transcript JSON parse	`transcript.json_parse`	`call.lifecycle`	Parse status, fallback triggered
Transcript merge fallback	`transcript.merge.fallback`	`call.lifecycle`	Fallback reason, quality level

What Is a Good OpenTelemetry Trace/Span Model for Voice Agents?

When someone asks for a good OpenTelemetry trace/span model for voice agents, they usually want a naming scheme. That's only half the answer.

We got this wrong initially. The first version traced the obvious service calls, but it still made incidents feel like archaeology. The useful model has three levels: a call root span, one span per conversational turn, and child spans for every component that can independently delay, distort, or fail the call. A flat trace with one LLM span is just a cleaner version of the same blind spot.

Use this model as the default:

Level	Span Pattern	Required Join Keys	What It Proves
Call	`call.lifecycle`	`call.id`, `room.id`, `workspace.id`, `agent.id`	The full call, final status, and tenant boundary
Turn	`turn.{index}`	`turn.index`, `trace_id`, `conversation.id`	Which user utterance or agent response caused the issue
Audio input	`stt.transcription`, `vad.end_of_utterance`	`audio.segment_id`, `stt.provider`, `stt.model`	Whether bad input reached the LLM
Reasoning	`llm.inference`, `llm.tool_call.{name}`	`prompt.version`, `tool.call.id`, `scenario.id`	Whether the model or tool decision failed
Audio output	`tts.synthesis`, `audio.playout`	`tts.provider`, `voice.id`, `audio.output_id`	Whether dead air came from TTS or playback. See Dead Air Detection Guide for root-cause routing.
Side effects	`webhook.dispatch`, `db.write`, `db.read`	`request.id`, `idempotency_key`, `db.target`	Whether the right external action happened
Evaluation	`evaluation.assertion_check`, `metric.score.write`	`test_run.id`, `assertion.id`, `metric.id`	Whether the call outcome was scored correctly

The call span answers "what happened to this call?" The turn span answers "which exchange broke?" Component spans answer "which dependency caused it?"

The join keys are the part teams skip when they're rushing. Don't. They keep the trace connected to recordings, transcripts, test runs, and dashboards without storing private content directly inside telemetry.

Quotable definition: A voice agent trace should model a conversation, not a request. The root span is the call, turn spans divide the interaction, and child spans capture STT, LLM, tool execution, TTS, webhooks, and post-call evaluation. If a span cannot be joined back to audio, transcript, and test-run evidence, it will not help during root cause analysis.

Voice-Specific Span Attributes That Actually Help Debugging

Generic HTTP span attributes (http.method, http.status_code, http.url) tell you a request happened. They don't tell you why your voice agent failed. These 12 attributes are what we recommend based on debugging thousands of production incidents:

Attribute	Type	Sample	When You'll Need This
`stt.provider`	string	`"deepgram"`	Debugging empty transcripts: which provider handled this call?
`stt.confidence`	float	`0.87`	Filtering calls where ASR quality degraded below threshold
`stt.latency_ms`	int	`412`	Identifying provider latency spikes that cascade to user-perceived delay
`llm.model`	string	`"gpt-4o"`	Correlating failures with specific model versions or deployments
`llm.ttft_ms`	int	`340`	Time-to-first-token: the latency users actually feel
`llm.tokens.input`	int	`487`	Detecting prompt bloat that slows inference
`llm.tokens.output`	int	`203`	Catching truncated responses from output token limits
`tts.provider`	string	`"elevenlabs"`	Tracing synthesis failures or quality issues to specific providers
`tts.synthesis_ms`	int	`290`	Identifying TTS bottlenecks in the latency waterfall
`tool.name`	string	`"check_inventory"`	Filtering traces by which tool calls were invoked
`tool.execution_ms`	int	`350`	Finding slow webhook endpoints that delay responses
`call.duration_ms`	int	`45000`	Correlating call length with error rates and user satisfaction

The pattern: every attribute should answer a debugging question you'd actually ask. "Which STT provider handled this call?" "Was the LLM response truncated?" "Which tool call was slow?" If an attribute doesn't map to a question you'd ask during an incident, don't add it. Trace storage isn't free. (For the production KPIs these attributes power, see our voice agent monitoring KPIs guide.)

We found that adding stt.confidence to every transcription span was the single highest-ROI attribute. It lets you query "show me all calls where ASR confidence dropped below 0.7" and immediately see which calls had degraded input before the LLM ever ran.

How should voice-agent spans align with OpenTelemetry GenAI conventions?

Use the OpenTelemetry GenAI conventions where they already fit, then add voice-specific attributes for audio, STT, TTS, and telephony. As of June 2026, OpenTelemetry's GenAI semantic conventions are still marked as development status. They define model spans, agent spans, metrics, exceptions, events, provider names, token usage, conversation IDs, and tool execution spans, but they do not yet standardize voice-specific STT, TTS, VAD, barge-in, silence, or SIP attributes.

That means your trace model should have two layers of naming discipline:

Span Area	Prefer OTel GenAI Convention	Add Voice-Specific Attributes	Why
LLM inference	`gen_ai.operation.name`, `gen_ai.provider.name`, `gen_ai.request.model`, `gen_ai.response.model`, `gen_ai.usage.input_tokens`, `gen_ai.usage.output_tokens`	`llm.ttft_ms`, prompt version, agent version	OTel gives portable model telemetry; voice debugging still needs turn timing and release context
Tool execution	`gen_ai.operation.name=execute_tool`, `gen_ai.tool.name`, `gen_ai.tool.call.id`, `error.type`	`tool.execution_ms`, `tool.retry_count`, `tool.side_effect_type`, redacted result pointer	OTel names the tool span; voice QA needs to know whether the right side effect happened
Conversation identity	`gen_ai.conversation.id` when available	`call.id`, `room.id`, `test_run.id`, `scenario.id`, `workspace.id`	Conversation IDs alone rarely join telephony, LiveKit rooms, tests, and post-call evaluation
STT and VAD	No stable voice convention yet	`stt.provider`, `stt.model`, `stt.confidence`, `stt.latency_ms`, `vad.end_of_utterance_ms`	These catch bad input before the LLM receives it
TTS and playback	No stable voice convention yet	`tts.provider`, `tts.voice_id`, `tts.synthesis_ms`, `tts.first_audio_ms`, `audio.playout_ms`	Users feel TTS delays as dead air, even if the LLM span looks healthy. See Dead Air Detection Guide.

Do not record raw prompts, transcripts, tool arguments, or tool results as default span attributes in production. The OpenTelemetry GenAI docs warn that input/output message attributes can contain sensitive information and recommend opt-in capture or external storage references. For voice agents, that warning matters twice: the transcript may include PII, and the tool result may include account or payment data.

Practical rule: keep span attributes low-cardinality and reviewer-safe by default. Store sensitive audio, transcripts, and tool payloads in controlled storage, then attach redacted pointers, manifest hashes, and trace IDs to the span. If you need the complete evidence packet for QA or compliance review, pair this trace model with the voice agent call evidence export runbook.

What Telemetry Should You Capture for Voice Agent Root Cause Analysis?

Capture telemetry that lets an engineer move from symptom to cause without opening five tools. For voice agents, that means traces, logs, metrics, and artifacts all share the same call, turn, and test-run identifiers.

Here's the filter I use: if the signal cannot help you decide whether the failure came from audio, transcript, reasoning, a tool side effect, speech generation, or evaluation, it probably belongs in a log line, not in your core RCA model.

RCA Question	Telemetry to Capture	Span or Event Location	Avoid
Did the user audio arrive cleanly?	packet loss, jitter, audio segment ID, VAD endpoint time	`call.lifecycle`, `vad.end_of_utterance`, `stt.transcription`	Raw audio in span attributes
Did STT distort the request?	provider, model, confidence, partial/final transcript status	`stt.transcription`, `stt.provider.{name}`	Full transcript text by default
Did the LLM receive the right context?	prompt version, model, token counts, tool-call request ID, finish reason	`llm.inference`, `llm.tool_call.{name}`	Unredacted prompts or tool arguments
Did a tool create the right side effect?	tool name, idempotency key, status code, retry count, redacted result pointer	`llm.tool_call.{name}`, `webhook.dispatch`	Customer account data in trace tags
Did TTS or playback add dead air?	synthesis latency, first audio time, voice ID, playout latency	`tts.synthesis`, `audio.playout`	Provider-internal guesses you cannot verify. See Dead Air Detection Guide.
Did evaluation read the right artifact?	assertion ID, test run ID, transcript artifact hash, DB read target	`evaluation.assertion_check`, `metric.score.write`	Timestamp-only joins

The practical test is simple: given a failed call, can someone answer "bad audio, bad transcript, bad reasoning, bad tool side effect, bad speech, or bad evaluation" in under 10 minutes? If not, you are missing RCA telemetry.

How to Propagate Context Across Language Boundaries

Voice agent calls can cross 3+ language boundaries in a single request: TypeScript (web server and task queue), Go (Temporal workflow orchestration), and Python (voice worker). Web APIs rarely do this, which is why standard tracing guides don't cover it.

W3C Trace Context (traceparent header) is the standard that works across all of them. The header format is straightforward:

traceparent: 00-{32-char-trace-id}-{16-char-span-id}-01

Here's how trace context flows through a voice agent system:

1. TypeScript (Web Server): Create the root span

import { withOpenTelemetry, getWebTracer } from '@hamming/tracing/instrumentation';const result = await withOpenTelemetry()  .setSpanName('call.lifecycle')  .setTracer(getWebTracer())  .setSpanAttributes({    'call.id': callId,    'workspace.id': workspaceId,    'agent.id': agentId,  })  .setFunction(async (span) => {    // Temporal client auto-injects traceparent via OTel interceptor    await temporalClient.workflow.start(voiceCallWorkflow, {      args: [callConfig],      taskQueue: 'voice-calls',    });  })  .execute();

2. Temporal (Workflow Orchestration): Context propagates automatically

Temporal's OpenTelemetry interceptors extract the traceparent from workflow metadata and inject it into activity calls. No manual propagation code needed. This is the "Go boundary" that most teams don't realize exists.

3. Python (Voice Worker): Extract and continue the trace

from opentelemetry import tracefrom opentelemetry.propagate import extractasync def handle_voice_job(job):    # Build a carrier dict from job metadata; start a fresh trace if header is missing    traceparent = getattr(getattr(job, "metadata", None), "headers", {}).get("traceparent")    carrier: dict[str, str] = {"traceparent": traceparent} if traceparent else {}    context = extract(carrier)    tracer = trace.get_tracer("voice-worker")    with tracer.start_as_current_span(        "stt.transcription",        context=context,        attributes={            "stt.provider": "deepgram",            "stt.confidence": 0.92,            "stt.latency_ms": 412,        },    ):        transcript = await stt_provider.transcribe(audio)        return transcript

4. Webhook Callback: Context returns to TypeScript

When the voice worker sends webhook events back to the web server, the traceparent header is included automatically by OpenTelemetry's HTTP instrumentation. The web server continues the same trace, meaning post-call evaluation spans are children of the original call.lifecycle root.

The result: one trace ID follows a call from the moment it's initiated in TypeScript, through Temporal's Go-based orchestration, into Python's voice processing, and back to TypeScript for evaluation. Any span in that chain is queryable by the original trace ID.

Framework-Specific OTel Setup

If you're building on LiveKit Agents or Pipecat, both frameworks have native OpenTelemetry support that gives you a head start on Layer 1 (span hierarchy) and Layer 2 (attributes).

LiveKit Agents lets you set a custom tracer provider before an agent session starts. In the JavaScript SDK, setTracerProvider(provider, { metadata }) from @livekit/agents/telemetry injects metadata into spans and routes traces through your chosen provider. Treat LiveKit's framework spans as the session-level trace, then add your own spans around business logic, tool side effects, post-call evaluation, and trace-to-recording correlation.

Pipecat has built-in OpenTelemetry tracing for conversational pipelines. Initialize tracing with setup_tracing(service_name, exporter=...), then enable tracing and turn tracking on the pipeline worker with enable_tracing=True and enable_turn_tracking=True. Pipecat organizes traces as conversation -> turn -> stt/llm/tts spans, and its docs show service attributes for STT, LLM, TTS, multimodal sessions, tool calls, usage metrics, and time-to-first-byte. See the Pipecat OTel docs for setup and span attribute reference.

Both frameworks handle Layer 1 and Layer 2 within a single service. What they don't handle is Layer 3: propagating trace context across services (your orchestrator, your webhook handlers, your evaluation pipeline). That's where the W3C traceparent propagation described above becomes the release gate.

For Hamming-specific integration patterns, see our guides on monitoring Pipecat agents in production and testing and monitoring LiveKit voice agents.

Debugging Playbook: 4 Error Cascades and How to Trace Them

This is the section I could write 5,000 words on. These four patterns account for the majority of "I can't figure out why my voice agent is broken" incidents we've seen across 10K+ deployments. Each one is invisible without cross-service traces. (If you're looking for a structured incident response workflow, see our voice agent SEV playbook and postmortem template.)

Cascade 1: Silent Transcript Merging Failures

Symptom: Test assertions return "SKIPPED" with "transcript is empty," but you can see the full conversation in the UI playback.

What happened: The voice worker streamed transcript data in real-time (visible in the UI), but the canonical transcript used by the assertion evaluator was never finalized. A protocol configuration caused the finalization step to be skipped, so streaming data existed but the evaluation system read an empty canonical source.

How to trace it:

Step	Span to Check	What to Look For
1	`call.lifecycle`	Verify call completed successfully
2	`stt.transcription`	Check streaming transcript was received (messages > 0)
3	`transcript.finalization`	Missing or status=skipped. This is the break
4	`evaluation.assertion_check`	Reads canonical transcript, finds empty, marks SKIPPED

Fix: Add a finalization guard that checks whether canonical transcript exists before assertion evaluation. If streaming data exists but canonical doesn't, trigger finalization before proceeding.

Without unified tracing: You'd see passing UI playback in one dashboard, failing assertions in another, and have no way to connect the two. The streaming vs. canonical distinction is invisible to any single tool.

Cascade 2: Language Routing Causes Empty STT Output

Symptom: Non-English test runs (Japanese, Spanish, Hindi) produce empty transcripts. English calls work fine.

What happened: A language composition change combined persona language (ja-JP) with test case language (en-US) into a multilingual array [ja-JP, en-US]. The STT routing logic saw languages.length > 1 and selected a multilingual provider order. That specific provider returned empty transcripts for these language combinations. All downstream assertions failed with "transcript is empty."

How to trace it:

Step	Span to Check	What to Look For
1	`call.lifecycle`	Check `call.languages` attribute. It shows `[ja-JP, en-US]`
2	`stt.provider_selection`	Routing decision: multilingual path selected
3	`stt.provider.deepgram`	Empty or near-empty output (`stt.confidence` = 0 or missing)
4	`evaluation.assertion_check`	"Transcript is empty, cannot evaluate"

Fix: Validate the STT provider's language support before routing. If a provider doesn't handle a specific combination, fall back to a provider that does or split into separate single-language transcriptions. (See our multilingual voice agent testing guide for testing patterns that catch this before production.)

Without unified tracing: The language routing decision lives in Service A, the STT provider selection in Service B, and the assertion failure in Service C. You'd need to correlate timestamps across three log streams to connect "language composition change" to "empty transcript" to "assertion failure."

Cascade 3: Replica Lag Causes Phantom "Not Found" Errors

Symptom: Custom metric assertions show "metric score not found" even though the metric was computed successfully. Scores are visible in the database when you check manually.

What happened: Metric scores were written to the primary database at timestamp T. The assertion evaluator read from a read replica. During a recovery-conflict storm, replica lag spiked from seconds to minutes. The evaluator's read hit the stale replica, got its query canceled by PostgreSQL's recovery process, and marked the assertion as SKIPPED.

How to trace it:

Step	Span to Check	What to Look For
1	`metric.score.write`	Timestamp T, status=success, `db.target`=primary
2	`evaluation.assertion_check`	Timestamp T+5ms, `db.target`=replica
3	Compare timestamps	Write at T, read at T+5ms on a replica lagging seconds to minutes (recovery-conflict storm)
4	`db.query` spans	Check `db.replica_lag_ms` or PostgreSQL "canceling statement due to conflict with recovery" errors

Fix: For consistency-critical reads (metrics needed for assertion evaluation), route to the primary database. Or add a retry with exponential backoff that waits for replica convergence.

Without unified tracing: Langfuse would show the LLM call succeeded. Datadog would show the database is healthy. Neither would reveal that the read replica was in a recovery-conflict storm, canceling queries before they could return. In production, this pattern can generate thousands of errors across thousands of traces within minutes before anyone identifies the root cause.

Cascade 4: LLM Output Limits Cause Silent Quality Degradation

Symptom: Transcript merging quality drops sharply. In our incident reviews, failure rates ranged between 50-70% of merge operations producing degraded output, but no errors appeared in logs.

What happened: The LLM model used for transcript merging hit its output token limit on long conversations (>15 messages). The model returned truncated JSON, the parser failed silently, and the system fell back to a lower-quality text-only merge. No error was thrown because the fallback was "working correctly."

How to trace it:

Step	Span to Check	What to Look For
1	`llm.inference` (transcript merge)	`llm.tokens.output` near model limit, `llm.finish_reason`=`length`
2	`transcript.json_parse`	Status=failed, no exception thrown (silent failure)
3	`transcript.merge.fallback`	Fallback triggered, quality=degraded
4	Correlation: transcript length vs failure rate	In our data, long transcripts failed at 5-7x the rate of shorter transcripts

Fix: Add a JSON repair step before falling back to text-only merge. Detect finish_reason=length and retry with a larger context window or split the transcript into chunks.

Without unified tracing: The LLM provider dashboard shows calls completing. The transcript service shows merges running. The quality degradation only becomes visible when you correlate LLM output token counts with downstream merge quality metrics. That correlation requires a unified trace.

Unified Platform vs. Stitching Together Langfuse + Arize + Datadog

Remember the Dashboard Blindspot from earlier, where each tool sees its own slice but misses the cascade? The cost of working around it has a name too. We call the debugging overhead of correlating information across separate tools the Context Switching Tax. In our analysis of production incidents, engineers spend 40-60% of debugging time switching between tools, matching timestamps, and mentally reconstructing the chain of events.

Here's what each tool sees versus what a unified trace shows. "Unified Trace" here means a single OTel-backed tracing backend (e.g., SigNoz, Jaeger, Tempo) that also correlates voice artifacts (audio, transcripts, test results) via shared trace IDs. OpenTelemetry gives you the trace; correlating recordings and transcripts alongside spans requires attaching IDs and a UI that surfaces those artifacts together.

Debugging Dimension	Langfuse	Arize	Datadog	Unified Trace
LLM prompt/completion	Full visibility	Full visibility	Limited	Full visibility
LLM token counts and latency	Yes	Yes	No	Yes
STT provider selection	Partial	No	No	Yes
STT provider fallback chain	No	No	No	Yes
STT confidence per utterance	No	No	No	Yes
TTS synthesis latency	Partial	No	Partial	Yes
Tool call webhook timing	Partial	No	Yes	Yes
Database read/write routing	No	No	Yes	Yes
Replica lag correlation	No	No	Partial	Yes
Cross-service cascade	No	No	Partial	Yes
Test assertion correlation	No	No	No	Yes
Audio playback alongside trace	No	No	No	Yes
Call recording + trace linkage	No	No	No	Yes
Language routing decisions	No	No	No	Yes
Provider health across calls	No	Partial	Yes	Yes

The pattern is clear: Langfuse and Arize excel at LLM-layer visibility, and Langfuse has added partial STT/TTS tracing via Pipecat and LiveKit integrations. Datadog excels at infrastructure monitoring. But voice agent debugging requires correlating events across all three layers simultaneously, including provider fallback chains, language routing decisions, and database consistency issues that none of these tools trace natively. The question is whether you want one trace or three dashboards.

I should be clear about when fragmented tooling works fine. If your voice agent is a thin wrapper around a single LLM call with no custom STT/TTS, Langfuse alone gives you most of what you need. If you're primarily debugging infrastructure issues (CPU, memory, network), Datadog is the right tool. The unified approach becomes necessary when you're operating a multi-service voice pipeline where failures cascade across boundaries.

Flaws But Not Dealbreakers

Sampling decisions are brutal. You can't trace 100% of production calls at scale. (For the broader observability framework that tracing fits into, see Hamming's 4-Layer Voice Observability Framework.) At 10,000 calls/day, even with efficient BatchSpanProcessor export, full tracing generates substantial storage costs. But tail sampling (keep only slow or errored traces) has a critical weakness: the bugs you most need to trace are the ones that don't trigger errors. The transcript merge that silently degrades quality? No error. The STT provider that returns low-confidence output? No error. We call this the "Sampling Blind Spot." We haven't fully resolved it. Different teams land differently based on call volume and debugging needs.

Third-party providers don't propagate trace context. When you call Deepgram's API or ElevenLabs' TTS endpoint, they don't return a traceparent header. Your trace has a gap at the provider boundary. You can bracket the call with a client-side span (start span before request, end after response), but you lose visibility into what happened inside the provider. This is an industry-wide limitation, not specific to any one tool.

Expect 1-3% latency overhead. Span creation and export aren't free. We measure 1-3% latency increase from OTel instrumentation across our deployments. For most voice agents where turn latency is 800-2,000ms, that's 8-60ms of added latency. Worth it for the debugging capability, but measure it in your system. Use BatchSpanProcessor (not SimpleSpanProcessor) to amortize export costs.

The OTel spec doesn't have voice agent semantic conventions yet. The OpenTelemetry GenAI semantic conventions cover LLM calls but not STT, TTS, or audio processing. That means the attribute names in this guide (stt.provider, tts.synthesis_ms) are Hamming's conventions, not an industry standard. We'd love to see the OTel community adopt voice-specific conventions. Until then, pick a naming scheme and be consistent.

OTel Instrumentation Readiness Checklist

Use this before your next deployment:

Span Hierarchy

Root call.lifecycle span created at call initiation
Child spans for STT, LLM, TTS, tool execution, webhooks
Provider-level spans nested under STT/TTS (for fallback tracing)
Post-call evaluation spans linked to same root trace
Every span has call.id attribute for cross-referencing
Framework-native spans enabled (LiveKit: setTracerProvider(), Pipecat: setup_tracing() + enable_tracing=True)

Voice-Specific Attributes

STT: provider, confidence, latency_ms
LLM: model, ttft_ms, tokens.input, tokens.output, finish_reason
TTS: provider, synthesis_ms, character_count
Tools: name, execution_ms, webhook_status
Call: duration_ms, status, error_type (if applicable)

Context Propagation

W3C traceparent header on all HTTP requests between services
Temporal OTel interceptor configured for workflow/activity context
Python voice worker extracts context from job metadata
Webhook callbacks carry traceparent back to web server

Export & Storage

BatchSpanProcessor configured (not SimpleSpanProcessor)
Sampling strategy defined (head sampling, tail sampling, or adaptive)
Trace retention policy matches your debugging window (7-30 days)
Alerts configured on trace completeness (>95% target)

Verification

End-to-end trace visible from web server → voice worker → webhook callback
Can filter traces by call.id, workspace.id, stt.provider
Can identify provider fallback chains in the span tree
Can correlate test assertion results with call traces

Voice agent observability isn't about more dashboards. It's about one trace that follows a call through every service, every language boundary, and every decision point. The teams that debug fastest aren't the ones with the best engineers. They're the ones with the best traces.

Get started with unified voice agent observability

OpenTelemetry for AI Voice Agents: How to Trace Calls End-to-End

Who This Is For (And Who Should Skip It)

Why Standard LLM Observability Tools Miss Voice Agent Failures

The Voice Agent Span Hierarchy: What to Trace

What Is a Good OpenTelemetry Trace/Span Model for Voice Agents?

Voice-Specific Span Attributes That Actually Help Debugging

How should voice-agent spans align with OpenTelemetry GenAI conventions?

What Telemetry Should You Capture for Voice Agent Root Cause Analysis?

How to Propagate Context Across Language Boundaries

Framework-Specific OTel Setup

Debugging Playbook: 4 Error Cascades and How to Trace Them

Cascade 1: Silent Transcript Merging Failures

Cascade 2: Language Routing Causes Empty STT Output

Cascade 3: Replica Lag Causes Phantom "Not Found" Errors

Cascade 4: LLM Output Limits Cause Silent Quality Degradation

Unified Platform vs. Stitching Together Langfuse + Arize + Datadog

Flaws But Not Dealbreakers

OTel Instrumentation Readiness Checklist

Span Hierarchy

Voice-Specific Attributes

Context Propagation

Export & Storage

Verification

Frequently Asked Questions

Sumanyu Sharma

Related Resources

Monitor Pipecat Agents in Production: Logs, Traces, Metrics & Alerts

IVR and Voice Agent Log Correlation: A Runbook for Unified Call Debugging

Debugging Voice Agents: Real-Time Logs, Missed Intents & Error Dashboards (2026)