OpenTelemetry for AI Voice Agents: How to Trace Calls End-to-End

Sumanyu Sharma
Sumanyu Sharma
Founder & CEO
, Voice AI QA Pioneer

Has stress-tested 4M+ voice agent calls to find where they break.

February 25, 2026Updated February 25, 202621 min read
OpenTelemetry for AI Voice Agents: How to Trace Calls End-to-End

Last month, I watched a customer debug a voice agent failure with Langfuse open in one tab and Datadog in another. The LLM traces looked clean. The infrastructure metrics showed no anomalies. Yet more than half of their transcript merges were failing silently. The fix? OpenTelemetry, instrumented for voice agents, not just LLM calls.

The problem wasn't in the LLM. It wasn't in the infrastructure. It was in the space between them: a language routing decision in one service triggered an STT provider switch in another, which produced empty transcripts that a third service evaluated as "no data available." Three services. Two programming languages. One trace ID that connected them all, once we instrumented it properly.

I used to think you could bolt general LLM observability tools like Langfuse or Arize onto voice agents and call it done. After analyzing 4M+ voice agent calls at Hamming, I changed my mind. Voice agents need their own span hierarchy, their own attributes, and their own debugging playbooks. This is the implementation guide I wish existed when we started.

TL;DR: Instrument voice agents using Hamming's 3-Layer OTel Instrumentation Model:

  • Layer 1: Span Hierarchy — Design parent-child spans for call.lifecycle stt llm tools tts webhooks (not flat LLM-only traces)
  • Layer 2: Voice-Specific Attributes — Attach stt.provider, llm.ttft_ms, tts.synthesis_ms, tool.execution_ms to every span (generic HTTP attributes won't help you debug voice failures)
  • Layer 3: Unified Export — Route all traces to a single backend that correlates with test results and production calls (the "Context Switching Tax" of 3-4 separate dashboards kills debugging speed)

One trace ID across TypeScript, Temporal, and Python beats five dashboards.

Methodology Note: The span hierarchies, attributes, and debugging playbooks in this guide are derived from Hamming's analysis of 4M+ production voice agent calls across 10K+ voice agents (2024-2026).

Error cascade patterns are drawn from real production incidents across 10K+ voice agent deployments. All examples are anonymized. Last Updated: February 2026.

Related Guides:

Who This Is For (And Who Should Skip It)

If you're running fewer than 100 voice agent calls per week, structured logging is probably enough. Bookmark this for when you scale.

If you're on a managed platform like Retell or Vapi with built-in dashboards, check whether their traces span all layers you care about. Most don't include STT provider details or tool execution timing. If those gaps don't hurt you yet, this guide can wait.

This guide is for teams that:

  • Run 1,000+ voice agent calls per week across custom infrastructure
  • Use separate services for ASR, LLM, TTS, and tool execution
  • Have hit a debugging wall where logs alone can't explain why calls fail
  • Want to instrument with OpenTelemetry rather than lock into a proprietary tracing vendor

If you're already spending more than 30 minutes per incident correlating timestamps across Langfuse, Datadog, and your application logs, the approach below will save you hours.

Why Standard LLM Observability Tools Miss Voice Agent Failures

Here's the trap: you set up Langfuse or Arize, trace your LLM calls, and assume you have observability. The LLM dashboard looks green. Prompt latency is normal. Token counts are within range.

Meanwhile, your users hear silence.

We call this the Dashboard Blindspot: each observability tool sees its own slice of the voice pipeline, but no single view shows the cascade.

Here's what happens in a real voice call. Audio arrives from a phone network, gets routed to an ASR provider, flows through language detection logic, hits the LLM for inference, triggers tool calls via webhooks, generates TTS audio, and streams back to the caller. That chain crosses 3+ service boundaries and typically spans 2 programming languages (TypeScript on the web/queue side, Python on the voice worker side).

Langfuse traces the LLM hop. Datadog monitors the infrastructure. Your application logs capture individual events. But none of them show you that a language routing decision in Service A triggered an STT provider switch in Service B that produced empty transcripts consumed by Service C. In our analysis of production incidents, the majority of voice agent failures that present as "LLM errors" actually originate in a different pipeline stage. Across the four cascade patterns we've documented in detail, every one involved the LLM performing fine while the input it received was garbage.

Standard LLM observability tools weren't built for this. Langfuse has added partial voice tracing through Pipecat and LiveKit integrations, which is a step forward, but these capture individual STT/TTS spans without the cross-service cascade visibility (provider fallback chains, language routing decisions, database consistency) that voice debugging requires. That's not a knock on Langfuse or Arize. They're excellent at what they do. But bolting them onto a voice agent and expecting full pipeline visibility is like monitoring your database and assuming you're observing your web app. (For more on structuring call logs and analytics, see our taxonomy guide.)

The Voice Agent Span Hierarchy: What to Trace

A voice agent span hierarchy mirrors the audio pipeline, not the HTTP request path. The root span is call.lifecycle, with child spans for each processing stage:

call.lifecycle (root, 45,000ms)
├── stt.transcription (1,200ms)
   ├── stt.provider.deepgram (timeout, 800ms)
   └── stt.provider.fallback.azure (400ms)
├── llm.inference (2,100ms)
   ├── llm.tool_call.check_inventory (350ms)
   └── llm.tool_call.create_order (890ms)
├── tts.synthesis (380ms)
├── webhook.dispatch (120ms)
   └── webhook.callback.result_update (95ms)
└── evaluation.assertion_check (450ms)

Notice the nesting. When Deepgram times out, the system falls back to Azure. That fallback is a child span of stt.transcription, so you see it in context. When the LLM calls two tools, each tool call is a separate child span with its own timing. When the assertion evaluation runs after the call, it's traceable back to the same root.

This hierarchy is different from what Langfuse gives you. Langfuse would show the llm.inference span and its tool calls, but not the STT provider fallback that happened 800ms earlier or the assertion evaluation that failed 30 seconds later because the transcript was incomplete. (For a deep-dive on STT failure patterns that trigger these fallbacks, see 7 ASR failure modes in production.)

"The hardest part isn't adding spans. It's resisting the urge to trace everything and instead tracing the decision points that actually break," says Sumanyu Sharma, CEO of Hamming. "We wasted two weeks tracing HTTP headers before we realized provider fallback decisions were the thing we needed visibility into."

Design principle: Every decision point in the pipeline gets its own span. Provider selection, fallback triggers, webhook dispatches, and post-call evaluations are all traceable. If a component can fail independently, it needs a span.

Pipeline StageSpan NameParentWhat It Captures
Call start/endcall.lifecycleRootTotal duration, final status, error type
Speech-to-textstt.transcriptioncall.lifecycleProvider used, confidence, fallback triggered
STT provider attemptstt.provider.{name}stt.transcriptionPer-provider latency, success/failure, timeout
LLM inferencellm.inferencecall.lifecycleModel, TTFT, tokens, finish reason
Tool executionllm.tool_call.{name}llm.inferenceWebhook URL, status code, response time
Text-to-speechtts.synthesiscall.lifecycleProvider, synthesis latency, character count
Webhook dispatchwebhook.dispatchcall.lifecycleTarget URL, delivery status, retry count
Post-call evaluationevaluation.assertion_checkcall.lifecycleAssertion type, result, score
STT provider routingstt.provider_selectionstt.transcriptionRouting decision, language array, selected provider order
Transcript finalizationtranscript.finalizationcall.lifecycleFinalization status, canonical transcript written
Metric score writemetric.score.writecall.lifecycleMetric type, write target (primary/replica), status
Transcript JSON parsetranscript.json_parsecall.lifecycleParse status, fallback triggered
Transcript merge fallbacktranscript.merge.fallbackcall.lifecycleFallback reason, quality level

Voice-Specific Span Attributes That Actually Help Debugging

Generic HTTP span attributes (http.method, http.status_code, http.url) tell you a request happened. They don't tell you why your voice agent failed. These 12 attributes are what we recommend based on debugging thousands of production incidents:

AttributeTypeExampleWhen You'll Need This
stt.providerstring"deepgram"Debugging empty transcripts: which provider handled this call?
stt.confidencefloat0.87Filtering calls where ASR quality degraded below threshold
stt.latency_msint412Identifying provider latency spikes that cascade to user-perceived delay
llm.modelstring"gpt-4o"Correlating failures with specific model versions or deployments
llm.ttft_msint340Time-to-first-token: the latency users actually feel
llm.tokens.inputint487Detecting prompt bloat that slows inference
llm.tokens.outputint203Catching truncated responses from output token limits
tts.providerstring"elevenlabs"Tracing synthesis failures or quality issues to specific providers
tts.synthesis_msint290Identifying TTS bottlenecks in the latency waterfall
tool.namestring"check_inventory"Filtering traces by which tool calls were invoked
tool.execution_msint350Finding slow webhook endpoints that delay responses
call.duration_msint45000Correlating call length with error rates and user satisfaction

The pattern: every attribute should answer a debugging question you'd actually ask. "Which STT provider handled this call?" "Was the LLM response truncated?" "Which tool call was slow?" If an attribute doesn't map to a question you'd ask during an incident, don't add it. Trace storage isn't free. (For the production KPIs these attributes power, see our voice agent monitoring KPIs guide.)

We found that adding stt.confidence to every transcription span was the single highest-ROI attribute. It lets you query "show me all calls where ASR confidence dropped below 0.7" and immediately see which calls had degraded input before the LLM ever ran.

How to Propagate Context Across Language Boundaries

Voice agent calls typically cross 3+ language boundaries in a single request: TypeScript (web server and task queue), Go (Temporal workflow orchestration), and Python (voice worker). Web APIs rarely do this, which is why standard tracing guides don't cover it.

W3C Trace Context (traceparent header) is the standard that works across all of them. The header format is straightforward:

traceparent: 00-{32-char-trace-id}-{16-char-span-id}-01

Here's how trace context flows through a voice agent system:

1. TypeScript (Web Server) — Create the root span:

import { withOpenTelemetry, getWebTracer } from '@hamming/tracing/instrumentation';

const result = await withOpenTelemetry()
  .setSpanName('call.lifecycle')
  .setTracer(getWebTracer())
  .setSpanAttributes({
    'call.id': callId,
    'workspace.id': workspaceId,
    'agent.id': agentId,
  })
  .setFunction(async (span) => {
    // Temporal client auto-injects traceparent via OTel interceptor
    await temporalClient.workflow.start(voiceCallWorkflow, {
      args: [callConfig],
      taskQueue: 'voice-calls',
    });
  })
  .execute();

2. Temporal (Workflow Orchestration) — Context propagates automatically:

Temporal's OpenTelemetry interceptors extract the traceparent from workflow metadata and inject it into activity calls. No manual propagation code needed. This is the "Go boundary" that most teams don't realize exists.

3. Python (Voice Worker) — Extract and continue the trace:

from opentelemetry import trace
from opentelemetry.propagate import extract

async def handle_voice_job(job):
    # Build a carrier dict from job metadata; start a fresh trace if header is missing
    traceparent = getattr(getattr(job, "metadata", None), "headers", {}).get("traceparent")
    carrier: dict[str, str] = {"traceparent": traceparent} if traceparent else {}
    context = extract(carrier)

    tracer = trace.get_tracer("voice-worker")
    with tracer.start_as_current_span(
        "stt.transcription",
        context=context,
        attributes={
            "stt.provider": "deepgram",
            "stt.confidence": 0.92,
            "stt.latency_ms": 412,
        },
    ):
        transcript = await stt_provider.transcribe(audio)
        return transcript

4. Webhook Callback — Context returns to TypeScript:

When the voice worker sends webhook events back to the web server, the traceparent header is included automatically by OpenTelemetry's HTTP instrumentation. The web server continues the same trace, meaning post-call evaluation spans are children of the original call.lifecycle root.

The result: one trace ID follows a call from the moment it's initiated in TypeScript, through Temporal's Go-based orchestration, into Python's voice processing, and back to TypeScript for evaluation. Any span in that chain is queryable by the original trace ID.

Framework-Specific OTel Setup

If you're building on LiveKit Agents or Pipecat, both frameworks have native OpenTelemetry support that gives you a head start on Layer 1 (span hierarchy) and Layer 2 (attributes).

LiveKit Agents exports the same OTel spans that power LiveKit Cloud's Agent Insights dashboard. Call set_tracer_provider() from livekit.agents.telemetry to route those spans to your own backend (Jaeger, SigNoz, etc.) instead of — or in addition to — LiveKit Cloud. LiveKit auto-instruments five pipeline stages: VAD, STT, end-of-utterance detection, LLM inference, and TTS. Each span carries gen_ai.* attributes (model, token counts, TTFB) aligned with the emerging OpenTelemetry GenAI semantic conventions. See the LiveKit observability data docs for the full set_tracer_provider() API and the Agent Insights guide for the session timeline view.

Pipecat instruments at three levels: a root conversation span, child turn spans (one per user-agent exchange), and grandchild service spans for each STT, LLM, and TTS call. Enable it with setup_tracing() and set enable_tracing=True plus enable_turn_tracking=True on your PipelineTask. Pipecat's service spans include provider-specific attributes like stt.audio_duration, llm.tokens.prompt, and tts.characters. See the Pipecat OTel docs for setup and span attribute reference, and the metrics guide for TTFB and processing time metrics.

Both frameworks handle Layer 1 and Layer 2 within a single service. What they don't handle is Layer 3: propagating trace context across services (your orchestrator, your webhook handlers, your evaluation pipeline). That's where the W3C traceparent propagation described above becomes essential.

For Hamming-specific integration patterns, see our guides on monitoring Pipecat agents in production and testing and monitoring LiveKit voice agents.

Debugging Playbook: 4 Error Cascades and How to Trace Them

This is the section I could write 5,000 words on. These four patterns account for the majority of "I can't figure out why my voice agent is broken" incidents we've seen across 10K+ deployments. Each one is invisible without cross-service traces. (If you're looking for a structured incident response workflow, see our voice agent SEV playbook and postmortem template.)

Cascade 1: Silent Transcript Merging Failures

Symptom: Test assertions return "SKIPPED" with "transcript is empty," but you can see the full conversation in the UI playback.

What happened: The voice worker streamed transcript data in real-time (visible in the UI), but the canonical transcript used by the assertion evaluator was never finalized. A protocol configuration caused the finalization step to be skipped, so streaming data existed but the evaluation system read an empty canonical source.

How to trace it:

StepSpan to CheckWhat to Look For
1call.lifecycleVerify call completed successfully
2stt.transcriptionCheck streaming transcript was received (messages > 0)
3transcript.finalizationMissing or status=skipped — this is the break
4evaluation.assertion_checkReads canonical transcript, finds empty, marks SKIPPED

Fix: Add a finalization guard that checks whether canonical transcript exists before assertion evaluation. If streaming data exists but canonical doesn't, trigger finalization before proceeding.

Without unified tracing: You'd see passing UI playback in one dashboard, failing assertions in another, and have no way to connect the two. The streaming vs. canonical distinction is invisible to any single tool.

Cascade 2: Language Routing Causes Empty STT Output

Symptom: Non-English test runs (Japanese, Spanish, Hindi) produce empty transcripts. English calls work fine.

What happened: A language composition change combined persona language (ja-JP) with test case language (en-US) into a multilingual array [ja-JP, en-US]. The STT routing logic saw languages.length > 1 and selected a multilingual provider order. That specific provider returned empty transcripts for these language combinations. All downstream assertions failed with "transcript is empty."

How to trace it:

StepSpan to CheckWhat to Look For
1call.lifecycleCheck call.languages attribute — shows [ja-JP, en-US]
2stt.provider_selectionRouting decision — multilingual path selected
3stt.provider.deepgramEmpty or near-empty output (stt.confidence = 0 or missing)
4evaluation.assertion_check"Transcript is empty, cannot evaluate"

Fix: Validate the STT provider's language support before routing. If a provider doesn't handle a specific combination, fall back to a provider that does or split into separate single-language transcriptions. (See our multilingual voice agent testing guide for testing patterns that catch this before production.)

Without unified tracing: The language routing decision lives in Service A, the STT provider selection in Service B, and the assertion failure in Service C. You'd need to correlate timestamps across three log streams to connect "language composition change" to "empty transcript" to "assertion failure."

Cascade 3: Replica Lag Causes Phantom "Not Found" Errors

Symptom: Custom metric assertions show "metric score not found" even though the metric was computed successfully. Scores are visible in the database when you check manually.

What happened: Metric scores were written to the primary database at timestamp T. The assertion evaluator read from a read replica. During a recovery-conflict storm, replica lag spiked from seconds to minutes. The evaluator's read hit the stale replica, got its query canceled by PostgreSQL's recovery process, and marked the assertion as SKIPPED.

How to trace it:

StepSpan to CheckWhat to Look For
1metric.score.writeTimestamp T, status=success, db.target=primary
2evaluation.assertion_checkTimestamp T+5ms, db.target=replica
3Compare timestampsWrite at T, read at T+5ms on a replica lagging seconds to minutes (recovery-conflict storm)
4db.query spansCheck db.replica_lag_ms or PostgreSQL "canceling statement due to conflict with recovery" errors

Fix: For consistency-critical reads (metrics needed for assertion evaluation), route to the primary database. Or add a retry with exponential backoff that waits for replica convergence.

Without unified tracing: Langfuse would show the LLM call succeeded. Datadog would show the database is healthy. Neither would reveal that the read replica was in a recovery-conflict storm, canceling queries before they could return. In production, this pattern can generate thousands of errors across thousands of traces within minutes before anyone identifies the root cause.

Cascade 4: LLM Output Limits Cause Silent Quality Degradation

Symptom: Transcript merging quality drops sharply. In our incident reviews, failure rates ranged between 50-70% of merge operations producing degraded output, but no errors appeared in logs.

What happened: The LLM model used for transcript merging hit its output token limit on long conversations (>15 messages). The model returned truncated JSON, the parser failed silently, and the system fell back to a lower-quality text-only merge. No error was thrown because the fallback was "working correctly."

How to trace it:

StepSpan to CheckWhat to Look For
1llm.inference (transcript merge)llm.tokens.output near model limit, llm.finish_reason=length
2transcript.json_parseStatus=failed, no exception thrown (silent failure)
3transcript.merge.fallbackFallback triggered, quality=degraded
4Correlation: transcript length vs failure rateIn our data, long transcripts failed at 5-7x the rate of shorter transcripts

Fix: Add a JSON repair step before falling back to text-only merge. Detect finish_reason=length and retry with a larger context window or split the transcript into chunks.

Without unified tracing: The LLM provider dashboard shows calls completing. The transcript service shows merges running. The quality degradation only becomes visible when you correlate LLM output token counts with downstream merge quality metrics. That correlation requires a unified trace.

Unified Platform vs. Stitching Together Langfuse + Arize + Datadog

Remember the Dashboard Blindspot from earlier, where each tool sees its own slice but misses the cascade? The cost of working around it has a name too. We call the debugging overhead of correlating information across separate tools the Context Switching Tax. In our analysis of production incidents, engineers spend 40-60% of debugging time switching between tools, matching timestamps, and mentally reconstructing the chain of events.

Here's what each tool sees versus what a unified trace shows. "Unified Trace" here means a single OTel-backed tracing backend (e.g., SigNoz, Jaeger, Tempo) that also correlates voice artifacts (audio, transcripts, test results) via shared trace IDs. OpenTelemetry gives you the trace; correlating recordings and transcripts alongside spans requires attaching IDs and a UI that surfaces those artifacts together.

Debugging DimensionLangfuseArizeDatadogUnified Trace
LLM prompt/completionFull visibilityFull visibilityLimitedFull visibility
LLM token counts and latencyYesYesNoYes
STT provider selectionPartialNoNoYes
STT provider fallback chainNoNoNoYes
STT confidence per utteranceNoNoNoYes
TTS synthesis latencyPartialNoPartialYes
Tool call webhook timingPartialNoYesYes
Database read/write routingNoNoYesYes
Replica lag correlationNoNoPartialYes
Cross-service cascadeNoNoPartialYes
Test assertion correlationNoNoNoYes
Audio playback alongside traceNoNoNoYes
Call recording + trace linkageNoNoNoYes
Language routing decisionsNoNoNoYes
Provider health across callsNoPartialYesYes

The pattern is clear: Langfuse and Arize excel at LLM-layer visibility, and Langfuse has added partial STT/TTS tracing via Pipecat and LiveKit integrations. Datadog excels at infrastructure monitoring. But voice agent debugging requires correlating events across all three layers simultaneously, including provider fallback chains, language routing decisions, and database consistency issues that none of these tools trace natively. The question is whether you want one trace or three dashboards.

I should be clear about when fragmented tooling works fine. If your voice agent is a thin wrapper around a single LLM call with no custom STT/TTS, Langfuse alone gives you most of what you need. If you're primarily debugging infrastructure issues (CPU, memory, network), Datadog is the right tool. The unified approach becomes essential when you're operating a multi-service voice pipeline where failures cascade across boundaries.

Flaws But Not Dealbreakers

Sampling decisions are brutal. You can't trace 100% of production calls at scale. (For the broader observability framework that tracing fits into, see Hamming's 4-Layer Voice Observability Framework.) At 10,000 calls/day, even with efficient BatchSpanProcessor export, full tracing generates substantial storage costs. But tail sampling (keep only slow or errored traces) has a critical weakness: the bugs you most need to trace are often the ones that don't trigger errors. The transcript merge that silently degrades quality? No error. The STT provider that returns low-confidence output? No error. We call this the "Sampling Blind Spot." We haven't fully resolved it. Different teams land differently based on call volume and debugging needs.

Third-party providers don't propagate trace context. When you call Deepgram's API or ElevenLabs' TTS endpoint, they don't return a traceparent header. Your trace has a gap at the provider boundary. You can bracket the call with a client-side span (start span before request, end after response), but you lose visibility into what happened inside the provider. This is an industry-wide limitation, not specific to any one tool.

Expect 1-3% latency overhead. Span creation and export aren't free. We measure 1-3% latency increase from OTel instrumentation across our deployments. For most voice agents where turn latency is 800-2,000ms, that's 8-60ms of added latency. Worth it for the debugging capability, but measure it in your system. Use BatchSpanProcessor (not SimpleSpanProcessor) to amortize export costs.

The OTel spec doesn't have voice agent semantic conventions yet. The OpenTelemetry GenAI semantic conventions cover LLM calls but not STT, TTS, or audio processing. That means the attribute names in this guide (stt.provider, tts.synthesis_ms) are Hamming's conventions, not an industry standard. We'd love to see the OTel community adopt voice-specific conventions. Until then, pick a naming scheme and be consistent.

OTel Instrumentation Readiness Checklist

Use this before your next deployment:

Span Hierarchy

  • Root call.lifecycle span created at call initiation
  • Child spans for STT, LLM, TTS, tool execution, webhooks
  • Provider-level spans nested under STT/TTS (for fallback tracing)
  • Post-call evaluation spans linked to same root trace
  • Every span has call.id attribute for cross-referencing
  • Framework-native spans enabled (LiveKit: set_tracer_provider(), Pipecat: setup_tracing() + enable_tracing=True)

Voice-Specific Attributes

  • STT: provider, confidence, latency_ms
  • LLM: model, ttft_ms, tokens.input, tokens.output, finish_reason
  • TTS: provider, synthesis_ms, character_count
  • Tools: name, execution_ms, webhook_status
  • Call: duration_ms, status, error_type (if applicable)

Context Propagation

  • W3C traceparent header on all HTTP requests between services
  • Temporal OTel interceptor configured for workflow/activity context
  • Python voice worker extracts context from job metadata
  • Webhook callbacks carry traceparent back to web server

Export & Storage

  • BatchSpanProcessor configured (not SimpleSpanProcessor)
  • Sampling strategy defined (head sampling, tail sampling, or adaptive)
  • Trace retention policy matches your debugging window (7-30 days)
  • Alerts configured on trace completeness (>95% target)

Verification

  • End-to-end trace visible from web server → voice worker → webhook callback
  • Can filter traces by call.id, workspace.id, stt.provider
  • Can identify provider fallback chains in the span tree
  • Can correlate test assertion results with call traces

Voice agent observability isn't about more dashboards. It's about one trace that follows a call through every service, every language boundary, and every decision point. The teams that debug fastest aren't the ones with the best engineers. They're the ones with the best traces.

Get started with unified voice agent observability

Frequently Asked Questions

OpenTelemetry provides W3C Trace Context propagation, span creation APIs, and multi-backend export that are essential for voice agents. According to Hamming's 3-Layer OTel Instrumentation Model, you create a root span at call initiation, child spans for each pipeline stage (STT, LLM, TTS, tool execution), and propagate the trace ID across service boundaries using the traceparent header. This gives you a single trace spanning 3+ services and 2+ programming languages.

Voice agents need attributes beyond standard HTTP spans. According to Hamming's analysis of 4M+ calls, the 12 most useful attributes are: stt.provider, stt.confidence, stt.latency_ms, llm.model, llm.ttft_ms, llm.tokens.input, llm.tokens.output, tts.provider, tts.synthesis_ms, tool.name, tool.execution_ms, and call.duration_ms. Each attribute maps to a specific debugging question you'd ask during an incident.

Use W3C Trace Context (traceparent header) at every service boundary. In TypeScript, OpenTelemetry HTTP instrumentation automatically injects the header into outgoing requests. For Temporal workflow orchestration, use the OTel interceptor to embed context in workflow metadata. In Python, extract the context using TraceContextTextMapPropagator. According to Hamming's architecture, a single voice call crosses 3+ language boundaries, and W3C traceparent is the only standard that works across all of them.

Langfuse and Arize excel at LLM-layer tracing and have added partial voice framework integrations (Pipecat, LiveKit), but they still miss cross-service cascade visibility. They trace prompt/completion cycles and some STT/TTS spans, but don't capture provider fallback chains, language routing decisions, or database read-after-write consistency issues. According to Hamming's debugging data from 4M+ calls, the majority of voice agent failures that present as LLM errors actually originate in a different pipeline stage. A unified platform that traces the entire call catches these cross-component cascades.

W3C Trace Context is a standard HTTP header format (traceparent: 00-traceId-spanId-flags) that propagates distributed trace identity across services. For voice agents, it matters because a single call typically crosses 3+ service boundaries and 2+ programming languages. According to Hamming's 3-Layer OTel Instrumentation Model, traceparent is the glue that turns five separate log streams into one debuggable trace. Without it, each service creates disconnected traces.

Start with the symptom, then follow the trace backward through the span hierarchy. According to Hamming's debugging playbook from 10K+ deployments, most cascades follow a pattern: a decision in Service A triggers behavior in Service B that produces bad output consumed by Service C. With a unified trace, you see the full chain in one view instead of correlating timestamps across 3 dashboards. Hamming calls this overhead the Context Switching Tax.

Expect 1-3% latency increase from span creation and export. According to Hamming's benchmarks across 10K+ voice agents, for typical turn latency of 800-2,000ms, that translates to 8-60ms of added latency. Use BatchSpanProcessor instead of SimpleSpanProcessor to minimize per-span cost. At high volume, implement sampling, but be aware of the Sampling Blind Spot: the bugs you most need to trace are often the ones that don't trigger traditional error thresholds.

Sumanyu Sharma

Sumanyu Sharma

Founder & CEO

Previously Head of Data at Citizen, where he helped quadruple the user base. As Senior Staff Data Scientist at Tesla, grew AI-powered sales program to 100s of millions in revenue per year.

Researched AI-powered medical image search at the University of Waterloo, where he graduated with Engineering honors on dean's list.

“At Hamming, we're taking all of our learnings from Tesla and Citizen to build the future of trustworthy, safe and reliable voice AI agents.”