Which WebRTC metrics matter most for voice agents?

Start with packet loss, jitter, round-trip time, audio levels, jitter-buffer delay, concealed samples, codec, and ICE connection state. According to Hamming's testing checklist, those 8 signals explain whether the call was degraded before ASR, LLM, or TTS quality should be blamed.

What packet loss threshold should block a voice agent release?

Hamming treats sustained packet loss above 3% as a release-blocking signal for voice agents and sustained loss above 1% as a warning that needs investigation. Short spikes can recover, but repeated packet loss during speech usually creates missing words, low ASR confidence, and awkward turn-taking.

How do you test WebRTC jitter for a voice agent?

Collect jitter from WebRTC stats during synthetic calls and production sessions, then segment it by region, device, codec, and network type. Hamming's checklist uses 30ms as a warning threshold and 50ms as a block threshold when the issue is sustained during active speech.

Should teams use MOS for voice agent audio quality?

MOS is useful as a quality summary, but Hamming recommends pairing it with packet loss, jitter, RTT, and actual call outcomes. ITU MOS guidance uses a 1-5 opinion-score scale, while production voice-agent monitoring also needs replayable evidence for what the caller and agent actually heard.

How often should WebRTC call quality tests run?

Run lightweight pre-call checks before live user sessions when possible, synthetic WebRTC checks on every deploy candidate, and production sampling continuously. Hamming's release gate treats text-only tests as necessary but incomplete because they cannot exercise real packet loss, jitter, audio levels, or ICE behavior.

How is WebRTC call quality different from voice agent latency?

Latency measures how long the agent takes to respond, while WebRTC call quality measures whether the audio path is stable enough for the conversation to work. A voice agent can have healthy P95 latency and still fail if packet loss, one-way audio, or jitter corrupts the user's speech.

What should a WebRTC call quality evidence record include?

Store the canonical call ID, trace ID, ICE state, codec, packet loss, jitter, RTT, audio levels, quality score, ASR confidence, transcript pointer, audio pointer, and task outcome. Hamming recommends keeping those 13 fields together so QA can distinguish media-path failures from agent-logic failures.

WebRTC Call Quality Testing for Voice Agents

Q: What is WebRTC call quality testing for voice agents?

WebRTC call quality testing measures whether the media path is clean enough for a voice agent to hear, respond, and recover naturally. Hamming recommends testing packet loss, jitter, RTT, audio levels, jitter-buffer behavior, codec, ICE state, and task outcome together because media quality problems often look like ASR or prompt failures.

The fastest way to waste a day debugging a voice agent is to start with the prompt when the audio path is broken. The ASR transcript is wrong, the user interrupts twice, and the agent sounds late. Then someone opens the WebRTC stats and finds sustained packet loss during the same turn everyone was arguing about.

That is why WebRTC call quality testing needs to be a release gate for voice agents, not a post-incident chore. Text-mode tests prove the agent can reason. WebRTC tests prove the agent can survive a real call: ICE negotiation, codec behavior, jitter, packet loss, audio levels, device quirks, and network changes.

If your team is still pre-production and running fewer than 50 test calls per week, keep this simple: open browser diagnostics, run 10 noisy calls, and fix obvious media failures. This guide is for teams that need repeatable checks before every deploy and production evidence when callers say the agent "couldn't hear me."

WebRTC call quality testing is the practice of measuring the media path before, during, and after a voice-agent session. A useful test records packet loss, jitter, round-trip time, audio levels, jitter-buffer behavior, codec, ICE state, and conversation outcome in one evidence record.

Quick filter: If your QA record cannot answer "was this an agent failure or a bad media path?" you do not have enough call-quality evidence.

TL;DR: Treat WebRTC call quality as part of voice-agent QA:

Collect media stats: packet loss, jitter, RTT, audio levels, jitter-buffer delay, concealed samples, codec, and ICE state.

Run three checks: pre-call network check, synthetic WebRTC call, and sampled production monitoring.

Gate releases: block sustained packet loss, high RTT, one-way audio, failed ICE, and unexplained quality-score drops.

Correlate outcomes: store call-quality stats with ASR confidence, transcript/audio pointers, latency, and task completion.

Methodology Note: This checklist is based on Hamming's analysis of production voice agent calls across 10K+ voice agents (2025-2026). Hamming's platform has 10M+ mins protected. We've tested agents built on LiveKit, Pipecat, ElevenLabs, Retell, Vapi, and custom-built solutions.
It also uses public WebRTC, LiveKit, Twilio, Daily, and ITU documentation to ground the media-quality metrics and cases.

Last Updated: May 2026

Related Guides:

Debug WebRTC Voice Agents - troubleshoot ICE, RTP, jitter, packet loss, and browser diagnostics
Testing LiveKit Voice Agents - connect WebRTC validation to LiveKit agent testing
Voice AI Latency - separate media delay from model and TTS latency
Voice Agent Monitoring KPIs - production thresholds for voice-agent health
Voice Agent Analytics Metrics - formulas for call-quality and outcome reporting
Voice Agent SLOs - turn call-quality failures into reliability targets
Voice Agent Observability Tracing - preserve trace IDs across media, ASR, LLM, and TTS

How to use this checklist:

Run the pre-call and synthetic checks before debugging prompts or ASR behavior.
Save media stats with the same call ID and trace ID used by ASR, LLM, TTS, and outcome events.
Block releases only on sustained or repeated quality failures, not one isolated stat spike.
Segment production issues by region, device, network type, provider edge, and media direction.
Use replayable evidence to decide whether the fix belongs in media routing, ASR, latency, or agent logic.

What Is WebRTC Call Quality Testing for Voice Agents?

WebRTC call quality testing checks whether the caller's audio can reach the agent, and the agent's audio can reach the caller, with enough stability for the conversation to work.

That sounds obvious. It is easy to skip because most voice-agent test suites start at the transcript. The transcript is already downstream of microphone permissions, ICE negotiation, codec selection, packet delivery, jitter buffering, and browser/device behavior.

Working rule: Do not debug ASR accuracy until you know the audio path was healthy enough for ASR to be judged fairly.

There are three test layers:

Test Layer	When It Runs	What It Catches	What It Misses
Pre-call quality check	Before a user joins or before a synthetic test starts	Bad network, high RTT, obvious packet loss, device permission issues	Agent logic and real conversation timing
Synthetic WebRTC call	CI, deploy candidate, scheduled probe	ICE failures, jitter under load, one-way audio, codec issues, turn latency	Every real device and carrier path
Production sampling	Live calls, with redaction and retention controls	Region/device regressions, intermittent network patterns, issue replay	Failures you did not instrument

The MDN RTCPeerConnection.getStats() docs describe the browser API that returns connection statistics. The W3C WebRTC stats specification defines the identifiers behind those reports, including inbound RTP stats such as packets received, packets lost, and jitter.

For voice agents, those stats are only half the record. The other half is whether the caller completed the task, whether ASR confidence dropped, whether turn latency spiked, and whether the agent recovered.

Which WebRTC Metrics Matter Most?

Start with the metrics that can explain a bad conversation. Do not build a dashboard of every available stat and call it observability.

Metric	Where It Comes From	Warning Signal	Why Voice Agents Care
ICE connection state	Peer connection, SFU, or provider event	Stuck in checking, disconnected, failed	No media path means every downstream test is invalid
Packet loss	inbound/outbound RTP stats	Sustained above 1%	Missing audio frames create transcript gaps and robotic playback
Jitter	inbound/outbound RTP stats	Sustained above 30ms	Late packets make speech choppy and can increase perceived latency
RTT	remote inbound stats or provider metrics	Sustained above 300ms	High round-trip time makes turn-taking feel slow even if models are fast
Jitter-buffer delay	inbound RTP stats	Rising without recovery	Buffering hides packet timing problems but adds playout delay
Concealed samples	inbound audio stats	Increasing during active speech	Audio is being repaired because packets were lost or late
Audio level	inbound RTP stats or provider SDK	Flatline or clipping	Detects one-way audio, mute bugs, and device gain issues
Codec	SDP/provider summary	Unexpected codec or sample rate	Codec mismatch can degrade ASR or force transcoding
MOS or quality score	provider metric or test result	Under 4.0 warning, under 3.5 block	Useful summary, but not enough by itself

MDN's inbound RTP stats reference includes audio-level, jitter-buffer, packet-loss, and concealment fields that are useful when diagnosing received audio. MDN's remote inbound stats reference covers sender-side views such as RTT and fraction lost.

If you use LiveKit, its JS SDK exposes audio sender stats with packets sent, packets lost, jitter, and RTT in the provider's client abstractions. If you use a telephony path, Twilio Voice Insights exposes call summaries, metrics, events, and quality tags for investigating call quality.

Call-quality evidence record: Store WebRTC stats, provider metrics, ASR confidence, transcript pointer, audio pointer, trace ID, and task outcome together. A packet-loss number without the failed turn is just trivia.

What Thresholds Should Trigger Warnings or Release Blocks?

Thresholds need calibration by codec, device, region, and workload. Still, a starter gate is better than "looks fine."

Signal	Warning	Release Block	First Action
Packet loss during active speech	Sustained above 1%	Sustained above 3%	Segment by region, network type, and media direction
Jitter	Sustained above 30ms	Sustained above 50ms	Check jitter-buffer delay and provider edge
RTT	Sustained above 300ms	Sustained above 600ms	Check TURN routing, region placement, and mobile network path
MOS or quality score	Under 4.0	Under 3.5	Inspect packet loss, jitter, codec, and audio samples
ICE connection	Reconnects or disconnected	Failed or repeated reconnect loop	Check STUN/TURN, firewall, and selected candidate pair
Audio level	Flatline for 2 seconds during expected speech	One-way audio or clipping in test call	Verify device, permissions, track state, and gain
Concealed samples	Rising during active speech	Repeated concealment across test set	Inspect packet loss and jitter burst pattern

Twilio's call-quality tags cover signals such as low MOS, high jitter, high packet loss, high latency, and ICE failure. Daily's testCallQuality() documentation uses packet-loss, RTT, and outbound-bitrate thresholds to classify pre-call quality as good, warning, bad, failed, or aborted. Those provider thresholds are not universal, but they are useful anchors.

We use stricter release thinking for voice agents because a "minor" media issue often becomes an agent-quality issue. A missing word changes intent. A late packet can make ASR endpoint early. One-way audio makes the agent look broken even when every backend span is green.

How to Run Pre-Call and Synthetic WebRTC Checks

Run checks before production traffic gets involved. The sequence should be boring enough that CI can enforce it.

Check	Environment	Pass Criteria	Evidence to Save
Browser/device readiness	Pre-call or synthetic client	Mic permission, expected input device, audio level present	device class, permission state, audio-level sample
Network quality	Pre-call	Packet loss and RTT below warning threshold	network state, RTT, packet loss, bitrate
ICE connectivity	Synthetic call	Selected candidate pair reaches connected/completed	candidate type, region, TURN usage
Media loop	Synthetic call	Audio sent and received in both directions	inbound/outbound packet counts, audio levels
Speech sample	Synthetic call	Known phrase reaches ASR with expected confidence	audio pointer, transcript, confidence
Turn-taking sample	Synthetic call	Agent responds inside latency target and handles interruption	turn timestamps, interruption decision
Load sample	Deploy candidate	Quality holds under expected concurrent sessions	p95 stats by region and test batch

For a minimal browser-side collector, sample getStats() on a short interval and normalize only the fields you need. jitterBufferDelay is cumulative, so divide it by jitterBufferEmittedCount for a lifetime average, or compute polling-window deltas if you need a rolling average.

type WebRtcAudioQualitySample = {  sampledAt: string;  callId: string;  direction: "inbound" | "outbound";  packetsLost?: number;  packetsReceived?: number;  jitterMs?: number;  roundTripTimeMs?: number;  audioLevel?: number;  jitterBufferDelayMs?: number;  concealedSamples?: number;};export async function collectWebRtcAudioQuality(  peerConnection: RTCPeerConnection,  callId: string,): Promise<WebRtcAudioQualitySample[]> {  const stats = await peerConnection.getStats();  const samples: WebRtcAudioQualitySample[] = [];  stats.forEach((report) => {    if (report.type === "inbound-rtp" && report.kind === "audio") {      const jitterBufferDelayMs =        report.jitterBufferDelay === undefined ||        report.jitterBufferEmittedCount === undefined ||        report.jitterBufferEmittedCount === 0          ? undefined          : (report.jitterBufferDelay / report.jitterBufferEmittedCount) *            1000;      samples.push({        sampledAt: new Date(report.timestamp).toISOString(),        callId,        direction: "inbound",        packetsLost: report.packetsLost,        packetsReceived: report.packetsReceived,        jitterMs: report.jitter === undefined ? undefined : report.jitter * 1000,        audioLevel: report.audioLevel,        jitterBufferDelayMs,        concealedSamples: report.concealedSamples,      });    }    if (report.type === "remote-inbound-rtp" && report.kind === "audio") {      samples.push({        sampledAt: new Date(report.timestamp).toISOString(),        callId,        direction: "outbound",        packetsLost: report.packetsLost,        jitterMs: report.jitter === undefined ? undefined : report.jitter * 1000,        roundTripTimeMs:          report.roundTripTime === undefined            ? undefined            : report.roundTripTime * 1000,      });    }  });  return samples;}

This is not a full monitoring system. It is the smallest useful proof that the voice agent's media path was measurable during the test. The OpenTelemetry voice-agent tracing guide covers how to attach the same call ID and trace ID to downstream spans.

How to Connect Call-Quality Stats to ASR, Latency, and Outcomes

The important question is not "what was jitter?" The important question is "did jitter change the conversation?"

Join WebRTC stats to the voice-agent outcome table:

Evidence	Join Key	Decision It Supports
WebRTC stats	call ID, participant ID, track ID, timestamp	Was the media path degraded?
ASR transcript and confidence	call ID, turn ID, timestamp	Did audio degradation affect recognition?
Agent trace	trace ID, turn ID	Did model/tool/TTS latency compound the issue?
Audio replay pointer	call ID, redacted asset ID	Can QA hear the failure?
Task outcome	call ID, workflow ID	Did the caller complete the job?
Incident or release ID	deploy version, policy version	Did a change introduce the regression?

This is where the page connects to broader voice-agent monitoring KPIs. Packet loss matters more when it correlates with worse ASR confidence, longer task time, higher abandonment, or more failed transfers.

It also changes how you debug latency. If voice AI latency spikes but WebRTC stats are clean, inspect STT endpointing, LLM first-token time, tool calls, and TTS. If WebRTC stats degrade first, do not tune the prompt until the media path is healthy.

Release-gate rule: A voice-agent release should not pass on text tests alone. It should pass text-mode logic tests, synthetic WebRTC calls, and outcome checks that prove media quality did not hide a broken conversation.

Store the joined evidence in one compact event so QA can replay the same failure later:

{  "eventName": "voice.webrtc_call_quality.checked",  "callId": "call_01JZ9W2M7K",  "traceId": "trace_7db3e1",  "sampleWindowMs": 30000,  "network": {    "region": "us-central",    "connectionType": "wifi",    "selectedCandidateType": "relay"  },  "media": {    "iceState": "connected",    "codec": "opus",    "packetLossPercent": 1.6,    "jitterMsP95": 34,    "roundTripTimeMsP95": 280,    "audioLevelFlatlineMs": 0  },  "voiceAgentOutcome": {    "asrConfidenceMin": 0.82,    "turnLatencyMsP95": 1180,    "taskCompleted": true,    "needsReplay": false  }}

What Should CI and Production Monitoring Include?

Use this checklist as the minimum useful gate.

Area	CI / Deploy Candidate	Production Monitoring
Pre-call quality	Run a 15-30 second quality check from each target region	Track warning/bad pre-call rates by region and device
ICE and connectivity	Assert connection reaches connected/completed	Alert on failed ICE, reconnect loops, and TURN-only spikes
Packet loss and jitter	Block sustained packet loss above 3% or jitter above 50ms	Trend p50/p95 by direction, region, and provider edge
RTT	Block sustained RTT above 600ms	Alert on regional RTT drift and TURN-route changes
Audio levels	Verify inbound and outbound audio are non-flatline	Detect one-way audio and clipped input
Known speech sample	Assert phrase reaches ASR with expected confidence	Sample replayable failures with redaction
Turn timing	Verify first response and interruption tests	Correlate with latency SLOs and abandonment
Outcome	Confirm synthetic caller completes the task	Segment bad outcomes by media-quality bucket

For dashboards, start with 6 panels:

Connection success by region and client type.
Packet loss p95 by direction.
Jitter p95 by direction.
RTT p95 by region.
One-way-audio candidates from audio-level flatlines.
Task completion split by media-quality bucket.

The voice-agent dashboard template covers layout patterns for operators. The voice-agent SLO guide covers how to turn these signals into error budgets and burn-rate alerts.

When WebRTC Metrics Are Not Enough

WebRTC stats can tell you the media path was damaged. They cannot tell you whether the agent made the right decision.

Three limitations matter:

Good media can still produce bad agent behavior. Clean packet delivery does not prove the prompt, tool call, handoff, or compliance policy worked.
Quality scores hide causes. MOS or provider quality labels are useful summaries, but you still need packet loss, jitter, RTT, audio levels, and replay evidence.
Browser stats are not the full telephony path. PSTN, SIP, SFU, and provider edges may each see a different version of the call.

The practical approach is layered. Use WebRTC stats to validate the media path. Use voice-agent observability traces to explain the AI pipeline. Use incident response runbooks when the failure affects production users.

We used to think of WebRTC checks as debugging support. That was too late. For production voice agents, call quality is part of the product contract. If the caller cannot be heard clearly, the agent does not get credit for being smart.

WebRTC Call Quality Release Checklist

Pre-call quality check runs from every supported region or test edge.
Synthetic WebRTC call captures inbound and outbound audio stats.
ICE state, selected candidate type, TURN usage, and codec are saved.
Packet loss, jitter, RTT, audio levels, and jitter-buffer delay have warning and block thresholds.
Known speech samples produce expected ASR confidence and transcript quality.
Turn latency and interruption behavior are tested with real audio, not only text turns.
Call-quality stats join to transcript, audio replay pointer, trace ID, and task outcome.
Production dashboard segments bad outcomes by media-quality bucket.
Release owner can block deploys when media-quality checks fail.

WebRTC Call Quality Testing for Voice Agents

What Is WebRTC Call Quality Testing for Voice Agents?

Which WebRTC Metrics Matter Most?

What Thresholds Should Trigger Warnings or Release Blocks?

How to Run Pre-Call and Synthetic WebRTC Checks

How to Connect Call-Quality Stats to ASR, Latency, and Outcomes

What Should CI and Production Monitoring Include?

When WebRTC Metrics Are Not Enough

WebRTC Call Quality Release Checklist

Frequently Asked Questions

Sumanyu Sharma

Related Resources

Voice Agent Hallucination Detection Guide

Voice Agent Interruption Handling: Barge-In, Backchannels, and Turn Detection

Long-Call Voice Agent Testing: How to Test 70+ Conversation Turns