The fastest way to waste a day debugging a voice agent is to start with the prompt when the audio path is broken. The ASR transcript is wrong, the user interrupts twice, and the agent sounds late. Then someone opens the WebRTC stats and finds sustained packet loss during the same turn everyone was arguing about.
That is why WebRTC call quality testing needs to be a release gate for voice agents, not a post-incident chore. Text-mode tests prove the agent can reason. WebRTC tests prove the agent can survive a real call: ICE negotiation, codec behavior, jitter, packet loss, audio levels, device quirks, and network changes.
If your team is still pre-production and running fewer than 50 test calls per week, keep this simple: open browser diagnostics, run 10 noisy calls, and fix obvious media failures. This guide is for teams that need repeatable checks before every deploy and production evidence when callers say the agent "couldn't hear me."
WebRTC call quality testing is the practice of measuring the media path before, during, and after a voice-agent session. A useful test records packet loss, jitter, round-trip time, audio levels, jitter-buffer behavior, codec, ICE state, and conversation outcome in one evidence record.
Quick filter: If your QA record cannot answer "was this an agent failure or a bad media path?" you do not have enough call-quality evidence.
TL;DR: Treat WebRTC call quality as part of voice-agent QA:
- Collect media stats: packet loss, jitter, RTT, audio levels, jitter-buffer delay, concealed samples, codec, and ICE state.
- Run three checks: pre-call network check, synthetic WebRTC call, and sampled production monitoring.
- Gate releases: block sustained packet loss, high RTT, one-way audio, failed ICE, and unexplained quality-score drops.
- Correlate outcomes: store call-quality stats with ASR confidence, transcript/audio pointers, latency, and task completion.
Methodology Note: This checklist is based on Hamming's analysis of 4M+ production voice agent calls across 10K+ voice agents (2025-2026). We've tested agents built on LiveKit, Pipecat, ElevenLabs, Retell, Vapi, and custom-built solutions.It also uses public WebRTC, LiveKit, Twilio, Daily, and ITU documentation to ground the media-quality metrics and cases.
Last Updated: May 2026
Related Guides:
- Debug WebRTC Voice Agents - troubleshoot ICE, RTP, jitter, packet loss, and browser diagnostics
- Testing LiveKit Voice Agents - connect WebRTC validation to LiveKit agent testing
- Voice AI Latency - separate media delay from model and TTS latency
- Voice Agent Monitoring KPIs - production thresholds for voice-agent health
- Voice Agent Analytics Metrics - formulas for call-quality and outcome reporting
- Voice Agent SLOs - turn call-quality failures into reliability targets
- Voice Agent Observability Tracing - preserve trace IDs across media, ASR, LLM, and TTS
How to use this checklist:
- Run the pre-call and synthetic checks before debugging prompts or ASR behavior.
- Save media stats with the same call ID and trace ID used by ASR, LLM, TTS, and outcome events.
- Block releases only on sustained or repeated quality failures, not one isolated stat spike.
- Segment production issues by region, device, network type, provider edge, and media direction.
- Use replayable evidence to decide whether the fix belongs in media routing, ASR, latency, or agent logic.
What Is WebRTC Call Quality Testing for Voice Agents?
WebRTC call quality testing checks whether the caller's audio can reach the agent, and the agent's audio can reach the caller, with enough stability for the conversation to work.
That sounds obvious. It is easy to skip because most voice-agent test suites start at the transcript. The transcript is already downstream of microphone permissions, ICE negotiation, codec selection, packet delivery, jitter buffering, and browser/device behavior.
Working rule: Do not debug ASR accuracy until you know the audio path was healthy enough for ASR to be judged fairly.
There are three test layers:
| Test Layer | When It Runs | What It Catches | What It Misses |
|---|---|---|---|
| Pre-call quality check | Before a user joins or before a synthetic test starts | Bad network, high RTT, obvious packet loss, device permission issues | Agent logic and real conversation timing |
| Synthetic WebRTC call | CI, deploy candidate, scheduled probe | ICE failures, jitter under load, one-way audio, codec issues, turn latency | Every real device and carrier path |
| Production sampling | Live calls, with redaction and retention controls | Region/device regressions, intermittent network patterns, issue replay | Failures you did not instrument |
The MDN RTCPeerConnection.getStats() docs describe the browser API that returns connection statistics. The W3C WebRTC stats specification defines the identifiers behind those reports, including inbound RTP stats such as packets received, packets lost, and jitter.
For voice agents, those stats are only half the record. The other half is whether the caller completed the task, whether ASR confidence dropped, whether turn latency spiked, and whether the agent recovered.
Which WebRTC Metrics Matter Most?
Start with the metrics that can explain a bad conversation. Do not build a dashboard of every available stat and call it observability.
| Metric | Where It Comes From | Warning Signal | Why Voice Agents Care |
|---|---|---|---|
| ICE connection state | Peer connection, SFU, or provider event | Stuck in checking, disconnected, failed | No media path means every downstream test is invalid |
| Packet loss | inbound/outbound RTP stats | Sustained above 1% | Missing audio frames create transcript gaps and robotic playback |
| Jitter | inbound/outbound RTP stats | Sustained above 30ms | Late packets make speech choppy and can increase perceived latency |
| RTT | remote inbound stats or provider metrics | Sustained above 300ms | High round-trip time makes turn-taking feel slow even if models are fast |
| Jitter-buffer delay | inbound RTP stats | Rising without recovery | Buffering hides packet timing problems but adds playout delay |
| Concealed samples | inbound audio stats | Increasing during active speech | Audio is being repaired because packets were lost or late |
| Audio level | inbound RTP stats or provider SDK | Flatline or clipping | Detects one-way audio, mute bugs, and device gain issues |
| Codec | SDP/provider summary | Unexpected codec or sample rate | Codec mismatch can degrade ASR or force transcoding |
| MOS or quality score | provider metric or test result | Under 4.0 warning, under 3.5 block | Useful summary, but not enough by itself |
MDN's inbound RTP stats reference includes audio-level, jitter-buffer, packet-loss, and concealment fields that are useful when diagnosing received audio. MDN's remote inbound stats reference covers sender-side views such as RTT and fraction lost.
If you use LiveKit, its JS SDK exposes audio sender stats with packets sent, packets lost, jitter, and RTT in the provider's client abstractions. If you use a telephony path, Twilio Voice Insights exposes call summaries, metrics, events, and quality tags for investigating call quality.
Call-quality evidence record: Store WebRTC stats, provider metrics, ASR confidence, transcript pointer, audio pointer, trace ID, and task outcome together. A packet-loss number without the failed turn is just trivia.
What Thresholds Should Trigger Warnings or Release Blocks?
Thresholds need calibration by codec, device, region, and workload. Still, a starter gate is better than "looks fine."
| Signal | Warning | Release Block | First Action |
|---|---|---|---|
| Packet loss during active speech | Sustained above 1% | Sustained above 3% | Segment by region, network type, and media direction |
| Jitter | Sustained above 30ms | Sustained above 50ms | Check jitter-buffer delay and provider edge |
| RTT | Sustained above 300ms | Sustained above 600ms | Check TURN routing, region placement, and mobile network path |
| MOS or quality score | Under 4.0 | Under 3.5 | Inspect packet loss, jitter, codec, and audio samples |
| ICE connection | Reconnects or disconnected | Failed or repeated reconnect loop | Check STUN/TURN, firewall, and selected candidate pair |
| Audio level | Flatline for 2 seconds during expected speech | One-way audio or clipping in test call | Verify device, permissions, track state, and gain |
| Concealed samples | Rising during active speech | Repeated concealment across test set | Inspect packet loss and jitter burst pattern |
Twilio's call-quality tags cover signals such as low MOS, high jitter, high packet loss, high latency, and ICE failure. Daily's testCallQuality() documentation uses packet-loss, RTT, and outbound-bitrate thresholds to classify pre-call quality as good, warning, bad, failed, or aborted. Those provider thresholds are not universal, but they are useful anchors.
We use stricter release thinking for voice agents because a "minor" media issue often becomes an agent-quality issue. A missing word changes intent. A late packet can make ASR endpoint early. One-way audio makes the agent look broken even when every backend span is green.
How to Run Pre-Call and Synthetic WebRTC Checks
Run checks before production traffic gets involved. The sequence should be boring enough that CI can enforce it.
| Check | Environment | Pass Criteria | Evidence to Save |
|---|---|---|---|
| Browser/device readiness | Pre-call or synthetic client | Mic permission, expected input device, audio level present | device class, permission state, audio-level sample |
| Network quality | Pre-call | Packet loss and RTT below warning threshold | network state, RTT, packet loss, bitrate |
| ICE connectivity | Synthetic call | Selected candidate pair reaches connected/completed | candidate type, region, TURN usage |
| Media loop | Synthetic call | Audio sent and received in both directions | inbound/outbound packet counts, audio levels |
| Speech sample | Synthetic call | Known phrase reaches ASR with expected confidence | audio pointer, transcript, confidence |
| Turn-taking sample | Synthetic call | Agent responds inside latency target and handles interruption | turn timestamps, interruption decision |
| Load sample | Deploy candidate | Quality holds under expected concurrent sessions | p95 stats by region and test batch |
For a minimal browser-side collector, sample getStats() on a short interval and normalize only the fields you need. jitterBufferDelay is cumulative, so divide it by jitterBufferEmittedCount for a lifetime average, or compute polling-window deltas if you need a rolling average.
type WebRtcAudioQualitySample = {
sampledAt: string;
callId: string;
direction: "inbound" | "outbound";
packetsLost?: number;
packetsReceived?: number;
jitterMs?: number;
roundTripTimeMs?: number;
audioLevel?: number;
jitterBufferDelayMs?: number;
concealedSamples?: number;
};
export async function collectWebRtcAudioQuality(
peerConnection: RTCPeerConnection,
callId: string,
): Promise<WebRtcAudioQualitySample[]> {
const stats = await peerConnection.getStats();
const samples: WebRtcAudioQualitySample[] = [];
stats.forEach((report) => {
if (report.type === "inbound-rtp" && report.kind === "audio") {
const jitterBufferDelayMs =
report.jitterBufferDelay === undefined ||
report.jitterBufferEmittedCount === undefined ||
report.jitterBufferEmittedCount === 0
? undefined
: (report.jitterBufferDelay / report.jitterBufferEmittedCount) *
1000;
samples.push({
sampledAt: new Date(report.timestamp).toISOString(),
callId,
direction: "inbound",
packetsLost: report.packetsLost,
packetsReceived: report.packetsReceived,
jitterMs: report.jitter === undefined ? undefined : report.jitter * 1000,
audioLevel: report.audioLevel,
jitterBufferDelayMs,
concealedSamples: report.concealedSamples,
});
}
if (report.type === "remote-inbound-rtp" && report.kind === "audio") {
samples.push({
sampledAt: new Date(report.timestamp).toISOString(),
callId,
direction: "outbound",
packetsLost: report.packetsLost,
jitterMs: report.jitter === undefined ? undefined : report.jitter * 1000,
roundTripTimeMs:
report.roundTripTime === undefined
? undefined
: report.roundTripTime * 1000,
});
}
});
return samples;
}
This is not a full monitoring system. It is the smallest useful proof that the voice agent's media path was measurable during the test. The OpenTelemetry voice-agent tracing guide covers how to attach the same call ID and trace ID to downstream spans.
How to Connect Call-Quality Stats to ASR, Latency, and Outcomes
The important question is not "what was jitter?" The important question is "did jitter change the conversation?"
Join WebRTC stats to the voice-agent outcome table:
| Evidence | Join Key | Decision It Supports |
|---|---|---|
| WebRTC stats | call ID, participant ID, track ID, timestamp | Was the media path degraded? |
| ASR transcript and confidence | call ID, turn ID, timestamp | Did audio degradation affect recognition? |
| Agent trace | trace ID, turn ID | Did model/tool/TTS latency compound the issue? |
| Audio replay pointer | call ID, redacted asset ID | Can QA hear the failure? |
| Task outcome | call ID, workflow ID | Did the caller complete the job? |
| Incident or release ID | deploy version, policy version | Did a change introduce the regression? |
This is where the page connects to broader voice-agent monitoring KPIs. Packet loss matters more when it correlates with worse ASR confidence, longer task time, higher abandonment, or more failed transfers.
It also changes how you debug latency. If voice AI latency spikes but WebRTC stats are clean, inspect STT endpointing, LLM first-token time, tool calls, and TTS. If WebRTC stats degrade first, do not tune the prompt until the media path is healthy.
Release-gate rule: A voice-agent release should not pass on text tests alone. It should pass text-mode logic tests, synthetic WebRTC calls, and outcome checks that prove media quality did not hide a broken conversation.
Store the joined evidence in one compact event so QA can replay the same failure later:
{
"eventName": "voice.webrtc_call_quality.checked",
"callId": "call_01JZ9W2M7K",
"traceId": "trace_7db3e1",
"sampleWindowMs": 30000,
"network": {
"region": "us-central",
"connectionType": "wifi",
"selectedCandidateType": "relay"
},
"media": {
"iceState": "connected",
"codec": "opus",
"packetLossPercent": 1.6,
"jitterMsP95": 34,
"roundTripTimeMsP95": 280,
"audioLevelFlatlineMs": 0
},
"voiceAgentOutcome": {
"asrConfidenceMin": 0.82,
"turnLatencyMsP95": 1180,
"taskCompleted": true,
"needsReplay": false
}
}
What Should CI and Production Monitoring Include?
Use this checklist as the minimum useful gate.
| Area | CI / Deploy Candidate | Production Monitoring |
|---|---|---|
| Pre-call quality | Run a 15-30 second quality check from each target region | Track warning/bad pre-call rates by region and device |
| ICE and connectivity | Assert connection reaches connected/completed | Alert on failed ICE, reconnect loops, and TURN-only spikes |
| Packet loss and jitter | Block sustained packet loss above 3% or jitter above 50ms | Trend p50/p95 by direction, region, and provider edge |
| RTT | Block sustained RTT above 600ms | Alert on regional RTT drift and TURN-route changes |
| Audio levels | Verify inbound and outbound audio are non-flatline | Detect one-way audio and clipped input |
| Known speech sample | Assert phrase reaches ASR with expected confidence | Sample replayable failures with redaction |
| Turn timing | Verify first response and interruption tests | Correlate with latency SLOs and abandonment |
| Outcome | Confirm synthetic caller completes the task | Segment bad outcomes by media-quality bucket |
For dashboards, start with 6 panels:
- Connection success by region and client type.
- Packet loss p95 by direction.
- Jitter p95 by direction.
- RTT p95 by region.
- One-way-audio candidates from audio-level flatlines.
- Task completion split by media-quality bucket.
The voice-agent dashboard template covers layout patterns for operators. The voice-agent SLO guide covers how to turn these signals into error budgets and burn-rate alerts.
When WebRTC Metrics Are Not Enough
WebRTC stats can tell you the media path was damaged. They cannot tell you whether the agent made the right decision.
Three limitations matter:
- Good media can still produce bad agent behavior. Clean packet delivery does not prove the prompt, tool call, handoff, or compliance policy worked.
- Quality scores hide causes. MOS or provider quality labels are useful summaries, but you still need packet loss, jitter, RTT, audio levels, and replay evidence.
- Browser stats are not the full telephony path. PSTN, SIP, SFU, and provider edges may each see a different version of the call.
The practical approach is layered. Use WebRTC stats to validate the media path. Use voice-agent observability traces to explain the AI pipeline. Use incident response runbooks when the failure affects production users.
We used to think of WebRTC checks as debugging support. That was too late. For production voice agents, call quality is part of the product contract. If the caller cannot be heard clearly, the agent does not get credit for being smart.
WebRTC Call Quality Release Checklist
- Pre-call quality check runs from every supported region or test edge.
- Synthetic WebRTC call captures inbound and outbound audio stats.
- ICE state, selected candidate type, TURN usage, and codec are saved.
- Packet loss, jitter, RTT, audio levels, and jitter-buffer delay have warning and block thresholds.
- Known speech samples produce expected ASR confidence and transcript quality.
- Turn latency and interruption behavior are tested with real audio, not only text turns.
- Call-quality stats join to transcript, audio replay pointer, trace ID, and task outcome.
- Production dashboard segments bad outcomes by media-quality bucket.
- Release owner can block deploys when media-quality checks fail.

