WebRTC Call Quality Testing for Voice Agents

Sumanyu Sharma
Sumanyu Sharma
Founder & CEO
, Voice AI QA Pioneer

Has stress-tested 4M+ voice agent calls to find where they break.

May 22, 2026Updated May 22, 202613 min read
WebRTC Call Quality Testing for Voice Agents

The fastest way to waste a day debugging a voice agent is to start with the prompt when the audio path is broken. The ASR transcript is wrong, the user interrupts twice, and the agent sounds late. Then someone opens the WebRTC stats and finds sustained packet loss during the same turn everyone was arguing about.

That is why WebRTC call quality testing needs to be a release gate for voice agents, not a post-incident chore. Text-mode tests prove the agent can reason. WebRTC tests prove the agent can survive a real call: ICE negotiation, codec behavior, jitter, packet loss, audio levels, device quirks, and network changes.

If your team is still pre-production and running fewer than 50 test calls per week, keep this simple: open browser diagnostics, run 10 noisy calls, and fix obvious media failures. This guide is for teams that need repeatable checks before every deploy and production evidence when callers say the agent "couldn't hear me."

WebRTC call quality testing is the practice of measuring the media path before, during, and after a voice-agent session. A useful test records packet loss, jitter, round-trip time, audio levels, jitter-buffer behavior, codec, ICE state, and conversation outcome in one evidence record.

Quick filter: If your QA record cannot answer "was this an agent failure or a bad media path?" you do not have enough call-quality evidence.

TL;DR: Treat WebRTC call quality as part of voice-agent QA:

  • Collect media stats: packet loss, jitter, RTT, audio levels, jitter-buffer delay, concealed samples, codec, and ICE state.
  • Run three checks: pre-call network check, synthetic WebRTC call, and sampled production monitoring.
  • Gate releases: block sustained packet loss, high RTT, one-way audio, failed ICE, and unexplained quality-score drops.
  • Correlate outcomes: store call-quality stats with ASR confidence, transcript/audio pointers, latency, and task completion.
Methodology Note: This checklist is based on Hamming's analysis of 4M+ production voice agent calls across 10K+ voice agents (2025-2026). We've tested agents built on LiveKit, Pipecat, ElevenLabs, Retell, Vapi, and custom-built solutions.

It also uses public WebRTC, LiveKit, Twilio, Daily, and ITU documentation to ground the media-quality metrics and cases.

Last Updated: May 2026

Related Guides:

How to use this checklist:

  1. Run the pre-call and synthetic checks before debugging prompts or ASR behavior.
  2. Save media stats with the same call ID and trace ID used by ASR, LLM, TTS, and outcome events.
  3. Block releases only on sustained or repeated quality failures, not one isolated stat spike.
  4. Segment production issues by region, device, network type, provider edge, and media direction.
  5. Use replayable evidence to decide whether the fix belongs in media routing, ASR, latency, or agent logic.

What Is WebRTC Call Quality Testing for Voice Agents?

WebRTC call quality testing checks whether the caller's audio can reach the agent, and the agent's audio can reach the caller, with enough stability for the conversation to work.

That sounds obvious. It is easy to skip because most voice-agent test suites start at the transcript. The transcript is already downstream of microphone permissions, ICE negotiation, codec selection, packet delivery, jitter buffering, and browser/device behavior.

Working rule: Do not debug ASR accuracy until you know the audio path was healthy enough for ASR to be judged fairly.

There are three test layers:

Test LayerWhen It RunsWhat It CatchesWhat It Misses
Pre-call quality checkBefore a user joins or before a synthetic test startsBad network, high RTT, obvious packet loss, device permission issuesAgent logic and real conversation timing
Synthetic WebRTC callCI, deploy candidate, scheduled probeICE failures, jitter under load, one-way audio, codec issues, turn latencyEvery real device and carrier path
Production samplingLive calls, with redaction and retention controlsRegion/device regressions, intermittent network patterns, issue replayFailures you did not instrument

The MDN RTCPeerConnection.getStats() docs describe the browser API that returns connection statistics. The W3C WebRTC stats specification defines the identifiers behind those reports, including inbound RTP stats such as packets received, packets lost, and jitter.

For voice agents, those stats are only half the record. The other half is whether the caller completed the task, whether ASR confidence dropped, whether turn latency spiked, and whether the agent recovered.

Which WebRTC Metrics Matter Most?

Start with the metrics that can explain a bad conversation. Do not build a dashboard of every available stat and call it observability.

MetricWhere It Comes FromWarning SignalWhy Voice Agents Care
ICE connection statePeer connection, SFU, or provider eventStuck in checking, disconnected, failedNo media path means every downstream test is invalid
Packet lossinbound/outbound RTP statsSustained above 1%Missing audio frames create transcript gaps and robotic playback
Jitterinbound/outbound RTP statsSustained above 30msLate packets make speech choppy and can increase perceived latency
RTTremote inbound stats or provider metricsSustained above 300msHigh round-trip time makes turn-taking feel slow even if models are fast
Jitter-buffer delayinbound RTP statsRising without recoveryBuffering hides packet timing problems but adds playout delay
Concealed samplesinbound audio statsIncreasing during active speechAudio is being repaired because packets were lost or late
Audio levelinbound RTP stats or provider SDKFlatline or clippingDetects one-way audio, mute bugs, and device gain issues
CodecSDP/provider summaryUnexpected codec or sample rateCodec mismatch can degrade ASR or force transcoding
MOS or quality scoreprovider metric or test resultUnder 4.0 warning, under 3.5 blockUseful summary, but not enough by itself

MDN's inbound RTP stats reference includes audio-level, jitter-buffer, packet-loss, and concealment fields that are useful when diagnosing received audio. MDN's remote inbound stats reference covers sender-side views such as RTT and fraction lost.

If you use LiveKit, its JS SDK exposes audio sender stats with packets sent, packets lost, jitter, and RTT in the provider's client abstractions. If you use a telephony path, Twilio Voice Insights exposes call summaries, metrics, events, and quality tags for investigating call quality.

Call-quality evidence record: Store WebRTC stats, provider metrics, ASR confidence, transcript pointer, audio pointer, trace ID, and task outcome together. A packet-loss number without the failed turn is just trivia.

What Thresholds Should Trigger Warnings or Release Blocks?

Thresholds need calibration by codec, device, region, and workload. Still, a starter gate is better than "looks fine."

SignalWarningRelease BlockFirst Action
Packet loss during active speechSustained above 1%Sustained above 3%Segment by region, network type, and media direction
JitterSustained above 30msSustained above 50msCheck jitter-buffer delay and provider edge
RTTSustained above 300msSustained above 600msCheck TURN routing, region placement, and mobile network path
MOS or quality scoreUnder 4.0Under 3.5Inspect packet loss, jitter, codec, and audio samples
ICE connectionReconnects or disconnectedFailed or repeated reconnect loopCheck STUN/TURN, firewall, and selected candidate pair
Audio levelFlatline for 2 seconds during expected speechOne-way audio or clipping in test callVerify device, permissions, track state, and gain
Concealed samplesRising during active speechRepeated concealment across test setInspect packet loss and jitter burst pattern

Twilio's call-quality tags cover signals such as low MOS, high jitter, high packet loss, high latency, and ICE failure. Daily's testCallQuality() documentation uses packet-loss, RTT, and outbound-bitrate thresholds to classify pre-call quality as good, warning, bad, failed, or aborted. Those provider thresholds are not universal, but they are useful anchors.

We use stricter release thinking for voice agents because a "minor" media issue often becomes an agent-quality issue. A missing word changes intent. A late packet can make ASR endpoint early. One-way audio makes the agent look broken even when every backend span is green.

How to Run Pre-Call and Synthetic WebRTC Checks

Run checks before production traffic gets involved. The sequence should be boring enough that CI can enforce it.

CheckEnvironmentPass CriteriaEvidence to Save
Browser/device readinessPre-call or synthetic clientMic permission, expected input device, audio level presentdevice class, permission state, audio-level sample
Network qualityPre-callPacket loss and RTT below warning thresholdnetwork state, RTT, packet loss, bitrate
ICE connectivitySynthetic callSelected candidate pair reaches connected/completedcandidate type, region, TURN usage
Media loopSynthetic callAudio sent and received in both directionsinbound/outbound packet counts, audio levels
Speech sampleSynthetic callKnown phrase reaches ASR with expected confidenceaudio pointer, transcript, confidence
Turn-taking sampleSynthetic callAgent responds inside latency target and handles interruptionturn timestamps, interruption decision
Load sampleDeploy candidateQuality holds under expected concurrent sessionsp95 stats by region and test batch

For a minimal browser-side collector, sample getStats() on a short interval and normalize only the fields you need. jitterBufferDelay is cumulative, so divide it by jitterBufferEmittedCount for a lifetime average, or compute polling-window deltas if you need a rolling average.

type WebRtcAudioQualitySample = {
  sampledAt: string;
  callId: string;
  direction: "inbound" | "outbound";
  packetsLost?: number;
  packetsReceived?: number;
  jitterMs?: number;
  roundTripTimeMs?: number;
  audioLevel?: number;
  jitterBufferDelayMs?: number;
  concealedSamples?: number;
};

export async function collectWebRtcAudioQuality(
  peerConnection: RTCPeerConnection,
  callId: string,
): Promise<WebRtcAudioQualitySample[]> {
  const stats = await peerConnection.getStats();
  const samples: WebRtcAudioQualitySample[] = [];

  stats.forEach((report) => {
    if (report.type === "inbound-rtp" && report.kind === "audio") {
      const jitterBufferDelayMs =
        report.jitterBufferDelay === undefined ||
        report.jitterBufferEmittedCount === undefined ||
        report.jitterBufferEmittedCount === 0
          ? undefined
          : (report.jitterBufferDelay / report.jitterBufferEmittedCount) *
            1000;

      samples.push({
        sampledAt: new Date(report.timestamp).toISOString(),
        callId,
        direction: "inbound",
        packetsLost: report.packetsLost,
        packetsReceived: report.packetsReceived,
        jitterMs: report.jitter === undefined ? undefined : report.jitter * 1000,
        audioLevel: report.audioLevel,
        jitterBufferDelayMs,
        concealedSamples: report.concealedSamples,
      });
    }

    if (report.type === "remote-inbound-rtp" && report.kind === "audio") {
      samples.push({
        sampledAt: new Date(report.timestamp).toISOString(),
        callId,
        direction: "outbound",
        packetsLost: report.packetsLost,
        jitterMs: report.jitter === undefined ? undefined : report.jitter * 1000,
        roundTripTimeMs:
          report.roundTripTime === undefined
            ? undefined
            : report.roundTripTime * 1000,
      });
    }
  });

  return samples;
}

This is not a full monitoring system. It is the smallest useful proof that the voice agent's media path was measurable during the test. The OpenTelemetry voice-agent tracing guide covers how to attach the same call ID and trace ID to downstream spans.

How to Connect Call-Quality Stats to ASR, Latency, and Outcomes

The important question is not "what was jitter?" The important question is "did jitter change the conversation?"

Join WebRTC stats to the voice-agent outcome table:

EvidenceJoin KeyDecision It Supports
WebRTC statscall ID, participant ID, track ID, timestampWas the media path degraded?
ASR transcript and confidencecall ID, turn ID, timestampDid audio degradation affect recognition?
Agent tracetrace ID, turn IDDid model/tool/TTS latency compound the issue?
Audio replay pointercall ID, redacted asset IDCan QA hear the failure?
Task outcomecall ID, workflow IDDid the caller complete the job?
Incident or release IDdeploy version, policy versionDid a change introduce the regression?

This is where the page connects to broader voice-agent monitoring KPIs. Packet loss matters more when it correlates with worse ASR confidence, longer task time, higher abandonment, or more failed transfers.

It also changes how you debug latency. If voice AI latency spikes but WebRTC stats are clean, inspect STT endpointing, LLM first-token time, tool calls, and TTS. If WebRTC stats degrade first, do not tune the prompt until the media path is healthy.

Release-gate rule: A voice-agent release should not pass on text tests alone. It should pass text-mode logic tests, synthetic WebRTC calls, and outcome checks that prove media quality did not hide a broken conversation.

Store the joined evidence in one compact event so QA can replay the same failure later:

{
  "eventName": "voice.webrtc_call_quality.checked",
  "callId": "call_01JZ9W2M7K",
  "traceId": "trace_7db3e1",
  "sampleWindowMs": 30000,
  "network": {
    "region": "us-central",
    "connectionType": "wifi",
    "selectedCandidateType": "relay"
  },
  "media": {
    "iceState": "connected",
    "codec": "opus",
    "packetLossPercent": 1.6,
    "jitterMsP95": 34,
    "roundTripTimeMsP95": 280,
    "audioLevelFlatlineMs": 0
  },
  "voiceAgentOutcome": {
    "asrConfidenceMin": 0.82,
    "turnLatencyMsP95": 1180,
    "taskCompleted": true,
    "needsReplay": false
  }
}

What Should CI and Production Monitoring Include?

Use this checklist as the minimum useful gate.

AreaCI / Deploy CandidateProduction Monitoring
Pre-call qualityRun a 15-30 second quality check from each target regionTrack warning/bad pre-call rates by region and device
ICE and connectivityAssert connection reaches connected/completedAlert on failed ICE, reconnect loops, and TURN-only spikes
Packet loss and jitterBlock sustained packet loss above 3% or jitter above 50msTrend p50/p95 by direction, region, and provider edge
RTTBlock sustained RTT above 600msAlert on regional RTT drift and TURN-route changes
Audio levelsVerify inbound and outbound audio are non-flatlineDetect one-way audio and clipped input
Known speech sampleAssert phrase reaches ASR with expected confidenceSample replayable failures with redaction
Turn timingVerify first response and interruption testsCorrelate with latency SLOs and abandonment
OutcomeConfirm synthetic caller completes the taskSegment bad outcomes by media-quality bucket

For dashboards, start with 6 panels:

  1. Connection success by region and client type.
  2. Packet loss p95 by direction.
  3. Jitter p95 by direction.
  4. RTT p95 by region.
  5. One-way-audio candidates from audio-level flatlines.
  6. Task completion split by media-quality bucket.

The voice-agent dashboard template covers layout patterns for operators. The voice-agent SLO guide covers how to turn these signals into error budgets and burn-rate alerts.

When WebRTC Metrics Are Not Enough

WebRTC stats can tell you the media path was damaged. They cannot tell you whether the agent made the right decision.

Three limitations matter:

  • Good media can still produce bad agent behavior. Clean packet delivery does not prove the prompt, tool call, handoff, or compliance policy worked.
  • Quality scores hide causes. MOS or provider quality labels are useful summaries, but you still need packet loss, jitter, RTT, audio levels, and replay evidence.
  • Browser stats are not the full telephony path. PSTN, SIP, SFU, and provider edges may each see a different version of the call.

The practical approach is layered. Use WebRTC stats to validate the media path. Use voice-agent observability traces to explain the AI pipeline. Use incident response runbooks when the failure affects production users.

We used to think of WebRTC checks as debugging support. That was too late. For production voice agents, call quality is part of the product contract. If the caller cannot be heard clearly, the agent does not get credit for being smart.

WebRTC Call Quality Release Checklist

  • Pre-call quality check runs from every supported region or test edge.
  • Synthetic WebRTC call captures inbound and outbound audio stats.
  • ICE state, selected candidate type, TURN usage, and codec are saved.
  • Packet loss, jitter, RTT, audio levels, and jitter-buffer delay have warning and block thresholds.
  • Known speech samples produce expected ASR confidence and transcript quality.
  • Turn latency and interruption behavior are tested with real audio, not only text turns.
  • Call-quality stats join to transcript, audio replay pointer, trace ID, and task outcome.
  • Production dashboard segments bad outcomes by media-quality bucket.
  • Release owner can block deploys when media-quality checks fail.

Frequently Asked Questions

WebRTC call quality testing measures whether the media path is clean enough for a voice agent to hear, respond, and recover naturally. Hamming recommends testing packet loss, jitter, RTT, audio levels, jitter-buffer behavior, codec, ICE state, and task outcome together because media quality problems often look like ASR or prompt failures.

Start with packet loss, jitter, round-trip time, audio levels, jitter-buffer delay, concealed samples, codec, and ICE connection state. According to Hamming's testing checklist, those 8 signals explain whether the call was degraded before ASR, LLM, or TTS quality should be blamed.

Hamming treats sustained packet loss above 3% as a release-blocking signal for voice agents and sustained loss above 1% as a warning that needs investigation. Short spikes can recover, but repeated packet loss during speech usually creates missing words, low ASR confidence, and awkward turn-taking.

Collect jitter from WebRTC stats during synthetic calls and production sessions, then segment it by region, device, codec, and network type. Hamming's checklist uses 30ms as a warning threshold and 50ms as a block threshold when the issue is sustained during active speech.

MOS is useful as a quality summary, but Hamming recommends pairing it with packet loss, jitter, RTT, and actual call outcomes. ITU MOS guidance uses a 1-5 opinion-score scale, while production voice-agent monitoring also needs replayable evidence for what the caller and agent actually heard.

Run lightweight pre-call checks before live user sessions when possible, synthetic WebRTC checks on every deploy candidate, and production sampling continuously. Hamming's release gate treats text-only tests as necessary but incomplete because they cannot exercise real packet loss, jitter, audio levels, or ICE behavior.

Latency measures how long the agent takes to respond, while WebRTC call quality measures whether the audio path is stable enough for the conversation to work. A voice agent can have healthy P95 latency and still fail if packet loss, one-way audio, or jitter corrupts the user's speech.

Store the canonical call ID, trace ID, ICE state, codec, packet loss, jitter, RTT, audio levels, quality score, ASR confidence, transcript pointer, audio pointer, and task outcome. Hamming recommends keeping those 13 fields together so QA can distinguish media-path failures from agent-logic failures.

Sumanyu Sharma

Sumanyu Sharma

Founder & CEO

Previously Head of Data at Citizen, where he helped quadruple the user base. As Senior Staff Data Scientist at Tesla, grew AI-powered sales program to 100s of millions in revenue per year.

Researched AI-powered medical image search at the University of Waterloo, where he graduated with Engineering honors on dean's list.

“At Hamming, we're taking all of our learnings from Tesla and Citizento build the future of trustworthy, safe and reliable voice AI agents.”