What is voice agent hallucination detection?

Voice agent hallucination detection checks whether a spoken AI answer is supported by approved source data, tool results, policy rules, or prior call context. Hamming recommends tracking both hallucination rate and detection coverage so teams know how many factual claims were actually checked.

How do you detect hallucinations in voice AI calls?

Attach each factual answer to the evidence it should use, then score the spoken claim against that evidence. Use tool-result checks for specific account facts, source-grounding checks for policy answers, and calibrated human review for ambiguous or high-risk calls.

What is a good hallucination rate for voice agents?

There is no universal safe rate because a wrong office-hours answer and wrong medical instruction carry different risk. Hamming recommends 0 tolerance for critical harmful hallucinations, then separate high, medium, and low severity targets by workflow.

Can LLM judges detect voice agent hallucinations?

LLM judges can help score open-ended answers, but they should not be the only detector for critical facts. Use source-grounding, deterministic tool-result checks, and human-reviewed calibration sets to verify that the judge catches real failures instead of rewarding fluent answers.

What should be logged when a hallucination is detected?

Log the spoken claim, call ID, turn ID, agent and prompt version, source evidence, support decision, detector method, severity, and remediation owner. The transcript alone is not enough because reviewers need to know which source the answer should have matched.

How do hallucination alerts differ from compliance alerts?

Hallucination alerts focus on unsupported or contradicted factual claims, while compliance alerts focus on required or forbidden behavior. In regulated voice agents, the same call can trigger both if the agent invents a fact and violates a disclosure or escalation policy.

How do you turn hallucinations into regression tests?

Take the confirmed failure, remove private identifiers, preserve the source or tool result, define the expected answer, and add forbidden claims. Hamming recommends keeping the original severity and failure label so future reviewers understand why the test exists.

How often should hallucination detection be reviewed?

Review critical and high-severity issues immediately or same day, then review trend metrics weekly. The weekly dashboard should include hallucination rate, detection coverage, severity mix, false positive samples, false negative samples, and mean time from confirmed failure to regression test.

Voice Agent Hallucination Detection Guide

Voice agent hallucination detection is the process of checking whether a spoken AI answer is supported by the source data, tool result, policy, and conversation context the agent was allowed to use.

If your voice agent only reads scripted prompts, this guide is overkill. You have script-selection risk, not LLM hallucination risk. This guide is for teams using LLM-powered voice agents that generate novel answers, call tools, summarize account data, explain policies, or operate in regulated workflows.

The failure pattern is simple: the agent sounds confident, the call does not crash, the transcript looks fluent, and the customer walks away with the wrong fact. That is why hallucination detection belongs in production monitoring, not only in prompt review.

TL;DR: Detect voice agent hallucinations with a four-part loop: capture the answer, attach the source it should be grounded in, score the claim against that source, and turn confirmed failures into regression tests. Alert immediately only for harmful or high-impact fabrications. Route lower-severity unsupported claims into review so teams can improve the prompt, retrieval, tool call, or policy boundary without creating alert fatigue.

Hamming definition: A voice agent hallucination is a spoken claim, instruction, or tool-derived statement that is unsupported by the agent's approved sources or contradicts the call context. The important object is not the transcript alone; it is the transcript plus source evidence, tool result, severity, and remediation owner.

Quick filter: If your team cannot answer "which source should have supported this sentence?" for a failed call, your hallucination detector is measuring vibes, not factual accuracy.

Methodology Note: This guide is based on Hamming's analysis of production voice agent calls across 10K+ voice agents (2025-2026). Hamming's platform has 10M+ mins protected. We've tested agents built on LiveKit, Pipecat, ElevenLabs, Retell, Vapi, and custom-built solutions.
Public source references include Microsoft, Google, and OpenAI documentation on groundedness checks, evaluation design, and graders. Benchmarks and thresholds should be calibrated by domain, traffic volume, and risk.

Last Updated: May 2026

Related Guides:

How to Evaluate Voice Agents - full evaluation loop and failure-mode coverage
Voice Agent Evaluation Metrics - hallucination rate, task success, WER, latency, and safety metrics
Voice Agent Response Coverage - turn production gaps into regression tests
Voice Agent Analytics and Post-Call Metrics - formulas and dashboards for live monitoring
Voice Agent Monitoring KPIs - production KPI selection and thresholds
Voice Agent Incident Response Runbook - triage and fix production failures
Voice Agent SLOs and Error Budgets - decide when quality regressions block releases
RAG Debugging - isolate retrieval failures from reasoning failures

How to use this guide:

Define which spoken claims are factual enough to check.
Attach each claim type to its allowed evidence: tool result, policy, knowledge chunk, or prior turn.
Separate confirmed hallucinations from missing evidence and ambiguous calls.
Route alerts by severity, not by detector confidence alone.
Convert every high-impact confirmed failure into a regression test.

What Counts as a Hallucination in a Voice Agent?

A hallucination is not any answer you dislike. It is a factual or policy claim that cannot be supported by the sources the agent was supposed to use.

Failure Type	Sample Scenario	Why It Matters	Detection Source
Fabricated fact	Agent invents an appointment slot, refund rule, balance, or provider name	Customer may act on something false	Database, API result, knowledge base
Wrong tool interpretation	Tool returns "not eligible" but agent says the customer qualifies	Correct tool call still becomes a bad spoken answer	Tool response and spoken transcript
Policy contradiction	Agent promises something the policy forbids	Compliance and trust risk	Policy document or guardrail rule
Context contradiction	Agent says "you asked for Monday" after caller corrected to Friday	Conversation state is being lost	Prior turns and state transition log
Unsupported extrapolation	Agent adds "your case will be approved" when only status was known	Sounds helpful but overstates evidence	Source confidence and claim type
Unsafe medical, legal, or financial advice	Agent gives a diagnosis, legal conclusion, or investment instruction outside scope	High-stakes harm	Domain policy and escalation rules

Microsoft's groundedness documentation defines ungroundedness as content that is inaccurate or not present in source materials. Google's grounding docs use a support score from 0 to 1 and claim-level citations to show whether an answer is supported by provided facts. For voice agents, use the same idea, but attach the call evidence around the spoken turn.

Grounded voice answer: a spoken answer where each factual claim can be traced to an approved source, a tool result, or a prior turn in the same call.

That last phrase matters. In voice, the "source" is often not a document. It may be a CRM response, scheduling API result, caller-provided entity, authentication state, or the specific words the agent already said.

Why Voice Hallucinations Need Production Evidence

Prompt rules help, but they are not enough.

A prompt can say "only answer from the knowledge base." Then the agent faces a caller with a noisy microphone, an ambiguous account lookup, a stale retrieval chunk, and a tool result whose field names are easy to misread. The hallucination does not come from one layer. It comes from the gap between speech, retrieval, tool use, and policy.

We used to think hallucination testing was mostly a pre-deployment problem: write known-answer questions, score the model, fix the prompt. That is still necessary. The bigger lesson from production voice systems is that factual accuracy decays when the inputs change.

Production Change	How Hallucinations Appear	Monitoring Signal
Knowledge base update	Agent cites old policy or mixes old and new rules	Grounding score drops for changed topics
Tool schema change	Agent reads the wrong field or ignores uncertainty	Tool-result mismatch rate rises
New caller phrasing	Retrieval pulls the wrong document	Unsupported-claim clusters grow
Model or prompt update	Agent becomes more fluent but less careful	Hallucination rate rises after version tag
Long call context	Agent forgets earlier correction or consent boundary	Context contradiction events increase
Noisy audio or ASR drift	Agent answers a different question than caller asked	Transcript-confidence and correction failures rise

The correction is to treat hallucination detection like production voice agent monitoring: versioned, sampled, reviewable, and connected to release policy.

The Hallucination Evidence Loop

Use this loop for every factual answer type.

Step	Artifact	Owner	Good Output
1. Capture the spoken answer	Transcript segment, audio pointer, turn ID	Platform / observability	The specific claim the caller heard
2. Attach allowed evidence	Knowledge chunk, tool result, prior turn, policy rule	Agent runtime	The source the answer should be grounded in
3. Score the claim	Grounded / unsupported / contradicted / needs review	Evaluator	A reasoned decision with confidence and cited evidence
4. Assign severity	Critical, high, medium, low	QA / compliance owner	Alert policy and review SLA
5. Fix the source of error	Prompt, retrieval, tool mapping, policy, test data	Engineering owner	Merged mitigation with version tag
6. Add regression coverage	Test case and expected behavior	QA owner	Failure cannot silently return in the next release

This is the same operating principle behind response coverage: production failures become durable tests, not one-off dashboard rows.

For grounding, start with the source type. A policy answer should be checked against policy text. An account-specific answer should be checked against the tool result. A medical triage answer should be checked against the approved triage protocol and escalation rule. A caller correction should be checked against prior turns.

Hallucination rate =  confirmed unsupported or contradicted factual claims  / factual claims checkedDetection coverage =  factual claims checked  / factual claims eligible for checkingConfirmed critical rate =  critical hallucinations  / all eligible production calls

Do not report hallucination rate without detection coverage. A 0.2% hallucination rate on 3% reviewed coverage is not the same operating signal as 0.2% on 90% claim coverage.

Severity Taxonomy

Not every unsupported phrase deserves a pager. Severity should be based on harm, reversibility, and whether the caller is likely to act on the claim.

Severity	Definition	Sample Scenario	Response
Critical	Could cause safety, financial, legal, medical, or compliance harm	Agent invents payment approval, dosage guidance, fraud decision, or legal requirement	Page owner, pause risky release path, open incident
High	Changes customer action or account outcome	Agent gives wrong appointment time, refund eligibility, balance, cancellation status	Same-day review and regression test
Medium	Misstates a policy detail but has an obvious correction path	Agent gives outdated office hours or partial policy wording	Queue for QA review and source/prompt fix
Low	Unsupported wording with little customer impact	Agent embellishes a generic explanation without changing action	Track trend, sample manually
Needs review	Detector lacks enough evidence to classify	Source missing, transcript unclear, tool call unavailable	Improve logging before scoring

The highest-value change most teams can make is separating "needs review" from "confirmed hallucination." If evidence is missing, call it evidence missing. Do not let detector uncertainty inflate or hide the hallucination rate.

Which Detector Should Handle Each Answer Type?

Use the simplest detector that can prove the claim.

Answer Type	Best Detector	Why	Failure to Watch
Specific account fact	Tool-result check	The answer must match a known API result	Field-name mismatch or stale tool result
Policy explanation	Source-grounding check	Claims should be supported by approved policy text	Retrieval pulls the wrong version
Appointment or booking status	Tool-result plus prior-turn check	Caller corrections and system state both matter	Agent confirms an uncommitted slot
Compliance script	Rule/guardrail check	Required language must appear or be avoided	Paraphrase misses required wording
Open-ended support answer	Groundedness check plus human calibration	Some claims require interpretation	LLM judge rewards fluent unsupported answers
Medical, legal, or financial boundary	Policy classifier plus escalation guardrail	The agent should refuse or escalate outside scope	The detector treats unsafe advice as helpfulness
Long-call summary	Claim-level grounding	Summaries can mix accurate and false details	One wrong sentence hides inside a good summary

OpenAI's evaluation guidance recommends mixing production data, hard-coded correct answers, historical logs, automated scoring, and expert labels. Its grader docs also warn that model graders should be tested against human expert judgments because graders can be biased or gamed.

The voice-agent version is straightforward: use LLM judges where semantic judgment is unavoidable, but calibrate them against human-reviewed calls and source-backed samples. For critical workflows, keep deterministic checks for specific facts, required disclosures, and forbidden claims.

What to Log When a Hallucination Is Detected

If the log only says hallucination=true, it is not useful enough. The reviewer needs to know what was claimed, what source was expected, and what changed after the fix.

Use a compact evidence record:

{  "eventName": "voice.hallucination.detected",  "eventVersion": "2026-05-23",  "occurredAt": "2026-05-23T16:08:31.224Z",  "canonicalCallId": "call_01JZ9W2M7K",  "turnId": "turn_0014",  "agentVersion": "agent_billing_v42",  "promptVersion": "prompt_2026_05_23",  "claim": {    "spokenText": "Your refund will arrive tomorrow morning.",    "claimType": "account_specific_status",    "heardByCaller": true  },  "evidence": {    "sourceType": "tool_result",    "sourceId": "refund_status_api_2026_05_23",    "sourceExcerpt": "refund_status: pending_review",    "supportDecision": "contradicted"  },  "severity": "high",  "detector": {    "method": "tool_result_check",    "confidence": 0.94,    "needsHumanReview": false  },  "remediation": {    "owner": "billing_agent_team",    "action": "add pending-review response branch and regression test",    "status": "open"  }}

Store raw audio and transcripts according to your retention and privacy policy. For the monitoring event, keep the pointer, source type, support decision, severity, and remediation owner. Pair this with the voice agent log retention checklist if the call record may become audit evidence.

How to Alert, Review, and Triage

Alert policy should follow severity, not detector excitement.

Condition	Alert Channel	Review SLA	Action
Critical hallucination in regulated or safety workflow	Pager / incident channel	Immediate	Triage call, pause risky release, notify owner
High-severity account or workflow claim	QA + engineering channel	Same business day	Confirm evidence, patch workflow, add test
Medium unsupported policy wording	QA queue	2-5 business days	Review cluster, update source or prompt
Low unsupported embellishment	Trend dashboard	Weekly	Sample and watch version drift
Needs-review due to missing source	Observability backlog	Next sprint	Fix logging, do not over-score

Do not route every hallucination flag to the same channel. A hallucination detector that pages for low-impact paraphrases will be muted. A detector that hides critical unsupported advice in weekly QA will be ignored for the opposite reason.

Tie severe hallucinations to your incident response runbook and your voice agent SLOs. A critical hallucination should usually burn reliability budget faster than a mild latency regression.

How to Turn Hallucinations Into Regression Tests

The fix is not complete when the prompt is edited. The fix is complete when the failure cannot silently return.

Use this conversion workflow:

Pull the confirmed hallucination event.
Strip private identifiers and keep only the necessary scenario shape.
Preserve the allowed source, tool result, or policy rule.
Write the expected answer or refusal behavior.
Add a guardrail for the forbidden claim.
Run the test against the fixed agent and the previous version.
Keep the test with the original failure label and severity.

Sample regression case:

test_name: refund_status_pending_review_no_promisesource_failure: production_hallucination_highcaller_goal: ask when a refund will arrivetool_result:  refund_status: pending_review  estimated_arrival_date: nullexpected_agent_behavior:  - explain that the refund is still under review  - do not promise a date  - offer a human transfer or follow-up pathforbidden_claims:  - "will arrive tomorrow"  - "approved"  - "guaranteed"

This connects hallucination detection to testing voice agents for production reliability. The monitoring loop finds the failure; the regression suite makes sure the next release does not reintroduce it.

Common Mistakes

Mistake	Why It Fails	Better Practice
Measuring only final transcript fluency	Fluent wrong answers look good	Check each factual claim against evidence
Treating all unsupported claims equally	Alert fatigue hides critical failures	Use severity and review SLAs
Using one LLM judge without calibration	The judge can miss domain-specific truth	Compare against human-reviewed samples
Ignoring tool results	Many voice hallucinations are tool interpretation errors	Log tool input, output, and spoken claim together
Reporting hallucination rate without coverage	Low review volume can hide risk	Report detection coverage beside rate
Fixing prompts without adding tests	The same failure can return next release	Convert confirmed failures into regression cases
Letting stale sources remain in retrieval	The model may ground itself in obsolete truth	Version source data and alert on stale citations

This is especially important in healthcare, finance, insurance, and debt collection. Pair this guide with regulatory script adherence testing and HIPAA voice agent testing when the agent handles regulated content.

30-Day Rollout Checklist

Week	Work	Exit Criteria
1	Define factual claim types and severity levels	Critical, high, medium, low, and needs-review are documented
1	Attach sources to answer types	Each factual flow maps to a tool result, policy, KB chunk, or prior turn
2	Start detector coverage on highest-risk flows	At least 80% of critical factual claims are checked
2	Add evidence event schema	Reviewers can see claim, source, decision, confidence, and owner
3	Calibrate with human-reviewed samples	Detector false positives and false negatives are sampled
3	Route alerts by severity	Critical and high issues reach the right owners without noisy low-severity pages
4	Convert confirmed failures into regression tests	Every high/critical confirmed hallucination has a test and owner
4	Add dashboard review	Hallucination rate, detection coverage, severity mix, and mean time to regression are reviewed weekly

The first useful milestone is not perfection. It is a loop where the team can prove why a claim was wrong, who owns the fix, and which test now prevents it from coming back.

Voice Agent Hallucination Detection Guide

What Counts as a Hallucination in a Voice Agent?

Why Voice Hallucinations Need Production Evidence

The Hallucination Evidence Loop

Severity Taxonomy

Which Detector Should Handle Each Answer Type?

What to Log When a Hallucination Is Detected

How to Alert, Review, and Triage

How to Turn Hallucinations Into Regression Tests

Common Mistakes

30-Day Rollout Checklist

Frequently Asked Questions

Sumanyu Sharma

Related Resources

WebRTC Call Quality Testing for Voice Agents

Voice Agent Interruption Handling: Barge-In, Backchannels, and Turn Detection

Long-Call Voice Agent Testing: How to Test 70+ Conversation Turns