Insurance Claims Intake Voice Agent Testing Runbook

Sumanyu Sharma
Sumanyu Sharma
Founder & CEO
, Voice AI QA Pioneer

Has stress-tested 4M+ voice agent calls to find where they break.

June 22, 2026Updated June 22, 202614 min read
Insurance Claims Intake Voice Agent Testing Runbook

Insurance claims intake voice agent testing should prove that the caller's loss was captured, validated, routed, and preserved with evidence. It is not enough for the agent to sound empathetic and say, "I have started your claim."

If the agent only answers policy FAQs, use a smaller voice agent workflow testing runbook. This is for teams where a call can create a first notice of loss (FNOL), update a claim file, trigger an adjuster assignment, request documents, or route a distressed claimant to a human.

The failure mode is blunt: the transcript sounds right, but the downstream claim is still a draft, missing a reporter, attached to an unverified policy, assigned to the wrong path, or impossible to reconstruct during review.

TL;DR: Test claims intake as a stateful workflow:

  • State: draft, open, incomplete, escalated, duplicate, withdrawn, or failed.
  • Required fields: policy, loss date, reporter, contact route, incident facts, exposure, attachments, and consent.
  • Invariants: no future loss dates, no claim submission without required fields, no coverage promise, no payment promise, no silent PII leak.
  • Evidence: transcript span, tool trace, claim ID, validation result, escalation reason, reviewer route, and cleanup status.

A claims-intake test passes only when the spoken outcome, tool calls, final claim state, and audit evidence agree.

Methodology Note: This runbook is based on Hamming's analysis of 4M+ production voice agent calls where claim intake, workflow state, and evidence quality changed the customer outcome across 10K+ voice agents (2025-2026). We've tested agents built on LiveKit, Pipecat, ElevenLabs, Retell, Vapi, and custom-built solutions.

It also uses public Guidewire and NAIC documentation to ground the FNOL state model and claim-file evidence samples. It is not legal advice; map these patterns to your own claim system, product line, and jurisdiction.

Last Updated: June 2026

Related Guides:

What Makes Claims Intake Hard to Test?

Claims intake is non-deterministic because callers do not report losses in a neat order.

They call from accident scenes. They forget policy numbers. They give a date, then correct it. They mention injuries late. They ask whether something is covered before the agent has enough facts. They may need empathy, but the system still needs structured data.

The test should not demand one fixed conversation. It should demand the right outcome.

Claims-intake correctness: the agent collected enough facts for the allowed claim state, avoided unsupported coverage or payment promises, escalated high-risk cases, and preserved evidence that lets a reviewer reconstruct what happened.

That is a different test from "did the agent answer politely?" Politeness matters. It does not prove the workflow.

We used to treat claims intake as a long form-filling call. Now we treat it as a state transition with guardrails. The agent can take many valid paths, but some things should never happen.

The FNOL State Model to Test

Start with the claim lifecycle your system actually uses. For many property and casualty systems, the useful model begins with FNOL: first notice of loss.

Guidewire's ClaimCenter documentation describes an FNOL process where a claim can begin as a draft, then become open when enough information exists to enter adjudication. It also distinguishes verified policies from unverified policies. That distinction is why claims intake tests need state assertions.

StateWhat It MeansVoice-Agent Test AssertionBlock Release When
No claim createdCaller gave too little information or declined to proceedAgent explains what is missing and writes no claimAgent says a claim was started but no record exists
Draft claimEnough information exists to save an intake record, but not enough to submitDraft has policy/loss/reporter evidence and missing-field reasonDraft is saved without required traceability
Open claimClaim met validation and assignment requirementsOpen claim ID exists, assignment route is recorded, caller receives safe next stepsAgent says claim is open while backend remains draft
Escalated claimHuman review is required because risk, ambiguity, distress, fraud, injury, or coverage uncertainty is presentEscalation reason, handoff target, and evidence packet existAgent continues as if routine intake is allowed
Duplicate candidateCaller may be reporting the same loss againAgent checks prior claim context and avoids duplicate submissionA retry or repeat caller creates a second claim
Failed intakeTool, validation, or assignment failedAgent gives safe fallback and preserves failed-state evidenceAgent promises success after an error

Guidewire's draft-claim submission docs also show why "created" is not enough. A draft can fail submission because of validation or assignment errors. Your test should catch that, not smooth it over.

Build a Claims Intake Test Matrix

Write the matrix before you write the prompt. A good claims-intake matrix names the caller condition, fixture state, allowed branch, forbidden branch, and evidence.

ScenarioFixture SetupExpected OutcomeForbidden OutcomeEvidence Required
Clean auto FNOLVerified policy, loss date today, reporter matches policyholderDraft or open claim with loss facts, reporter, vehicle, contact, and next stepAgent invents coverage or settlement timingClaim ID, policy lookup, field map, transcript spans
Missing policy numberCaller knows name/address but not policy numberAgent uses approved lookup path or creates allowed unverified-policy flowAgent blocks without offering supported lookupLookup trace, verification decision, missing-field note
Date conflictCaller first says "yesterday," later says "last Friday"Agent clarifies and stores one resolved loss date with rationaleBoth dates appear as final truthClarification span, resolved field, reviewer note
Injury or urgent safety signalCaller mentions injury, medical emergency, threat, or unsafe locationAgent escalates or gives approved safety handoffAgent continues routine intakeEscalation trigger, timestamp, handoff result
Fraud or inconsistency signalCaller gives inconsistent ownership, location, or timing factsAgent records inconsistency and routes to reviewAgent accuses caller or ignores conflictConflict fields, review queue ID, neutral language check
Document requestClaim needs photos, receipts, police report, or inventoryAgent requests allowed documents and records channelAgent asks for unsupported sensitive dataDocument request list, delivery channel, consent
Duplicate lossSame caller reports same incident twiceAgent detects prior claim candidate and routes safelyDuplicate open claim is createdPrior-claim lookup, duplicate decision, final state
Tool timeoutPolicy or claims API times outAgent gives safe fallback and writes no false successAgent says the claim is submittedError trace, no-write assertion, fallback transcript

NAIC consumer guidance is useful for scenario design because it names the kinds of information claimants are often asked to provide: insurance information, contact information, damage descriptions, inventories, photos or videos, repair receipts, and follow-up communication.

Do not turn that into a universal legal checklist. Use it to make the test cases feel like real claims calls.

Invariants That Should Never Break

Non-deterministic paths still need deterministic rules.

These are the rules that should fail the run every time, no matter how natural the conversation sounded.

InvariantWhy It MattersTest Method
Loss date is not in the futureFuture loss dates usually indicate ASR, caller, or extraction errorSchema check plus transcript span
Required reporter field exists before submissionA claim without reporter context may fail validation or reviewField completeness check
Policy status is explicitUnknown policy state should not become a confident coverage statementPolicy lookup trace and language check
No guaranteed coverage statementThe agent should not promise coverage, liability, or payment before adjudicationSemantic prohibited-claim evaluator
High-risk triggers escalateInjury, fraud, distress, threat, legal complaint, or vulnerable-caller signals need a human pathClassifier plus handoff assertion
Duplicate protection existsRepeated calls and retries should not create duplicate claimsIdempotency key and prior-claim lookup
Evidence packet is completeClaims workflows need reconstruction, not just a transcriptMetadata completeness check
Sensitive fields are minimizedClaim details can include PII, financial facts, medical details, or addressesPII/security review and redaction policy

Invariant check: a deterministic rule that must pass even when the conversation path varies. For claims intake, invariants are the guardrails that keep a flexible agent from creating unsafe claim records.

Pair these with structured output validation. The agent can summarize beautifully and still extract the wrong loss date, claimant role, address, or exposure.

How to Test Non-Deterministic Branches

Do not make CI flaky by requiring identical transcripts.

Use repeated scenario runs for the conversational layer, but deterministic checks for the state layer.

LayerWhat VariesWhat Must Stay StableSuggested Gate
ConversationWording, order of clarification, empathy phrasingRequired information is requested before submission20-50 runs for critical paths
Tool callsRetry timing, optional lookup path, handoff timingAllowed tools only, correct order for state-changing calls99% allowed-tool compliance
Structured claimCaller phrasing and ASR variantsFinal normalized fields match the fixture or human-reviewed answer95%+ field correctness for blocking fields
EscalationCaller wording and emotional stateHigh-risk signals route to the correct human path100% for severe injury, fraud, legal, or safety triggers
EvidenceTranscript chunking and reviewer fieldsRequired packet fields are present99%+ completeness

This is where sandbox side-effect testing matters. Run the claim write against fixture data or a test claims environment. If a production write is unavoidable, keep it allowlisted, owner-approved, and outside normal CI.

For customer-specific claim rules, use the customer workflow rules testing template. One insurer may require escalation for glass claims above one threshold; another may route them automatically. The test should prove the active rule version, not a generic industry assumption.

Evidence Packet Template

A claims-intake run should leave enough evidence for QA, claims operations, compliance, and engineering to agree on what happened.

{
  "run_id": "fnol_run_2026_06_22_014",
  "call_id": "call_fixture_883",
  "agent_version": "claims-intake-agent-v17",
  "claim_fixture": {
    "line_of_business": "personal_auto",
    "policy_state": "verified",
    "caller_role": "policyholder"
  },
  "expected_state": "open_claim_or_escalated",
  "actual_state": "open_claim",
  "required_fields": {
    "policy_number": "present",
    "loss_date": "present",
    "reporter": "present",
    "contact_method": "present",
    "incident_summary": "present"
  },
  "tool_trace": [
    {
      "tool": "lookup_policy",
      "status": "verified"
    },
    {
      "tool": "create_draft_claim",
      "status": "created",
      "fixture_claim_id": "claim_fixture_991"
    },
    {
      "tool": "submit_claim",
      "status": "submitted"
    }
  ],
  "invariant_results": {
    "no_future_loss_date": "pass",
    "no_coverage_promise": "pass",
    "duplicate_check": "pass",
    "pii_redaction": "pass"
  },
  "review_route": "claims_ops_sample",
  "cleanup_status": "verified"
}

The packet should not expose raw PII in dashboards or alert payloads. Keep raw evidence under the right access controls, then export redacted fields for QA and engineering. For broader evidence handoff, use the voice agent call evidence export runbook.

The NAIC model regulation on unfair claims settlement practices emphasizes claim-file documentation and retrievable claim data in its model language. Treat that as a useful reminder: a claims-intake agent should create reviewable evidence, not just a nice conversation.

What Belongs in CI?

Claims tests get expensive quickly. Keep the blocking suite small and schedule the rest.

GateRun WhenSizeBlocks Merge?
Field extraction invariantsPrompt, ASR, entity extraction, schema, or tool changes20-80 casesYes
Mocked claims workflowTool wrapper or orchestration changes5-15 casesYes
Sandbox FNOL flowClaim write, assignment, escalation, or document-request changes3-8 fixture casesYes for critical flows
Repeated stochastic scenariosModel, prompt, or policy changes20-50 runs per critical scenarioYes for severe risk paths
Phone-path testsTelephony, transfer, audio, or interruption changes2-5 callsUsually pre-release
Production review samplingLive monitoring1-5% of eligible calls or top failure clustersNo, alert and triage

When a production call fails, convert it into a replayable regression with the failed production call regression test runbook. For daily operations, use production call review triage to avoid reviewing thousands of calls when only 20 contain useful learning.

Flaws But Not Dealbreakers

State rules vary by insurer and jurisdiction. This runbook gives a testing structure, not a universal claims policy. Compliance and claims operations should own the obligations.

Synthetic claims can overfit. If every fixture is clean, the agent will look better than it is. Add messy caller phrasing, missing data, background noise, conflicting facts, and interrupted calls.

LLM judges are not claim adjusters. Use them to aggregate semantic signals: missing facts, unsafe promises, escalation cues, and empathy quality. Do not let a judge decide coverage, liability, or payment.

The system of record is the source of truth. A transcript is evidence. The claim object, policy lookup, validation result, and review route prove whether intake actually worked.

Claims Intake Launch Checklist

  • Every claim path has a fixture: clean, missing data, conflict, duplicate, high-risk, tool failure, and escalation.
  • Required fields are separated from optional fields.
  • Draft vs open claim state is asserted explicitly.
  • Every claim write uses an idempotency key or duplicate guard.
  • Coverage, liability, settlement, and payment promises are prohibited unless your approved policy allows them.
  • High-risk triggers route to a human with an evidence packet.
  • PII, medical, financial, and address fields follow the approved retention and redaction policy.
  • Failed production claims become regression tests before the next prompt or model change.

For regulated language, pair this checklist with regulatory script adherence testing. Claims intake is not only a data-capture problem; it is also a disclosure, evidence, and escalation problem.

Insurance Claims Intake Testing FAQ

What is insurance claims intake voice agent testing?

Insurance claims intake voice agent testing verifies that an AI voice agent can collect first notice of loss facts, validate required fields, route exceptions, and preserve evidence for review. The test should check the final claim state and evidence packet, not only the transcript.

How do I test non-deterministic claims paths?

Run repeated scenario variants for caller phrasing, missing details, interruptions, and contradictions, then assert deterministic invariants such as required fields, no coverage promise, duplicate protection, and correct escalation. Hamming recommends measuring state outcomes across many valid paths instead of requiring one fixed script.

What should a claims intake test matrix include?

A claims intake test matrix should include fixture setup, caller goal, expected claim state, forbidden outcome, required fields, escalation trigger, tool trace, and cleanup evidence. The most useful rows cover clean FNOL, missing policy number, date conflict, injury or safety signal, duplicate claim, fraud signal, and tool failure.

Should claims intake tests create real claims?

Most CI tests should not create production claims. Use mocks for fast deterministic checks, sandbox claims environments for fixture-backed workflow tests, and tightly scoped live checks only when a release owner approves the risk.

What evidence should a passing FNOL test store?

A passing FNOL test should store run ID, call ID, agent version, policy lookup result, extracted required fields, claim state, tool trace, invariant results, review route, and cleanup status. Sensitive fields should be redacted in broad dashboards while raw evidence stays under the right access controls.

Which claims intake failures should block release?

Block release when the agent submits incomplete claims, creates duplicates, promises coverage or payment without authority, misses injury or fraud escalation, leaks sensitive data, or claims success after a tool error. Hamming treats these as state and evidence failures, not wording issues.

How does this differ from generic voice agent workflow testing?

Generic workflow testing checks that an agent followed the right branch and tool sequence. Claims intake testing adds insurance-specific state, required fields, documentation, escalation, duplicate protection, and audit evidence.

Frequently Asked Questions

Insurance claims intake voice agent testing verifies that an AI voice agent can collect first notice of loss facts, validate required fields, route exceptions, and preserve evidence for review. The test should check the final claim state and evidence packet, not only the transcript.

Run repeated scenario variants for caller phrasing, missing details, interruptions, and contradictions, then assert deterministic invariants such as required fields, no coverage promise, duplicate protection, and correct escalation. Hamming recommends measuring state outcomes across many valid paths instead of requiring one fixed script.

A claims intake test matrix should include fixture setup, caller goal, expected claim state, forbidden outcome, required fields, escalation trigger, tool trace, and cleanup evidence. The most useful rows cover clean FNOL, missing policy number, date conflict, injury or safety signal, duplicate claim, fraud signal, and tool failure.

Most CI tests should not create production claims. Use mocks for fast deterministic checks, sandbox claims environments for fixture-backed workflow tests, and tightly scoped live checks only when a release owner approves the risk.

A passing FNOL test should store run ID, call ID, agent version, policy lookup result, extracted required fields, claim state, tool trace, invariant results, review route, and cleanup status. Sensitive fields should be redacted in broad dashboards while raw evidence stays under the right access controls.

Block release when the agent submits incomplete claims, creates duplicates, promises coverage or payment without authority, misses injury or fraud escalation, leaks sensitive data, or claims success after a tool error. Hamming treats these as state and evidence failures, not wording issues.

Generic workflow testing checks that an agent followed the right branch and tool sequence. Claims intake testing adds insurance-specific state, required fields, documentation, escalation, duplicate protection, and audit evidence.

Sumanyu Sharma

Sumanyu Sharma

Founder & CEO

Previously Head of Data at Citizen, where he helped quadruple the user base. As Senior Staff Data Scientist at Tesla, grew AI-powered sales program to 100s of millions in revenue per year.

Researched AI-powered medical image search at the University of Waterloo, where he graduated with Engineering honors on dean's list.

“At Hamming, we're taking all of our learnings from Tesla and Citizento build the future of trustworthy, safe and reliable voice AI agents.”