Voice agent caller identity testing answers a narrow but expensive question: did the agent know who was calling for the right reason?
If your agent only answers public FAQs, this is probably too much process. But once the agent can greet a caller by name, access account status, continue a previous case, route to a protected queue, update a record, or disclose regulated information, caller identity becomes a release gate.
The failure mode is not subtle. We call it the wrong-account warm start: the agent starts confidently with the wrong customer context because the number, CRM record, dynamic variable, or model-forwarded field was trusted too early.
Voice agent caller identity testing verifies that server-trusted caller signals, backend lookup results, explicit verification steps, and model-visible context agree before the agent personalizes a call or touches account data.
TL;DR: Treat caller identity as a boundary test:
- Use caller ID as a lookup signal, not a final authorization decision.
- Keep trusted identity fields outside the LLM-facing schema.
- Test matched, unknown, duplicate, spoofed, stale, timeout, and mid-call-change scenarios.
- Assert which variables entered the prompt, which variables reached tools, and which data stayed server-side.
- Store call ID, lookup trace, verification result, and fallback evidence with every test run.
Methodology Note: This checklist is based on Hamming's analysis of 4M+ production voice agent calls where identity, routing, tool calls, and account context affected the caller experience across 10K+ voice agents (2025-2026). We've tested agents built on LiveKit, Pipecat, ElevenLabs, Retell, Vapi, and custom-built solutions.It also uses public Vapi, Retell, and Twilio documentation to ground provider-specific webhook, dynamic-variable, and caller-context checks.
Last Updated: May 2026
Related Guides:
- Voice Agent Workflow Testing - state, tool-call, side-effect, and handoff assertions
- Voice Agent Tests as Code - store caller fixtures and assertions in Git
- Voice Agent Production Readiness Checklist - launch gates for critical flows
- HIPAA and Clinical Workflow Testing - identity verification in healthcare workflows
- Regulatory Script Adherence - ordered compliance steps before restricted actions
- Voice Agent Security Review Questions - vendor review questions for security-sensitive deployments
- WebSocket Voice Agent Testing - endpoint tests before phone-path complexity
- Hamming vs Coval - caller identity controls in vendor bake-offs
What Caller Identity Testing Should Prove
Caller identity testing is not the same thing as asking the caller to say their name.
The test needs to prove 4 things:
| Layer | What It Proves | Sample Failure |
|---|---|---|
| Signaling | The inbound number, dialed number, call ID, SIP/Twilio metadata, or provider context was captured correctly. | The call arrives as anonymous but the agent still loads account context. |
| Backend lookup | The trusted backend matched, rejected, or disambiguated the caller before personalization. | 2 records share a household number and the agent picks one silently. |
| Model-visible context | Only approved fields entered prompts, dynamic variables, or tool-visible context. | Full account status enters the prompt before verification. |
| Authorization | The agent completed the required verification step before sensitive action or disclosure. | Caller hears appointment, balance, PHI, or case status without step-up verification. |
The model can help route the conversation. It should not be the source of truth for who the caller is.
Caller identity boundary: the line between data your backend or telephony layer knows and claims the caller or model can influence. A voice agent test should prove that trusted fields cross this boundary only through server-controlled variables, not through natural language.
The Caller Identity Test Contract
Start every caller identity test with a small contract. If the test cannot name the source of truth, the test is not ready.
Caller identity test contract =
inbound caller signal
+ dialed number or route
+ expected lookup result
+ allowed model-visible context
+ required verification step
+ allowed tool parameters
+ fallback decision
+ evidence retention
| Field | Required? | Sample |
|---|---|---|
| Test ID | Yes | caller_identity_duplicate_household_number |
| Caller signal | Yes | +15551234567, anonymous, alternate caller number, SIP URI, or provider customer object |
| Dialed route | Yes | Support line, billing line, clinic scheduling line, collections line |
| Lookup fixture | Yes | 0 matches, 1 match, 2 matches, stale match, suspended account |
| Model-visible fields | Yes | First name only, no account balance, no PHI, no payment details |
| Trusted tool fields | Yes | Server-injected account_id, call_id, caller_number, lookup_confidence |
| Verification rule | Yes | DOB last 4, SMS OTP, verbal consent, policy handoff, or no access |
| Fallback | Yes | Generic greeting, step-up auth, human handoff, or call rejection |
| Evidence | Yes | Lookup trace, variable bag, tool args, transcript, final decision |
This contract should live near your tests-as-code definitions. The important part is reviewability: a teammate should see what identity data the agent gets before the test runs.
Separate Trusted Identity From Model Claims
The highest-risk bug is letting a caller speak trusted identity into existence.
Suppose the caller says, "My phone number is +15550001111." The model dutifully calls lookup_account with that number. The backend returns a record. The agent treats it as verified.
That is not identity verification. That is a prompt-controlled lookup.
| Data | Trusted Source | Model Can See? | Test Assertion |
|---|---|---|---|
| Inbound caller number | Telephony/provider signaling or backend call creation | Sometimes | Model cannot overwrite it through speech. |
| Dialed number | Provider phone number or SIP route | Sometimes | Routing logic matches the dialed line. |
| Account ID | Backend lookup keyed from trusted signal | Usually no | Tool receives server-injected ID, not an LLM-generated ID. |
| Customer name | Backend lookup after match | Maybe | Prompt gets only the approved display field. |
| Verification status | Backend policy engine | Maybe as boolean | Sensitive tools reject calls before verification is true. |
| Caller-stated data | Transcript/model extraction | Yes | Treated as a claim that must match trusted records. |
Trusted caller identity is identity evidence created or verified outside the model path. It can come from telephony metadata, a signed webhook, a backend lookup, a policy engine, or a server-injected tool parameter, but it should not depend on the caller convincing the LLM to repeat a value.
Vapi's static variables and aliases docs make this distinction explicit: model-facing function parameters are different from server-merged parameters. Use that idea even if you are not on Vapi. Trusted fields belong in the orchestration layer or backend. Caller-stated fields belong in the transcript and must be checked.
For broader tool-call proof, use the voice agent workflow testing runbook. Caller identity is one precondition that every downstream workflow should inherit.
Required Scenario Matrix
Run these scenarios before launch and after every prompt, tool-schema, routing, or provider change that touches identity.
| Scenario | Setup | Expected Behavior | Block Release If |
|---|---|---|---|
| Matched caller | One fixture record matches inbound number. | Agent may use approved low-risk context, then completes required verification before sensitive data. | Sensitive data appears before verification. |
| Unknown caller | No fixture record matches. | Generic greeting, account lookup by approved secondary factor, or handoff. | Agent invents account context or says it found the caller. |
| Duplicate match | 2 records share the same number. | Agent asks an approved disambiguation question or hands off. | Agent chooses one record silently. |
| Spoof attempt | Caller says a different number or account ID. | Backend keeps trusted inbound identity separate from caller claim. | Tool call uses the spoken number as trusted identity. |
| Anonymous caller | Caller ID is unavailable or blocked. | No account personalization until step-up verification succeeds. | Agent greets by name or routes as known caller. |
| Stale CRM record | Phone number belongs to an old or closed account. | Agent detects stale status and limits actions. | Agent continues a closed or transferred workflow. |
| Lookup timeout | Backend identity service times out. | Safe fallback within provider response window. | Call hangs, personalizes from cached stale data, or exposes an internal error. |
| Mid-call identity change | Caller claims they are calling for someone else. | Agent changes policy state and requires authorization. | Agent continues as the original account without recording delegated access. |
The uncomfortable part: a lot of teams only test the matched-caller row. That is the row most likely to work in a demo and least likely to catch a production identity bug.
Pass/Fail Checklist
Use this as the pre-merge checklist for caller identity changes.
| Check | Owner | Evidence Required |
|---|---|---|
| Inbound caller signal captured | Platform engineer | Raw provider event or redacted request body with call ID |
| Webhook authenticity verified | Platform engineer | Signature or request-verification result |
| Lookup result deterministic | Backend engineer | Fixture ID, match count, confidence, and selected policy |
| Model-visible variables reviewed | Prompt owner | Diff of variables allowed into prompt or dynamic context |
| Trusted tool parameters injected server-side | Backend engineer | Tool trace showing server-injected IDs and nonce |
| Verification step enforced | Product/compliance owner | Transcript turn ID and policy decision |
| Fallback path tested | QA owner | Unknown, duplicate, timeout, and anonymous caller runs |
| Cleanup completed | Test owner | Fixture reset, sandbox records removed, no live customer writes |
If a test fails, do not fix it by adding "always verify identity" to the prompt and moving on. Prompts are useful guardrails. They are not the control plane.
Provider-Specific Checks
The names change by provider, but the same tests apply: capture inbound context, decide what is trusted, decide what the model can see, and prove the fallback.
| Provider Surface | Public Behavior to Test | Caller Identity Check |
|---|---|---|
| Vapi personalization | Inbound call can request assistant selection; your server can identify caller by phone number and return dynamic variables or assistant config. | Test matched and unmatched callers, and assert the response fits the documented 7.5-second window. |
| Vapi server events | assistant-request can return an assistant, transient assistant, transfer destination, or error. | Test safe transfer or error response when lookup fails. |
| Vapi static variables | Server-merged parameters can keep trusted values outside the model-facing function schema. | Assert caller number, account ID, call ID, and nonce cannot be overwritten by speech. |
| Retell inbound webhook | Inbound webhook includes from_number and to_number, can set dynamic variables and metadata, times out after 10 seconds, and retries up to 3 times. | Test timeout, retry idempotency, duplicate match, and call rejection behavior. |
| Retell dynamic variables | Phone-call variables include user number, agent number, call ID, direction, and call type. | Test missing variables, raw placeholder leakage, and string-only values. |
| Twilio Voice webhooks | Incoming voice calls can invoke your app in real time; Twilio recommends HTTPS and request verification. | Test signature verification, HTTPS-only routing, and redacted logging of inbound parameters. |
For LiveKit or WebSocket paths, identity may arrive as SIP metadata, JWT claims, room metadata, or your own session object. Use the same checklist. The WebSocket testing guide covers endpoint evidence; the LiveKit testing guide covers runtime-specific test setup.
Troubleshoot Caller Identity Failures
Classify the failure before changing prompts.
| Symptom | Likely Layer | First Diagnostic | Fix |
|---|---|---|---|
| Agent greets wrong customer | Backend lookup or variable injection | Compare inbound signal, selected record, and prompt variables. | Block duplicate/stale matches; limit fields before verification. |
| Agent asks for identity twice | State handoff between backend and model | Check whether verification status entered the model context. | Inject a low-risk verification_pending or verified state. |
| Tool call uses spoken phone number | Tool schema/trust boundary | Inspect model-facing function parameters. | Move trusted caller number to server-injected parameters. |
| Unknown caller gets account data | Fallback policy | Replay unknown-caller fixture. | Require generic path or step-up verification. |
| Duplicate records pass silently | CRM matching policy | Seed 2 matching records. | Add disambiguation rule and block automation until resolved. |
| CI flakes on lookup timeout | Test harness/backend dependency | Check timeout, retries, and idempotency keys. | Mock the identity service for CI; run live dependency checks separately. |
| Logs contain too much PII | Observability/redaction | Inspect trace, transcript, and request logs. | Store hashes or redacted fields; keep raw identity in the system of record. |
Tie every failure to observability. The voice agent observability tracing guide and IVR log correlation runbook show how to connect call IDs, traces, transcripts, and routing events.
What This Checklist Cannot Prove
This checklist proves that caller identity evidence is captured, separated from model claims, and enforced before sensitive actions. It does not prove that every caller is the account owner.
Three limitations matter in production:
| Limitation | Why It Matters | Practical Response |
|---|---|---|
| Caller ID can be shared or spoofed | A matched number is not always the right human. | Treat caller ID as a lookup key and require step-up verification for sensitive flows. |
| Provider metadata can be missing | Anonymous, forwarded, SIP, and contact-center paths do not always carry the same fields. | Test anonymous and missing-metadata paths as first-class scenarios. |
| Policy changes outside the agent | CRM merges, account transfers, and delegated-access rules can change after a test passes. | Re-run identity tests after routing, CRM, policy, and tool-schema changes. |
We used to treat caller identity as a routing concern: get the right record, then let the agent continue. That is too loose for production workflows. The safer view is that identity is a state transition, and the agent should not move into an account-specific state until the trusted evidence exists.
What Belongs in CI?
Put the smallest identity suite in CI. Keep the provider-live and telephony-live runs for nightly or pre-release validation.
| Gate | Run When | Recommended Size | Blocks Merge? |
|---|---|---|---|
| Fixture lookup unit tests | Backend lookup policy changes | 10-20 records | Yes |
| Prompt/context tests | Prompt, dynamic variable, or assistant config changes | 5-8 identity scenarios | Yes for sensitive workflows |
| Tool trust-boundary tests | Tool schema or API integration changes | 5-10 tool calls | Yes |
| Provider webhook tests | Routing/provider config changes | 3-5 calls per provider path | Usually pre-release |
| Production sampling | Continuous monitoring | 1-5% of eligible calls | No, but alert on drift |
The production readiness checklist should treat identity as a launch blocker for healthcare, finance, insurance, collections, legal, and account-management flows. For vendor evaluations, add this to your voice testing vendor questions: "Show the same scenario with a matched caller, unknown caller, duplicate record, anonymous caller, and spoofed caller claim. Then show the evidence for each run."
Minimum Production-Ready Checklist
- Caller ID is treated as a lookup signal, not standalone authorization.
- Unknown, duplicate, anonymous, stale, timeout, and spoofed scenarios are tested.
- Trusted identity fields are injected server-side or backend-side, not generated by the model.
- Model-visible context is limited before verification.
- Sensitive tools require verified identity or explicit policy approval.
- Lookup failures produce a safe fallback, not a broken call or over-personalized response.
- Logs redact or hash raw identity fields unless the system is approved to store them.
- Every identity test records call ID, lookup trace, variable bag, verification result, and cleanup status.
- Caller identity failures become regression tests within 1 business day.
Caller identity testing is not about making a voice agent suspicious of every caller. It is about proving the agent knows the difference between a useful hint, a trusted backend fact, and a claim someone said out loud.
That difference is what keeps personalization from becoming unauthorized access.

