Regulatory script adherence for AI voice agents is not the same as putting a disclosure in the prompt. The test is whether the agent actually said the required thing, at the required time, avoided prohibited language, preserved evidence, and failed closed when the caller pushed it off path.
That gap matters most in banking, lending, insurance, healthcare, and BPO programs where a voice agent can finish the task and still create a compliance problem.
Regulatory script adherence for AI voice agents is the practice of converting required disclosures, identity checks, prohibited statements, consent rules, and escalation policies into measurable runtime checks across every call.
Quick filter: If your agent only handles low-risk FAQ calls and no required disclosures, this checklist is probably too much. If a missed line, unsafe paraphrase, or undocumented exception can trigger an audit finding, use it.
TL;DR: Use a four-part script adherence system:
- Script obligation matrix - every required line, semantic obligation, timing rule, and prohibited phrase in one table.
- Runtime evaluation - exact checks for mandatory text, semantic checks for policy intent, and timing checks for when the line was delivered.
- Audit evidence packet - transcript span, audio pointer, policy version, evaluator result, reviewer decision, and remediation status.
- Regression loop - every confirmed miss becomes a replayable test before the next prompt or model update.
Do not rely on prompt instructions alone. A prompt is the plan. The evaluated call is the evidence.
Methodology Note: This checklist is based on Hamming's analysis of 4M+ production voice agent calls across 10K+ voice agents (2025-2026).It also uses public regulator and vendor documentation reviewed from 2026-04-16 through 2026-05-16, including materials from the FCC, FTC, FINRA, Observe.AI, Sedric, and Itero.
Last Updated: May 2026
Related Guides:
- Call Logging for AI Voice Agents - taxonomy, retention, and audit trail basics
- AI Voice Agent Compliance & Security - behavior-level compliance failures
- PII Redaction for Voice Agents - transcript, trace, and log redaction patterns
- Voice Agent Monitoring KPIs - KPI thresholds for production monitoring
- Voice Agent Testing in CI/CD - regression gates and deployment checks
- Voice Agent Incident Response Runbook - escalation and postmortem workflow
Why Prompt-Level Script Compliance Fails
Most teams start with a reasonable instruction:
Always read the required disclosure before discussing loan terms.
That instruction is necessary. It is not evidence.
A caller can interrupt. The agent can paraphrase. The LLM can move the disclosure after the recommendation. The agent can answer a question that requires identity verification before the verification is complete. The transcript can show that the agent was helpful, fast, and wrong.
We see the same pattern in voice agent compliance testing: the failure is rarely "the prompt forgot the rule." The failure is that the rule was not turned into a measurable assertion with call-level evidence.
Regulators and compliance teams do not review your system prompt. They review what happened on the call.
The Script Obligation Matrix
Start by translating policy language into obligations the evaluator can check. Keep the matrix small enough that compliance, QA, and engineering can all maintain it.
| Obligation Type | Example | Check Type | Evidence Required | Fails When |
|---|---|---|---|---|
| Required disclosure | "This call may be recorded..." | Exact or near-exact text | Transcript span + audio timestamp | Missing, late, or materially altered |
| Required identity step | Verify DOB before account details | Ordered workflow check | Verification event + answer span | Account detail appears before verification |
| Prohibited statement | No guaranteed approval language | Semantic policy check | Transcript span + evaluator rationale | Agent implies approval, rate, or outcome is guaranteed |
| Consent capture | Caller consents before outreach | Binary event + transcript check | Consent span + call metadata | Agent proceeds without consent |
| Escalation trigger | Threat, complaint, fraud, hardship | Classification + handoff event | Trigger span + escalation record | Agent continues instead of escalating |
| Recordkeeping | Preserve reviewable evidence | Metadata completeness check | Call ID, policy version, evaluator version | Missing audit fields |
This is where call logging matters. If the call record does not include timestamps, agent version, policy version, transcript turns, and reviewer decisions, the compliance check cannot be defended later.
Verbatim Checks vs. Semantic Checks
Do not use one evaluator style for every rule. Some obligations need exact text. Others need meaning.
| Rule | Best Evaluator | Why |
|---|---|---|
| Statutory disclosure with prescribed wording | Verbatim or phrase-window match | The wording itself matters. |
| Identity verification before account access | Ordered event assertion | Timing matters more than wording. |
| No misleading claims | Semantic evaluator | The unsafe statement may be paraphrased. |
| Required opt-out instruction | Exact + semantic hybrid | The call needs specific mechanics and clear meaning. |
| Complaint recognition | Semantic classifier | Callers use messy language. |
| PII or PHI handling | Entity detection + policy assertion | Sensitive data may appear outside the expected field. |
The practical setup is a hybrid:
- Exact checks for lines that must appear.
- Semantic checks for intent, prohibitions, and unsafe paraphrases.
- Order checks for "before/after" rules.
- Metadata checks for audit completeness.
- Human review queues for low-confidence or high-risk flags.
According to the FCC's February 2024 ruling, TCPA restrictions on artificial or prerecorded voice include AI technologies that generate human voices. The FTC's 2024 TSR update also emphasized disclosures, misrepresentation prohibitions, and recordkeeping. FINRA Rule 3230 applies telemarketing restrictions to member firms and points back to FTC and FCC requirements.
Those are not model-quality concerns. They are runtime behavior and evidence concerns.
The Evidence Packet Template
Every failed or high-risk compliance check should produce a packet a reviewer can inspect without reconstructing the call by hand.
{
"canonicalCallId": "call_01H...",
"agentVersion": "banking-agent-2026-05-16",
"policyVersion": "loan-servicing-script-v7",
"ruleId": "required_apr_disclosure_before_terms",
"ruleType": "ordered_required_disclosure",
"result": "fail",
"confidence": 0.91,
"transcriptSpan": {
"startMs": 42100,
"endMs": 46850,
"speaker": "agent"
},
"audioPointer": "recording://call_01H...#t=42.1",
"evaluatorRationale": "Agent discussed payment terms before APR disclosure.",
"reviewStatus": "pending_compliance_review"
}
The exact field names can differ. The evidence categories should not:
| Evidence Field | Why It Matters |
|---|---|
canonicalCallId | Joins transcript, audio, trace, CRM outcome, and review notes. |
agentVersion | Shows which prompt/model/tool version produced the call. |
policyVersion | Proves which rulebook was active at the time. |
ruleId | Keeps the finding tied to a stable obligation. |
transcriptSpan | Lets reviewers inspect the relevant sentence, not the whole call. |
audioPointer | Confirms whether ASR or punctuation distorted the transcript. |
evaluatorRationale | Explains why the check passed or failed. |
reviewStatus | Tracks whether the finding was confirmed, dismissed, or remediated. |
For regulated workflows, use PII redaction before exporting broad analytics. Raw evidence can be retained under stricter roles, but dashboards and alert payloads should not spray account numbers, PHI, payment details, or other sensitive values.
Runtime Monitoring Runbook
Script adherence checks should run continuously, not only after a quarterly QA review.
| Step | Owner | System Action | Output |
|---|---|---|---|
| 1. Normalize the policy | Compliance + QA | Convert scripts into stable rule IDs | Script obligation matrix |
| 2. Attach call context | Engineering | Propagate call ID, agent version, policy version | Queryable call record |
| 3. Evaluate every call | Hamming / QA platform | Run exact, semantic, order, and metadata checks | Per-rule result |
| 4. Route exceptions | QA + compliance | Send high-risk or low-confidence failures to review | Review queue |
| 5. Preserve evidence | Data owner | Store transcript span, audio pointer, evaluator version | Audit packet |
| 6. Convert misses | Engineering + QA | Replay confirmed failures as regression tests | CI/CD gate |
| 7. Review trends | Operations | Track miss rate by rule, flow, agent version, language | Compliance dashboard |
This is the same operating model as voice agent monitoring KPIs, but focused on policy behavior instead of only latency, containment, or task completion.
A Practical Compliance Scorecard
Use a scorecard that separates legal risk, customer impact, and engineering fixability. One blended "compliance score" hides the wrong things.
| Dimension | Passing Target | Warning | Critical |
|---|---|---|---|
| Required disclosure delivery | 99%+ | 95-99% | below 95% |
| Disclosure timing | 98%+ before restricted topic | 90-98% | below 90% |
| Prohibited-response rate | 0 confirmed misses | 1 unconfirmed high-confidence flag | any confirmed miss |
| Identity-before-disclosure | 100% | any low-confidence exception | any confirmed ordering failure |
| Evidence packet completeness | 99%+ | 95-99% | below 95% |
| Reviewer SLA | same business day for critical | within 3 business days | missed critical review |
| Regression coverage | every confirmed miss converted | backlog exists | repeated miss without regression |
Do not let this scorecard become theater. If a rule is truly critical, a single confirmed miss should trigger an incident path through your voice agent incident response runbook.
How Hamming Fits
Hamming is not legal counsel and does not decide what your regulated script must say. The useful boundary is different:
- Compliance defines the obligation.
- QA defines the expected behavior.
- Engineering attaches call context and versioning.
- Hamming evaluates calls, preserves evidence, tracks failures, and turns misses into regression tests.
That last piece matters because script failures often come back after a harmless-looking prompt edit. The agent gets more concise and drops a line. A model update changes phrasing. A multilingual route translates a required phrase too loosely. A tool error sends the caller into a fallback path that never reads the disclosure.
When those failures are replayable, CI/CD regression testing can block the next release before customers or auditors find the drift.
What To Test Before Launch
Use this pre-launch checklist for any regulated voice flow:
- Every required line has a stable
ruleId. - Every rule has an owner in compliance or QA.
- Exact-wording rules are separated from semantic obligations.
- Required order is explicit: "before quote," "before payment," "before PHI," "before transfer."
- Prohibited phrases and prohibited meanings are both covered.
- Interruptions are tested: caller talks over the disclosure, asks a question, or tries to skip it.
- Multilingual paths are tested with language-specific expected wording.
- ASR errors are tested with noisy audio and accents.
- Evidence packets include transcript span, audio pointer, policy version, and evaluator version.
- Confirmed misses become regression tests.
For contact-center programs, connect this checklist to the broader call center voice agent testing guide. Script adherence is one layer. You still need load, latency, escalation, PII, and outcome testing.
Flaws But Not Dealbreakers
Exact compliance is harder with generative speech. If a rule truly requires exact wording, constrain that part of the flow. Use a fixed disclosure block or controlled response, then let the agent return to normal conversation.
Semantic evaluators need calibration. A policy evaluator can over-flag harmless paraphrases or miss industry-specific phrasing. Start with reviewable alerts, then tune against confirmed decisions.
Monitoring does not replace policy design. If the script is vague, the evaluator will be vague. Clean policy language is the cheapest compliance control you have.

