Most voice agent vendor reviews ask the usual SaaS questions: SOC 2, SSO, encryption, subprocessors, uptime, and incident response.
Those questions matter. They just do not cover enough.
A voice agent testing or monitoring vendor may handle raw call audio, transcripts, redacted transcripts, PII, PHI, prompt versions, tool traces, QA annotations, escalation notes, and production failure samples. The security review has to prove how that evidence moves, who can see it, how long it lives, and whether the AI layer can be abused.
Voice agent security review questions are the vendor due-diligence questions that cover both normal SaaS controls and voice-specific risks: recordings, transcripts, redaction, telephony metadata, prompt injection, tool actions, data residency, retention, deletion, and production monitoring evidence.
Quick filter: This checklist is for security, procurement, engineering, and QA teams evaluating a voice agent testing, QA, or monitoring vendor. If the POC will only use synthetic calls with no customer data, use the shorter version. If the POC will touch production recordings, transcripts, or regulated workflows, use the full checklist before launch.
TL;DR: Ask 30 questions before a voice agent testing POC:
- What data enters the vendor: audio, transcript, metadata, tool traces, QA notes, or exports?
- Is customer data used for model training, fine-tuning, evaluation, or support debugging?
- Who can access raw audio and unredacted transcript text?
- Can the vendor separate raw, redacted, aggregate, and exported evidence?
- How are retention, deletion, legal hold, and audit logs enforced?
- Which model, telephony, storage, and analytics subprocessors touch the data?
- How does the platform test prompt injection, sensitive-data leakage, and unsafe tool actions?
- What evidence can the vendor show before a POC uses sensitive calls?
Methodology Note: This checklist is based on Hamming's analysis of production voice agent testing, monitoring, and security-review workflows across 10K+ voice agents (2025-2026). We've tested agents built on LiveKit, Pipecat, ElevenLabs, Retell, Vapi, and custom-built solutions.It also uses public guidance from AICPA, HHS, OWASP, and NIST so procurement questions stay tied to recognized control and risk-management frameworks.
Last Updated: May 2026
Related Guides:
- SOC 2 Voice Agent Testing - control evidence and audit readiness for voice-agent QA
- HIPAA PHI Clinical Workflow Testing Checklist - healthcare workflow and PHI testing controls
- PII Redaction for Voice Agents - redaction design for transcripts and audio evidence
- PII Redaction Compliance Architecture - architecture patterns for sensitive voice data
- Voice Agent Log Retention Compliance Checklist - retention, deletion, and audit archive decisions
- Call Logging for AI Voice Agents - taxonomy for call logs, transcripts, and metadata
- Voice Agent Workflow Testing - tool-call and side-effect test evidence
- Voice Agent Production Readiness Checklist - launch gates before real callers use the agent
- Voice Agent Monitoring Platform Guide - production monitoring requirements
- Voice Agent Incident Response Runbook - evidence needed during customer-impacting incidents
I used to treat this as a security-packet problem: collect the SOC 2 report, confirm SSO, check the subprocessor list, and move on. That misses the part unique to voice agents. The most uncomfortable review questions are usually about the evidence trail: who can play the recording, who can read the raw transcript, what the evaluator stores, and whether a spoken attack can trigger the wrong tool action.
What a Voice Agent Security Review Has to Cover
Start with a map. Do not start by forwarding a generic vendor security questionnaire.
| Review Area | What It Covers | Why Voice Agents Are Different |
|---|---|---|
| SaaS controls | SOC 2 scope, SSO, RBAC, encryption, vulnerability management, incident response | Necessary baseline, but not enough for call evidence and AI behavior |
| Data flow | Audio, transcript, metadata, tool traces, QA notes, exports, support access | Spoken conversations can contain account numbers, medical details, consent statements, and payment context |
| AI behavior | prompt injection, sensitive information disclosure, hallucination, tool abuse, model-provider data use | The caller can attack the system through natural speech, not just a web form |
| Voice stack | telephony provider, SIP/WebRTC path, recording storage, DTMF handling, transfer/handoff metadata | Call routing and recording systems may sit outside the core application boundary |
| Production monitoring | failure samples, sampled calls, reviewer access, alerting, incident evidence | Monitoring often expands data access after launch if it is not designed carefully |
| Deployment model | multi-tenant SaaS, private tenant, customer-owned storage, self-hosted, hybrid | Regulated buyers may need stronger isolation, regional residency, or customer-managed keys |
The key mistake is treating the vendor as a normal SaaS tool while ignoring the evidence it will process. A voice agent testing platform may become the place where the most sensitive failures are collected: angry callers, failed identity checks, escalations, medical questions, payment attempts, and agent mistakes.
Working rule: if a vendor will store or inspect real call evidence, the security review must cover every evidence class separately: raw audio, unredacted transcript, redacted transcript, metadata, QA annotation, tool trace, export, and aggregate metric.
AICPA's Trust Services Criteria cover security, availability, processing integrity, confidentiality, and privacy controls for service-organization systems. That is a good baseline. It does not remove the need to ask how the vendor handles voice-specific data and AI-specific failure modes.
The 30-Question Vendor Security Checklist
Use this as the first pass. It is intentionally direct. A mature vendor should be able to answer without inventing policy during the call.
| # | Question | Good Evidence |
|---|---|---|
| 1 | Which specific data types enter your system during testing or monitoring? | Data-flow diagram with audio, transcript, metadata, tool traces, QA notes, and exports labeled |
| 2 | Is any customer audio or transcript text used for model training, fine-tuning, evaluation, or support debugging? | Written data-use policy and opt-in/opt-out controls |
| 3 | Do raw audio and unredacted transcripts have separate permissions? | RBAC matrix and audit-log samples |
| 4 | Can redacted transcripts be used for search and analytics without exposing raw text? | Redaction workflow, status field, and separate storage or access boundary |
| 5 | Can customers configure retention by data type? | Retention policy by audio, transcript, metadata, QA note, and aggregate metric |
| 6 | How do deletion requests propagate across transcripts, recordings, exports, analytics, and backups? | Deletion runbook and completion evidence |
| 7 | Which subprocessors touch customer call data? | Current subprocessor list and data categories per subprocessor |
| 8 | Which regions store or process data? | Residency options and cross-border transfer explanation |
| 9 | What is covered by your SOC 2 report? | Report scope, control families, period, and excluded systems |
| 10 | Can you sign a BAA for HIPAA workloads? | BAA terms and ePHI handling boundaries |
| 11 | Is SSO/SAML supported and can MFA be enforced? | Identity-provider setup guide and access logs |
| 12 | Are admin actions, exports, playback, and transcript views audited? | Audit-log sample with actor, timestamp, action, and object |
| 13 | Can support staff access customer calls? | Support-access policy, approval workflow, time limits, and logging |
| 14 | How are secrets, API keys, webhook URLs, and telephony credentials stored? | Key-management policy and rotation process |
| 15 | How do you secure webhook ingestion and outbound callbacks? | Signature verification, replay protection, and retry policy |
| 16 | How do you prevent prompt injection through caller speech or transcript text? | Test suite, policy, and failure samples |
| 17 | How do you prevent the agent or evaluator from leaking sensitive information? | Red-team scenarios and sensitive-output checks |
| 18 | How do you prevent unauthorized tool actions during test calls? | Tool permissions, sandboxing, idempotency, and side-effect controls |
| 19 | Can test calls run without writing to production CRMs, calendars, EHRs, payment systems, or ticketing tools? | Sandbox or mock integration evidence |
| 20 | How do you isolate tenant data? | Architecture note, access model, and incident boundary |
| 21 | Can we use customer-owned storage or customer-managed keys? | Supported architecture and operational tradeoffs |
| 22 | Do you support private tenant or self-hosted deployment? | Deployment model table and shared-responsibility model |
| 23 | How are production monitoring samples selected? | Sampling policy and opt-out controls |
| 24 | Can customers prevent sensitive queues from being monitored? | Queue-level policy controls |
| 25 | Can call evidence be exported securely for auditors? | Scoped export workflow and audit trail |
| 26 | What happens to data after contract termination? | Termination deletion or export policy |
| 27 | What security incidents would trigger customer notification? | Incident notification policy and SLA |
| 28 | How do you validate upstream model-provider changes? | Model-change review and rollback process |
| 29 | Can customers pin, restrict, or approve model providers? | Provider configuration and governance controls |
| 30 | What are the known unresolved risks? | Honest risk register, roadmap, or compensating controls |
The last question matters. Vendors that can name unresolved risks are usually safer than vendors that claim there are none.
Evidence to Request Before a POC
Do not wait until procurement to ask for evidence. The POC is when sensitive data boundaries are most likely to get blurred.
| Evidence | Ask For It When | Review Owner |
|---|---|---|
| Security overview | Before any account setup | Security |
| SOC 2 report or bridge letter | Before enterprise contract review | Security/procurement |
| Data-flow diagram | Before production calls enter the platform | Engineering/security |
| Subprocessor list | Before any real customer data is processed | Legal/security |
| Data-use policy | Before transcripts, recordings, or QA notes are uploaded | Legal/security |
| Retention and deletion policy | Before persistent storage is enabled | Security/compliance |
| RBAC and audit-log sample | Before reviewers or support users are invited | Security/operations |
| BAA or regulated-workload terms | Before PHI or healthcare workflows are tested | Legal/compliance |
| Incident-response process | Before production monitoring is enabled | Security/on-call |
| AI behavior test summary | Before tool-calling or sensitive flows are tested | Engineering/security |
This is not paperwork for its own sake. It prevents the awkward moment where the POC works technically but fails security because the vendor cannot explain where call evidence went.
Voice-Specific Risks Generic SaaS Reviews Miss
A generic questionnaire may ask whether data is encrypted. It usually does not ask whether a caller can trick the agent into reading back private context, whether DTMF digits are isolated from the model, or whether a support user can play raw recordings.
| Risk | Why It Matters | Security Review Question |
|---|---|---|
| Prompt injection over speech | Attackers can speak instructions that alter model behavior. | How do you test spoken prompt injection and indirect transcript injection? |
| Sensitive information disclosure | The model may reveal secrets, account context, policy text, or prior-call details. | What checks prevent sensitive output before it reaches the caller or reviewer? |
| Excessive tool agency | A model can trigger actions beyond what the caller is authorized to request. | What tools can the system call, under what policy, and in which environment? |
| Transcript overexposure | Reviewers may need QA access without raw PII access. | Can raw and redacted transcript access be separated? |
| Recording playback | Audio can reveal biometric voice characteristics and sensitive spoken content. | Who can play recordings, export them, or share links? |
| DTMF and payment leakage | Digits can pass through systems that should never see payment data. | Can payment or DTMF collection be isolated from transcription and LLM context? |
| Handoff context leakage | Summaries can expose more data than the receiving queue needs. | Can handoff payloads be minimized and audited? |
| Production monitoring creep | Monitoring can expand from failure review into broad surveillance. | Can customers scope monitoring by queue, policy, and data class? |
OWASP's LLM application guidance calls out risks such as prompt injection, sensitive information disclosure, insecure output handling, and excessive agency. For voice agents, those risks show up through speech, transcripts, tool calls, and handoffs.
Data Flow, Transcript Access, and Retention Questions
The data-flow review should be concrete enough that an engineer can draw it and a security reviewer can challenge it.
caller audio
-> telephony / WebRTC / SIP layer
-> recording and streaming transcription
-> voice agent runtime
-> LLM provider and tool calls
-> testing or monitoring platform
-> QA review, alerts, exports, analytics, and archive
Ask these questions for each hop:
| Data Class | Storage Question | Access Question | Retention Question |
|---|---|---|---|
| Raw audio | Where is the recording stored and encrypted? | Who can play or export it? | Can it expire faster than redacted evidence? |
| Unredacted transcript | Is it stored after redaction completes? | Who can view raw text? | Can it be short-lived by default? |
| Redacted transcript | Is it the default analytics copy? | Can reviewers search without raw PII access? | Can it retain longer than raw content when policy allows? |
| Metadata | Which IDs, timestamps, queues, versions, and outcomes are stored? | Can low-PII metadata be broadly available? | Can it support debugging after raw content expires? |
| Tool trace | Are arguments, results, and errors filtered for secrets? | Who can inspect tool payloads? | Can traces expire separately from QA results? |
| QA annotation | Does it include reviewer notes or customer-sensitive text? | Can customer admins see reviewer activity? | Does it follow contract or audit requirements? |
| Export | Where do downloads, webhooks, and API exports go? | Can exports be disabled or approved? | Are exported copies included in deletion evidence? |
The voice agent log retention checklist goes deeper on retention classes. The call logging taxonomy helps standardize the fields before they spread across vendors.
Deployment Model Decision Table
Do not ask for private tenant or self-hosting because it sounds more secure. Ask because a specific risk requires it.
| Deployment Model | Use When | Tradeoff |
|---|---|---|
| Multi-tenant SaaS | Low-to-moderate sensitivity, standard enterprise controls, fastest rollout | Depends on vendor tenant isolation and shared infrastructure controls |
| Private tenant | Regulated workflows, strict customer isolation, enterprise residency requirements | More operational overhead and longer setup |
| Customer-owned storage | Customer wants lifecycle, retention, keys, or archive controls in its own environment | Vendor may not control every retrieval or deletion workflow |
| Self-hosted | Contract, regulator, or internal policy requires customer-controlled runtime | Customer owns more upgrades, monitoring, and incident response |
| Hybrid | Sensitive evidence stays customer-side while aggregate results sync to vendor | More integration work and more shared-responsibility ambiguity |
NIST's AI Risk Management Framework emphasizes governance, mapping, measurement, and management of AI risks over the system lifecycle. Deployment model is one of those risk decisions. It should connect to data class, jurisdiction, customer contract, and operational owner.
AI Behavior, Prompt Injection, and Tool-Action Questions
Voice agent security is not only about storing data safely. It is also about preventing the system from doing the wrong thing with a caller, a transcript, or a tool.
Ask vendors to show how they test:
| AI Behavior Risk | Test Question | Passing Evidence |
|---|---|---|
| Spoken prompt injection | Can a caller override system instructions by speaking them? | Test cases for direct and indirect prompt injection |
| Sensitive output | Can the agent reveal system prompts, secrets, account details, or previous-call content? | Output checks and red-team samples |
| Unauthorized action | Can the agent call a tool the caller is not allowed to use? | Permission model and tool-policy assertions |
| Tool argument leakage | Can prompts, credentials, or PII leak into tool payloads? | Payload filtering and secret-scanning evidence |
| Duplicate side effect | Can retries create duplicate bookings, refunds, tickets, or messages? | Idempotency and sandbox test results |
| Handoff oversharing | Can the agent send too much context to a human queue or downstream system? | Handoff payload minimization and audit logs |
| Model drift | Can an upstream model change weaken safety behavior? | Model-change evaluation and rollback process |
This connects directly to voice agent workflow testing. Tool calls are not just quality events. They are permissioned actions that need preconditions, traceability, and side-effect evidence.
Regulated Deployment Add-Ons
For healthcare, finance, insurance, and BPO deployments, add these questions instead of relying on the baseline checklist.
| Environment | Add These Questions |
|---|---|
| Healthcare | Will the vendor sign a BAA? Which systems handle ePHI? How are administrative, physical, and technical safeguards implemented? Can PHI be redacted before broad QA review? |
| Financial services | How are regulatory script adherence, call recordings, dispute evidence, access logs, and retention schedules handled? Can sensitive payment or account details be isolated? |
| Insurance | How are claims details, policy numbers, medical context, and adjuster notes protected? Can claim workflows be tested without writing to production systems? |
| BPO / outsourced operations | Can data, dashboards, reviewers, exports, and alerts be segmented by client? Are cross-client analytics de-identified and access-controlled? |
| International contact centers | Which regions process data? Can residency differ by customer, queue, or workspace? How are cross-border transfers documented? |
HHS describes the HIPAA Security Rule as requiring appropriate administrative, physical, and technical safeguards for electronic protected health information held by covered entities and business associates. For voice agent vendors, that means the review needs to include call recordings, transcripts, QA workflows, exports, and support access when they may contain PHI.
For more healthcare-specific test design, use the HIPAA PHI clinical workflow testing checklist. For redaction architecture, use the PII redaction compliance guide.
Red Flags and Green Flags
Use this section during vendor calls. It is often faster than a 200-row questionnaire.
| Red Flag | Why It Matters |
|---|---|
| "SOC 2 covers it" is the answer to every AI-specific question | SOC 2 scope may not cover prompt injection, tool abuse, or model-provider behavior |
| Raw audio and transcript permissions are the same | Reviewers may receive more sensitive data than they need |
| Support access is broad or informal | Debugging can become a privacy incident |
| Retention is one global number | Raw audio, redacted transcripts, metadata, and QA notes need different policies |
| The vendor cannot list subprocessors | Buyers cannot evaluate where sensitive call evidence goes |
| Production monitoring requires all calls by default | Sampling and queue-level policy should be configurable |
| The vendor cannot test deletion or export in staging | Compliance behavior is being assumed, not proven |
| The vendor cannot explain model-provider data use | Customer data may flow into systems procurement has not approved |
| Green Flag | What It Shows |
|---|---|
| Separate controls for raw audio, unredacted transcript, redacted transcript, metadata, exports, and QA notes | The vendor understands data-class risk |
| Time-boxed support access with approval and audit logs | Human access is controlled |
| Customer-configurable retention and deletion evidence | The platform can match policy, not just store data |
| Sandbox tool calls and mocked side effects | The vendor can test workflows without touching production systems |
| AI-specific red-team or regression tests | The vendor tests behavior, not only infrastructure |
| Clear shared-responsibility model for private tenant or self-hosted deployment | Security ownership will not be ambiguous |
| Honest unresolved-risk list | The vendor has operational maturity |
What This Checklist Cannot Prove
This checklist narrows the review. It does not replace a legal, security, or compliance decision.
| Limitation | What to Do Instead |
|---|---|
| A questionnaire cannot prove runtime behavior. | Run a non-production POC with seeded sensitive fields, mocked tool calls, and access-log review. |
| SOC 2 scope can lag the product surface. | Ask which systems are covered, which are excluded, and whether the POC path matches the audited path. |
| Redaction demos can hide edge cases. | Test 7-10 realistic samples from your own call patterns, including noisy audio and partial identifiers. |
| Private deployment does not remove shared responsibility. | Write down who owns upgrades, monitoring, incident response, key rotation, and deletion evidence. |
POC Gating Checklist
Before the POC handles sensitive production calls, make these gates explicit.
| Gate | Pass Criteria | Owner |
|---|---|---|
| Data scope | POC data classes are listed and approved | Engineering + security |
| Data-use policy | No customer data is used for training or support debugging without approval | Legal + security |
| Access model | Raw audio, transcript, export, admin, and support permissions are separated | Security |
| Redaction | Sensitive fields are redacted before broad search or reporting | Security + QA |
| Retention | Audio, transcripts, metadata, and QA notes have default expiration rules | Compliance |
| Deletion | A sample delete can be executed and evidenced | Security + vendor |
| Subprocessors | Data categories and regions are reviewed | Legal + security |
| Tool safety | Tool calls run against mocks, sandboxes, or approved dry-run endpoints | Engineering |
| Incident process | Notification path and severity rules are known | Security + on-call |
| Exit plan | Data export and deletion after POC are written down | Procurement + legal |
The POC can still be fast. The difference is that it starts with a smaller, approved data boundary instead of discovering the boundary after transcripts have already moved.
How Hamming Fits This Review
Hamming is built for teams that need to test and monitor voice agents before production failures become customer problems. In security review, the practical question is not "do you have a dashboard?" It is whether the platform can help teams evaluate real voice-agent behavior while respecting access, retention, redaction, and deployment constraints.
Use this checklist against Hamming too. Ask for the same evidence: data flow, retention, deletion, access controls, support access, subprocessors, deployment model, and AI-specific testing approach. Strong security review should make the buying process clearer, not more theatrical.
For broader vendor selection, pair this with questions to ask voice testing vendors, Hamming vs. Coval, and the voice agent production readiness checklist.

