What is voice agent structured output validation?

Voice agent structured output validation checks whether extracted JSON fields and labels are supported by the actual call evidence. Hamming's checklist requires schema version, call ID, evidence span, raw value, normalized value, confidence, and action policy before high-risk fields are trusted.

Is JSON schema validation enough for voice agent data extraction?

No. JSON schema validation proves shape, not truth. According to Hamming's validation checklist, high-risk voice fields also need evidence spans, ambiguity rules, normalization checks, and downstream action gates before they can update a system.

How do I validate extracted dates and times from voice calls?

Preserve the raw phrase, resolve it against the caller's timezone and business calendar, and mark uncertain phrases as needing confirmation. Hamming's checklist treats phrases like "Friday after 2" as candidates until the slot, timezone, and confirmation status are explicit.

How should structured outputs feed tool calls?

Structured outputs should feed tool calls only after schema, evidence, ambiguity, trust, destination, idempotency, and audit gates pass. If any gate fails, Hamming recommends asking for clarification, routing to review, or keeping the field read-only.

What evidence should I store with extracted voice agent fields?

Store call ID, transcript or audio span, schema version, extractor version, raw value, normalized value, confidence, review status, and any tool request or result that used the field. Hamming's checklist uses those fields so failed extractions can become repeatable regression tests.

How many regression tests should structured output extraction have?

Start with 5 to 10 blocking cases for the fields most likely to cause bad writes: names, account IDs, dates, money amounts, addresses, and intent labels. Hamming recommends adding a new regression case whenever a production call exposes a repeated extraction miss.

How do I handle PII in structured voice outputs?

Redact or role-scope sensitive fields before broad search, dashboards, or exports. Hamming's checklist keeps raw audio and unredacted transcripts behind stricter controls, and avoids storing sensitive structured fields unless the downstream workflow genuinely needs them.

Voice Agent Structured Output Validation Checklist

Voice agent structured output validation is the process of proving that extracted JSON fields, tool arguments, intent labels, and post-call analysis results are supported by what the caller actually said.

If the output is only used for a rough internal summary, a lightweight review may be enough. This checklist is for teams that send structured call data into CRMs, QA dashboards, compliance queues, test cases, routing decisions, or workflow automations.

The failure mode is subtle: the JSON parses, the schema passes, and the dashboard looks clean. Then a date, account ID, medication name, refund amount, or callback intent is wrong because the extractor normalized the wrong phrase.

We found this shows up most often when teams treat the extractor as the source of truth instead of treating it as one more system that needs evidence.

TL;DR: Validate voice agent structured outputs in 4 layers: schema, evidence, normalization, and action safety. A field should not drive a tool call, customer record, compliance decision, or metric until it is tied to transcript or audio evidence and the ambiguity policy is clear.

Methodology Note: This checklist is based on Hamming's analysis of production voice agent calls where extracted fields, tool arguments, and post-call labels changed downstream workflows across 10K+ voice agents (2025-2026). Hamming's platform has 10M+ mins protected. We've tested agents built on LiveKit, Pipecat, ElevenLabs, Retell, Vapi, and custom-built solutions.
Treat structured output validation as a workflow control, not a formatting check. The highest-risk failures are usually schema-valid and semantically wrong.

Last Updated: June 2026

Related Guides:

Voice Agent Workflow Testing - assert tool calls, state transitions, and side effects
Voice Agent Sandbox Testing - prove tool calls without touching production data
Voice Agent Transcript Search Schema - store searchable transcript evidence
Voice Agent Call Evidence Export Runbook - package call evidence for review
Voice Agent Tests as Code - put guardrails and fixtures in Git
Voice Agent Caller Identity Testing - decide which identity fields are trusted
Voice Agent CI/CD Testing - gate risky changes before release
OpenTelemetry for Voice Agents - connect extraction failures to traces
Voice Agent Response Coverage - convert production gaps into tests
WebSocket Voice Agent Testing - validate event streams and tool calls without PSTN

Do Not Confuse Schema Validity With Caller Truth

OpenAI's Structured Outputs documentation explains that Structured Outputs can make model responses adhere to a supplied JSON Schema. That is useful. It does not prove the fields are true.

Valid JSON means the parser can read it. Schema-valid output means the fields match the contract. Caller-supported output means each meaningful field can be traced back to the call evidence.

Level	What It Proves	What It Does Not Prove	Sample Failure
Valid JSON	The output parses.	Required fields, enums, and types are correct.	`"appointment_date": "next Friday"` parses but is not normalized.
Schema-valid	Required fields and types match the schema.	The extracted value was said by the caller.	`"refund_amount": 50` fits the schema but the caller said $15.
Caller-supported	The value is tied to transcript or audio evidence.	The value is safe to write downstream.	Caller mentioned "John" but identity lookup found 3 matching records.
Action-safe	Evidence, confidence, policy, and destination are all acceptable.	The whole workflow succeeded.	Extracted callback time is safe, but SMS delivery fails later.

Structured output validation rule: do not let a field cross into a tool call, CRM write, compliance queue, or KPI until it passes schema validation, caller-evidence validation, normalization validation, and action-safety validation.

This is the same reason voice agent workflow testing cannot stop at "the agent sounded correct." A model can produce a valid object that still misrepresents the caller.

The Validation Contract

Start with a contract that describes how each field earns trust.

structured_output_validation =  schema_version  + source_call_id  + transcript_or_audio_span  + extraction_prompt_or_tool_version  + field_value  + normalized_value  + confidence  + ambiguity_policy  + downstream_action_policy

Use this as the minimum review table.

Field	Required?	Sample	Why It Matters
`schema_version`	Yes	`booking_intake_v3`	Prevents old extractors from feeding new workflows.
`source_call_id`	Yes	`call_2026_06_07_1042`	Lets reviewers replay the evidence.
`evidence_span`	Yes	`turn_12: chars 40-78` or audio `00:03:12-00:03:18`	Shows where the value came from.
`raw_utterance`	Usually	"Friday after 2"	Keeps the original caller words.
`normalized_value`	Conditional	`2026-06-12T14:00:00-05:00`	Makes downstream systems deterministic.
`confidence`	Yes	`0.82` or `high`	Separates strong evidence from guesswork.
`review_status`	Yes for high-risk fields	`auto_pass`, `needs_review`, `rejected`	Blocks writes when evidence is weak.
`action_policy`	Yes	`may_write_sandbox`, `human_review_required`	Keeps extraction from becoming an unsafe side effect.

Vapi's structured outputs documentation describes processing complete call context, transcripts, messages, tool results, and metadata before delivering schema-shaped results. That is the right direction. The validation layer still has to decide whether each result should be trusted.

Field-Level Checklist

Not every field deserves the same scrutiny. Tags and summaries can tolerate more ambiguity than dates, amounts, identity fields, and tool arguments.

Field Type	Validate Against	Block When	Safer Fallback
Caller name	Caller utterance, identity lookup, spelling confirmation	Name conflicts with trusted profile or only appears in ASR noise	Mark as unverified and ask for confirmation.
Phone or account ID	Trusted lookup, masked transcript, caller confirmation	Value came only from model inference	Use backend identity record instead.
Date and time	Transcript span, timezone, business calendar	Relative date cannot be resolved unambiguously	Store raw phrase and request confirmation.
Money amount	Transcript span, currency, decimal normalization	Amount differs between transcript and tool argument	Route to review before writing.
Address	Transcript span, geocoder result, required fields	Missing unit, city, postal code, or country when required	Ask a clarifying question.
Intent label	Transcript evidence, allowed label set, confidence	Label is inferred from agent summary only	Keep as candidate label.
Sentiment or CSAT	Caller words, explicit score, evaluator rationale	The caller never provided enough evidence	Mark as unavailable.
Tool arguments	Tool schema, evidence span, preconditions	Required argument was invented or copied from stale context	Fail the workflow test.
Summary	Transcript coverage, redaction policy	Summary includes unsupported facts or sensitive details	Regenerate from redacted evidence.

For transcript entity extraction, AssemblyAI documents entity records with text, entity type, and start/end timestamps. Timestamps are useful because they let a reviewer jump from the extracted field to the precise evidence span instead of reading the whole call.

A Copyable Schema for Evidence-Aware Extraction

Use a schema that forces evidence, not just values.

{  "type": "object",  "additionalProperties": false,  "required": ["callId", "schemaVersion", "fields", "overallStatus"],  "properties": {    "callId": { "type": "string" },    "schemaVersion": { "type": "string" },    "overallStatus": {      "type": "string",      "enum": ["auto_pass", "needs_review", "rejected"]    },    "fields": {      "type": "array",      "items": {        "type": "object",        "additionalProperties": false,        "required": [          "name",          "rawValue",          "normalizedValue",          "evidenceSpan",          "confidence",          "actionPolicy"        ],        "properties": {          "name": { "type": "string" },          "rawValue": { "type": "string" },          "normalizedValue": { "type": ["string", "number", "boolean", "null"] },          "evidenceSpan": { "type": "string" },          "confidence": { "type": "number", "minimum": 0, "maximum": 1 },          "actionPolicy": {            "type": "string",            "enum": ["read_only", "human_review_required", "may_write_sandbox", "may_write_production"]          }        }      }    }  }}

OpenAI's function-calling guide recommends strict mode for function schemas and documents requirements such as additionalProperties: false and required fields. Use those constraints for shape. Then add your own evidence requirements for truth.

Evidence-first schema rule: every high-risk extracted field should carry its raw value, normalized value, evidence span, confidence, and action policy. A value without provenance is a note, not a workflow input.

Normalize Only After Preserving What Was Said

Normalization is where many errors become permanent. Once "Friday after 2" becomes a timestamp, the original uncertainty can disappear.

Caller Phrase	Bad Normalization	Better Output
"Friday after 2"	`2026-06-12T14:00:00Z`	Raw phrase plus local timezone, candidate slots, and confirmation required.
"fifteen dollars"	`50`	Raw phrase, parsed amount `15.00`, currency, and transcript span.
"my son John"	`caller_name: John`	Relationship field plus dependent name, not caller identity.
"cancel the second one"	`cancel_all: true`	Referenced appointment ID only after listing candidates.
"same address as last time"	Full address copied from stale profile	Trusted profile pointer plus confirmation status.

I used to think structured outputs were mostly a schema problem. They are not. In voice, the hard part is preserving uncertainty long enough for the workflow to handle it safely.

That matters for caller identity testing. A trusted backend field and a caller-spoken field are different evidence types. Do not merge them just because both can fit into one JSON object.

Gate Tool Calls and Side Effects

Structured outputs often become tool arguments. That is where validation stops being editorial and starts being operational.

Before a field can drive a tool call or side effect, require these checks:

Gate	Pass Condition	Fail Action
Schema gate	Output matches the schema version expected by the workflow.	Reject and retry extraction or route to review.
Evidence gate	Field has transcript or audio support.	Mark field unavailable.
Ambiguity gate	The field is unambiguous for the action.	Ask a clarifying question.
Trust gate	Identity and permission fields come from approved sources.	Block sensitive action.
Destination gate	The target system is sandbox, test, or explicitly allowlisted.	Do not write.
Idempotency gate	The action has a dedupe key.	Block retries that can duplicate records.
Audit gate	Call ID, field evidence, tool request, and result are stored.	Treat as failed evidence.

Pair this with voice agent sandbox testing. A structured output can be correct and still unsafe to write to production if the environment, idempotency key, or cleanup rule is missing.

CI Regression Sample

Put the risky cases in CI, not just in a dashboard.

id: callback_time_extraction_ambiguous_daterisk: blockingagent_ref:  environment: staging  prompt_version: pr-842input:  call_fixture: fixtures/calls/callback_next_friday_after_two.wavexpected_structured_output:  schema_version: callback_request_v2  fields:    - name: callback_window      raw_value_must_include: "Friday after 2"      normalized_value_status: needs_confirmation      evidence_span_required: true      action_policy: human_review_requiredforbidden:  - create_production_callback  - write_crm_task_without_confirmationevidence:  retain_audio: true  retain_transcript: true  retain_structured_output: true  retain_tool_trace: true

This belongs next to your voice agent tests as code. If a prompt change makes the extractor confidently choose the wrong date, the PR should fail before a customer record changes.

What This Checklist Cannot Prove

Structured output validation does not prove the whole voice agent worked.

It does not prove the caller was satisfied, the audio was clear, the tool completed, the downstream record synced, or the business policy was right. It proves a narrower thing: the structured fields you plan to trust are shaped correctly, supported by evidence, normalized safely, and gated before action.

Use call evidence exports when a reviewer needs the whole packet. Use transcript search schema when teams need to find similar failures across calls. Use response coverage when repeated extraction misses should become new test cases.

Final Review Checklist

Before trusting voice agent structured outputs, check:

Voice Agent Structured Output Validation Checklist

Do Not Confuse Schema Validity With Caller Truth

The Validation Contract

Field-Level Checklist

A Copyable Schema for Evidence-Aware Extraction

Normalize Only After Preserving What Was Said

Gate Tool Calls and Side Effects

CI Regression Sample

What This Checklist Cannot Prove

Final Review Checklist

Frequently Asked Questions

Sumanyu Sharma

Related Resources

Voice Agent Sandbox Testing for Tool Calls and Side Effects

How to Turn Failed Production Calls Into Regression Tests

Long-Call Voice Agent Testing: How to Test 70+ Conversation Turns