Drive-thru voice agent order testing proves that a restaurant voice agent can take a messy spoken order and leave the store with the same order in the cart, POS, and readback.
If the agent only answers store hours or loyalty-program FAQs, this is too much. A few conversation tests are fine.
If it takes drive-thru, phone, kiosk, or catering orders, transcript-only testing is where teams get burned. The call can sound normal while the cart contains the wrong modifier, the combo lost its drink, the POS accepted an unavailable item, or the allergy note never made it to the kitchen.
That failure has a name I would put on the wall of every restaurant voice AI test plan: cart drift. The caller and agent seem aligned, but the durable order state is no longer the order the caller asked for.
TL;DR: Build drive-thru voice agent tests as a menu-cart-POS matrix:
- Menu state: store, menu version, item IDs, modifier groups, prices, availability, and schedule.
- Caller order: the exact spoken phrase, including substitutions, allergies, corrections, and background noise.
- Cart snapshots: expected cart state after every add, remove, change, upsell, and readback.
- POS mode: mock, sandbox, test store, or narrow live-scoped path.
- Evidence: transcript, tool trace, item and modifier IDs, price check, final POS state, and cleanup result.
A passing transcript is useful. It is not proof of an accurate order. The cart and final order state have to agree.
Methodology Note: This template is based on Hamming's analysis of 4M+ production voice agent calls involving order capture, menu state, tool calls, POS writes, background noise, and QSR-style workflow failures across 10K+ voice agents (2025-2026). We've tested agents built on LiveKit, Pipecat, ElevenLabs, Retell, Vapi, and custom-built solutions.Use it as a launch-safety template for drive-thru, phone ordering, kiosk, and catering voice agents that touch real menu or POS state.
Last Updated: June 2026
Related Guides:
- Hamming for QSR voice AI testing - QSR-specific testing surface for menu orders, customizations, load, and drive-thru conditions
- Lilac Labs customer spotlight - public example of automated drive-thru order accuracy testing
- Background Noise Voice Agent Testing - test speaker distortion, engines, wind, passengers, and store noise
- Voice Agent Sandbox Testing - run side-effect tests without production writes
- Customer-Specific Workflow Rules Template - handle store, franchise, region, or account-specific rules
- Voice Agent Workflow Testing Runbook - broader workflow and tool-call coverage
- Voice Agent Tests as Code - keep menu fixtures and expected cart state reviewable in Git
- Voice Agent Load Testing Guide - test lunch rush, dinner rush, and promo spikes
- Structured Output Validation Checklist - validate extracted items, modifiers, and order fields
- Failed Production Call Regression Runbook - turn missed orders into repeatable tests
What Drive-Thru Order Testing Should Prove
Drive-thru order testing checks whether the voice agent can turn the way people actually order food into the exact durable order the restaurant expects.
| Layer | What the test proves | Failure example |
|---|---|---|
| Menu understanding | The agent maps speech to canonical menu items, modifier groups, sizes, prices, and availability | "No cheese" is captured as a note instead of a modifier removal |
| Cart mutation | The cart changes correctly after each turn | Caller says "make that a large" and both the medium and large remain in the cart |
| Readback | The agent confirms the order in language the customer can correct | Agent reads back the wrong drink or skips the allergy note |
| POS write | The final order sent downstream matches the confirmed cart | POS receives a stale item ID from a previous menu version |
| Operational safety | Tests avoid real orders unless the release owner explicitly allows a scoped live check | CI creates a kitchen order or payment authorization |
Drive-thru order test: a voice agent test that loads a menu snapshot, runs a spoken ordering scenario, verifies cart state after every mutation, checks final POS or order state, and saves cleanup evidence.
The hard part is not "Can the agent understand fries?" It is whether it keeps the same order intact after the caller changes their mind, a passenger adds a drink, the lunch menu is missing an item, and the speaker sounds like it was installed in 2008.
Build the Menu-Cart-POS Matrix
Start with a matrix. Do not start by dialing the agent 200 times.
| Field | What to capture | Sample |
|---|---|---|
| Scenario ID | Stable identifier for the test | qsr_combo_modifier_017 |
| Store fixture | Store, franchise, daypart, menu version, tax region | store_fixture_midwest_03, breakfast disabled |
| Caller phrase | What the customer says | "Can I get a cheeseburger combo, no pickles, Coke Zero, and make the fries large?" |
| Menu expectation | Canonical item IDs, modifier IDs, availability, price | burger item, no-pickle modifier, combo drink slot, large fry upcharge |
| Cart checkpoints | Expected cart after each add, remove, or change | 1 combo, 0 pickles, Coke Zero, large fries |
| Forbidden state | What must not happen | duplicate burger, default drink, medium fries, stale price |
| POS mode | Mock, sandbox, test store, or live scoped | POS sandbox with fixture order tag |
| Evidence | What proves success | transcript, tool trace, cart snapshots, POS response, cleanup query |
| Gate | Blocking, scheduled, pre-release, or manual | blocking for allergy, payment, and POS writes |
Keep this matrix near your tests-as-code definitions. A teammate should be able to review the expected order without replaying the call.
Copyable Matrix Starter
suite: drive_thru_order_accuracy
owner: voice-platform
menu_version: qsr_menu_2026_06_20_lunch
scenarios:
- id: qsr_combo_modifier_017
store_fixture: store_fixture_midwest_03
dependency_mode:
menu: snapshot
cart: sandbox
pos: test_store
caller_goal: "Order a cheeseburger combo with no pickles, Coke Zero, and large fries"
expected_cart_checkpoints:
- turn: "add cheeseburger combo"
item_ids: ["combo_cheeseburger"]
modifier_ids: []
subtotal_cents: 849
- turn: "no pickles"
item_ids: ["combo_cheeseburger"]
modifier_ids: ["remove_pickles"]
- turn: "Coke Zero and large fries"
item_ids: ["combo_cheeseburger"]
modifier_ids: ["drink_coke_zero", "fry_large"]
subtotal_cents: 999
forbidden_states:
- duplicate_combo
- default_drink
- medium_fry_after_large_upgrade
- kitchen_note_instead_of_modifier_id
final_assertions:
readback_contains:
- "cheeseburger combo"
- "no pickles"
- "Coke Zero"
- "large fries"
pos_order_count_for_run: 1
cleanup_status: verified
That is more structured than a transcript score. It is also what lets you catch the order bug before a customer receives the wrong bag.
Test Menu Understanding Before Cart Mutation
Menu understanding has to be tested before cart mutation because the wrong canonical item poisons every downstream check.
Public restaurant ordering APIs expose why this gets tricky. Google Cloud's Food Ordering AI Agent menu integration docs separate menu structure, modifiers, schedules, prices, and availability before ordering works. Toast's modifier docs call out behavior around default modifiers, pre-modifiers, special requests, and quantity mismatches. Lightspeed's online ordering API docs similarly distinguish menu management from order creation and confirmation.
Your tests should do the same.
| Spoken request | Menu-state assertion | Why it matters |
|---|---|---|
| "No pickles" | Uses the remove-pickles modifier or explicit default-modifier removal | Kitchen notes may not remove the ingredient |
| "Extra sauce" | Adds allowed modifier quantity within menu constraints | Free-text notes may bypass price or kitchen routing |
| "Make it a large" | Changes the size or combo slot, not a second item | Duplicate items hide behind a correct readback |
| "I'll take the breakfast sandwich" at 2 PM | Marks item unavailable and offers valid alternatives | Menu schedules change the correct answer |
| "I'm allergic to sesame" | Adds the safe allergy workflow or escalation path | Allergy handling should not be treated as a casual note |
Menu snapshot rule: every order test should name the menu version it loaded. If the menu can change without changing the test expectation, the test is not reproducible.
This is where QSR tests differ from generic tool-call tests. A tool call named add_item is not enough. The test has to prove the item and modifier identifiers match the restaurant's active menu model.
Test Cart State Across Corrections
Most order bugs happen after the first item. That is why a single happy-path burger order is such weak evidence.
The caller changes their mind. A passenger interrupts. The agent suggests an upsell. The customer says "actually make that two." The agent needs to update the cart without keeping stale state.
Use turn-by-turn cart checkpoints.
| Turn | Caller says | Expected cart state | Common failure |
|---|---|---|---|
| 1 | "I want a spicy chicken sandwich combo" | 1 spicy chicken combo, default size, default side pending | Adds sandwich only, not combo |
| 2 | "Make the fries large" | Same combo, large fries, upcharge applied | Adds separate large fries |
| 3 | "No mayo" | Same combo, remove-mayo modifier | Stores note but leaves default mayo |
| 4 | "Actually make that grilled chicken" | Replaces spicy chicken with grilled chicken, keeps compatible modifiers | Leaves both sandwiches |
| 5 | "Add a kids meal too" | Combo plus kids meal | Replaces cart instead of appending |
| 6 | "Read it back" | Spoken readback matches cart and price | Reads back transcript memory, not cart state |
The test should fail as soon as cart state diverges. Waiting until the final readback turns a small state bug into a scavenger hunt across transcript, tool trace, and POS payload.
Evidence Envelope
{
"run_id": "drive_thru_run_2026_06_20_0042",
"store_fixture": "store_fixture_midwest_03",
"menu_version": "qsr_menu_2026_06_20_lunch",
"scenario_id": "qsr_correction_024",
"cart_snapshots": [
{
"turn": 1,
"expected_item_ids": ["combo_spicy_chicken"],
"actual_item_ids": ["combo_spicy_chicken"],
"expected_modifier_ids": [],
"actual_modifier_ids": []
},
{
"turn": 4,
"expected_item_ids": ["combo_grilled_chicken"],
"actual_item_ids": ["combo_grilled_chicken"],
"forbidden_item_ids_observed": []
}
],
"final_order": {
"pos_mode": "test_store",
"pos_order_count_for_run": 1,
"price_match": true,
"readback_match": true,
"cleanup_status": "verified"
}
}
The vendor, test runner, or reviewer does not need private customer data. It needs enough structure to know whether the order state stayed honest.
Prove POS And Kitchen State Without Touching Production
Use the same dependency-mode thinking from voice agent sandbox testing, but make the order boundary explicit.
| Dependency mode | Use it for | Good signal | Release risk |
|---|---|---|---|
| Mocked menu and cart | Fast CI, negative cases, prompt changes, schema checks | Agent selects the right menu IDs and cart operations | Can miss POS auth, menu drift, and provider validation |
| Menu snapshot plus POS sandbox | Integration checks with fixture stores and fake orders | Final order validates against a realistic menu and POS contract | Requires fixture hygiene and cleanup |
| POS test store | Pre-release order injection and kitchen-routing checks | The real order path accepts the payload | Can pollute test dashboards if run IDs and cleanup are weak |
| Live scoped path | Production-only routing, store availability, or provider limits | Real path still works under release-owner controls | Can create real orders if allowlists are wrong |
For most teams, CI should never create a real kitchen order or payment authorization. Use mocks for speed, sandbox and test-store paths for confidence, and live scoped checks only with explicit release-owner approval.
When the voice agent does write downstream, assert the durable state:
- exactly one order for the run ID
- expected item IDs, modifier IDs, prices, tax, store, and fulfillment type
- no duplicate cart lines from retries
- no unavailable items accepted
- no production customer record touched
- cleanup verified after pass and fail
The failed production call regression runbook is the right place to promote any escaped order bug into this matrix.
Cover The Edge Cases That Actually Break QSR Agents
Do not try to generate every menu combination on day one. Start with the cases that change the order, safety risk, or operations.
| Edge case | Test example | Blocking? |
|---|---|---|
| Allergy or dietary restriction | "I'm allergic to sesame. What can I get?" | Yes |
| Modifier removal | "No onions, no mayo" | Yes |
| Default modifier behavior | Item usually includes cheese; caller says no cheese | Yes |
| Combo substitution | Swap drink, side, size, or protein | Yes |
| Unavailable item | Breakfast item during lunch menu | Yes |
| Special request | "Sauce on the side" | Usually |
| Upsell accepted | Agent suggests fries; caller accepts | Usually |
| Upsell rejected | Agent suggests dessert; caller says no | Usually |
| Quantity correction | "Actually make that two" | Yes |
| Remove item | "Take off the nuggets" | Yes |
| Multiple speakers | Passenger adds a drink | Scheduled |
| Background noise | Engine, wind, speaker distortion, store noise | Scheduled and pre-release |
| Drive-off event | Customer leaves before confirmation | Pre-release |
| Crew interjection | Staff takes over or corrects the agent | Pre-release |
| Rush-hour load | 50+ concurrent ordering calls | Scheduled and pre-release |
Hamming's public Lilac Labs customer spotlight is a useful proof point here: the hard cases were not just "normal orders." They included dietary restrictions, allergies, modifications, and enough automated coverage to replace hours of manual retesting.
Decide What Belongs In CI
Keep the blocking suite small enough that engineers will tolerate it.
| Gate | Run when | Recommended size | Blocks merge? |
|---|---|---|---|
| Menu schema checks | Menu parser, prompt, tool schema, or item mapping changes | 10-25 cases | Yes |
| Cart mutation tests | Prompt, orchestration, tool-call, or state changes | 8-20 cases | Yes |
| POS sandbox tests | Order creation, price, tax, fulfillment, or payment handoff changes | 3-10 fixture orders | Yes for critical flows |
| Phone-path drive-thru tests | ASR, telephony, interruption, noise, or provider changes | 3-8 calls | Pre-release |
| Load and rush-hour tests | Model, provider, queue, or infrastructure changes | 50-500 synthetic calls | Scheduled or pre-release |
| Production sampling | Continuous monitoring | 1-5% of eligible calls | No, alert on drift |
If a failure can create a wrong order, allergy miss, payment problem, or kitchen workflow issue, keep at least one blocking test. Put long-tail combinations in scheduled coverage.
Launch Checklist
Before a drive-thru ordering agent sees real traffic, confirm these are true:
- The test suite names the menu version, store fixture, and POS dependency mode.
- Each critical menu item has at least one add, remove, change, and unavailable-item case.
- Modifiers are asserted by canonical ID, not only transcript words.
- Cart state is checked after every mutation, not only at the end.
- Readback is compared against cart state and price.
- POS writes use sandbox, test-store, or explicit live-scoped controls.
- Duplicate order prevention is tested with retries.
- Cleanup runs after pass and fail.
- Allergy and safety workflows are blocking.
- Background-noise and rush-hour tests run before launch.
- Production failures can be promoted into regression tests.
This is more work than a demo script. It is also the line between "the agent sounded right" and "the customer got the food they ordered."
What This Template Cannot Prove
This template proves that the agent followed the menu, cart, and POS expectations you loaded into the test. It does not prove every store configuration is correct.
Three limitations matter:
| Limitation | Why it matters | Practical response |
|---|---|---|
| Menu drift | Store menus, prices, item availability, and modifiers change | Refresh menu snapshots and compare versions before scheduled runs |
| Sandbox drift | POS sandbox behavior may differ from production routing, auth, or kitchen systems | Keep a small pre-release live-scoped check with owner approval |
| Combination explosion | Every modifier and combo permutation cannot block CI | Use risk-based blocking tests and scheduled coverage for long-tail combinations |
The practical win is narrow and worth it: stop letting a polished transcript hide a wrong cart.
Drive-Thru Voice Agent Testing FAQ
How do you test a drive-thru voice agent with menu and cart state?
Create a menu-cart-POS test matrix that maps each spoken order to the menu snapshot, expected item IDs, modifiers, prices, cart mutations, POS write, readback, and cleanup evidence. For a first pass, cover 10-25 menu schema cases and 3-10 fixture orders before adding load. The test should fail when the transcript sounds correct but the cart snapshots or final order state are wrong.
What should go in a restaurant voice agent order test matrix?
Include menu version, store or franchise configuration, caller phrase, expected item and modifier IDs, expected cart state after each turn, forbidden cart state, expected price, POS dependency mode, final order evidence, and cleanup owner. Hamming's template also saves run ID, menu version, cart snapshots, POS response, and cleanup_status so QA can prove the order was not left behind.
How do you test menu modifiers and substitutions?
Use fixtures for default modifiers, removed modifiers, added modifiers, substitutions, combo changes, allergies, and unavailable items. Assert the canonical menu IDs and modifier quantities, not just the words in the transcript. Keep a small blocking set for high-risk modifiers, then schedule long-tail menu combinations outside every pull request.
How do you test cart corrections in a voice ordering agent?
Run multi-turn cases where the caller adds an item, changes size, removes a modifier, replaces one item, accepts or rejects an upsell, and asks for a readback. Check the cart state after every mutation so stale items and duplicate modifiers are caught early. The useful artifact is the before-and-after cart snapshot, not just a final transcript score.
How do you test POS integration without creating real orders?
Use mocked tools for fast CI, POS sandbox or test stores for integration checks, and narrowly allowlisted live checks only when a production-only path cannot be represented elsewhere. Save the run ID, fixture order ID, POS response, final state, and cleanup result. Any live-scoped check should have an owner, rollback path, and explicit cleanup_status.
Should drive-thru voice agent tests block CI?
Block CI on high-risk menu, modifier, cart mutation, price, allergy, payment, and POS write cases. Run long-tail menu combinations, load tests, and phone-path checks on a schedule or before release when they are too slow for every pull request. A practical load pass starts with 50-500 synthetic calls, then scales only after cart correctness stays stable.
What edge cases matter most for QSR voice agents?
Prioritize unavailable items, modifier removal, combo substitutions, allergies, special requests, multiple speakers, background noise, drive-off events, crew interjections, rush-hour latency, payment fallback, duplicate order prevention, and store-specific menu differences. These cases catch cart drift: the caller and transcript look aligned while the durable order state is wrong.
What evidence should each drive-thru order test save?
Save run ID, menu version, store fixture, transcript, tool trace, cart snapshots after each turn, expected versus actual item and modifier IDs, price checks, POS response, final order state, readback outcome, and cleanup status. The minimum evidence packet should let a reviewer replay the order, inspect the cart mutation, and prove no test order remains open.

