Drive-Thru Voice Agent Order Testing Template

Sumanyu Sharma
Sumanyu Sharma
Founder & CEO
, Voice AI QA Pioneer

Has stress-tested 4M+ voice agent calls to find where they break.

June 20, 2026Updated June 20, 202616 min read
Drive-Thru Voice Agent Order Testing Template

Drive-thru voice agent order testing proves that a restaurant voice agent can take a messy spoken order and leave the store with the same order in the cart, POS, and readback.

If the agent only answers store hours or loyalty-program FAQs, this is too much. A few conversation tests are fine.

If it takes drive-thru, phone, kiosk, or catering orders, transcript-only testing is where teams get burned. The call can sound normal while the cart contains the wrong modifier, the combo lost its drink, the POS accepted an unavailable item, or the allergy note never made it to the kitchen.

That failure has a name I would put on the wall of every restaurant voice AI test plan: cart drift. The caller and agent seem aligned, but the durable order state is no longer the order the caller asked for.

TL;DR: Build drive-thru voice agent tests as a menu-cart-POS matrix:

  • Menu state: store, menu version, item IDs, modifier groups, prices, availability, and schedule.
  • Caller order: the exact spoken phrase, including substitutions, allergies, corrections, and background noise.
  • Cart snapshots: expected cart state after every add, remove, change, upsell, and readback.
  • POS mode: mock, sandbox, test store, or narrow live-scoped path.
  • Evidence: transcript, tool trace, item and modifier IDs, price check, final POS state, and cleanup result.

A passing transcript is useful. It is not proof of an accurate order. The cart and final order state have to agree.

Methodology Note: This template is based on Hamming's analysis of 4M+ production voice agent calls involving order capture, menu state, tool calls, POS writes, background noise, and QSR-style workflow failures across 10K+ voice agents (2025-2026). We've tested agents built on LiveKit, Pipecat, ElevenLabs, Retell, Vapi, and custom-built solutions.

Use it as a launch-safety template for drive-thru, phone ordering, kiosk, and catering voice agents that touch real menu or POS state.

Last Updated: June 2026

Related Guides:

What Drive-Thru Order Testing Should Prove

Drive-thru order testing checks whether the voice agent can turn the way people actually order food into the exact durable order the restaurant expects.

LayerWhat the test provesFailure example
Menu understandingThe agent maps speech to canonical menu items, modifier groups, sizes, prices, and availability"No cheese" is captured as a note instead of a modifier removal
Cart mutationThe cart changes correctly after each turnCaller says "make that a large" and both the medium and large remain in the cart
ReadbackThe agent confirms the order in language the customer can correctAgent reads back the wrong drink or skips the allergy note
POS writeThe final order sent downstream matches the confirmed cartPOS receives a stale item ID from a previous menu version
Operational safetyTests avoid real orders unless the release owner explicitly allows a scoped live checkCI creates a kitchen order or payment authorization

Drive-thru order test: a voice agent test that loads a menu snapshot, runs a spoken ordering scenario, verifies cart state after every mutation, checks final POS or order state, and saves cleanup evidence.

The hard part is not "Can the agent understand fries?" It is whether it keeps the same order intact after the caller changes their mind, a passenger adds a drink, the lunch menu is missing an item, and the speaker sounds like it was installed in 2008.

Build the Menu-Cart-POS Matrix

Start with a matrix. Do not start by dialing the agent 200 times.

FieldWhat to captureSample
Scenario IDStable identifier for the testqsr_combo_modifier_017
Store fixtureStore, franchise, daypart, menu version, tax regionstore_fixture_midwest_03, breakfast disabled
Caller phraseWhat the customer says"Can I get a cheeseburger combo, no pickles, Coke Zero, and make the fries large?"
Menu expectationCanonical item IDs, modifier IDs, availability, priceburger item, no-pickle modifier, combo drink slot, large fry upcharge
Cart checkpointsExpected cart after each add, remove, or change1 combo, 0 pickles, Coke Zero, large fries
Forbidden stateWhat must not happenduplicate burger, default drink, medium fries, stale price
POS modeMock, sandbox, test store, or live scopedPOS sandbox with fixture order tag
EvidenceWhat proves successtranscript, tool trace, cart snapshots, POS response, cleanup query
GateBlocking, scheduled, pre-release, or manualblocking for allergy, payment, and POS writes

Keep this matrix near your tests-as-code definitions. A teammate should be able to review the expected order without replaying the call.

Copyable Matrix Starter

suite: drive_thru_order_accuracy
owner: voice-platform
menu_version: qsr_menu_2026_06_20_lunch

scenarios:
  - id: qsr_combo_modifier_017
    store_fixture: store_fixture_midwest_03
    dependency_mode:
      menu: snapshot
      cart: sandbox
      pos: test_store
    caller_goal: "Order a cheeseburger combo with no pickles, Coke Zero, and large fries"
    expected_cart_checkpoints:
      - turn: "add cheeseburger combo"
        item_ids: ["combo_cheeseburger"]
        modifier_ids: []
        subtotal_cents: 849
      - turn: "no pickles"
        item_ids: ["combo_cheeseburger"]
        modifier_ids: ["remove_pickles"]
      - turn: "Coke Zero and large fries"
        item_ids: ["combo_cheeseburger"]
        modifier_ids: ["drink_coke_zero", "fry_large"]
        subtotal_cents: 999
    forbidden_states:
      - duplicate_combo
      - default_drink
      - medium_fry_after_large_upgrade
      - kitchen_note_instead_of_modifier_id
    final_assertions:
      readback_contains:
        - "cheeseburger combo"
        - "no pickles"
        - "Coke Zero"
        - "large fries"
      pos_order_count_for_run: 1
      cleanup_status: verified

That is more structured than a transcript score. It is also what lets you catch the order bug before a customer receives the wrong bag.

Test Menu Understanding Before Cart Mutation

Menu understanding has to be tested before cart mutation because the wrong canonical item poisons every downstream check.

Public restaurant ordering APIs expose why this gets tricky. Google Cloud's Food Ordering AI Agent menu integration docs separate menu structure, modifiers, schedules, prices, and availability before ordering works. Toast's modifier docs call out behavior around default modifiers, pre-modifiers, special requests, and quantity mismatches. Lightspeed's online ordering API docs similarly distinguish menu management from order creation and confirmation.

Your tests should do the same.

Spoken requestMenu-state assertionWhy it matters
"No pickles"Uses the remove-pickles modifier or explicit default-modifier removalKitchen notes may not remove the ingredient
"Extra sauce"Adds allowed modifier quantity within menu constraintsFree-text notes may bypass price or kitchen routing
"Make it a large"Changes the size or combo slot, not a second itemDuplicate items hide behind a correct readback
"I'll take the breakfast sandwich" at 2 PMMarks item unavailable and offers valid alternativesMenu schedules change the correct answer
"I'm allergic to sesame"Adds the safe allergy workflow or escalation pathAllergy handling should not be treated as a casual note

Menu snapshot rule: every order test should name the menu version it loaded. If the menu can change without changing the test expectation, the test is not reproducible.

This is where QSR tests differ from generic tool-call tests. A tool call named add_item is not enough. The test has to prove the item and modifier identifiers match the restaurant's active menu model.

Test Cart State Across Corrections

Most order bugs happen after the first item. That is why a single happy-path burger order is such weak evidence.

The caller changes their mind. A passenger interrupts. The agent suggests an upsell. The customer says "actually make that two." The agent needs to update the cart without keeping stale state.

Use turn-by-turn cart checkpoints.

TurnCaller saysExpected cart stateCommon failure
1"I want a spicy chicken sandwich combo"1 spicy chicken combo, default size, default side pendingAdds sandwich only, not combo
2"Make the fries large"Same combo, large fries, upcharge appliedAdds separate large fries
3"No mayo"Same combo, remove-mayo modifierStores note but leaves default mayo
4"Actually make that grilled chicken"Replaces spicy chicken with grilled chicken, keeps compatible modifiersLeaves both sandwiches
5"Add a kids meal too"Combo plus kids mealReplaces cart instead of appending
6"Read it back"Spoken readback matches cart and priceReads back transcript memory, not cart state

The test should fail as soon as cart state diverges. Waiting until the final readback turns a small state bug into a scavenger hunt across transcript, tool trace, and POS payload.

Evidence Envelope

{
  "run_id": "drive_thru_run_2026_06_20_0042",
  "store_fixture": "store_fixture_midwest_03",
  "menu_version": "qsr_menu_2026_06_20_lunch",
  "scenario_id": "qsr_correction_024",
  "cart_snapshots": [
    {
      "turn": 1,
      "expected_item_ids": ["combo_spicy_chicken"],
      "actual_item_ids": ["combo_spicy_chicken"],
      "expected_modifier_ids": [],
      "actual_modifier_ids": []
    },
    {
      "turn": 4,
      "expected_item_ids": ["combo_grilled_chicken"],
      "actual_item_ids": ["combo_grilled_chicken"],
      "forbidden_item_ids_observed": []
    }
  ],
  "final_order": {
    "pos_mode": "test_store",
    "pos_order_count_for_run": 1,
    "price_match": true,
    "readback_match": true,
    "cleanup_status": "verified"
  }
}

The vendor, test runner, or reviewer does not need private customer data. It needs enough structure to know whether the order state stayed honest.

Prove POS And Kitchen State Without Touching Production

Use the same dependency-mode thinking from voice agent sandbox testing, but make the order boundary explicit.

Dependency modeUse it forGood signalRelease risk
Mocked menu and cartFast CI, negative cases, prompt changes, schema checksAgent selects the right menu IDs and cart operationsCan miss POS auth, menu drift, and provider validation
Menu snapshot plus POS sandboxIntegration checks with fixture stores and fake ordersFinal order validates against a realistic menu and POS contractRequires fixture hygiene and cleanup
POS test storePre-release order injection and kitchen-routing checksThe real order path accepts the payloadCan pollute test dashboards if run IDs and cleanup are weak
Live scoped pathProduction-only routing, store availability, or provider limitsReal path still works under release-owner controlsCan create real orders if allowlists are wrong

For most teams, CI should never create a real kitchen order or payment authorization. Use mocks for speed, sandbox and test-store paths for confidence, and live scoped checks only with explicit release-owner approval.

When the voice agent does write downstream, assert the durable state:

  • exactly one order for the run ID
  • expected item IDs, modifier IDs, prices, tax, store, and fulfillment type
  • no duplicate cart lines from retries
  • no unavailable items accepted
  • no production customer record touched
  • cleanup verified after pass and fail

The failed production call regression runbook is the right place to promote any escaped order bug into this matrix.

Cover The Edge Cases That Actually Break QSR Agents

Do not try to generate every menu combination on day one. Start with the cases that change the order, safety risk, or operations.

Edge caseTest exampleBlocking?
Allergy or dietary restriction"I'm allergic to sesame. What can I get?"Yes
Modifier removal"No onions, no mayo"Yes
Default modifier behaviorItem usually includes cheese; caller says no cheeseYes
Combo substitutionSwap drink, side, size, or proteinYes
Unavailable itemBreakfast item during lunch menuYes
Special request"Sauce on the side"Usually
Upsell acceptedAgent suggests fries; caller acceptsUsually
Upsell rejectedAgent suggests dessert; caller says noUsually
Quantity correction"Actually make that two"Yes
Remove item"Take off the nuggets"Yes
Multiple speakersPassenger adds a drinkScheduled
Background noiseEngine, wind, speaker distortion, store noiseScheduled and pre-release
Drive-off eventCustomer leaves before confirmationPre-release
Crew interjectionStaff takes over or corrects the agentPre-release
Rush-hour load50+ concurrent ordering callsScheduled and pre-release

Hamming's public Lilac Labs customer spotlight is a useful proof point here: the hard cases were not just "normal orders." They included dietary restrictions, allergies, modifications, and enough automated coverage to replace hours of manual retesting.

Decide What Belongs In CI

Keep the blocking suite small enough that engineers will tolerate it.

GateRun whenRecommended sizeBlocks merge?
Menu schema checksMenu parser, prompt, tool schema, or item mapping changes10-25 casesYes
Cart mutation testsPrompt, orchestration, tool-call, or state changes8-20 casesYes
POS sandbox testsOrder creation, price, tax, fulfillment, or payment handoff changes3-10 fixture ordersYes for critical flows
Phone-path drive-thru testsASR, telephony, interruption, noise, or provider changes3-8 callsPre-release
Load and rush-hour testsModel, provider, queue, or infrastructure changes50-500 synthetic callsScheduled or pre-release
Production samplingContinuous monitoring1-5% of eligible callsNo, alert on drift

If a failure can create a wrong order, allergy miss, payment problem, or kitchen workflow issue, keep at least one blocking test. Put long-tail combinations in scheduled coverage.

Launch Checklist

Before a drive-thru ordering agent sees real traffic, confirm these are true:

  • The test suite names the menu version, store fixture, and POS dependency mode.
  • Each critical menu item has at least one add, remove, change, and unavailable-item case.
  • Modifiers are asserted by canonical ID, not only transcript words.
  • Cart state is checked after every mutation, not only at the end.
  • Readback is compared against cart state and price.
  • POS writes use sandbox, test-store, or explicit live-scoped controls.
  • Duplicate order prevention is tested with retries.
  • Cleanup runs after pass and fail.
  • Allergy and safety workflows are blocking.
  • Background-noise and rush-hour tests run before launch.
  • Production failures can be promoted into regression tests.

This is more work than a demo script. It is also the line between "the agent sounded right" and "the customer got the food they ordered."

What This Template Cannot Prove

This template proves that the agent followed the menu, cart, and POS expectations you loaded into the test. It does not prove every store configuration is correct.

Three limitations matter:

LimitationWhy it mattersPractical response
Menu driftStore menus, prices, item availability, and modifiers changeRefresh menu snapshots and compare versions before scheduled runs
Sandbox driftPOS sandbox behavior may differ from production routing, auth, or kitchen systemsKeep a small pre-release live-scoped check with owner approval
Combination explosionEvery modifier and combo permutation cannot block CIUse risk-based blocking tests and scheduled coverage for long-tail combinations

The practical win is narrow and worth it: stop letting a polished transcript hide a wrong cart.

Drive-Thru Voice Agent Testing FAQ

How do you test a drive-thru voice agent with menu and cart state?

Create a menu-cart-POS test matrix that maps each spoken order to the menu snapshot, expected item IDs, modifiers, prices, cart mutations, POS write, readback, and cleanup evidence. For a first pass, cover 10-25 menu schema cases and 3-10 fixture orders before adding load. The test should fail when the transcript sounds correct but the cart snapshots or final order state are wrong.

What should go in a restaurant voice agent order test matrix?

Include menu version, store or franchise configuration, caller phrase, expected item and modifier IDs, expected cart state after each turn, forbidden cart state, expected price, POS dependency mode, final order evidence, and cleanup owner. Hamming's template also saves run ID, menu version, cart snapshots, POS response, and cleanup_status so QA can prove the order was not left behind.

How do you test menu modifiers and substitutions?

Use fixtures for default modifiers, removed modifiers, added modifiers, substitutions, combo changes, allergies, and unavailable items. Assert the canonical menu IDs and modifier quantities, not just the words in the transcript. Keep a small blocking set for high-risk modifiers, then schedule long-tail menu combinations outside every pull request.

How do you test cart corrections in a voice ordering agent?

Run multi-turn cases where the caller adds an item, changes size, removes a modifier, replaces one item, accepts or rejects an upsell, and asks for a readback. Check the cart state after every mutation so stale items and duplicate modifiers are caught early. The useful artifact is the before-and-after cart snapshot, not just a final transcript score.

How do you test POS integration without creating real orders?

Use mocked tools for fast CI, POS sandbox or test stores for integration checks, and narrowly allowlisted live checks only when a production-only path cannot be represented elsewhere. Save the run ID, fixture order ID, POS response, final state, and cleanup result. Any live-scoped check should have an owner, rollback path, and explicit cleanup_status.

Should drive-thru voice agent tests block CI?

Block CI on high-risk menu, modifier, cart mutation, price, allergy, payment, and POS write cases. Run long-tail menu combinations, load tests, and phone-path checks on a schedule or before release when they are too slow for every pull request. A practical load pass starts with 50-500 synthetic calls, then scales only after cart correctness stays stable.

What edge cases matter most for QSR voice agents?

Prioritize unavailable items, modifier removal, combo substitutions, allergies, special requests, multiple speakers, background noise, drive-off events, crew interjections, rush-hour latency, payment fallback, duplicate order prevention, and store-specific menu differences. These cases catch cart drift: the caller and transcript look aligned while the durable order state is wrong.

What evidence should each drive-thru order test save?

Save run ID, menu version, store fixture, transcript, tool trace, cart snapshots after each turn, expected versus actual item and modifier IDs, price checks, POS response, final order state, readback outcome, and cleanup status. The minimum evidence packet should let a reviewer replay the order, inspect the cart mutation, and prove no test order remains open.

Frequently Asked Questions

Create a menu-cart-POS test matrix that maps each spoken order to the menu snapshot, expected item IDs, modifiers, prices, cart mutations, POS write, readback, and cleanup evidence. For a first pass, cover 10-25 menu schema cases and 3-10 fixture orders before adding load. The test should fail when the transcript sounds correct but the cart snapshots or final order state are wrong.

Include menu version, store or franchise configuration, caller phrase, expected item and modifier IDs, expected cart state after each turn, forbidden cart state, expected price, POS dependency mode, final order evidence, and cleanup owner. Hamming's template also saves run ID, menu version, cart snapshots, POS response, and cleanup_status so QA can prove the order was not left behind.

Use fixtures for default modifiers, removed modifiers, added modifiers, substitutions, combo changes, allergies, and unavailable items. Assert the canonical menu IDs and modifier quantities, not just the words in the transcript. Keep a small blocking set for high-risk modifiers, then schedule long-tail menu combinations outside every pull request.

Run multi-turn cases where the caller adds an item, changes size, removes a modifier, replaces one item, accepts or rejects an upsell, and asks for a readback. Check the cart state after every mutation so stale items and duplicate modifiers are caught early. The useful artifact is the before-and-after cart snapshot, not just a final transcript score.

Use mocked tools for fast CI, POS sandbox or test stores for integration checks, and narrowly allowlisted live checks only when a production-only path cannot be represented elsewhere. Save the run ID, fixture order ID, POS response, final state, and cleanup result. Any live-scoped check should have an owner, rollback path, and explicit cleanup_status.

Block CI on high-risk menu, modifier, cart mutation, price, allergy, payment, and POS write cases. Run long-tail menu combinations, load tests, and phone-path checks on a schedule or before release when they are too slow for every pull request. A practical load pass starts with 50-500 synthetic calls, then scales only after cart correctness stays stable.

Prioritize unavailable items, modifier removal, combo substitutions, allergies, special requests, multiple speakers, background noise, drive-off events, crew interjections, rush-hour latency, payment fallback, duplicate order prevention, and store-specific menu differences. These cases catch cart drift: the caller and transcript look aligned while the durable order state is wrong.

Save run ID, menu version, store fixture, transcript, tool trace, cart snapshots after each turn, expected versus actual item and modifier IDs, price checks, POS response, final order state, readback outcome, and cleanup status. The minimum evidence packet should let a reviewer replay the order, inspect the cart mutation, and prove no test order remains open.

Sumanyu Sharma

Sumanyu Sharma

Founder & CEO

Previously Head of Data at Citizen, where he helped quadruple the user base. As Senior Staff Data Scientist at Tesla, grew AI-powered sales program to 100s of millions in revenue per year.

Researched AI-powered medical image search at the University of Waterloo, where he graduated with Engineering honors on dean's list.

“At Hamming, we're taking all of our learnings from Tesla and Citizento build the future of trustworthy, safe and reliable voice AI agents.”