Voice Agent Drift Detection: Monitor Model and Behavior Changes

Sumanyu Sharma
Sumanyu Sharma
Founder & CEO
, Voice AI QA Pioneer

Has stress-tested 1M+ voice agent calls to find where they break.

January 15, 202620 min read
Voice Agent Drift Detection: Monitor Model and Behavior Changes

Three months after launch, your voice agent's satisfaction scores are dropping. Customers complain the bot "doesn't understand them anymore." Call containment rates fall from 85% to 72%. Yet nothing in your infrastructure has changed—no deployments, no configuration updates, no code modifications.

This is drift: the silent killer of voice agent quality. While you weren't watching, Deepgram tweaked their acoustic model. OpenAI refined GPT-5.1's safety filters. ElevenLabs adjusted prosody algorithms. Each change was minor. Together, they've degraded your agent's performance by 15-20%.

At Hamming, we've analyzed thousands of production voice agents. The pattern is consistent: gradual degradation that compounds over 90 days until customers revolt.

Just launched last week? Focus on immediate bugs first. Drift is a month-3+ problem.

Using a fully-managed platform that handles model updates? Check if they monitor drift for you. Most don't.

This guide is for teams with production voice agents running 30+ days who need to maintain quality over time.

TL;DR: Detect drift using Hamming's Voice Agent Drift Detection Framework:

  • STT Drift — Word error rate, confidence scores, domain term accuracy changing over time
  • LLM Drift — Response quality, formatting, instruction following degrading
  • TTS Drift — Voice quality, prosody, naturalness scores shifting
  • Behavioral Drift — End-to-end metrics (containment, completion) trending down

Establish baselines at launch, monitor continuously, alert on deviation greater than 10 percent from baseline.

What Is Voice Agent Drift?

Voice agent drift is the gradual degradation of agent performance over time without any changes to your code or configuration. Unlike bugs that break things immediately, drift happens slowly—quality erodes week by week until customers notice something feels "off." This is why regression testing alone isn't enough—you need continuous drift monitoring.

The Silent Degradation, as we call it, happens because voice agents depend on external components that change without notice:

  1. Speech-to-Text (STT) providers update their acoustic and language models monthly
  2. Large Language Models (LLMs) receive continuous updates for safety and quality
  3. Text-to-Speech (TTS) engines refine voice models and prosody algorithms
  4. Data distributions shift as user behavior and vocabulary evolve

When I first encountered drift, I assumed model updates would be announced. They're not. Silent updates are the norm. OpenAI updates GPT-5.1 without notice. Deepgram tweaks their Nova model regularly. ElevenLabs refines their voice synthesis continuously. Each small change compounds into noticeable quality degradation.

The challenge isn't detecting catastrophic failures—those trigger immediate alerts. It's catching the 1 to 2 percent weekly degradation that adds up to 15 to 20 percent accuracy loss over three months. By the time customers complain, you've already lost trust.

The Voice Agent Drift Detection Framework (4 Types)

Hamming's Voice Agent Drift Detection Framework identifies four distinct types of drift, each requiring different monitoring approaches and response strategies. Understanding these types helps you build targeted detection systems rather than hoping generic monitoring catches everything.

Drift TypeWhat ChangesKey IndicatorsDetection Method
STT DriftTranscription accuracyWER increases, confidence dropsSynthetic test audio
LLM DriftResponse qualityFormat violations, hallucinationsPrompt consistency tests
TTS DriftVoice characteristicsMOS score drops, prosody shiftsVoice quality metrics
Behavioral DriftEnd-to-end performanceContainment falls, completion dropsProduction metrics

Each drift type requires specific baselines, monitoring frequencies, and response procedures. Let's examine each in detail.

Type 1: STT/ASR Drift

Speech-to-Text drift is often the first drift type to impact production quality. When transcription accuracy degrades, everything downstream suffers—your LLM receives incorrect input, leading to confused responses and frustrated users.

Common STT Drift Patterns

STT models drift for several reasons:

  • Acoustic model updates change how phonemes are recognized
  • Language model updates alter word prediction probabilities
  • Vocabulary additions introduce new words that conflict with existing terms
  • Regional accent tuning improves some accents while degrading others

The most insidious form is domain-specific term drift. Your industry jargon that transcribed perfectly at launch suddenly becomes generic words. "HIPAA compliance" becomes "hip compliance." "401(k) rollover" becomes "401 cay rollover." Small changes that break intent detection.

Monitoring STT Drift

Voice agent drift detection requires tracking these metrics continuously:

MetricBaselineWarning ThresholdCritical ThresholdDirection
Word Error Rate8%8.4% (+5% relative)8.8% (+10% relative)Higher is worse
Confidence Score92%87.4% (-5% relative)82.8% (-10% relative)Lower is worse
Domain Term Accuracy95%90.2% (-5% relative)85.5% (-10% relative)Lower is worse

Synthetic Test Audio Strategy

Run daily synthetic tests with consistent audio files to detect STT drift before it affects production:

  1. Golden audio set: Fifty to one hundred recordings covering your domain vocabulary
  2. Accent diversity: Include various accents your users have
  3. Background noise: Test with clean and noisy audio
  4. Edge cases: Numbers, acronyms, proper nouns specific to your use case

Compare transcriptions daily against known-good baselines. Any deviation indicates potential drift.

Type 2: LLM Response Drift

LLM drift is subtle but profoundly impacts user experience. Your carefully crafted prompts that produced perfect responses at launch gradually degrade as the underlying model evolves. GPT-5.1 in January behaves differently than GPT-5.1 in April, even with identical prompts.

LLM Drift Manifestations

LLM drift appears in multiple forms:

  • Instruction adherence degradation: The model stops following your specific formatting requirements
  • Personality shifts: Your professional agent becomes overly casual or excessively formal
  • Knowledge cutoff changes: Previously known information becomes unavailable
  • Safety filter evolution: Responses become more conservative or restrictive
  • Response length variance: Concise answers become verbose explanations

The most frustrating aspect is format compliance drift. Your prompt says "Respond with ONLY the account number" but suddenly the model adds "Your account number is:" before the number. Small changes that break downstream parsing.

LLM Drift Detection Strategy

Monitor these response characteristics:

MetricDescriptionMeasurement Method
Format Compliance Rate% of responses matching expected structureRegex validation against templates
Instruction Adherence ScoreHow well responses follow specific directivesSemantic similarity to ideal responses
Response Length DistributionToken count variance from baselineStatistical deviation tracking
Hallucination Rate% of responses with fabricated informationFact-checking against known data
Tone ConsistencySentiment and formality alignmentNLP sentiment analysis

Weekly Prompt Consistency Tests

Run the same test prompts weekly to detect gradual changes:

Test PromptExpected FormatRequired ElementsForbidden ElementsValidation Method
"What is the account balance for user ID twelve-three-four-five?"$XXX.XX patternDollar sign, decimal, digitsN/ARegex: ^\$[\d,]+\.\d{2}$
"Schedule an appointment for next Tuesday at 2 PM""Appointment scheduled for [date] at [time]"Date, time references"I'll", "I will", "Let me"Pattern matching + sentiment check
"List three benefits of our premium plan"Numbered list (1-3 items)Numbers, line breaksExcessive detail (more than 50 words/item)Structure validation
"Transfer to human agent"Acknowledgment onlyTransfer confirmationAttempts to handle requestLength check (less than 20 tokens)

Weekly comparison process:

  1. Run each prompt through current model
  2. Compare response structure to baseline
  3. Calculate drift score based on format violations
  4. Alert if drift exceeds 5% threshold for any prompt

Type 3: TTS Quality Drift

Text-to-Speech drift is immediately noticeable to users but hard to quantify programmatically. Voices that sounded natural become slightly robotic. Pronunciation that was perfect becomes occasionally garbled. Prosody that flowed naturally becomes stilted.

TTS Drift Indicators

Watch for these changes:

  • Prosody degradation: Unnatural pauses, wrong emphasis, monotone delivery
  • Pronunciation shifts: Proper nouns, numbers, acronyms spoken differently
  • Voice characteristic changes: Timber, pace, or emotion shifting from baseline
  • Audio artifacts: Clicks, pops, or distortion not present at launch
  • Consistency variance: Same text producing noticeably different audio

TTS Quality Metrics

Track objective and subjective measures:

MetricMeasurement MethodBaselineWarning ThresholdCritical Threshold
Mean Opinion Score (MOS)1-5 scale from human evaluators4.23.9 (-0.3 drop)3.7 (-0.5 drop)
Prosody ScoreAutomated prosody analysis85%80% (-5% absolute)75% (-10% absolute)
Pronunciation AccuracyPhoneme matching for test phrases95%90% (-5% absolute)85% (-10% absolute)
Audio ArtifactsSignal processing detectionLess than 0.1%0.5% of samples1% of samples
Voice ConsistencySpectral similarity analysis92% match87% match82% match

Voice Consistency Testing

Generate the same text weekly and compare audio characteristics:

  1. Reference phrases: 20-30 sentences covering various speaking styles
  2. Spectral analysis: Compare frequency distributions
  3. Temporal alignment: Check speaking pace consistency
  4. Perceptual hashing: Detect subtle audio changes

Type 4: Behavioral Drift (End-to-End)

Behavioral drift represents the compound effect of all component drifts plus emergent behaviors from their interaction. Even if individual components drift within acceptable ranges, their combined effect can significantly degrade the user experience.

Behavioral Drift Patterns

End-to-end drift manifests as:

  • Containment rate decline: More users requesting human agents
  • Task completion drop: Fewer successful resolutions
  • Conversation length increase: Taking more turns to accomplish tasks
  • User satisfaction decrease: Lower CSAT scores over time
  • Escalation rate rise: More supervisor interventions needed

This is where The Silent Degradation becomes visible in business metrics. Individual component metrics might look acceptable, but users are having worse experiences.

End-to-End Monitoring Strategy

Track these production metrics continuously:

MetricBaseline PeriodUpdate FrequencyAlert Threshold
Containment Rate30-day averageDailyDecrease of 5 percent from baseline
Task Completion Rate30-day averageDailyDecrease of 5 percent from baseline
Average Handle Time14-day medianDailyIncrease of 15 percent from baseline
First Call Resolution30-day averageWeeklyDecrease of 10 percent from baseline
User Satisfaction30-day averageWeeklyDecrease of 0.5 points (five-point scale)

Correlation Analysis

Behavioral drift often correlates with component drift:

Correlation Analysis Approach

To identify which component drives behavioral drift, analyze correlations between component metrics and business outcomes:

Component DriftBusiness MetricWhat High Correlation Means
STT accuracy declineContainment rate dropTranscription errors cause call failures
LLM format violationsTask completion rate dropResponse quality affects success rates
TTS quality degradationCustomer satisfaction declineVoice quality impacts user experience

The component with the highest correlation coefficient is typically your primary drift driver. Focus remediation efforts there first.

Setting Up Drift Baselines

Establishing accurate baselines is critical for voice agent drift detection. Poor baselines lead to false positives (alerting on normal variation) or false negatives (missing actual degradation).

Baseline Establishment Timeline

Follow this timeline for new deployments:

Week 1-2: Stabilization Period

  • System settling after launch
  • Initial bug fixes and adjustments
  • High variance is normal
  • Don't establish baselines yet

Week 3-4: Data Collection

  • Collect metrics across all layers
  • Include peak and off-peak periods
  • Capture weekend vs. weekday patterns
  • Build statistical distributions

Week 5-6: Baseline Calculation

  • Calculate median values (more robust than mean)
  • Determine standard deviation for each metric
  • Set percentile thresholds (P50, P90, P95)
  • Document seasonal considerations

Week 7+: Baseline Validation

  • Compare against weeks 5-6
  • Adjust for discovered patterns
  • Lock in baselines for monitoring
  • Schedule periodic reviews

Baseline Calculation Methods

Statistical Methods for Baseline Calculation

Different metric types require different statistical approaches:

Metric TypeExamplesRecommended MethodWhy This Method
Success RatesContainment, completionWeighted average by call volumeAccounts for traffic patterns
Response TimesLatency, processing time50th, 90th, 99th percentilesCaptures distribution, not just average
Quality ScoresMOS, satisfaction ratingsTrimmed mean (exclude top/bottom 10%)Removes outlier ratings
Binary OutcomesPass/fail, transferred/containedSuccess rate with confidence intervalsShows statistical significance

For rate metrics, weight by call volume to avoid skewing baselines during low-traffic periods. For latency metrics, use percentiles rather than averages since outliers can distort mean values. Quality scores benefit from trimmed means that exclude the highest and lowest 10% of ratings.

Baseline Drift Considerations

Baselines themselves can become stale. The Stale Baseline Trap occurs when you compare current performance to outdated baselines that no longer represent acceptable quality.

Update baselines when:

  • Major model upgrades improve performance (new baseline = new normal)
  • Business requirements change (different success criteria)
  • User demographics shift significantly
  • More than 6 months have passed

Continuous Monitoring for Drift

Effective voice agent drift detection requires balancing comprehensive coverage with operational overhead. Monitor too much and you drown in noise. Monitor too little and you miss critical degradation.

Monitoring Architecture

Implement a three-tier monitoring strategy:

Tier 1: Real-time Production Metrics (Continuous)

  • Containment rate per hour
  • Error rates and timeouts
  • Latency percentiles
  • Task completion rates

Tier 2: Daily Synthetic Tests

  • STT accuracy on golden audio set
  • LLM response format compliance
  • TTS quality spot checks
  • End-to-end scenario tests

Tier 3: Weekly Deep Analysis

  • Comprehensive test suite execution
  • Human evaluation sampling
  • Trend analysis across all metrics
  • Correlation studies

Synthetic Test Framework

Synthetic Testing Architecture

Build a comprehensive monitoring system with these components (for background noise specifically, see our background noise testing KPIs guide):

ComponentTest FrequencyWhat to TestAlert Triggers
STT TestsDailyGolden audio set, domain termsWER increase greater than 5%
LLM TestsDailyFormat compliance, response qualityFormat violations greater than 5%
TTS TestsWeeklyVoice consistency, pronunciationMOS drop greater than 0.3 points
End-to-EndDailyComplete call flowsContainment drop greater than 5%

Drift Calculation Formula: Calculate drift as the percentage change from baseline: |current - baseline| / baseline × 100

When drift exceeds warning thresholds for your configured duration (e.g., 7 days), trigger alerts. Critical thresholds should trigger immediate escalation. Store all test results with timestamps for trend analysis.

Production Sampling Strategy

Don't analyze every call—use intelligent sampling:

  1. Random sampling: Sample 5 to 10 percent of all calls for unbiased overview
  2. Edge case sampling: Sample all calls matching risk criteria
  3. New pattern sampling: Calls with unusual characteristics
  4. Complaint sampling: All calls preceding customer complaints

Alerting When Drift Occurs

Effective alerting balances sensitivity with actionability. Too sensitive and you get False Positive Fatigue—your team ignores alerts. Too conservative and you miss degradation until customers complain.

Alert Threshold Configuration

Configure graduated alert levels based on severity and business impact:

Drift TypeAlert LevelThresholdDuration RequiredActionRecipients
STT DriftInfo3% increase1 dayLog to dashboard-
STT DriftWarning5% increase7 days sustainedSlack notificationML team, QA team
STT DriftCritical10% increase3 days sustainedPagerDuty alertOn-call, ML, Product
BehavioralWarning5% degradation3 days sustainedSlack + dashboardProduct team
BehavioralCritical8% degradation1 day sustainedImmediate pageOn-call, Leadership

Key Principles:

  • Behavioral metrics get lower thresholds (more sensitive) since they directly impact customers
  • Require sustained drift to avoid alerting on temporary spikes
  • Critical alerts escalate if not acknowledged within 15 minutes
  • Info-level logging helps identify trends before they become problems

Alert Context and Actionability

Every alert must include:

  1. What drifted: Specific metric and component
  2. By how much: Current value vs. baseline vs. threshold
  3. Since when: Duration of degradation
  4. Trend direction: Getting better or worse?
  5. Likely cause: Recent model updates or known issues
  6. Recommended action: Specific steps to investigate

Example alert format:

🚨 CRITICAL: LLM Response Drift Detected

Component: GPT-5.1 Response Format Compliance
Current: 82 percent (Baseline: 95 percent, Critical Threshold: 85 point 5 percent)
Duration: 3 days sustained drift
Trend:  Worsening (84 percent  83 percent  82 percent)

Likely Cause: GPT-5.1 update on Jan 12 (3 days ago)
Affected Formats: JSON responses, numbered lists

Recommended Actions:
1. Run format compliance test suite
2. Review prompt engineering for affected formats
3. Consider prompt adjustments or model rollback
4. Check OpenAI status page for known issues

Dashboard: https://monitoring/drift/llm/format-compliance
Runbook: https://wiki/drift-response/llm-format

Alert Fatigue Prevention

Prevent False Positive Fatigue through:

  • Sustained thresholds: Require drift to persist (not spike)
  • Business hours routing: Non-critical alerts wait for working hours
  • Smart grouping: Combine related alerts into single notification
  • Automatic resolution: Clear alerts when metrics recover
  • Threshold tuning: Monthly review of alert accuracy

Responding to Drift Events

When drift is detected, follow a systematic response process to identify root causes and implement fixes without causing additional disruption.

Drift Response Runbook

Step 1: Confirm Drift Is Real (5 minutes)

  • Check if monitoring system is functioning correctly
  • Verify baseline is still valid (not stale)
  • Confirm sustained drift, not temporary spike
  • Review any recent deployments or changes

Step 2: Identify Drift Source (15 minutes)

  • Check component-specific drift scores
  • Review provider status pages and changelogs
  • Analyze correlation with behavioral metrics
  • Look for patterns in affected calls

Step 3: Assess Impact (10 minutes)

  • Quantify affected users and call volumes
  • Measure business impact (containment, satisfaction)
  • Determine urgency of response
  • Evaluate risk of immediate action vs. monitoring

Step 4: Implement Response (varies)

For external drift (provider model updates):

# Response strategies for external drift
def respond_to_external_drift(component, drift_type):
    if component == "stt":
        options = [
            "Switch to previous model version if available",
            "Adjust confidence thresholds",
            "Implement preprocessing for problem terms",
            "Switch to alternative STT provider"
        ]

    elif component == "llm":
        options = [
            "Refine prompts for new model behavior",
            "Add explicit format enforcement",
            "Implement response post-processing",
            "Roll back to previous model version",
            "Switch to alternative LLM"
        ]

    elif component == "tts":
        options = [
            "Adjust voice settings (speed, pitch)",
            "Switch to different voice model",
            "Implement audio post-processing",
            "Change TTS provider"
        ]

    return evaluate_options(options)

For internal drift (data distribution changes):

  • Retrain on recent data
  • Update vocabulary and phrases
  • Adjust routing rules
  • Modify conversation flows

Step 5: Validate Fix (30 minutes)

  • Run comprehensive test suite
  • Compare metrics to pre-drift baseline
  • Monitor early production traffic
  • Prepare rollback plan

Step 6: Update Baselines (if needed) Sometimes drift represents improvement:

  • Provider updates that enhance quality
  • Seasonal patterns that are expected
  • Evolution in user behavior

If drift is positive and sustained, update baselines to reflect new normal.

Recovery Tracking

Track key metrics during drift recovery to validate fixes and document lessons learned:

Recovery PhaseWhat to TrackDurationSuccess Criteria
Pre-Fix BaselineDegraded metric valuesPoint in timeDocument current state
Fix ApplicationIntervention type, changes madeDuring fixComplete without errors
Initial RecoveryFirst hour metrics1 hour post-fixMetrics trending upward
StabilizationHourly measurements24 hoursWithin 10% of baseline
Full RecoveryDaily averages72 hoursReturn to baseline range

Recovery Documentation:

  • Record the intervention type (prompt adjustment, model rollback, configuration change)
  • Measure time to recovery from fix application
  • Track whether metrics return to original baseline or establish new normal
  • Document lessons learned for faster response to similar events

Monitor for 72 hours post-fix to ensure the drift doesn't recur. If metrics don't recover within acceptable thresholds after 24 hours, consider escalating or trying alternative fixes.

When Drift Detection Isn't Critical

Not every voice agent needs comprehensive drift detection. Understanding when simpler approaches suffice helps you allocate resources effectively.

Manual monitoring works for: Low-volume agents where you can review calls manually, catching quality issues through direct observation. If you handle less than 500 calls/month, listening to a sample of recordings weekly might be sufficient.

Periodic testing works for: Stable agents with infrequent model updates where monthly manual reviews catch issues. If your providers rarely update models and your use case is simple, quarterly test runs might suffice.

Basic alerting works for: Agents with clear success/failure outcomes where behavioral metrics alone indicate problems. If task completion is binary and immediately measurable, you might not need component-level drift detection.

Consider if:

  • Your agent handles non-critical interactions where quality variance is acceptable
  • You use fully-managed solutions that handle model updates transparently
  • Your interaction volume is too low for statistical significance
  • Cost of drift detection exceeds potential impact of degradation

Flaws But Not Dealbreakers

Some drift is improvement. Model updates often make things better. Your voice agent drift detection alerts will sometimes flag improvements, not regressions. Review before reverting. That 5 percent change in LLM behavior might be the model getting better at understanding context.

Baselines decay. If your baseline is 12 months old, drift from it might be intentional evolution. Update baselines periodically. What was acceptable quality a year ago might not meet today's standards.

Not all drift is detectable. Subtle changes in tone, personality, or nuance are hard to measure automatically. A voice agent that becomes slightly less empathetic won't trigger metrics but users will notice. Complement automated monitoring with periodic human review.

Seasonality confuses drift detection. Holiday greetings, weather-related conversations, and seasonal buying patterns can look like drift. Build seasonal baselines or use longer comparison windows during known seasonal periods.

Component interactions create emergent drift. Sometimes individual components remain within thresholds but their interaction creates problems. STT transcribes correctly, LLM responds appropriately, but together they mishandle specific scenarios.

Next Steps: Building Your Drift Detection System

Start with the highest-impact, lowest-effort monitoring:

  1. Week 1: Set up behavioral metric tracking (containment, completion rates)
  2. Week 2: Implement daily synthetic STT tests with golden audio
  3. Week 3: Add weekly LLM format compliance checks
  4. Week 4: Establish baselines from your first month of data
  5. Month 2: Add TTS quality monitoring and alert automation
  6. Month 3: Implement full drift detection framework with correlation analysis

Remember: drift is a month-3+ problem. If you just launched, focus on immediate bugs first. But once you're stable, drift detection becomes critical for maintaining quality over time.

Voice agent drift isn't a question of if, but when. External models will change. Data distributions will shift. Quality will degrade. The teams that maintain excellent voice agents aren't the ones that prevent drift—that's impossible. They're the ones that detect it early and respond effectively.

The Silent Degradation doesn't have to be silent. With proper monitoring, those 15 to 20 percent accuracy drops over 90 days become 2 to 3 percent corrections every few weeks. Your customers experience consistent quality. Your team prevents fires instead of fighting them. Your voice agent remains as good on day 300 as it was on day 30.

Ready to implement comprehensive drift detection for your voice agents? Hamming provides automated testing and monitoring that catches drift before your customers do. Our platform runs continuous synthetic tests, tracks all four drift types, and alerts you when metrics deviate from baseline—so you can maintain consistent quality over time.

Related Guides:

Frequently Asked Questions

Voice agent drift detection is the process of monitoring gradual performance degradation in AI voice agents over time. It tracks changes in STT accuracy, LLM response quality, TTS voice consistency, and overall behavioral metrics to catch quality issues before customers notice them.

Run daily synthetic tests for critical components (STT and LLM), weekly tests for comprehensive coverage, and monthly evaluations including human review. The frequency depends on your call volume and quality requirements—higher volume agents need more frequent monitoring.

Voice agents drift due to external model updates (GPT, Deepgram, ElevenLabs), changing data distributions, seasonal patterns in user behavior, and accumulating edge cases. Providers update their models without notice, causing gradual behavior changes.

Establish baselines by collecting metrics for 30 days post-launch, calculating statistical measures (median, percentiles), and setting thresholds based on acceptable variation. Update baselines quarterly or after major improvements to avoid the stale baseline trap.

Regression is sudden failure caused by code changes or deployments, caught immediately by CI/CD tests. Drift is gradual degradation over weeks without any changes on your side, requiring continuous monitoring with statistical comparison to baselines.

No, drift cannot be prevented because external dependencies (STT/LLM/TTS models) update without your control. The goal is early detection and rapid response—catching drift within days rather than discovering it through customer complaints months later.

Sumanyu Sharma

Sumanyu Sharma

Founder & CEO

Previously Head of Data at Citizen, where he helped quadruple the user base. As Senior Staff Data Scientist at Tesla, grew AI-powered sales program to 100s of millions in revenue per year.

Researched AI-powered medical image search at the University of Waterloo, where he graduated with Engineering honors on dean's list.

“At Hamming, we're taking all of our learnings from Tesla and Citizen to build the future of trustworthy, safe and reliable voice AI agents.”