Slack Alerts for Voice Agents: Monitoring Latency, ASR Drift & Prompt Regressions

Sumanyu Sharma
Sumanyu Sharma
Founder & CEO
, Voice AI QA Pioneer

Has stress-tested 1M+ voice agent calls to find where they break.

January 28, 2026Updated January 28, 202616 min read
Slack Alerts for Voice Agents: Monitoring Latency, ASR Drift & Prompt Regressions

Set up Slack alerts to catch voice agent failures before users notice them.

Voice agents fail silently. A latency spike that adds 800ms to every response. An ASR model update that drops transcription accuracy by 12%. A prompt change that causes the agent to ignore safety guardrails. Users notice immediately—they hang up, complain, or switch to competitors. Your dashboard might not show problems for hours.

Slack alerts close this gap. When configured correctly, they detect failures within 60 seconds and route actionable context to the right team. This guide covers the complete alerting stack for production voice agents: what to monitor, where to set thresholds, and how to structure alerts that enable fast triage.

What you'll get from this guide:

  • A master reference table mapping alert types to thresholds, severity levels, and Slack routing
  • Copy-paste Slack message templates for latency, ASR drift, jitter, and prompt regressions
  • Implementation patterns for webhooks, OpenTelemetry, and monitoring integrations
  • Noise control strategies that reduce alert fatigue by 60-80%

Related Guides:


Understanding Voice Agent Monitoring Requirements

The Voice Agent Stack: What Needs Monitoring

Voice agents are multi-component pipelines where failures cascade. Each layer requires specific metrics and alert thresholds:

┌─────────────────────────────────────────────────────────────────────────────┐
                         VOICE AGENT PIPELINE                                 
├─────────────────────────────────────────────────────────────────────────────┤
                                                                             
   [Network/VoIP]  [ASR/STT]  [LLM]  [Tools]  [TTS]  [Network/VoIP]    
                                                                       
     Jitter         WER      Latency   Success   MOS      Packet Loss       
     Packet Loss    Confidence  TTFB     Rate    TTFB                       
     MOS            Latency    Tokens   Latency  Quality                    
                                                                             
└─────────────────────────────────────────────────────────────────────────────┘

Each component has distinct failure modes:

  • Network/VoIP: Jitter greater than 30ms causes choppy audio; packet loss greater than 1% drops words
  • ASR/STT: WER drift degrades intent accuracy; confidence drops signal model issues
  • LLM: Latency spikes break conversational flow; token overflows cause truncation
  • Tools: API failures block task completion; timeouts cascade through the pipeline
  • TTS: Quality degradation (MOS <3.5) frustrates users; latency adds to total delay

Why Traditional APM Falls Short for Voice Agents

Generic APM tools (Datadog, New Relic, Grafana) excel at infrastructure monitoring but miss voice-specific failures:

What APM TracksWhat Voice Agents NeedGap
Server CPU/memoryTurn-level latencyAPM sees averages; voice needs P99 per turn
API response timesComponent breakdown (STT/LLM/TTS)APM shows total; voice needs each stage
Error ratesIntent accuracy, WERAPM counts errors; voice needs quality metrics
UptimeConversation completionAPM sees calls connect; voice needs task success
Request latencyTime to First Word (TTFW)APM measures server; voice measures user perception

The result: APM dashboards show green while users experience broken conversations. A healthcare company tracked 99.9% uptime and less than 200ms API latency while their CSAT dropped 15%—because they weren't measuring turn-level latency, intent accuracy, or context retention.

Critical Metrics for Production Voice Agents

These are the metrics that predict voice agent failure. Alert on these, not infrastructure proxies:

MetricDefinitionTargetWhy It Matters
P99 Latency (TTFW)99th percentile time from user stops speaking to agent starts speakingless than 7s1% of users experience worst-case; they remember it
WERWord Error Rate: (substitutions + deletions + insertions) / total wordsless than 8%Every 5% WER increase reduces intent accuracy by ~10%
Interruption Rate% of turns where user interrupts agentless than 15%High rates indicate latency or relevance problems
TTFATime to First Audio byte from TTSless than 200msStreaming TTS responsiveness
Tool Call Success% of tool invocations returning expected resultsgreater than 99%Failed tools = failed tasks
Intent Accuracy% correctly classified user intentsgreater than 95%Foundation for correct agent behavior
MOS ScoreMean Opinion Score (1-5) for audio qualitygreater than 4.0Below 3.5, ASR accuracy degrades significantly

Voice Agent Alert Master Table

This reference table maps every alert type to thresholds, severity, and routing. Use it to configure your alerting system.

Alert TypeCategoryTrigger DefinitionStarter ThresholdSeveritySlack ChannelInclude in Alert
Latency SpikeLLMP95 TTFW exceeds thresholdgreater than 6s warning, greater than 7s criticalSEV-1 (critical), SEV-2 (warning)#voice-oncallComponent breakdown, sample calls
ASR DriftASRWER increases from baselinegreater than 5% increase over 100-call windowSEV-2#voice-alertsBaseline comparison, sample transcripts
Jitter SpikeVoIPJitter exceeds acceptable variancegreater than 30ms warning, greater than 50ms criticalSEV-2#voice-alertsMOS impact, affected regions
Packet LossVoIPRTP packet loss rategreater than 1% warning, greater than 3% criticalSEV-1#voice-oncallDuration, affected calls
Prompt RegressionLLMPrompt version scores dropgreater than 10% drop from baselineSEV-1#voice-oncallVersion comparison, failure examples
TTS QualityTTSMOS score degradesless than 4.0 warning, less than 3.5 criticalSEV-2#voice-alertsAudio samples, provider status
Intent AccuracyASR/LLMClassification accuracy dropsless than 92% warning, less than 90% criticalSEV-1#voice-oncallConfusion matrix, sample failures
Tool FailuresIntegrationTool call error rategreater than 2% warning, greater than 5% criticalSEV-1#voice-oncallFailed tool, error codes, affected tasks
Task CompletionBusinessEnd-to-end task success rateless than 80% warning, less than 70% criticalSEV-1#voice-oncallFailed tasks, drop-off points
Compliance ViolationSafetyPolicy or PII violation detectedAny occurrenceSEV-0#voice-oncall + #complianceConversation link, violation type
Dead AirQualitySilence greater than 3s during conversationgreater than 2 occurrences per callSEV-2#voice-alertsTimestamp, preceding context
Escalation SpikeBusinessTransfer to human rategreater than 2x baselineSEV-2#voice-alertsEscalation reasons, affected intents

Legend:

  • SEV-0: Immediate response required (less than 5 min), pages on-call
  • SEV-1: Urgent response (less than 15 min), posts to #voice-oncall
  • SEV-2: Standard response (less than 1 hour), posts to #voice-alerts

Latency Monitoring and Alerting

Defining Latency Thresholds for Voice AI

Human conversation has a natural rhythm. Responses within 1 second feel natural. Between 1-2 seconds feels acceptable. Above 3 seconds starts to feel broken. Based on real-world production analysis, set your alert thresholds based on these practical boundaries:

PercentileTargetWarning ThresholdCritical ThresholdUser Impact
P50<1.5sgreater than 2sgreater than 2.5sMedian user experience
P90<3sgreater than 4sgreater than 5sMost users affected
P95<5sgreater than 6sgreater than 7sSignificant degradation
P99<7sgreater than 8sgreater than 10sWorst experiences, high frustration

Why P99 matters more than averages: A P50 of 1.5s with P99 of 10s means your average user has acceptable experience but 1% face significant delays. At 10,000 calls/day, that's 100 frustrated users daily. They remember.

Component-Level Latency Breakdown

End-to-end latency is the sum of its parts. Instrument each component to identify bottlenecks:

ComponentTypical RangeTarget% of TotalCommon Bottleneck
Network (inbound)50-100ms<80ms2-3%Geographic distance, mobile networks
ASR/STT300-600ms<500ms10-15%Model size, streaming vs batch
LLM Inference1.5-4s<3s60-70%Model size, prompt length, provider load
Tool Execution200-600ms<400ms8-12%API latency, database queries
TTS200-400ms<300ms5-10%Voice complexity, streaming setup
Network (outbound)50-100ms<80ms2-3%Same as inbound

Example breakdown for 3.5s total latency:

ASR:     450ms (13%)
LLM:   2,400ms (69%)
Tools:   350ms (10%)
TTS:     300ms (8%)
─────────────────────
Total:  3,500ms

Slack Alert Template: Latency Spike Detection

🚨 SEV-1 ALERT: Voice Agent Latency Critical

📊 Metric: ttfw_p99
   Current: 9.2s (threshold: 8s)
   Baseline (24h): 5.8s
   Duration: 8 minutes
   Affected: 12% of calls (47 of 391)

🔍 Component Breakdown:
    Network In:   85ms  (normal)
    ASR:         450ms  (normal)
    LLM:        7,820ms  ( 4,200ms - LIKELY CAUSE)
    Tools:       380ms  (normal)
    TTS:         465ms  (normal)

📍 Context:
    Region: us-east-1
    Model: gpt-4-turbo
    Prompt Version: v2-4-1

🔗 Actions:
   [View Dashboard] [Sample Calls] [LLM Provider Status] [Runbook]

👤 On-call: @voice-team
   React with  to acknowledge

Detecting Latency Regressions Across Deployments

Latency often regresses after deployments—new prompt versions, model updates, or infrastructure changes. Automate detection:

CI/CD Integration:

# .github/workflows/voice-agent-deploy.yml
- name: Run latency baseline test
  run: |
    hamming test run --suite latency-baseline --wait

- name: Compare against baseline
  run: |
    CURRENT_P99=$(hamming metrics get ttfw_p99 --last-hour)
    BASELINE_P99=$(hamming metrics get ttfw_p99 --baseline)

    if [ $(echo "$CURRENT_P99 > $BASELINE_P99 * 120 / 100" | bc) -eq 1 ]; then
      echo "::error::P99 latency regressed by greater than 20%"
      exit 1
    fi

Post-Deploy Monitoring Checklist:

  • P99 latency within 20% of pre-deploy baseline after 15 minutes
  • No component showing greater than 50% latency increase
  • LLM token usage within expected range (longer prompts = slower inference)
  • No new timeout errors in tool calls

ASR Drift and Quality Monitoring

Measuring ASR Performance with WER

Word Error Rate (WER) is the standard metric for ASR accuracy:

WER = (Substitutions + Deletions + Insertions) / Total Words × 100

Example:
Reference:  "I need to check my account balance today"
ASR Output: "I need to check my count balance"

Substitutions: 1 ("account"  "count")
Deletions: 1 ("today" missing)
Insertions: 0

WER = (1 + 1 + 0) / 8 × 100 = 25%

WER Benchmarks by Condition:

ConditionGoodAcceptablePoorAction Required
Clean audio, native speakerless than 5%less than 8%greater than 10%Investigate ASR provider
Background noiseless than 10%less than 15%greater than 18%Enable noise suppression
Non-native accentless than 12%less than 18%greater than 22%Consider accent-aware models
Domain terminologyless than 8%less than 12%greater than 15%Add custom vocabulary

Detecting Silent Model Updates and API Changes

ASR providers update models without notice. These "silent updates" can shift WER by 5-15% overnight. Detect them by monitoring WER trends independent of your code deployments:

Detection Strategy:

  • Maintain a 7-day rolling WER baseline
  • Alert when current 100-call WER deviates greater than 5% from baseline
  • Cross-reference against your deployment log—if no deploy, suspect provider change
# Pseudo-code for drift detection
def check_asr_drift(current_wer, baseline_wer, deploy_log):
    deviation = (current_wer - baseline_wer) / baseline_wer * 100

    if deviation > 5:  # 5% threshold
        recent_deploy = deploy_log.has_deploy_in_last_hours(4)

        if not recent_deploy:
            alert(
                severity="SEV-2",
                message="ASR drift detected without deployment",
                context={
                    "current_wer": current_wer,
                    "baseline_wer": baseline_wer,
                    "deviation_pct": deviation,
                    "likely_cause": "Provider model update"
                }
            )

Slack Alert Template: ASR Accuracy Degradation

⚠️ SEV-2 ALERT: ASR Drift Detected

📊 Metric: asr_wer
   Current (100-call window): 13.8%
   Baseline (7-day): 7.2%
   Deviation: +92%
   Duration: 45 minutes

🔍 No Recent Deployments
   Last deploy: 3 days ago (v2.3.8)
   Likely cause: Provider model update

📝 Sample Transcription Errors:
    "cancel my subscription"  "cancel mice subscription"
    "account number 4859"  "account number four eight five nine"
    "schedule for Tuesday"  "schedule for today"

📈 Segment Analysis:
    Mobile callers: 18.2% WER ( from 9.1%)
    Landline: 8.4% WER (normal)
    With background noise: 22.1% WER ( from 12.3%)

🔗 Actions:
   [Audio Samples] [WER Trend Dashboard] [Contact ASR Provider] [Runbook]

👤 On-call: @voice-team

Monitoring Transcription Quality by User Segment

Aggregate WER hides systematic failures. Slice by segments to identify affected populations:

SegmentHow to IdentifyCommon IssuesAlert Threshold
AccentUser profile, phone number regionModel bias toward standard accentsgreater than 2x overall WER
Background NoiseAudio analysis, SNR measurementOffice, traffic, windgreater than 150% of clean audio WER
Domain TermsEntity extraction failuresProduct names, medical termsgreater than 3x general vocabulary WER
Device TypeUser agent, call metadataMobile compression, Bluetoothgreater than 130% of landline WER
Time of DayTimestampNetwork congestion patternsVariance greater than 20% from daily mean

Jitter and Network Quality Alerting

Jitter vs Latency: Why Both Matter

Latency is the absolute delay between sending and receiving audio packets. Jitter is the variation in that delay. For voice agents, consistent delay is manageable; inconsistent delay breaks audio.

MetricWhat It MeasuresImpact on Voice AgentsTarget
Latency (RTT)Round-trip time for packetsAdds to total TTFWless than 150ms
JitterVariance in packet arrivalCauses choppy audio, ASR errorsless than 30ms
Packet Loss% of packets that don't arriveMissing words, gaps in audioless than 1%
MOSMean Opinion Score (predicted)Overall call qualitygreater than 4.0

How jitter affects ASR:

  • Jitter greater than 30ms: Audio frames arrive out of order, requiring larger jitter buffers
  • Jitter greater than 50ms: Noticeable audio choppiness, WER increases 10-20%
  • Jitter greater than 100ms: Severe audio distortion, ASR essentially fails

Real-Time Jitter Detection

Monitor jitter at the network edge, not just in aggregate metrics:

Measurement Points:

  • WebRTC stats: RTCInboundRtpStreamStats.jitter (in seconds)
  • RTP stream analysis: Calculate from sequence number gaps and timestamps
  • Synthetic probes: Send test packets every 500ms, measure variance
// WebRTC jitter monitoring
const stats = await peerConnection.getStats();
stats.forEach(report => {
  if (report.type === 'inbound-rtp' && report.kind === 'audio') {
    const jitterMs = report.jitter * 1000;

    if (jitterMs > 50) {
      sendAlert({
        severity: 'SEV-2',
        metric: 'jitter',
        value: jitterMs,
        threshold: 50,
        callId: currentCallId
      });
    }
  }
});

Slack Alert Template: Network Quality Issues

🚨 SEV-2 ALERT: Network Quality Degradation

📊 Metrics (30-second window):
   Jitter:       67ms (threshold: 50ms)
   Packet Loss:  2.8% (threshold: 1%)
   MOS Score:    3.2 (threshold: 4.0)

🔍 Impact Assessment:
    Affected calls: 23 (in progress)
    ASR WER during incident: 18.4% (baseline: 7.2%)
    User interruptions:  340%

📍 Geographic Analysis:
    us-east-1: Normal (jitter 18ms)
    us-west-2: AFFECTED (jitter 67ms)
    eu-west-1: Normal (jitter 22ms)

🔧 Probable Cause:
    ISP routing change detected at 14:32 UTC
    Affects Comcast residential in California region

🔗 Actions:
   [Network Dashboard] [Affected Calls] [ISP Status] [Runbook]

👤 On-call: @infrastructure-team @voice-team

Prompt Regression Detection

Why Voice Agent Prompts Regress

Prompt regressions happen without code changes. Research shows 58.8% of prompt+model combinations experience accuracy drops when underlying LLM APIs update. Common causes:

CauseDetection MethodFrequency
LLM API updatesMonitor model version in response headersMonthly
Prompt interaction effectsA/B test prompt versions continuouslyPer prompt change
Training data driftTrack performance on held-out test setQuarterly
Context window changesMonitor truncation errorsPer model update
Temperature/parameter shiftsLog all inference parametersContinuous

Setting Up LLM-as-a-Judge Evaluators

Use a scoring LLM to evaluate agent responses against rubrics. Run in CI/CD and production:

# LLM-as-a-Judge evaluation
EVAL_PROMPT = """
You are evaluating a voice agent's response quality.

User utterance: {user_utterance}
Agent response: {agent_response}
Expected behavior: {expected_behavior}

Score the response on these dimensions (1-5 each):
• Relevance: Does it address the user's need?
• Accuracy: Is the information correct?
• Compliance: Does it follow the prompt instructions?
• Tone: Is it appropriate for a voice conversation?

Return JSON: {"relevance": N, "accuracy": N, "compliance": N, "tone": N, "overall": N}
"""

def evaluate_response(user_utterance, agent_response, expected_behavior):
    result = llm.complete(EVAL_PROMPT.format(
        user_utterance=user_utterance,
        agent_response=agent_response,
        expected_behavior=expected_behavior
    ))
    scores = json.loads(result)

    if scores["overall"] < 3:
        flag_for_review(user_utterance, agent_response, scores)

    return scores

Evaluation Triggers:

  • CI/CD: Run on test dataset before every deployment
  • Production sampling: Evaluate 5% of production calls async
  • Regression detection: Compare scores across prompt versions

Slack Alert Template: Prompt Performance Degradation

🚨 SEV-1 ALERT: Prompt Regression Detected

📊 Prompt Version Comparison:
   Current (v2.5.0):  Score 3.2/5.0
   Previous (v2.4.1): Score 4.1/5.0
   Regression: -22%

🔍 Breakdown by Dimension:
    Relevance:  3.8  3.4 (-10%)
    Accuracy:   4.2  3.0 (-29%) ⚠️
    Compliance: 4.0  3.1 (-23%) ⚠️
    Tone:       4.3  3.5 (-19%)

📝 Failure Examples:
    Intent: cancel_subscription
     Expected: Confirm cancellation, offer retention
     Actual: Processed cancellation without retention offer

    Intent: billing_dispute
     Expected: Acknowledge, gather details, escalate
     Actual: Provided incorrect refund policy

📈 Traffic Impact:
    v2.5.0 receiving 25% of traffic (canary)
    847 calls evaluated
    Recommend: ROLLBACK to v2.4.1

🔗 Actions:
   [Rollback Now] [View Diff] [Sample Failures] [Runbook]

👤 On-call: @voice-team @prompt-eng
   React with 🔙 to trigger rollback

Version Control and Baseline Management for Prompts

Treat prompts as versioned code artifacts, not strings in config files:

Best Practices:

  • Version every change: "v2-4-1" not "updated prompt"
  • Embed test suites: Each prompt version has associated test cases
  • Track lineage: Know which base prompt each version derives from
  • Store evaluation scores: Historical scores enable regression detection
# prompts/booking-agent/v2.5.0.yaml
version: "2.5.0"
parent_version: "2.4.1"
created_at: "2026-01-27T10:00:00Z"
author: "jane@company.com"

prompt: |
  You are a booking assistant for [Company].
  ...

test_suite:
  - intent: book_appointment
    cases: 50
    expected_score: 4.0
  - intent: reschedule
    cases: 30
    expected_score: 4.2

baseline_scores:
  overall: 4.1
  relevance: 3.9
  accuracy: 4.2
  compliance: 4.0
  tone: 4.3

TTS Quality and Naturalness Monitoring

TTS Metrics That Matter in Production

MetricDefinitionTargetAlert Threshold
MOSMean Opinion Score (1-5)greater than 4.0less than 3.5
TTFBTime to First Byte of audioless than 200msgreater than 400ms
TTS WERWER when ASR transcribes TTS outputless than 3%greater than 5%
Synthesis LatencyTotal time to generate audioless than 300msgreater than 500ms
Audio ArtifactsClicks, pops, unnatural pauses0 per callgreater than 2 per call

Why TTS WER matters: If your ASR can't accurately transcribe what your TTS outputs, users hear something different than what you intended. TTS WER greater than 5% indicates pronunciation or audio quality issues.

Slack Alert Template: TTS Quality Degradation

⚠️ SEV-2 ALERT: TTS Quality Below Threshold

📊 Metrics (1-hour window):
   MOS Score:    3.2 (threshold: 3.5)
   TTS WER:      7.8% (threshold: 5%)
   TTFB:         380ms (threshold: 400ms)

🔊 Audio Quality Issues:
    Detected artifacts: 3.2 per call average
    Unnatural pauses: 12% of utterances
    Pronunciation errors: "schedule"  "skedule" (47 occurrences)

📍 Affected Configuration:
    Voice: "alloy"
    Provider: OpenAI
    Sample rate: 24kHz

🎧 Sample Audio:
   [Play Sample 1] [Play Sample 2] [Play Sample 3]

🔗 Actions:
   [TTS Dashboard] [Provider Status] [Switch to Backup Voice] [Runbook]

👤 On-call: @voice-team

Implementing Production Observability

OpenTelemetry for Voice Agent Instrumentation

Use OpenTelemetry with GenAI semantic conventions for consistent observability:

from opentelemetry import trace
from opentelemetry.trace import SpanKind

tracer = trace.get_tracer("voice-agent")

async def process_turn(user_audio, context):
    with tracer.start_as_current_span(
        "voice_agent.turn",
        kind=SpanKind.SERVER
    ) as turn_span:
        turn_span.set_attribute("gen_ai.system", "voice_agent")
        turn_span.set_attribute("session.id", context.session_id)
        turn_span.set_attribute("turn.index", context.turn_count)

        # ASR span
        with tracer.start_as_current_span("asr.transcribe") as asr_span:
            transcript = await asr.transcribe(user_audio)
            asr_span.set_attribute("asr.provider", "deepgram")
            asr_span.set_attribute("asr.model", "nova-2")
            asr_span.set_attribute("asr.confidence", transcript.confidence)
            asr_span.set_attribute("asr.word_count", len(transcript.words))

        # LLM span
        with tracer.start_as_current_span("llm.generate") as llm_span:
            response = await llm.generate(transcript.text, context)
            llm_span.set_attribute("gen_ai.request.model", "gpt-4-turbo")
            llm_span.set_attribute("gen_ai.usage.prompt_tokens", response.prompt_tokens)
            llm_span.set_attribute("gen_ai.usage.completion_tokens", response.completion_tokens)
            llm_span.set_attribute("llm.ttfb_ms", response.time_to_first_token_ms)

        # TTS span
        with tracer.start_as_current_span("tts.synthesize") as tts_span:
            audio = await tts.synthesize(response.text)
            tts_span.set_attribute("tts.provider", "elevenlabs")
            tts_span.set_attribute("tts.voice_id", "alloy")
            tts_span.set_attribute("tts.character_count", len(response.text))
            tts_span.set_attribute("tts.ttfb_ms", audio.time_to_first_byte_ms)

        # Record turn-level metrics
        turn_span.set_attribute("turn.total_latency_ms", calculate_total_latency())

        return audio

Span-Level vs Trace-Level vs Session-Level Evaluation

LevelWhat It CapturesUse CaseAlert On
SpanSingle component (ASR call, LLM inference)Component debuggingComponent latency greater than 2x baseline
TraceComplete turn (user speaks → agent responds)Turn-level qualityTTFW >threshold
SessionFull conversation (all turns)End-to-end qualityTask completion, FCR

Instrumentation Hierarchy:

Session (conversation_id)
├── Trace (turn_1)
   ├── Span (asr.transcribe)
   ├── Span (llm.generate)
      └── Span (tool.call)
   └── Span (tts.synthesize)
├── Trace (turn_2)
   └── ...
└── Trace (turn_n)

Integrating Voice Metrics with Slack

Option 1: Webhook to Slack Incoming Webhook

import requests

SLACK_WEBHOOK_URL = "https://hooks.slack.com/services/T00/B00/XXX"

def send_slack_alert(alert):
    payload = {
        "channel": alert.channel,
        "username": "Voice Agent Alerts",
        "icon_emoji": ":telephone:",
        "blocks": [
            {
                "type": "header",
                "text": {
                    "type": "plain_text",
                    "text": f"{alert.emoji} {alert.severity}: {alert.title}"
                }
            },
            {
                "type": "section",
                "fields": [
                    {"type": "mrkdwn", "text": f"*Metric:* {alert.metric}"},
                    {"type": "mrkdwn", "text": f"*Current:* {alert.current_value}"},
                    {"type": "mrkdwn", "text": f"*Threshold:* {alert.threshold}"},
                    {"type": "mrkdwn", "text": f"*Duration:* {alert.duration}"}
                ]
            },
            {
                "type": "actions",
                "elements": [
                    {
                        "type": "button",
                        "text": {"type": "plain_text", "text": "View Dashboard"},
                        "url": alert.dashboard_url
                    },
                    {
                        "type": "button",
                        "text": {"type": "plain_text", "text": "Acknowledge"},
                        "action_id": f"ack_{alert.id}"
                    }
                ]
            }
        ]
    }

    requests.post(SLACK_WEBHOOK_URL, json=payload)

Option 2: Datadog/Grafana → Slack

# Datadog monitor configuration
name: "Voice Agent TTFW P99 Critical"
type: metric alert
query: "p99:voice_agent.ttfw{env:production} > 8000"
message: |
  {{#is_alert}}
  @slack-voice-oncall

  TTFW P99 exceeded 8s threshold.

  Current: {{value}}ms
  Host: {{host.name}}
  Region: {{region.name}}

  [View Dashboard](https://app.datadoghq.com/dashboard/xxx)
  {{/is_alert}}
options:
  notify_no_data: true
  evaluation_delay: 60

Alert Configuration Best Practices

Defining Alert Severity Levels

SeverityResponse TimeCriteriaNotification Channels
SEV-0less than 5 minCompliance violation, complete outage, data breachPagerDuty + Slack #voice-oncall + Phone
SEV-1less than 15 minRevenue impact, greater than 20% users affected, critical metric breachPagerDuty + Slack #voice-oncall
SEV-2less than 1 hourPerformance degradation, less than 20% users affectedSlack #voice-alerts
INFONext business dayThreshold warning, trend alertSlack #voice-metrics, Email digest

Severity Classification Matrix:

MetricSEV-2SEV-1SEV-0
TTFW P99greater than 7sgreater than 10sgreater than 15s
Task Completionless than 80%less than 70%less than 50%
WERgreater than 12%greater than 18%greater than 25%
Tool Call Failuresgreater than 2%greater than 5%greater than 15%
Compliance ViolationsAny

Preventing Alert Fatigue

Alert fatigue causes teams to ignore critical alerts. Implement these strategies:

Correlation Rules (1) Group related alerts into single incidents:

correlation_rules:
  - name: "Latency cascade"
    group_by: [region, time_window]
    window: 5m
    alerts: [ttfw_warning, llm_latency, tts_latency]
    output_severity: max(input_severities)

Intelligent Throttling (2) Limit repeat alerts for ongoing incidents:

throttling:
  cooldown_period: 30m  # Don't re-alert for 30 min
  max_alerts_per_hour: 5
  escalate_if_sustained: 60m  # Escalate if still firing after 1 hour

Risk-Based Prioritization (3) Score alerts by business impact:

def calculate_alert_priority(alert):
    base_score = SEVERITY_SCORES[alert.severity]

    # Multiply by affected users
    user_multiplier = min(alert.affected_users / 100, 5)

    # Multiply by time of day (peak hours = higher priority)
    hour = datetime.now().hour
    time_multiplier = 1.5 if 9 <= hour <= 17 else 1

    return base_score * user_multiplier * time_multiplier

Expected Results:

  • 60-80% reduction in total alerts
  • Zero missed critical incidents
  • less than 10% false positive rate

Dynamic vs Static Thresholds

Threshold TypeWhen to UseExample
StaticHard limits, compliance requirementsLatency greater than 3s is always bad
Dynamic (baseline %)Metrics with natural varianceWER greater than 20% above 7-day baseline
Dynamic (ML anomaly)Complex patterns, seasonal variationUnusual traffic patterns

Implementation:

def evaluate_threshold(metric, value, threshold_config):
    if threshold_config.type == "static":
        return value > threshold_config.value

    elif threshold_config.type == "baseline_percent":
        baseline = get_baseline(metric, threshold_config.baseline_window)
        deviation = (value - baseline) / baseline * 100
        return deviation > threshold_config.percent

    elif threshold_config.type == "anomaly":
        # Use ML model trained on historical data
        return anomaly_detector.is_anomaly(metric, value)

Alert Payload Design for Quick Triage

Every alert should answer these questions in less than 10 seconds:

QuestionAlert FieldExample
How bad is it?Severity + emoji🚨 SEV-1
What broke?Metric name + componentTTFW P99, LLM component
How bad exactly?Current value vs threshold1,340ms (threshold: 1,000ms)
Is it getting worse?Trend indicator↑ 340ms from baseline
How many affected?User/call count47 calls in last 15 min
What do I do?Runbook link[Runbook]
Who else knows?On-call mention@voice-team

Slack Alert Templates and Examples

Template 1: Multi-Component Pipeline Failure

🚨 SEV-1 ALERT: Voice Agent Pipeline Degradation

📊 Multiple Components Affected:
   ┌─────────┬─────────┬─────────┬─────────┐
      ASR      LLM     Tools     TTS   
             ⚠️                  
    195ms    1200ms   FAILING  150ms   
   └─────────┴─────────┴─────────┴─────────┘

🔍 Root Cause Analysis:
    Tool "lookup_customer" returning 503 errors
    LLM retrying tool calls, increasing latency
    Cascade effect: TTFW  180%

📈 Impact:
    Task completion: 45% (baseline: 87%)
    Affected calls: 234 in last 30 min
    Estimated revenue impact: ~$1,200

🔧 Suggested Actions:
    Check customer lookup API status
    Consider enabling fallback tool
    If persists greater than 15 min, switch to degraded mode

🔗 [Incident Dashboard] [API Status] [Enable Fallback] [Runbook]

📞 Affected Call IDs:
   call_abc123, call_def456, call_ghi789 (+ 231 more)

👤 On-call: @voice-team @backend-team

Template 2: Geographic Latency Degradation

⚠️ SEV-2 ALERT: Regional Latency Anomaly

📊 Metric: ttfw_p95 by region

   Region      Current    Baseline    Status
   ─────────────────────────────────────────
   us-east-1   680ms      720ms        Normal
   us-west-2   1,450ms    710ms        +104%
   eu-west-1   695ms      705ms        Normal
   ap-south-1  890ms      920ms        Normal

🔍 us-west-2 Analysis:
    Affected since: 14:23 UTC (47 minutes)
    ASR latency: Normal (185ms)
    LLM latency: 980ms ( 420ms)
    Network: Normal (32ms RTT)

📍 Probable Cause:
    LLM provider edge node degradation in us-west-2
    OpenAI status page shows elevated latency

🔧 Options:
    Route us-west-2 traffic to us-east-1 (adds ~40ms network)
    Wait for provider resolution
    Switch to backup LLM provider

🔗 [Regional Dashboard] [Provider Status] [Traffic Routing] [Runbook]

👤 On-call: @voice-team

Template 3: Compliance and Policy Violations

🚨 SEV-0 ALERT: Compliance Violation Detected

⚠️ Type: PII Exposure in Agent Response

📝 Incident Details:
   Call ID: call_xyz789
   Timestamp: 2026-01-27 14:32:17 UTC
   Agent Version: v2.5.0

🔒 Violation:
   Agent disclosed another customer's account number
   in response to identity verification question.

📄 Transcript Excerpt:
   User: "Can you confirm my account number?"
   Agent: "Your account number is 4859... wait, I see
          account 7734 as well. Which one?"  ← VIOLATION

🔍 Root Cause (Preliminary):
   • Context window contained previous caller's data
    Session isolation failed between calls

🚫 Immediate Actions Taken:
    Call flagged for compliance review
    Similar calls in last hour being audited
    Agent isolated pending investigation

🔗 [Compliance Dashboard] [Call Recording] [Incident Ticket] [Runbook]

👤 Required Response:
   @compliance-team @security-team @voice-team

   ⚠️ Response required within 15 minutes
   React with 🔒 to confirm investigation started

Template 4: Tool Call and Intent Failure Spikes

⚠️ SEV-2 ALERT: Intent Classification Failures Elevated

📊 Metrics (1-hour window):
   Intent Accuracy: 84.2% (threshold: 90%)
   Fallback Rate:   23.1% (baseline: 8.4%)

🔍 Confusion Matrix (Top Errors):

   Actual Intent     Misclassified As    Count    %
   ─────────────────────────────────────────────────
   cancel_order      track_order         47      18.2%
   refund_request    return_status       31      14.8%
   billing_help      account_inquiry     28      12.1%

📝 Sample Misclassifications:
    "I want to cancel this order"
     Expected: cancel_order
     Classified: track_order (confidence: 0.62)

    "Can I get my money back for this?"
     Expected: refund_request
     Classified: return_status (confidence: 0.58)

📈 Downstream Impact:
    Task completion: 71% (baseline: 87%)
    User repeated themselves: 34% of calls
    Escalation rate:  2.1x

🔧 Possible Causes:
    Recent ASR model update (WER normal)
    Prompt version v2.5.0 deployed 2 hours ago  LIKELY
    New user traffic patterns (no evidence)

🔗 [Intent Dashboard] [Prompt Comparison] [Rollback v2.5.0] [Runbook]

👤 On-call: @voice-team @ml-team

Monitoring Integration Architecture

Connecting Hamming Metrics to Slack

Hamming provides 50+ built-in voice agent metrics with native Slack integration:

# Hamming alert configuration
alerts:
  - name: "TTFW P99 Critical"
    metric: ttfw_p99
    condition: "> 8s"
    duration: "5m"
    severity: SEV-1
    channels:
      - slack: "#voice-oncall"
    include:
      - component_breakdown
      - sample_calls: 3
      - runbook_link: "https://docs.company.com/runbooks/ttfw"

  - name: "ASR Drift Detection"
    metric: wer
    condition: "> baseline * 120%"
    baseline_window: "7d"
    evaluation_window: "100 calls"
    severity: SEV-2
    channels:
      - slack: "#voice-alerts"
    include:
      - sample_transcripts: 5
      - segment_breakdown: [device_type, region]

Hamming's Unique Capabilities:

  • Turn-level latency breakdown (ASR/LLM/TTS)
  • Automated call replay links in alerts
  • Intent accuracy with confusion matrices
  • Prompt version comparison
  • One-click regression testing from alerts

Routing Alerts to the Right Teams

Alert CategoryPrimary TeamSecondarySlack Channel
Latency (TTFW)Voice PlatformSRE#voice-oncall
ASR/WERML EngineeringVoice Platform#voice-alerts → #ml-oncall
LLM/PromptPrompt EngineeringML Engineering#voice-oncall → #prompt-eng
VoIP/NetworkInfrastructureVoice Platform#infra-oncall → #voice-oncall
Tools/APIsBackendVoice Platform#backend-oncall → #voice-alerts
ComplianceSecurityLegal#security-oncall + #compliance
Business (FCR)Voice PlatformProduct#voice-alerts → #product

Escalation Policy:

escalation:
  - level: 1
    delay: 0
    notify: ["slack:#voice-oncall"]

  - level: 2
    delay: 15m
    condition: "not acknowledged"
    notify: ["pagerduty:voice-primary"]

  - level: 3
    delay: 30m
    condition: "not acknowledged"
    notify: ["pagerduty:voice-secondary", "slack:#engineering-leadership"]

Alert Acknowledgment and Incident Workflows

Implement interactive Slack buttons for rapid response:

# Slack interactive message handler
@slack_app.action("acknowledge_alert")
async def handle_acknowledge(ack, body, client):
    await ack()

    alert_id = body["actions"][0]["value"]
    user = body["user"]["username"]

    # Update alert status
    await alerts.acknowledge(alert_id, user)

    # Update Slack message
    await client.chat_update(
        channel=body["channel"]["id"],
        ts=body["message"]["ts"],
        blocks=[
            *body["message"]["blocks"],
            {
                "type": "context",
                "elements": [{
                    "type": "mrkdwn",
                    "text": f"✅ Acknowledged by @{user} at {datetime.now()}"
                }]
            }
        ]
    )

    # Create incident ticket if SEV-1+
    if alert.severity in ["SEV-0", "SEV-1"]:
        await pagerduty.create_incident(alert)

@slack_app.action("trigger_rollback")
async def handle_rollback(ack, body, client):
    await ack()

    # Confirm before rollback
    await client.views_open(
        trigger_id=body["trigger_id"],
        view={
            "type": "modal",
            "title": {"type": "plain_text", "text": "Confirm Rollback"},
            "submit": {"type": "plain_text", "text": "Rollback Now"},
            "blocks": [
                {
                    "type": "section",
                    "text": {
                        "type": "mrkdwn",
                        "text": "This will rollback to prompt version v2-4-1. Continue?"
                    }
                }
            ]
        }
    )

Frequently Asked Questions

The most critical metrics for voice agent production are P99 latency (TTFW), ASR Word Error Rate (WER), interruption rates, Time to First Audio (TTFA), tool call success rates, and compliance violations. P99 latency should stay under 1200ms, WER under 8% for clean audio, interruption rates under 15%, and tool call success above 99%. These metrics predict user frustration and task failure before aggregate metrics like CSAT show problems.

Detect ASR drift by monitoring Word Error Rate (WER) across rolling 100-call windows and comparing against a 7-day baseline. Alert when WER increases more than 5% from baseline. Cross-reference deviations against your deployment log—if no code deployed, suspect a provider-side model update. Also track WER by user segments (accent, background noise, device type) to identify systematic failures affecting specific populations.

Set P95 TTFW (Time to First Word) alerts at >800ms as warning and >1200ms as critical, based on the 300ms conversational responsiveness target where humans expect responses. P99 should alert at >1200ms warning and >1800ms critical. These percentile thresholds catch the worst user experiences that drive complaints, which averages would hide.

Prevent alert fatigue using four strategies: correlation rules that group related alerts into single incidents, intelligent throttling with 15-30 minute cooldowns between repeat alerts, severity-based routing to different Slack channels (SEV-1 to #voice-oncall, SEV-2 to #voice-alerts), and risk-based prioritization that scores alerts by affected user count and business hours. These strategies typically reduce alert volume by 60-80% while maintaining zero missed critical incidents.

Latency measures the absolute delay (round-trip time) for packets, while jitter measures the variation in that delay. For voice agents, consistent latency is manageable but inconsistent delay (jitter) causes choppy audio and ASR failures. Target latency under 150ms RTT and jitter under 30ms. Jitter above 50ms causes 10-20% increase in Word Error Rate because audio frames arrive out of order.

Implement prompt regression testing using LLM-as-a-Judge evaluators that score agent responses against predefined rubrics in CI/CD pipelines. Run evaluations on held-out test datasets before every deployment and on 5% of production calls asynchronously. Compare scores across prompt versions and alert when overall score drops more than 10% from baseline. Version prompts like code with embedded test suites and track lineage across updates.

Monitor TTS Mean Opinion Score (MOS) with target above 4.0 and alert threshold at 3.5, Time to First Byte (TTFB) under 200ms with alert at 400ms, and TTS WER (accuracy when ASR transcribes TTS output) under 3% with alert at 5%. Also track audio artifacts like clicks, pops, and unnatural pauses. TTS WER above 5% indicates pronunciation issues where users hear something different than intended.

Structure Slack alerts to answer key triage questions in under 10 seconds: severity level with emoji (how bad), metric name and component (what broke), current value versus threshold (how bad exactly), trend from baseline (getting worse?), affected user/call count (scope), runbook link (what to do), and on-call mention (who knows). Include interactive buttons for acknowledge, snooze, and escalate actions. Use dedicated channels by severity: #voice-oncall for SEV-0/1, #voice-alerts for SEV-2.

Instrument voice agents using GenAI semantic conventions with distinct span levels: span-level for individual components (ASR calls, LLM inference), trace-level for complete conversation turns (user speaks to agent responds), and session-level for full multi-turn dialogues. Set attributes including gen_ai.system, model identifiers, voice_id, TTFB measurements, and token counts. Export traces to unified backends that support correlation across the full pipeline.

Sumanyu Sharma

Sumanyu Sharma

Founder & CEO

Previously Head of Data at Citizen, where he helped quadruple the user base. As Senior Staff Data Scientist at Tesla, grew AI-powered sales program to 100s of millions in revenue per year.

Researched AI-powered medical image search at the University of Waterloo, where he graduated with Engineering honors on dean's list.

“At Hamming, we're taking all of our learnings from Tesla and Citizen to build the future of trustworthy, safe and reliable voice AI agents.”