Set up Slack alerts to catch voice agent failures before users notice them.
Voice agents fail silently. A latency spike that adds 800ms to every response. An ASR model update that drops transcription accuracy by 12%. A prompt change that causes the agent to ignore safety guardrails. Users notice immediately—they hang up, complain, or switch to competitors. Your dashboard might not show problems for hours.
Slack alerts close this gap. When configured correctly, they detect failures within 60 seconds and route actionable context to the right team. This guide covers the complete alerting stack for production voice agents: what to monitor, where to set thresholds, and how to structure alerts that enable fast triage.
What you'll get from this guide:
- A master reference table mapping alert types to thresholds, severity levels, and Slack routing
- Copy-paste Slack message templates for latency, ASR drift, jitter, and prompt regressions
- Implementation patterns for webhooks, OpenTelemetry, and monitoring integrations
- Noise control strategies that reduce alert fatigue by 60-80%
Related Guides:
- Voice Agent Monitoring KPIs — 10 production metrics with formulas and benchmarks
- Voice Agent Troubleshooting — VoIP diagnostics and ASR/LLM/TTS debugging
- Voice Agent Monitoring Platform Guide — 4-Layer monitoring architecture
Understanding Voice Agent Monitoring Requirements
The Voice Agent Stack: What Needs Monitoring
Voice agents are multi-component pipelines where failures cascade. Each layer requires specific metrics and alert thresholds:
┌─────────────────────────────────────────────────────────────────────────────┐
│ VOICE AGENT PIPELINE │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ [Network/VoIP] → [ASR/STT] → [LLM] → [Tools] → [TTS] → [Network/VoIP] │
│ │ │ │ │ │ │ │
│ Jitter WER Latency Success MOS Packet Loss │
│ Packet Loss Confidence TTFB Rate TTFB │
│ MOS Latency Tokens Latency Quality │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
Each component has distinct failure modes:
- Network/VoIP: Jitter greater than 30ms causes choppy audio; packet loss greater than 1% drops words
- ASR/STT: WER drift degrades intent accuracy; confidence drops signal model issues
- LLM: Latency spikes break conversational flow; token overflows cause truncation
- Tools: API failures block task completion; timeouts cascade through the pipeline
- TTS: Quality degradation (MOS <3.5) frustrates users; latency adds to total delay
Why Traditional APM Falls Short for Voice Agents
Generic APM tools (Datadog, New Relic, Grafana) excel at infrastructure monitoring but miss voice-specific failures:
| What APM Tracks | What Voice Agents Need | Gap |
|---|---|---|
| Server CPU/memory | Turn-level latency | APM sees averages; voice needs P99 per turn |
| API response times | Component breakdown (STT/LLM/TTS) | APM shows total; voice needs each stage |
| Error rates | Intent accuracy, WER | APM counts errors; voice needs quality metrics |
| Uptime | Conversation completion | APM sees calls connect; voice needs task success |
| Request latency | Time to First Word (TTFW) | APM measures server; voice measures user perception |
The result: APM dashboards show green while users experience broken conversations. A healthcare company tracked 99.9% uptime and less than 200ms API latency while their CSAT dropped 15%—because they weren't measuring turn-level latency, intent accuracy, or context retention.
Critical Metrics for Production Voice Agents
These are the metrics that predict voice agent failure. Alert on these, not infrastructure proxies:
| Metric | Definition | Target | Why It Matters |
|---|---|---|---|
| P99 Latency (TTFW) | 99th percentile time from user stops speaking to agent starts speaking | less than 7s | 1% of users experience worst-case; they remember it |
| WER | Word Error Rate: (substitutions + deletions + insertions) / total words | less than 8% | Every 5% WER increase reduces intent accuracy by ~10% |
| Interruption Rate | % of turns where user interrupts agent | less than 15% | High rates indicate latency or relevance problems |
| TTFA | Time to First Audio byte from TTS | less than 200ms | Streaming TTS responsiveness |
| Tool Call Success | % of tool invocations returning expected results | greater than 99% | Failed tools = failed tasks |
| Intent Accuracy | % correctly classified user intents | greater than 95% | Foundation for correct agent behavior |
| MOS Score | Mean Opinion Score (1-5) for audio quality | greater than 4.0 | Below 3.5, ASR accuracy degrades significantly |
Voice Agent Alert Master Table
This reference table maps every alert type to thresholds, severity, and routing. Use it to configure your alerting system.
| Alert Type | Category | Trigger Definition | Starter Threshold | Severity | Slack Channel | Include in Alert |
|---|---|---|---|---|---|---|
| Latency Spike | LLM | P95 TTFW exceeds threshold | greater than 6s warning, greater than 7s critical | SEV-1 (critical), SEV-2 (warning) | #voice-oncall | Component breakdown, sample calls |
| ASR Drift | ASR | WER increases from baseline | greater than 5% increase over 100-call window | SEV-2 | #voice-alerts | Baseline comparison, sample transcripts |
| Jitter Spike | VoIP | Jitter exceeds acceptable variance | greater than 30ms warning, greater than 50ms critical | SEV-2 | #voice-alerts | MOS impact, affected regions |
| Packet Loss | VoIP | RTP packet loss rate | greater than 1% warning, greater than 3% critical | SEV-1 | #voice-oncall | Duration, affected calls |
| Prompt Regression | LLM | Prompt version scores drop | greater than 10% drop from baseline | SEV-1 | #voice-oncall | Version comparison, failure examples |
| TTS Quality | TTS | MOS score degrades | less than 4.0 warning, less than 3.5 critical | SEV-2 | #voice-alerts | Audio samples, provider status |
| Intent Accuracy | ASR/LLM | Classification accuracy drops | less than 92% warning, less than 90% critical | SEV-1 | #voice-oncall | Confusion matrix, sample failures |
| Tool Failures | Integration | Tool call error rate | greater than 2% warning, greater than 5% critical | SEV-1 | #voice-oncall | Failed tool, error codes, affected tasks |
| Task Completion | Business | End-to-end task success rate | less than 80% warning, less than 70% critical | SEV-1 | #voice-oncall | Failed tasks, drop-off points |
| Compliance Violation | Safety | Policy or PII violation detected | Any occurrence | SEV-0 | #voice-oncall + #compliance | Conversation link, violation type |
| Dead Air | Quality | Silence greater than 3s during conversation | greater than 2 occurrences per call | SEV-2 | #voice-alerts | Timestamp, preceding context |
| Escalation Spike | Business | Transfer to human rate | greater than 2x baseline | SEV-2 | #voice-alerts | Escalation reasons, affected intents |
Legend:
- SEV-0: Immediate response required (less than 5 min), pages on-call
- SEV-1: Urgent response (less than 15 min), posts to #voice-oncall
- SEV-2: Standard response (less than 1 hour), posts to #voice-alerts
Latency Monitoring and Alerting
Defining Latency Thresholds for Voice AI
Human conversation has a natural rhythm. Responses within 1 second feel natural. Between 1-2 seconds feels acceptable. Above 3 seconds starts to feel broken. Based on real-world production analysis, set your alert thresholds based on these practical boundaries:
| Percentile | Target | Warning Threshold | Critical Threshold | User Impact |
|---|---|---|---|---|
| P50 | <1.5s | greater than 2s | greater than 2.5s | Median user experience |
| P90 | <3s | greater than 4s | greater than 5s | Most users affected |
| P95 | <5s | greater than 6s | greater than 7s | Significant degradation |
| P99 | <7s | greater than 8s | greater than 10s | Worst experiences, high frustration |
Why P99 matters more than averages: A P50 of 1.5s with P99 of 10s means your average user has acceptable experience but 1% face significant delays. At 10,000 calls/day, that's 100 frustrated users daily. They remember.
Component-Level Latency Breakdown
End-to-end latency is the sum of its parts. Instrument each component to identify bottlenecks:
| Component | Typical Range | Target | % of Total | Common Bottleneck |
|---|---|---|---|---|
| Network (inbound) | 50-100ms | <80ms | 2-3% | Geographic distance, mobile networks |
| ASR/STT | 300-600ms | <500ms | 10-15% | Model size, streaming vs batch |
| LLM Inference | 1.5-4s | <3s | 60-70% | Model size, prompt length, provider load |
| Tool Execution | 200-600ms | <400ms | 8-12% | API latency, database queries |
| TTS | 200-400ms | <300ms | 5-10% | Voice complexity, streaming setup |
| Network (outbound) | 50-100ms | <80ms | 2-3% | Same as inbound |
Example breakdown for 3.5s total latency:
ASR: 450ms (13%)
LLM: 2,400ms (69%)
Tools: 350ms (10%)
TTS: 300ms (8%)
─────────────────────
Total: 3,500ms
Slack Alert Template: Latency Spike Detection
🚨 SEV-1 ALERT: Voice Agent Latency Critical
📊 Metric: ttfw_p99
Current: 9.2s (threshold: 8s)
Baseline (24h): 5.8s
Duration: 8 minutes
Affected: 12% of calls (47 of 391)
🔍 Component Breakdown:
• Network In: 85ms (normal)
• ASR: 450ms (normal)
• LLM: 7,820ms (↑ 4,200ms - LIKELY CAUSE)
• Tools: 380ms (normal)
• TTS: 465ms (normal)
📍 Context:
• Region: us-east-1
• Model: gpt-4-turbo
• Prompt Version: v2-4-1
🔗 Actions:
[View Dashboard] [Sample Calls] [LLM Provider Status] [Runbook]
👤 On-call: @voice-team
React with ✅ to acknowledge
Detecting Latency Regressions Across Deployments
Latency often regresses after deployments—new prompt versions, model updates, or infrastructure changes. Automate detection:
CI/CD Integration:
# .github/workflows/voice-agent-deploy.yml
- name: Run latency baseline test
run: |
hamming test run --suite latency-baseline --wait
- name: Compare against baseline
run: |
CURRENT_P99=$(hamming metrics get ttfw_p99 --last-hour)
BASELINE_P99=$(hamming metrics get ttfw_p99 --baseline)
if [ $(echo "$CURRENT_P99 > $BASELINE_P99 * 120 / 100" | bc) -eq 1 ]; then
echo "::error::P99 latency regressed by greater than 20%"
exit 1
fi
Post-Deploy Monitoring Checklist:
- P99 latency within 20% of pre-deploy baseline after 15 minutes
- No component showing greater than 50% latency increase
- LLM token usage within expected range (longer prompts = slower inference)
- No new timeout errors in tool calls
ASR Drift and Quality Monitoring
Measuring ASR Performance with WER
Word Error Rate (WER) is the standard metric for ASR accuracy:
WER = (Substitutions + Deletions + Insertions) / Total Words × 100
Example:
Reference: "I need to check my account balance today"
ASR Output: "I need to check my count balance"
Substitutions: 1 ("account" → "count")
Deletions: 1 ("today" missing)
Insertions: 0
WER = (1 + 1 + 0) / 8 × 100 = 25%
WER Benchmarks by Condition:
| Condition | Good | Acceptable | Poor | Action Required |
|---|---|---|---|---|
| Clean audio, native speaker | less than 5% | less than 8% | greater than 10% | Investigate ASR provider |
| Background noise | less than 10% | less than 15% | greater than 18% | Enable noise suppression |
| Non-native accent | less than 12% | less than 18% | greater than 22% | Consider accent-aware models |
| Domain terminology | less than 8% | less than 12% | greater than 15% | Add custom vocabulary |
Detecting Silent Model Updates and API Changes
ASR providers update models without notice. These "silent updates" can shift WER by 5-15% overnight. Detect them by monitoring WER trends independent of your code deployments:
Detection Strategy:
- Maintain a 7-day rolling WER baseline
- Alert when current 100-call WER deviates greater than 5% from baseline
- Cross-reference against your deployment log—if no deploy, suspect provider change
# Pseudo-code for drift detection
def check_asr_drift(current_wer, baseline_wer, deploy_log):
deviation = (current_wer - baseline_wer) / baseline_wer * 100
if deviation > 5: # 5% threshold
recent_deploy = deploy_log.has_deploy_in_last_hours(4)
if not recent_deploy:
alert(
severity="SEV-2",
message="ASR drift detected without deployment",
context={
"current_wer": current_wer,
"baseline_wer": baseline_wer,
"deviation_pct": deviation,
"likely_cause": "Provider model update"
}
)
Slack Alert Template: ASR Accuracy Degradation
⚠️ SEV-2 ALERT: ASR Drift Detected
📊 Metric: asr_wer
Current (100-call window): 13.8%
Baseline (7-day): 7.2%
Deviation: +92%
Duration: 45 minutes
🔍 No Recent Deployments
Last deploy: 3 days ago (v2.3.8)
Likely cause: Provider model update
📝 Sample Transcription Errors:
• "cancel my subscription" → "cancel mice subscription"
• "account number 4859" → "account number four eight five nine"
• "schedule for Tuesday" → "schedule for today"
📈 Segment Analysis:
• Mobile callers: 18.2% WER (↑ from 9.1%)
• Landline: 8.4% WER (normal)
• With background noise: 22.1% WER (↑ from 12.3%)
🔗 Actions:
[Audio Samples] [WER Trend Dashboard] [Contact ASR Provider] [Runbook]
👤 On-call: @voice-team
Monitoring Transcription Quality by User Segment
Aggregate WER hides systematic failures. Slice by segments to identify affected populations:
| Segment | How to Identify | Common Issues | Alert Threshold |
|---|---|---|---|
| Accent | User profile, phone number region | Model bias toward standard accents | greater than 2x overall WER |
| Background Noise | Audio analysis, SNR measurement | Office, traffic, wind | greater than 150% of clean audio WER |
| Domain Terms | Entity extraction failures | Product names, medical terms | greater than 3x general vocabulary WER |
| Device Type | User agent, call metadata | Mobile compression, Bluetooth | greater than 130% of landline WER |
| Time of Day | Timestamp | Network congestion patterns | Variance greater than 20% from daily mean |
Jitter and Network Quality Alerting
Jitter vs Latency: Why Both Matter
Latency is the absolute delay between sending and receiving audio packets. Jitter is the variation in that delay. For voice agents, consistent delay is manageable; inconsistent delay breaks audio.
| Metric | What It Measures | Impact on Voice Agents | Target |
|---|---|---|---|
| Latency (RTT) | Round-trip time for packets | Adds to total TTFW | less than 150ms |
| Jitter | Variance in packet arrival | Causes choppy audio, ASR errors | less than 30ms |
| Packet Loss | % of packets that don't arrive | Missing words, gaps in audio | less than 1% |
| MOS | Mean Opinion Score (predicted) | Overall call quality | greater than 4.0 |
How jitter affects ASR:
- Jitter greater than 30ms: Audio frames arrive out of order, requiring larger jitter buffers
- Jitter greater than 50ms: Noticeable audio choppiness, WER increases 10-20%
- Jitter greater than 100ms: Severe audio distortion, ASR essentially fails
Real-Time Jitter Detection
Monitor jitter at the network edge, not just in aggregate metrics:
Measurement Points:
- WebRTC stats:
RTCInboundRtpStreamStats.jitter(in seconds) - RTP stream analysis: Calculate from sequence number gaps and timestamps
- Synthetic probes: Send test packets every 500ms, measure variance
// WebRTC jitter monitoring
const stats = await peerConnection.getStats();
stats.forEach(report => {
if (report.type === 'inbound-rtp' && report.kind === 'audio') {
const jitterMs = report.jitter * 1000;
if (jitterMs > 50) {
sendAlert({
severity: 'SEV-2',
metric: 'jitter',
value: jitterMs,
threshold: 50,
callId: currentCallId
});
}
}
});
Slack Alert Template: Network Quality Issues
🚨 SEV-2 ALERT: Network Quality Degradation
📊 Metrics (30-second window):
Jitter: 67ms (threshold: 50ms)
Packet Loss: 2.8% (threshold: 1%)
MOS Score: 3.2 (threshold: 4.0)
🔍 Impact Assessment:
• Affected calls: 23 (in progress)
• ASR WER during incident: 18.4% (baseline: 7.2%)
• User interruptions: ↑ 340%
📍 Geographic Analysis:
• us-east-1: Normal (jitter 18ms)
• us-west-2: AFFECTED (jitter 67ms)
• eu-west-1: Normal (jitter 22ms)
🔧 Probable Cause:
• ISP routing change detected at 14:32 UTC
• Affects Comcast residential in California region
🔗 Actions:
[Network Dashboard] [Affected Calls] [ISP Status] [Runbook]
👤 On-call: @infrastructure-team @voice-team
Prompt Regression Detection
Why Voice Agent Prompts Regress
Prompt regressions happen without code changes. Research shows 58.8% of prompt+model combinations experience accuracy drops when underlying LLM APIs update. Common causes:
| Cause | Detection Method | Frequency |
|---|---|---|
| LLM API updates | Monitor model version in response headers | Monthly |
| Prompt interaction effects | A/B test prompt versions continuously | Per prompt change |
| Training data drift | Track performance on held-out test set | Quarterly |
| Context window changes | Monitor truncation errors | Per model update |
| Temperature/parameter shifts | Log all inference parameters | Continuous |
Setting Up LLM-as-a-Judge Evaluators
Use a scoring LLM to evaluate agent responses against rubrics. Run in CI/CD and production:
# LLM-as-a-Judge evaluation
EVAL_PROMPT = """
You are evaluating a voice agent's response quality.
User utterance: {user_utterance}
Agent response: {agent_response}
Expected behavior: {expected_behavior}
Score the response on these dimensions (1-5 each):
• Relevance: Does it address the user's need?
• Accuracy: Is the information correct?
• Compliance: Does it follow the prompt instructions?
• Tone: Is it appropriate for a voice conversation?
Return JSON: {"relevance": N, "accuracy": N, "compliance": N, "tone": N, "overall": N}
"""
def evaluate_response(user_utterance, agent_response, expected_behavior):
result = llm.complete(EVAL_PROMPT.format(
user_utterance=user_utterance,
agent_response=agent_response,
expected_behavior=expected_behavior
))
scores = json.loads(result)
if scores["overall"] < 3:
flag_for_review(user_utterance, agent_response, scores)
return scores
Evaluation Triggers:
- CI/CD: Run on test dataset before every deployment
- Production sampling: Evaluate 5% of production calls async
- Regression detection: Compare scores across prompt versions
Slack Alert Template: Prompt Performance Degradation
🚨 SEV-1 ALERT: Prompt Regression Detected
📊 Prompt Version Comparison:
Current (v2.5.0): Score 3.2/5.0
Previous (v2.4.1): Score 4.1/5.0
Regression: -22%
🔍 Breakdown by Dimension:
• Relevance: 3.8 → 3.4 (-10%)
• Accuracy: 4.2 → 3.0 (-29%) ⚠️
• Compliance: 4.0 → 3.1 (-23%) ⚠️
• Tone: 4.3 → 3.5 (-19%)
📝 Failure Examples:
• Intent: cancel_subscription
Expected: Confirm cancellation, offer retention
Actual: Processed cancellation without retention offer
• Intent: billing_dispute
Expected: Acknowledge, gather details, escalate
Actual: Provided incorrect refund policy
📈 Traffic Impact:
• v2.5.0 receiving 25% of traffic (canary)
• 847 calls evaluated
• Recommend: ROLLBACK to v2.4.1
🔗 Actions:
[Rollback Now] [View Diff] [Sample Failures] [Runbook]
👤 On-call: @voice-team @prompt-eng
React with 🔙 to trigger rollback
Version Control and Baseline Management for Prompts
Treat prompts as versioned code artifacts, not strings in config files:
Best Practices:
- Version every change: "v2-4-1" not "updated prompt"
- Embed test suites: Each prompt version has associated test cases
- Track lineage: Know which base prompt each version derives from
- Store evaluation scores: Historical scores enable regression detection
# prompts/booking-agent/v2.5.0.yaml
version: "2.5.0"
parent_version: "2.4.1"
created_at: "2026-01-27T10:00:00Z"
author: "jane@company.com"
prompt: |
You are a booking assistant for [Company].
...
test_suite:
- intent: book_appointment
cases: 50
expected_score: 4.0
- intent: reschedule
cases: 30
expected_score: 4.2
baseline_scores:
overall: 4.1
relevance: 3.9
accuracy: 4.2
compliance: 4.0
tone: 4.3
TTS Quality and Naturalness Monitoring
TTS Metrics That Matter in Production
| Metric | Definition | Target | Alert Threshold |
|---|---|---|---|
| MOS | Mean Opinion Score (1-5) | greater than 4.0 | less than 3.5 |
| TTFB | Time to First Byte of audio | less than 200ms | greater than 400ms |
| TTS WER | WER when ASR transcribes TTS output | less than 3% | greater than 5% |
| Synthesis Latency | Total time to generate audio | less than 300ms | greater than 500ms |
| Audio Artifacts | Clicks, pops, unnatural pauses | 0 per call | greater than 2 per call |
Why TTS WER matters: If your ASR can't accurately transcribe what your TTS outputs, users hear something different than what you intended. TTS WER greater than 5% indicates pronunciation or audio quality issues.
Slack Alert Template: TTS Quality Degradation
⚠️ SEV-2 ALERT: TTS Quality Below Threshold
📊 Metrics (1-hour window):
MOS Score: 3.2 (threshold: 3.5)
TTS WER: 7.8% (threshold: 5%)
TTFB: 380ms (threshold: 400ms)
🔊 Audio Quality Issues:
• Detected artifacts: 3.2 per call average
• Unnatural pauses: 12% of utterances
• Pronunciation errors: "schedule" → "skedule" (47 occurrences)
📍 Affected Configuration:
• Voice: "alloy"
• Provider: OpenAI
• Sample rate: 24kHz
🎧 Sample Audio:
[Play Sample 1] [Play Sample 2] [Play Sample 3]
🔗 Actions:
[TTS Dashboard] [Provider Status] [Switch to Backup Voice] [Runbook]
👤 On-call: @voice-team
Implementing Production Observability
OpenTelemetry for Voice Agent Instrumentation
Use OpenTelemetry with GenAI semantic conventions for consistent observability:
from opentelemetry import trace
from opentelemetry.trace import SpanKind
tracer = trace.get_tracer("voice-agent")
async def process_turn(user_audio, context):
with tracer.start_as_current_span(
"voice_agent.turn",
kind=SpanKind.SERVER
) as turn_span:
turn_span.set_attribute("gen_ai.system", "voice_agent")
turn_span.set_attribute("session.id", context.session_id)
turn_span.set_attribute("turn.index", context.turn_count)
# ASR span
with tracer.start_as_current_span("asr.transcribe") as asr_span:
transcript = await asr.transcribe(user_audio)
asr_span.set_attribute("asr.provider", "deepgram")
asr_span.set_attribute("asr.model", "nova-2")
asr_span.set_attribute("asr.confidence", transcript.confidence)
asr_span.set_attribute("asr.word_count", len(transcript.words))
# LLM span
with tracer.start_as_current_span("llm.generate") as llm_span:
response = await llm.generate(transcript.text, context)
llm_span.set_attribute("gen_ai.request.model", "gpt-4-turbo")
llm_span.set_attribute("gen_ai.usage.prompt_tokens", response.prompt_tokens)
llm_span.set_attribute("gen_ai.usage.completion_tokens", response.completion_tokens)
llm_span.set_attribute("llm.ttfb_ms", response.time_to_first_token_ms)
# TTS span
with tracer.start_as_current_span("tts.synthesize") as tts_span:
audio = await tts.synthesize(response.text)
tts_span.set_attribute("tts.provider", "elevenlabs")
tts_span.set_attribute("tts.voice_id", "alloy")
tts_span.set_attribute("tts.character_count", len(response.text))
tts_span.set_attribute("tts.ttfb_ms", audio.time_to_first_byte_ms)
# Record turn-level metrics
turn_span.set_attribute("turn.total_latency_ms", calculate_total_latency())
return audio
Span-Level vs Trace-Level vs Session-Level Evaluation
| Level | What It Captures | Use Case | Alert On |
|---|---|---|---|
| Span | Single component (ASR call, LLM inference) | Component debugging | Component latency greater than 2x baseline |
| Trace | Complete turn (user speaks → agent responds) | Turn-level quality | TTFW >threshold |
| Session | Full conversation (all turns) | End-to-end quality | Task completion, FCR |
Instrumentation Hierarchy:
Session (conversation_id)
├── Trace (turn_1)
│ ├── Span (asr.transcribe)
│ ├── Span (llm.generate)
│ │ └── Span (tool.call)
│ └── Span (tts.synthesize)
├── Trace (turn_2)
│ └── ...
└── Trace (turn_n)
Integrating Voice Metrics with Slack
Option 1: Webhook to Slack Incoming Webhook
import requests
SLACK_WEBHOOK_URL = "https://hooks.slack.com/services/T00/B00/XXX"
def send_slack_alert(alert):
payload = {
"channel": alert.channel,
"username": "Voice Agent Alerts",
"icon_emoji": ":telephone:",
"blocks": [
{
"type": "header",
"text": {
"type": "plain_text",
"text": f"{alert.emoji} {alert.severity}: {alert.title}"
}
},
{
"type": "section",
"fields": [
{"type": "mrkdwn", "text": f"*Metric:* {alert.metric}"},
{"type": "mrkdwn", "text": f"*Current:* {alert.current_value}"},
{"type": "mrkdwn", "text": f"*Threshold:* {alert.threshold}"},
{"type": "mrkdwn", "text": f"*Duration:* {alert.duration}"}
]
},
{
"type": "actions",
"elements": [
{
"type": "button",
"text": {"type": "plain_text", "text": "View Dashboard"},
"url": alert.dashboard_url
},
{
"type": "button",
"text": {"type": "plain_text", "text": "Acknowledge"},
"action_id": f"ack_{alert.id}"
}
]
}
]
}
requests.post(SLACK_WEBHOOK_URL, json=payload)
Option 2: Datadog/Grafana → Slack
# Datadog monitor configuration
name: "Voice Agent TTFW P99 Critical"
type: metric alert
query: "p99:voice_agent.ttfw{env:production} > 8000"
message: |
{{#is_alert}}
@slack-voice-oncall
TTFW P99 exceeded 8s threshold.
Current: {{value}}ms
Host: {{host.name}}
Region: {{region.name}}
[View Dashboard](https://app.datadoghq.com/dashboard/xxx)
{{/is_alert}}
options:
notify_no_data: true
evaluation_delay: 60
Alert Configuration Best Practices
Defining Alert Severity Levels
| Severity | Response Time | Criteria | Notification Channels |
|---|---|---|---|
| SEV-0 | less than 5 min | Compliance violation, complete outage, data breach | PagerDuty + Slack #voice-oncall + Phone |
| SEV-1 | less than 15 min | Revenue impact, greater than 20% users affected, critical metric breach | PagerDuty + Slack #voice-oncall |
| SEV-2 | less than 1 hour | Performance degradation, less than 20% users affected | Slack #voice-alerts |
| INFO | Next business day | Threshold warning, trend alert | Slack #voice-metrics, Email digest |
Severity Classification Matrix:
| Metric | SEV-2 | SEV-1 | SEV-0 |
|---|---|---|---|
| TTFW P99 | greater than 7s | greater than 10s | greater than 15s |
| Task Completion | less than 80% | less than 70% | less than 50% |
| WER | greater than 12% | greater than 18% | greater than 25% |
| Tool Call Failures | greater than 2% | greater than 5% | greater than 15% |
| Compliance Violations | — | — | Any |
Preventing Alert Fatigue
Alert fatigue causes teams to ignore critical alerts. Implement these strategies:
Correlation Rules (1) Group related alerts into single incidents:
correlation_rules:
- name: "Latency cascade"
group_by: [region, time_window]
window: 5m
alerts: [ttfw_warning, llm_latency, tts_latency]
output_severity: max(input_severities)
Intelligent Throttling (2) Limit repeat alerts for ongoing incidents:
throttling:
cooldown_period: 30m # Don't re-alert for 30 min
max_alerts_per_hour: 5
escalate_if_sustained: 60m # Escalate if still firing after 1 hour
Risk-Based Prioritization (3) Score alerts by business impact:
def calculate_alert_priority(alert):
base_score = SEVERITY_SCORES[alert.severity]
# Multiply by affected users
user_multiplier = min(alert.affected_users / 100, 5)
# Multiply by time of day (peak hours = higher priority)
hour = datetime.now().hour
time_multiplier = 1.5 if 9 <= hour <= 17 else 1
return base_score * user_multiplier * time_multiplier
Expected Results:
- 60-80% reduction in total alerts
- Zero missed critical incidents
- less than 10% false positive rate
Dynamic vs Static Thresholds
| Threshold Type | When to Use | Example |
|---|---|---|
| Static | Hard limits, compliance requirements | Latency greater than 3s is always bad |
| Dynamic (baseline %) | Metrics with natural variance | WER greater than 20% above 7-day baseline |
| Dynamic (ML anomaly) | Complex patterns, seasonal variation | Unusual traffic patterns |
Implementation:
def evaluate_threshold(metric, value, threshold_config):
if threshold_config.type == "static":
return value > threshold_config.value
elif threshold_config.type == "baseline_percent":
baseline = get_baseline(metric, threshold_config.baseline_window)
deviation = (value - baseline) / baseline * 100
return deviation > threshold_config.percent
elif threshold_config.type == "anomaly":
# Use ML model trained on historical data
return anomaly_detector.is_anomaly(metric, value)
Alert Payload Design for Quick Triage
Every alert should answer these questions in less than 10 seconds:
| Question | Alert Field | Example |
|---|---|---|
| How bad is it? | Severity + emoji | 🚨 SEV-1 |
| What broke? | Metric name + component | TTFW P99, LLM component |
| How bad exactly? | Current value vs threshold | 1,340ms (threshold: 1,000ms) |
| Is it getting worse? | Trend indicator | ↑ 340ms from baseline |
| How many affected? | User/call count | 47 calls in last 15 min |
| What do I do? | Runbook link | [Runbook] |
| Who else knows? | On-call mention | @voice-team |
Slack Alert Templates and Examples
Template 1: Multi-Component Pipeline Failure
🚨 SEV-1 ALERT: Voice Agent Pipeline Degradation
📊 Multiple Components Affected:
┌─────────┬─────────┬─────────┬─────────┐
│ ASR │ LLM │ Tools │ TTS │
│ ✅ │ ⚠️ │ ❌ │ ✅ │
│ 195ms │ 1200ms │ FAILING │ 150ms │
└─────────┴─────────┴─────────┴─────────┘
🔍 Root Cause Analysis:
• Tool "lookup_customer" returning 503 errors
• LLM retrying tool calls, increasing latency
• Cascade effect: TTFW ↑ 180%
📈 Impact:
• Task completion: 45% (baseline: 87%)
• Affected calls: 234 in last 30 min
• Estimated revenue impact: ~$1,200
🔧 Suggested Actions:
• Check customer lookup API status
• Consider enabling fallback tool
• If persists greater than 15 min, switch to degraded mode
🔗 [Incident Dashboard] [API Status] [Enable Fallback] [Runbook]
📞 Affected Call IDs:
call_abc123, call_def456, call_ghi789 (+ 231 more)
👤 On-call: @voice-team @backend-team
Template 2: Geographic Latency Degradation
⚠️ SEV-2 ALERT: Regional Latency Anomaly
📊 Metric: ttfw_p95 by region
Region Current Baseline Status
─────────────────────────────────────────
us-east-1 680ms 720ms ✅ Normal
us-west-2 1,450ms 710ms ❌ +104%
eu-west-1 695ms 705ms ✅ Normal
ap-south-1 890ms 920ms ✅ Normal
🔍 us-west-2 Analysis:
• Affected since: 14:23 UTC (47 minutes)
• ASR latency: Normal (185ms)
• LLM latency: 980ms (↑ 420ms)
• Network: Normal (32ms RTT)
📍 Probable Cause:
• LLM provider edge node degradation in us-west-2
• OpenAI status page shows elevated latency
🔧 Options:
• Route us-west-2 traffic to us-east-1 (adds ~40ms network)
• Wait for provider resolution
• Switch to backup LLM provider
🔗 [Regional Dashboard] [Provider Status] [Traffic Routing] [Runbook]
👤 On-call: @voice-team
Template 3: Compliance and Policy Violations
🚨 SEV-0 ALERT: Compliance Violation Detected
⚠️ Type: PII Exposure in Agent Response
📝 Incident Details:
Call ID: call_xyz789
Timestamp: 2026-01-27 14:32:17 UTC
Agent Version: v2.5.0
🔒 Violation:
Agent disclosed another customer's account number
in response to identity verification question.
📄 Transcript Excerpt:
User: "Can you confirm my account number?"
Agent: "Your account number is 4859... wait, I see
account 7734 as well. Which one?" ← VIOLATION
🔍 Root Cause (Preliminary):
• Context window contained previous caller's data
• Session isolation failed between calls
🚫 Immediate Actions Taken:
• Call flagged for compliance review
• Similar calls in last hour being audited
• Agent isolated pending investigation
🔗 [Compliance Dashboard] [Call Recording] [Incident Ticket] [Runbook]
👤 Required Response:
@compliance-team @security-team @voice-team
⚠️ Response required within 15 minutes
React with 🔒 to confirm investigation started
Template 4: Tool Call and Intent Failure Spikes
⚠️ SEV-2 ALERT: Intent Classification Failures Elevated
📊 Metrics (1-hour window):
Intent Accuracy: 84.2% (threshold: 90%)
Fallback Rate: 23.1% (baseline: 8.4%)
🔍 Confusion Matrix (Top Errors):
Actual Intent → Misclassified As Count %
─────────────────────────────────────────────────
cancel_order → track_order 47 18.2%
refund_request → return_status 31 14.8%
billing_help → account_inquiry 28 12.1%
📝 Sample Misclassifications:
• "I want to cancel this order"
Expected: cancel_order
Classified: track_order (confidence: 0.62)
• "Can I get my money back for this?"
Expected: refund_request
Classified: return_status (confidence: 0.58)
📈 Downstream Impact:
• Task completion: 71% (baseline: 87%)
• User repeated themselves: 34% of calls
• Escalation rate: ↑ 2.1x
🔧 Possible Causes:
• Recent ASR model update (WER normal)
• Prompt version v2.5.0 deployed 2 hours ago ← LIKELY
• New user traffic patterns (no evidence)
🔗 [Intent Dashboard] [Prompt Comparison] [Rollback v2.5.0] [Runbook]
👤 On-call: @voice-team @ml-team
Monitoring Integration Architecture
Connecting Hamming Metrics to Slack
Hamming provides 50+ built-in voice agent metrics with native Slack integration:
# Hamming alert configuration
alerts:
- name: "TTFW P99 Critical"
metric: ttfw_p99
condition: "> 8s"
duration: "5m"
severity: SEV-1
channels:
- slack: "#voice-oncall"
include:
- component_breakdown
- sample_calls: 3
- runbook_link: "https://docs.company.com/runbooks/ttfw"
- name: "ASR Drift Detection"
metric: wer
condition: "> baseline * 120%"
baseline_window: "7d"
evaluation_window: "100 calls"
severity: SEV-2
channels:
- slack: "#voice-alerts"
include:
- sample_transcripts: 5
- segment_breakdown: [device_type, region]
Hamming's Unique Capabilities:
- Turn-level latency breakdown (ASR/LLM/TTS)
- Automated call replay links in alerts
- Intent accuracy with confusion matrices
- Prompt version comparison
- One-click regression testing from alerts
Routing Alerts to the Right Teams
| Alert Category | Primary Team | Secondary | Slack Channel |
|---|---|---|---|
| Latency (TTFW) | Voice Platform | SRE | #voice-oncall |
| ASR/WER | ML Engineering | Voice Platform | #voice-alerts → #ml-oncall |
| LLM/Prompt | Prompt Engineering | ML Engineering | #voice-oncall → #prompt-eng |
| VoIP/Network | Infrastructure | Voice Platform | #infra-oncall → #voice-oncall |
| Tools/APIs | Backend | Voice Platform | #backend-oncall → #voice-alerts |
| Compliance | Security | Legal | #security-oncall + #compliance |
| Business (FCR) | Voice Platform | Product | #voice-alerts → #product |
Escalation Policy:
escalation:
- level: 1
delay: 0
notify: ["slack:#voice-oncall"]
- level: 2
delay: 15m
condition: "not acknowledged"
notify: ["pagerduty:voice-primary"]
- level: 3
delay: 30m
condition: "not acknowledged"
notify: ["pagerduty:voice-secondary", "slack:#engineering-leadership"]
Alert Acknowledgment and Incident Workflows
Implement interactive Slack buttons for rapid response:
# Slack interactive message handler
@slack_app.action("acknowledge_alert")
async def handle_acknowledge(ack, body, client):
await ack()
alert_id = body["actions"][0]["value"]
user = body["user"]["username"]
# Update alert status
await alerts.acknowledge(alert_id, user)
# Update Slack message
await client.chat_update(
channel=body["channel"]["id"],
ts=body["message"]["ts"],
blocks=[
*body["message"]["blocks"],
{
"type": "context",
"elements": [{
"type": "mrkdwn",
"text": f"✅ Acknowledged by @{user} at {datetime.now()}"
}]
}
]
)
# Create incident ticket if SEV-1+
if alert.severity in ["SEV-0", "SEV-1"]:
await pagerduty.create_incident(alert)
@slack_app.action("trigger_rollback")
async def handle_rollback(ack, body, client):
await ack()
# Confirm before rollback
await client.views_open(
trigger_id=body["trigger_id"],
view={
"type": "modal",
"title": {"type": "plain_text", "text": "Confirm Rollback"},
"submit": {"type": "plain_text", "text": "Rollback Now"},
"blocks": [
{
"type": "section",
"text": {
"type": "mrkdwn",
"text": "This will rollback to prompt version v2-4-1. Continue?"
}
}
]
}
)
Related Guides
- Voice Agent Monitoring KPIs — 10 production metrics with formulas and benchmarks
- Voice Agent Troubleshooting — VoIP diagnostics and ASR/LLM/TTS debugging
- Voice Agent Monitoring Platform Guide — 4-Layer monitoring architecture
- Voice Agent Observability Tracing Guide — OpenTelemetry integration
- Monitor Pipecat Agents in Production — Pipecat-specific monitoring

