Voice agent analytics is the continuous measurement of performance across telephony, ASR, LLM, and TTS layers to ensure production quality. Unlike traditional call center metrics that track averages and aggregate outcomes, voice agent analytics requires layer-by-layer observability—tracing every interaction from audio ingestion through speech recognition, language model inference, and speech synthesis to pinpoint where and why conversations succeed or fail.
| Metric Category | Key Metrics | Production Target |
|---|---|---|
| Task Success | FCR, containment rate, TSR | FCR 70-85%, containment 80%+ |
| Latency | TTFW, turn latency, p90/p95 | P90 <3.5s, TTFW <500ms |
| ASR Quality | WER, confidence scores | WER <5% |
| NLU Accuracy | Intent recognition, slot filling | Intent accuracy 95%+ |
| TTS Quality | MOS, synthesis latency | MOS 4.3-4.5 |
| Safety | Hallucination rate, refusal rate | Hallucination <3% |
At Hamming, we've analyzed 4M+ voice agent calls across 10K+ production voice agents. This guide provides the standardized definitions, mathematical formulas, instrumentation approaches, and benchmark ranges for every critical voice agent metric.
Methodology Note: Metrics, benchmarks, and formula definitions in this guide are derived from Hamming's analysis of 4M+ voice agent interactions across 10K+ production voice agents (2025-2026).Thresholds validated across healthcare, financial services, e-commerce, and customer support verticals.
Related Guides:
- Voice Agent Evaluation Metrics: Definitions, Formulas & Benchmarks — Complete technical reference for individual evaluation metrics
- Post-Call Analytics for Voice Agents: Metrics and Monitoring — Real-time data pipelines and 4-layer observability
- Voice Agent Dashboard Template — 6-Metric Framework with executive reports
- Voice Agent Monitoring KPIs: Production Guide — 10 critical KPIs with alert thresholds
- Voice AI Latency: What's Fast, What's Slow, How to Fix It — Engineering guide to latency optimization
- The Anatomy of a Perfect Voice Agent Analytics Dashboard — Dashboard design patterns and KPI categories
Core Voice Agent Metrics and KPIs
First Call Resolution (FCR)
First Call Resolution (FCR) measures the percentage of customer issues fully resolved during the initial interaction without requiring callbacks, transfers, or follow-up contacts.
Formula:
FCR = (Issues resolved on first contact / Total interactions) × 100
| Level | FCR Target | Context |
|---|---|---|
| Baseline | 70-75% | Standard voice agent deployment |
| Good | 75-80% | Optimized flows with knowledge base coverage |
| Top Performer | 85%+ | Specialized, well-defined use cases |
Measurement approach: Use 48-72 hour verification windows. If a customer contacts again within that window about the same issue, the original interaction did not achieve resolution—even if it was marked complete.
Segmentation matters: FCR varies significantly by intent category. Appointment scheduling may achieve 90%+ FCR while complex troubleshooting sits at 60-65%. Report FCR by intent to identify specific improvement opportunities rather than optimizing a blended average.
Containment Rate
Containment rate measures the percentage of calls handled entirely by the AI agent without escalation to a human operator.
Formula:
Containment Rate = (AI-resolved calls / Total calls) × 100
| Level | Target | Use Case Context |
|---|---|---|
| Excellent | >80% | Well-defined transactional flows |
| Good | 70-80% | Standard customer service |
| Acceptable | 60-70% | Complex queries, new deployments |
| FAQ Bots | 40-60% | Simple information retrieval |
Target 80%+ after optimization, though rates vary substantially by use case complexity. Healthcare triage may appropriately target 60-70% while appointment scheduling targets 85%+.
Critical caveat: High containment with low CSAT indicates false containment—users abandoning rather than getting helped. Always pair containment tracking with task completion and satisfaction metrics.
Track escalation patterns by reason to identify actionable improvements:
- Knowledge gap (agent lacks required information)
- Authentication failure
- User preference (explicitly requested human)
- Conversation breakdown (intent confusion, loops)
- Policy requirement (regulatory escalation triggers)
Intent Recognition Accuracy
Intent recognition accuracy measures how correctly the voice agent classifies customer requests into predefined intent categories.
Formula:
Intent Recognition Accuracy = (Correct intent matches / Total utterances) × 100
| Level | Accuracy | Assessment |
|---|---|---|
| Production-ready | >95% | Required for customer-facing deployment |
| Acceptable | 90-95% | Requires monitoring and prompt tuning |
| Below threshold | <90% | Not production-ready without improvement |
Production requires 95%+ accuracy. Intent misclassification cascades through the entire conversation—a misrouted caller enters the wrong flow, receives irrelevant responses, and either escalates or abandons.
Track intent coverage rate alongside accuracy: the percentage of incoming calls that match a fully supported intent category. Low coverage (many "fallback" or "unknown" intents) indicates gaps in your intent taxonomy rather than classification quality.
Task Success Rate (TSR)
Task Success Rate (TSR) tracks completed objectives relative to total interaction attempts.
Formula:
TSR = (Successful task completions / Total interactions) × 100
| Use Case | TSR Benchmark |
|---|---|
| Appointment scheduling | 90-95% |
| Order status inquiry | 88-93% |
| Payment processing | 85-90% |
| Technical troubleshooting | 75-85% |
| Complex multi-step flows | 70-80% |
Benchmark 85-95% for specialized implementations with well-defined success criteria. TSR differs from FCR in that it measures whether the agent completed its designated task, regardless of whether the customer needed to call back for a different issue.
Define explicit completion criteria for each task type. "Appointment scheduled" is unambiguous. "Customer helped" is not.
Customer Satisfaction (CSAT) Score
CSAT measures satisfaction with individual interactions, typically on a 1-5 scale.
Formula:
CSAT = (Satisfied responses [4-5 rating] / Total survey responses) × 100
| Level | CSAT Target |
|---|---|
| Excellent | >85% |
| Good | 75-85% |
| Acceptable | 65-75% |
| Poor | <65% |
Target 75-85% with AI-based automated scoring supplementing explicit surveys. Voice agents can embed satisfaction measurement directly in conversations, achieving 30%+ higher completion rates than post-call email surveys.
Beyond explicit ratings: Track caller frustration through sentiment trajectory, interruption patterns, and repetition frequency. AI-based scoring analyzes tone, sentiment, resolution speed, and conversation ending patterns to predict CSAT without requiring explicit surveys—useful when survey response rates are low.
Latency and Response Time Metrics
Time to First Word (TTFW)
Time to First Word (TTFW) measures initial response delay from user speech completion (VAD silence detection) to the first agent audio reaching the caller.
Formula:
TTFW = VAD silence detection → Agent first audio byte
| Threshold | User Experience |
|---|---|
| <300ms | Natural, indistinguishable from human conversation |
| 300-500ms | Acceptable for most users |
| 500-800ms | Noticeable delay, users begin adapting speech patterns |
| >800ms | Conversation breakdown begins |
Target sub-500ms as ideal, with 800ms as the acceptable production threshold. Note that TTFW measures only the initial response delay—the time until the first audio byte reaches the caller. This differs from total turn latency (covered below), which measures complete end-to-end response time. Based on Hamming's analysis of production voice agents, industry median total turn latency is 1.4-1.7 seconds—significantly slower than the 300ms human conversational expectation. This gap explains why users report agents that "feel slow" or "keep getting interrupted."
Total Conversation Latency
Total conversation latency measures end-to-end response time including ASR processing, LLM inference, and TTS synthesis for each conversational turn.
Component breakdown (typical production):
- Audio transmission: ~40ms
- STT processing: 150-350ms
- LLM inference (TTFT): 200-800ms (typically 70% of total)
- TTS synthesis: 100-200ms
- Audio playback: ~30ms
LLM inference dominates total latency, making model selection and prompt optimization the highest-leverage improvement targets.
Latency Percentiles (P50, P90, P95, P99)
Track percentile distributions rather than averages to expose performance outliers that degrade user experience for significant portions of your traffic.
Production latency benchmarks:
| Percentile | Response Time | User Experience | Action Required |
|---|---|---|---|
| P50 (median) | 1.5s | Noticeable delay, functional | Optimize LLM inference |
| P90 | 3.5s | Significant frustration, talk-overs | Investigate infrastructure and model |
| P95 | 5.0s | Severe delay, frequent abandonment | Immediate attention required |
| P99 | 10s+ | Complete breakdown | Critical incident |
Why percentiles matter: A system reporting 500ms average latency may have 10% of calls experiencing 3.5+ second delays. Those callers don't care about the average—they experience a broken product. Alert on p90 threshold breaches (3.5s), not mean degradation.
Alert configuration: Set notifications when p90 latency exceeds 3.5s for a sustained 5-minute window. This catches degradation before it becomes widespread while filtering transient spikes.
Speech Recognition Quality Metrics
Word Error Rate (WER)
Word Error Rate (WER) is the industry standard metric for ASR accuracy, measuring the percentage of words incorrectly transcribed.
Formula:
WER = (S + D + I) / N × 100
Where:
S = Substitutions (wrong words)
D = Deletions (missing words)
I = Insertions (extra words)
N = Total words in reference transcript
| Level | WER | Assessment |
|---|---|---|
| Enterprise | <5% | Required for production deployment |
| Acceptable | 5-8% | Standard deployment, optimization needed |
| Below threshold | 8-12% | Not production-ready for high-stakes use |
| Poor | >12% | Requires fundamental ASR changes |
Require under 5% WER for production. ASR errors cascade through the entire pipeline—a misrecognized word becomes a wrong intent becomes a failed task. Track WER segmented by noise condition, accent, and domain vocabulary to identify specific improvement areas.
Transcription Confidence Scores
ASR systems provide probability scores indicating transcription certainty for each word or utterance segment. These scores enable real-time quality monitoring without requiring reference transcripts.
| Confidence Level | Score Range | Action |
|---|---|---|
| High | >0.85 | Process normally |
| Medium | 0.6-0.85 | Flag for monitoring |
| Low | <0.6 | Trigger re-prompting or human review |
Production use: Flag low-confidence segments (below 0.6) for re-prompting strategies—ask the caller to repeat or rephrase rather than proceeding with uncertain transcription. Monitor confidence score distributions over time to detect ASR drift.
Speaker Diarization Accuracy
Speaker diarization identifies and separates multiple speakers in a conversation, attributing each utterance to the correct participant.
Critical for:
- Multi-party calls (caller + agent + transferred party)
- Accurate context attribution in analytics
- Compliance monitoring requiring speaker-specific tracking
- Training data quality for model improvement
Track diarization error rate as the percentage of speech segments attributed to the wrong speaker. Production systems should achieve under 5% diarization error for two-speaker conversations.
Natural Language Understanding Metrics
Intent Coverage Rate
Intent coverage rate measures the percentage of incoming calls that match a fully supported intent category in your voice agent's taxonomy.
Formula:
Intent Coverage = (Calls matching supported intents / Total calls) × 100
Track coverage gaps—calls routed to "fallback" or "unknown" intent categories—to identify where your agent lacks capability. High fallback rates (above 15%) indicate taxonomy gaps rather than classification errors.
Action pattern: Review fallback utterances weekly. Cluster similar requests and evaluate whether they warrant new intent categories or expanded training data for existing intents.
Semantic Accuracy Rate
Semantic accuracy measures whether agent responses align with the user's actual meaning, going beyond keyword matching to evaluate contextual understanding.
Unlike intent accuracy (which measures classification), semantic accuracy evaluates whether the agent's response appropriately addresses what the user meant—even when the intent was correctly classified.
Validation approach: Conduct periodic manual audits or use LLM-as-judge evaluation pipelines. Sample 100-200 conversations per week, scoring whether responses were semantically appropriate given the full conversation context. LLM-as-judge approaches achieve 95%+ agreement with human evaluators when using two-step evaluation pipelines.
Slot Filling Accuracy
Slot filling accuracy tracks successful extraction of required parameters (names, dates, account numbers, addresses) from user utterances before task execution.
Formula:
Slot Filling Accuracy = (Correctly extracted slots / Total required slots) × 100
Production target: 90%+ slot filling accuracy. Failed slot extraction forces repetitive re-prompting that degrades user experience. Track accuracy by slot type—dates and numbers typically achieve higher accuracy than proper nouns and addresses.
Text-to-Speech Quality Metrics
Mean Opinion Score (MOS)
Mean Opinion Score (MOS) is a subjective 1-5 scale rating of TTS naturalness, clarity, and overall quality, following the ITU-T P.800 standard.
| MOS Score | Quality Level |
|---|---|
| 4.5-5.0 | Excellent, indistinguishable from human |
| 4.3-4.5 | Very good, rivals human speech quality |
| 3.8-4.3 | Good, clearly synthetic but natural |
| 3.0-3.8 | Fair, robotic qualities noticeable |
| <3.0 | Poor, unacceptable for production |
Target 4.3-4.5 to rival human speech quality benchmarks. MOS remains the gold standard for TTS evaluation despite being resource-intensive, requiring crowdsourced evaluation or automated MOSNet scoring.
Voice Consistency Rate
Voice consistency measures stable prosody, tone, and pacing throughout an entire conversation. Inconsistent voice characteristics—sudden pitch shifts, pacing changes, or tonal breaks—break user immersion and erode trust.
Monitor for:
- Pitch stability across conversation turns
- Pacing consistency (words per minute variance)
- Tonal alignment with conversation context (empathetic when appropriate)
- Cross-session consistency for returning callers
Audio Synthesis Latency
TTS synthesis latency measures the time required to generate audio output from text input.
| Percentile | Target | Impact |
|---|---|---|
| P50 | <150ms | Contributes to natural conversational flow |
| P90 | <300ms | Acceptable production threshold |
| P95 | <500ms | Investigate TTS provider performance |
Track TTS p90 latency under 300ms to maintain conversational rhythm. TTS latency combines with STT and LLM latency to determine total turn latency—optimizing any single component improves end-to-end experience.
Hallucination Detection and Safety Metrics
Hallucination Rate
Hallucination rate tracks instances where the voice agent generates fabricated information, invented facts, or confident responses not grounded in its knowledge base.
Formula:
Hallucination Rate = (Hallucinated responses / Total responses) × 100
Target under 3% occurrence for general deployments. Regulated industries (healthcare, financial services) should target under 1%.
Detection approaches:
- Real-time validation against knowledge base sources
- LLM-as-judge evaluation on sampled conversations
- Tracking five or more consecutive transcription errors as potential hallucination signals
- Monitoring responses with high confidence but no matching source documents
Safety Refusal Rate
Safety refusal rate measures the percentage of adversarial, inappropriate, or out-of-scope prompts correctly rejected by the voice agent.
Track both:
- True positive refusals: Correctly blocked adversarial or policy-violating requests
- False positive refusals: Legitimate requests incorrectly blocked (over-aggressive guardrails)
Balance is critical. Under-refusing exposes your system to misuse. Over-refusing creates frustrated users who can't complete legitimate tasks.
Source Grounding Score
Source grounding validates that agent responses are traceable to verified knowledge base content, flagging "confident answers with no matching source" as potential hallucinations.
Implementation: For each response, check whether the key claims map to retrieved knowledge base passages. Responses with high confidence but low source overlap should trigger review and potential re-prompting.
Conversation Quality Scoring
Call Deflection Success
Call deflection success measures prevention of unnecessary human transfers when the AI agent could resolve the issue, calculated against baseline pre-automation escalation rates.
Formula:
Deflection Success = (Pre-automation escalations - Current escalations) / Pre-automation escalations × 100
This metric only makes sense relative to historical baselines. Compare current escalation patterns to pre-automation rates, segmented by intent category.
Interruption Frequency
Interruption frequency counts instances where the agent speaks over the user or responds before the user completes their thought. High interruption rates indicate ASR timing issues, specifically problems with Voice Activity Detection (VAD) or end-of-turn prediction.
| Level | Interruption Rate | Assessment |
|---|---|---|
| Good | <5% of turns | Natural conversation flow |
| Acceptable | 5-10% of turns | Monitor VAD configuration |
| Poor | >10% of turns | Immediate VAD tuning required |
Diagnostic approach: Distinguish between agent-caused interruptions (premature response) and user-caused interruptions (barge-in). Agent-caused interruptions indicate system issues. User-caused barge-ins may indicate latency problems prompting users to repeat themselves.
Conversation Abandonment Rate
Conversation abandonment rate tracks calls ended by the user mid-conversation before reaching resolution, signaling poor experience or agent failure.
Formula:
Abandonment Rate = (Calls abandoned before resolution / Total calls) × 100
Segment abandonment by:
- Time in call: Early abandonment (under 30s) suggests greeting or routing issues
- Intent stage: Abandonment during slot filling suggests re-prompting fatigue
- After specific turns: Identifies exact conversation points causing drop-off
Voice Agent Analytics Framework
4-Layer Monitoring Approach
Implement Hamming's 4-layer analytics framework for comprehensive voice agent observability:
| Layer | Function | Key Metrics | Failure Modes |
|---|---|---|---|
| Layer 1: Telephony & Audio | Audio quality, transport health | Packet loss, jitter, SNR, codec latency | Garbled audio, dropouts, echo |
| Layer 2: ASR & Transcription | Speech-to-text accuracy | WER, confidence scores, transcription latency | Mishearing, silent failures, drift |
| Layer 3: LLM & Semantic | Intent and response generation | TTFT, intent accuracy, hallucination rate | Wrong routing, confabulation, scope creep |
| Layer 4: TTS & Generation | Speech synthesis quality | Synthesis latency, MOS, voice consistency | Delays, robotic speech, voice drift |
Issues cascade across layers. An audio quality problem (Layer 1) causes transcription errors (Layer 2), which cause intent misclassification (Layer 3), which causes task failure. Without layer-by-layer instrumentation, you see the task failure but not the root cause.
Real-Time vs Post-Call Analytics
Balance immediate alerting with deep post-call analysis:
| Approach | Purpose | Latency | Depth |
|---|---|---|---|
| Real-time | Detect degradation as it happens | Seconds | Surface-level indicators |
| Near-real-time | Pattern identification within sessions | Minutes | Trend analysis |
| Post-call | Root cause analysis, model improvement | Hours | Full conversation evaluation |
Real-time monitoring catches outages and severe degradation. Post-call analysis identifies systematic patterns—specific prompts that consistently underperform, intent categories with declining accuracy, or time-of-day latency variations—that inform model improvements and prompt optimization.
Dashboard Design and Visualization
Essential Dashboard Components
A production voice agent dashboard must answer four questions within 30 seconds:
- Is the system healthy? — Call volume trends, error rates, infrastructure status
- Are users satisfied? — CSAT trajectory, abandonment rates, sentiment patterns
- Where are the problems? — Latency percentile distributions, WER trends, containment drops
- What changed? — Deployment markers, model version annotations, configuration diffs
Display call volume trends, latency percentile distributions (p50/p90/p95), containment rates, and sentiment analysis with one-click drill-downs from anomaly to individual conversation transcript and audio playback.
Alert Configuration Best Practices
Set thresholds on percentile metrics, not averages. Average-based alerts mask degradation affecting minority populations of calls.
| Metric | Alert Threshold | Duration | Severity |
|---|---|---|---|
| P90 latency | >3.5s | 5 minutes | Warning |
| P95 latency | >5.0s | 5 minutes | Critical |
| Containment rate | <60% | 1 hour | Warning |
| WER | >8% | 15 minutes | Warning |
| Hallucination rate | >3% | 30 minutes | Critical |
| TTFW p95 | >800ms | 5 minutes | Warning |
Trigger notifications before customer-facing degradation becomes widespread. A p90 alert at 3.5s catches the problem when 10% of users are affected, not when the average crosses a threshold that requires 50%+ degradation.
Metric Correlation Views
Link upstream failures to downstream impacts to trace root causes efficiently:
- High WER → Low intent accuracy → Low TSR — ASR degradation cascading to task failure
- Latency spike → High interruption rate → High abandonment — Infrastructure issue causing conversation breakdown
- Low confidence scores → High fallback rate → Low containment — ASR uncertainty driving escalations
Build correlation dashboards that surface these causal chains automatically, enabling operators to jump from symptom to root cause without manual investigation.
ROI and Business Impact Metrics
Cost Per Interaction
Calculate total resolution cost including infrastructure, model inference, and telephony:
| Channel | Cost Range | Context |
|---|---|---|
| Human agent | $5-8 per call | Fully loaded: salary, benefits, training, facilities |
| AI voice agent | $0.01-0.25 per minute | Infrastructure, model inference, telephony |
Comparison framework: For a 3-minute average call, AI costs $0.03-0.75 versus $5-8 for human handling—a 10-250x cost reduction depending on complexity and infrastructure choices.
Automation ROI Formula
Formula:
ROI = (Containment Rate × Call Volume × Per-Call Savings - Infrastructure Costs) / Infrastructure Costs × 100
Expected returns: 200-500% ROI within 3-6 months for well-implemented deployments with sub-six-month payback periods.
Worked example: 10,000 monthly calls × 75% containment × $6 per-call savings = $45,000 monthly savings. Against $8,000 monthly infrastructure costs: ($45,000 - $8,000) / $8,000 = 462% ROI.
Industry Benchmarks and Thresholds (2026)
Performance Benchmark Summary Table
| Metric | Baseline | Good | Top Performer | How to Measure |
|---|---|---|---|---|
| FCR | 70% | 75-80% | 85%+ | 48-72hr callback verification |
| Containment | 60% | 70-80% | 80%+ | Escalation event tracking |
| Latency P90 | 4.5s | 3.5s | <2.5s | Percentile distribution monitoring |
| WER | 8% | 5% | <3% | Reference transcript comparison |
| CSAT | 65% | 75-80% | 85%+ | In-call + AI-predicted scoring |
| MOS | 3.8 | 4.3 | 4.5+ | Crowdsourced + MOSNet evaluation |
| Intent Accuracy | 90% | 95% | 98%+ | Labeled test set evaluation |
| Hallucination Rate | 5% | 3% | <1% | Source grounding validation |
Benchmark Variation by Use Case
Performance expectations vary significantly by voice agent complexity:
| Use Case | Typical Containment | FCR Target | Latency Tolerance |
|---|---|---|---|
| FAQ / Information | 40-60% | 80%+ | P90 <4.0s |
| Appointment Scheduling | 75-85% | 90%+ | P90 <3.0s |
| Order Management | 70-80% | 80%+ | P90 <3.5s |
| Customer Service | 65-80% | 70-80% | P90 <3.5s |
| Technical Support | 50-65% | 65-75% | P90 <4.0s |
| Healthcare Triage | 55-70% | 75-85% | P90 <3.5s |
Simpler FAQ bots achieve 40-60% containment while complex customer service targets 75-80%. Adjust expectations by complexity rather than applying uniform benchmarks.
Production Monitoring and Alerting
Critical Alert Definitions
Configure tiered notifications linking severity to response expectations:
| Alert Type | Trigger | Severity | Response SLA |
|---|---|---|---|
| Containment drop | Below 60% for 1 hour | P2 - High | 30 minutes |
| Latency spike | P90 >3.5s for 5 minutes | P2 - High | 15 minutes |
| WER degradation | Above 8% for 15 minutes | P2 - High | 30 minutes |
| Hallucination increase | Above 3% for 30 minutes | P1 - Critical | 10 minutes |
| TTFW spike | P95 >800ms for 5 minutes | P3 - Medium | 1 hour |
| Total failure | Error rate >5% for 5 minutes | P1 - Critical | 5 minutes |
Incident Response Workflows
Establish playbooks linking alert types to diagnostic steps:
Latency alerts: Check infrastructure health → Review model inference times → Verify TTS provider status → Examine traffic patterns for load spikes
Accuracy alerts: Audit recent model or prompt changes → Compare WER distributions before/after → Review confidence score trends → Check ASR provider status
Containment alerts: Analyze escalation reason distribution → Review intent coverage gaps → Check for new conversation patterns → Verify knowledge base currency
Hallucination alerts: Validate knowledge base freshness → Review recent prompt modifications → Audit source grounding scores → Check retrieval pipeline health
Each playbook should terminate in either resolution or escalation within defined SLAs, with post-incident documentation capturing root cause and prevention measures.
Building a Measurement-Driven Voice Agent Practice
Voice agent analytics is not a dashboard project—it is an operational discipline. The teams that succeed in production share three practices: they instrument all four layers (telephony, ASR, LLM, TTS) independently, they alert on percentile distributions rather than averages, and they correlate upstream failures to downstream business impact.
Start with the metrics that drive decisions: FCR (70-85%), containment rate (80%+), WER (under 5%), and p90 latency (under 3.5s). Instrument these first, set alert thresholds, and build the correlation views that connect a WER spike to a containment drop to a CSAT decline. Everything else—MOS scores, slot filling accuracy, diarization—layers on top of that foundation.
At Hamming, we help teams validate these metrics before production deployment and continuously monitor them at scale. Whether you are establishing baseline measurements for a new voice agent or debugging a latency regression in an existing deployment, the definitions, formulas, and benchmarks in this guide provide the shared vocabulary your team needs to move from transcript logging to genuine observability.

