RAG Debugging

Sumanyu Sharma
Sumanyu Sharma
Founder & CEO
, Voice AI QA Pioneer

Has stress-tested 1M+ voice agent calls to find where they break.

April 16, 20244 min read
RAG Debugging

What is RAG?

RAG stands for Retrieval-Augmented Generation. It is an approach to language modeling where the model retrieves relevant information from an external knowledge source to help inform its generations. The key components are:

  1. A retriever that finds relevant documents/passages from a knowledge corpus.
  2. A generator (language model) that incorporates the retrieved information to produce the final output.

Quick filter: If you can’t tell whether the failure is retrieval or reasoning, you can’t fix it yet.

Some benefits of RAG include:

  • Context expansion - Allows the model to draw upon a large domain-specific knowledge base instead of relying solely on its parameters.
  • Reduced hallucinations - The retrieved contexts provide factual grounding and reduce hallucinations.
  • Better explainability - It's easier to isolate why a certain answer was produced and provide citations.

Measuring RAG Performance

When developing and deploying RAG systems, it's critical to measure the performance of both the retriever and end-to-end system:

End-to-end Metrics

These are domain-specific metrics that measure the end-to-end system performance. i.e., is the final answer good?

MetricWhat it measuresUse it to decide
Factual accuracyGrounding to trusted sourcesIf answers match reality
StyleTone and readabilityIf output fits the audience
ConcisenessSignal-to-noiseIf responses are too long
ToxicityUnsafe or harmful contentIf outputs need filtering

Hamming includes the following end-to-end metrics by default:

  • Factual Accuracy - Measures how well the generated text aligns with established facts and information from reliable sources.
  • Style - Evaluate the fluency, coherence, and appropriateness of the generated text for the intended audience and purpose.
  • Conciseness - Assesses how well the model conveys information efficiently, avoiding unnecessary repetition or irrelevant details.
  • Toxicity - Checks the generated text for inappropriate, offensive, or harmful content that could negatively impact users.

The key is making an AI judge aligned with your definition of good. “Good” is different for a support bot vs. a medical assistant.

Retrieval Metrics

The following scores measure the quality of the retrieved contexts. i.e., were the retrieved documents 'good'?

  • Precision - The % of retrieved documents relevant to the user input. Helpful in measuring signal/noise ratio.
  • Recall - The % of statements from the golden output for a given input that are contained in the contexts. If recall is low, no prompt tweaking will improve the answer; the key is to improve the retrieval pipeline through better chunking and search parameters. This is the #1 trap we see.
  • Hallucination - Similar to recall. The % of statements from the AI output attributable to the retrieved contexts.

RAG Debugging Pro-Tips

Here's how we think you should debug your RAG pipeline:

Step 1: Run evals on the end-to-end output

The goal here is to isolate specific examples that are underperforming. Debugging averages is a slow way to learn.

Step 2: Isolate between reasoning and retrieval errors

Use the retrieval metrics to help you differentiate between retrieval errors and reasoning errors.

With Hamming, you can run one-off scores. This is more cost-efficient than running retrieval scores on all cases.

Step 3: If retrieval is the issue

If recall is low, you likely have a retrieval issue. Focus on improving your indexing and retrieval pipeline.

Step 4: If reasoning is the issue

If precision and recall are high, but hallucination is also high, you likely have reasoning errors. Sometimes retrieval can be high, but precision can be low, leading to reasoning errors. The next best move is iterating on the prompts or using a smarter model.

Step 5: Make the necessary changes

Make updates to your prompts or retrieval pipeline to solve for retrieval and reasoning errors.

Step 6: Re-test

Run evals on the end-to-end output to make sure you're improving. It's common to experience regressions after making a small change to your AI pipeline.

🤔

Some people believe you can do label-free evaluations of just the retrieval pipeline. In our experience, checking the end-to-end quality first and THEN measuring retrieval performance is more useful. This correlates more with how a user measures quality.

Frequently Asked Questions

RAG systems can fail in two places: retrieval (wrong or missing docs) and reasoning (the model ignores or misuses good docs). If you do not separate those, you will fix the wrong thing and waste a lot of time.

Start by checking whether the necessary facts appear in the retrieved context. If key facts are missing, it is a retrieval/recall problem and prompt tweaks will not fix it. If the context is correct but the answer is wrong, it is usually reasoning, ranking/precision, or instruction-following.

Hamming scores both the end-to-end output and the retrieval layer, so you can see whether failures come from low recall, low precision, or hallucination. It also supports one-off scoring on specific examples, inspection of retrieved chunks, and regression tracking after a fix.

Low recall from poor chunking or retrieval parameters, low precision from noisy retrieval, and answers that cite relevant context but still miss the user’s intent. Data drift is another common one: the knowledge base changes and yesterday’s “good” answer becomes outdated without anyone noticing.

Sumanyu Sharma

Sumanyu Sharma

Founder & CEO

Previously Head of Data at Citizen, where he helped quadruple the user base. As Senior Staff Data Scientist at Tesla, grew AI-powered sales program to 100s of millions in revenue per year.

Researched AI-powered medical image search at the University of Waterloo, where he graduated with Engineering honors on dean's list.

“At Hamming, we're taking all of our learnings from Tesla and Citizen to build the future of trustworthy, safe and reliable voice AI agents.”