🚀 Launch BF: Hamming AI (S24) - Make RAG & AI agents reliable (YC deal inside)

Sumanyu Sharma
Sumanyu Sharma
Founder & CEO
, Voice AI QA Pioneer

Has stress-tested 1M+ voice agent calls to find where they break.

June 2, 2024•4 min read
🚀 Launch BF: Hamming AI (S24) - Make RAG & AI agents reliable (YC deal inside)

đź‘‹ Sumanyu from @Hamming (S24)

TLDR: Are you struggling to make your RAG & AI agents reliable? We're launching our LLM Experimentation Platform to help eng teams speed up iteration velocity, root-cause bad LLM outputs, and prevent regressions.

Quick filter: If your iteration loop depends on “eyeballing a few examples,” you don’t have a reliable system yet.

🌟 Click here to try our LLM Experimentation Platform 🌟

Our thesis: Experimentation drives reliability

Previously, I ran growth focused eng and data teams at Tesla and Citizen. We learned that running experiments is the best way to move a metric. More experiments = more growth.

We believe the same is true for eng teams building AI products. More experiments = more reliability = more retention for your AI products.

more-experiments

Problem: Making RAG and AI agents reliable feels like whack-a-mole

Here's the workflow most teams follow:

  1. Tweak your RAG or AI agents by indexing new documents, adding new tools, changing the prompts, models or other business logic.
  2. Eyeball how well your changes improved a handful of examples you wanted to fix. Often ad-hoc and slow.
  3. Ship the changes if they worked.
  4. Detect regressions when users complain of things breaking in production.
  5. Repeat steps 1 to 4 until you get tired.
StepManual pain pointHow Hamming helps
TweakUnclear impact of changesStructured evals on real traces
EyeballSlow, low-signal reviewAutomated scoring at scale
ShipRisk of silent regressionsGate changes with tests
DetectUsers complain firstReal-time production scoring

Steps 2 and 4 are often the slowest & most painful parts of the feedback loop. This is what we tackle.

Our take: Use LLMs as judges to speed up iteration velocity

We use LLMs to score the outputs of other LLMs. This is the fastest way to speed up the feedback loop.

midwit-evals

Prod: Flag errors in production, before customers notice

We go beyond passive LLM / trace-level monitoring. We actively score your production outputs in real-time and flag cases the team needs to double-click on. This helps eng teams quickly prioritize cases they need to fix.

automated-production-scores

Dev: Test new changes quickly and prevent regressions

We make offline evaluations easy, so you can make a change to your system and get feedback in minutes.

Easily create golden datasets

Offline evaluations are bottlenecked on a high-quality golden dataset of input/output pairs. We support converting production traces to dataset examples in one-click.

production-traces-to-dataset-examples

Diagnose between retrieval, reasoning or function-calling errors quickly

Differentiating between retrieval, reasoning and function-calling errors is time-consuming. We score each retrieved context on metrics like hallucination, recall and precision to help you prioritize your eng efforts where it matters.

measure-hallucinations

Override AI scores

Sometimes our AI scores disagree with your definition of "good". We make it easy to override our scores with your own preferences. Our AI scorer learns from your feedback.

learn-from-human-feedback

Meet the team

Sumanyu previously helped Citizen (safety app; backed by Founders Fund, Sequoia, 8VC) grow its users by 4X and grew an AI-powered sales program to $100s of millions in revenue/year at Tesla.

Our ask

We previously launched Prompt Optimizer on BF that saved 80% of manual prompt engineering effort. In this launch, we showed how teams using Hamming to build reliable RAG and AI agents.

  • YC Deal. 50% off our growth plan for next 12 months, 1:1 workshops, premium support, Patagonia Swag, more. ($10k+ worth of value) Link to our YC deal.
  • Warm intros. We'd love intros to anyone you know who wants to make their RAG and AI agents more reliable. (including you!)

Email us here.

Book time on our calendly.

Frequently Asked Questions

Hamming's LLM Experimentation Platform helps teams make RAG and agentic pipelines reliable by speeding up iteration velocity, root-causing outputs, and preventing regressions in production.

Because reliability failures are often non-obvious: a retrieval change fixes one case and breaks another, model updates shift behavior, and regressions appear only after deployment. If your loop is “eyeball a few examples,” you’ll miss these. Experimentation turns this into a measurable loop—evaluate changes quickly, understand why outputs changed, and prevent the same failure from recurring.

Hamming helps teams run offline evaluations and production scoring with LLM judges, convert real production traces into datasets, and differentiate retrieval failures from reasoning and tool-call failures. That makes it easier to iterate quickly and ship changes with confidence.

Start by defining a small “golden” dataset and evaluation criteria, then iterate in short cycles: run evals, inspect failures, and categorize root cause (retrieval vs reasoning vs tool calls). Ship with production scoring and alerts, and promote real production failures back into the dataset so your coverage grows automatically.

Sumanyu Sharma

Sumanyu Sharma

Founder & CEO

Previously Head of Data at Citizen, where he helped quadruple the user base. As Senior Staff Data Scientist at Tesla, grew AI-powered sales program to 100s of millions in revenue per year.

Researched AI-powered medical image search at the University of Waterloo, where he graduated with Engineering honors on dean's list.

“At Hamming, we're taking all of our learnings from Tesla and Citizen to build the future of trustworthy, safe and reliable voice AI agents.”