Are you struggling to make your RAG & agentic pipelines reliable? We're launching our LLM Experimentation Platform to speed up iteration velocity, ...

👋 Sumanyu and Marius from @Hamming (S24)

**TLDR:** Are you struggling to make your RAG & AI agents reliable? We're launching our [LLM Experimentation Platform](https://app.hamming.ai) to help eng teams speed up iteration velocity, root-cause bad LLM outputs, and prevent regressions.

🌟 [Click here to try our LLM Experimentation Platform](https://app.hamming.ai) 🌟

## Our thesis: Experimentation drives reliability

Previously, Marius and I ran growth focused eng and data teams at Tesla, Citizen, Spell & Anduril. We learned that running experiments is the best way to move a metric. More experiments = more growth.

We believe the same is true for eng teams building AI products. More experiments = more reliability = more retention for your AI products.

<Image
  src="/images/more-experiments.png"
  alt="more-experiments"
  width={1942 / 3}
  height={1104 / 3}
/>

## Problem: Making RAG and AI agents reliable feels like whack-a-mole

Here's the workflow most teams follow:

1. **Tweak** your RAG or AI agents by indexing new documents, adding new tools, changing the prompts, models or other business logic.
2. **Eyeball** how well your changes improved a handful of examples you wanted to fix. Often ad-hoc and slow.
3. **Ship** the changes if they worked.
4. **Detect** regressions when users complain of things breaking in production.
5. **Repeat** steps 1 to 4 until you get tired.

Steps 2 and 4 are often the slowest & most painful parts of the feedback loop. This is what we tackle.

## Our take: Use LLMs as judges to speed up iteration velocity

We use LLMs to score the outputs of other LLMs. This is the fastest way to speed up the feedback loop.

<Image
  src="/images/midwit-evals.png"
  alt="midwit-evals"
  width={600}
  height={400}
/>

### Prod: Flag errors in production, before customers notice

We go beyond passive LLM / trace-level monitoring. We actively **score your production outputs** in real-time and **flag cases** the team needs to double-click on. This helps eng teams quickly prioritize cases they need to fix.

<Image
  src="/images/automated-production-scores.png"
  alt="automated-production-scores"
  width={600}
  height={400}
/>

### Dev: Test new changes quickly and prevent regressions

We make offline evaluations easy, so you can make a change to your system and get feedback in minutes.

**Easily create golden datasets**

Offline evaluations are bottlenecked on a high-quality golden dataset of input/output pairs. We support converting production traces to dataset examples in one-click.

<Image
  src="/images/production-traces-to-dataset-examples.png"
  alt="production-traces-to-dataset-examples"
  width={600}
  height={400}
/>

**Diagnose between retrieval, reasoning or function-calling errors quickly**

Differentiating between retrieval, reasoning and function-calling errors is time-consuming. We score each retrieved context on metrics like hallucination, recall and precision to help you prioritize your eng efforts where it matters.

<Image
  src="/images/measure-hallucinations.png"
  alt="measure-hallucinations"
  width={600}
  height={400}
/>

**Override AI scores**

Sometimes our AI scores disagree with your definition of "good". We make it easy to override our scores with your own preferences. Our AI scorer learns from your feedback.

<Image
  src="/images/learn-from-human-feedback.png"
  alt="learn-from-human-feedback"
  width={600}
  height={400}
/>

## Meet the team

[Sumanyu](https://www.linkedin.com/in/sumanyusharma/) previously helped Citizen (safety app; backed by Founders Fund, Sequoia, 8VC) grow its users by 4X and grew an AI-powered sales program to $100s of millions in revenue/year at Tesla.

[Marius](https://www.linkedin.com/in/mariusbuleandra/) previously ran data infrastructure @ Anduril, drove user growth at Citizen with Sumanyu and was a founding engineer @ Spell (MLOps startup acquired by Reddit).

![Sumanyu & Marius](/images/hamming-team.png)

## Our ask

We previously launched [Prompt Optimizer](https://bookface.ycombinator.com/posts/80751) on BF that saved 80% of manual prompt engineering effort. In this launch, we showed how teams using Hamming to build reliable RAG and AI agents.

- **YC Deal.** 50% off our growth plan for next 12 months, 1:1 workshops, premium support, Patagonia Swag, more. ($10k+ worth of value) [Link to our YC deal](https://bookface.ycombinator.com/deals/2520).
- **Warm intros.** We'd love intros to anyone you know who wants to make their RAG and AI agents more reliable. (including you!)

Email us [here](mailto:sumanyu@hamming.ai).

Book time on our [calendly](https://calendly.com/sumanyusharma/30min).