Langfuse traces + RedHop reports: what each one tells you

If you run RAG in production you have probably reached for two different tools:

Langfuse (or LangSmith, Phoenix, Helicone, any LLM-observability product) for tracing. Every LLM call gets a span. You see latency, cost, the prompt that went in, the answer that came out, and the full trace tree across your agent or chain.
A retrieval diagnostic for retrieval-specific failures. There isn’t really an obvious tool here, which is why this guide exists.

The interesting thing is that the two questions barely overlap. Langfuse answers “what happened, when, how much did it cost.” RedHop answers “why did retrieval return the wrong chunks.” You want both, and the integration is unusually painless because RedHop’s report is intrinsic, structured, and serializable, so it falls right into a Langfuse metadata field with no SDK on either side that has to know about the other.

This guide builds the integration on a LangChain BM25 pipeline (the most common starting point), pipes a workload through it, and shows exactly what Langfuse vs RedHop tell you about the same failure.

The pipeline we’re observing

The setup any team would actually have: a LangChain BM25 retriever over some product docs, an LLM that turns the retrieved chunks into an answer, and Langfuse wrapping the whole thing so each query produces a trace.

from langchain_community.retrievers import BM25Retriever
from langfuse import Langfuse
from langfuse.openai import openai  # auto-instrumented client

CORPUS = [
    "Refund Policy. Refunds are available within thirty days of purchase.",
    "Termination for convenience. Either party may terminate this agreement.",
    "Governing Law. This agreement is governed by the laws of California.",
    "Limitation of Liability. The cap is twelve months of fees.",
    # ... real corpus would be much larger
]

langfuse = Langfuse()
retriever = BM25Retriever.from_texts(CORPUS)
retriever.k = 5


def search(query: str) -> list[str]:
    return [d.page_content for d in retriever.invoke(query)]


def answer(query: str) -> str:
    with langfuse.start_as_current_span(name="rag.query") as span:
        chunks = search(query)
        prompt = build_prompt(query, chunks)
        resp = openai.chat.completions.create(
            model="gpt-4o-mini",
            messages=[{"role": "user", "content": prompt}],
        )
        return resp.choices[0].message.content

This works. Langfuse stores every query, every retrieved chunk, every LLM response. You can grep your traces, you can A/B-test prompts, you can spot the slow ones. Good infrastructure.

But the trace doesn’t answer “why did retrieval miss.” That’s a different question. When a user complains the bot got their refund question wrong, you open the trace, see five chunks were retrieved, none of them mention “refund,” and… now what? Was it the query? The corpus? The chunker? The retriever? The trace shows the what. It does not show the why.

What RedHop adds

RedHop’s Decision Report is what you reach for here. It is structured data about how the query interacted with the corpus and the retrieved candidates: which query terms matched zero chunks, how concentrated the score mass was, whether the query looks templated. Plus a small closed registry of bounded hints that fire on documented failure shapes, each citing the measured finding that justifies it.

There’s no retriever to swap out. RedHop has a function called analyze_context(query, chunks) that takes the candidates you already have and produces the report. Drop it next to your existing call:

import redhop

def answer(query: str) -> str:
    with langfuse.start_as_current_span(name="rag.query") as span:
        texts = search(query)

        # NEW: hand the retrieved candidates to RedHop for diagnosis.
        chunks = [redhop.Chunk(t, id=str(i), source="langchain")
                  for i, t in enumerate(texts)]
        report = redhop.analyze_context(query, chunks)

        # Existing LLM call.
        prompt = build_prompt(query, texts)
        resp = openai.chat.completions.create(
            model="gpt-4o-mini",
            messages=[{"role": "user", "content": prompt}],
        )

        # Attach the report to the Langfuse span so it lands on the trace.
        span.update(metadata=redhop_report_to_metadata(report))
        return resp.choices[0].message.content

redhop_report_to_metadata is a five-line helper we’ll get to in a second. The important bit is that this is the entire integration. The retriever doesn’t know RedHop exists. Langfuse doesn’t know RedHop exists. Both work as before, plus the trace now carries structured retrieval signals.

The metadata bridge

RedHop ships a helper for exactly this:

from redhop.otel import report_to_attributes

def redhop_report_to_metadata(report):
    # report_to_attributes() returns an OTel-legal flat dict
    # (bool / int / float / str / list[str], all keys under "redhop.").
    # Langfuse metadata accepts arbitrary JSON, so the dict drops in.
    return report_to_attributes(report)

The dict you end up with on every Langfuse trace looks like this:

{
  "redhop.strategy": "raw_topk",
  "redhop.auto_decision": "passthrough",
  "redhop.n_selected": 0,
  "redhop.retained_evidence_ratio": 1.0,
  "redhop.low_confidence": true,
  "redhop.diagnosis.empty_context": true,
  "redhop.diagnosis.n_candidates": 5,
  "redhop.diagnosis.hints": ["empty_context"],
  "redhop.diagnosis.zero_match_terms": ["cancel", "money", "back"]
}

That last line is the moment. The trace previously told you “five chunks were retrieved and the LLM answered wrong.” Now it also tells you the query used the words “cancel,” “money,” “back” and none of those words exist in the corpus. The corpus uses “refund” and “termination for convenience.” The diagnosis was sitting in the candidate scores all along, RedHop just surfaces it.

What each tool answers

After the integration, here is what each tool is actually good for:

Question	Tool
When did this query run?	Langfuse
How much did it cost in tokens?	Langfuse
What prompt did the LLM see?	Langfuse
What were the retrieved chunks?	Langfuse
Did retrieval find evidence above the grounding bar?	RedHop (`low_confidence`)
Which query terms appear nowhere in the corpus?	RedHop (`zero_match_terms`)
Why did retrieval probably fail?	RedHop (`hints[].code`)
What does the evidence say to try first?	RedHop (workload audit)
Did the prompt fit the budget?	Both (different fields)
Did the chunker break a paragraph mid-thought?	Neither (open problem)

The split is clean: Langfuse stores the trace, RedHop interprets the retrieval. They never overlap and they never fight.

Workload-level diagnosis

A single trace tells you about one query. The interesting question is across hundreds. RedHop ships summarize_diagnoses for exactly this: hand it the reports from a slice of your production traffic and get one focus recommendation back, citing the measured finding behind it.

A common way to feed it: pull the last N traces out of Langfuse, run each one through analyze_context again (we have the original query and the retrieved chunks in the trace), and aggregate.

import redhop

# Pull recent traces. The actual API depends on the version of
# langfuse you're on, the shape is what matters.
traces = langfuse.fetch_traces(limit=500, name="rag.query")

reports = []
for t in traces:
    query = t.input
    # The chunks the retriever returned are already in the trace.
    retrieved_texts = t.metadata.get("retrieved_texts", [])
    chunks = [redhop.Chunk(text, id=str(i), source="trace")
              for i, text in enumerate(retrieved_texts)]
    reports.append(redhop.analyze_context(query, chunks))

print(redhop.summarize_diagnoses(reports))

The rendered summary (excerpt):

RedHop Workload Audit
═════════════════════

  Reports aggregated: 487

Hint histogram
──────────────
  - empty_context              112  ( 23%)
  - vocab_mismatch             162  ( 33%)
  - low_confidence             134  ( 28%)

Rates
─────
  Empty-context rate:    23%
  Low-confidence rate:   28%
  Corpus-stats coverage: 0%

Focus
─────
  Code: vocab_mismatch
  33% of queries had most terms missing from the corpus. Top gap terms:
  "cancel", "refund", "subscription", "trial", "billing", "charge".
  Rephrasing toward the documents' vocabulary is the measured first
  fix, and dense retrieval (retrieval="hybrid") was measured to lift
  retention on exactly this shape.
      evidence: docs/findings/MULTIHOP_HYBRID.md

Exactly one focus per audit. The cited finding is what justifies the recommendation. RedHop never auto-applies the fix, the action is yours.

Notice corpus_stats_coverage: 0%: because we only had what was in the trace (the retrieved chunks, not the corpus), we got Layer-1 facts only. If you want the full corpus-level diagnosis you point RedHop at the same corpus once, separately:

doc = redhop.Document.from_chunks(
    [redhop.Chunk(t, id=str(i), source="corpus") for i, t in enumerate(CORPUS)]
)
reports = [doc.context(t.input).report for t in traces]
print(redhop.summarize_diagnoses(reports))

Now corpus_stats_coverage is 100% and the summary’s top_zero_match_terms is populated with the exact words your users ask about that your docs don’t contain. That list is often the most useful artifact in the report: it’s a workload-derived Vocabulary seed, ready to compile into a synonym dict.

Honest limits

Both tools have things they don’t do. Worth flagging before someone finds out the hard way.

No gold labels, no quality scores. Without a ground-truth answer set, neither tool can tell you your retrieval recall is 73% or your answer F1 is 0.41. RedHop measures failure shapes: “your queries use words your docs don’t.” That is genuinely actionable, but it isn’t an evaluation. For graded answer quality on a gold set, RedHop ships an evaluate surface and Langfuse has its own scoring product.
analyze_context observes waste, it does not remove it. If you want to prune the dilution it flags (the measured −80% prompt tokens with gold evidence kept), you migrate retrieval into RedHop’s Document.from_chunks(...).context(query). That’s a bigger commitment. The integration in this guide doesn’t help you skip it, it just tells you whether you should.
If the workload audit says healthy, RedHop has nothing for you. No failure shape exceeded the threshold. The recommendation is “no intervention indicated.” That’s the correct answer, even though it’s not the answer a vendor wants to give. Langfuse traces are still useful for the other things they answer.

What to do next

If the workload audit fires vocab_mismatch, the fix lives in two places. The first is in your content: the gap terms are what your users want to ask about, and your docs should probably mention them. The second is in retrieval: a dense embedder can match “cancel” to “refund” through semantic similarity. Both are documented in Choosing a configuration and the full walk-through of the diagnostic surface.

If it fires templated_queries, run analyze_query_set on a sample of your queries to extract the boilerplate, then compile a Stripper. This was measured to lift retention on CUAD from 81.3% to 87.7% in a controlled three-arm run.

If it fires weak_retrieval with no specific shape, your corpus may simply not cover the questions users are asking. Look at top_zero_match_terms. That’s the gap.

Either way, the next step is yours. RedHop is observation plus citation, never the planner.