Langfuse traces + RedHop reports: what each one tells you
If you run RAG in production you have probably reached for two different tools:
- Langfuse (or LangSmith, Phoenix, Helicone, any LLM-observability product) for tracing. Every LLM call gets a span. You see latency, cost, the prompt that went in, the answer that came out, and the full trace tree across your agent or chain.
- A retrieval diagnostic for retrieval-specific failures. There isn’t really an obvious tool here, which is why this guide exists.
The interesting thing is that the two questions barely overlap. Langfuse answers “what happened, when, how much did it cost.” RedHop answers “why did retrieval return the wrong chunks.” You want both, and the integration is unusually painless because RedHop’s report is intrinsic, structured, and serializable, so it falls right into a Langfuse metadata field with no SDK on either side that has to know about the other.
This guide builds the integration on a LangChain BM25 pipeline (the most common starting point), pipes a workload through it, and shows exactly what Langfuse vs RedHop tell you about the same failure.
The pipeline we’re observing
Section titled “The pipeline we’re observing”The setup any team would actually have: a LangChain BM25 retriever over some product docs, an LLM that turns the retrieved chunks into an answer, and Langfuse wrapping the whole thing so each query produces a trace.
from langchain_community.retrievers import BM25Retrieverfrom langfuse import Langfusefrom langfuse.openai import openai # auto-instrumented client
CORPUS = [ "Refund Policy. Refunds are available within thirty days of purchase.", "Termination for convenience. Either party may terminate this agreement.", "Governing Law. This agreement is governed by the laws of California.", "Limitation of Liability. The cap is twelve months of fees.", # ... real corpus would be much larger]
langfuse = Langfuse()retriever = BM25Retriever.from_texts(CORPUS)retriever.k = 5
def search(query: str) -> list[str]: return [d.page_content for d in retriever.invoke(query)]
def answer(query: str) -> str: with langfuse.start_as_current_span(name="rag.query") as span: chunks = search(query) prompt = build_prompt(query, chunks) resp = openai.chat.completions.create( model="gpt-4o-mini", messages=[{"role": "user", "content": prompt}], ) return resp.choices[0].message.contentThis works. Langfuse stores every query, every retrieved chunk, every LLM response. You can grep your traces, you can A/B-test prompts, you can spot the slow ones. Good infrastructure.
But the trace doesn’t answer “why did retrieval miss.” That’s a different question. When a user complains the bot got their refund question wrong, you open the trace, see five chunks were retrieved, none of them mention “refund,” and… now what? Was it the query? The corpus? The chunker? The retriever? The trace shows the what. It does not show the why.
What RedHop adds
Section titled “What RedHop adds”RedHop’s Decision Report is what you reach for here. It is structured data about how the query interacted with the corpus and the retrieved candidates: which query terms matched zero chunks, how concentrated the score mass was, whether the query looks templated. Plus a small closed registry of bounded hints that fire on documented failure shapes, each citing the measured finding that justifies it.
There’s no retriever to swap out. RedHop has a function called
analyze_context(query, chunks) that takes the candidates you already
have and produces the report. Drop it next to your existing call:
import redhop
def answer(query: str) -> str: with langfuse.start_as_current_span(name="rag.query") as span: texts = search(query)
# NEW: hand the retrieved candidates to RedHop for diagnosis. chunks = [redhop.Chunk(t, id=str(i), source="langchain") for i, t in enumerate(texts)] report = redhop.analyze_context(query, chunks)
# Existing LLM call. prompt = build_prompt(query, texts) resp = openai.chat.completions.create( model="gpt-4o-mini", messages=[{"role": "user", "content": prompt}], )
# Attach the report to the Langfuse span so it lands on the trace. span.update(metadata=redhop_report_to_metadata(report)) return resp.choices[0].message.contentredhop_report_to_metadata is a five-line helper we’ll get to in a
second. The important bit is that this is the entire integration. The
retriever doesn’t know RedHop exists. Langfuse doesn’t know RedHop
exists. Both work as before, plus the trace now carries structured
retrieval signals.
The metadata bridge
Section titled “The metadata bridge”RedHop ships a helper for exactly this:
from redhop.otel import report_to_attributes
def redhop_report_to_metadata(report): # report_to_attributes() returns an OTel-legal flat dict # (bool / int / float / str / list[str], all keys under "redhop."). # Langfuse metadata accepts arbitrary JSON, so the dict drops in. return report_to_attributes(report)The dict you end up with on every Langfuse trace looks like this:
{ "redhop.strategy": "raw_topk", "redhop.auto_decision": "passthrough", "redhop.n_selected": 0, "redhop.retained_evidence_ratio": 1.0, "redhop.low_confidence": true, "redhop.diagnosis.empty_context": true, "redhop.diagnosis.n_candidates": 5, "redhop.diagnosis.hints": ["empty_context"], "redhop.diagnosis.zero_match_terms": ["cancel", "money", "back"]}That last line is the moment. The trace previously told you “five chunks were retrieved and the LLM answered wrong.” Now it also tells you the query used the words “cancel,” “money,” “back” and none of those words exist in the corpus. The corpus uses “refund” and “termination for convenience.” The diagnosis was sitting in the candidate scores all along, RedHop just surfaces it.
What each tool answers
Section titled “What each tool answers”After the integration, here is what each tool is actually good for:
| Question | Tool |
|---|---|
| When did this query run? | Langfuse |
| How much did it cost in tokens? | Langfuse |
| What prompt did the LLM see? | Langfuse |
| What were the retrieved chunks? | Langfuse |
| Did retrieval find evidence above the grounding bar? | RedHop (low_confidence) |
| Which query terms appear nowhere in the corpus? | RedHop (zero_match_terms) |
| Why did retrieval probably fail? | RedHop (hints[].code) |
| What does the evidence say to try first? | RedHop (workload audit) |
| Did the prompt fit the budget? | Both (different fields) |
| Did the chunker break a paragraph mid-thought? | Neither (open problem) |
The split is clean: Langfuse stores the trace, RedHop interprets the retrieval. They never overlap and they never fight.
Workload-level diagnosis
Section titled “Workload-level diagnosis”A single trace tells you about one query. The interesting question is
across hundreds. RedHop ships summarize_diagnoses for exactly this:
hand it the reports from a slice of your production traffic and get one
focus recommendation back, citing the measured finding behind it.
A common way to feed it: pull the last N traces out of Langfuse, run
each one through analyze_context again (we have the original query
and the retrieved chunks in the trace), and aggregate.
import redhop
# Pull recent traces. The actual API depends on the version of# langfuse you're on, the shape is what matters.traces = langfuse.fetch_traces(limit=500, name="rag.query")
reports = []for t in traces: query = t.input # The chunks the retriever returned are already in the trace. retrieved_texts = t.metadata.get("retrieved_texts", []) chunks = [redhop.Chunk(text, id=str(i), source="trace") for i, text in enumerate(retrieved_texts)] reports.append(redhop.analyze_context(query, chunks))
print(redhop.summarize_diagnoses(reports))The rendered summary (excerpt):
RedHop Workload Audit═════════════════════
Reports aggregated: 487
Hint histogram────────────── - empty_context 112 ( 23%) - vocab_mismatch 162 ( 33%) - low_confidence 134 ( 28%)
Rates───── Empty-context rate: 23% Low-confidence rate: 28% Corpus-stats coverage: 0%
Focus───── Code: vocab_mismatch 33% of queries had most terms missing from the corpus. Top gap terms: "cancel", "refund", "subscription", "trial", "billing", "charge". Rephrasing toward the documents' vocabulary is the measured first fix, and dense retrieval (retrieval="hybrid") was measured to lift retention on exactly this shape. evidence: docs/findings/MULTIHOP_HYBRID.mdExactly one focus per audit. The cited finding is what justifies the recommendation. RedHop never auto-applies the fix, the action is yours.
Notice corpus_stats_coverage: 0%: because we only had what was in the
trace (the retrieved chunks, not the corpus), we got Layer-1 facts
only. If you want the full corpus-level diagnosis you point RedHop at
the same corpus once, separately:
doc = redhop.Document.from_chunks( [redhop.Chunk(t, id=str(i), source="corpus") for i, t in enumerate(CORPUS)])reports = [doc.context(t.input).report for t in traces]print(redhop.summarize_diagnoses(reports))Now corpus_stats_coverage is 100% and the summary’s
top_zero_match_terms is populated with the exact words your users
ask about that your docs don’t contain. That list is often the most
useful artifact in the report: it’s a workload-derived Vocabulary
seed, ready to compile into a synonym dict.
Honest limits
Section titled “Honest limits”Both tools have things they don’t do. Worth flagging before someone finds out the hard way.
- No gold labels, no quality scores. Without a ground-truth answer
set, neither tool can tell you your retrieval recall is 73% or your
answer F1 is 0.41. RedHop measures failure shapes: “your queries
use words your docs don’t.” That is genuinely actionable, but it
isn’t an evaluation. For graded answer quality on a gold set, RedHop
ships an
evaluatesurface and Langfuse has its own scoring product. analyze_contextobserves waste, it does not remove it. If you want to prune the dilution it flags (the measured −80% prompt tokens with gold evidence kept), you migrate retrieval into RedHop’sDocument.from_chunks(...).context(query). That’s a bigger commitment. The integration in this guide doesn’t help you skip it, it just tells you whether you should.- If the workload audit says
healthy, RedHop has nothing for you. No failure shape exceeded the threshold. The recommendation is “no intervention indicated.” That’s the correct answer, even though it’s not the answer a vendor wants to give. Langfuse traces are still useful for the other things they answer.
What to do next
Section titled “What to do next”If the workload audit fires vocab_mismatch, the fix lives in two
places. The first is in your content: the gap terms are what your
users want to ask about, and your docs should probably mention them.
The second is in retrieval: a dense embedder can match “cancel” to
“refund” through semantic similarity. Both are documented in
Choosing a configuration and the
full walk-through of the diagnostic
surface.
If it fires templated_queries, run analyze_query_set on a sample of
your queries to extract the boilerplate, then compile a Stripper.
This was measured to lift retention on CUAD from 81.3% to 87.7% in a
controlled three-arm run.
If it fires weak_retrieval with no specific shape, your corpus may
simply not cover the questions users are asking. Look at
top_zero_match_terms. That’s the gap.
Either way, the next step is yours. RedHop is observation plus citation, never the planner.