Diagnose your existing RAG pipeline
You already run retrieval. Maybe it is LangChain BM25 over your contracts, LlamaIndex over a notebook of code, pgvector behind an internal API, or something you wrote yourself. You do not want to migrate. You want to know why retrieval sometimes returns the wrong thing, and which single knob the data says to reach for first.
This page walks you through pointing RedHop’s Decision Report at your existing pipeline. Three calls, no behavior change, ~10 lines, no new SDKs.
Who this is for
Section titled “Who this is for”You retrieve with one of:
- LangChain (BM25Retriever, vector stores, ensemble retrievers)
- LlamaIndex (BM25, dense indexes, retriever modules)
- pgvector / Weaviate / Qdrant / Milvus (you query directly)
- A hand-rolled retriever inside your application code
You want a structured answer to “what is going wrong, and what does the evidence say to do about it?” without changing the pipeline.
Step 1: one query, zero behavior change
Section titled “Step 1: one query, zero behavior change”Hand RedHop the texts your retriever already returned.
analyze_context observes the candidate pool without modifying it.
import redhop
# Your existing pipeline; here is a LangChain sketch.# from langchain_community.retrievers import BM25Retriever# retriever = BM25Retriever.from_texts(my_corpus_texts)# texts = [d.page_content for d in retriever.invoke(query)]
texts = retriever_invoke(query) # whatever you have todaychunks = [redhop.Chunk(t, id=str(i), source="external") for i, t in enumerate(texts)]report = redhop.analyze_context(query, chunks)
print(report) # the Decision Reportprint(report.diagnosis["hints"]) # any bounded hints that firedconst { Chunk, analyzeContext } = require("redhop");
const texts = retrieverInvoke(query);const chunks = texts.map((t, i) => new Chunk(t, { id: String(i), source: "external" }));const report = analyzeContext(query, chunks);
console.log(report.rendered);console.log(report.diagnosis.hints);use redhop::{analyze_context, Chunk, ChunkId, ContextConfig, Query, RetrievalMethod, RetrievalResult, Score, TokenCount};
let texts = retriever_invoke(query);let retrieved: Vec<RetrievalResult> = texts.into_iter().enumerate().map(|(i, t)| { let chunk = Chunk::new(ChunkId::new(i.to_string()), t.clone(), "external", TokenCount(t.split_whitespace().count())); RetrievalResult::new(chunk, Score { value: 1.0, method: RetrievalMethod::External })}).collect();
let report = analyze_context(&Query::new(query), &retrieved, &ContextConfig::default());println!("{}", report.render(None));What you get back: per-candidate facts about how the query interacted with what your retriever produced. The score spread, the empty-context flag, the low-confidence flag, the query terms that appear in none of the retrieved candidates. Layer-1 facts. Corpus-wide stats are not computable here because RedHop does not see your corpus, only what came out of your retriever.
Step 2: corpus-level diagnosis
Section titled “Step 2: corpus-level diagnosis”Two lines to upgrade. Load the same chunks into a Document so
RedHop can build a vocabulary map and tell you which query terms
appear nowhere in your corpus (the canonical paraphrase failure:
the user says “cancel” and “money back”, the doc says “refund” and
“termination for convenience”).
doc = redhop.Document.from_chunks( [redhop.Chunk(t, id=str(i), source="corpus") for i, t in enumerate(my_corpus_texts)])
ctx = doc.context(query)print(ctx.report.diagnosis["zero_match_terms"]) # query terms missing from the corpusprint(ctx.report.diagnosis["term_stats"]) # per-term corpus frequencyRedHop indexes a copy in memory. Your retrieval is untouched. The
report on ctx carries the full diagnosis with corpus_stats_available = True. The five hint codes from Choosing a
configuration (vocabulary mismatch,
polysemy, templated boilerplate, plus empty_context /
low_confidence) can now fire with full evidence.
Step 3: audit the workload
Section titled “Step 3: audit the workload”One query tells you about one query. The real value is across hundreds
of production queries. Collect the reports, call summarize_diagnoses,
read the focus.
reports = [doc.context(q).report for q in production_queries]summary = redhop.summarize_diagnoses(reports)print(summary)Rendered output (excerpt):
RedHop Workload Audit═════════════════════
Reports aggregated: 247
Hint histogram────────────── - empty_context 42 (17%) - vocab_mismatch 81 (33%) - low_confidence 57 (23%)
Rates───── Empty-context rate: 17% Low-confidence rate: 23% Corpus-stats coverage: 100%
Top zero-match terms──────────────────── "cancel", "refund", "subscription", "trial", "billing", "charge", ...
Focus───── Code: vocab_mismatch 33% of queries had most terms missing from the corpus. Top gap terms: "cancel", "refund", "subscription", "trial", "billing", "charge". Rephrasing toward the documents' vocabulary is the measured first fix, and dense retrieval (retrieval="hybrid") was measured to lift retention on exactly this shape. evidence: docs/findings/MULTIHOP_HYBRID.mdExactly one focus per summary. Resolution is by a fixed priority order (vocab mismatch outranks templated outranks underdetermined outranks weak-retrieval). The cited finding is what justifies the recommendation. RedHop never auto-applies the fix.
What each focus code points at
Section titled “What each focus code points at”| Focus | What it means | Where the fix lives |
|---|---|---|
vocab_mismatch | Queries use different words than the docs do | Rephrase, or retrieval="hybrid", or build a Vocabulary |
templated_queries | Fixed wrappers dominate the queries | analyze_query_set + Stripper |
underdetermined_queries | Short queries can’t pick between candidates | UI prompt for one more keyword |
weak_retrieval | High empty/low-confidence but no specific shape | Inspect top_zero_match_terms; your corpus may not cover the questions |
healthy | No failure shape above the threshold | No intervention indicated |
sample_too_small | Fewer than 20 queries aggregated | Run more |
Step 4: ship it to your telemetry
Section titled “Step 4: ship it to your telemetry”Decision Reports are intrinsic, structured, and serializable. Every field flattens cleanly into OpenTelemetry attributes or Langfuse metadata. RedHop imports no SDK, you attach the dict.
OpenTelemetry (Python)
Section titled “OpenTelemetry (Python)”from opentelemetry import tracefrom redhop.otel import report_to_attributes
tracer = trace.get_tracer(__name__)
with tracer.start_as_current_span("rag.query") as span: ctx = doc.context(query) span.set_attributes(report_to_attributes(ctx.report)) span.add_event("rag.report", attributes={"redhop.report": ctx.report.json()})Langfuse (Python)
Section titled “Langfuse (Python)”from langfuse import Langfusefrom redhop.otel import report_to_attributes
langfuse = Langfuse()ctx = doc.context(query)langfuse.trace( name="rag.query", input=query, output=ctx.text(), metadata=report_to_attributes(ctx.report),)Node.js
Section titled “Node.js”There is no helper module on Node, the conventions table is small enough to inline. Drop this next to your retrieval call:
function reportToAttributes(report, prefix = "redhop.") { const d = report.diagnosis; const attrs = { [prefix + "strategy"]: report.strategy, [prefix + "auto_decision"]: report.autoDecision, [prefix + "input_tokens"]: Number(report.inputTokens), [prefix + "total_tokens"]: Number(report.totalTokens), [prefix + "token_budget"]: Number(report.tokenBudget), [prefix + "n_selected"]: Number(report.nSelected), [prefix + "retained_evidence_ratio"]: Number(report.retainedEvidenceRatio), [prefix + "evidence_density"]: Number(report.evidenceDensity), [prefix + "estimated_waste_tokens"]: Number(report.estimatedWasteTokens), [prefix + "second_hop_rescues"]: Number(report.secondHopRescueCount), [prefix + "low_confidence"]: Boolean(report.lowConfidenceRetrieval), [prefix + "diagnosis.empty_context"]: Boolean(d.emptyContext), [prefix + "diagnosis.n_candidates"]: Number(d.nCandidates), [prefix + "diagnosis.hints"]: d.hints.map((h) => h.code), [prefix + "diagnosis.zero_match_terms"]: d.zeroMatchTerms.slice(0, 16), }; if (d.scoreSpread != null) attrs[prefix + "diagnosis.score_spread"] = Number(d.scoreSpread); return attrs;}Attribute conventions
Section titled “Attribute conventions”All keys under the redhop. namespace. Every value is one of: bool,
int, float, string, list[string]. Optional fields are omitted
rather than emitted as null.
| Attribute | Type | Source |
|---|---|---|
redhop.strategy | string | report.strategy |
redhop.requested_strategy | string | report.requested_strategy |
redhop.auto_decision | string | report.auto_decision |
redhop.input_tokens | int | report.input_tokens |
redhop.total_tokens | int | report.total_tokens |
redhop.token_budget | int | report.token_budget |
redhop.n_input_chunks | int | report.n_input_chunks |
redhop.n_selected | int | report.n_selected |
redhop.retained_evidence_ratio | float | report.retained_evidence_ratio |
redhop.evidence_density | float | report.economics.evidence_density |
redhop.estimated_waste_tokens | int | report.economics.estimated_waste_tokens |
redhop.second_hop_rescues | int | report.second_hop_rescue_count |
redhop.low_confidence | bool | report.low_confidence_retrieval |
redhop.diagnosis.empty_context | bool | diagnosis.empty_context |
redhop.diagnosis.n_candidates | int | diagnosis.n_candidates |
redhop.diagnosis.hints | string[] | hint codes, in fire order |
redhop.diagnosis.zero_match_terms | string[] | display-ordered, capped at 16 |
redhop.diagnosis.score_spread | float | omitted when None |
Excluded on purpose: hint message strings (size, cardinality, the
evidence path is recoverable from the code), per-term term_stats
rows (belong in the JSON body of an event, not the attribute set),
the rendered report (use report.json() for the full body).
Honest limits
Section titled “Honest limits”- This measures failure shapes, not retention or answer quality. You will see “your query terms appear nowhere in the corpus”, not “your retrieval recall is X%”. For graded answer quality on a gold set, see the eval surface.
analyze_context(Step 1) observes waste, it does not remove it. Only the nativebuild_context/Document.contextpath runs the reasoning-preserving prune. The wins on the comparison page are measured on the native path.- If
summary.focus["code"] == "healthy", RedHop has nothing to recommend, and that is the correct outcome.
Reference
Section titled “Reference”- Runnable example:
examples/python/13_workload_audit.py(and the Node and Rust mirrors). - The five hint codes and the failure shapes they map to: Choosing a configuration.
- Threshold provenance:
docs/DEFAULT_PROVENANCE.md.