Why is my retention bad? A debugging guide
You ran doc.context("…"), the model gave the wrong answer, and a quick look at
ctx.text() shows the answer-bearing passage isn’t in there. What now?
Bad retention is almost always one of four patterns. This page walks each one: the symptom you see, the most likely cause, the measured fix, and the diagnostic tool RedHop ships to confirm it before you change anything.
Step 0: Measure first, change later
Section titled “Step 0: Measure first, change later”Before guessing at the fix, get a baseline number. If you have a small sample of (query, expected answer) pairs:
from redhop import Document, evaluate
doc = Document.from_file("my_corpus.pdf")
for query, gold_text in my_samples: ctx = doc.context(query) report = evaluate(query, ctx, gold_answer=gold_text) print(f"{query[:40]}: context_recall={report.context_recall} answer_token_recall={report.answer_token_recall}")evaluate is deterministic: no LLM judge, no API key. Run this once to know
where you’re starting from. Run it again after every fix to know whether you
actually improved anything. The number is the contract. A “guide says do
X” without a measured before/after is how libraries accumulate
cargo-culted advice.
Pattern 1: Templated queries (CUAD-shape)
Section titled “Pattern 1: Templated queries (CUAD-shape)”Symptom. All your queries share a long boilerplate frame
("Highlight the parts (if any) of this contract related to X. Details: …",
or "Please help me with my X issue, my account is broken …"). Retention is
worse than you’d expect. The gold chunk is in your corpus but BM25 ranks it
low.
Cause. BM25 weights every query token by its IDF over the corpus. The shared boilerplate words have low IDF (they appear in every query) but still contribute to the score. They dilute the signal from the few discriminating words that vary across queries.
Confirm with analyze_query_set:
import redhopreport = redhop.analyze_query_set(my_queries) # list of strings, ~50-300 enoughprint(report.is_templated, report.template_word_share, report.boilerplate_terms[:8])# Templated workloads: is_templated=True, template_word_share ≥ 0.50# Diverse workloads: is_templated=False, template_word_share < 0.10Fix. Apply a Stripper to remove the boilerplate before retrieval:
stripper = redhop.Stripper(report.boilerplate_terms)
ctx = doc.context_with_rewrites(query, [stripper])# `ctx.report.query_rewrites` shows what got stripped from each queryMeasured impact on CUAD (template-heavy legal contracts, n=300): retention rises
81.3% → 87.7% ≥0.8 with Stripper alone, before any vocabulary curation.
See CUAD_CLAUSE_EXPANSION.
If Stripper doesn’t fire on a query you expected it to, use
is_effective_on:
effect = stripper.is_effective_on("the office of records")print(effect["removed_terms"]) # what actually firedprint(effect["probable_silent_no_op"]) # configured terms PRESENT in the # query that didn't fire — usually a # stem mismatch between your # boilerplate list and the analyzerPattern 2: Vocabulary mismatch (paraphrase gap)
Section titled “Pattern 2: Vocabulary mismatch (paraphrase gap)”Symptom. Queries use natural-language phrasings that the documents don’t
use. User asks "How long do I have to cancel and get my money back?". The
contract says "refund window" and "termination for convenience". The
retrieved context contains neither.
Cause. BM25 only matches surface tokens. If the query terms don’t appear in the document, there’s nothing to score against.
Fix (workload-curated synonyms). Build a small Vocabulary dict mapping
user phrasings to document terms:
vocab = redhop.Vocabulary({ "cancel": ["refund", "termination for convenience"], "money back": ["refund window"],})
ctx = doc.context_with_rewrites(query, [vocab])This works best for fixed-taxonomy workloads (legal clause names, support
ticket categories, schema columns) where the document vocabulary is bounded.
Measured on CUAD (with Stripper already applied): the curated Vocabulary
adds another +3.0 pts (87.7% → 90.7%). See
CUAD_CLAUSE_EXPANSION.
Fix (semantic retrieval). For workloads where you can’t enumerate the synonyms (open-domain Q&A, conversational queries), escalate the retrieval tier:
doc = redhop.Document.from_file("my_corpus.pdf", retrieval="semantic", model="bge-small")# every chunk gets a dense embedding; the query cosines against all of themFor paraphrase-heavy workloads at small/medium scale this is the right escalation. Latency cost: ~80MB model download on first use, ~150ms per query.
Pattern 3: Multi-hop bridge passage
Section titled “Pattern 3: Multi-hop bridge passage”Symptom. The question requires combining information from two paragraphs
("Who is the spouse of the Green performer?": first hop names the
performer, second hop names the spouse). Retention measures fine on either
hop in isolation, but the assembled context is missing the bridge passage
that links them.
Cause. Standard BM25 ranks the obviously-relevant first-hop paragraph high (it shares the query’s content words), and the bridge paragraph low (it mentions only the linking entity, not the query). At a small budget, the bridge gets pruned.
Fix. Use retrieval="hybrid" (BM25 + dense rerank). The dense embedder
catches semantic relationships that BM25 misses:
doc = redhop.Document.from_file("my_corpus.pdf", retrieval="hybrid", model="bge-small")Measured on HotpotQA (multi-hop QA, n=100): ≥0.8 retention rises 71% → 81% under hybrid. On MuSiQue (compositional 2-4 hop, n=100): 23% → 34% ≥0.8 retention. Latency cost: ~250ms p50 vs ~3ms for BM25.
See MULTIHOP_HYBRID and MULTIHOP_HYBRID_COMPETITORS.
Stripper doesn’t help on multi-hop (queries aren’t templated) and
candidate_k tuning doesn’t either (bridge passages aren’t in the larger
BM25 pool either). Dense rerank is the only knob that moves multi-hop
retention.
Pattern 4: Chunking mismatch
Section titled “Pattern 4: Chunking mismatch”Symptom. Retention is bad across queries you can’t classify into the above three patterns. The right chunk might be there but split awkwardly across two retrieved chunks, or one chunk is so big that pulling it eats the whole token budget.
Cause. RedHop’s default chunker is sentence-aware, ~128-token target. That fits HotpotQA-shape Wikipedia paragraphs and CUAD contract clauses well. It doesn’t fit code (long functions), tabular data, or structured documents (SOAP-note clinical records, deeply-nested JSON dumps).
Confirm with the chunk-size sweep:
for cs in [128, 256, 384, 512]: doc = redhop.Document.from_text(my_text, chunk_size=cs) ctx = doc.context(my_query) print(f"chunk_size={cs}: {doc.n_chunks} chunks, recall={evaluate(...).context_recall}")On the workloads we measured (HotpotQA + MuSiQue), bigger chunks regressed. See MULTIHOP_CHUNK_SIZE_NULL. That’s evidence the 128-token default is well-tuned for those shapes. If your workload is different (code, tables, etc.) measure your own to find your sweet spot.
Fix. Bring your own chunker via Document.from_chunks(...):
from your_chunker import chunk_into_sectionssections = chunk_into_sections(open("paper.tex").read())
chunks = [ redhop.Chunk(text, source="paper.tex", id=f"sec-{i}", metadata={"section": title}) for i, (title, text) in enumerate(sections)]doc = redhop.Document.from_chunks(chunks)The constant-chunking matrix (MULTIHOP_CONSTANT_CHUNKING) showed the chunker is the lever (the BM25 implementation barely matters). Bringing your workload-specific chunker can move retention 12-20 points.
Step N: None of the above
Section titled “Step N: None of the above”If you’ve checked all four patterns and retention is still bad, the most likely remaining causes are:
- Token budget too tight. Default is 8192. Bump higher if your model
supports it.
doc.context(query, budget=16000). - Off-topic queries. If the user’s question genuinely isn’t answered
anywhere in the corpus, retrieval can only return the closest chunks.
No retrieval system will manufacture the right answer.
ctx.report.low_confidence_retrievalflags this. - Document parsing lost the content. For PDFs especially, the text may
not have been extracted. Print
len(doc.chunks)and a sample ofctx.chunks. If the actual answer text isn’t in any chunk, the retrieval tier can’t help. Try a different parser orfrom_text(open(...))with a known-good extraction.
See also
Section titled “See also”- Choosing a configuration: the prescriptive version of this page, organized by workload shape rather than symptom.
- Retrieval & context tips: the underlying principles (“why does BM25 work this way, when should I escalate to dense”).
- Evidence layer: the measured findings each recommendation here traces back to.