Skip to content

Why is my retention bad? A debugging guide

You ran doc.context("…"), the model gave the wrong answer, and a quick look at ctx.text() shows the answer-bearing passage isn’t in there. What now?

Bad retention is almost always one of four patterns. This page walks each one: the symptom you see, the most likely cause, the measured fix, and the diagnostic tool RedHop ships to confirm it before you change anything.

Before guessing at the fix, get a baseline number. If you have a small sample of (query, expected answer) pairs:

from redhop import Document, evaluate
doc = Document.from_file("my_corpus.pdf")
for query, gold_text in my_samples:
ctx = doc.context(query)
report = evaluate(query, ctx, gold_answer=gold_text)
print(f"{query[:40]}: context_recall={report.context_recall} answer_token_recall={report.answer_token_recall}")

evaluate is deterministic: no LLM judge, no API key. Run this once to know where you’re starting from. Run it again after every fix to know whether you actually improved anything. The number is the contract. A “guide says do X” without a measured before/after is how libraries accumulate cargo-culted advice.

Symptom. All your queries share a long boilerplate frame ("Highlight the parts (if any) of this contract related to X. Details: …", or "Please help me with my X issue, my account is broken …"). Retention is worse than you’d expect. The gold chunk is in your corpus but BM25 ranks it low.

Cause. BM25 weights every query token by its IDF over the corpus. The shared boilerplate words have low IDF (they appear in every query) but still contribute to the score. They dilute the signal from the few discriminating words that vary across queries.

Confirm with analyze_query_set:

import redhop
report = redhop.analyze_query_set(my_queries) # list of strings, ~50-300 enough
print(report.is_templated, report.template_word_share, report.boilerplate_terms[:8])
# Templated workloads: is_templated=True, template_word_share ≥ 0.50
# Diverse workloads: is_templated=False, template_word_share < 0.10

Fix. Apply a Stripper to remove the boilerplate before retrieval:

stripper = redhop.Stripper(report.boilerplate_terms)
ctx = doc.context_with_rewrites(query, [stripper])
# `ctx.report.query_rewrites` shows what got stripped from each query

Measured impact on CUAD (template-heavy legal contracts, n=300): retention rises 81.3% → 87.7% ≥0.8 with Stripper alone, before any vocabulary curation. See CUAD_CLAUSE_EXPANSION.

If Stripper doesn’t fire on a query you expected it to, use is_effective_on:

effect = stripper.is_effective_on("the office of records")
print(effect["removed_terms"]) # what actually fired
print(effect["probable_silent_no_op"]) # configured terms PRESENT in the
# query that didn't fire — usually a
# stem mismatch between your
# boilerplate list and the analyzer

Pattern 2: Vocabulary mismatch (paraphrase gap)

Section titled “Pattern 2: Vocabulary mismatch (paraphrase gap)”

Symptom. Queries use natural-language phrasings that the documents don’t use. User asks "How long do I have to cancel and get my money back?". The contract says "refund window" and "termination for convenience". The retrieved context contains neither.

Cause. BM25 only matches surface tokens. If the query terms don’t appear in the document, there’s nothing to score against.

Fix (workload-curated synonyms). Build a small Vocabulary dict mapping user phrasings to document terms:

vocab = redhop.Vocabulary({
"cancel": ["refund", "termination for convenience"],
"money back": ["refund window"],
})
ctx = doc.context_with_rewrites(query, [vocab])

This works best for fixed-taxonomy workloads (legal clause names, support ticket categories, schema columns) where the document vocabulary is bounded. Measured on CUAD (with Stripper already applied): the curated Vocabulary adds another +3.0 pts (87.7% → 90.7%). See CUAD_CLAUSE_EXPANSION.

Fix (semantic retrieval). For workloads where you can’t enumerate the synonyms (open-domain Q&A, conversational queries), escalate the retrieval tier:

doc = redhop.Document.from_file("my_corpus.pdf",
retrieval="semantic", model="bge-small")
# every chunk gets a dense embedding; the query cosines against all of them

For paraphrase-heavy workloads at small/medium scale this is the right escalation. Latency cost: ~80MB model download on first use, ~150ms per query.

Symptom. The question requires combining information from two paragraphs ("Who is the spouse of the Green performer?": first hop names the performer, second hop names the spouse). Retention measures fine on either hop in isolation, but the assembled context is missing the bridge passage that links them.

Cause. Standard BM25 ranks the obviously-relevant first-hop paragraph high (it shares the query’s content words), and the bridge paragraph low (it mentions only the linking entity, not the query). At a small budget, the bridge gets pruned.

Fix. Use retrieval="hybrid" (BM25 + dense rerank). The dense embedder catches semantic relationships that BM25 misses:

doc = redhop.Document.from_file("my_corpus.pdf",
retrieval="hybrid", model="bge-small")

Measured on HotpotQA (multi-hop QA, n=100): ≥0.8 retention rises 71% → 81% under hybrid. On MuSiQue (compositional 2-4 hop, n=100): 23% → 34% ≥0.8 retention. Latency cost: ~250ms p50 vs ~3ms for BM25.

See MULTIHOP_HYBRID and MULTIHOP_HYBRID_COMPETITORS.

Stripper doesn’t help on multi-hop (queries aren’t templated) and candidate_k tuning doesn’t either (bridge passages aren’t in the larger BM25 pool either). Dense rerank is the only knob that moves multi-hop retention.

Symptom. Retention is bad across queries you can’t classify into the above three patterns. The right chunk might be there but split awkwardly across two retrieved chunks, or one chunk is so big that pulling it eats the whole token budget.

Cause. RedHop’s default chunker is sentence-aware, ~128-token target. That fits HotpotQA-shape Wikipedia paragraphs and CUAD contract clauses well. It doesn’t fit code (long functions), tabular data, or structured documents (SOAP-note clinical records, deeply-nested JSON dumps).

Confirm with the chunk-size sweep:

for cs in [128, 256, 384, 512]:
doc = redhop.Document.from_text(my_text, chunk_size=cs)
ctx = doc.context(my_query)
print(f"chunk_size={cs}: {doc.n_chunks} chunks, recall={evaluate(...).context_recall}")

On the workloads we measured (HotpotQA + MuSiQue), bigger chunks regressed. See MULTIHOP_CHUNK_SIZE_NULL. That’s evidence the 128-token default is well-tuned for those shapes. If your workload is different (code, tables, etc.) measure your own to find your sweet spot.

Fix. Bring your own chunker via Document.from_chunks(...):

from your_chunker import chunk_into_sections
sections = chunk_into_sections(open("paper.tex").read())
chunks = [
redhop.Chunk(text, source="paper.tex", id=f"sec-{i}",
metadata={"section": title})
for i, (title, text) in enumerate(sections)
]
doc = redhop.Document.from_chunks(chunks)

The constant-chunking matrix (MULTIHOP_CONSTANT_CHUNKING) showed the chunker is the lever (the BM25 implementation barely matters). Bringing your workload-specific chunker can move retention 12-20 points.

If you’ve checked all four patterns and retention is still bad, the most likely remaining causes are:

  • Token budget too tight. Default is 8192. Bump higher if your model supports it. doc.context(query, budget=16000).
  • Off-topic queries. If the user’s question genuinely isn’t answered anywhere in the corpus, retrieval can only return the closest chunks. No retrieval system will manufacture the right answer. ctx.report.low_confidence_retrieval flags this.
  • Document parsing lost the content. For PDFs especially, the text may not have been extracted. Print len(doc.chunks) and a sample of ctx.chunks. If the actual answer text isn’t in any chunk, the retrieval tier can’t help. Try a different parser or from_text(open(...)) with a known-good extraction.
  • Choosing a configuration: the prescriptive version of this page, organized by workload shape rather than symptom.
  • Retrieval & context tips: the underlying principles (“why does BM25 work this way, when should I escalate to dense”).
  • Evidence layer: the measured findings each recommendation here traces back to.