Skip to content

Choosing a configuration

If you’re not sure which Document settings to use, this page tells you in 60 seconds, based on what the docs you’re loading actually look like.

Your corpus is…
┌────────────────────┼─────────────────────────┐
│ │ │
"normal" — code, "structured" — a "synonym-mismatch" — HR
API refs, runbooks, contract / policy with FAQs, support tickets
handbooks, reports, near-duplicate clauses where users ask in
mixed folders (e.g. governing-law different words than
overrides per region) the docs use
│ │
▼ ▼ ▼
Document.from_file(p) Document.from_file(p, Document.from_file(p,
doc.context(q) retrieval="hybrid", retrieval="hybrid",
model="bge-small") model="bge-small",
rerank="cross-encoder")
doc.context(q,
include_heading=True,
neighbors=1)

Three recipes cover the practical space.

No model download. ~50ms warm queries. Zero ONNX runtime.

import redhop
doc = redhop.Document.from_file("contract.pdf")
ctx = doc.context("What is the governing law?")
prompt = ctx.text() # feed to any LLM
print(ctx.report) # see what was retrieved and why

When this is right: code, API references, internal docs, runbooks, financial reports, handbooks, mixed folders (from_folder). The queries share vocabulary with the answers — which is the case for most technical and policy content.

Hybrid + heading-aware retrieval. Adds an ~80MB embedding model download on first run; warm queries climb to ~150ms. Worth it only if your doc has clauses like “main clause X” and “EU override of clause X” and “Japan override of clause X” — heading awareness disambiguates them.

doc = redhop.Document.from_file(
"msa.pdf",
retrieval="hybrid",
model="bge-small",
)
ctx = doc.context(
"What law applies in the UK?",
include_heading=True,
neighbors=1,
)

When this is right: legal contracts with regional variations, multi-jurisdiction policies, vendor security questionnaires with repeated sub-sections. When it’s wrong: clean single-chapter docs — adding neighbors=1 to well-structured chapters can dilute well-targeted retrieval rather than help it.

Adds a cross-encoder reranker — closes the synonym gap (the canonical “employee left” vs “staff terminated” case). Adds ~300MB of model download and 5–10× query latency.

doc = redhop.Document.from_file(
"support_kb.md",
retrieval="hybrid",
model="bge-small",
rerank="cross-encoder",
)
ctx = doc.context("why did the worker leave?")

When this is right: corpora where queries and answers regularly share no surface words (HR, support FAQs translated from internal phrasing, multilingual content). When it’s wrong: anywhere the lexical default already works — it adds latency without recovering anything. Verify it helps on your corpus before adopting.

Lexical defaultHybrid + bge+ cross-encoder rerank
First-run model downloadnone~80MB (bge-small)+ ~300MB (cross-encoder)
Warm query latency~50ms~150ms~1000ms
Compile-time depsnoneONNX runtimeONNX runtime
Where it helpsmost document QAregional overrides, parallel sub-sectionssynonym-mismatch retrieval
Where it hurtsadds latency on docs lexical already handlesadds latency without recovering anything unless the failure mode is synonym mismatch

Query writing — the part the user controls

Section titled “Query writing — the part the user controls”

The library can only retrieve what your query gives it. Two patterns no config can fix:

'vendor' retrieves the vendor-management section, not the liability cap (even when both mention vendors). 'settle' can retrieve the indemnification clause (“settle a claim”) rather than the arbitration clause (“settle a dispute”) — even with a cross-encoder reranker, because both readings are defensible.

Fix it in the query, not the config: add one disambiguating word. 'liability cap for vendor' correctly finds the cap clause. 'arbitration forum to settle disputes' finds the arbitration clause.

2. Natural-language paraphrase with no shared vocabulary

Section titled “2. Natural-language paraphrase with no shared vocabulary”

'How long do I have to cancel and get my money back?' against a contract that uses “refund” and “termination for convenience” — not “cancel” or “money back” — can return an empty or weak context across every tier.

Fix in the query: use the doc’s vocabulary. “What’s the refund window?” finds the relevant clause immediately. Fix at the config level (sometimes): retrieval="semantic" (full dense, BM25 bypassed) returns something where hybrid returns empty — but the result may still not be the right clause. There’s a known bug where hybrid sometimes returns fewer candidates than lexical alone.

  • Context optimization strategy — when to prune what was retrieved: Tips guide.
  • All parameters — the full reference: Options.
  • Loaders — every on-ramp (from_text, from_file, from_folder, from_bytes): Loaders.