Choosing a configuration
If you’re not sure which Document settings to use, this page tells you in
60 seconds, based on what the docs you’re loading actually look like.
The decision tree
Section titled “The decision tree” Your corpus is… │ ┌────────────────────┼─────────────────────────┐ │ │ │"normal" — code, "structured" — a "synonym-mismatch" — HRAPI refs, runbooks, contract / policy with FAQs, support ticketshandbooks, reports, near-duplicate clauses where users ask inmixed folders (e.g. governing-law different words than overrides per region) the docs use │ │ ▼ ▼ ▼ Document.from_file(p) Document.from_file(p, Document.from_file(p, doc.context(q) retrieval="hybrid", retrieval="hybrid", model="bge-small") model="bge-small", rerank="cross-encoder") doc.context(q, include_heading=True, neighbors=1)Three recipes cover the practical space.
The recipes
Section titled “The recipes”Default — for most docs
Section titled “Default — for most docs”No model download. ~50ms warm queries. Zero ONNX runtime.
import redhop
doc = redhop.Document.from_file("contract.pdf")ctx = doc.context("What is the governing law?")prompt = ctx.text() # feed to any LLMprint(ctx.report) # see what was retrieved and whyconst { Document } = require("redhop");const doc = Document.fromFile("contract.pdf");const ctx = doc.context("What is the governing law?");const prompt = ctx.text;console.log(ctx.report.rendered);let mut doc = redhop::read_file("contract.pdf")?;let ctx = doc.context("What is the governing law?")?;let prompt = ctx.text();When this is right: code, API references, internal docs, runbooks,
financial reports, handbooks, mixed folders (from_folder). The queries
share vocabulary with the answers — which is the case for most technical
and policy content.
Structured docs with parallel clauses
Section titled “Structured docs with parallel clauses”Hybrid + heading-aware retrieval. Adds an ~80MB embedding model download on first run; warm queries climb to ~150ms. Worth it only if your doc has clauses like “main clause X” and “EU override of clause X” and “Japan override of clause X” — heading awareness disambiguates them.
doc = redhop.Document.from_file( "msa.pdf", retrieval="hybrid", model="bge-small",)ctx = doc.context( "What law applies in the UK?", include_heading=True, neighbors=1,)const doc = Document.fromFile("msa.pdf", { retrieval: "hybrid", model: "bge-small",});const ctx = doc.context( "What law applies in the UK?", undefined, // budget — keep default 1, // neighbors true, // includeHeading);let mut doc = redhop::read_file_with("msa.pdf", &redhop::LoadOptions { retrieval: Some("hybrid".into()), model: Some("bge-small".into()), ..Default::default()})?;let ctx = doc.context_with("What law applies in the UK?", &redhop::ContextOptions { include_heading: true, neighbors: 1, ..Default::default()})?;When this is right: legal contracts with regional variations,
multi-jurisdiction policies, vendor security questionnaires with repeated
sub-sections. When it’s wrong: clean single-chapter docs — adding
neighbors=1 to well-structured chapters can dilute well-targeted
retrieval rather than help it.
Synonym-mismatch corpora
Section titled “Synonym-mismatch corpora”Adds a cross-encoder reranker — closes the synonym gap (the canonical “employee left” vs “staff terminated” case). Adds ~300MB of model download and 5–10× query latency.
doc = redhop.Document.from_file( "support_kb.md", retrieval="hybrid", model="bge-small", rerank="cross-encoder",)ctx = doc.context("why did the worker leave?")const doc = Document.fromFile("support_kb.md", { retrieval: "hybrid", model: "bge-small", rerank: "cross-encoder",});const ctx = doc.context("why did the worker leave?");When this is right: corpora where queries and answers regularly share no surface words (HR, support FAQs translated from internal phrasing, multilingual content). When it’s wrong: anywhere the lexical default already works — it adds latency without recovering anything. Verify it helps on your corpus before adopting.
Trade-offs at a glance
Section titled “Trade-offs at a glance”| Lexical default | Hybrid + bge | + cross-encoder rerank | |
|---|---|---|---|
| First-run model download | none | ~80MB (bge-small) | + ~300MB (cross-encoder) |
| Warm query latency | ~50ms | ~150ms | ~1000ms |
| Compile-time deps | none | ONNX runtime | ONNX runtime |
| Where it helps | most document QA | regional overrides, parallel sub-sections | synonym-mismatch retrieval |
| Where it hurts | — | adds latency on docs lexical already handles | adds latency without recovering anything unless the failure mode is synonym mismatch |
Query writing — the part the user controls
Section titled “Query writing — the part the user controls”The library can only retrieve what your query gives it. Two patterns no config can fix:
1. One-word polysemy queries
Section titled “1. One-word polysemy queries”'vendor' retrieves the vendor-management section, not the liability cap
(even when both mention vendors). 'settle' can retrieve the
indemnification clause (“settle a claim”) rather than the arbitration
clause (“settle a dispute”) — even with a cross-encoder reranker, because
both readings are defensible.
Fix it in the query, not the config: add one disambiguating word.
'liability cap for vendor' correctly finds the cap clause.
'arbitration forum to settle disputes' finds the arbitration clause.
2. Natural-language paraphrase with no shared vocabulary
Section titled “2. Natural-language paraphrase with no shared vocabulary”'How long do I have to cancel and get my money back?' against a
contract that uses “refund” and “termination for convenience” — not
“cancel” or “money back” — can return an empty or weak context across
every tier.
Fix in the query: use the doc’s vocabulary. “What’s the refund
window?” finds the relevant clause immediately. Fix at the config
level (sometimes): retrieval="semantic" (full dense, BM25 bypassed)
returns something where hybrid returns empty — but the result may
still not be the right clause. There’s a known bug
where hybrid sometimes returns fewer candidates than lexical alone.
See also
Section titled “See also”- Context optimization strategy — when to prune what was retrieved: Tips guide.
- All parameters — the full reference: Options.
- Loaders — every on-ramp (
from_text,from_file,from_folder,from_bytes): Loaders.