Choosing a configuration

If you’re not sure which Document settings to use, this page tells you in 60 seconds, based on what the docs you’re loading actually look like.

The decision tree

                  Your corpus is…
                        │
   ┌────────────────────┼─────────────────────────┐
   │                    │                         │
"normal" — code,    "structured" — a              "synonym-mismatch" — HR
API refs, runbooks, contract / policy with        FAQs, support tickets
handbooks, reports, near-duplicate clauses        where users ask in
mixed folders       (e.g. governing-law           different words than
                    overrides per region)         the docs use
                        │                         │
        ▼               ▼                         ▼
  Document.from_file(p)   Document.from_file(p,    Document.from_file(p,
  doc.context(q)          retrieval="hybrid",     retrieval="hybrid",
                          model="bge-small")      model="bge-small",
                                                  rerank="cross-encoder")
                          doc.context(q,
                              include_heading=True,
                              neighbors=1)

Three recipes cover the practical space.

The recipes

Default: for most docs

No model download. ~50ms warm queries. Zero ONNX runtime.

import redhop

doc = redhop.Document.from_file("contract.pdf")
ctx = doc.context("What is the governing law?")
prompt = ctx.text()         # feed to any LLM
print(ctx.report)           # see what was retrieved and why

const { Document } = require("redhop");
const doc = Document.fromFile("contract.pdf");
const ctx = doc.context("What is the governing law?");
const prompt = ctx.text;
console.log(ctx.report.rendered);

let mut doc = redhop::read_file("contract.pdf")?;
let ctx = doc.context("What is the governing law?")?;
let prompt = ctx.text();

When this is right: code, API references, internal docs, runbooks, financial reports, handbooks, mixed folders (from_folder). The queries share vocabulary with the answers, which is the case for most technical and policy content.

Structured docs with parallel clauses

Hybrid + heading-aware retrieval. Adds an ~80MB embedding model download on first run, and warm queries climb to ~150ms. Worth it only if your doc has clauses like “main clause X” and “EU override of clause X” and “Japan override of clause X”. Heading awareness disambiguates them.

doc = redhop.Document.from_file("msa.pdf", options=redhop.DocumentOptions(retrieval="hybrid", model="bge-small"))
ctx = doc.context(
    "What law applies in the UK?",
    include_heading=True,
    neighbors=1,
)

const doc = Document.fromFile("msa.pdf", {
  retrieval: "hybrid",
  model: "bge-small",
});
const ctx = doc.context(
  "What law applies in the UK?",
  undefined,   // budget — keep default
  1,           // neighbors
  true,        // includeHeading
);

let mut doc = redhop::read_file_with("msa.pdf", &redhop::LoadOptions {
    retrieval: Some("hybrid".into()),
    model: Some("bge-small".into()),
    ..Default::default()
})?;
let ctx = doc.context_with("What law applies in the UK?", &redhop::ContextOptions {
    include_heading: true,
    neighbors: 1,
    ..Default::default()
})?;

When this is right: legal contracts with regional variations, multi-jurisdiction policies, vendor security questionnaires with repeated sub-sections. When it’s wrong: clean single-chapter docs. Adding neighbors=1 to well-structured chapters can dilute well-targeted retrieval rather than help it.

Synonym-mismatch corpora

Adds a cross-encoder reranker, which closes the synonym gap (the canonical “employee left” vs “staff terminated” case). Adds ~300MB of model download and 5–10× query latency.

Python
Node.js

doc = redhop.Document.from_file("support_kb.md", options=redhop.DocumentOptions(retrieval="hybrid", model="bge-small", rerank="cross-encoder"))
ctx = doc.context("why did the worker leave?")

const doc = Document.fromFile("support_kb.md", {
  retrieval: "hybrid",
  model: "bge-small",
  rerank: "cross-encoder",
});
const ctx = doc.context("why did the worker leave?");

When this is right: corpora where queries and answers regularly share no surface words (HR, support FAQs translated from internal phrasing, multilingual content). When it’s wrong: anywhere the lexical default already works: it adds latency without recovering anything. Verify it helps on your corpus before adopting.

Trade-offs at a glance

	Lexical default	Hybrid + bge	+ cross-encoder rerank
First-run model download	none	~80MB (bge-small)	+ ~300MB (cross-encoder)
Warm query latency	~50ms	~150ms	~1000ms
Compile-time deps	none	ONNX runtime	ONNX runtime
Where it helps	most document QA	regional overrides, parallel sub-sections	synonym-mismatch retrieval
Where it hurts	—	adds latency on docs lexical already handles	adds latency without recovering anything unless the failure mode is synonym mismatch

When your corpus is a catalog, not prose

The decision tree above keys on content type. Two other axes change the answer and the tree doesn’t see them: corpus size (how many near-duplicate items you hold) and query length (a 2–5 token product reference is not a 15-token question). When your corpus is a product catalog, a parts list, an API surface, or anything with a high-cardinality family of near-identical items, read this section. The evidence is CATALOG_REGIME, a synthetic re-derivation of an external regime — treat the numbers as direction and measure on your own data before shipping a default. The deep-dive is the catalog & noisy-query guide.

The typo / short-token tier (char-ngram)

Short tokens carry no redundancy, so one transcription or OCR error (a brand arriving as 1ays instead of lays) zeroes token-exact BM25. A dense model doesn’t rescue a 1–2 token query either (the zero-dep ceiling is ~0.56). The lever is subword lexical matching, no model:

# default grams 3–5; tune with language="char_ngram:2-4"
doc = redhop.Document.from_text(catalog, options=redhop.DocumentOptions(language="char_ngram"))

const doc = Document.fromText(catalog, { language: "char_ngram" });

use std::sync::Arc;
use redhop::analyzer::CharNgramAnalyzer;
let mut doc = redhop::text(catalog, &redhop::LoadOptions {
    language: Some("char_ngram".into()),
    ..Default::default()
})?;
// or, at the retriever level: Bm25Retriever::with_analyzer(Arc::new(CharNgramAnalyzer::default()))

On brand-typo’d queries this held early precision near 0.98 at every catalog size while word-BM25 fell to ~0.10. The catch: char-ngram is a recall booster, not a drop-in. It actively reranks the near-duplicates word-BM25 leaves tied, so its clean set-coverage erodes at scale. Pair it with word-BM25 (the recall leg of a hybrid), don’t replace it. The default analyzer stays word-token for prose.

Corpus size can flip the best retriever

Holding content constant, the best retriever can change as the corpus grows: at a small catalog char-ngram and word-BM25 tie, at a large one word-BM25 holds the recall floor char-ngram loses. Size is a first-class axis, not a footnote. There is no auto-selector yet, so the procedure is manual — run both arms on a held-out sample at your real catalog size and pick on your own metric. Don’t carry a small-corpus choice to a large corpus unmeasured.

Choosing field weights (and why a boost can hurt)

BM25 indexes three fields (text / source / heading) at equal weight. For a near-duplicate catalog you can boost the field carrying the discriminating token:

# [text, source, heading]; boost heading (e.g. brand/title) 2x
doc = redhop.Document.from_text(catalog, options=redhop.DocumentOptions(bm25_field_weights=[1.0, 1.0, 2.0]))

const doc = Document.fromText(catalog, { bm25FieldWeights: [1.0, 1.0, 2.0] });

use redhop::retrieval::{Bm25Retriever, FieldWeights};
let r = Bm25Retriever::new()?
    .with_field_weights(FieldWeights { text: 1.0, source: 1.0, heading: 2.0 });
// or DocumentConfig.bm25_field_weights

A weight of 1.0 is the exact default (skipped before it reaches the index), so the knob is zero-regression. But a boost is not free lift. In our re-derivation, boosting a structured field had no effect on set-coverage — a boost on a field the near-duplicates share scales them all equally and reorders nothing (CATALOG_REGIME Panel D). The rule to remember: a field boost helps only when the boosted field separates the answer from its near-duplicates. Sweep the weight on a held-out set with your own eval, and watch set-coverage (below), not just recall@k.

Measuring the whole set, not just one answer

A catalog query often maps to a set (every size or flavor of a product), and recall@k against a single gold chunk hides a half-retrieved family. Score strict set-coverage with gold_families:

r = redhop.evaluate(query, ctx, gold_families=[["sku_a1", "sku_a2"], ["sku_b1", "sku_b2"]])
print(r.set_coverage)   # fraction of families fully present in the context

This caught families that a recall@20 of 1.000 reported as fine but that were actually un-offerable for disambiguation. More on the Evaluation page.

Query writing: the part the user controls

The library can only retrieve what your query gives it. Three patterns no config can fix.

1. One-word polysemy queries

'vendor' retrieves the vendor-management section, not the liability cap (even when both mention vendors). 'settle' can retrieve the indemnification clause (“settle a claim”) rather than the arbitration clause (“settle a dispute”), even with a cross-encoder reranker, because both readings are defensible.

Fix it in the query, not the config: add one disambiguating word. 'liability cap for vendor' correctly finds the cap clause. 'arbitration forum to settle disputes' finds the arbitration clause. The report flags this shape with the underdetermined_query hint when a short query produces a nearly flat score spread across many candidates.

2. Natural-language paraphrase with no shared vocabulary

'How long do I have to cancel and get my money back?' against a contract that uses “refund” and “termination for convenience” (not “cancel” or “money back”) can return an empty or weak context across every tier.

Fix in the query: use the doc’s vocabulary. “What’s the refund window?” finds the relevant clause immediately. Fix at the config level (sometimes): retrieval="hybrid" adds a dense embedder that can match refund to cancel through semantic similarity. Hybrid is a strict superset of lexical (BM25-tail fallback fills any chunks the dense pool missed), so you never lose candidates by turning it on. The cost is the ~80MB embedder download and ~3× warm latency. The report flags this shape with the vocab_mismatch hint and lists the exact zero-match terms.

3. Templated queries with heavy boilerplate

If every query in your workload follows a fixed template (“Highlight the parts (if any) of this contract related to X that should be reviewed by a lawyer. Details: …”, “Help me with X, my account is Y, the error is Z”, form-filled queries from a structured UI), BM25 weights each query term by corpus IDF, not by how often the term appears across your query set. So the 19 boilerplate words dilute the 5 real signal words, and retention suffers. The report flags this shape with the low_discrimination_query hint when most query terms appear in a large fraction of the corpus.

This is measured. On CUAD’s fixed 24-word template, stripping the boilerplate to just <clause name> <details> before calling context() lifts ≥0.8 retention from 81.3% → 87.7% (n=300, BM25, budget 2,000 tok), overtaking LlamaIndex’s 86%. Full mechanism + numbers: CUAD_CLAUSE_EXPANSION (the controlled three-arm run).

Two paths up the same hill: pick one, don’t combine. Measured on CUAD (CUAD_HYBRID_RERANK):

path	what you do	retention	latency
One-knob	`retrieval="hybrid"` (BGE-small embedder)	~86–88%	~10 ms/q
Best-quality	BM25 default + `analyze_query_set` → `Stripper` + `Vocabulary` chain via `doc.context_with_rewrites(...)`	90.7%	~2.5 ms/q

Hybrid retrieval reads chunks as semantic content rather than counting tokens, so the boilerplate ratio stops mattering. It substitutes for template stripping by a different mechanism. Running both gives diminishing returns: once one mechanism has fixed the boilerplate dilution, the other adds only +0.3 points. Strip + expand is Pareto-optimal on CUAD (higher retention AND lower latency) but takes the upfront work of writing a stripper and building a synonym dict.

Recommended workflow if you go the best-quality path: detect → compile → run-through-rewrites → A/B. Every step ships in the public API. The rewrites compile once (the analyzer pass happens at construction time, not per query) and run through doc.context_with_rewrites(query, [stripper, vocab]). The per-stage audit trail lands on ctx.report.query_rewrites so every transform is observable:

import redhop

# 1. Detect — analyzer reports the shape of your query set.
report = redhop.analyze_query_set(my_queries[:300])
# Cross-workload probe (findings/QUERY_SET_ANALYZER.md):
#   CUAD     → is_templated=True,  share=0.66, cost="high"
#   HotpotQA → is_templated=False, share=0.00, cost="none"
#   MuSiQue  → is_templated=False, share=0.12, cost="none"

if report.is_templated:
    # 2. Compile the rewrites. Stripper compiles the boilerplate
    #    once via the analyzer — token-level matching, so an "of"
    #    stripper does NOT erase the "of" inside "office".
    stripper = redhop.Stripper(report.boilerplate_terms)

    # 3. (optional) Vocabulary. If your workload has a known taxonomy
    #    of "topics" with predictable synonyms (clause types, error
    #    codes, issue categories), compile them once. On CUAD this
    #    lifts retention from 87.7% to 90.7% on top of the Stripper.
    #    Mechanism: workload-curated high-IDF discriminators raise
    #    the BM25 score of the relevant chunk. Opposite mechanism
    #    direction from PRF (which fails on boilerplate-heavy
    #    corpora; see CUAD_PRF_NULL).
    vocab = redhop.Vocabulary({
        # YOUR workload's keys → synonyms. Worked CUAD example in
        # CUAD_CLAUSE_EXPANSION.md.
        "change of control": ["merger", "successor", "acquisition"],
        "non-compete":       ["restraint", "non-competition"],
    })

    # 4. Run the chain inside context_with_rewrites; the per-stage
    #    audit lands on ctx.report.query_rewrites automatically.
    doc = redhop.Document.from_text(your_document)
    ctx_a = doc.context(user_query)                              # baseline
    ctx_b = doc.context_with_rewrites(user_query, [stripper, vocab])

    # 5. A/B — redhop.evaluate scores both arms deterministically.
    #          No LLM judge; same primitives the Decision Report uses.
    eval_a = redhop.evaluate(user_query, ctx_a, gold_chunks=gold_ids)
    eval_b = redhop.evaluate(user_query, ctx_b, gold_chunks=gold_ids)
    # eval_b.overall - eval_a.overall is the per-query lift.

    # 6. Inspect — every rewrite is observable.
    for rec in ctx_b.report.query_rewrites:
        print(rec.stage, "matched=", rec.matched,
              "added=", rec.added, "removed=", rec.removed)

The analyzer is conservative by design: HotpotQA and MuSiQue both register quiet on the probe (is_templated=False), while CUAD fires (is_templated=True, share 0.66). The analyzer measures the shape of your queries. It does not promise a specific retention lift. The CUAD lift was measured directly at +6 points on CUAD specifically. On a different templated workload the magnitude depends on how much of your real signal was being drowned, which is why step 3 matters.

For single-doc extraction workloads also set strategy="raw_topk". On contract-shape tasks the Auto-routed reasoning_preserving strategy is solving a multi-hop problem you don’t have, and raw_topk beats it by ~4 points at every chunk size.

We deliberately do not ship a CUAD-specific strip_template() helper. Templates are workload-specific, and embedding one into the library would make the wrong call for the next workload. Stripper(...) and Vocabulary({...}) take your boilerplate / synonym dict so the call stays on your side.

What about PRF / query expansion? Tested twice on RedHop, falsified twice with two different failure mechanisms. The dilution win here is subtraction at the query boundary, not addition. See CUAD_PRF_NULL for the mechanism that predicts where unweighted PRF will fail on a new workload.