Skip to content

Deploy RedHop to production

In-process retrieval means there is no separate service to operate, but it doesn’t mean the production decisions are gone. Once you move past a script that re-indexes on every run, a handful of choices matter: where the indexed Document lives across requests, whether the index persists to disk, what fields the report sends to your observability stack.

The biggest gain over the tutorial pattern is loading the document once at startup instead of per request. A single PDF is a millisecond to index. For from_folder over thousands of files it is seconds. Either way, paying that cost per request is waste. Hold the indexed Document in shared state.

from fastapi import FastAPI
from pydantic import BaseModel
from contextlib import asynccontextmanager
import redhop
@asynccontextmanager
async def lifespan(app: FastAPI):
app.state.doc = redhop.Document.from_file("contract.pdf")
yield
app = FastAPI(lifespan=lifespan)
class Ask(BaseModel):
question: str
@app.post("/ask")
def ask(req: Ask):
ctx = app.state.doc.context(req.question)
return {
"prompt": ctx.text(),
"citations": ctx.citations,
"report": redhop.report_to_dict(ctx.report),
}

uvicorn main:app --workers 4. Each worker holds its own Document. Warm queries land in 1 to 6 ms.

For a single file, reindexing on startup is free. For from_folder over thousands of files it isn’t. Pass persist=True and subsequent loads only pay for the files that changed since last run.

doc = redhop.Document.from_folder("./docs", persist=True)
# or a custom location for the cache:
doc = redhop.Document.from_folder("./docs", persist=True,
index_dir="/var/cache/redhop")

The persisted layout stores a content fingerprint per file. On reload, RedHop diffs the fingerprints against the current files: unchanged files skip parsing and chunking, new and modified files re-index, deleted files drop. A 5,000-file repo where three files changed reloads in under a second.

A persistent index assumes a single writer. Multiple processes writing to the same .redhop/ directory will race. For multi-process deployments, give each worker its own index_dir, or build the index once in a deploy step and mount it read-only at runtime.

For docs or knowledge-base content that updates outside the service, watch the folder and rebuild the Document on change. Assigning the new Document to the shared reference is atomic: in-flight requests use the old one, new requests see the new one. No locking required.

import asyncio
from watchfiles import awatch
import redhop
doc = redhop.Document.from_folder("./docs", persist=True)
async def watcher(app):
async for changes in awatch("./docs"):
app.state.doc = redhop.Document.from_folder("./docs", persist=True)
print(f"reloaded after {len(changes)} change(s)")

The defaults are calibrated against measured failure modes, and the evidence layer in the repo documents the why for each one. A few knobs are worth knowing about in production:

KnobDefaultWhen to change
chunk_size128Larger (256 to 512) for prose-heavy docs where wider context helps. Smaller (64) for very dense reference material. Index-time, so changes require re-indexing.
chunk_overlap1 sentenceHigher (2 or 3) if retrieval misses chunks where the answer straddles a boundary.
token_budget8192Match the LLM’s prompt budget. Smaller (2,000 to 4,000) for cost-sensitive workloads.
candidate_k20More candidates (40 to 50) for diverse corpora.
retrieval"lexical""hybrid" only when keyword search misses. See Choosing a configuration.
reranknone"cross-encoder" only when paraphrase mismatch is the actual failure mode. Adds five to ten times the query latency.

The rule that always applies: don’t recreate the Document per request.

FROM python:3.12-slim
WORKDIR /app
RUN pip install --no-cache-dir redhop fastapi uvicorn openai
COPY app.py .
COPY docs/ ./docs/
# Bake the persistent index into the image so cold starts hit the
# cached state instead of indexing from scratch.
RUN python -c "import redhop; redhop.Document.from_folder('./docs', persist=True)"
CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "8000"]

Around 150 MB without the semantic tier. With it, add roughly 80 MB for the bge-small ONNX model. Either bake the model into the image or mount a persistent volume to avoid re-downloading on each cold start.

The report on every BuiltContext carries structured fields that slot into any tracing or logging stack.

import logging, redhop
log = logging.getLogger("redhop")
@app.post("/ask")
def ask(req: Ask):
ctx = app.state.doc.context(req.question)
log.info("retrieval", extra={
"query": req.question,
"auto_decision": ctx.report.auto_decision,
"n_input_chunks": ctx.report.n_input_chunks,
"n_selected": ctx.report.n_selected,
"total_tokens": ctx.report.total_tokens,
"retained_evidence_ratio": ctx.report.retained_evidence_ratio,
"n_citations": len(ctx.citations),
})
return {"prompt": ctx.text(), "citations": ctx.citations}

Worth alerting on:

  • An empty citations list. The retriever found nothing, usually a vocabulary mismatch between the query and the corpus.
  • retained_evidence_ratio dropping. The pruner is dropping more than before, often because the corpus shape has shifted.
  • auto_decision flipping its passthrough/prune mix. The input contexts have changed in size or density.

For a pre-flight check without assembling the context, doc.analyze(query) returns the same report shape.

The native addon means RedHop will not run on Cloudflare Workers or Vercel Edge. The Node guide covers what to use instead in those environments: Node.js library for RAG.

Cross-encoder rerank adds five to ten times the query latency. Enable it only when synonym mismatch is the measured failure mode on your corpus, not by default.

Multi-node scaling: each node holds its own indexed Document. If the corpus exceeds single-node memory, that’s past RedHop’s sweet spot. Switch to a vector database (LanceDB, Qdrant, Pinecone) and keep RedHop as the context assembly layer if you still want that piece.

Tested against RedHop 0.3.x.