Deploy RedHop to production
In-process retrieval means there is no separate service to operate, but
it doesn’t mean the production decisions are gone. Once you move past
a script that re-indexes on every run, a handful of choices matter:
where the indexed Document lives across requests, whether the index
persists to disk, what fields the report sends to your observability
stack.
One Document, many requests
Section titled “One Document, many requests”The biggest gain over the tutorial pattern is loading the document
once at startup instead of per request. A single PDF is a millisecond
to index. For from_folder over thousands of files it is seconds.
Either way, paying that cost per request is waste. Hold the indexed
Document in shared state.
from fastapi import FastAPIfrom pydantic import BaseModelfrom contextlib import asynccontextmanagerimport redhop
@asynccontextmanagerasync def lifespan(app: FastAPI): app.state.doc = redhop.Document.from_file("contract.pdf") yield
app = FastAPI(lifespan=lifespan)
class Ask(BaseModel): question: str
@app.post("/ask")def ask(req: Ask): ctx = app.state.doc.context(req.question) return { "prompt": ctx.text(), "citations": ctx.citations, "report": redhop.report_to_dict(ctx.report), }uvicorn main:app --workers 4. Each worker holds its own Document.
Warm queries land in 1 to 6 ms.
import express from "express";import { Document } from "redhop";
const app = express();app.use(express.json());
const doc = Document.fromFile("contract.pdf");console.log(`indexed ${doc.chunkCount} chunks`);
app.post("/ask", (req, res) => { const ctx = doc.context(req.body.question); res.json({ prompt: ctx.text, citations: ctx.citations, report: { autoDecision: ctx.report.autoDecision, totalTokens: ctx.report.totalTokens, retainedEvidenceRatio: ctx.report.retainedEvidenceRatio, }, });});
app.listen(3000);In a clustered setup (PM2 or the Node cluster module), each worker
holds its own Document. Memory cost is per worker.
use axum::{Router, routing::post, extract::{State, Json}, response::Json as JsonResp};use serde::{Deserialize, Serialize};use redhop::read_file;use std::sync::Arc;use parking_lot::Mutex;
#[derive(Clone)]struct AppState { doc: Arc<Mutex<redhop::Document>>,}
#[derive(Deserialize)]struct Ask { question: String }
#[derive(Serialize)]struct Answer { prompt: String, citations: Vec<serde_json::Value> }
async fn ask(State(s): State<AppState>, Json(req): Json<Ask>) -> JsonResp<Answer> { let mut doc = s.doc.lock(); let ctx = doc.context(&req.question).unwrap(); JsonResp(Answer { prompt: ctx.text().to_string(), citations: serde_json::to_value(&ctx.citations).unwrap() .as_array().cloned().unwrap_or_default(), })}
#[tokio::main]async fn main() { let doc = read_file("contract.pdf").expect("load failed"); let state = AppState { doc: Arc::new(Mutex::new(doc)) }; let app = Router::new().route("/ask", post(ask)).with_state(state); let listener = tokio::net::TcpListener::bind("0.0.0.0:3000").await.unwrap(); axum::serve(listener, app).await.unwrap();}For high-concurrency reads, RwLock over Mutex lets multiple
queries run in parallel against the same index.
Persist the index
Section titled “Persist the index”For a single file, reindexing on startup is free. For from_folder
over thousands of files it isn’t. Pass persist=True and subsequent
loads only pay for the files that changed since last run.
doc = redhop.Document.from_folder("./docs", persist=True)
# or a custom location for the cache:doc = redhop.Document.from_folder("./docs", persist=True, index_dir="/var/cache/redhop")const doc = Document.fromFolder("./docs", { persist: true });
const doc = Document.fromFolder("./docs", { persist: true, indexDir: "/var/cache/redhop",});use redhop::{read_folder_with, FolderOptions};
let mut doc = read_folder_with("./docs", &FolderOptions { persist: true, index_dir: Some("/var/cache/redhop".into()), ..Default::default()})?;The persisted layout stores a content fingerprint per file. On reload, RedHop diffs the fingerprints against the current files: unchanged files skip parsing and chunking, new and modified files re-index, deleted files drop. A 5,000-file repo where three files changed reloads in under a second.
A persistent index assumes a single writer. Multiple processes writing
to the same .redhop/ directory will race. For multi-process
deployments, give each worker its own index_dir, or build the index
once in a deploy step and mount it read-only at runtime.
Hot-reload when files change
Section titled “Hot-reload when files change”For docs or knowledge-base content that updates outside the service,
watch the folder and rebuild the Document on change. Assigning the
new Document to the shared reference is atomic: in-flight requests
use the old one, new requests see the new one. No locking required.
import asynciofrom watchfiles import awatchimport redhop
doc = redhop.Document.from_folder("./docs", persist=True)
async def watcher(app): async for changes in awatch("./docs"): app.state.doc = redhop.Document.from_folder("./docs", persist=True) print(f"reloaded after {len(changes)} change(s)")import { watch } from "node:fs/promises";import { Document } from "redhop";
let doc = Document.fromFolder("./docs", { persist: true });
for await (const _evt of watch("./docs", { recursive: true })) { doc = Document.fromFolder("./docs", { persist: true });}Knobs that matter in production
Section titled “Knobs that matter in production”The defaults are calibrated against measured failure modes, and the evidence layer in the repo documents the why for each one. A few knobs are worth knowing about in production:
| Knob | Default | When to change |
|---|---|---|
chunk_size | 128 | Larger (256 to 512) for prose-heavy docs where wider context helps. Smaller (64) for very dense reference material. Index-time, so changes require re-indexing. |
chunk_overlap | 1 sentence | Higher (2 or 3) if retrieval misses chunks where the answer straddles a boundary. |
token_budget | 8192 | Match the LLM’s prompt budget. Smaller (2,000 to 4,000) for cost-sensitive workloads. |
candidate_k | 20 | More candidates (40 to 50) for diverse corpora. |
retrieval | "lexical" | "hybrid" only when keyword search misses. See Choosing a configuration. |
rerank | none | "cross-encoder" only when paraphrase mismatch is the actual failure mode. Adds five to ten times the query latency. |
The rule that always applies: don’t recreate the Document per
request.
Docker
Section titled “Docker”FROM python:3.12-slim
WORKDIR /appRUN pip install --no-cache-dir redhop fastapi uvicorn openai
COPY app.py .COPY docs/ ./docs/
# Bake the persistent index into the image so cold starts hit the# cached state instead of indexing from scratch.RUN python -c "import redhop; redhop.Document.from_folder('./docs', persist=True)"
CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "8000"]Around 150 MB without the semantic tier. With it, add roughly 80 MB for the bge-small ONNX model. Either bake the model into the image or mount a persistent volume to avoid re-downloading on each cold start.
FROM node:20-slim
WORKDIR /appCOPY package*.json ./RUN npm ci --omit=dev
COPY . .
RUN node -e "import('redhop').then(({Document}) => Document.fromFolder('./docs', {persist: true}))"
EXPOSE 3000CMD ["node", "server.mjs"]The npm package ships native binaries per platform (linux-x64, linux-arm64, darwin-x64, darwin-arm64, win32-x64). Match the image’s architecture to the runtime.
FROM rust:1.75 AS builderWORKDIR /appCOPY . .RUN cargo build --release --features files,semantic
FROM debian:bookworm-slimRUN apt-get update && apt-get install -y ca-certificates && rm -rf /var/lib/apt/lists/*COPY --from=builder /app/target/release/myapp /usr/local/bin/myappCOPY docs/ /app/docs/WORKDIR /app
EXPOSE 3000CMD ["myapp"]Around 80 MB with semantic enabled. Strip symbols
(RUSTFLAGS="-C strip=symbols") for closer to 30 MB.
Logging the report
Section titled “Logging the report”The report on every BuiltContext carries structured fields that
slot into any tracing or logging stack.
import logging, redhop
log = logging.getLogger("redhop")
@app.post("/ask")def ask(req: Ask): ctx = app.state.doc.context(req.question) log.info("retrieval", extra={ "query": req.question, "auto_decision": ctx.report.auto_decision, "n_input_chunks": ctx.report.n_input_chunks, "n_selected": ctx.report.n_selected, "total_tokens": ctx.report.total_tokens, "retained_evidence_ratio": ctx.report.retained_evidence_ratio, "n_citations": len(ctx.citations), }) return {"prompt": ctx.text(), "citations": ctx.citations}app.post("/ask", (req, res) => { const ctx = doc.context(req.body.question); console.log(JSON.stringify({ msg: "retrieval", query: req.body.question, autoDecision: ctx.report.autoDecision, totalTokens: ctx.report.totalTokens, retainedEvidenceRatio: ctx.report.retainedEvidenceRatio, nCitations: ctx.citations.length, })); res.json({ prompt: ctx.text, citations: ctx.citations });});Worth alerting on:
- An empty
citationslist. The retriever found nothing, usually a vocabulary mismatch between the query and the corpus. retained_evidence_ratiodropping. The pruner is dropping more than before, often because the corpus shape has shifted.auto_decisionflipping its passthrough/prune mix. The input contexts have changed in size or density.
For a pre-flight check without assembling the context,
doc.analyze(query) returns the same report shape.
What to know going in
Section titled “What to know going in”The native addon means RedHop will not run on Cloudflare Workers or Vercel Edge. The Node guide covers what to use instead in those environments: Node.js library for RAG.
Cross-encoder rerank adds five to ten times the query latency. Enable it only when synonym mismatch is the measured failure mode on your corpus, not by default.
Multi-node scaling: each node holds its own indexed Document. If
the corpus exceeds single-node memory, that’s past RedHop’s sweet
spot. Switch to a vector database (LanceDB, Qdrant, Pinecone) and
keep RedHop as the context assembly layer if you still want that
piece.
Tested against RedHop 0.3.x.