Deploy RedHop to production

In-process retrieval means there is no separate service to operate, but it doesn’t mean the production decisions are gone. Once you move past a script that re-indexes on every run, a handful of choices matter: where the indexed Document lives across requests, whether the index persists to disk, what fields the report sends to your observability stack.

One Document, many requests

The biggest gain over the tutorial pattern is loading the document once at startup instead of per request. A single PDF is a millisecond to index. For from_folder over thousands of files it is seconds. Either way, paying that cost per request is waste. Hold the indexed Document in shared state.

from fastapi import FastAPI
from pydantic import BaseModel
from contextlib import asynccontextmanager
import redhop

@asynccontextmanager
async def lifespan(app: FastAPI):
    app.state.doc = redhop.Document.from_file("contract.pdf")
    yield

app = FastAPI(lifespan=lifespan)

class Ask(BaseModel):
    question: str

@app.post("/ask")
def ask(req: Ask):
    ctx = app.state.doc.context(req.question)
    return {
        "prompt": ctx.text(),
        "citations": ctx.citations,
        "report": redhop.report_to_dict(ctx.report),
    }

uvicorn main:app --workers 4. Each worker holds its own Document. Warm queries land in 1 to 6 ms.

import express from "express";
import { Document } from "redhop";

const app = express();
app.use(express.json());

const doc = Document.fromFile("contract.pdf");
console.log(`indexed ${doc.chunkCount} chunks`);

app.post("/ask", (req, res) => {
  const ctx = doc.context(req.body.question);
  res.json({
    prompt: ctx.text,
    citations: ctx.citations,
    report: {
      autoDecision: ctx.report.autoDecision,
      totalTokens: ctx.report.totalTokens,
      retainedEvidenceRatio: ctx.report.retainedEvidenceRatio,
    },
  });
});

app.listen(3000);

In a clustered setup (PM2 or the Node cluster module), each worker holds its own Document. Memory cost is per worker.

use axum::{Router, routing::post, extract::{State, Json}, response::Json as JsonResp};
use serde::{Deserialize, Serialize};
use redhop::read_file;
use std::sync::Arc;
use parking_lot::Mutex;

#[derive(Clone)]
struct AppState {
    doc: Arc<Mutex<redhop::Document>>,
}

#[derive(Deserialize)]
struct Ask { question: String }

#[derive(Serialize)]
struct Answer { prompt: String, citations: Vec<serde_json::Value> }

async fn ask(State(s): State<AppState>, Json(req): Json<Ask>) -> JsonResp<Answer> {
    let mut doc = s.doc.lock();
    let ctx = doc.context(&req.question).unwrap();
    JsonResp(Answer {
        prompt: ctx.text().to_string(),
        citations: serde_json::to_value(&ctx.citations).unwrap()
            .as_array().cloned().unwrap_or_default(),
    })
}

#[tokio::main]
async fn main() {
    let doc = read_file("contract.pdf").expect("load failed");
    let state = AppState { doc: Arc::new(Mutex::new(doc)) };
    let app = Router::new().route("/ask", post(ask)).with_state(state);
    let listener = tokio::net::TcpListener::bind("0.0.0.0:3000").await.unwrap();
    axum::serve(listener, app).await.unwrap();
}

For high-concurrency reads, RwLock over Mutex lets multiple queries run in parallel against the same index.

Persist the index

For a single file, reindexing on startup is free. For from_folder over thousands of files it isn’t. Pass persist=True and subsequent loads only pay for the files that changed since last run.

doc = redhop.Document.from_folder("./docs", options=redhop.FolderOptions(persist=True))

# or a custom location for the cache:
doc = redhop.Document.from_folder("./docs", options=redhop.FolderOptions(persist=True, index_dir="/var/cache/redhop"))

const doc = Document.fromFolder("./docs", { persist: true });

const doc = Document.fromFolder("./docs", {
  persist: true,
  indexDir: "/var/cache/redhop",
});

use redhop::{read_folder_with, FolderOptions};

let mut doc = read_folder_with("./docs", &FolderOptions {
    persist: true,
    index_dir: Some("/var/cache/redhop".into()),
    ..Default::default()
})?;

The persisted layout stores a content fingerprint per file. On reload, RedHop diffs the fingerprints against the current files: unchanged files skip parsing and chunking, new and modified files re-index, deleted files drop. A 5,000-file repo where three files changed reloads in under a second.

A persistent index assumes a single writer. Multiple processes writing to the same .redhop/ directory will race. For multi-process deployments, give each worker its own index_dir, or build the index once in a deploy step and mount it read-only at runtime.

Hot-reload when files change

For docs or knowledge-base content that updates outside the service, watch the folder and rebuild the Document on change. Assigning the new Document to the shared reference is atomic: in-flight requests use the old one, new requests see the new one. No locking required.

Python
Node.js

import asyncio
from watchfiles import awatch
import redhop

doc = redhop.Document.from_folder("./docs", options=redhop.FolderOptions(persist=True))

async def watcher(app):
    async for changes in awatch("./docs"):
        app.state.doc = redhop.Document.from_folder("./docs", options=redhop.FolderOptions(persist=True))
        print(f"reloaded after {len(changes)} change(s)")

import { watch } from "node:fs/promises";
import { Document } from "redhop";

let doc = Document.fromFolder("./docs", { persist: true });

for await (const _evt of watch("./docs", { recursive: true })) {
  doc = Document.fromFolder("./docs", { persist: true });
}

Knobs that matter in production

The defaults are calibrated against measured failure modes, and the evidence layer in the repo documents the why for each one. A few knobs are worth knowing about in production:

Knob	Default	When to change
`chunk_size`	128	Larger (256 to 512) for prose-heavy docs where wider context helps. Smaller (64) for very dense reference material. Index-time, so changes require re-indexing.
`chunk_overlap`	1 sentence	Higher (2 or 3) if retrieval misses chunks where the answer straddles a boundary.
`token_budget`	8192	Match the LLM’s prompt budget. Smaller (2,000 to 4,000) for cost-sensitive workloads.
`candidate_k`	20	More candidates (40 to 50) for diverse corpora.
`retrieval`	`"lexical"`	`"hybrid"` only when keyword search misses. See Choosing a configuration.
`rerank`	none	`"cross-encoder"` only when paraphrase mismatch is the actual failure mode. Adds five to ten times the query latency.

The rule that always applies: don’t recreate the Document per request.

Docker

FROM python:3.12-slim

WORKDIR /app
RUN pip install --no-cache-dir redhop fastapi uvicorn openai

COPY app.py .
COPY docs/ ./docs/

# Bake the persistent index into the image so cold starts hit the
# cached state instead of indexing from scratch.
RUN python -c "import redhop; redhop.Document.from_folder('./docs', options=redhop.FolderOptions(persist=True))"

CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "8000"]

Around 150 MB without the semantic tier. With it, add roughly 80 MB for the bge-small ONNX model. Either bake the model into the image or mount a persistent volume to avoid re-downloading on each cold start.

FROM node:20-slim

WORKDIR /app
COPY package*.json ./
RUN npm ci --omit=dev

COPY . .

RUN node -e "import('redhop').then(({Document}) => Document.fromFolder('./docs', {persist: true}))"

EXPOSE 3000
CMD ["node", "server.mjs"]

The npm package ships native binaries per platform (linux-x64, linux-arm64, darwin-x64, darwin-arm64, win32-x64). Match the image’s architecture to the runtime.

FROM rust:1.75 AS builder
WORKDIR /app
COPY . .
RUN cargo build --release --features files,semantic

FROM debian:bookworm-slim
RUN apt-get update && apt-get install -y ca-certificates && rm -rf /var/lib/apt/lists/*
COPY --from=builder /app/target/release/myapp /usr/local/bin/myapp
COPY docs/ /app/docs/
WORKDIR /app

EXPOSE 3000
CMD ["myapp"]

Around 80 MB with semantic enabled. Strip symbols (RUSTFLAGS="-C strip=symbols") for closer to 30 MB.

Logging the report

The report on every BuiltContext carries structured fields that slot into any tracing or logging stack.

Python
Node.js

import logging, redhop

log = logging.getLogger("redhop")

@app.post("/ask")
def ask(req: Ask):
    ctx = app.state.doc.context(req.question)
    log.info("retrieval", extra={
        "query": req.question,
        "auto_decision": ctx.report.auto_decision,
        "n_input_chunks": ctx.report.n_input_chunks,
        "n_selected": ctx.report.n_selected,
        "total_tokens": ctx.report.total_tokens,
        "retained_evidence_ratio": ctx.report.retained_evidence_ratio,
        "n_citations": len(ctx.citations),
    })
    return {"prompt": ctx.text(), "citations": ctx.citations}

app.post("/ask", (req, res) => {
  const ctx = doc.context(req.body.question);
  console.log(JSON.stringify({
    msg: "retrieval",
    query: req.body.question,
    autoDecision: ctx.report.autoDecision,
    totalTokens: ctx.report.totalTokens,
    retainedEvidenceRatio: ctx.report.retainedEvidenceRatio,
    nCitations: ctx.citations.length,
  }));
  res.json({ prompt: ctx.text, citations: ctx.citations });
});

Worth alerting on:

An empty citations list. The retriever found nothing, usually a vocabulary mismatch between the query and the corpus.
retained_evidence_ratio dropping. The pruner is dropping more than before, often because the corpus shape has shifted.
auto_decision flipping its passthrough/prune mix. The input contexts have changed in size or density.

For a pre-flight check without assembling the context, doc.analyze(query) returns the same report shape.

What to know going in

The native addon means RedHop will not run on Cloudflare Workers or Vercel Edge. The Node guide covers what to use instead in those environments: Node.js library for RAG.

Cross-encoder rerank adds five to ten times the query latency. Enable it only when synonym mismatch is the measured failure mode on your corpus, not by default.

Multi-node scaling: each node holds its own indexed Document. If the corpus exceeds single-node memory, that’s past RedHop’s sweet spot. Switch to a vector database (LanceDB, Qdrant, Pinecone) and keep RedHop as the context assembly layer if you still want that piece.

Tested against RedHop 0.3.x.