Skip to content

Build a RAG app with RedHop

You’re going to build a small program that loads a PDF, takes a question, and sends an LLM the context it needs to answer well. Three calls do the work. The whole script is around thirty lines.

You’ll need Python 3.9+ or Node 18+ or Rust 1.75+, an API key for any LLM provider (OpenAI in the examples below), and a PDF to ask questions about. Any text or markdown file works too. The code path is the same.

Terminal window
pip install redhop openai

Set your API key in the shell:

Terminal window
export OPENAI_API_KEY="sk-..."
import redhop
doc = redhop.Document.from_file("contract.pdf")
print(f"indexed {doc.n_chunks} chunks across {doc.n_files} file(s)")

from_file handles parsing, sentence-aware chunking, and an in-memory BM25 index. A 50-page PDF is ready in a millisecond or two. There is no vector database to provision.

ctx = doc.context("What is the governing law of this contract?")
print(ctx.text())

context() runs retrieval, budgets the result against the model’s prompt window, and returns the assembled string. The output is around a kilobyte of relevant clauses rather than the whole 50-page document.

for c in ctx.citations:
print(f" {c['source']} p{c['page']} {c['heading']}")
print()
print(ctx.report)

ctx.citations is a list with one entry per chunk that made it into the context. The fields are source, page, heading, line, and the raw text. Render them however the UI wants.

The report on the same object describes the assembly decision. For a small clean context it passes the input through unchanged. For a large diluted one it prunes. Either way you see which path it took:

RedHop Decision Report
══════════════════════
Decision: Auto → passthrough (small context, no intervention needed)
Why:
- 1,240 tokens, below the dilution gate (1,500 tokens)
- pruning a small clean context risks dropping reasoning evidence
Result:
- kept all 8 retrieved chunks
- evidence retained 100%, second-hop links preserved
from openai import OpenAI
query = "What is the governing law of this contract?"
response = OpenAI().chat.completions.create(
model="gpt-4o-mini",
messages=[{
"role": "user",
"content": f"{ctx.text()}\n\nQuestion: {query}",
}],
)
print(response.choices[0].message.content)

The prompt string is yours to send anywhere. OpenAI, Anthropic, a local Ollama, your own model. RedHop never makes the LLM call itself, which keeps the library single-purpose and lets you change providers without touching retrieval.

rag.py
import redhop
from openai import OpenAI
QUERY = "What is the governing law of this contract?"
doc = redhop.Document.from_file("contract.pdf")
ctx = doc.context(QUERY)
for c in ctx.citations:
print(f" {c['source']} p{c['page']} {c['heading']}")
response = OpenAI().chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": f"{ctx.text()}\n\nQuestion: {QUERY}"}],
)
print(response.choices[0].message.content)
print()
print(ctx.report)

python rag.py

A folder of files instead of a single document: Document.from_folder("./docs", persist=True). The persistent index makes subsequent loads sub-second on large corpora.

Hybrid retrieval for queries the keyword tier misses, such as HR FAQs or support knowledge bases where users phrase things differently from the docs: pass retrieval="hybrid", model="bge-small". The choice between tiers is covered in Choosing a configuration.

Wrapping this behind an HTTP service: Deploy to production.

More patterns by use case: Examples.

Tested against RedHop 0.3.x.