Build a RAG app with RedHop

You’re going to build a small program that loads a PDF, takes a question, and sends an LLM the context it needs to answer well. Three calls do the work. The whole script is around thirty lines.

You’ll need Python 3.9+ or Node 18+ or Rust 1.75+, an API key for any LLM provider (OpenAI in the examples below), and a PDF to ask questions about. Any text or markdown file works too. The code path is the same.

Install

pip install redhop openai

npm install redhop openai

The Node binding is a native addon, so it doesn’t run on Cloudflare Workers or Vercel Edge. See the Node.js library for RAG guide if your target is an edge runtime.

cargo add redhop --features files,semantic
cargo add tokio --features macros,rt-multi-thread
cargo add async-openai anyhow

files pulls in the document parsers, and semantic adds the optional ONNX embedder for the dense retrieval tier. The lean build with just BM25 omits both.

Set your API key in the shell:

export OPENAI_API_KEY="sk-..."

Load a document

import redhop

doc = redhop.Document.from_file("contract.pdf")
print(f"indexed {doc.n_chunks} chunks across {doc.n_files} file(s)")

import { Document } from "redhop";

const doc = Document.fromFile("contract.pdf");
console.log(`indexed ${doc.chunkCount} chunks`);

use redhop::read_file;

let mut doc = read_file("contract.pdf")?;
println!("indexed {} chunks", doc.n_chunks());

from_file handles parsing, sentence-aware chunking, and an in-memory BM25 index. A 50-page PDF is ready in a millisecond or two. There is no vector database to provision.

Ask a question

ctx = doc.context("What is the governing law of this contract?")
print(ctx.text())

const ctx = doc.context("What is the governing law of this contract?");
console.log(ctx.text);

let ctx = doc.context("What is the governing law of this contract?")?;
println!("{}", ctx.text());

context() runs retrieval, budgets the result against the model’s prompt window, and returns the assembled string. The output is around a kilobyte of relevant clauses rather than the whole 50-page document.

Show the sources

for c in ctx.citations:
    print(f"  {c['source']} p{c['page']}  {c['heading']}")

print()
print(ctx.report)

for (const c of ctx.citations) {
  console.log(`  ${c.source} p${c.page ?? "?"}  ${c.heading ?? ""}`);
}

console.log();
console.log(ctx.report.rendered);

for c in &ctx.citations {
    println!("  {} p{:?}  {:?}", c.source, c.page, c.heading);
}

println!();
println!("{}", ctx.report.rendered);

ctx.citations is a list with one entry per chunk that made it into the context. The fields are source, page, heading, line, and the raw text. Render them however the UI wants.

The report on the same object describes the assembly decision. For a small clean context it passes the input through unchanged. For a large diluted one it prunes. Either way you see which path it took:

RedHop Decision Report
══════════════════════

Decision: Auto → passthrough (small context, no intervention needed)

  Why:
    - 1,240 tokens, below the dilution gate (1,500 tokens)
    - pruning a small clean context risks dropping reasoning evidence
  Result:
    - kept all 8 retrieved chunks
    - evidence retained 100%, second-hop links preserved

Call the LLM

from openai import OpenAI

query = "What is the governing law of this contract?"
response = OpenAI().chat.completions.create(
    model="gpt-4o-mini",
    messages=[{
        "role": "user",
        "content": f"{ctx.text()}\n\nQuestion: {query}",
    }],
)
print(response.choices[0].message.content)

import OpenAI from "openai";

const query = "What is the governing law of this contract?";
const response = await new OpenAI().chat.completions.create({
  model: "gpt-4o-mini",
  messages: [{ role: "user", content: `${ctx.text}\n\nQuestion: ${query}` }],
});
console.log(response.choices[0].message.content);

use async_openai::{Client, types::{
    CreateChatCompletionRequestArgs,
    ChatCompletionRequestUserMessageArgs,
}};

let query = "What is the governing law of this contract?";
let req = CreateChatCompletionRequestArgs::default()
    .model("gpt-4o-mini")
    .messages([ChatCompletionRequestUserMessageArgs::default()
        .content(format!("{}\n\nQuestion: {}", ctx.text(), query))
        .build()?
        .into()])
    .build()?;

let response = Client::new().chat().create(req).await?;
println!("{}", response.choices[0].message.content.as_deref().unwrap_or(""));

The prompt string is yours to send anywhere. OpenAI, Anthropic, a local Ollama, your own model. RedHop never makes the LLM call itself, which keeps the library single-purpose and lets you change providers without touching retrieval.

The whole script

import redhop
from openai import OpenAI

QUERY = "What is the governing law of this contract?"

doc = redhop.Document.from_file("contract.pdf")

ctx = doc.context(QUERY)
for c in ctx.citations:
    print(f"  {c['source']} p{c['page']}  {c['heading']}")

response = OpenAI().chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": f"{ctx.text()}\n\nQuestion: {QUERY}"}],
)
print(response.choices[0].message.content)

print()
print(ctx.report)

python rag.py

import { Document } from "redhop";
import OpenAI from "openai";

const QUERY = "What is the governing law of this contract?";

const doc = Document.fromFile("contract.pdf");

const ctx = doc.context(QUERY);
for (const c of ctx.citations) {
  console.log(`  ${c.source} p${c.page ?? "?"}  ${c.heading ?? ""}`);
}

const response = await new OpenAI().chat.completions.create({
  model: "gpt-4o-mini",
  messages: [{ role: "user", content: `${ctx.text}\n\nQuestion: ${QUERY}` }],
});
console.log(response.choices[0].message.content);

console.log();
console.log(ctx.report.rendered);

node rag.mjs

use redhop::read_file;
use async_openai::{Client, types::{
    CreateChatCompletionRequestArgs,
    ChatCompletionRequestUserMessageArgs,
}};

#[tokio::main]
async fn main() -> anyhow::Result<()> {
    let query = "What is the governing law of this contract?";

    let mut doc = read_file("contract.pdf")?;

    let ctx = doc.context(query)?;
    for c in &ctx.citations {
        println!("  {} p{:?}  {:?}", c.source, c.page, c.heading);
    }

    let req = CreateChatCompletionRequestArgs::default()
        .model("gpt-4o-mini")
        .messages([ChatCompletionRequestUserMessageArgs::default()
            .content(format!("{}\n\nQuestion: {}", ctx.text(), query))
            .build()?.into()])
        .build()?;
    let response = Client::new().chat().create(req).await?;
    println!("{}", response.choices[0].message.content.as_deref().unwrap_or(""));

    println!("\n{}", ctx.report.rendered);
    Ok(())
}

cargo run

What to try next

A folder of files instead of a single document: Document.from_folder("./docs", options=FolderOptions(persist=True)). The persistent index makes subsequent loads sub-second on large corpora.

Hybrid retrieval for queries the keyword tier misses, such as HR FAQs or support knowledge bases where users phrase things differently from the docs: pass retrieval="hybrid", model="bge-small". The choice between tiers is covered in Choosing a configuration.

Wrapping this behind an HTTP service: Deploy to production.

More patterns by use case: Examples.

Tested against RedHop 0.3.x.