RedHop: a LlamaIndex alternative for document RAG

If you’re searching for a LlamaIndex alternative, you’re probably hitting one of these walls:

The framework assumes you need a vector store. Even for one PDF, LlamaIndex’s default path is VectorStoreIndex: embed every chunk, store the vectors, query via an embedding model. Most document QA doesn’t need that.
The mental model is its own thing to learn. Indexes, node parsers, query engines, response synthesizers, retrievers, post-processors. To answer one question about a contract.
It’s Python-first. TypeScript port exists but trails behind, and nothing for Rust services.
No visibility into the decision. When the wrong chunk surfaces, you instrument LlamaIndex yourself.

RedHop is a focused alternative: an in-process retrieval + context library that does one thing (turn a document and a question into the right LLM prompt context) and tells you exactly what it kept, dropped, and why.

import redhop

doc = redhop.Document.from_file("contract.pdf")
ctx = doc.context("What is the governing law?")
answer = llm.generate(ctx.text())

print(ctx.report)   # what was kept, dropped, and why

That’s the whole surface. Three calls. No vector store. No query engine. Python, Node, and Rust over a Rust core, all in-process.

Should you switch from LlamaIndex to RedHop?

The honest answer: it depends on what you’re building.

If you need…	Pick
Document QA with citations and a Decision Report	RedHop
In-process retrieval, no vector store, no infra	RedHop
The same API in Python, Node, and Rust	RedHop
Composable indices (Tree, KeywordTable, mixed)	LlamaIndex
Specialized query engines (sub-question, multi-step, citation)	LlamaIndex
Hosted / managed RAG with a dashboard	LlamaCloud (LlamaIndex’s offering)
Out-of-the-box CUAD on raw 24-word template queries	LlamaIndex (+4 on raw template. The gap stems from BM25 boilerplate dilution. Stripping the template helps every system, see fair-preprocessing footnote below)
LlamaHub ecosystem of loaders / tools / readers	LlamaIndex

LlamaIndex is a framework purpose-built for RAG. RedHop is a library that does the one bounded step: here’s the file, here’s the question, give me the right context with a decision report. If you need LlamaIndex’s composition layer, stay there. If you just need the three-call shape with observability, RedHop is simpler.

The same question, two ways

Same contract.pdf. Same question. RedHop on the left tab, LlamaIndex on the right.

RedHop
LlamaIndex

import redhop
from openai import OpenAI

query = "What is the governing law?"

ctx = redhop.Document.from_file("contract.pdf").context(query)
#  parsed, chunked, retrieved, and token-budgeted internally

response = OpenAI().chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": f"{ctx.text()}\n\nQuestion: {query}"}],
)
print(response.choices[0].message.content)

What you stand up: nothing. Point it at the file and ask; parsing, chunking, retrieval, and token-budgeting happen inside — and every call returns a Decision Report explaining what it kept and why.

from llama_index.core import VectorStoreIndex, Settings
from llama_index.core.node_parser import SentenceSplitter
from llama_index.readers.file import PyMuPDFReader
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.llms.openai import OpenAI

query = "What is the governing law?"

Settings.embed_model = OpenAIEmbedding()
Settings.llm = OpenAI(model="gpt-4o-mini")

docs = PyMuPDFReader().load(file_path="contract.pdf")

index = VectorStoreIndex.from_documents(
    docs,
    transformations=[SentenceSplitter(chunk_size=512, chunk_overlap=50)],
)

engine = index.as_query_engine(similarity_top_k=4)
print(engine.query(query))

What you stand up: a node parser, an embedding model, a vector index, and a query engine. Cleaner than LangChain, but still an embed-and-index pipeline you own and pay for.

LlamaIndex is cleaner than LangChain: node parser, embedding model, vector index, query engine is a more linear mental model than loader, splitter, embedder, vector store, retriever, prompt, chain. But it’s still an embed-and-index pipeline. RedHop has one concept: document → context. Everything else is an implementation detail.

The full head-to-head benchmark (evidence retention + downstream answer quality on CUAD contracts and HotpotQA multi-hop) is on the Comparison page: same documents, same BM25 retriever for fairness, same token budget.

What LlamaIndex gives you that RedHop doesn’t

Be clear about this. LlamaIndex has things RedHop doesn’t even try to be:

Composable indices. TreeIndex for hierarchical summarization, KeywordTableIndex for keyword routing, VectorStoreIndex for dense retrieval, KnowledgeGraphIndex for graph-shaped corpora, and you can compose them. RedHop is one path: chunk → BM25 (or hybrid) → assemble.
Specialized query engines. SubQuestionQueryEngine breaks a complex question into sub-questions. MultiStepQueryEngine runs iterative retrieval. CitationQueryEngine enforces source citations in the response. RedHop returns context + citations. You compose any retrieval-shape logic outside.
LlamaCloud. Hosted, managed RAG with a dashboard and a billing page. RedHop is OSS only, in-process. You run it.
LlamaHub. A library of loaders (Notion, Slack, Confluence, S3, Postgres, …), tools, and prompt templates. RedHop has built-in parsers for PDF / DOCX / PPTX / XLSX / Markdown / code, and that’s it.
Edges RedHop on CUAD’s raw-template query (measured, mechanism-known). Out of the box, LlamaIndex is 4 points ahead on CUAD’s 24-word template question (86% vs 82% gold-span retention). The cause is BM25 dilution from the fixed-template boilerplate, not a LlamaIndex chunking superpower. With RedHop’s analyze_query_set + Stripper + Vocabulary workflow run through doc.context_with_rewrites(...) (the same primitives ship in the public API across Python, Node, and Rust), RedHop reaches 90.7% while staying on default BM25 at ~2.5ms/query. Honest footnote (n=300, 2026-06-08): when the same Stripper is applied to every system’s query, LlamaIndex lifts to 94%, RedHop topk to 88%, LangChain to 79%, so the 90.7% RedHop number is RedHop with Stripper+Vocabulary vs LlamaIndex with default. We did not apply the Vocabulary recipe to LlamaIndex. Given LlamaIndex benefits more from the Stripper step alone, an unmeasured-but-likely outcome is that the same Vocabulary would also help LlamaIndex. The CUAD recipe’s value is the reproducible workflow, not an architectural lead. Full investigation: CUAD_RECALL_GAP + CUAD_CLAUSE_EXPANSION.

If you need any of the above, stay on LlamaIndex.

What RedHop gives you that LlamaIndex doesn’t

1. A Decision Report on every call

Every doc.context(query) returns a ctx.report describing exactly what happened: what was kept, what was dropped, whether the engine intervened, why it chose what it chose.

RedHop Decision Report
======================

Decision: Auto → passthrough (small context, no intervention needed)

  Why:
    - 1,240 tokens — below the dilution gate (1,500 tokens)
    - pruning a small clean context risks dropping reasoning evidence
  Result:
    - kept all 8 retrieved chunks
    - evidence retained 100%, second-hop links preserved

LlamaIndex returns a Response with source_nodes, but no structured report explaining the retrieval and assembly decision. With RedHop, the report is structured data on every call: auto_decision, total_tokens, n_input_chunks, n_selected, retained_evidence_ratio, second_hop_rescue_count. You can also run doc.analyze(query) to get the same diagnostics without assembling a context.

2. No vector store required

LlamaIndex’s default index assumes vectors: VectorStoreIndex.from_documents() embeds every chunk on construction. Even for one PDF, you’re paying the embed cost upfront and standing up a vector store.

RedHop’s default tier is BM25. Zero model download, zero embedding cost, sub-100ms warm queries. Most document QA (code, API references, runbooks, financial reports, handbooks) works on lexical alone, because the words in the question are usually the words in the answer.

If you need semantic retrieval, opt into retrieval="hybrid" with a small embedding model (bge-small, ~80MB, auto-downloaded). Even then, retrieval is exact cosine over your in-memory chunks: no ANN index, no vector store, no embedded service.

3. Three calls cover the surface

Load. Ask. Read. That’s the API.

doc = redhop.Document.from_file("contract.pdf")   # load (or .from_folder, .from_text, .from_bytes)
ctx = doc.context("What is the governing law?")   # ask
print(ctx.text())                                 # the prompt for your LLM
for c in ctx.citations: ...                        # source / page / heading / line per chunk
print(ctx.report)                                 # the decision

Compare to LlamaIndex’s load → parse → index → engine → query → response shape. Each piece is its own concept with its own config.

4. The same API in Python, Node, and Rust

LlamaIndex is Python primarily, with a TypeScript port (llamaindex-ts) that trails behind, and nothing for Rust. RedHop ships the same surface in Python, Node, and Rust over a single Rust core. Prototype in Python, ship the same API in your Rust service or Electron app.

5. In-process, no SaaS, no network calls

RedHop runs in your process. No service to call, no hosted endpoint, no API key. The optional embedding model is downloaded once and runs locally via ONNX. Your documents never leave the box. For finance / legal / health teams with data residency requirements, this is the shape of the answer.

Migrating from LlamaIndex to RedHop

If you’ve got an existing LlamaIndex pipeline doing document QA, here’s the equivalent in RedHop.

Loading + indexing

LlamaIndex:

from llama_index.core import VectorStoreIndex, Settings
from llama_index.core.node_parser import SentenceSplitter
from llama_index.readers.file import PyMuPDFReader
from llama_index.embeddings.openai import OpenAIEmbedding

Settings.embed_model = OpenAIEmbedding()
docs = PyMuPDFReader().load(file_path="contract.pdf")
index = VectorStoreIndex.from_documents(
    docs,
    transformations=[SentenceSplitter(chunk_size=512, chunk_overlap=50)],
)

RedHop:

import redhop
doc = redhop.Document.from_file("contract.pdf")

That’s it. PDF parsing, chunking, indexing: all behind the API. No embedding call (default tier is BM25). For semantic retrieval add retrieval="hybrid", model="bge-small" to the constructor.

Querying

LlamaIndex:

engine = index.as_query_engine(similarity_top_k=4)
response = engine.query("What is the governing law?")
print(response.response)

RedHop (LLM-agnostic, bring your own):

ctx = doc.context("What is the governing law?")
answer = OpenAI().responses.create(
    model="gpt-4o-mini",
    input=f"{ctx.text()}\n\nQuestion: What is the governing law?",
).output_text

LlamaIndex bundles the LLM call in its query engine. RedHop hands you the prompt string and lets you call any provider, no lock-in to a wrapper.

Citations / source nodes

LlamaIndex:

for node in response.source_nodes:
    print(node.metadata, node.text)

RedHop:

for c in ctx.citations:
    print(c["source"], c["page"], c["heading"], c["line"])

Same shape, simpler keys. source plus whichever of page / heading / line the format provides, no separate metadata layer.

Folder of files

LlamaIndex:

from llama_index.core import SimpleDirectoryReader
docs = SimpleDirectoryReader("./docs").load_data()
index = VectorStoreIndex.from_documents(docs)

RedHop:

doc = redhop.Document.from_folder("./docs", options=redhop.FolderOptions(persist=True))

from_folder honors .gitignore, accepts custom ignore patterns, and optionally writes an incremental on-disk index: reload is O(changed files), not O(all files).

Multi-step / sub-question queries

LlamaIndex has dedicated query engines for these (SubQuestionQueryEngine, MultiStepQueryEngine). RedHop doesn’t: it returns context for one question per call. If your workload is genuinely sub-question / multi-step, LlamaIndex is the better tool and you can still call RedHop for the per-step retrieval inside it.

Pick the right tool

Workload	RedHop	LlamaIndex
Document QA with one or many files	✅ shorter, observable	✅ flexible
Composable indices / multi-step retrieval	❌ out of scope	✅ flagship
Need to plug in a specific LLM provider	✅ (any, you call it)	✅ (built-in integration)
Hosted / managed RAG with a dashboard	❌	✅ LlamaCloud
Visibility into retrieval decisions	✅ Decision Report	❌ DIY observability
CUAD raw-template query (out of the box)	82%	86%
CUAD with the Stripper + Vocabulary workflow	90.7%	—
In-process, no vector store, no infra	✅	❌
Same API in Python / Node / Rust	✅	❌ Python + partial TS
Apache-2.0, no commercial gating	✅	✅ (with LlamaCloud as a paid layer)

If your workload sits firmly in document QA and you’ve been wondering why LlamaIndex’s index → query engine → response shape feels heavy for what you’re doing, RedHop is the alternative you’re looking for. If you’re doing composable retrieval, sub-question decomposition, or you specifically need contract extraction at LlamaIndex’s quality, stay on LlamaIndex.

Get started

pip install redhop                            # Python
cargo add redhop --features files,semantic    # Rust
npm install redhop                            # Node.js -- on npm

Quickstart: the three-call surface
Choosing a configuration: when to use which retrieval tier
The full benchmark vs LangChain & LlamaIndex: same datasets, same retriever, head-to-head
Other alternatives: per-framework deep-dives (LangChain, Haystack)
llms.txt: single-file context for AI coding agents

Open source under Apache-2.0. Bug reports and use-case feedback welcome at github.com/vysakh0/redhop.