Rust library for RAG: the landscape, what's mature, and how to choose
If you’re building RAG and you’d rather not pull in Python, this guide is for you. It’s a tour of the Rust RAG ecosystem as of 2026 — what’s mature, what’s half-built, what’s missing entirely, and which library to pick for which job. We’ll cover end-to-end RAG libraries, the building blocks (embeddings, ML frameworks, tokenizers), and the vector databases that work natively from Rust.
We’ll be honest about the tradeoffs. Rust isn’t strictly better than Python for RAG — it’s better for specific reasons, in specific deployments. By the end of this guide you should know whether you’re in that “specific” or not.
Why pick Rust for RAG?
Section titled “Why pick Rust for RAG?”Three real reasons, plus one honest pushback:
1. You already ship Rust. This is the most common case. You have a Rust service — an Axum API, a Tauri desktop app, a CLI, an edge worker — and you need to add document QA or semantic search. Calling out to a Python sidecar is a deployment headache. A Rust crate fits the existing build.
2. Latency and resource footprint matter. Rust RAG runtimes can answer warm queries in single-digit milliseconds with tiny memory footprints. For in-process retrieval at the edge (Cloudflare Workers via WASM, on-device mobile, embedded), Python isn’t an option; Rust is the default.
3. Static binary deployment. Ship one file with no Python interpreter, no pip dependencies, no system Python version to manage. For finance / legal / health / air-gapped environments, this is often the shape that gets the security review through.
The honest pushback: Python’s RAG ecosystem is far deeper. LangChain, LlamaIndex, and Haystack have hundreds of integrations, mature evaluation harnesses, and the bulk of the community. If your stack is already Python and you’re not constrained by latency or deployment, picking Rust is usually trading away ecosystem breadth for a marginal win. Pick Rust when the deployment / integration / footprint reasons are real, not because “Rust is faster” in a benchmark.
The Rust RAG landscape (2026)
Section titled “The Rust RAG landscape (2026)”The ecosystem has three layers. Most people only need the top one.
Layer 1 — End-to-end RAG libraries
Section titled “Layer 1 — End-to-end RAG libraries”These are the “drop in and go” libraries. You hand them documents and a query; they handle chunking, retrieval, and context assembly.
| Library | Shape | Best for |
|---|---|---|
| RedHop | 3-call library (load → ask → read) | Document QA, smallest surface, in-process |
| rig | LLM framework (agents + RAG) | Agents, tool-use, multi-provider LLM apps |
| swiftide | Async indexing pipeline + RAG | Production batch ingestion at scale |
| anyrag | Lightweight RAG with adapters | Smaller projects, pick-and-mix backends |
Each one is taking a different bet about what RAG-in-Rust should look like.
RedHop is the smallest surface. Three calls — Document::from_file(path),
doc.context(query), then read the assembled prompt + Decision Report off
the returned context. BM25 by default (no model download, no ONNX runtime),
optional dense retrieval via a small embedding model. Designed for document
context optimization — turning a file and a question into the right LLM
prompt, with full observability into the decision. The same library is also
published to PyPI and npm, so the API is identical across Python, Node,
and Rust services. Best when you want one bounded step you can drop into a
larger app.
rig is broader. It’s a Rust answer to “LangChain for Rust” — agents, tool-use, multi-step workflows, provider integrations (OpenAI, Anthropic, Cohere, Gemini, etc.). RAG is one capability among many. Best when you’re building an agent app where RAG is one tool the agent can call. Heavier surface than RedHop; closer in spirit to LangChain.
swiftide is an async indexing-and-RAG pipeline framework, focused on high-throughput document ingestion. It composes nicely with Qdrant, LanceDB, Redis, FastEmbed, and Tree-sitter for code. The pipeline DAG model is similar to Haystack’s. Best when you’re ingesting a large corpus on a schedule and want explicit pipeline composition.
anyrag and similar smaller libraries fill the long tail — minimal, pick-and-mix shapes for narrow use cases. Worth knowing about but smaller communities and surface.
Layer 2 — Building blocks (embeddings, ML, tokenizers)
Section titled “Layer 2 — Building blocks (embeddings, ML, tokenizers)”If you’re building your own RAG, or one of the layer-1 libraries doesn’t quite fit, these are the components you’ll wire together.
| Crate | What it is |
|---|---|
| candle | HuggingFace’s pure-Rust ML framework. Run any HF model (embedders, rerankers, LLMs) in Rust |
| ort | Rust bindings for ONNX Runtime. The fastest way to run small embedding models in Rust |
| fastembed-rs | Rust port of fastembed — opinionated, easy ONNX embeddings for common models |
| embed_anything | Multi-modal embeddings library; supports text, images, audio |
| tokenizers | HuggingFace’s Rust tokenizer (the Rust core under the Python tokenizers package) |
| tiktoken-rs | Rust port of OpenAI’s tiktoken |
candle vs ort is the big decision in this layer:
- ort wraps the C++ ONNX Runtime. Faster on most CPUs for small models (BGE, MiniLM, etc.), more mature, supported by Microsoft. Slightly heavier deps (needs the ONNX Runtime shared library or the download-binaries feature).
- candle is pure Rust, no native runtime. Lighter deps, runs anywhere Rust runs (including WASM-friendly targets). Smaller-to-mid model performance is comparable; large-model performance is improving but ort is generally faster today.
If you’re running 384-dim or 768-dim sentence embedders (BGE-small, all-MiniLM, etc.), ort is the typical pick. If you need maximum portability or want to avoid any non-Rust deps, candle is the answer.
Layer 3 — Vector databases (Rust-native or Rust-callable)
Section titled “Layer 3 — Vector databases (Rust-native or Rust-callable)”If you need persistent vector storage at scale, you’ll reach for one of these. (As we’ll discuss below, you often don’t need one for document QA.)
| DB | Shape | Best for |
|---|---|---|
| LanceDB | Embedded columnar vector DB, Rust-native | Single-process; embedded use; mid-scale |
| Qdrant (server + Rust client) | Standalone vector DB written in Rust | Production at scale; cloud or self-host |
| Chroma (Rust client) | Open-source vector DB | Prototyping; Python-first but works from Rust |
| usearch | Embedded HNSW index in Rust | Lightweight ANN over in-memory or mmap vectors |
| pgvector + sqlx/diesel | Postgres extension via Rust DB libs | Already have Postgres; want vectors in the same DB |
LanceDB is the most idiomatic Rust choice for an embedded vector DB today — Rust-native, columnar, handles updates and filters well, supports both flat scan and ANN. Qdrant is the production-grade server when you’ve outgrown embedded.
Worth saying explicitly: for document QA, you often don’t need a vector DB at any tier. BM25 with an in-memory inverted index handles most keyword-dense corpora (contracts, API references, runbooks, handbooks) just fine, and exact cosine over an in-memory chunk array handles moderate-sized semantic queries without ANN. The “you need a vector DB” assumption comes from Python tutorials defaulting to one; it doesn’t follow from the math. (RedHop’s whole pitch is built on this observation — see the Comparison page for measured numbers across real corpora.)
A working example: RAG in 30 lines of Rust
Section titled “A working example: RAG in 30 lines of Rust”Here’s a complete, runnable RAG pipeline with RedHop. From cargo new to
asking a question about a contract.
Cargo.toml:
[package]name = "my-rag"version = "0.1.0"edition = "2021"
[dependencies]redhop = { version = "0.2", features = ["files", "semantic"] }tokio = { version = "1", features = ["macros", "rt-multi-thread"] }src/main.rs:
use redhop::read_file;
fn main() -> redhop::Result<()> { // 1. Load -- a single file (PDF / DOCX / PPTX / XLSX / Markdown / text / // code). Parsing, chunking, and indexing all happen here. let mut doc = read_file("contract.pdf")?;
// 2. Ask -- chunking, retrieval, and token-budgeting happen in-process. let ctx = doc.context("What is the governing law?")?;
// 3. Read the assembled prompt + the Decision Report. println!("Prompt for the LLM:\n{}", ctx.text());
for c in &ctx.citations { println!(" cited: {} p{:?} {:?}", c.source, c.page, c.heading); }
println!("\nDecision: {}", ctx.report.auto_decision); println!("Tokens: {}, evidence retained: {:.0}%", ctx.report.total_tokens, ctx.report.retained_evidence_ratio * 100.0);
Ok(())}cargo run — that’s the whole thing. The files feature pulls in the
built-in PDF/DOCX/PPTX/XLSX/text/code parsers; the semantic feature
adds the optional ONNX embedder for retrieval="hybrid" (off by default
so the lean build pulls only BM25 + Tantivy).
If you want to pass the prompt to an LLM:
let response = openai_client.chat_completion(&ctx.text()).await?;ctx.text() is just a String — you can hand it to any LLM provider’s
Rust crate (async-openai, anthropic-sdk, the OpenAI-compatible
endpoints exposed by many providers, etc.). RedHop doesn’t bundle the LLM
call; it owns the context step.
What about hybrid / semantic retrieval?
Section titled “What about hybrid / semantic retrieval?”Add retrieval and model to the loader. The first run downloads the
embedding model (~80MB for bge-small); subsequent runs hit the cache.
use redhop::{read_file_with, LoadOptions};
let mut doc = read_file_with("contract.pdf", &LoadOptions { retrieval: Some("hybrid".into()), model: Some("bge-small".into()), ..Default::default()})?;For structured docs with parallel clauses (regional overrides etc.), pair
hybrid with context_with(.., include_heading: true, neighbors: 1).
There’s a decision guide covering when each
configuration is the right pick.
A working example with rig (for comparison)
Section titled “A working example with rig (for comparison)”If you’re building agent-shaped applications and want RAG as one piece of a larger flow, rig is the more natural fit. Rough shape:
use rig::providers::openai;use rig::vector_store::in_memory_store::InMemoryVectorStore;use rig::embeddings::EmbeddingsBuilder;
let openai_client = openai::Client::from_env();let embedding_model = openai_client.embedding_model("text-embedding-3-small");
let embeddings = EmbeddingsBuilder::new(embedding_model.clone()) .documents(documents)? .build() .await?;
let store = InMemoryVectorStore::from_documents(embeddings);let agent = openai_client.agent("gpt-4o-mini") .preamble("You are a helpful assistant.") .dynamic_context(4, store.index(embedding_model)) .build();
let response = agent.prompt("What is the governing law?").await?;Compared to RedHop: rig is heavier on the LLM-provider integration side and lighter on the document-context side. You bring your own loader, chunker, and vector store; in return you get agent + tool-calling shapes out of the box. Roughly: rig::RedHop ≈ LangChain::LlamaIndex in Python.
A working example with swiftide (for comparison)
Section titled “A working example with swiftide (for comparison)”If you’re ingesting a corpus at scale and want an explicit indexing pipeline, swiftide is the right shape:
use swiftide::indexing::{Pipeline, transformers, loaders};use swiftide::integrations::{qdrant::Qdrant, fastembed::FastEmbed};
Pipeline::from_loader(loaders::FileLoader::new("./docs")) .then_chunk(transformers::ChunkMarkdown::default()) .then_in_batch(50, transformers::Embed::new(FastEmbed::default())) .then_store_with(Qdrant::try_from_url("http://localhost:6334")?.build()?) .run() .await?;Then a query pipeline does retrieval + answer assembly. Best when the ingestion side is the heavy lift — large corpora, scheduled refreshes, integration with Qdrant / LanceDB / Redis.
Where the Rust RAG ecosystem is still weaker than Python
Section titled “Where the Rust RAG ecosystem is still weaker than Python”We promised honesty, so:
- Evaluation harnesses. Python has TruLens, RAGAS, deepeval, Haystack’s eval framework, LlamaIndex evaluators. Rust has… benchmark scripts in individual repos. If you need rigorous RAG eval out of the box, Python still wins.
- Document loader breadth. LlamaIndex’s LlamaHub and LangChain’s community loaders cover hundreds of sources (Notion, Slack, S3, Confluence, ServiceNow, …). Rust libraries mostly cover the basics (files, folders, URLs) — anything exotic, you’ll write yourself.
- LLM provider integration breadth. rig leads here in Rust with ~10 providers; LangChain has 100+. For obscure providers, expect to wire the HTTP client yourself.
- Multi-step / agentic flows. LangGraph and LlamaIndex’s agent frameworks are far more mature than rig’s agent layer. If your problem needs sophisticated tool-calling + planning, the Python ecosystem is still the right place.
- Notebooks / experimentation. Iterating in Jupyter is faster than iterating in a Rust crate. For prompt engineering and config tuning, many teams prototype in Python even if production is Rust.
If any of those gaps matters to you, the right answer is often a Rust-Python hybrid — Rust for the latency-sensitive service path, Python for evaluation / experimentation. Or just pick Python.
When to pick Rust for RAG (and when not to)
Section titled “When to pick Rust for RAG (and when not to)”A quick decision table:
| Your situation | Pick |
|---|---|
| Existing Rust service that needs document QA | Rust (RedHop or rig) |
| Edge / WASM / on-device deployment | Rust |
| Need single-static-binary deployment for security review | Rust |
| Sub-10ms warm-query budget | Rust |
| Prototyping or experimenting | Python (LangChain / LlamaIndex) |
| Need 100+ LLM provider integrations | Python (LangChain) |
| Need exotic document loaders (Notion, ServiceNow, …) | Python (LlamaHub) |
| Sophisticated agent / tool-calling flows | Python (LangGraph / LlamaIndex) |
| Doing serious RAG evaluation | Python (RAGAS / TruLens) |
| Stack is already Python and there’s no specific reason to switch | Python |
Quick library-by-library decision
Section titled “Quick library-by-library decision”Among Rust RAG libraries specifically:
- Document QA with a small API surface and observability → RedHop. Three calls, BM25 default, Decision Report, in-process. Best for “add document QA to an existing Rust service.”
- Agents that call tools and choose actions, with RAG as one capability → rig. Closer to LangChain in scope; multi-provider LLM support is its strongest suit.
- Production indexing pipelines for large corpora → swiftide. Async pipelines, batch ingestion, integration with Qdrant / LanceDB / Redis. Heaviest of the three but production-shaped.
Resources
Section titled “Resources”- RedHop quickstart (Rust) — the three-call surface
- Choosing a configuration — when to use which retrieval tier
- llms.txt — single-file context for AI coding agents
- RedHop on crates.io
- rig docs
- swiftide docs
- candle and ort — embedding model runtimes
- LanceDB — embedded vector DB
If you’re new to RAG itself (not just the Rust side), start with the Intro to RAG guide and the Retrieval & context tips for the parts that aren’t language-specific. If you’re weighing Node.js against Rust for your RAG service, the Node.js library for RAG guide covers that side with the same shape.