Loaders
RedHop owns everything from chunking onward; you just get content in. There are a
few on-ramps — from “I already have text,” to a file or whole folder, to bytes
straight from a cloud bucket. All return a Document and take the same options
(Retrieval options). Same API in Python, Node, and Rust —
pick your tab.
| On-ramp | For |
|---|---|
from_text | text you already have (your own parser/OCR, a DB field) |
from_chunks | content you already chunked |
from_file | a file on disk — PDF, DOCX, PPTX, XLSX, or text/code |
from_bytes | bytes you already fetched — S3 / Azure Blob / GCS / HTTP / DB blobs |
from_folder | a whole folder in one index (with an optional incremental on-disk index) |
from_text — text you already have
Section titled “from_text — text you already have”Got the text already — a DB field, an API response, your own parser or OCR? Hand it
straight to RedHop. This is also the escape hatch for formats from_file doesn’t
parse, or scanned PDFs that need OCR first.
import redhopdoc = redhop.Document.from_text(my_text)const { Document } = require("redhop");const doc = Document.fromText(myText);let doc = redhop::Document::from_text("notes", my_text)?;from_chunks — you already chunked
Section titled “from_chunks — you already chunked”Already split your content (your own chunker, a DB dump)? Hand RedHop the chunks and it skips straight to indexing:
doc = redhop.Document.from_chunks(["clause one …", "clause two …"])const doc = Document.fromChunks(["clause one …", "clause two …"]);let doc = redhop::chunks(vec!["clause one …".into(), "clause two …".into()], &Default::default())?;from_file — a file on disk, one line
Section titled “from_file — a file on disk, one line”Point it at a path and RedHop reads, parses, chunks, and indexes it — PDF, DOCX,
PPTX, XLSX, or any text/code file. It tracks the file path as each chunk’s source
plus structural location (citations), so a citation points at
contract.pdf, p.3 or notes.md → Setup or main.py:42, not just the filename.
doc = redhop.Document.from_file("contract.pdf")ctx = doc.context("how long is the refund window?")const doc = Document.fromFile("contract.pdf");const ctx = doc.context("how long is the refund window?");let mut doc = redhop::read_file("contract.pdf")?;let ctx = doc.context("how long is the refund window?")?;Indexing is type-aware. Each format carries the structural location it has:
| Format | How it’s split | Citation |
|---|---|---|
| Markdown | by heading | heading + line |
Code (.py .ts .rs …) | by definition (function/class), kept verbatim | symbol + line |
| Text / data | by blank-line block | line |
| DOCX | paragraphs (heading-aware) + tables | heading |
| PPTX | one section per slide | page (slide #) |
| XLSX / ODS | one section per sheet, rows pipe-joined | heading (sheet) |
| one section per page | page |
Code is chunked verbatim (formatting preserved) and labeled with its nearest
definition (auth.py → def login); prose is sentence-packed. Code is also retrieved
lexically under hybrid — see Retrieval options.
from_bytes — cloud storage, HTTP, blobs
Section titled “from_bytes — cloud storage, HTTP, blobs”The file isn’t always on local disk — it’s in S3, Cloudflare R2, Azure Blob, GCS,
behind a URL, or in a DB column. RedHop doesn’t bundle cloud SDKs (credentials and
auth aren’t its job) — fetch the bytes with the client you already have, and hand them
over. The name (e.g. "contract.pdf") picks the parser and becomes the citation
source, so pass something meaningful like the object key.
import boto3, redhopobj = boto3.client("s3").get_object(Bucket="my-bucket", Key="contract.pdf")doc = redhop.Document.from_bytes(obj["Body"].read(), source="s3://my-bucket/contract.pdf")const { Document } = require("redhop");// e.g. const buf = Buffer.from(await (await fetch(url)).arrayBuffer());const doc = Document.fromBytes(buf, "s3://my-bucket/contract.pdf");// let bytes = your_s3_client.get(...).await?; // any clientlet doc = redhop::read_bytes(&bytes, "s3://my-bucket/contract.pdf")?;Citations
Section titled “Citations”Every selected chunk remembers where it came from — source plus whichever of
page / heading / line the format provides (the rest absent):
for c in ctx.citations: print(c["source"], c["page"], c["heading"], c["line"]) # contract.pdf 3 None None → "contract.pdf, p.3" # notes.md None "Setup" 12 → "notes.md → Setup"for (const c of ctx.citations) { console.log(c.source, c.page, c.heading, c.line); // contract.pdf 3 null null → "contract.pdf, p.3"}for c in redhop::citations(&ctx) { println!("{} {:?} {:?} {:?}", c.source, c.page, c.heading, c.line);}This is what lets you show the model’s evidence trail — “answer grounded in
contract.docx → Refund Policy” — without a separate store. from_text/from_chunks
give you source and the chunk text; the structural fields are filled in per format.
from_folder — point at a directory
Section titled “from_folder — point at a directory”The on-ramp for file apps and coding agents: index a whole folder and ask, no vector DB to operate. RedHop walks the directory, reads every file it can, and builds one combined index — each chunk keeps its own file path as source, so citations point at the right file across the corpus.
doc = redhop.Document.from_folder("./docs")ctx = doc.context("what's our deprecation policy?")const doc = Document.fromFolder("./docs");const ctx = doc.context("what's our deprecation policy?");let mut doc = redhop::read_folder("./docs")?;let ctx = doc.context("what's our deprecation policy?")?;It reads everything from_file can. Files it can’t parse are skipped; hidden entries
and build/cache dirs (node_modules, target, __pycache__, venv, dist,
build) are always ignored. It respects your .gitignore (even outside a git
checkout). Add your own excludes, or turn gitignore off:
doc = redhop.Document.from_folder( "./repo", retrieval="hybrid", # BM25 → dense; scales, no vector DB ignore=["*.lock", "tests/**", "*.min.js"], # extra excludes # gitignore=False, # to include .gitignore'd files recursive=True,)const doc = Document.fromFolder("./repo", { recursive: true, ignore: ["*.lock", "tests/**", "*.min.js"], // extra excludes // gitignore: false, // to include .gitignore'd files options: { retrieval: "hybrid" }, // BM25 → dense; scales, no vector DB});use redhop::{read_folder_with, FolderOptions, LoadOptions};
let opts = FolderOptions { recursive: Some(true), ignore: vec!["*.lock".into(), "tests/**".into(), "*.min.js".into()], // gitignore: Some(false), load: LoadOptions { retrieval: Some("hybrid".into()), ..Default::default() }, ..Default::default()};let mut doc = read_folder_with("./repo", &opts)?;Persist the index — incremental reload
Section titled “Persist the index — incremental reload”By default the index is in-memory (rebuilt each run). Turn on persistence to save it to disk and reload incrementally: on the next run, files whose modified-time and size are unchanged are reused from the cache — no re-parsing, no re-embedding — only new/changed files are processed and removed files are dropped. This is what makes a folder of thousands of files practical; the win is biggest on the semantic/hybrid tiers, where the saved index carries the embeddings.
# First run embeds everything and saves to ./docs/.redhop; later runs reuse it.doc = redhop.Document.from_folder("./docs", persist=True, retrieval="hybrid")# Put the index elsewhere:doc = redhop.Document.from_folder("./docs", persist=True, index_dir="/var/cache/redhop")// First run embeds everything and saves to ./docs/.redhop; later runs reuse it.let doc = Document.fromFolder("./docs", { persist: true, options: { retrieval: "hybrid" } });// Put the index elsewhere:doc = Document.fromFolder("./docs", { persist: true, indexDir: "/var/cache/redhop" });use redhop::{read_folder_with, FolderOptions, LoadOptions};
// First run embeds everything and saves to ./docs/.redhop; later runs reuse it.let opts = FolderOptions { persist: true, load: LoadOptions { retrieval: Some("hybrid".into()), ..Default::default() }, ..Default::default()};let mut doc = read_folder_with("./docs", &opts)?;The cache is keyed by a fingerprint of your indexing settings (chunk size, retrieval tier, model), so changing any of them rebuilds rather than serving a stale index.
→ Once content is loaded, pick how it’s retrieved: Retrieval options.