Skip to content

Loaders

RedHop owns everything from chunking onward; you just get content in. There are a few on-ramps — from “I already have text,” to a file or whole folder, to bytes straight from a cloud bucket. All return a Document and take the same options (Retrieval options). Same API in Python, Node, and Rust — pick your tab.

On-rampFor
from_texttext you already have (your own parser/OCR, a DB field)
from_chunkscontent you already chunked
from_filea file on disk — PDF, DOCX, PPTX, XLSX, or text/code
from_bytesbytes you already fetched — S3 / Azure Blob / GCS / HTTP / DB blobs
from_foldera whole folder in one index (with an optional incremental on-disk index)

Got the text already — a DB field, an API response, your own parser or OCR? Hand it straight to RedHop. This is also the escape hatch for formats from_file doesn’t parse, or scanned PDFs that need OCR first.

import redhop
doc = redhop.Document.from_text(my_text)

Already split your content (your own chunker, a DB dump)? Hand RedHop the chunks and it skips straight to indexing:

doc = redhop.Document.from_chunks(["clause one …", "clause two …"])

Point it at a path and RedHop reads, parses, chunks, and indexes it — PDF, DOCX, PPTX, XLSX, or any text/code file. It tracks the file path as each chunk’s source plus structural location (citations), so a citation points at contract.pdf, p.3 or notes.md → Setup or main.py:42, not just the filename.

doc = redhop.Document.from_file("contract.pdf")
ctx = doc.context("how long is the refund window?")

Indexing is type-aware. Each format carries the structural location it has:

FormatHow it’s splitCitation
Markdownby headingheading + line
Code (.py .ts .rs …)by definition (function/class), kept verbatimsymbol + line
Text / databy blank-line blockline
DOCXparagraphs (heading-aware) + tablesheading
PPTXone section per slidepage (slide #)
XLSX / ODSone section per sheet, rows pipe-joinedheading (sheet)
PDFone section per pagepage

Code is chunked verbatim (formatting preserved) and labeled with its nearest definition (auth.py → def login); prose is sentence-packed. Code is also retrieved lexically under hybrid — see Retrieval options.

The file isn’t always on local disk — it’s in S3, Cloudflare R2, Azure Blob, GCS, behind a URL, or in a DB column. RedHop doesn’t bundle cloud SDKs (credentials and auth aren’t its job) — fetch the bytes with the client you already have, and hand them over. The name (e.g. "contract.pdf") picks the parser and becomes the citation source, so pass something meaningful like the object key.

import boto3, redhop
obj = boto3.client("s3").get_object(Bucket="my-bucket", Key="contract.pdf")
doc = redhop.Document.from_bytes(obj["Body"].read(), source="s3://my-bucket/contract.pdf")

Every selected chunk remembers where it came fromsource plus whichever of page / heading / line the format provides (the rest absent):

for c in ctx.citations:
print(c["source"], c["page"], c["heading"], c["line"])
# contract.pdf 3 None None → "contract.pdf, p.3"
# notes.md None "Setup" 12 → "notes.md → Setup"

This is what lets you show the model’s evidence trail — “answer grounded in contract.docx → Refund Policy” — without a separate store. from_text/from_chunks give you source and the chunk text; the structural fields are filled in per format.

The on-ramp for file apps and coding agents: index a whole folder and ask, no vector DB to operate. RedHop walks the directory, reads every file it can, and builds one combined index — each chunk keeps its own file path as source, so citations point at the right file across the corpus.

doc = redhop.Document.from_folder("./docs")
ctx = doc.context("what's our deprecation policy?")

It reads everything from_file can. Files it can’t parse are skipped; hidden entries and build/cache dirs (node_modules, target, __pycache__, venv, dist, build) are always ignored. It respects your .gitignore (even outside a git checkout). Add your own excludes, or turn gitignore off:

doc = redhop.Document.from_folder(
"./repo",
retrieval="hybrid", # BM25 → dense; scales, no vector DB
ignore=["*.lock", "tests/**", "*.min.js"], # extra excludes
# gitignore=False, # to include .gitignore'd files
recursive=True,
)

By default the index is in-memory (rebuilt each run). Turn on persistence to save it to disk and reload incrementally: on the next run, files whose modified-time and size are unchanged are reused from the cache — no re-parsing, no re-embedding — only new/changed files are processed and removed files are dropped. This is what makes a folder of thousands of files practical; the win is biggest on the semantic/hybrid tiers, where the saved index carries the embeddings.

# First run embeds everything and saves to ./docs/.redhop; later runs reuse it.
doc = redhop.Document.from_folder("./docs", persist=True, retrieval="hybrid")
# Put the index elsewhere:
doc = redhop.Document.from_folder("./docs", persist=True, index_dir="/var/cache/redhop")

The cache is keyed by a fingerprint of your indexing settings (chunk size, retrieval tier, model), so changing any of them rebuilds rather than serving a stale index.

→ Once content is loaded, pick how it’s retrieved: Retrieval options.