Skip to content

Node.js library for RAG: the landscape, what's mature, and how to choose

If you’re building RAG in Node.js or TypeScript and want to skip the Python sidecar, this guide is for you. It’s a tour of the Node.js RAG ecosystem as of 2026 — what’s mature, what’s half-built, what doesn’t exist yet, and which library to pick for which job. We’ll cover the end-to-end RAG libraries, the embedding / runtime layer (including the edge-runtime constraints that bite if you’re not careful), and the vector databases that work natively from Node.

We’ll be honest about the tradeoffs. Node.js isn’t strictly better than Python for RAG — it’s better for specific reasons, in specific deployments. By the end you should know whether you’re in that “specific” or not.

Four real reasons, plus the honest pushback:

1. You already ship Node. This is the most common case. You have a Next.js app, an Express API, a Fastify service, a Cloudflare Worker, a Deno Deploy script — and you need to add document QA or semantic search. Spawning a Python subprocess or calling a Python sidecar is a deployment headache. A npm package fits the existing build.

2. Edge runtime deployment. Vercel Edge, Cloudflare Workers, Netlify Edge — these only run JavaScript / WASM. If your service needs to live at the edge for latency, you can’t ship Python at all. Whether Node.js or pure-WASM-Rust is the right pick depends on which constraints bite (more below), but Python is out either way.

3. Streaming-first DX. The Node.js LLM ecosystem (especially Vercel AI SDK) has the best streaming-token UX of any language. If your app is “user types question → tokens stream into the UI,” Node is in its element. Python’s RAG libraries treat streaming as an afterthought.

4. Same language as the frontend. TypeScript everywhere — no language switch between your retrieval logic and the React component that renders the answer. For full-stack teams, this is a real win.

The honest pushback: Python’s RAG ecosystem is significantly deeper. LangChain.js usually trails LangChain Python by 1–2 quarters on feature parity; LlamaIndex.TS is further behind. Evaluation frameworks, document loader breadth, and exotic LLM provider integrations all favor Python. Don’t pick Node for RAG just because “JavaScript is faster than Python.” Pick it because the deployment / integration / streaming reasons are real for your project.

The ecosystem splits into three layers. Most full-stack apps only touch the top.

Layer 1 — End-to-end RAG libraries / frameworks

Section titled “Layer 1 — End-to-end RAG libraries / frameworks”

These are the “pick one and go” libraries. You hand them documents and a query; they handle chunking, retrieval, and context assembly (and sometimes the LLM call too).

LibraryShapeBest for
RedHop3-call library (load → ask → read)Document QA, smallest surface, in-process
Vercel AI SDKLLM streaming + RAG primitivesNext.js apps, streaming UIs, edge-first
LangChain.jsFull LangChain port — chains, agents, retrieversLangChain shape in TypeScript; agents and tool-use
LlamaIndex.TSLlamaIndex port — indices, query enginesLlamaIndex shape in TypeScript
mastraTS-first agents + RAG + evalsBuilding agent apps in TypeScript
Genkit (Google)Flow + prompt management with RAGApps targeting Vertex AI / Google ecosystem

Each is taking a different bet about what RAG in Node should look like.

RedHop has the smallest surface. Three calls: Document.fromFile(path), doc.context(query), then read the assembled prompt + Decision Report off the returned context. BM25 by default (no model download, no ONNX runtime in the lean build), optional dense retrieval via a small embedding model. Built as a native addon (napi-rs) over a Rust core — so the same library is also on PyPI and crates.io, with the same API shape. Best when you want one bounded context step you can drop into a larger app. Native-addon caveat: not suitable for Cloudflare Workers / Vercel Edge (those runtimes don’t allow .node binaries).

Vercel AI SDK is the dominant choice for Next.js apps. It’s not specifically a RAG library — it’s an LLM streaming SDK with great DX (streamText, generateObject, React hooks like useChat). For RAG, you bring your own retriever and pass the results into the prompt. Best when streaming UX matters and you’re already in the Vercel / Next ecosystem. Edge-runtime friendly.

LangChain.js is the TypeScript port of LangChain. It mirrors the Python API: chains, agents, retrievers, vector stores, document loaders. Trails the Python version on feature parity (usually a few quarters behind), but covers the common cases. Best when you’re building agent-shaped apps or want LangChain’s many integrations.

LlamaIndex.TS is the TypeScript port of LlamaIndex. Same Index → QueryEngine shape, smaller surface than Python LlamaIndex. Maintained but lagging more than LangChain.js. Best when you want LlamaIndex’s shape specifically.

mastra is a newer TS-first framework with agents, RAG, evaluations, and workflow management. Production-shaped, MIT-licensed. Best when you’re building a serious TypeScript agent app and want the framework to handle the eval / observability story too.

Genkit is Google’s framework. Flow-based, strong with Vertex AI integration, has RAG primitives. Best when you’re targeting Google Cloud / Firebase.

Layer 2 — Embeddings, model runtime, tokenizers

Section titled “Layer 2 — Embeddings, model runtime, tokenizers”

If you’re building your own RAG, or your layer-1 library needs a custom embedder, these are the components.

PackageWhat it is
Transformers.js (@xenova/transformers)HuggingFace’s Transformers ported to JS via ONNX Runtime Web. Runs models in Node and browser
onnxruntime-nodeNative ONNX Runtime for Node. Fastest local inference; native addon
@huggingface/inferenceHosted HF inference client (no local model)
fastembedFast ONNX embeddings (BGE / MiniLM / etc.)
tiktokenOpenAI tokenizer for JS / TS
@dqbd/tiktokenAlternative tiktoken with WASM + better DX

The big decision in this layer is local vs hosted vs edge.

  • Local Node embeddings (native ONNX): onnxruntime-node or fastembed — runs the model in the same Node process. Fast, no network. Doesn’t run on edge runtimes because they’re native addons.
  • Browser / edge-safe embeddings: Transformers.js (with onnxruntime-web under the hood) runs in any JS environment including browsers and edge workers. Slower than native ONNX but portable.
  • Hosted embeddings: @huggingface/inference, OpenAI’s text-embedding-3-small, Cohere, Voyage — call the API, pay per token, get back vectors. Easy, costs money, latency floor of one network hop.

If you’re shipping a Node service on a VM or container, native ONNX is the right pick (faster, free after the model download). If you’re shipping to Cloudflare Workers / Vercel Edge, you need either Transformers.js or hosted embeddings. Native addons are out at the edge.

If you need persistent vector storage at scale, you’ll reach for one of these. (Big caveat below — for document QA, you often don’t need a vector DB at any tier.)

DBShapeBest for
LanceDB (@lancedb/lancedb)Embedded columnar vector DB, native bindingsSingle-node, embedded; mid-scale
Chroma (chromadb)Open-source vector DBPrototyping; client-server shape
Qdrant (@qdrant/js-client-rest)Standalone vector DBProduction at scale; self-host or cloud
Pinecone (@pinecone-database/pinecone)Hosted vector DBPure hosted, scales without ops
Weaviate (weaviate-client)Vector DB with hybrid / generative featuresHybrid search, multi-modal
hnswlib-nodeEmbedded HNSW index, nativeLightweight ANN over in-memory vectors
vectraPure-JS local file-based vector DBSimple prototyping, no native deps

vectra is the most edge-friendly since it’s pure JS, but it’s file-based (won’t work where there’s no filesystem). LanceDB is the most mature embedded option for serious workloads with a native binary. Qdrant + Pinecone + Weaviate are the production-grade server / hosted options.

Worth saying explicitly: for document QA, you often don’t need a vector DB at any tier. BM25 with an in-memory inverted index handles most keyword-dense corpora (contracts, API references, runbooks, handbooks) just fine, and exact cosine over an in-memory chunk array handles moderate semantic queries without ANN. The “you need a vector DB” assumption comes from tutorials defaulting to one; it doesn’t follow from the math. (RedHop’s whole pitch is built on this observation — see the Comparison page for measured numbers across real corpora.)

A working example: RAG in 20 lines of TypeScript

Section titled “A working example: RAG in 20 lines of TypeScript”

Here’s a complete, runnable RAG pipeline with RedHop. From npm init to asking a question about a contract.

package.json:

{
"type": "module",
"dependencies": {
"redhop": "^0.2.0",
"openai": "^4.0.0"
}
}

src/rag.ts:

import { Document } from "redhop";
import OpenAI from "openai";
// 1. Load — a single file (PDF / DOCX / PPTX / XLSX / Markdown / text /
// code). Parsing, chunking, and indexing all happen here.
const doc = Document.fromFile("contract.pdf");
// 2. Ask — chunking, retrieval, and token-budgeting happen in-process.
const ctx = doc.context("What is the governing law?");
// 3. Read: the assembled prompt + the Decision Report + citations.
console.log("Prompt for the LLM:\n", ctx.text);
for (const c of ctx.citations) {
console.log(` cited: ${c.source} p${c.page ?? "?"} ${c.heading ?? "—"}`);
}
console.log(`\nDecision: ${ctx.report.autoDecision}`);
console.log(
`Tokens: ${ctx.report.totalTokens}, ` +
`retained evidence: ${(ctx.report.retainedEvidenceRatio * 100).toFixed(0)}%`,
);
// Pass it to any LLM provider. Bring your own client.
const openai = new OpenAI();
const response = await openai.chat.completions.create({
model: "gpt-4o-mini",
messages: [{
role: "user",
content: `${ctx.text}\n\nQuestion: What is the governing law?`,
}],
});
console.log(response.choices[0].message.content);

node --experimental-strip-types src/rag.ts (Node 22+) or transpile via tsc/tsx first. The native addon handles the heavy lifting; your TypeScript stays light.

Add retrieval and model to the loader. The first run downloads the embedding model (~80MB for bge-small); subsequent runs hit the cache.

const doc = Document.fromFile("contract.pdf", {
retrieval: "hybrid",
model: "bge-small",
});
const ctx = doc.context(
"What law applies in the UK?",
undefined, // budget (default)
1, // neighbors -- structural expansion
true, // includeHeading
);

For structured docs with parallel clauses (regional overrides etc.), pair hybrid with neighbors=1, includeHeading=true. There’s a decision guide covering when each configuration is the right pick.

What about edge runtimes (Vercel Edge / Cloudflare Workers)?

Section titled “What about edge runtimes (Vercel Edge / Cloudflare Workers)?”

RedHop is a native addon (.node binary) and won’t run on edge runtimes. That’s a hard constraint of the napi-rs approach. If your service is edge-first:

  • Use Vercel AI SDK for the LLM + streaming.
  • Bring an edge-safe retriever: vectra (pure JS, file-based — if you have R2 / KV / Vercel Blob filesystem), or a managed vector DB (Pinecone, Qdrant Cloud) over HTTP.
  • For embeddings: Transformers.js (works in Workers / Edge) or hosted (OpenAI, Cohere).

A WASM build of RedHop is on the roadmap — track the WASM feasibility note for status — but it’s not shipped yet.

A working example with Vercel AI SDK (for comparison)

Section titled “A working example with Vercel AI SDK (for comparison)”

If you’re building a Next.js streaming RAG chatbot, Vercel AI SDK is the more natural fit. Rough shape (you bring your own retriever):

import { openai } from "@ai-sdk/openai";
import { streamText } from "ai";
const context = await retrieveContext(question); // your retriever
const result = streamText({
model: openai("gpt-4o-mini"),
prompt: `${context}\n\nQuestion: ${question}`,
});
return result.toDataStreamResponse();

This is not a RAG library — it’s the LLM + streaming layer. You’re responsible for retrieveContext(). The natural pairing:

  • For the retriever: RedHop (if you’re on a non-edge runtime), LangChain.js (if you want their integrations), or your own BM25/vector search.
  • For the LLM call + streaming: Vercel AI SDK.

This split is increasingly the dominant pattern in 2026 Next.js apps.

A working example with LangChain.js (for comparison)

Section titled “A working example with LangChain.js (for comparison)”

If you want the full LangChain shape in TypeScript:

import { PDFLoader } from "@langchain/community/document_loaders/fs/pdf";
import { RecursiveCharacterTextSplitter } from "@langchain/textsplitters";
import { OpenAIEmbeddings, ChatOpenAI } from "@langchain/openai";
import { MemoryVectorStore } from "langchain/vectorstores/memory";
import { ChatPromptTemplate } from "@langchain/core/prompts";
import { RunnablePassthrough, RunnableSequence } from "@langchain/core/runnables";
import { StringOutputParser } from "@langchain/core/output_parsers";
const docs = await new PDFLoader("contract.pdf").load();
const chunks = await new RecursiveCharacterTextSplitter({
chunkSize: 1000, chunkOverlap: 200,
}).splitDocuments(docs);
const store = await MemoryVectorStore.fromDocuments(chunks, new OpenAIEmbeddings());
const retriever = store.asRetriever({ k: 4 });
const prompt = ChatPromptTemplate.fromTemplate(
"Answer using only the context.\n\n{context}\n\nQuestion: {input}",
);
const chain = RunnableSequence.from([
{ context: retriever, input: new RunnablePassthrough() },
prompt,
new ChatOpenAI({ model: "gpt-4o-mini" }),
new StringOutputParser(),
]);
const answer = await chain.invoke("What is the governing law?");

Compared to RedHop: significantly more pieces to wire. The benefit is LangChain’s ecosystem — agents, tool-use, hundreds of LLM providers, many vector store integrations. Pick LangChain.js if you need that ecosystem; pick RedHop if you just want document QA with a Decision Report.

Where the Node.js RAG ecosystem is still weaker than Python

Section titled “Where the Node.js RAG ecosystem is still weaker than Python”

We promised honesty, so:

  • LangChain.js / LlamaIndex.TS feature parity. Both lag the Python versions. New retrievers / loaders / integrations land in Python first, ports follow weeks-to-months later. For cutting-edge work, Python is still the leading edge.
  • Evaluation harnesses. Python has TruLens, RAGAS, deepeval, LlamaIndex evaluators. Node has mastra’s eval module and DIY scripts. For rigorous RAG eval, Python wins.
  • Document loader breadth. LlamaHub (Python) has hundreds of loaders for Notion, ServiceNow, Confluence, S3, Salesforce, … The Node ecosystem covers the basics; exotic sources you’ll write yourself.
  • LLM provider integration breadth. LangChain.js covers ~40 providers; LangChain Python covers 100+. Most popular ones are in both; obscure ones may not be.
  • Notebook experimentation. Iterating in Jupyter is faster than iterating in tsc. For prompt engineering and config tuning, many teams prototype in Python even if production is Node.

If any of these gaps matters to your project, a common pattern is Node for the service path, Python for evaluation / experimentation. Or just pick Python if those gaps dominate the work.

When to pick Node.js for RAG (and when not to)

Section titled “When to pick Node.js for RAG (and when not to)”

A quick decision table:

Your situationPick
Existing Next.js / Node app that needs document QANode (RedHop / LangChain.js)
Streaming UX is the primary product featureNode (Vercel AI SDK)
Edge / Cloudflare Workers / Vercel Edge deploymentNode (edge-safe libs only)
Need single-language stack with the frontendNode
Need 100+ LLM provider integrations todayPython (LangChain)
Need exotic document loaders (Notion / Salesforce / etc.)Python (LlamaHub)
Doing serious RAG evaluationPython (RAGAS / TruLens)
Need sub-10ms warm-query latency from a static binaryRust (RedHop / rig / swiftide)
Stack is already Python and there’s no specific reason to switchPython

Among Node.js RAG libraries specifically:

  • Document QA with a small API surface and observabilityRedHop. Three calls, BM25 default, Decision Report, native addon (not for edge). Best for “add document QA to an existing Node service.”
  • Streaming chatbot in Next.jsVercel AI SDK for the LLM + streaming layer, paired with RedHop or your own retriever for the context.
  • Full LangChain shape in TypeScriptLangChain.js. Best when you need LangChain’s integrations or agent / tool-use.
  • Production agent apps with built-in evalsmastra. TS-first, modern shape.
  • Targeting Vertex AI / Google CloudGenkit.
  • LlamaIndex shape specificallyLlamaIndex.TS.

If you’re new to RAG itself (not just the Node side), start with the Intro to RAG guide and the Retrieval & context tips for the parts that aren’t language-specific. If you’re choosing between Node, Python, and Rust for the language, the Rust library for RAG guide covers the Rust side with the same shape as this one.