RedHop vs Ragas

If you’re already running Ragas on a RAG pipeline, the natural question is whether redhop.evaluate(...) gives you a different number. It doesn’t, on the metric they both ship.

On claim-decomposed faithfulness (evaluate(..., decompose_faithfulness=True) versus Ragas’s Faithfulness) the two are substantively equivalent: Pearson r = +0.664, MAE = 0.151, 61% perfect agreement on n=200 HotpotQA with openai/gpt-4o-mini. Neither library is unambiguously “more correct” under a third LLM’s read at scale.

The differentiator is philosophy, not accuracy.

Different surfaces, same category

	RedHop `evaluate`	Ragas
Scope	one API for closed-set answer quality + Decision Report self-eval	dedicated eval framework
Metric families	faithfulness, relevancy, correctness, critique, summarize	the above + similarity, context precision/recall, AspectCritic, more
LLM dependence	fully optional: `_lexical` (no LLM) + `_judged` (opt-in)	LLM-required for most metrics
Embedder dependence	none (no relevancy-cosine, no similarity)	required for relevancy / similarity / answer-correctness embedding term
Integration	`pip install redhop`, single package	langchain-wrapped LLM/embedder, multiple deps
Same primitives as runtime?	yes, `evaluate` uses the runtime’s own Decision Report machinery	no
Output shape	one `EvalReport` dataclass + `summarize(reports)` aggregator	per-metric `Result` objects + dataset-level aggregation

RedHop is not an eval framework. It ships a narrow answer-quality surface that mirrors what the runtime already measures internally. The comparison is “does it produce numbers similar to a dedicated eval library on the metric they both ship.”

The numbers: n=200 HotpotQA, `gpt-4o-mini`

The bench generates an LLM answer for each HotpotQA question given the full distractor context (supporting + distractor paragraphs), then scores faithfulness in both libraries with the same judge model. Apples-to-apples on the score, the workload, and the judge.

Correlation with Ragas

	Pearson r	MAE	exact agreement
RedHop decomposed ↔ Ragas	+0.664	0.151	61% (122/200)
RedHop single-prompt ↔ Ragas	+0.285	0.239	—

Read: decomposed-faithfulness (the path you should default to) agrees with Ragas on the majority of cases and stays within ~0.15 absolute when it diverges. Single-prompt diverges more, mostly on refusal answers (“I don’t know” scores 1.0 single-prompt because there are no claims to contradict, the vacuous-truth failure mode). Use decompose_faithfulness=True for accuracy.

Third-judge tie-breaker (Claude haiku)

When the two libraries disagree by ≥ 0.5 (35 cases on n=200), we ask Claude haiku to score the same cases independently via the local claude -p --model haiku CLI:

	MAE vs Claude haiku (66-case subset: 46 contested + 20 agreement)
RedHop decomposed	0.340
Ragas	0.262

On contested cases, Claude favors: RedHop 12/35, Ragas 23/35.

Read carefully: this looks like Ragas is “more correct”, but re-tracing 5 randomly-sampled “RedHop loses to Ragas” cases at 5 runs each shows 4 of 5 give 1.0 consistently when measured stably. The bench captured a one-shot 0.0 because gpt-4o-mini at temperature 0.0 is not deterministic through OpenRouter (model-replica routing + floating-point non-associativity → ~20–30% per-case variance on borderline judgments).

So the contested-cases MAE-to-Claude is noise-dominated. Aggregate metrics (Pearson r and MAE vs Ragas, averaged over many cases) are robust. Per-case verdicts on individual cases are not.

What each library gives you that the other doesn’t

With RedHop you also get:

A single EvalReport dataclass that blends lexical metrics (deterministic, run in CI without an LLM) with judged metrics (opt-in via Judge.from_callable(fn).cached()), instead of running metrics one at a time.
summarize(reports) for test-set aggregation: one function call rolls up per-case reports into a means + N + share-flagged summary, the same shape RedHop’s own runtime uses for its Decision Report.
No embedder dependency. Ragas’s AnswerRelevancy and AnswerSimilarity need an embedder. RedHop’s relevancy_judged uses an LLM-only noncommittal-detection prompt (no embeddings, no extra dep).
Refusal handling. “I don’t know” answers correctly return None for decomposed faithfulness (0 claims extracted) instead of being scored as a vacuous 1.0. Surfaces refusals as a distinct category, not as faithfulness = 1.
critique(answer, aspects, ...) for user-defined dimensions. Ragas has AspectCritic. RedHop has the equivalent in critique with the same EvalReport-shape output as quantitative metrics.

With Ragas you also get:

More metric families. Ragas ships AnswerSimilarity, ContextPrecision, ContextRecall, Faithfulness with NLI, more AspectCritic variants, and test-set generation. RedHop ships a focused subset.
Broader integration ecosystem. LangChain wrappers, LlamaIndex wrappers, Phoenix / Langfuse / Arize Phoenix integrations. RedHop stays in-process by design.
Dataset loaders. Ragas can load HuggingFace datasets, dataset formats. RedHop expects you to construct (question, context, answer, gold_answer) tuples directly.

import redhop

# Same judge plumbing for everything — `Judge.from_callable(fn).cached()`
# accepts any sync function returning a float, a string (parsed as a number
# if it looks like one, else treated as raw text), or a dict.
judge = redhop.Judge.from_callable(my_llm).cached()

report = redhop.evaluate(
    user_query, ctx,
    answer=model_answer,
    gold_answer=gold,
    judge=judge,
    decompose_faithfulness=True,    # 2 LLM calls; catches paraphrase-hallucinations
    decompose_correctness=True,     # TP/FP/FN against gold → F1
)
report.faithfulness_judged          # mean of per-claim scores in [0, 1]
report.n_faithfulness_claims_extracted, report.n_faithfulness_claims_supported
report.correctness_judged           # F1
report.n_correctness_tp, report.n_correctness_fp, report.n_correctness_fn

const { Judge, evaluateWithJudge } = require("redhop");

const judge = Judge.fromCallable(async (err, prompt, system) => {
  return await myLlm({ prompt, system });    // your LLM client, your choice
}, "openai-mini").cached();

const report = await evaluateWithJudge(userQuery, ctx, judge, {
  answer: modelAnswer,
  goldAnswer: gold,
  decomposeFaithfulness: true,
  decomposeCorrectness: true,
});
report.faithfulnessJudged
report.nFaithfulnessClaimsExtracted, report.nFaithfulnessClaimsSupported
report.correctnessJudged
report.nCorrectnessTp, report.nCorrectnessFp, report.nCorrectnessFn

How to read this

Where the result is robust: decomposed-faithfulness produces numbers strongly correlated with Ragas’s faithfulness across n=200 HotpotQA. If you use either library to evaluate a RAG system, the trends you see will be the same.
Where the result is fragile: any single case’s score has ~0.2–0.3 absolute noise. Use the score as a signal across many cases, not as an oracle on one.
Where neither library shines: the metric is LLM-judged. Both libraries inherit the judge’s calibration. A different judge model produces different absolute numbers. The trend (the two libraries agreeing) is what’s stable.

Reproduce it yourself

python3 -m venv bench/.venv
bench/.venv/bin/pip install ragas openai langchain langchain-openai
bench/.venv/bin/pip install redhop
OPENROUTER_API_KEY=sk-or-... \
  bench/.venv/bin/python bench/eval_correlation_hotpot.py \
    --n 200 --context all

Script generates answers via the LLM, scores each via both libraries, prints Pearson r + MAE + per-case scores. The bench source lives in the repo at bench/eval_correlation_hotpot.py.

Honest caveats

Single LLM. gpt-4o-mini only. Different judge models produce different absolute numbers. The agreement trend is likely robust, the absolute r/MAE numbers aren’t necessarily.
Only faithfulness was compared head-to-head. Ragas’s relevancy / similarity / correctness need an embedder which RedHop deliberately doesn’t carry. Comparing those across embedder choices would muddy the apples-to-apples.
No human ground truth. We’re measuring whether the two libraries agree, not whether either agrees with human judgment. Claude haiku as a third LLM is a tie-breaker, not an oracle.
The “correct” answer to the contested cases is genuinely ambiguous. Different graders (LLM or human) will reasonably disagree on partial-support cases. That’s a property of the metric, not a bug in either library.

Full evidence + the v0→v4 prompt iteration that closed the calibration gap is in EVAL_JUDGED_CALIBRATION.