RedHop vs Ragas
If you’re already running Ragas
on a RAG pipeline, the natural question is whether redhop.evaluate(...)
gives you a different number. It doesn’t, on the metric they both ship.
On claim-decomposed faithfulness (evaluate(..., decompose_faithfulness=True)
versus Ragas’s Faithfulness) the two are substantively equivalent:
Pearson r = +0.664, MAE = 0.151, 61% perfect agreement on n=200 HotpotQA
with openai/gpt-4o-mini. Neither library is unambiguously “more correct”
under a third LLM’s read at scale.
The differentiator is philosophy, not accuracy.
Different surfaces, same category
Section titled “Different surfaces, same category”RedHop evaluate | Ragas | |
|---|---|---|
| Scope | one API for closed-set answer quality + Decision Report self-eval | dedicated eval framework |
| Metric families | faithfulness, relevancy, correctness, critique, summarize | the above + similarity, context precision/recall, AspectCritic, more |
| LLM dependence | fully optional: _lexical (no LLM) + _judged (opt-in) | LLM-required for most metrics |
| Embedder dependence | none (no relevancy-cosine, no similarity) | required for relevancy / similarity / answer-correctness embedding term |
| Integration | pip install redhop, single package | langchain-wrapped LLM/embedder, multiple deps |
| Same primitives as runtime? | yes, evaluate uses the runtime’s own Decision Report machinery | no |
| Output shape | one EvalReport dataclass + summarize(reports) aggregator | per-metric Result objects + dataset-level aggregation |
RedHop is not an eval framework. It ships a narrow answer-quality surface that mirrors what the runtime already measures internally. The comparison is “does it produce numbers similar to a dedicated eval library on the metric they both ship.”
The numbers: n=200 HotpotQA, gpt-4o-mini
Section titled “The numbers: n=200 HotpotQA, gpt-4o-mini”The bench generates an LLM answer for each HotpotQA question given the full distractor context (supporting + distractor paragraphs), then scores faithfulness in both libraries with the same judge model. Apples-to-apples on the score, the workload, and the judge.
Correlation with Ragas
Section titled “Correlation with Ragas”| Pearson r | MAE | exact agreement | |
|---|---|---|---|
| RedHop decomposed ↔ Ragas | +0.664 | 0.151 | 61% (122/200) |
| RedHop single-prompt ↔ Ragas | +0.285 | 0.239 | — |
Read: decomposed-faithfulness (the path you should default to) agrees with
Ragas on the majority of cases and stays within ~0.15 absolute when it
diverges. Single-prompt diverges more, mostly on refusal answers (“I don’t
know” scores 1.0 single-prompt because there are no claims to contradict,
the vacuous-truth failure mode). Use decompose_faithfulness=True for
accuracy.
Third-judge tie-breaker (Claude haiku)
Section titled “Third-judge tie-breaker (Claude haiku)”When the two libraries disagree by ≥ 0.5 (35 cases on n=200), we ask Claude
haiku to score the same cases independently via the local
claude -p --model haiku CLI:
| MAE vs Claude haiku (66-case subset: 46 contested + 20 agreement) | |
|---|---|
| RedHop decomposed | 0.340 |
| Ragas | 0.262 |
On contested cases, Claude favors: RedHop 12/35, Ragas 23/35.
Read carefully: this looks like Ragas is “more correct”, but re-tracing 5
randomly-sampled “RedHop loses to Ragas” cases at 5 runs each shows 4 of 5
give 1.0 consistently when measured stably. The bench captured a one-shot
0.0 because gpt-4o-mini at temperature 0.0 is not deterministic through
OpenRouter (model-replica routing + floating-point non-associativity →
~20–30% per-case variance on borderline judgments).
So the contested-cases MAE-to-Claude is noise-dominated. Aggregate metrics (Pearson r and MAE vs Ragas, averaged over many cases) are robust. Per-case verdicts on individual cases are not.
What each library gives you that the other doesn’t
Section titled “What each library gives you that the other doesn’t”With RedHop you also get:
- A single
EvalReportdataclass that blends lexical metrics (deterministic, run in CI without an LLM) with judged metrics (opt-in viaJudge.from_callable(fn).cached()), instead of running metrics one at a time. summarize(reports)for test-set aggregation: one function call rolls up per-case reports into a means + N + share-flagged summary, the same shape RedHop’s own runtime uses for its Decision Report.- No embedder dependency. Ragas’s
AnswerRelevancyandAnswerSimilarityneed an embedder. RedHop’srelevancy_judgeduses an LLM-only noncommittal-detection prompt (no embeddings, no extra dep). - Refusal handling. “I don’t know” answers correctly return
Nonefor decomposed faithfulness (0 claims extracted) instead of being scored as a vacuous 1.0. Surfaces refusals as a distinct category, not as faithfulness = 1. critique(answer, aspects, ...)for user-defined dimensions. Ragas hasAspectCritic. RedHop has the equivalent incritiquewith the sameEvalReport-shape output as quantitative metrics.
With Ragas you also get:
- More metric families. Ragas ships
AnswerSimilarity,ContextPrecision,ContextRecall,Faithfulness with NLI, more AspectCritic variants, and test-set generation. RedHop ships a focused subset. - Broader integration ecosystem. LangChain wrappers, LlamaIndex wrappers, Phoenix / Langfuse / Arize Phoenix integrations. RedHop stays in-process by design.
- Dataset loaders. Ragas can load HuggingFace datasets, dataset formats.
RedHop expects you to construct
(question, context, answer, gold_answer)tuples directly.
Code, side-by-side
Section titled “Code, side-by-side”import redhop
# Same judge plumbing for everything — `Judge.from_callable(fn).cached()`# accepts any sync function returning a float, a string (parsed as a number# if it looks like one, else treated as raw text), or a dict.judge = redhop.Judge.from_callable(my_llm).cached()
report = redhop.evaluate( user_query, ctx, answer=model_answer, gold_answer=gold, judge=judge, decompose_faithfulness=True, # 2 LLM calls; catches paraphrase-hallucinations decompose_correctness=True, # TP/FP/FN against gold → F1)report.faithfulness_judged # mean of per-claim scores in [0, 1]report.n_faithfulness_claims_extracted, report.n_faithfulness_claims_supportedreport.correctness_judged # F1report.n_correctness_tp, report.n_correctness_fp, report.n_correctness_fnconst { Judge, evaluateWithJudge } = require("redhop");
const judge = Judge.fromCallable(async (err, prompt, system) => { return await myLlm({ prompt, system }); // your LLM client, your choice}, "openai-mini").cached();
const report = await evaluateWithJudge(userQuery, ctx, judge, { answer: modelAnswer, goldAnswer: gold, decomposeFaithfulness: true, decomposeCorrectness: true,});report.faithfulnessJudgedreport.nFaithfulnessClaimsExtracted, report.nFaithfulnessClaimsSupportedreport.correctnessJudgedreport.nCorrectnessTp, report.nCorrectnessFp, report.nCorrectnessFnHow to read this
Section titled “How to read this”- Where the result is robust: decomposed-faithfulness produces numbers strongly correlated with Ragas’s faithfulness across n=200 HotpotQA. If you use either library to evaluate a RAG system, the trends you see will be the same.
- Where the result is fragile: any single case’s score has ~0.2–0.3 absolute noise. Use the score as a signal across many cases, not as an oracle on one.
- Where neither library shines: the metric is LLM-judged. Both libraries inherit the judge’s calibration. A different judge model produces different absolute numbers. The trend (the two libraries agreeing) is what’s stable.
Reproduce it yourself
Section titled “Reproduce it yourself”python3 -m venv bench/.venvbench/.venv/bin/pip install ragas openai langchain langchain-openaibench/.venv/bin/pip install redhopOPENROUTER_API_KEY=sk-or-... \ bench/.venv/bin/python bench/eval_correlation_hotpot.py \ --n 200 --context allScript generates answers via the LLM, scores each via both libraries, prints Pearson r + MAE + per-case scores. The bench source lives in the repo at bench/eval_correlation_hotpot.py.
Honest caveats
Section titled “Honest caveats”- Single LLM.
gpt-4o-minionly. Different judge models produce different absolute numbers. The agreement trend is likely robust, the absolute r/MAE numbers aren’t necessarily. - Only faithfulness was compared head-to-head. Ragas’s relevancy / similarity / correctness need an embedder which RedHop deliberately doesn’t carry. Comparing those across embedder choices would muddy the apples-to-apples.
- No human ground truth. We’re measuring whether the two libraries agree, not whether either agrees with human judgment. Claude haiku as a third LLM is a tie-breaker, not an oracle.
- The “correct” answer to the contested cases is genuinely ambiguous. Different graders (LLM or human) will reasonably disagree on partial-support cases. That’s a property of the metric, not a bug in either library.
Full evidence + the v0→v4 prompt iteration that closed the calibration gap is in EVAL_JUDGED_CALIBRATION.