Skip to content

RedHop vs Ragas

If you’re already running Ragas on a RAG pipeline, the natural question is whether redhop.evaluate(...) gives you a different number. It doesn’t, on the metric they both ship.

On claim-decomposed faithfulness (evaluate(..., decompose_faithfulness=True) versus Ragas’s Faithfulness) the two are substantively equivalent: Pearson r = +0.664, MAE = 0.151, 61% perfect agreement on n=200 HotpotQA with openai/gpt-4o-mini. Neither library is unambiguously “more correct” under a third LLM’s read at scale.

The differentiator is philosophy, not accuracy.

RedHop evaluateRagas
Scopeone API for closed-set answer quality + Decision Report self-evaldedicated eval framework
Metric familiesfaithfulness, relevancy, correctness, critique, summarizethe above + similarity, context precision/recall, AspectCritic, more
LLM dependencefully optional: _lexical (no LLM) + _judged (opt-in)LLM-required for most metrics
Embedder dependencenone (no relevancy-cosine, no similarity)required for relevancy / similarity / answer-correctness embedding term
Integrationpip install redhop, single packagelangchain-wrapped LLM/embedder, multiple deps
Same primitives as runtime?yes, evaluate uses the runtime’s own Decision Report machineryno
Output shapeone EvalReport dataclass + summarize(reports) aggregatorper-metric Result objects + dataset-level aggregation

RedHop is not an eval framework. It ships a narrow answer-quality surface that mirrors what the runtime already measures internally. The comparison is “does it produce numbers similar to a dedicated eval library on the metric they both ship.”

The bench generates an LLM answer for each HotpotQA question given the full distractor context (supporting + distractor paragraphs), then scores faithfulness in both libraries with the same judge model. Apples-to-apples on the score, the workload, and the judge.

Pearson rMAEexact agreement
RedHop decomposed ↔ Ragas+0.6640.15161% (122/200)
RedHop single-prompt ↔ Ragas+0.2850.239

Read: decomposed-faithfulness (the path you should default to) agrees with Ragas on the majority of cases and stays within ~0.15 absolute when it diverges. Single-prompt diverges more, mostly on refusal answers (“I don’t know” scores 1.0 single-prompt because there are no claims to contradict, the vacuous-truth failure mode). Use decompose_faithfulness=True for accuracy.

When the two libraries disagree by ≥ 0.5 (35 cases on n=200), we ask Claude haiku to score the same cases independently via the local claude -p --model haiku CLI:

MAE vs Claude haiku (66-case subset: 46 contested + 20 agreement)
RedHop decomposed0.340
Ragas0.262

On contested cases, Claude favors: RedHop 12/35, Ragas 23/35.

Read carefully: this looks like Ragas is “more correct”, but re-tracing 5 randomly-sampled “RedHop loses to Ragas” cases at 5 runs each shows 4 of 5 give 1.0 consistently when measured stably. The bench captured a one-shot 0.0 because gpt-4o-mini at temperature 0.0 is not deterministic through OpenRouter (model-replica routing + floating-point non-associativity → ~20–30% per-case variance on borderline judgments).

So the contested-cases MAE-to-Claude is noise-dominated. Aggregate metrics (Pearson r and MAE vs Ragas, averaged over many cases) are robust. Per-case verdicts on individual cases are not.

What each library gives you that the other doesn’t

Section titled “What each library gives you that the other doesn’t”

With RedHop you also get:

  1. A single EvalReport dataclass that blends lexical metrics (deterministic, run in CI without an LLM) with judged metrics (opt-in via Judge.from_callable(fn).cached()), instead of running metrics one at a time.
  2. summarize(reports) for test-set aggregation: one function call rolls up per-case reports into a means + N + share-flagged summary, the same shape RedHop’s own runtime uses for its Decision Report.
  3. No embedder dependency. Ragas’s AnswerRelevancy and AnswerSimilarity need an embedder. RedHop’s relevancy_judged uses an LLM-only noncommittal-detection prompt (no embeddings, no extra dep).
  4. Refusal handling. “I don’t know” answers correctly return None for decomposed faithfulness (0 claims extracted) instead of being scored as a vacuous 1.0. Surfaces refusals as a distinct category, not as faithfulness = 1.
  5. critique(answer, aspects, ...) for user-defined dimensions. Ragas has AspectCritic. RedHop has the equivalent in critique with the same EvalReport-shape output as quantitative metrics.

With Ragas you also get:

  1. More metric families. Ragas ships AnswerSimilarity, ContextPrecision, ContextRecall, Faithfulness with NLI, more AspectCritic variants, and test-set generation. RedHop ships a focused subset.
  2. Broader integration ecosystem. LangChain wrappers, LlamaIndex wrappers, Phoenix / Langfuse / Arize Phoenix integrations. RedHop stays in-process by design.
  3. Dataset loaders. Ragas can load HuggingFace datasets, dataset formats. RedHop expects you to construct (question, context, answer, gold_answer) tuples directly.
import redhop
# Same judge plumbing for everything — `Judge.from_callable(fn).cached()`
# accepts any sync function returning a float, a string (parsed as a number
# if it looks like one, else treated as raw text), or a dict.
judge = redhop.Judge.from_callable(my_llm).cached()
report = redhop.evaluate(
user_query, ctx,
answer=model_answer,
gold_answer=gold,
judge=judge,
decompose_faithfulness=True, # 2 LLM calls; catches paraphrase-hallucinations
decompose_correctness=True, # TP/FP/FN against gold → F1
)
report.faithfulness_judged # mean of per-claim scores in [0, 1]
report.n_faithfulness_claims_extracted, report.n_faithfulness_claims_supported
report.correctness_judged # F1
report.n_correctness_tp, report.n_correctness_fp, report.n_correctness_fn
  • Where the result is robust: decomposed-faithfulness produces numbers strongly correlated with Ragas’s faithfulness across n=200 HotpotQA. If you use either library to evaluate a RAG system, the trends you see will be the same.
  • Where the result is fragile: any single case’s score has ~0.2–0.3 absolute noise. Use the score as a signal across many cases, not as an oracle on one.
  • Where neither library shines: the metric is LLM-judged. Both libraries inherit the judge’s calibration. A different judge model produces different absolute numbers. The trend (the two libraries agreeing) is what’s stable.
Terminal window
python3 -m venv bench/.venv
bench/.venv/bin/pip install ragas openai langchain langchain-openai
bench/.venv/bin/pip install redhop
OPENROUTER_API_KEY=sk-or-... \
bench/.venv/bin/python bench/eval_correlation_hotpot.py \
--n 200 --context all

Script generates answers via the LLM, scores each via both libraries, prints Pearson r + MAE + per-case scores. The bench source lives in the repo at bench/eval_correlation_hotpot.py.

  • Single LLM. gpt-4o-mini only. Different judge models produce different absolute numbers. The agreement trend is likely robust, the absolute r/MAE numbers aren’t necessarily.
  • Only faithfulness was compared head-to-head. Ragas’s relevancy / similarity / correctness need an embedder which RedHop deliberately doesn’t carry. Comparing those across embedder choices would muddy the apples-to-apples.
  • No human ground truth. We’re measuring whether the two libraries agree, not whether either agrees with human judgment. Claude haiku as a third LLM is a tie-breaker, not an oracle.
  • The “correct” answer to the contested cases is genuinely ambiguous. Different graders (LLM or human) will reasonably disagree on partial-support cases. That’s a property of the metric, not a bug in either library.

Full evidence + the v0→v4 prompt iteration that closed the calibration gap is in EVAL_JUDGED_CALIBRATION.