Evaluation in production
redhop.evaluate(query, ctx) is two tools in one, and which one you reach for
decides the use case:
- Self-eval signals:
low_confidence,evidence_density,retained_evidence_ratio,second_hop_rescues,mean_grounding,estimated_waste_tokens. Always populated, zero LLM cost, deterministic. They aren’t re-measured: they’re the Decision Report refracted, so reading them is essentially free. - Judged metrics:
faithfulness,relevancy,correctness. Real (cached) LLM calls. For offline depth, not the hot path.
The free deterministic tier is what makes the online patterns below possible. A judged eval that costs 3–4 LLM calls and varies run-to-run can’t live inside a request loop. A signal that’s free and reproducible can.
Gate before you generate
Section titled “Gate before you generate”The highest-leverage pattern, and the one that costs nothing: check whether the
retrieved context is worth answering from before you pay for the generation
call. low_confidence is a literal “retrieval found nothing it trusts” flag,
the perfect trigger for say so instead of letting the model guess.
import redhop
ctx = doc.context(user_query)report = redhop.evaluate(user_query, ctx) # zero LLM cost — reads the Decision Report
if report.low_confidence: # retrieval found nothing it's confident about — don't let the model hallucinate answer = "I don't have enough information to answer that reliably."else: answer = llm.generate(ctx.text()) # only spend the call on context you trustconst ctx = doc.context(userQuery);const report = redhop.evaluate(userQuery, ctx); // zero LLM cost
const answer = report.lowConfidence ? "I don't have enough information to answer that reliably." : await llm.generate(ctx.text());For a customer-facing site chatbot, that one branch is the difference between
“I’ll connect you to support” and a confidently wrong answer. The signals are
also on ctx.report directly (ctx.report.low_confidence_retrieval) if you’d
rather skip the evaluate call entirely in the hot path.
Let an agent improvise its retrieval
Section titled “Let an agent improvise its retrieval”The same gate lets an agent self-correct: if the first retrieval comes back low-confidence, reformulate the query and try again before committing to an answer, a cheap, deterministic loop around an otherwise-expensive step.
def answer_with_retry(doc, query, rephrase): for attempt in (query, rephrase(query)): # raw query, then a reformulation ctx = doc.context(attempt) report = redhop.evaluate(attempt, ctx) # free — runs every iteration if not report.low_confidence: return llm.generate(ctx.text()) return "I don't have enough information to answer that reliably."Diagnose why a result was bad
Section titled “Diagnose why a result was bad”A low score isn’t one problem. The reason there are six signals instead of one number is that their combination localizes the failure to a stage: retrieval (didn’t fetch it), assembly (fetched it, then buried or dropped it), or generation (had it, model still missed). That’s what saves you from debugging the wrong stage.
| What you see | Where it broke | What to do | Needs |
|---|---|---|---|
low_confidence = true | Retrieval: nothing matched | semantic retrieval, expand the query, broaden, or answer “I don’t know” | free |
context_recall low | Retrieval: the gold evidence never came back | wrong corpus / vocabulary gap | gold chunks |
not low-confidence, but evidence_density low | Assembly: found it, it’s buried in junk (dilution) | prune harder, smaller chunks, tighter budget | free |
retained_evidence_ratio low | Assembly: evidence was in the input and got dropped | strategy too aggressive, stop over-filtering | free |
context_recall high + faithfulness low | Generation: context was fine, the model hallucinated | fix the prompt/model, not retrieval | gold + answer |
estimated_waste_tokens high | Cost: paying for chunks below the bar | tune budget / pruning | free |
The pair that earns its keep is the fifth row: right context in, wrong answer out tells you to stop tuning the retriever and go fix your prompt, the question every RAG team otherwise answers by hours of guessing.
Mine failures into a content roadmap
Section titled “Mine failures into a content roadmap”Aggregate the gate signal over real traffic and it stops being a grade and
becomes a backlog. summarize() rolls many reports into one health view.
The low_confidence queries underneath it are the questions your corpus can’t
answer yet. Cluster them and you have a ranked list of the docs to write next.
reports = [redhop.evaluate(q, doc.context(q)) for q in last_nights_queries]summary = redhop.summarize(reports)
print(summary.low_confidence_rate) # e.g. 0.18 → 18% of traffic is ungrounded# → collect the low-confidence queries, cluster them, and you've found your# top missing-content topics — a content roadmap, derived from real misses.This is the same loop whether you’re running a site chatbot, an internal knowledge base, or support deflection: the failures tell you what to add.
A/B and CI regression: deterministic, zero-LLM
Section titled “A/B and CI regression: deterministic, zero-LLM”Because the lexical tier is a pure function of the text and the analyzer, the
same inputs always produce the same score, so it can gate a pull request or
compare two configs without an LLM in the loop. Pass gold_chunks to score one
arm against the other:
ctx_a = doc.context(user_query) # baselinectx_b = doc.context_with_rewrites(user_query, [stripper, vocab]) # with a rewrite chain
a = redhop.evaluate(user_query, ctx_a, gold_chunks=gold_ids)b = redhop.evaluate(user_query, ctx_b, gold_chunks=gold_ids)lift = b.overall - a.overall # deterministic — safe for "block the PR if this drops"The full detect → strip → A/B recipe lives on the Choosing a configuration page. For wiring eval into an HTTP service and your observability stack, see Deploy to production.
Where the judged tier earns its place
Section titled “Where the judged tier earns its place”For one-off audits and model migrations (“did faithfulness regress when
we upgraded the model?”), cost stops mattering and thoroughness wins. Add a judge
to unlock the LLM-scored metrics, and wrap it with .cached() so re-runs over
the same test set cost nothing:
judge = redhop.Judge.from_callable(my_llm_score_fn).cached()report = redhop.evaluate(query, ctx, answer=model_answer, gold_answer=gold, judge=judge)report.faithfulness_judged # LLM-scored: is every claim supported by the context?A good funnel at scale: use the free faithfulness_lexical as a first-pass
screen across everything, then escalate only the suspicious cases to the paid
faithfulness_judged.
Catch paraphrase-hallucinations with claim decomposition
Section titled “Catch paraphrase-hallucinations with claim decomposition”The single-prompt faithfulness_judged asks the LLM “is the answer faithful?”
in one call. That’s fast, but it misses subtle hallucinations where the answer
reads faithful but smuggles in one unsupported fact (“X co-wrote Y in Z”
where the context only supports the “X co-wrote Y” part). Set
decompose_faithfulness=True to switch to the two-call path: extract atomic
claims, then batch-verify each against the context. Two LLM calls regardless of
how many claims came out: same cost as the single-prompt path, much sharper.
report = redhop.evaluate( query, ctx, answer=model_answer, judge=judge, decompose_faithfulness=True,)report.faithfulness_judged # mean of per-claim scoresreport.n_faithfulness_claims_extracted # diagnostic: how many claimsreport.n_faithfulness_claims_supported # diagnostic: how many ≥ 0.5The diagnostics matter: a faithfulness of 0.5 from “1 of 2 claims supported” is much more actionable than “0.5” alone: it tells you exactly one claim slipped through, which you can find and fix.
For correctness, the same trick maps onto answer-vs-gold:
decompose_correctness=True extracts claims from both the answer and the gold,
classifies each as TP / FP / FN, returns F₁, and exposes the counters so you
can see “the answer covered 3 of the gold’s 4 facts and added one not in the
gold.”
report = redhop.evaluate( query, ctx, answer=model_answer, gold_answer=gold, judge=judge, decompose_correctness=True,)report.correctness_judged # F₁report.n_correctness_tp, report.n_correctness_fp, report.n_correctness_fnRefusal handling: when the answer is “I don’t know” the extractor returns
zero claims and faithfulness_judged is None (not a vacuous 1.0). That keeps
refusals as a distinct category in summaries instead of inflating your
faithfulness mean.
Critique open-ended dimensions
Section titled “Critique open-ended dimensions”For qualities that aren’t faithfulness or correctness (harmfulness,
conciseness, brand voice, factuality on facts you don’t have a gold answer
for), use critique(). One judge call per aspect, polarity-corrected so
high = good across the report regardless of an aspect’s natural direction
(“harmfulness” is naturally bad-when-high, so the report inverts it).
report = redhop.critique( model_answer, aspects=[ redhop.Aspect("conciseness", "Is the answer concise?"), redhop.Aspect("harmfulness", "Is the answer harmful?", high_is_good=False), redhop.Aspect("brand_voice", "Is the answer in our brand voice?"), ], judge=judge,)for name, score in report.scores: print(f"{name:<14} {score:.2f}") # high = good for all threeSame Judge.from_callable(...).cached() plumbing as evaluate, no new
primitive to learn.
How does this compare to Ragas?
Section titled “How does this compare to Ragas?”If you’re already running Ragas:
the metric they both ship, claim-decomposed faithfulness, is calibrated
substantively equivalent across n=200 HotpotQA: Pearson r = +0.664,
MAE = 0.151, 61% perfect agreement, both at gpt-4o-mini. With Claude haiku
as an independent third judge, neither library is unambiguously “more correct”
on contested cases. They tie at scale.
Where they differ is philosophy: RedHop’s eval lives in-process with the
runtime (same primitives the Decision Report uses, no embedder dependency,
single EvalReport return), while Ragas is a broader framework with more
metric families and integrations. Full head-to-head + numbers:
RedHop vs Ragas.
Honest boundaries
Section titled “Honest boundaries”- The self-eval signals are label-free proxies: directional triggers, not ground truth. Excellent for gating and mining, not a certified score.
- It measures retrieval + assembly, not task success. A high
overallmeans “good context was assembled,” not “the user got what they wanted.” overallis internally consistent, not a portable benchmark: compare it across your own arms, not against another tool’s number.- Lexical
faithfulnessis a screen, not a verdict: real hallucination scoring needs the judged tier. The lexical proxy answers “do the answer’s terms appear in the context,” which is weaker (and weaker still on non-English text, where the default analyzer is English-tuned). - Judged scores have a noise floor.
gpt-4o-miniat temperature 0 is not deterministic through most providers: model-replica routing plus floating-point non-associativity gives ~20–30% per-case variance on borderline judgments. Aggregate metrics (Pearson r, MAE over a test set) average it out. Individual case scores don’t. Read the per-case scores as directional, not absolute. - Single-prompt vs decomposed. The single-prompt path gives a vacuous 1.0
on refusal answers (“I don’t know” has no claims to contradict).
decompose_faithfulness=TrueproducesNonefor refusals, the right semantics, at the cost of a second LLM call. Default to decomposed when refusals are a real share of your traffic.
Full design and the “refraction, not independent measurement” rationale: ANSWER_QUALITY_EVAL. Calibration evidence + Ragas comparison: EVAL_JUDGED_CALIBRATION.