Skip to content

Evaluation in production

redhop.evaluate(query, ctx) is two tools in one, and which one you reach for decides the use case:

  • Self-eval signals: low_confidence, evidence_density, retained_evidence_ratio, second_hop_rescues, mean_grounding, estimated_waste_tokens. Always populated, zero LLM cost, deterministic. They aren’t re-measured: they’re the Decision Report refracted, so reading them is essentially free.
  • Judged metrics: faithfulness, relevancy, correctness. Real (cached) LLM calls. For offline depth, not the hot path.

The free deterministic tier is what makes the online patterns below possible. A judged eval that costs 3–4 LLM calls and varies run-to-run can’t live inside a request loop. A signal that’s free and reproducible can.

The highest-leverage pattern, and the one that costs nothing: check whether the retrieved context is worth answering from before you pay for the generation call. low_confidence is a literal “retrieval found nothing it trusts” flag, the perfect trigger for say so instead of letting the model guess.

import redhop
ctx = doc.context(user_query)
report = redhop.evaluate(user_query, ctx) # zero LLM cost — reads the Decision Report
if report.low_confidence:
# retrieval found nothing it's confident about — don't let the model hallucinate
answer = "I don't have enough information to answer that reliably."
else:
answer = llm.generate(ctx.text()) # only spend the call on context you trust

For a customer-facing site chatbot, that one branch is the difference between “I’ll connect you to support” and a confidently wrong answer. The signals are also on ctx.report directly (ctx.report.low_confidence_retrieval) if you’d rather skip the evaluate call entirely in the hot path.

The same gate lets an agent self-correct: if the first retrieval comes back low-confidence, reformulate the query and try again before committing to an answer, a cheap, deterministic loop around an otherwise-expensive step.

def answer_with_retry(doc, query, rephrase):
for attempt in (query, rephrase(query)): # raw query, then a reformulation
ctx = doc.context(attempt)
report = redhop.evaluate(attempt, ctx) # free — runs every iteration
if not report.low_confidence:
return llm.generate(ctx.text())
return "I don't have enough information to answer that reliably."

A low score isn’t one problem. The reason there are six signals instead of one number is that their combination localizes the failure to a stage: retrieval (didn’t fetch it), assembly (fetched it, then buried or dropped it), or generation (had it, model still missed). That’s what saves you from debugging the wrong stage.

What you seeWhere it brokeWhat to doNeeds
low_confidence = trueRetrieval: nothing matchedsemantic retrieval, expand the query, broaden, or answer “I don’t know”free
context_recall lowRetrieval: the gold evidence never came backwrong corpus / vocabulary gapgold chunks
not low-confidence, but evidence_density lowAssembly: found it, it’s buried in junk (dilution)prune harder, smaller chunks, tighter budgetfree
retained_evidence_ratio lowAssembly: evidence was in the input and got droppedstrategy too aggressive, stop over-filteringfree
context_recall high + faithfulness lowGeneration: context was fine, the model hallucinatedfix the prompt/model, not retrievalgold + answer
estimated_waste_tokens highCost: paying for chunks below the bartune budget / pruningfree

The pair that earns its keep is the fifth row: right context in, wrong answer out tells you to stop tuning the retriever and go fix your prompt, the question every RAG team otherwise answers by hours of guessing.

Aggregate the gate signal over real traffic and it stops being a grade and becomes a backlog. summarize() rolls many reports into one health view. The low_confidence queries underneath it are the questions your corpus can’t answer yet. Cluster them and you have a ranked list of the docs to write next.

reports = [redhop.evaluate(q, doc.context(q)) for q in last_nights_queries]
summary = redhop.summarize(reports)
print(summary.low_confidence_rate) # e.g. 0.18 → 18% of traffic is ungrounded
# → collect the low-confidence queries, cluster them, and you've found your
# top missing-content topics — a content roadmap, derived from real misses.

This is the same loop whether you’re running a site chatbot, an internal knowledge base, or support deflection: the failures tell you what to add.

A/B and CI regression: deterministic, zero-LLM

Section titled “A/B and CI regression: deterministic, zero-LLM”

Because the lexical tier is a pure function of the text and the analyzer, the same inputs always produce the same score, so it can gate a pull request or compare two configs without an LLM in the loop. Pass gold_chunks to score one arm against the other:

ctx_a = doc.context(user_query) # baseline
ctx_b = doc.context_with_rewrites(user_query, [stripper, vocab]) # with a rewrite chain
a = redhop.evaluate(user_query, ctx_a, gold_chunks=gold_ids)
b = redhop.evaluate(user_query, ctx_b, gold_chunks=gold_ids)
lift = b.overall - a.overall # deterministic — safe for "block the PR if this drops"

The full detect → strip → A/B recipe lives on the Choosing a configuration page. For wiring eval into an HTTP service and your observability stack, see Deploy to production.

For one-off audits and model migrations (“did faithfulness regress when we upgraded the model?”), cost stops mattering and thoroughness wins. Add a judge to unlock the LLM-scored metrics, and wrap it with .cached() so re-runs over the same test set cost nothing:

judge = redhop.Judge.from_callable(my_llm_score_fn).cached()
report = redhop.evaluate(query, ctx, answer=model_answer, gold_answer=gold, judge=judge)
report.faithfulness_judged # LLM-scored: is every claim supported by the context?

A good funnel at scale: use the free faithfulness_lexical as a first-pass screen across everything, then escalate only the suspicious cases to the paid faithfulness_judged.

Catch paraphrase-hallucinations with claim decomposition

Section titled “Catch paraphrase-hallucinations with claim decomposition”

The single-prompt faithfulness_judged asks the LLM “is the answer faithful?” in one call. That’s fast, but it misses subtle hallucinations where the answer reads faithful but smuggles in one unsupported fact (“X co-wrote Y in Z” where the context only supports the “X co-wrote Y” part). Set decompose_faithfulness=True to switch to the two-call path: extract atomic claims, then batch-verify each against the context. Two LLM calls regardless of how many claims came out: same cost as the single-prompt path, much sharper.

report = redhop.evaluate(
query, ctx,
answer=model_answer,
judge=judge,
decompose_faithfulness=True,
)
report.faithfulness_judged # mean of per-claim scores
report.n_faithfulness_claims_extracted # diagnostic: how many claims
report.n_faithfulness_claims_supported # diagnostic: how many ≥ 0.5

The diagnostics matter: a faithfulness of 0.5 from “1 of 2 claims supported” is much more actionable than “0.5” alone: it tells you exactly one claim slipped through, which you can find and fix.

For correctness, the same trick maps onto answer-vs-gold: decompose_correctness=True extracts claims from both the answer and the gold, classifies each as TP / FP / FN, returns F₁, and exposes the counters so you can see “the answer covered 3 of the gold’s 4 facts and added one not in the gold.”

report = redhop.evaluate(
query, ctx, answer=model_answer, gold_answer=gold, judge=judge,
decompose_correctness=True,
)
report.correctness_judged # F₁
report.n_correctness_tp, report.n_correctness_fp, report.n_correctness_fn

Refusal handling: when the answer is “I don’t know” the extractor returns zero claims and faithfulness_judged is None (not a vacuous 1.0). That keeps refusals as a distinct category in summaries instead of inflating your faithfulness mean.

For qualities that aren’t faithfulness or correctness (harmfulness, conciseness, brand voice, factuality on facts you don’t have a gold answer for), use critique(). One judge call per aspect, polarity-corrected so high = good across the report regardless of an aspect’s natural direction (“harmfulness” is naturally bad-when-high, so the report inverts it).

report = redhop.critique(
model_answer,
aspects=[
redhop.Aspect("conciseness", "Is the answer concise?"),
redhop.Aspect("harmfulness", "Is the answer harmful?", high_is_good=False),
redhop.Aspect("brand_voice", "Is the answer in our brand voice?"),
],
judge=judge,
)
for name, score in report.scores:
print(f"{name:<14} {score:.2f}") # high = good for all three

Same Judge.from_callable(...).cached() plumbing as evaluate, no new primitive to learn.

If you’re already running Ragas: the metric they both ship, claim-decomposed faithfulness, is calibrated substantively equivalent across n=200 HotpotQA: Pearson r = +0.664, MAE = 0.151, 61% perfect agreement, both at gpt-4o-mini. With Claude haiku as an independent third judge, neither library is unambiguously “more correct” on contested cases. They tie at scale.

Where they differ is philosophy: RedHop’s eval lives in-process with the runtime (same primitives the Decision Report uses, no embedder dependency, single EvalReport return), while Ragas is a broader framework with more metric families and integrations. Full head-to-head + numbers: RedHop vs Ragas.

  • The self-eval signals are label-free proxies: directional triggers, not ground truth. Excellent for gating and mining, not a certified score.
  • It measures retrieval + assembly, not task success. A high overall means “good context was assembled,” not “the user got what they wanted.”
  • overall is internally consistent, not a portable benchmark: compare it across your own arms, not against another tool’s number.
  • Lexical faithfulness is a screen, not a verdict: real hallucination scoring needs the judged tier. The lexical proxy answers “do the answer’s terms appear in the context,” which is weaker (and weaker still on non-English text, where the default analyzer is English-tuned).
  • Judged scores have a noise floor. gpt-4o-mini at temperature 0 is not deterministic through most providers: model-replica routing plus floating-point non-associativity gives ~20–30% per-case variance on borderline judgments. Aggregate metrics (Pearson r, MAE over a test set) average it out. Individual case scores don’t. Read the per-case scores as directional, not absolute.
  • Single-prompt vs decomposed. The single-prompt path gives a vacuous 1.0 on refusal answers (“I don’t know” has no claims to contradict). decompose_faithfulness=True produces None for refusals, the right semantics, at the cost of a second LLM call. Default to decomposed when refusals are a real share of your traffic.

Full design and the “refraction, not independent measurement” rationale: ANSWER_QUALITY_EVAL. Calibration evidence + Ragas comparison: EVAL_JUDGED_CALIBRATION.