Evaluation in production

redhop.evaluate(query, ctx) is two tools in one, and which one you reach for decides the use case:

Self-eval signals: low_confidence, evidence_density, retained_evidence_ratio, second_hop_rescues, mean_grounding, estimated_waste_tokens. Always populated, zero LLM cost, deterministic. They aren’t re-measured: they’re the Decision Report refracted, so reading them is essentially free.
Judged metrics: faithfulness, relevancy, correctness. Real (cached) LLM calls. For offline depth, not the hot path.

The free deterministic tier is what makes the online patterns below possible. A judged eval that costs 3–4 LLM calls and varies run-to-run can’t live inside a request loop. A signal that’s free and reproducible can.

Gate before you generate

The highest-leverage pattern, and the one that costs nothing: check whether the retrieved context is worth answering from before you pay for the generation call. low_confidence is a literal “retrieval found nothing it trusts” flag, the perfect trigger for say so instead of letting the model guess.

Python
Node.js

import redhop

ctx = doc.context(user_query)
report = redhop.evaluate(user_query, ctx)   # zero LLM cost — reads the Decision Report

if report.low_confidence:
    # retrieval found nothing it's confident about — don't let the model hallucinate
    answer = "I don't have enough information to answer that reliably."
else:
    answer = llm.generate(ctx.text())        # only spend the call on context you trust

const ctx = doc.context(userQuery);
const report = redhop.evaluate(userQuery, ctx);   // zero LLM cost

const answer = report.lowConfidence
  ? "I don't have enough information to answer that reliably."
  : await llm.generate(ctx.text());

For a customer-facing site chatbot, that one branch is the difference between “I’ll connect you to support” and a confidently wrong answer. The signals are also on ctx.report directly (ctx.report.low_confidence_retrieval) if you’d rather skip the evaluate call entirely in the hot path.

Let an agent improvise its retrieval

The same gate lets an agent self-correct: if the first retrieval comes back low-confidence, reformulate the query and try again before committing to an answer, a cheap, deterministic loop around an otherwise-expensive step.

def answer_with_retry(doc, query, rephrase):
    for attempt in (query, rephrase(query)):      # raw query, then a reformulation
        ctx = doc.context(attempt)
        report = redhop.evaluate(attempt, ctx)    # free — runs every iteration
        if not report.low_confidence:
            return llm.generate(ctx.text())
    return "I don't have enough information to answer that reliably."

Diagnose why a result was bad

A low score isn’t one problem. The reason there are six signals instead of one number is that their combination localizes the failure to a stage: retrieval (didn’t fetch it), assembly (fetched it, then buried or dropped it), or generation (had it, model still missed). That’s what saves you from debugging the wrong stage.

What you see	Where it broke	What to do	Needs
`low_confidence = true`	Retrieval: nothing matched	semantic retrieval, expand the query, broaden, or answer “I don’t know”	free
`context_recall` low	Retrieval: the gold evidence never came back	wrong corpus / vocabulary gap	gold chunks
`set_coverage` low (but `context_recall` looks fine)	Retrieval: a query maps to a set and a whole variant family is half-retrieved, so it can’t be offered for disambiguation	boost the discriminating field, raise `candidate_k`, or fetch the family explicitly	gold families
not low-confidence, but `evidence_density` low	Assembly: found it, it’s buried in junk (dilution)	prune harder, smaller chunks, tighter budget	free
`retained_evidence_ratio` low	Assembly: evidence was in the input and got dropped	strategy too aggressive, stop over-filtering	free
`context_recall` high + `faithfulness` low	Generation: context was fine, the model hallucinated	fix the prompt/model, not retrieval	gold + answer
`estimated_waste_tokens` high	Cost: paying for chunks below the bar	tune budget / pruning	free

The pair that earns its keep is the fifth row: right context in, wrong answer out tells you to stop tuning the retriever and go fix your prompt, the question every RAG team otherwise answers by hours of guessing.

Mine failures into a content roadmap

Aggregate the gate signal over real traffic and it stops being a grade and becomes a backlog. summarize() rolls many reports into one health view. The low_confidence queries underneath it are the questions your corpus can’t answer yet. Cluster them and you have a ranked list of the docs to write next.

reports = [redhop.evaluate(q, doc.context(q)) for q in last_nights_queries]
summary = redhop.summarize(reports)

print(summary.low_confidence_rate)   # e.g. 0.18 → 18% of traffic is ungrounded
# → collect the low-confidence queries, cluster them, and you've found your
#   top missing-content topics — a content roadmap, derived from real misses.

This is the same loop whether you’re running a site chatbot, an internal knowledge base, or support deflection: the failures tell you what to add.

A/B and CI regression: deterministic, zero-LLM

Because the lexical tier is a pure function of the text and the analyzer, the same inputs always produce the same score, so it can gate a pull request or compare two configs without an LLM in the loop. Pass gold_chunks to score one arm against the other:

ctx_a = doc.context(user_query)                              # baseline
ctx_b = doc.context_with_rewrites(user_query, [stripper, vocab])  # with a rewrite chain

a = redhop.evaluate(user_query, ctx_a, gold_chunks=gold_ids)
b = redhop.evaluate(user_query, ctx_b, gold_chunks=gold_ids)
lift = b.overall - a.overall          # deterministic — safe for "block the PR if this drops"

The full detect → strip → A/B recipe lives on the Choosing a configuration page. For wiring eval into an HTTP service and your observability stack, see Deploy to production.

Set-valued queries: did you retrieve the whole family?

A single query often legitimately maps to a set — every size or flavor of a product, all regional overrides of a clause, both evidence chains of a multi-hop question. context_recall against a single flat gold hides a half-retrieved set: it reads a healthy 0.67 while a whole family is missing, so the user can’t be offered the options to disambiguate. Pass gold_families to score strict set-coverage — the fraction of families fully present in the context:

r = redhop.evaluate(
    query, ctx,
    gold_families=[["sku_a1", "sku_a2"], ["sku_b1", "sku_b2"]],  # Node: goldFamilies
)
print(r.set_coverage)   # 1.0 only if every family is fully retrieved

In testing this caught families that a recall@20 of 1.000 reported as fine but that were actually un-offerable for disambiguation (CATALOG_REGIME). summarize(reports).mean_set_coverage aggregates it across a workload.

Where the judged tier earns its place

For one-off audits and model migrations (“did faithfulness regress when we upgraded the model?”), cost stops mattering and thoroughness wins. Add a judge to unlock the LLM-scored metrics, and wrap it with .cached() so re-runs over the same test set cost nothing:

judge = redhop.Judge.from_callable(my_llm_score_fn).cached()
report = redhop.evaluate(query, ctx, answer=model_answer, gold_answer=gold, judge=judge)
report.faithfulness_judged    # LLM-scored: is every claim supported by the context?

A good funnel at scale: use the free faithfulness_lexical as a first-pass screen across everything, then escalate only the suspicious cases to the paid faithfulness_judged.

Catch paraphrase-hallucinations with claim decomposition

The single-prompt faithfulness_judged asks the LLM “is the answer faithful?” in one call. That’s fast, but it misses subtle hallucinations where the answer reads faithful but smuggles in one unsupported fact (“X co-wrote Y in Z” where the context only supports the “X co-wrote Y” part). Set decompose_faithfulness=True to switch to the two-call path: extract atomic claims, then batch-verify each against the context. Two LLM calls regardless of how many claims came out: same cost as the single-prompt path, much sharper.

report = redhop.evaluate(
    query, ctx,
    answer=model_answer,
    judge=judge,
    decompose_faithfulness=True,
)
report.faithfulness_judged                       # mean of per-claim scores
report.n_faithfulness_claims_extracted           # diagnostic: how many claims
report.n_faithfulness_claims_supported           # diagnostic: how many ≥ 0.5

The diagnostics matter: a faithfulness of 0.5 from “1 of 2 claims supported” is much more actionable than “0.5” alone: it tells you exactly one claim slipped through, which you can find and fix.

For correctness, the same trick maps onto answer-vs-gold: decompose_correctness=True extracts claims from both the answer and the gold, classifies each as TP / FP / FN, returns F₁, and exposes the counters so you can see “the answer covered 3 of the gold’s 4 facts and added one not in the gold.”

report = redhop.evaluate(
    query, ctx, answer=model_answer, gold_answer=gold, judge=judge,
    decompose_correctness=True,
)
report.correctness_judged                        # F₁
report.n_correctness_tp, report.n_correctness_fp, report.n_correctness_fn

Refusal handling: when the answer is “I don’t know” the extractor returns zero claims and faithfulness_judged is None (not a vacuous 1.0). That keeps refusals as a distinct category in summaries instead of inflating your faithfulness mean.

Critique open-ended dimensions

For qualities that aren’t faithfulness or correctness (harmfulness, conciseness, brand voice, factuality on facts you don’t have a gold answer for), use critique(). One judge call per aspect, polarity-corrected so high = good across the report regardless of an aspect’s natural direction (“harmfulness” is naturally bad-when-high, so the report inverts it).

report = redhop.critique(
    model_answer,
    aspects=[
        redhop.Aspect("conciseness",  "Is the answer concise?"),
        redhop.Aspect("harmfulness",  "Is the answer harmful?", high_is_good=False),
        redhop.Aspect("brand_voice",  "Is the answer in our brand voice?"),
    ],
    judge=judge,
)
for name, score in report.scores:
    print(f"{name:<14} {score:.2f}")    # high = good for all three

Same Judge.from_callable(...).cached() plumbing as evaluate, no new primitive to learn.

How does this compare to Ragas?

If you’re already running Ragas: the metric they both ship, claim-decomposed faithfulness, is calibrated substantively equivalent across n=200 HotpotQA: Pearson r = +0.664, MAE = 0.151, 61% perfect agreement, both at gpt-4o-mini. With Claude haiku as an independent third judge, neither library is unambiguously “more correct” on contested cases. They tie at scale.

Where they differ is philosophy: RedHop’s eval lives in-process with the runtime (same primitives the Decision Report uses, no embedder dependency, single EvalReport return), while Ragas is a broader framework with more metric families and integrations. Full head-to-head + numbers: RedHop vs Ragas.

Honest boundaries

The self-eval signals are label-free proxies: directional triggers, not ground truth. Excellent for gating and mining, not a certified score.
It measures retrieval + assembly, not task success. A high overall means “good context was assembled,” not “the user got what they wanted.”
overall is internally consistent, not a portable benchmark: compare it across your own arms, not against another tool’s number.
Lexical faithfulness is a screen, not a verdict: real hallucination scoring needs the judged tier. The lexical proxy answers “do the answer’s terms appear in the context,” which is weaker (and weaker still on non-English text, where the default analyzer is English-tuned).
Judged scores have a noise floor. gpt-4o-mini at temperature 0 is not deterministic through most providers: model-replica routing plus floating-point non-associativity gives ~20–30% per-case variance on borderline judgments. Aggregate metrics (Pearson r, MAE over a test set) average it out. Individual case scores don’t. Read the per-case scores as directional, not absolute.
Single-prompt vs decomposed. The single-prompt path gives a vacuous 1.0 on refusal answers (“I don’t know” has no claims to contradict). decompose_faithfulness=True produces None for refusals, the right semantics, at the cost of a second LLM call. Default to decomposed when refusals are a real share of your traffic.

Full design and the “refraction, not independent measurement” rationale: ANSWER_QUALITY_EVAL. Calibration evidence + Ragas comparison: EVAL_JUDGED_CALIBRATION.