RAG & Retrieval
IC4IC5IC6

RAG Evaluation: Measuring Retrieval and Generation Separately

RAG is two systems in a trench coat — a retriever and a generator — and you must score them apart, because a perfectly faithful answer over the wrong context is still wrong and retrieval is almost always where it breaks.

16 min read · 14 sections
Prerequisites: rag/foundations, rag/hybrid-search-and-rerank, what recall@k and cosine similarity mean, LLM-as-judge basics

1. Quick anchor

A RAG system is two systems wearing one trench coat: a retriever that fetches context and a generator that writes an answer from it. The single most important move in evaluating RAG is to refuse to score them together. A blended "is the answer good?" number tells you nothing actionable — it conflates "we never fetched the right chunk" with "we fetched it and the model ignored it," and those have opposite fixes. Retrieval metrics (recall@k, nDCG, MRR, hit-rate) answer did we put the right material in front of the model?; generation metrics (faithfulness, answer relevancy) answer did the model use that material honestly and completely? And the empirical prior every senior engineer carries into the room: retrieval is usually the bottleneck, not generation — so when in doubt, instrument retrieval first. None of this works without a golden set of real queries with labeled relevant chunks, and a regression gate in CI is the line between shipping and praying.

2. Why interviewers probe this

RAG eval is where "I read the LangChain quickstart" engineers and "I have run this in production" engineers separate cleanly. Building a retrieval+generation pipeline is a weekend; knowing whether it got better is the actual job.

At IC4, the tells are competence:

  • Names retrieval vs generation metrics distinctly and doesn't confuse recall@k with faithfulness.
  • Knows you need a labeled golden set, not eyeballing 10 outputs.
  • Can wire up RAGAS or a hand-rolled recall@k and read the numbers.

At IC5, the tells are diagnosis and judgment:

  • Decomposes a failure into retrieval-miss vs generation-fault as a reflex, and knows retrieval is the usual culprit.
  • Understands LLM-as-judge for what it is: a cheap, biased, useful ranker — and validates it against human labels before trusting it.
  • Builds the golden set from production traffic, separates retrieval and generation gates, and keeps CI fast.

At IC6, the tells are systems and org thinking:

  • Treats eval as a product surface and a process: label budget, sampling strategy, judge-vs-human agreement (Cohen's κ), drift when the judge model upgrades.
  • Knows when offline metrics diverge from online A/B reality and designs the closed loop that catches it.
  • Reasons about metric gaming, eval-set overfitting, and the dollar cost of a regression that slips a gate. Can argue why a given threshold, not just assert one.

3. Concept build-up

3.1 Why you must split the pipeline

Consider the canonical pipeline: ingest → chunk → embed → index, then query → (rewrite) → retrieve → fuse → rerank → assemble context → generate → cite (see /rag/foundations). A wrong final answer has exactly two root-cause families:

  1. The needed information was never retrieved. The chunk doesn't exist, was chunked badly, the embedding missed the paraphrase, BM25 missed the rare token, or the reranker buried it below the top-k cutoff. The generator never had a chance.
  2. The information was retrieved but the answer is still wrong. The model hallucinated past the context, ignored a chunk sitting at position 8 ("lost in the middle"), answered a different question, or got confused by distractor chunks.

These demand different interventions. Family (1) → fix chunking, embeddings, hybrid search and reranking, or query transforms. Family (2) → fix the prompt, the model, context ordering, or the number of chunks you stuff. If your only metric is "answer quality," you cannot tell which lever to pull, and you will waste a sprint tuning the generator when the bug was in the retriever (or vice versa). Separable metrics turn a guessing game into a decision tree.

3.2 Retrieval metrics from first principles

Beginner explainerNew here? RAG evaluation fundamentals

The big picture first. RAG has two halves: a retriever that pulls chunks, and a generator that writes an answer. To know which half is broken, you need separate scores — one for retrieval, one for generation. Retrieval metrics ask "did we give the model the right stuff?" Generation metrics ask "did the model use it honestly?"

The key terms.

  • Recall@k — what fraction of the correct chunks did the retriever surface in the top k results?
  • Precision@k — what fraction of the top k results are actually correct?
  • nDCG — a rank-aware score that rewards putting good chunks near the top, and penalizes burying them.
  • Golden set — your labeled dataset: real queries, and for each one, which chunks should be retrieved.
  • Graded relevance — a chunk isn't just "relevant" or "not"; it can be "very relevant" (score 3) or "somewhat relevant" (score 1).

Step by step.

  1. Build or mine a golden set: take real queries from your system and label which chunks are correct for each.
  2. Run your retriever on each query; collect the top k results and their order.
  3. For each metric (recall, precision, nDCG), compute the score by comparing what you retrieved vs. what was labeled correct.
  4. Average the scores across all queries.
  5. When you change the retriever (new embedding model, reranker, hybrid search), re-run the metrics; if they go up, you improved; if down, you regressed.
  6. Gate your CI/CD so PRs that drop recall by more than 2% are rejected — this prevents silent degradation.
  7. Track the metrics over time so you spot drift if your query distribution shifts.

Remember this: Retrieval metrics are cheap to compute, so run them on every PR. Generation metrics cost real money (LLM calls), so reserve those for nightly full runs.

Retrieval is information retrieval — a 50-year-old field with crisp metrics. You need a golden set mapping each query to its set of relevant chunk IDs. Then:

  • Recall@k — fraction of all relevant chunks that appear in the top-k. This is the metric that matters most for RAG, because a chunk the retriever never surfaces is invisible to the generator. If recall@10 is 0.6, then 40% of the time the answer is unattainable no matter how good your model is.
Recall@k — on real numbers

Every symbol named plain. Recall@k asks: of all the chunks that should be retrieved, what fraction did we actually get in the top k? The numerator is the count of correct chunks we found; the denominator is the count of all correct chunks (whether we found them or not).

Concrete example. Imagine the user asks "What are the benefits of the Enterprise plan?" Your golden set labels three chunks as relevant: chunk-5 (SSO), chunk-12 (audit logs), and chunk-18 (dedicated support). Your retriever returns the top 10, and you find chunk-5 at rank 2 and chunk-12 at rank 7, but chunk-18 never appears in the top 10.

Recall@10 = (chunks you found) / (chunks that exist) = 2 / 30.67.

Interpretation: two-thirds of the correct context made it to the top 10; one-third was buried or missed, so 33% of the time this question's answer is unreachable (the model can never see chunk-18). If your recall@10 is 0.67, you have a retrieval problem — upgrade the embedding model, add hybrid search, or improve the reranker.

  • Precision@k — fraction of the top-k that are relevant. Matters less for RAG than recall, but high precision means less noise (and less token cost) handed to the generator.
  • Hit-rate@k (a.k.a. hit@k) — fraction of queries with at least one relevant chunk in top-k. A coarse but intuitive "did we get anything useful?" gauge.
  • MRR (Mean Reciprocal Rank) — average of 1/rank of the first relevant chunk. Rewards putting a relevant chunk near the top. Good proxy for single-answer lookups.
  • nDCG@k (normalized Discounted Cumulative Gain) — the rank-aware metric. DCG@k = Σ rel_i / log2(i+1), normalized by the ideal ordering (nDCG = DCG/IDCG). Handles graded relevance (a chunk can be "very relevant" = 3, "somewhat" = 1) and penalizes burying good chunks. This is the metric to watch when you add a reranker, because reranking is precisely about ordering the top candidates.
nDCG — discounted rank, come alive

What each symbol means.

  • rel_i = relevance score of the chunk at position i (e.g., 3 = very relevant, 1 = somewhat, 0 = not relevant).
  • The denominator log2(i+1) is the discount: position 1 gets discount 1/log2(2) = 1.0, position 2 gets 1/log2(3) ≈ 0.63, position 10 gets 1/log2(11) ≈ 0.3. Burying a good chunk deeper costs you more.
  • DCG@k sums up the discounted relevance of your top k results.
  • IDCG (ideal DCG) is what you'd get if the results were perfectly ranked (all high-relevance chunks first).
  • nDCG = DCG / IDCG normalizes to 0–1, so it's comparable across queries.

Concrete example. You retrieve 5 chunks for a query. Your golden set grades them: chunk-A is "very relevant" (3), chunk-B is "very relevant" (3), chunk-C is "somewhat" (1), chunk-D is "very relevant" (3), chunk-E is "not" (0). Your reranker orders them: A, B, D, C, E.

DCG@5 = 3/log2(2) + 3/log2(3) + 3/log2(4) + 1/log2(5) + 0/log2(6) = 3/1.0 + 3/1.585 + 3/2.0 + 1/2.322 + 0 = 3.0 + 1.89 + 1.5 + 0.43 + 0 = 6.82

IDCG@5 (perfect order: the three 3's, then the one 1, then the 0): = 3/1.0 + 3/1.585 + 3/2.0 + 1/2.322 + 0/2.585 = 3.0 + 1.89 + 1.5 + 0.43 + 0 = 6.82

In this case, nDCG = 6.82 / 6.82 = 1.0 — you ranked perfectly. But if your reranker had put D at the end (order A, B, C, D, E), DCG would drop because position 4 gets a lower discount, and nDCG would fall below 1.0. That's why nDCG catches reranking quality: it rewards putting the best stuff at the top.

A minimal, honest implementation — no library needed:

from math import log2
 
def recall_at_k(retrieved_ids, relevant_ids, k):
    top = retrieved_ids[:k]
    return len(set(top) & set(relevant_ids)) / len(relevant_ids) if relevant_ids else 0.0
 
def reciprocal_rank(retrieved_ids, relevant_ids):
MRR and reciprocal_rank — what the code does

This function finds the first relevant chunk in your ranked list and returns 1 / rank. If the first match is at position 1, you get 1.0 (perfect). If it's at position 5, you get 0.2 — still good, but the relevant chunk was buried. If nothing matches, you get 0 (total miss). For a question like "What's the capital of France?" where one answer suffices, MRR is the right metric: it rewards having the answer within reaching distance at the top, and doesn't care about the rest. The mean across all queries becomes your final MRR score.

for rank, doc_id in enumerate(retrieved_ids, start=1):
    if doc_id in relevant_ids:
        return 1.0 / rank
return 0.0

def ndcg_at_k(retrieved_ids, relevance, k): # relevance: {doc_id: graded_score} dcg = sum(relevance.get(d, 0) / log2(i + 1) for i, d in enumerate(retrieved_ids[:k], start=1)) ideal = sorted(relevance.values(), reverse=True)[:k] idcg = sum(rel / log2(i + 1) for i, rel in enumerate(ideal, start=1)) return dcg / idcg if idcg else 0.0

 
The staff-level point: **these metrics require labeled relevant chunk IDs, which is the expensive part.** You can label by hand, mine click/feedback data, or — increasingly common in 2026 — bootstrap labels with a strong LLM and then audit a sample by hand. More on that in §3.4.
 
### 3.3 Generation metrics: faithfulness, relevancy, and the RAGAS context pair
 
Generation metrics evaluate the answer *given* the retrieved context. The canonical four (popularized by **RAGAS**, Es et al. 2023) are:
 
- **Faithfulness / groundedness** — the anti-hallucination metric. Decompose the answer into atomic claims; for each claim, ask a judge whether it is supported by the retrieved context. Score = supported claims / total claims. Faithfulness is the metric you most want to gate on, because an *unfaithful* answer is the failure mode that erodes user trust fastest. Crucially, faithfulness measures grounding, **not correctness** — a claim faithfully copied from a wrong chunk scores 1.0.
- **Answer relevancy** — does the answer actually address the question? RAGAS computes it by prompting an LLM to generate candidate questions the answer *would* answer, embedding them, and averaging cosine similarity to the real question. Penalizes evasive, incomplete, or off-topic answers. Also not a correctness measure.
- **Context precision** — are the *relevant* retrieved chunks ranked near the top? A rank-aware, LLM-judged signal that directly catches "lost in the middle" risk: low context precision means useful chunks are buried among distractors.
- **Context recall** — decompose the **reference (ground-truth) answer** into claims and check what fraction can be attributed to the retrieved context. This is a retrieval-recall proxy that needs a reference answer rather than labeled chunk IDs.
 
A subtle but interview-winning distinction: **RAGAS's context precision/recall are LLM-judged claim-attribution metrics that need a reference answer; classic IR recall@k/nDCG need labeled relevant chunk IDs.** They measure overlapping but not identical things. Mature teams compute *both* — IR metrics for cheap, deterministic regression gates, and RAGAS context metrics when they have reference answers but not chunk-level labels. The TruLens framing calls faithfulness + answer relevancy + context relevance the **"RAG triad."**
 
### 3.4 Building the golden set from production
 
A golden set is the single highest-leverage artifact in RAG eval, and the most neglected. Principles:
 
- **Source from real queries, not invented ones.** Sample actual production traffic — stratified across query types (factual lookup, multi-hop, comparison, "global" sense-making questions a flat vector search can't answer). Synthetic queries miss the weird phrasings, typos, and out-of-distribution intents your users actually send.
- **Label relevant chunks, not just answers.** This is what unlocks separable retrieval metrics. For each query, mark which chunks *should* be retrieved. Yes, it's expensive; it's also the asset competitors don't have.
- **Bootstrap with an LLM, audit with a human.** Use [claude-opus-4-8](/inference) or a strong open model to draft reference answers and candidate relevant-chunk labels, then have a human review a stratified sample. Measure how often the human disagrees — that's your label-noise floor.
- **Version it and grow it.** Every production incident becomes a golden-set entry. The set should track your evolving query distribution; a stale golden set silently stops representing reality (eval drift).
- **Guard against leakage.** If you tune chunking/prompts against the golden set repeatedly, you overfit to it. Keep a held-out slice you only touch at release.
 
### 3.5 LLM-as-judge: how it works and where it lies
 
Faithfulness, relevancy, and RAGAS's context metrics are all **LLM-as-judge** under the hood. This is the only scalable way to grade open-ended generation, but treat it like a noisy sensor, not an oracle:
 
- **Biases.** Judges exhibit *position bias* (favoring the first option in a comparison), *verbosity bias* (longer = better), and *self-preference bias* (a model rates outputs from its own family higher). Mitigate by randomizing order, instructing the judge on length-neutrality, andfor fairness across a model bake-off — using a judge from a *different* family than the candidates.
- **Calibration.** Judge scores are not absolute truth. Before trusting one, validate against human labels and report agreement — **Cohen's κ** for binary/categorical, Spearman correlation for graded. Use the judge to *rank* candidates and *detect regressions*, where its relative ordering is reliable, rather than as ground-truth accuracy.
- **Determinism.** Pin the judge model *and version*, set `temperature=0` where the model allows it, and prompt-cache the rubric. Note for 2026: `claude-opus-4-8` and `claude-opus-4-7` reject `temperature` entirely (adaptive thinking only), so if you need a hard-deterministic judge, reach for `claude-sonnet-4-6` or `claude-haiku-4-5` (which still accept `temperature=0`), or accept adaptive variance and average over runs.
Judge determinism in practice — on real models

The model choice trade-off. claude-opus-4-8 is the strongest evaluator but uses adaptive thinking internally, which means its answers vary run-to-run even at low temperature — you can't pin it. claude-sonnet-4-6 is nearly as strong for most evaluation tasks (claim verification, faithfulness judgments) and accepts temperature=0, so you get bit-identical results every run. claude-haiku-4-5 is cheaper ($1/$5 vs. $3/$15 per 1M tokens) and also supports temperature=0, though it makes more mistakes on nuanced grading. For a production pipeline running thousands of claims per night, Sonnet at temperature=0 is the sweet spot: cheap enough to batch, strong enough to be trusted, deterministic enough to be repeatable. If you need Opus's reasoning power, accept the variance, run it multiple times (3–5 times per sample), and average the scores — then you're buying consistency with compute instead of precision.

  • Cost. RAGAS faithfulness is not one LLM call — it's claim-extraction plus one verification call per claim. Over a 500-query golden set with ~5 claims each, that's thousands of calls per eval run. Use a cheaper judge (claude-haiku-4-5 at $1/$5 per 1M, or claude-sonnet-4-6 at $3/$15), batch through the Anthropic Batches API for ~50% off, and cache aggressively.
  • Drift. When you upgrade the judge model, scores shift even if your system didn't change. Pin the version and re-baseline deliberately on any judge upgrade.

This is the same discipline covered in the /evals pillar — RAG eval is a specialization of LLM eval, not a separate craft.

3.6 The frameworks, named

  • RAGAS (docs.ragas.io) — the de facto standard for the four metrics above; reference-free options and synthetic test-set generation. v0.2+ moved to a class-based metric API with SingleTurnSample/EvaluationDataset.
  • DeepEval (Confident AI) — pytest-style RAG assertions (assert_test), good for dropping eval directly into CI, with metrics that mirror RAGAS plus G-Eval-style custom rubrics.
  • TruLens — the "RAG triad" framing and strong tracing/observability for inspecting why a score is low.
  • ARES (Saad-Falcon et al. 2023) — trains lightweight judge classifiers on synthetic data with prediction-powered inference to give statistically grounded confidence intervals on your metrics; the rigorous choice when you need defensible numbers.

4. Minimal implementation

Evaluate a single RAG output for faithfulness (is the answer grounded in the retrieved chunks?) and context recall (did retrieval surface what the reference answer needs?), using RAGAS with a Claude judge.

pip install "ragas>=0.2" langchain-anthropic
export ANTHROPIC_API_KEY=sk-ant-...
from ragas import EvaluationDataset, evaluate
from ragas.dataset_schema import SingleTurnSample
from ragas.metrics import Faithfulness, LLMContextRecall
from ragas.llms import LangchainLLMWrapper
from langchain_anthropic import ChatAnthropic
 
# The judge. Sonnet 4.6 is the sweet spot for grading: strong enough to verify
# claims reliably, cheap enough to run over hundreds of samples ($3/$15 per 1M).
# temperature=0 is valid on Sonnet 4.6 — note it would 400 on claude-opus-4-8,
# which is adaptive-thinking-only. Pin the model id so scores are reproducible.
judge = LangchainLLMWrapper(ChatAnthropic(model="claude-sonnet-4-6", temperature=0))
 
# Each sample is one row of your golden set. `reference` is the human-labeled
# ground-truth answer; `retrieved_contexts` is exactly what your retriever
# returned for this query (the actual top-k, post-rerank).
samples = [
    SingleTurnSample(
        user_input="What discount does the Enterprise plan get on annual billing?",
        response="Enterprise customers get 20% off when they pay annually.",
        retrieved_contexts=[
            "Annual billing: Pro plans receive 15% off list price. "
            "Enterprise plans receive 20% off list price.",
            "All paid plans include SSO and audit logs.",
        ],
        reference="Enterprise plans receive a 20% discount on annual billing.",
    ),
    # A deliberately broken row: the retriever missed the discount chunk, so the
    # model guessed. Faithfulness will be low (claim unsupported) AND context
    # recall low (reference claim not attributable to context) — both fingers
    # point at retrieval.
    SingleTurnSample(
        user_input="What is the SLA for the Enterprise plan?",
        response="The Enterprise SLA guarantees 99.99% uptime.",
        retrieved_contexts=["All paid plans include SSO and audit logs."],
        reference="The Enterprise plan guarantees 99.95% uptime.",
    ),
]
 
dataset = EvaluationDataset(samples=samples)
 
result = evaluate(
RAGAS evaluation pipeline — section by section

Setup (lines before evaluate). You create a judge (Claude Sonnet 4.6 with temperature=0 for repeatability), build a list of SingleTurnSample objects (each row is one query with its response, the chunks your retriever returned, and a reference answer), and wrap them in an EvaluationDataset.

The evaluate call. RAGAS runs two metrics in parallel over your dataset. Faithfulness() takes the response and retrieved_contexts, and makes the judge extract atomic claims from the answer and verify each one against the chunks — no reference needed. LLMContextRecall() takes the reference (ground-truth answer) and retrieved_contexts, decomposes the reference into claims, and checks what fraction can be sourced to the retrieved chunks — it needs a labeled answer but no chunk IDs. If a sample has bad retrieval (missing chunks), both metrics will drop together, pointing at retrieval. If retrieval is fine but the answer is hallucinated, faithfulness drops alone, pointing at generation.

Output. result contains aggregate scores (e.g., {'faithfulness': 0.87, 'context_recall': 0.92}) and result.to_pandas() gives you per-row scores so you can triage which queries/answers are failing. This is the diagnostic heart: if faithfulness is 0.87 overall but 0.3 on queries about pricing, you know a specific domain is broken.

dataset=dataset,
metrics=[Faithfulness(), LLMContextRecall()],
llm=judge,

)

print(result) # aggregate, e.g. {'faithfulness': 0.50, 'context_recall': 0.50} df = result.to_pandas() # per-row scores — THIS is what you triage print(df[["user_input", "faithfulness", "context_recall"]])

 
What's happening, and why it's honest: `Faithfulness()` makes the judge decompose `response` into atomic claims and verify each against `retrieved_contexts` — no reference needed, so it scales to unlabeled production traffic. `LLMContextRecall()` decomposes `reference` into claims and checks attributability to `retrieved_contexts` — it needs a labeled answer but no chunk IDs. The second sample shows the diagnostic payoff: when *both* metrics drop together, you have a retrieval miss, not a generation bug. Run this over your whole golden set, write `df` to a parquet artifact, and you have the input to a CI gate. Add `Faithfulness` over *production* traffic (no reference required) as a continuous online monitor.
 
## 5. Production tradeoffs
 
| Dimension | Retrieval eval (recall@k, nDCG) | RAGAS/LLM-judge gen eval | What changes at scale |
|---|---|---|---|
| **Cost** | ~free — set ops on IDs | thousands of LLM calls per run (claim extraction + per-claim verify) | Batch API (−50%), cheap judge (Haiku/Sonnet), prompt-cache the rubric, subsample on PRs |
| **Latency** | milliseconds | minutes to hours over a large set | Full eval nightly; fast subset (50100 rows) on every PR |
| **Determinism** | exact, repeatable | judge variance + model drift | Pin model+version; `temperature=0` on Sonnet/Haiku; average runs; re-baseline on judge upgrade |
| **Label cost** | high (chunk-level relevance labels) | medium (reference answers) | Bootstrap with LLM, human-audit a stratified sample, grow from incidents |
| **What it catches** | missing/buried context | hallucination, off-topic, distractor confusion ||
| **Failure mode** | stale golden set → measures the past | judge bias, gaming, overfit to eval set | Refresh golden set from live traffic; held-out slice; track κ vs humans |
 
**Prose.** The economic reality: retrieval metrics are nearly free and deterministic, so run them on *every* PR and gate hard. Generation metrics cost real money and wall-clock time, so split them — a cheap representative subset gates PRs, the full set runs nightly and on release. **Gate retrieval and generation separately**: a `recall@100.85` gate and a `faithfulness ≥ 0.90` gate, so a red build immediately localizes the regression to a layer. Fail the build on a drop beyond a noise margin versus the committed baseline, not on an absolute floor alone — relative regression detection is what LLM judges are actually reliable at.
 
The dominant failure mode at scale is **eval-reality divergence**: your offline faithfulness climbs while users complain. Causes — golden set drifted from the live query distribution; you overfit prompts to the golden set; the judge is biased toward your generator's style; faithfulness is high but *correctness* is low (the model faithfully parrots a wrong chunk). The fix is a closed loop: offline metrics gate releases, but online signals (thumbs, deflection rate, escalation rate, an A/B on answer acceptance) are the ground truth that feeds the golden set its next entries. **Offline eval prevents obvious regressions; online eval defines quality.**
 
A note on the build-vs-buy frontier: as retrieval architectures get more sophisticated — [agentic RAG](/agents) where a model decides whether and what to retrieve, multi-hop traversal, [Contextual Retrieval](/rag/advanced-retrieval) — the eval target gets harder, because there's no single "the retrieved context" anymore; it's a trajectory. Evaluate those as you'd evaluate an [agent](/agents/tool-use-and-mcp): score the final answer's faithfulness *and* score the trajectory (did it retrieve when it should have, did it stop when it had enough).
 
## 6. How it's asked
 
**[IC4]** *"A RAG answer comes back wrong. How do you tell whether retrieval or generation is at fault?"*
Pull the retrieved chunks for that query and ask one question first: was the answer-bearing chunk in the context? If it wasn't, it's a retrieval miss — go fix chunking, embeddings, hybrid search, or reranking. If it *was* present and the answer is still wrong, it's a generation fault — prompt, model, context ordering, or too many distractors. The order matters because retrieval is the more common culprit and the cheaper thing to check, and because fixing the generator does nothing if the context never arrived. Quantitatively: low recall@k / context recall ⇒ retrieval; high recall@k but low faithfulness ⇒ generation.
 
**[IC5]** *"Build a golden eval set from production and gate it in CI."*
Stratified-sample real production queries across intent types, label each with its relevant chunk IDs and a reference answer (bootstrap with a strong LLM, human-audit a sample to measure label noise), version it, and grow it from every incident. In CI: run cheap retrieval metrics (recall@k, nDCG) on every PR with a hard gate; run a 50100-row generation subset (faithfulness, answer relevancy) on PRs and the full set nightly via the Batches API. Two separate gates — retrieval and generation — so a failure localizes the regression. Fail on regression versus a committed baseline, keep a held-out slice to detect overfitting, and pin the judge model version.
 
**[IC5]** *"Why not just use one end-to-end 'answer quality' score from an LLM judge?"*
Because it's unactionable and it hides the most common failure. A blended score can't distinguish "we never retrieved the chunk" from "we retrieved it and hallucinated anyway," and those have opposite fixes. You'd spend a sprint tuning the generator when the bug was in chunking. Separable metrics turn debugging into a decision tree. The single end-to-end score is fine as a *headline* for stakeholders, but engineering decisions need the decomposition.
 
**[IC6]** *"Offline faithfulness is 0.95 and climbing, users still complain. Reconcile."*
Several live hypotheses, and I'd instrument to distinguish them. (1) Faithfulness measures *grounding, not correctness* — the model may be faithfully citing wrong or outdated chunks; add a correctness check against references and an online accuracy signal. (2) The golden set has drifted from the live query distribution, so I'm acing yesterday's questions — refresh from current traffic and check coverage of recent failed queries. (3) Judge bias / overfit — the judge may favor my generator's style, or I've tuned prompts to the eval set; validate judge-human agreement (κ) and rotate in a held-out slice. (4) The complaints are about *answerability* — questions whose answer isn't in the corpus at all, which RAG can't fix and faithfulness won't flag. I'd wire a closed loop: offline gates releases, online signals (acceptance rate, escalation, A/B) define truth and feed the golden set.
 
**[IC6]** *"How do you keep LLM-as-judge from quietly corrupting your eval over a year?"*
Pin the judge model and version and re-baseline deliberately on any upgrade — a silent model bump shifts every score. Periodically re-measure judge-vs-human agreement on a fresh human-labeled sample; if κ degrades, the judge has drifted relative to your evolving data. Use the judge as a *relative ranker* for regression detection, where it's reliable, not as an absolute accuracy oracle. Watch for gaming: if engineers optimize the metric directly (e.g., padding answers to game verbosity-biased relevancy), the metric decouples from quality — which is why online signals must anchor the offline ones.
 
## 7. Pitfalls & flashcards
 
- **Scoring the pipeline end-to-end only.** You lose the retrieval-vs-generation decomposition — the most useful thing eval can tell you.
- **Confusing faithfulness with correctness.** Faithfulness = "supported by the retrieved context." A faithful answer over a wrong chunk scores perfectly. Add a correctness check against references.
- **No chunk-level labels.** Without labeled relevant chunk IDs you can't compute recall@k/nDCG and you're flying blind on the layer that fails most.
- **Trusting the LLM judge as ground truth.** It's a biased ranker. Validate against humans (κ), pin the version, randomize order, neutralize verbosity.
- **A stale golden set.** It silently stops representing your live query distribution. Grow it from production incidents.
- **Overfitting to the golden set.** Repeated tuning against it inflates scores without improving reality. Keep a held-out release slice.
- **One gate for everything.** Separate retrieval and generation gates so red builds localize the regression.
- **Ignoring cost.** RAGAS faithfulness is N+1 LLM calls per sample; over a big set that's a real bill — batch, cache, use a cheap judge.
- **Only offline eval.** Offline prevents regressions; online (A/B, acceptance, escalation) defines quality. You need both.
 
> **Flashcard.** Evaluate RAG as two systems: **retrieval** (recall@k, nDCG, MRR over labeled chunk IDs — usually where it breaks) and **generation** (faithfulness, answer relevancy, context precision/recall — mostly LLM-judged). Build a versioned golden set from real traffic with labeled relevant chunks, treat the LLM judge as a calibrated *ranker* not an oracle (pin its version, check κ vs humans, watch cost), gate the two layers *separately* in CI on regression-versus-baseline, and close the loop with online signals — because faithfulness measures grounding, not correctness, and offline ≠ online.
 
## 8. Further reading
 
- **Anthropic — Contextual Retrieval** (Sept 2024): why retrieval is the bottleneck and how contextual embeddings + contextual BM25 + reranking cut failed retrievals by ~49% / ~67%and therefore what your retrieval eval should be sensitive to. <https://www.anthropic.com/news/contextual-retrieval>
- **RAGAS paper** — Es et al., *Automated Evaluation of Retrieval Augmented Generation* (2023): the formal definitions of faithfulness, answer relevancy, context precision/recall. <https://arxiv.org/abs/2309.15217>
- **RAGAS docs** — current (v0.2+) metric API and synthetic test-set generation. <https://docs.ragas.io/>
- **ARES** — Saad-Falcon et al. (2023): trained judges + prediction-powered inference for statistically grounded RAG metrics. <https://arxiv.org/abs/2311.09476>
- **TruLens — the RAG triad**: faithfulness, answer relevance, context relevance, with tracing for root-cause. <https://www.trulens.org/getting_started/core_concepts/rag_triad/>
- **Lost in the Middle** — Liu et al. (2023): why context *ordering* (and thus context precision) matters for the generator. <https://arxiv.org/abs/2307.03172>
 
**Next:** [/evals](/evals) — the general LLM-evaluation pillar (golden sets, LLM-as-judge calibration, regression gates) of which RAG eval is a specialization; then revisit [/rag/advanced-retrieval](/rag/advanced-retrieval) to see which knobs your metrics should move.
Primary sources
← More in RAG & Retrieval